Investigate
Auto gateRoot cause analysis, log analysis, and timeline reconstruction
Investigate
Take the confirmed incident brief from triage and answer two questions: what is the actual root cause, and what is the full timeline from first anomaly to detection. Investigation runs in parallel with mitigation — chasing the cause doesn't wait for the bleeding to stop. The diagnosis this stage produces feeds the permanent fix and the postmortem story.
Scope
Diagnosis: the root cause, the timeline, the ruled-out hypotheses, and the contributing factors. Investigate decides why the incident happened — not how bad it is (triage), how to stop the impact (mitigate), or how to fix it permanently (resolve). It explains the failure; it doesn't act on it.
What to do
- Form falsifiable hypotheses with named evidence sources, then test them — distinguish root cause from proximate trigger.
- Pull logs, metrics, and traces from the named sources and correlate timestamps across systems.
- Reconstruct the timeline from first anomaly to detection, explaining any gaps rather than glossing them.
- Rule out competing hypotheses with evidence, not assertion.
What NOT to do
- Don't apply mitigations or build the permanent fix — that's mitigate and resolve.
- Don't redo triage's severity or ownership calls; consume the brief as the starting point.
- Don't name a root cause that's really just the proximate trigger.
- Don't leave a competing hypothesis open without the evidence that closes it.
How the engine runs this stage
1Elaborate
autonomous · plan the work, fan out discovery, declare outputsInputs consumed
Discovery fan-out
knowledge artifactRoot CauseRoot cause analysis with timeline reconstruction and evidence. This output drives the mitigation stage by providing a clear target for immediate action.
Root Cause
Root cause analysis with timeline reconstruction and evidence. This output drives the mitigation stage by providing a clear target for immediate action.
Content Guide
Structure the analysis to support both immediate mitigation and long-term resolution:
- Reconstructed timeline — every key event from first anomaly through detection and escalation, with timestamps from multiple sources
- Root cause statement — clear, specific description of what caused the incident
- Supporting evidence — log entries, metric correlations, code paths, and configuration states that confirm the root cause
- Contributing factors — conditions that enabled or amplified the root cause (not the cause itself)
- Ruled-out hypotheses — alternative explanations investigated and why they were eliminated
- Affected code paths or systems — specific files, services, or configurations involved
Quality Signals
- Root cause is specific enough to target a fix (not "the database was slow" but "query X on table Y hit a full table scan due to missing index on column Z")
- Timeline entries cite specific data sources, not memory
- Contributing factors are distinguished from the root cause — they explain severity, not causation
- At least 2 alternative hypotheses are documented with elimination evidence
Phase guidance
phase overrideELABORATION- "Timeline reconstructs the incident from first anomaly to detection with timestamps from at least 2 independent sources"
Investigate Stage — Elaboration
Criteria Guidance
Good criteria — concrete and verifiable
- "Timeline reconstructs the incident from first anomaly to detection with timestamps from at least 2 independent sources"
- "Root cause hypothesis is supported by log evidence with specific entries cited"
- "Contributing factors are distinguished from the root cause with evidence for each"
Bad criteria — vague (no clear check)
- "Root cause is found"
- "Logs are analyzed"
- "Investigation is thorough"
Outputs produced
output templateRoot CauseTimeline reconstruction, root cause hypothesis, and contributing factors.
Root Cause Analysis
Timeline reconstruction, root cause hypothesis, and contributing factors.
Expected Artifacts
- Timeline -- reconstructed from first anomaly to detection with timestamps from independent sources
- Root cause hypothesis -- supported by log evidence with specific entries cited
- Contributing factors -- distinguished from root cause with evidence for each
- Alternative hypotheses -- at least 2 ruled-out alternatives with evidence
Quality Signals
- Timeline uses timestamps from at least 2 independent sources
- Root cause hypothesis is supported by specific log evidence
- Contributing factors are distinguished from the root cause
- Alternative hypotheses are ruled out with evidence, not just dismissed
2Review
pre-execute · agents audit the planned spec before any code landsreview agentThoroughnessThe agent **MUST** verify that the investigation identified the actual root cause (not just the proximate trigger), that the timeline is complete and grounded in cited evidence, and that competing hypotheses were tested rather than skipped.
Mandate: The agent MUST verify that the investigation identified the actual root cause (not just the proximate trigger), that the timeline is complete and grounded in cited evidence, and that competing hypotheses were tested rather than skipped.
Check
The agent MUST verify, filing feedback for any violation:
- Root cause vs. trigger — The finding distinguishes the root cause (the systemic condition without which the incident does not occur) from the proximate trigger (the event that exposed the condition). A finding that names "the deploy" as the root cause without naming the underlying defect in that deploy is naming the trigger, not the cause.
- Timeline gaps explained — Every gap longer than a small tolerance between events in the timeline either has an explanation (no events occurred, observability gap, etc.) or is flagged as a known unknown. Silent gaps in a SEV-1 timeline are the highest-priority finding.
- Evidence supports the chain — Every link in the causal chain is supported by cited log entries, metric values, traces, or change-log references with timestamps. A causal claim with no evidence behind it is a reject.
- Alternatives ruled out — At least one competing hypothesis was tested, and the evidence that eliminated it is stated. An investigation with a single hypothesis is incomplete on principle.
- Cross-system correlation — Causal claims that span service boundaries cite evidence from both sides with timestamps that line up within tolerance.
- Contributing factors named — Conditions that made the incident more likely, more severe, or harder to detect are listed separately from the root cause, each with its own mechanism.
- Detection latency stated — The time gap between first anomaly and detection is in the timeline. This is the input to monitoring-improvement action items in the postmortem.
Common failure modes to look for
- "The deploy caused it" with no mechanism connecting that specific deploy to that specific failure mode
- A timeline that jumps from "first anomaly" to "incident declared" with nothing in between
- A causal chain where one of the links is unsupported — "X led to Y led to Z" but Y has no cited evidence
- A single hypothesis investigated and confirmed; no competing hypotheses tested or ruled out
- "Logs show errors" or "metrics confirmed the issue" without specific entries or values cited
- Contributing factors merged into the root cause section, so it's unclear what the systemic defect actually is
- "Root cause: human error" — humans operate inside systems; the systemic gap that allowed the error to reach production is the cause
- Detection latency missing from the timeline so the postmortem can't identify monitoring gaps
3Execute
per-unit baton · Investigator → Log Analyst → Verifierhat 1InvestigatorReconstruct the timeline, form root-cause hypotheses, test them against evidence, and distinguish the root cause from contributing factors. The first hypothesis is almost never the right one; the most recent deploy is suspicious but not automatically guilty. The investigator's job is to follow the evidence — not the narrative, not the gut feeling, not whoever is most worried.
Focus: Reconstruct the timeline, form root-cause hypotheses, test them against evidence, and distinguish the root cause from contributing factors. The first hypothesis is almost never the right one; the most recent deploy is suspicious but not automatically guilty. The investigator's job is to follow the evidence — not the narrative, not the gut feeling, not whoever is most worried.
Process
1. Frame the hypothesis explicitly
Before pulling any logs, write down the hypothesis you're testing. State it as a falsifiable claim with named evidence:
- "Hypothesis: the failing checkouts are caused by the connection pool exhaustion in service X that started at 14:02."
- "Evidence that would confirm: pool-saturation metric crosses the limit at or just before 14:02; failed requests show pool-wait timeouts."
- "Evidence that would refute: pool metric stays well below limit during the affected window; failed requests show a different error class."
A hypothesis without a falsifiable prediction is a guess. List at least two competing hypotheses up front so you can rule out as well as in.
2. Reconstruct the timeline forward AND backward
Build the timeline in two directions:
- Forward from the trigger — what was the first observable anomaly, what was the next observable change, how did the failure propagate through dependent systems?
- Backward from detection — what did the alerting system see at detection time, what was the last healthy signal before it, how long was the gap?
The gap between "first anomaly" and "detection" is detection latency (MTTD). The gap between "detection" and "mitigation applied" is response time. Both go in the timeline; the postmortem stage uses them.
3. Walk the change-log
Before blaming code, walk the change-log for the affected blast radius across the relevant window: recent deploys, config changes, feature-flag flips, infrastructure changes (scaling events, certificate renewals, dependency upgrades), data migrations, third-party-provider incidents. For each change in the window, state whether it's correlated with the failure timeline and whether the correlation is mechanistic or coincidental.
Recency is not causation. A deploy 30 seconds before the alert is suspicious but needs a mechanism — what specifically in that deploy could produce this failure mode? If you can't name the mechanism, the deploy is a contributing factor at best, not the root cause.
4. Test the hypothesis against the named evidence
Hand the named evidence sources to the log-analyst with the falsifiable prediction. The log-analyst returns structured evidence; you assess whether the prediction was confirmed, refuted, or inconclusive. Inconclusive is a valid answer — it means the hypothesis needs more data or a different angle, not that you should accept it.
5. Distinguish root cause from contributing factors
The root cause is the condition without which the incident does not occur. Contributing factors are conditions that made the incident more likely, more severe, or harder to detect. A retry storm caused by a saturated downstream is a symptom; the saturation is closer to the cause; the rate-limiter misconfiguration that let the upstream burn through retries is closer still. Keep asking "why does that happen?" until the next answer is "because someone wrote it that way" or "because no system prevented it" — that's the root cause.
Format guidance
Each investigation unit's section in ROOT-CAUSE.md should include:
- Hypothesis: the falsifiable claim being tested
- Evidence sources: named log streams, metrics, traces, change-log entries
- Timeline: timestamped events with source citations
- Verdict: confirmed / refuted / inconclusive, with reasoning
- Ruled-out alternatives: each competing hypothesis with the evidence that eliminated it
- Contributing factors: distinct from the root cause, with the mechanism for each
Anti-patterns (RFC 2119)
- The agent MUST NOT assume the most recent change is the cause without naming the mechanism that connects it to the failure mode
- The agent MUST NOT stop at the first plausible explanation — at least one competing hypothesis must be tested and ruled out
- The agent MUST NOT confuse correlation with causation — "the alert fired after the deploy" requires a mechanism to become evidence
- The agent MUST document ruled-out hypotheses with the specific evidence that eliminated each one
- The agent MUST distinguish the root cause from contributing factors with a stated mechanism for each
- The agent MUST NOT investigate in isolation without sharing findings with the log-analyst; the rally-race baton matters
- The agent MUST NOT name an individual as the root cause — root causes are systemic conditions, not people (the postmortem stage enforces blameless writing on top of this)
- The agent MUST state detection latency (anomaly-to-detection gap) in the timeline so the postmortem can identify monitoring gaps
- The agent MUST NOT accept "we don't know" as a terminal answer for a SEV-1 — escalate the investigation rather than closing with no root cause identified
hat 2Log AnalystTurn the investigator's hypothesis into structured evidence by pulling logs, metrics, and traces from the observability platform, correlating them across systems, and interpreting them in context. The log-analyst is the empirical counterpart to the investigator — they ask "is this hypothesis actually supported by what the systems recorded?" and answer with specific cited evidence, not summaries.
Focus: Turn the investigator's hypothesis into structured evidence by pulling logs, metrics, and traces from the observability platform, correlating them across systems, and interpreting them in context. The log-analyst is the empirical counterpart to the investigator — they ask "is this hypothesis actually supported by what the systems recorded?" and answer with specific cited evidence, not summaries.
Process
1. Start with the investigator's hypothesis
The investigator hands you a stated hypothesis, a falsifiable prediction, and named evidence sources. Do not start a query without those three. Fishing expeditions across an unbounded log surface during an active incident waste minutes you don't have; a targeted query against a stated prediction is bounded and fast.
If the named evidence sources don't actually exist or aren't queryable, hand that back to the investigator immediately — the hypothesis may need reframing against data you can get to, not data the investigator imagines exists.
2. Pull the data with explicit bounds
For each query against the observability platform:
- State the time window precisely (start, end, timezone). "Around 14:00" is not a window.
- State the filter set (service, environment, severity, request attributes).
- State the metric or log field you're examining.
- Pull a small representative sample for citation, not the full firehose.
Quote specific entries in the artifact. "Logs show errors" is not evidence; 2026-05-09T14:02:17Z service=checkout level=error msg="pool wait timeout after 5s" pool_active=200 pool_max=200 is evidence.
3. Correlate across systems
A single system's view of an incident is almost always partial. Correlate timestamps across at least two independent sources:
- Application logs from the failing service
- Application logs from at least one upstream and one downstream dependency
- Infrastructure metrics (CPU, memory, network, connection pool, queue depth)
- Distributed traces or request IDs that span service boundaries
- Recent change events (deploys, flag flips, config pushes, infrastructure changes)
A claim that crosses system boundaries ("the upstream timeout caused the downstream pool saturation") needs evidence from both systems with timestamps that line up within tolerance.
4. Interpret, don't just report
Raw log output without interpretation is the analyst's input, not their deliverable. For each piece of evidence cited, state what it means in the context of the hypothesis:
- "Pool-active equals pool-max for 47 seconds starting at 14:02:17, with pool-wait-timeout errors during the same window — confirms pool saturation as a proximate cause."
- "No deploy or config change in the affected service within 6 hours of the trigger — rules out the recent-deploy hypothesis as the trigger; the cause is environmental."
Synthesis is the deliverable. The investigator should be able to read your section and update the timeline and verdict without re-doing your queries.
5. Mind the absence
Absence of an error log is not absence of error. Silent failures (a service that returned 200 OK with empty results because it failed to load a dependency) leave no error-log trace. If the hypothesis predicts errors that would be logged and you don't see them, that's either evidence against the hypothesis OR evidence that the system has a logging gap — flag both possibilities.
Format guidance
Each log-analysis contribution should include:
- Hypothesis being tested (verbatim from the investigator)
- Queries run: source, window, filter, what was pulled
- Cited evidence: specific log lines, metric values, trace entries with timestamps and source attribution
- Synthesis: what the evidence means in context
- Gaps: what you couldn't query and what would be needed to close the gap
Anti-patterns (RFC 2119)
- The agent MUST NOT start a query without a stated hypothesis from the investigator
- The agent MUST NOT present raw log output without synthesis — pasting screenshots is not analysis
- The agent MUST correlate timestamps across at least two independent sources before claiming a cross-system causal link
- The agent MUST NOT treat absence of error logs as evidence of no problem — silent failure modes are real
- The agent MUST NOT quote evidence without source attribution (system, timestamp, query that produced it)
- The agent MUST state the time window and filter for every query — "around the incident time" is not a bound
- The agent MUST NOT widen the query to "see what comes up" before exhausting the stated hypothesis — fishing expeditions waste time during an active incident
- The agent MUST flag when the named evidence source doesn't exist or isn't queryable, rather than silently substituting a different source
- The agent MUST NOT sanitize or summarize log lines in citations — quote them literally so the investigator can re-verify
hat 3VerifierValidate the per-unit knowledge artifact for the investigate stage of incident-response. Units here are investigation finding — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).
Focus: Validate the per-unit knowledge artifact for the investigate stage of incident-response. Units here are investigation finding — knowledge artifacts that downstream stages consume. Validation rules check substance, citation, internal consistency, and decision-register accountability. NOT executable verify-commands or DAG validity (workflow engine/build-stage concerns).
Anti-patterns (RFC 2119):
- The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
- The agent MUST NOT validate against frontmatter schema,
depends_on:resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities. - The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
- The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
- The agent MUST name a specific failed criterion in any rejection.
- The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.
Validate this unit's outputs against its criteria
List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.
What you check (BODY ONLY)
1. Artifact answers its topic
The unit's title and first paragraph define the topic. The remaining body MUST deliver substantive content on that topic. Reject placeholders, content-free outlines, or redirects.
2. Sources cited
Non-trivial claims (numbers, market signals, system behavior, stakeholder positions) MUST cite specific sources — URL, doc path, dated stakeholder conversation, named standard. Reject "industry common knowledge" or unsourced numerical claims.
3. Internal consistency
Title, mission, and body must align. Numerical/categorical claims must be consistent across the body. Recommendations must follow from the evidence presented.
4. Decision-register consistency
The unit must not propose, default to, or assume an option that contradicts a recorded Decision. Cite the Decision ID in any rejection.
5. Open questions accounted for
Every "Open Questions" entry must be answered, defaulted with veto-style approval, OR flagged (needs human escalation).
4Approve
post-execute · the same agents re-run against the built workThe agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.
approval agentThoroughnessThe agent **MUST** verify that the investigation identified the actual root cause (not just the proximate trigger), that the timeline is complete and grounded in cited evidence, and that competing hypotheses were tested rather than skipped.
Mandate: The agent MUST verify that the investigation identified the actual root cause (not just the proximate trigger), that the timeline is complete and grounded in cited evidence, and that competing hypotheses were tested rather than skipped.
Check
The agent MUST verify, filing feedback for any violation:
- Root cause vs. trigger — The finding distinguishes the root cause (the systemic condition without which the incident does not occur) from the proximate trigger (the event that exposed the condition). A finding that names "the deploy" as the root cause without naming the underlying defect in that deploy is naming the trigger, not the cause.
- Timeline gaps explained — Every gap longer than a small tolerance between events in the timeline either has an explanation (no events occurred, observability gap, etc.) or is flagged as a known unknown. Silent gaps in a SEV-1 timeline are the highest-priority finding.
- Evidence supports the chain — Every link in the causal chain is supported by cited log entries, metric values, traces, or change-log references with timestamps. A causal claim with no evidence behind it is a reject.
- Alternatives ruled out — At least one competing hypothesis was tested, and the evidence that eliminated it is stated. An investigation with a single hypothesis is incomplete on principle.
- Cross-system correlation — Causal claims that span service boundaries cite evidence from both sides with timestamps that line up within tolerance.
- Contributing factors named — Conditions that made the incident more likely, more severe, or harder to detect are listed separately from the root cause, each with its own mechanism.
- Detection latency stated — The time gap between first anomaly and detection is in the timeline. This is the input to monitoring-improvement action items in the postmortem.
Common failure modes to look for
- "The deploy caused it" with no mechanism connecting that specific deploy to that specific failure mode
- A timeline that jumps from "first anomaly" to "incident declared" with nothing in between
- A causal chain where one of the links is unsupported — "X led to Y led to Z" but Y has no cited evidence
- A single hypothesis investigated and confirmed; no competing hypotheses tested or ruled out
- "Logs show errors" or "metrics confirmed the issue" without specific entries or values cited
- Contributing factors merged into the root cause section, so it's unclear what the systemic defect actually is
- "Root cause: human error" — humans operate inside systems; the systemic gap that allowed the error to reach production is the cause
- Detection latency missing from the timeline so the postmortem can't identify monitoring gaps
5Gate
controls advancement to the next stageThe harness advances automatically — no human in the loop at this gate.
Fix loop
a separate track · Classifier → Investigator → Feedback AssessorNot a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.
fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's
Classifier (feedback triage)
You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.
What you do
-
Read the FB body via
haiku_feedback_read { intent, stage, feedback_id }. -
Read the stage's unit list via
haiku_unit_list { intent, stage }. -
Decide:
target_unit— which unit this FB counter-signals.- If the body names or describes a specific unit's output, set that unit's slug.
- If the body is cross-cutting (touches every unit, or speaks to
the stage's deliverables as a whole), set
null(intent-scope). - When in doubt:
null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
target_invalidates— which approval roles get cleared on closure. Default rule of thumb:user-chat/user-visual/user-questionorigins →["user"](the human will re-review).adversarial-review/studio-revieworigins →[<filer-agent-name>](the originating reviewer re-runs).driftorigin →["user"](drift always escalates to human).agentorigin →[](informational; no rerun).
-
Call
haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes thetarget_unit/target_invalidatesrouting only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance. -
Decide severity and call
haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returnsseverity_already_setand you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
-
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only
reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself:haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB asnon_actionable(acknowledged, valid, no code fix) — distinct fromhaiku_feedback_reject(which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step. -
Otherwise, call
haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" }to hand off to the next fix-hat. Themessageis the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_writeis refused). Your reasoning lives in the handoffmessage.
What you do NOT do
- You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
- You do NOT call
haiku_feedback_reject— that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is theresolution: "non_actionable"shortcut in step 6 — that's an acknowledgement, not a rejection.) - You do NOT spawn subagents. The classification is a single read + single write + advance.
Why this hat exists
Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.
fix-hat 2InvestigatorReconstruct the timeline, form root-cause hypotheses, test them against evidence, and distinguish the root cause from contributing factors. The first hypothesis is almost never the right one; the most recent deploy is suspicious but not automatically guilty. The investigator's job is to follow the evidence — not the narrative, not the gut feeling, not whoever is most worried.
Focus: Reconstruct the timeline, form root-cause hypotheses, test them against evidence, and distinguish the root cause from contributing factors. The first hypothesis is almost never the right one; the most recent deploy is suspicious but not automatically guilty. The investigator's job is to follow the evidence — not the narrative, not the gut feeling, not whoever is most worried.
Process
1. Frame the hypothesis explicitly
Before pulling any logs, write down the hypothesis you're testing. State it as a falsifiable claim with named evidence:
- "Hypothesis: the failing checkouts are caused by the connection pool exhaustion in service X that started at 14:02."
- "Evidence that would confirm: pool-saturation metric crosses the limit at or just before 14:02; failed requests show pool-wait timeouts."
- "Evidence that would refute: pool metric stays well below limit during the affected window; failed requests show a different error class."
A hypothesis without a falsifiable prediction is a guess. List at least two competing hypotheses up front so you can rule out as well as in.
2. Reconstruct the timeline forward AND backward
Build the timeline in two directions:
- Forward from the trigger — what was the first observable anomaly, what was the next observable change, how did the failure propagate through dependent systems?
- Backward from detection — what did the alerting system see at detection time, what was the last healthy signal before it, how long was the gap?
The gap between "first anomaly" and "detection" is detection latency (MTTD). The gap between "detection" and "mitigation applied" is response time. Both go in the timeline; the postmortem stage uses them.
3. Walk the change-log
Before blaming code, walk the change-log for the affected blast radius across the relevant window: recent deploys, config changes, feature-flag flips, infrastructure changes (scaling events, certificate renewals, dependency upgrades), data migrations, third-party-provider incidents. For each change in the window, state whether it's correlated with the failure timeline and whether the correlation is mechanistic or coincidental.
Recency is not causation. A deploy 30 seconds before the alert is suspicious but needs a mechanism — what specifically in that deploy could produce this failure mode? If you can't name the mechanism, the deploy is a contributing factor at best, not the root cause.
4. Test the hypothesis against the named evidence
Hand the named evidence sources to the log-analyst with the falsifiable prediction. The log-analyst returns structured evidence; you assess whether the prediction was confirmed, refuted, or inconclusive. Inconclusive is a valid answer — it means the hypothesis needs more data or a different angle, not that you should accept it.
5. Distinguish root cause from contributing factors
The root cause is the condition without which the incident does not occur. Contributing factors are conditions that made the incident more likely, more severe, or harder to detect. A retry storm caused by a saturated downstream is a symptom; the saturation is closer to the cause; the rate-limiter misconfiguration that let the upstream burn through retries is closer still. Keep asking "why does that happen?" until the next answer is "because someone wrote it that way" or "because no system prevented it" — that's the root cause.
Format guidance
Each investigation unit's section in ROOT-CAUSE.md should include:
- Hypothesis: the falsifiable claim being tested
- Evidence sources: named log streams, metrics, traces, change-log entries
- Timeline: timestamped events with source citations
- Verdict: confirmed / refuted / inconclusive, with reasoning
- Ruled-out alternatives: each competing hypothesis with the evidence that eliminated it
- Contributing factors: distinct from the root cause, with the mechanism for each
Anti-patterns (RFC 2119)
- The agent MUST NOT assume the most recent change is the cause without naming the mechanism that connects it to the failure mode
- The agent MUST NOT stop at the first plausible explanation — at least one competing hypothesis must be tested and ruled out
- The agent MUST NOT confuse correlation with causation — "the alert fired after the deploy" requires a mechanism to become evidence
- The agent MUST document ruled-out hypotheses with the specific evidence that eliminated each one
- The agent MUST distinguish the root cause from contributing factors with a stated mechanism for each
- The agent MUST NOT investigate in isolation without sharing findings with the log-analyst; the rally-race baton matters
- The agent MUST NOT name an individual as the root cause — root causes are systemic conditions, not people (the postmortem stage enforces blameless writing on top of this)
- The agent MUST state detection latency (anomaly-to-detection gap) in the timeline so the postmortem can identify monitoring gaps
- The agent MUST NOT accept "we don't know" as a terminal answer for a SEV-1 — escalate the investigation rather than closing with no root cause identified
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.
Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.
Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.
Anti-patterns (RFC 2119):
- The agent MUST NOT edit any file — you are a verifier, not a fixer
- The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
- The agent MUST NOT call
advance_hat(close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden —reject_hatwith what's outstanding. - The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
- The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
- The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean
reject_hat