Investigate
Auto reviewRoot cause analysis, log analysis, and timeline reconstruction
Dependencies
Hat Sequence
Investigator
Focus: Reconstruct the incident timeline, form and test root cause hypotheses, and distinguish the root cause from contributing factors. Follow the evidence — resist the urge to blame the most recent deploy without proof.
Produces: Root cause analysis with timeline, hypothesis testing results, and contributing factor assessment.
Reads: Incident brief from triage, application logs, deployment history, configuration changes, metrics.
Anti-patterns (RFC 2119):
- The agent MUST NOT assume the most recent change is the cause without evidence
- The agent MUST NOT stop at the first plausible explanation without testing alternatives
- The agent MUST NOT confus correlation with causation (e.g., "it broke after the deploy" is not proof the deploy caused it)
- The agent MUST document ruled-out hypotheses and the evidence that eliminated them
- The agent MUST NOT investigate in isolation without sharing findings with the log-analyst
Log Analyst
Focus: Deep-dive into logs, metrics, and traces to find concrete evidence supporting or refuting root cause hypotheses. The log analyst turns raw observability data into structured evidence.
Produces: Evidence report with timestamped log entries, metric correlations, and trace analysis supporting the root cause determination.
Reads: Incident brief from triage, investigator's hypotheses, application logs, APM traces, infrastructure metrics.
Anti-patterns (RFC 2119):
- The agent MUST NOT search logs without a hypothesis to test — fishing expeditions waste time during incidents
- The agent MUST NOT present raw log output without synthesis or interpretation
- The agent MUST NOT ignore logs from adjacent systems that may reveal upstream causes
- The agent MUST correlat timestamps across different data sources
- The agent MUST NOT treat absence of error logs as evidence of no problem
Review Agents
Thoroughness
Mandate: The agent MUST verify the investigation identified the actual root cause, not just the proximate trigger.
Check:
- The agent MUST verify that timeline is complete with no unexplained gaps between events
- The agent MUST verify that evidence (logs, metrics, traces) supports the causal chain
- The agent MUST verify that alternative hypotheses were considered and ruled out with evidence
- The agent MUST verify that contributing factors (deploys, config changes, traffic patterns) are identified
Investigate
Criteria Guidance
Good criteria examples:
- "Timeline reconstructs the incident from first anomaly to detection with timestamps from at least 2 independent sources"
- "Root cause hypothesis is supported by log evidence with specific entries cited"
- "Contributing factors are distinguished from the root cause with evidence for each"
Bad criteria examples:
- "Root cause is found"
- "Logs are analyzed"
- "Investigation is thorough"
Completion Signal (RFC 2119)
Root cause document MUST exist with a reconstructed timeline from first anomaly through detection and escalation. Root cause hypothesis is stated with supporting evidence from logs, metrics, or code. Contributing factors MUST be identified separately. Investigator MUST have ruled out at least 2 alternative hypotheses with evidence. The root cause is specific enough to inform a targeted mitigation.