Application Development · stage 5 of 6

Operations

Auto gate

Deployment, monitoring, and operational readiness

Operations

Take working code from development and make it run reliably in production: the runtime configuration, deployment path, observability, and on-call posture that turn a green test suite into a service people can depend on.

Scope

Operational readiness — deployment shape, configuration, monitoring, alerting, and runbooks. Not feature work or behavior changes; the system's job was settled upstream, this stage makes it survivable.

What to do

  • Make every deployment rollback-able before it ships.
  • Cover the real failure modes with observability, and pair every alert with a runbook that says what to do.
  • Write runbooks concrete enough for someone with no prior context to follow under pressure.

What NOT to do

  • Don't add features or change behavior — that's a new intent, not an operations task.
  • Don't ship a deployment with no rollback path.
  • Don't define alerts no one can act on, and don't assume production behaves like your local machine.

How the engine runs this stage

1Elaborate

autonomous · plan the work, fan out discovery, declare outputs

Discovery fan-out

knowledge artifactRunbookOperational runbook for handling failure scenarios. Written for the person who gets paged at 3 AM — every entry should be actionable without needing to read the codebase.

Runbook

Operational runbook for handling failure scenarios. Written for the person who gets paged at 3 AM — every entry should be actionable without needing to read the codebase.

Content Guide

Organize by failure scenario. For each entry:

  • Symptom description — what the oncall sees (alert text, dashboard signal, user report)
  • Diagnostic steps — specific commands to run to confirm the issue and assess severity
  • Remediation steps — specific commands or actions to fix the problem
  • Escalation criteria — when to page additional people, and who
  • Rollback procedure — how to revert if the remediation makes things worse

Common scenarios to cover:

  • Service restart (graceful and forced)
  • Database failover
  • Cache invalidation
  • Dependency failure handling (upstream service down)
  • Certificate rotation
  • Intent-specific operational scenarios

Quality Signals

  • Every step has a specific command, not a vague instruction ("restart the service" vs. kubectl rollout restart deployment/api -n production)
  • Diagnostic steps come before remediation — understand before acting
  • Escalation criteria are specific ("if error rate exceeds 5% after restart, page the database oncall")
  • Rollback procedures are tested, not theoretical

Phase guidance

phase overrideELABORATION- "Deployment pipeline runs `terraform plan` in CI and requires approval before `apply`"

Operations Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

  • "Deployment pipeline runs terraform plan in CI and requires approval before apply"
  • "Runbook covers: service restart, database failover, cache flush, and certificate rotation with step-by-step commands"
  • "Alerts fire when error rate exceeds 1% over 5 minutes, routed through the project's paging system to the on-call rotation"
  • "Health check endpoint responds within 5 seconds and verifies database connectivity"

Bad criteria — vague (no clear check)

  • "Deployment is automated"
  • "Runbook exists"
  • "Monitoring is set up"

Outputs produced

output templateDeployment ConfigDeployment configuration written directly to the project source tree. Like the code output, this is not a document to author — it is working infrastructure-as-code.

Deployment Config

Deployment configuration written directly to the project source tree. Like the code output, this is not a document to author — it is working infrastructure-as-code.

Content Guide

Depending on the project's deployment tooling, this includes:

  • Infrastructure-as-code — Terraform, CloudFormation, Pulumi, or similar manifests
  • Container definitions — Dockerfile, docker-compose.yml
  • CI/CD pipeline definitions — GitHub Actions workflows, GitLab CI, Jenkins pipelines
  • Environment configuration — environment variable definitions, config maps, secrets references

Written to the appropriate location for the project's deployment tooling (e.g., deploy/, .github/workflows/, infrastructure/).

Quality Signals

  • No hardcoded secrets — all sensitive values reference secret stores or environment variables
  • Pipeline includes plan/preview step before apply/deploy
  • Health checks are configured and verified
  • Rollback mechanism is defined and tested
  • Configuration is environment-agnostic (staging and production use the same templates with different values)
output templateRunbookThe reliability artifact set the `sre` hat produces for each operations unit. Two things land together: per-pageable-alert runbooks for the oncall to follow at 3 AM, and a unit-body augmentation that captures the SLO / alert / dashboard inventory.

Runbooks and SLO bundle

The reliability artifact set the sre hat produces for each operations unit. Two things land together: per-pageable-alert runbooks for the oncall to follow at 3 AM, and a unit-body augmentation that captures the SLO / alert / dashboard inventory.

Expected artifacts

  • One runbook per pageable alert — written to the project's runbook tree
  • A unit-body augmentation — appended to the ops-engineer's body, capturing SLOs, alerts, runbooks, dashboards, and the "no PII in telemetry" attestation

Quality signals

  • Every SLO has a target, a window, and a named SLI
  • Every SLO has an error budget computed
  • Every pageable alert links to a runbook
  • Every runbook has triage steps, mitigations ordered by reversibility, and an escalation path
  • Dashboards exist for SLO compliance and the four golden signals (latency, traffic, errors, saturation)
  • No PII / credentials / tokens / session IDs land in logs, metrics, or traces — confirmed inline

Per-runbook shape

A runbook is a step-by-step guide a sleepy oncall can follow without thinking. Per pageable alert:

## Runbook: <alert name>

### What this alert means
<one paragraph in plain language — what symptom the SLI is detecting, what it implies about user impact>

### Symptoms to verify
<the dashboard / metric / log query to confirm the alert is real (not a metrics glitch)>

### Initial triage (5 minutes)
1. Check <dashboard> — confirm <metric> is elevated
2. Check <related dashboard> — is the cause upstream or local?
3. Check <recent-deploys log> — was anything deployed in the last <window>?

### Mitigations (in order of reversibility)
1. <least destructive — flag flip, rate limit increase, cache warm-up>
2. <intermediate — rollback last deploy if recent>
3. <last resort — failover, scale-up, page upstream>

### When to escalate
- If <condition> after <time>, page <next tier>
- If <condition>, page <subject-matter expert>

### Postmortem checklist
<links to the postmortem template + any data-collection that needs to happen DURING the incident before the data ages out>

Unit-body augmentation shape

Appended to (do not overwrite) the ops-engineer's body:

## SLOs

| SLI                              | SLO target | Window | Source metric | Error budget per window |
|----------------------------------|------------|--------|---------------|-------------------------|
| Availability of <surface>        | 99.5%      | 30d    | <metric name> | ~3.6h / 30d             |
| p95 latency of <endpoint>        | < 200ms    | 7d     | <metric name> | n/a (latency SLO)        |
| Error rate of <surface>          | < 1%       | 24h    | <metric name> | 14.4 min / 24h           |

## Alerts

| Alert name | Fires on | Severity | Pages whom | Runbook |
|------------|----------|----------|------------|---------|
| <name>     | <expression> | page / ticket | <rotation> | <link> |

## Runbooks

<one runbook per pageable alert — see Per-runbook shape above>

## Dashboards

<links to the project's dashboarding tool for: SLO compliance, golden signals, the critical user journey for this surface>

## Sensitive-data protection in telemetry

<confirmation that no PII / credentials / tokens leak into logs, metrics, traces; list any allow-list filtering applied>

2Review

pre-execute · agents audit the planned spec before any code lands
review agentObservabilityThe agent **MUST** verify the system is observable enough that an on-call engineer with no prior context can diagnose a production issue from telemetry alone. Operations that ship without observability discipline produce 2am pages with no signal — the wrong layer to discover the gap.

Mandate: The agent MUST verify the system is observable enough that an on-call engineer with no prior context can diagnose a production issue from telemetry alone. Operations that ship without observability discipline produce 2am pages with no signal — the wrong layer to discover the gap.

Check

The agent MUST verify each:

  • Structured logs with correlation IDs. Every request / job carries a correlation ID propagated through every downstream call. Log lines are key-value (not free-form prose) so they're queryable.
  • Four golden signals covered. Latency, traffic, errors, saturation — every user-facing service emits all four. Drill-down dimensions exist for slicing by endpoint / customer / region.
  • Alerts have runbooks. Every alert that pages a human links to a runbook or a one-line description of the action to take. Alerts without runbooks are noise that gets silenced.
  • Critical-journey dashboards exist. The top-N user journeys each have a dashboard showing the four golden signals end-to-end across the systems they touch.
  • No sensitive data in telemetry. Logs and metrics do not include PII, credentials, tokens, full request bodies, or full response bodies for payment / auth flows.
  • Sampling preserves signal at scale. Where logs / traces are sampled, the sampling strategy preserves all error traces and a representative sample of success traces; it doesn't silently drop the data you'd need to debug an incident.
  • Telemetry survives the failure. Logs ship to a destination outside the failing process — a crash-looping pod still emits its last lines. Metrics are pushed or scraped on a cadence that survives a partial outage.

Common failure modes to look for

  • A new endpoint added without a corresponding metric or log line — the team finds out it's broken via customer ticket
  • Logs that emit JSON-stringified blobs (an entire request body) instead of structured fields
  • An alert fires every 15 minutes with no documented action — on-call has muted it
  • A dashboard shows green during a known incident because the failing path isn't instrumented
  • Correlation ID propagation that drops at a service boundary (gRPC → HTTP, queue producer → consumer), making cross-service tracing impossible
  • An error path that silently swallows the exception with no log line or metric increment
  • Stack traces dumped into logs include Authorization: header values or full SQL with embedded credentials
review agentReliabilityThe agent **MUST** verify the deployment and operational configuration supports reliable production operation under the load and failure modes the system will actually see. Operations changes that look benign in staging cascade into outages in production when reliability concerns aren't checked up front.

Mandate: The agent MUST verify the deployment and operational configuration supports reliable production operation under the load and failure modes the system will actually see. Operations changes that look benign in staging cascade into outages in production when reliability concerns aren't checked up front.

Check

The agent MUST verify each:

  • Health checks reflect actual readiness. Liveness vs. readiness are distinct. Readiness fails when the dependent datastore, cache, or upstream service is unreachable; liveness only fails on process death. A service marked ready that can't actually serve traffic causes worse outages than one that fails closed.
  • Rollback procedure exists and is tested. Deployments declare how to roll back (previous version artifact, schema rollback steps, feature flag) and the rollback path has been exercised at least once on this surface — not theoretical.
  • Resource limits set with headroom. CPU, memory, connection pools, file descriptors, and concurrent goroutines / threads have explicit limits sized from real observed usage with a stated headroom factor. No "unbounded" pools.
  • Graceful shutdown handles in-flight work. Termination signals trigger draining: load balancer removal, in-flight requests completed (within a bounded timeout), then exit. New requests not accepted during drain.
  • Retry + circuit-breaker on external deps. External calls have explicit retry policy (max attempts, backoff strategy, jitter) and a circuit breaker that fails fast when the dependency is degraded — they do NOT retry forever, do NOT retry non-idempotent operations, and do NOT amplify a downstream outage into a self-DDoS.
  • Capacity headroom states the load model. Sizing references the actual peak-traffic shape (not "average load"). Headroom assumptions are explicit (e.g., 2x current peak) and tied to the autoscaling policy if any.
  • Stateful changes are reversible or migration-paired. Schema migrations, data backfills, and partition changes either ship with an explicit reversal procedure or are paired with a forward-only strategy that the rollback can tolerate (expand-then-contract pattern).

Common failure modes to look for

  • Liveness probe that hits a static endpoint and never fails, while the service is actually deadlocked on a stuck database connection
  • A rollback plan that says "redeploy the previous tag" but the previous tag's database migration has already been applied with no down migration
  • Memory limit set just above current usage with no headroom — first burst of traffic triggers OOMKill
  • A retry policy with no backoff or jitter — the first dependency hiccup turns into a synchronized retry storm
  • Graceful shutdown with an unbounded drain timeout, causing rolling deploys to hang
  • A circuit breaker that opens but never closes because its health probe is the same call it just stopped issuing
  • An autoscaling policy whose scale-up is slower than the traffic ramp it's meant to absorb

Borrowed from other stages

3Execute

per-unit baton · Ops Engineer → Sre → Verifier
hat 1Ops EngineerPlan and produce the deployment / infrastructure artifacts for THIS operational unit — pipeline config, infrastructure as code, environment-specific configuration, secrets handling, and the rollback path. Each unit at this stage corresponds to one operational step or one deployable surface. Your deliverable is the unit body with concrete artifact references, preconditions, the deploy/apply action, and an explicit rollback procedure.

Focus: Plan and produce the deployment / infrastructure artifacts for THIS operational unit — pipeline config, infrastructure as code, environment-specific configuration, secrets handling, and the rollback path. Each unit at this stage corresponds to one operational step or one deployable surface. Your deliverable is the unit body with concrete artifact references, preconditions, the deploy/apply action, and an explicit rollback procedure.

You are the plan + do role for the operations stage's plan-do-verify triplet. The baton you hand off to the sre hat is a working deployment artifact set; the baton sre hands to verifier is that artifact set plus reliability instrumentation (SLOs, alerts, runbooks).

Process

1. Read your inputs

  • The unit body — completion criteria, the specific operational step or deployable surface this unit covers
  • Upstream development code and architecture references — what's being deployed
  • Upstream product behavioral-spec — the surface area the deployment must keep available
  • The intent's decision register — locked decisions on platform, region, deployment strategy, secrets-management approach
  • Project conventions if they exist (infra/ directory, prior IaC modules, the project's CI/CD config) — reuse over rebuild

2. Decide artifact shape

Match artifact to the unit's discipline. Avoid vendor-specific defaults — name the artifact class, then reach for the tool the project actually uses:

  • CI/CD pipeline — the project's CI config (whatever the repo uses). Steps for build, test, scan, deploy.
  • Infrastructure as code — the project's IaC tool of choice (Terraform / Pulumi / OpenTofu / CloudFormation / Bicep / a project-specific abstraction). Modules + variables + outputs.
  • Container / runtime config — Dockerfile, Compose, Kubernetes manifests, runtime-specific deployment descriptor. Pin versions; tag images by content hash not latest.
  • Environment configuration — a config file or secret-store reference per environment. NEVER hardcode environment-specific values in code.
  • Migration / data-shape change — forward script + backfill plan + reverse script (or explicit "no reverse — see rollback").

Project overlays at .haiku/studios/software/stages/operations/ may name specific tools and conventions; defer to overlays when present.

3. Pre-flight before writing

  • Plan / dry-run. Run terraform plan (or pulumi preview, kubectl diff, docker build, the project's equivalent). Surface every resource being created / modified / destroyed.
  • Identify destructive changes. Anything that replaces a resource in place (DB instance class change, IP-changing network resource, secret rotation that breaks running pods) gets called out separately.
  • Identify cross-environment dependencies. A change to a shared resource (DNS, identity provider, shared DB) needs explicit sequencing with other environments.

4. Write the unit body

## Operational scope

<one paragraph naming what this unit deploys / changes — the surface, the environment(s), the platform>

## Preconditions

- <required state before the action runs: prior unit completed, migration applied, image built and scanned, ...>
- <required approval / change-control marker if applicable>

## Artifacts produced

| Path | Purpose | Notes |
|------|---------|-------|
| `infra/<module>/main.tf` | <what this module does> | reuses module X |
| `.github/workflows/deploy-<env>.yml` | <what this pipeline does> | invoked on tag |

## Action

<one unambiguous procedure — the literal commands or pipeline trigger, in order, that performs the deploy / apply / cutover>

## Post-condition checks

| Check | How to run | Pass criteria |
|-------|-----------|---------------|
| Health endpoint returns 200 | `curl https://<env>/healthz` | HTTP 200, body `{"status":"ok"}` |
| Migration applied | <project's migration tool — list applied migrations> | latest migration ID present |
| Error rate under SLO | <project's metrics tool> | < 1% over 5 min post-deploy |

## Rollback

<one of: explicit reverse procedure with literal commands; or "no rollback — forward-fix only" with rationale (e.g., destructive migration)>

## Secrets and configuration

<reference to secret-store paths; never inline values. Name the principal that reads each secret.>

## Open Questions

<unresolved decisions, e.g., region rollout order; flagged (needs human escalation) or with stated default>

5. Hand off to sre

  • Action is one unambiguous procedure — no "or" branches the operator has to decide
  • Every post-condition check has a concrete command and a pass criterion
  • Rollback is explicit (procedure OR rationale for forward-fix only)
  • No hardcoded secrets in artifacts; all reference the project's secret-store
  • Plan / dry-run results referenced in the body
  • Destructive changes are flagged

Call haiku_unit_advance_hat. The sre hat adds SLOs, alerts, runbooks. The verifier hat then validates the combined output.

Anti-patterns (RFC 2119)

  • The agent MUST NOT hardcode secrets or environment-specific values in code or in artifacts checked into VCS
  • The agent MUST NOT omit rollback strategy — every deployment must be reversible OR explicitly declare "no rollback — forward-fix only" with rationale
  • The agent MUST NOT tag images / artifacts with mutable references (latest, main) — pin to immutable identifiers (content hash, SHA, semver)
  • The agent MUST NOT make changes to shared resources without explicit cross-environment sequencing
  • The agent MUST flag destructive changes (in-place resource replacement, irreversible migrations) so the verifier and the gate can require additional approval
hat 2SreAdd reliability instrumentation to the deployment artifacts the `ops-engineer` hat produced. Define SLOs (availability, latency, error rate) with explicit error budgets, set up monitoring and alerting that fires on causes not symptoms, and write runbooks with diagnostic steps for common failure modes. The goal is that when something breaks at 3 AM, the oncall has a step-by-step guide — not just a page that says "investigate".

Focus: Add reliability instrumentation to the deployment artifacts the ops-engineer hat produced. Define SLOs (availability, latency, error rate) with explicit error budgets, set up monitoring and alerting that fires on causes not symptoms, and write runbooks with diagnostic steps for common failure modes. The goal is that when something breaks at 3 AM, the oncall has a step-by-step guide — not just a page that says "investigate".

You are the second do role in the operations stage's plan-do-verify chain (ops-engineer → sre → verifier). The baton you receive: a working deployment artifact set. The baton you hand off: that same set plus the observability + reliability layer, organized so the verifier can confirm the operational unit is production-ready.

Process

1. Read your inputs

  • The unit body the ops-engineer hat wrote — operational scope, artifacts, action, post-condition checks, rollback
  • The intent's behavioral-spec and data-contracts — the surface area whose reliability must be guaranteed
  • The intent's decision register — locked decisions on observability stack, paging system, on-call rotation, SLO targets
  • The project's existing monitoring / alerting config — reuse over rebuild; consistency matters
  • Sibling operations units — SLOs and alerts should compose across units, not contradict

2. Define SLOs first, then alerts

The order matters. An alert without an SLO is just a notification — there's no shared agreement on what "healthy" means. Walk:

  • What "healthy" looks like for this surface. Define before defining unhealthy. Concretely: target availability, target latency at relevant percentile, target error rate.
  • The SLO target. A measurable target with a window (e.g., 99.5% availability over a 30-day rolling window). Pull the target from upstream behavioral spec or product Decision — if the SLO target isn't stated, surface it as an open question, do NOT invent.
  • The error budget. The complement of the SLO over the window. The error budget is what determines whether deploy velocity needs to slow down.
  • The SLI(s) that measure the SLO. A specific metric or set of metrics that compute the SLO empirically. Cite the metric name and the project's metrics tool.

An SLO without an error budget is a wish, not a target.

3. Define alerts that fire on causes, not symptoms

For each SLO, define the alerts. Walk:

  • Burn-rate alerts. Multi-window, multi-burn-rate per the SRE playbook — fast-burn (2% of budget in 1 hour) and slow-burn (10% of budget in 6 hours) at minimum. The literal thresholds depend on the project's SLO targets.
  • Cause-level alerts, not symptom-level. "Error rate elevated" is a cause; "user X saw an error" is a symptom. Page on the cause.
  • Pager-worthy vs. ticket-worthy. Anything that pages a human at 3 AM MUST be actionable within minutes. Less-urgent issues file a ticket / alert in a low-priority channel.
  • No alert without a runbook. Every alert that pages a human MUST link to a runbook. Alerts without runbooks become alert fatigue, which makes real alerts invisible.

4. Write the artifact set

The canonical shapes — per-runbook structure and the unit-body augmentation block — live in plugin/studios/software/stages/operations/outputs/RUNBOOK.md. Read that before drafting; use the shapes directly. The output file also lists the quality signals the verifier hat will check.

5. Hand off to verifier

  • Every SLO has a target, a window, and a named SLI
  • Every SLO has an error budget computed
  • Every pageable alert links to a runbook
  • Every runbook has triage steps, mitigations in reversibility order, and an escalation path
  • Dashboards exist for SLO compliance and the four golden signals (latency, traffic, errors, saturation)
  • No PII / secrets in telemetry, confirmed inline

Call haiku_unit_advance_hat. The verifier hat validates the combined operational artifact.

Anti-patterns (RFC 2119)

  • The agent MUST NOT alert on symptoms instead of causes — alert on error rate, not individual errors
  • The agent MUST NOT define SLOs without error budgets — an SLO without a budget is a wish
  • The agent MUST NOT invent SLO numbers without an upstream decision or stakeholder agreement
  • The agent MUST NOT let PII / credentials / tokens / session IDs into logs, metrics, or traces
hat 3VerifierValidate the per-unit operational artifact for the operations stage of software. Units here are ops/deployment step — operational steps with concrete preconditions, actions, and post-condition checks. Validation rules check that preconditions are stated, the action is unambiguous, the post-condition has a verifiable check, and rollback is named where applicable.

Focus: Validate the per-unit operational artifact for the operations stage of software. Units here are ops/deployment step — operational steps with concrete preconditions, actions, and post-condition checks. Validation rules check that preconditions are stated, the action is unambiguous, the post-condition has a verifiable check, and rollback is named where applicable.

Anti-patterns (RFC 2119):

  • The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
  • The agent MUST NOT validate against frontmatter schema, depends_on: resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities.
  • The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
  • The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
  • The agent MUST name a specific failed criterion in any rejection.
  • The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

What you check (BODY ONLY)

1. Preconditions, action, post-condition all stated

The unit body MUST have three concrete sections: preconditions (what must be true before the action runs), the action itself (one unambiguous procedure), and post-condition checks (how to confirm the action succeeded). Reject if any of the three is missing or vague.

2. Verifiable post-condition

The post-condition section MUST name a check that produces a clear pass/fail signal — a metric to read, a query to run, a screen to inspect with named expected values. "Verify by eye that things look good" is a reject.

3. Rollback / recovery named where applicable

Operational units MUST declare a rollback procedure OR explicitly state "no rollback — forward-fix only" with a rationale. Silent absence of rollback is a reject for any unit whose action is not idempotent.

4. Decision-register consistency

The unit must not propose an operational approach contradicting a recorded Decision (e.g., blue-green deploy when Decision N chose canary). Cite the Decision ID.

5. Open questions accounted for

Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation). Operational open questions left to runtime are how outages happen.

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentObservabilityThe agent **MUST** verify the system is observable enough that an on-call engineer with no prior context can diagnose a production issue from telemetry alone. Operations that ship without observability discipline produce 2am pages with no signal — the wrong layer to discover the gap.

Mandate: The agent MUST verify the system is observable enough that an on-call engineer with no prior context can diagnose a production issue from telemetry alone. Operations that ship without observability discipline produce 2am pages with no signal — the wrong layer to discover the gap.

Check

The agent MUST verify each:

  • Structured logs with correlation IDs. Every request / job carries a correlation ID propagated through every downstream call. Log lines are key-value (not free-form prose) so they're queryable.
  • Four golden signals covered. Latency, traffic, errors, saturation — every user-facing service emits all four. Drill-down dimensions exist for slicing by endpoint / customer / region.
  • Alerts have runbooks. Every alert that pages a human links to a runbook or a one-line description of the action to take. Alerts without runbooks are noise that gets silenced.
  • Critical-journey dashboards exist. The top-N user journeys each have a dashboard showing the four golden signals end-to-end across the systems they touch.
  • No sensitive data in telemetry. Logs and metrics do not include PII, credentials, tokens, full request bodies, or full response bodies for payment / auth flows.
  • Sampling preserves signal at scale. Where logs / traces are sampled, the sampling strategy preserves all error traces and a representative sample of success traces; it doesn't silently drop the data you'd need to debug an incident.
  • Telemetry survives the failure. Logs ship to a destination outside the failing process — a crash-looping pod still emits its last lines. Metrics are pushed or scraped on a cadence that survives a partial outage.

Common failure modes to look for

  • A new endpoint added without a corresponding metric or log line — the team finds out it's broken via customer ticket
  • Logs that emit JSON-stringified blobs (an entire request body) instead of structured fields
  • An alert fires every 15 minutes with no documented action — on-call has muted it
  • A dashboard shows green during a known incident because the failing path isn't instrumented
  • Correlation ID propagation that drops at a service boundary (gRPC → HTTP, queue producer → consumer), making cross-service tracing impossible
  • An error path that silently swallows the exception with no log line or metric increment
  • Stack traces dumped into logs include Authorization: header values or full SQL with embedded credentials
approval agentReliabilityThe agent **MUST** verify the deployment and operational configuration supports reliable production operation under the load and failure modes the system will actually see. Operations changes that look benign in staging cascade into outages in production when reliability concerns aren't checked up front.

Mandate: The agent MUST verify the deployment and operational configuration supports reliable production operation under the load and failure modes the system will actually see. Operations changes that look benign in staging cascade into outages in production when reliability concerns aren't checked up front.

Check

The agent MUST verify each:

  • Health checks reflect actual readiness. Liveness vs. readiness are distinct. Readiness fails when the dependent datastore, cache, or upstream service is unreachable; liveness only fails on process death. A service marked ready that can't actually serve traffic causes worse outages than one that fails closed.
  • Rollback procedure exists and is tested. Deployments declare how to roll back (previous version artifact, schema rollback steps, feature flag) and the rollback path has been exercised at least once on this surface — not theoretical.
  • Resource limits set with headroom. CPU, memory, connection pools, file descriptors, and concurrent goroutines / threads have explicit limits sized from real observed usage with a stated headroom factor. No "unbounded" pools.
  • Graceful shutdown handles in-flight work. Termination signals trigger draining: load balancer removal, in-flight requests completed (within a bounded timeout), then exit. New requests not accepted during drain.
  • Retry + circuit-breaker on external deps. External calls have explicit retry policy (max attempts, backoff strategy, jitter) and a circuit breaker that fails fast when the dependency is degraded — they do NOT retry forever, do NOT retry non-idempotent operations, and do NOT amplify a downstream outage into a self-DDoS.
  • Capacity headroom states the load model. Sizing references the actual peak-traffic shape (not "average load"). Headroom assumptions are explicit (e.g., 2x current peak) and tied to the autoscaling policy if any.
  • Stateful changes are reversible or migration-paired. Schema migrations, data backfills, and partition changes either ship with an explicit reversal procedure or are paired with a forward-only strategy that the rollback can tolerate (expand-then-contract pattern).

Common failure modes to look for

  • Liveness probe that hits a static endpoint and never fails, while the service is actually deadlocked on a stuck database connection
  • A rollback plan that says "redeploy the previous tag" but the previous tag's database migration has already been applied with no down migration
  • Memory limit set just above current usage with no headroom — first burst of traffic triggers OOMKill
  • A retry policy with no backoff or jitter — the first dependency hiccup turns into a synchronized retry storm
  • Graceful shutdown with an unbounded drain timeout, causing rolling deploys to hang
  • A circuit breaker that opens but never closes because its health probe is the same call it just stopped issuing
  • An autoscaling policy whose scale-up is slower than the traffic ramp it's meant to absorb

Borrowed from other stages

5Gate

controls advancement to the next stage
Auto

The harness advances automatically — no human in the loop at this gate.

Fix loop

a separate track · Classifier → Ops Engineer → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

  1. Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.

  2. Read the stage's unit list via haiku_unit_list { intent, stage }.

  3. Decide:

    • target_unit — which unit this FB counter-signals.
      • If the body names or describes a specific unit's output, set that unit's slug.
      • If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
      • When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
    • target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
      • user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
      • adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
      • drift origin → ["user"] (drift always escalates to human).
      • agent origin → [] (informational; no rerun).
  4. Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.

  5. Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.

    • blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
    • high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
    • medium — a genuine issue worth fixing; not delivery-blocking.
    • low — a nit, polish, or nice-to-have.

    Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.

  6. Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.

  7. Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

  • You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
  • You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
  • You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2Ops EngineerPlan and produce the deployment / infrastructure artifacts for THIS operational unit — pipeline config, infrastructure as code, environment-specific configuration, secrets handling, and the rollback path. Each unit at this stage corresponds to one operational step or one deployable surface. Your deliverable is the unit body with concrete artifact references, preconditions, the deploy/apply action, and an explicit rollback procedure.

Focus: Plan and produce the deployment / infrastructure artifacts for THIS operational unit — pipeline config, infrastructure as code, environment-specific configuration, secrets handling, and the rollback path. Each unit at this stage corresponds to one operational step or one deployable surface. Your deliverable is the unit body with concrete artifact references, preconditions, the deploy/apply action, and an explicit rollback procedure.

You are the plan + do role for the operations stage's plan-do-verify triplet. The baton you hand off to the sre hat is a working deployment artifact set; the baton sre hands to verifier is that artifact set plus reliability instrumentation (SLOs, alerts, runbooks).

Process

1. Read your inputs

  • The unit body — completion criteria, the specific operational step or deployable surface this unit covers
  • Upstream development code and architecture references — what's being deployed
  • Upstream product behavioral-spec — the surface area the deployment must keep available
  • The intent's decision register — locked decisions on platform, region, deployment strategy, secrets-management approach
  • Project conventions if they exist (infra/ directory, prior IaC modules, the project's CI/CD config) — reuse over rebuild

2. Decide artifact shape

Match artifact to the unit's discipline. Avoid vendor-specific defaults — name the artifact class, then reach for the tool the project actually uses:

  • CI/CD pipeline — the project's CI config (whatever the repo uses). Steps for build, test, scan, deploy.
  • Infrastructure as code — the project's IaC tool of choice (Terraform / Pulumi / OpenTofu / CloudFormation / Bicep / a project-specific abstraction). Modules + variables + outputs.
  • Container / runtime config — Dockerfile, Compose, Kubernetes manifests, runtime-specific deployment descriptor. Pin versions; tag images by content hash not latest.
  • Environment configuration — a config file or secret-store reference per environment. NEVER hardcode environment-specific values in code.
  • Migration / data-shape change — forward script + backfill plan + reverse script (or explicit "no reverse — see rollback").

Project overlays at .haiku/studios/software/stages/operations/ may name specific tools and conventions; defer to overlays when present.

3. Pre-flight before writing

  • Plan / dry-run. Run terraform plan (or pulumi preview, kubectl diff, docker build, the project's equivalent). Surface every resource being created / modified / destroyed.
  • Identify destructive changes. Anything that replaces a resource in place (DB instance class change, IP-changing network resource, secret rotation that breaks running pods) gets called out separately.
  • Identify cross-environment dependencies. A change to a shared resource (DNS, identity provider, shared DB) needs explicit sequencing with other environments.

4. Write the unit body

## Operational scope

<one paragraph naming what this unit deploys / changes — the surface, the environment(s), the platform>

## Preconditions

- <required state before the action runs: prior unit completed, migration applied, image built and scanned, ...>
- <required approval / change-control marker if applicable>

## Artifacts produced

| Path | Purpose | Notes |
|------|---------|-------|
| `infra/<module>/main.tf` | <what this module does> | reuses module X |
| `.github/workflows/deploy-<env>.yml` | <what this pipeline does> | invoked on tag |

## Action

<one unambiguous procedure — the literal commands or pipeline trigger, in order, that performs the deploy / apply / cutover>

## Post-condition checks

| Check | How to run | Pass criteria |
|-------|-----------|---------------|
| Health endpoint returns 200 | `curl https://<env>/healthz` | HTTP 200, body `{"status":"ok"}` |
| Migration applied | <project's migration tool — list applied migrations> | latest migration ID present |
| Error rate under SLO | <project's metrics tool> | < 1% over 5 min post-deploy |

## Rollback

<one of: explicit reverse procedure with literal commands; or "no rollback — forward-fix only" with rationale (e.g., destructive migration)>

## Secrets and configuration

<reference to secret-store paths; never inline values. Name the principal that reads each secret.>

## Open Questions

<unresolved decisions, e.g., region rollout order; flagged (needs human escalation) or with stated default>

5. Hand off to sre

  • Action is one unambiguous procedure — no "or" branches the operator has to decide
  • Every post-condition check has a concrete command and a pass criterion
  • Rollback is explicit (procedure OR rationale for forward-fix only)
  • No hardcoded secrets in artifacts; all reference the project's secret-store
  • Plan / dry-run results referenced in the body
  • Destructive changes are flagged

Call haiku_unit_advance_hat. The sre hat adds SLOs, alerts, runbooks. The verifier hat then validates the combined output.

Anti-patterns (RFC 2119)

  • The agent MUST NOT hardcode secrets or environment-specific values in code or in artifacts checked into VCS
  • The agent MUST NOT omit rollback strategy — every deployment must be reversible OR explicitly declare "no rollback — forward-fix only" with rationale
  • The agent MUST NOT tag images / artifacts with mutable references (latest, main) — pin to immutable identifiers (content hash, SHA, semver)
  • The agent MUST NOT make changes to shared resources without explicit cross-environment sequencing
  • The agent MUST flag destructive changes (in-place resource replacement, irreversible migrations) so the verifier and the gate can require additional approval
fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

  • The agent MUST NOT edit any file — you are a verifier, not a fixer
  • The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
  • The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
  • The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
  • The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
  • The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat