Operations

Auto review

Deployment, monitoring, and operational readiness

Hats
2
Review Agents
2 +1
Review
Auto
Unit Types
Ops
Inputs
Inception, Product, Development, Development

Dependencies

Inceptiondiscovery
Productbehavioral-spec
Developmentarchitecture

Hat Sequence

1

Ops Engineer

Focus: Configure deployment pipeline, define infrastructure as code, set up CI/CD, and ensure deployment is repeatable and rollback-safe. Every deployment should be automated, auditable, and reversible.

Produces: Deployment configuration, CI/CD pipeline definitions, and infrastructure manifests.

Reads: code and architecture via the unit's ## References section.

Anti-patterns (RFC 2119):

  • The agent MUST NOT use manual deployment steps that require human intervention
  • The agent MUST NOT hardcode secrets or environment-specific values in code
  • The agent MUST NOT omit rollback strategy — every deployment must be reversible
  • The agent MUST NOT skip health checks — the system must verify its own readiness
  • The agent MUST NOT create deployment config without testing it (terraform plan, docker build, etc.)
  • The agent MUST NOT mix infrastructure concerns with application code
2

Sre

Focus: Define SLOs (availability, latency, error rate), set up monitoring and alerting, and write runbooks for common failure modes. The goal is that when something breaks at 3 AM, the oncall has a step-by-step guide.

Produces: Runbook, monitoring configuration, alert definitions, and SLO documentation.

Reads: code, architecture, and deployment config via the unit's ## References section.

Anti-patterns (RFC 2119):

  • The agent MUST NOT alert on symptoms instead of causes (alert on error rate, not individual errors)
  • The agent MUST NOT sLOs without error budgets — an SLO without a budget is just a wish
  • The agent MUST NOT runbooks that say "page the oncall" without diagnostic steps
  • The agent MUST NOT monitor that generates noise (alert fatigue makes real alerts invisible)
  • The agent MUST define what "healthy" looks like before defining what "unhealthy" looks like

Review Agents

Observability

Mandate: The agent MUST verify the system is observable enough to diagnose issues in production.

Check:

  • The agent MUST verify that key operations emit structured logs with correlation IDs
  • The agent MUST verify that metrics cover the four golden signals (latency, traffic, errors, saturation)
  • The agent MUST verify that alerts have clear runbooks or at minimum actionable descriptions
  • The agent MUST verify that dashboards exist for the critical user journeys
  • The agent MUST verify that no sensitive data in logs or metrics (PII, credentials, tokens)

Reliability

Mandate: The agent MUST verify the deployment and operational configuration supports reliable production operation.

Check:

  • The agent MUST verify that health checks cover actual readiness, not just process liveness
  • The agent MUST verify that rollback procedure is defined and tested
  • The agent MUST verify that resource limits (CPU, memory, connections) are set appropriately
  • The agent MUST verify that graceful shutdown handles in-flight requests
  • The agent MUST verify that retry and circuit-breaker patterns are configured for external dependencies

Included from other stages

Operations

Criteria Guidance

Good criteria examples:

  • "Deployment pipeline runs terraform plan in CI and requires approval before apply"
  • "Runbook covers: service restart, database failover, cache flush, and certificate rotation with step-by-step commands"
  • "Alerts fire when error rate exceeds 1% over 5 minutes, with PagerDuty routing"
  • "Health check endpoint responds within 5 seconds and verifies database connectivity"

Bad criteria examples:

  • "Deployment is automated"
  • "Runbook exists"
  • "Monitoring is set up"

Completion Signal (RFC 2119)

Deployment pipeline MUST be defined and validated (builds, plans, and applies successfully). Monitoring MUST cover key metrics (latency, error rate, throughput). Runbook MUST exist for common failure modes with step-by-step remediation commands. SLOs MUST be defined with alert thresholds and error budgets.