Data Pipeline · stage 5 of 5

Deployment

External gate

Deploy pipelines to production with monitoring and alerting

Deployment

The terminal stage of the data-pipeline lifecycle: take the validated pipeline and put it into production. This is where the pipeline stops being code on a branch and becomes infrastructure other people depend on.

Scope

Operationalizing the pipeline — orchestrator registration, schedule, resource sizing, alert routing, runbooks, and rollback plan. Deployment decides how the pipeline runs and is operated in production — it does not change transformation or validation logic; if either is wrong, that's a revisit upstream.

What to do

Package the pipeline for the orchestrator: schedule, dependency chain, retry / timeout policy, resource limits.
Route alerts to the right on-call channel and monitor both pipeline health and data freshness.
Write runbooks an unfamiliar engineer can actually follow, and a rollback plan for the first run.
Hold operational readiness — not just a successful execution — as the bar to ship.

What NOT to do

Don't modify transformation or validation logic — route a regression back to the stage that owns it.
Don't deploy a pipeline whose validation suite has unresolved blocking findings.
Don't treat a clean run as readiness; without alerting, monitoring, and rollback it isn't done.
Don't add scope the validated pipeline didn't already cover.

How the engine runs this stage

1Elaborate

autonomous · plan the work, fan out discovery, declare outputs

Inputs consumed

validation-reportfrom Validation

Phase guidance

phase overrideELABORATION- "Pipeline DAG is registered in the orchestrator with correct dependencies, retry policies, and SLA-based alerting"

Deployment Stage — Elaboration

Criteria Guidance

Good criteria — concrete and verifiable

"Pipeline DAG is registered in the orchestrator with correct dependencies, retry policies, and SLA-based alerting"
"Monitoring covers pipeline runtime, row counts per stage, data freshness, and error rates with alerts routed to the on-call channel"
"Runbook documents manual recovery steps for the 3 most likely failure modes (source unavailable, schema drift, transformation timeout)"

Bad criteria — vague (no clear check)

"Pipeline is deployed"
"Monitoring is set up"
"Documentation exists"

Outputs produced

output templatePipeline ConfigProduction pipeline deployment with monitoring, alerting, and runbook.

Pipeline Configuration

Production pipeline deployment with monitoring, alerting, and runbook.

Expected Artifacts

Pipeline DAG -- registered in orchestrator with dependencies, retry policies, and SLA-based alerting
Monitoring setup -- runtime, row counts, data freshness, and error rate tracking with alert routing
Runbook -- manual recovery steps for the most likely failure modes
Deployment verification -- confirmation that pipeline runs successfully in production

Quality Signals

Pipeline DAG has correct dependencies and retry policies
Monitoring covers all critical pipeline health metrics
Runbook documents recovery for at least 3 failure scenarios
Alerts route to the correct on-call channel

2Review

pre-execute · agents audit the planned spec before any code lands

review agentReliabilityThe agent **MUST** verify the deployed pipeline is resilient under realistic failure modes and observable in production by someone who didn't build it.

Mandate: The agent MUST verify the deployed pipeline is resilient under realistic failure modes and observable in production by someone who didn't build it.

Check

The agent MUST verify, and file feedback for any violation:

Failure-recovery definition — Retry policy per stage (max attempts, backoff strategy, what counts as retryable), dead-letter destinations for unrecoverable records, and alert wiring on hard failures are all explicit, not implicit
Resource sizing for peak — Memory, CPU, and parallelism are sized for the projected peak volume from the discovery brief's growth curve, not for current average load. Concurrency limits prevent overlapping runs of the same pipeline
Monitoring breadth — Coverage includes pipeline-execution health (success rate, duration, retry counts), data health (rows landed per stage, freshness per target, validation pass rate), and resource trends (drift toward memory / CPU / duration limits)
Alert actionability — Every alert has a severity (page / ticket / log-only) matched to impact, a route to a real on-call channel with a real schedule, and a runbook entry. Alerts that fire into the void provide false comfort
Backfill readiness — A documented backfill procedure exists, has been tested in staging at realistic volume, preserves idempotency, and is rate-limited so a backfill doesn't overwhelm the production target
Runbook actionability — For each of the most likely failure modes (source unavailable, schema drift, transformation timeout, validation failure, downstream-consumer breakage), the runbook covers symptoms, triage, recovery, rollback, and communication — concrete enough that an unfamiliar engineer can act on it
First-run plan and rollback — Before initial production deployment, "good first run" criteria are explicit (specific row counts, validation results, latency), rollback triggers are explicit, and rollback steps are concrete

Common failure modes to look for

A pipeline with retry policies but no defined dead-letter destination
Resource limits set to "default" with no reference to actual peak volume
Monitoring of "pipeline succeeded" without monitoring of "target stayed fresh" — a pipeline that emits zero rows successfully looks healthy
Alerts routed to a chat channel nobody owns or that's muted
A backfill procedure documented but never exercised in staging
A runbook entry that says "investigate the failure" with no triage steps
A first-run plan that says "ship it and see" with no specific success criteria and no rollback trigger
An on-call schedule that has gaps (nights, weekends) without escalation paths defined

Borrowed from other stages

Data Qualityfrom Transformation Coveragefrom Validation

3Execute

per-unit baton · Pipeline Engineer → Sre → Verifier

hat 1Pipeline EngineerPackage and deploy the pipeline to the production orchestrator. Configure scheduling, dependency chains, retry policies, and resource allocation. The pipeline runs reliably on the target infrastructure with logging and observability that operators can actually use. Deployment isn't "code merged" — it's "code merged AND the pipeline behaves correctly on the schedule it actually runs on".

Focus: Package and deploy the pipeline to the production orchestrator. Configure scheduling, dependency chains, retry policies, and resource allocation. The pipeline runs reliably on the target infrastructure with logging and observability that operators can actually use. Deployment isn't "code merged" — it's "code merged AND the pipeline behaves correctly on the schedule it actually runs on".

Process

1. Read the inputs

Validation's VALIDATION-REPORT.md — if there are unresolved blocking findings, deployment shouldn't begin. Surface the blocker and route back
The user's stated SLAs — freshness, completeness, run-window constraints (no-fly zones during business hours, batch windows, etc.)
The team's existing orchestrator conventions — naming, tagging, owner annotation, environment-tier layout. New pipelines that don't match house conventions become orphans

2. Register the DAG / schedule

Per pipeline:

Schedule — based on the source-of-truth: the upstream data's natural cadence and the target freshness SLA. Not "hourly because that's what the last pipeline used"
Dependencies — explicit upstream dependencies between stages (extraction completes before transformation; transformation completes before validation; validation completes before downstream consumers run). Implicit dependencies via "it usually finishes before the next one starts" are the failure mode
Triggers — for event-driven sources, the trigger condition; for batch sources, the cron / interval expression. State the trigger explicitly in code, not in tribal knowledge

3. Configure retry, timeout, and resource policies

Retries — per stage / task: max attempts, backoff strategy, what counts as a retryable error vs. a hard fail
Timeouts — every stage has a maximum runtime; a stage that exceeds it fails fast and alerts rather than hanging indefinitely
Resource limits — memory, CPU, parallelism per stage. Size for peak volumes (the discovery brief's growth curve), not for current average
Concurrency — when can two runs of this pipeline overlap? Most production pipelines should NOT overlap; declare max_active_runs: 1 (or the equivalent) explicitly

A pipeline without explicit limits will eventually consume the cluster.

4. Plumb logging and observability

Structured logs — stage name, run ID, row counts, error context. Logs operators can query, not "we logged something"
Metrics — pipeline-execution metrics (duration, success rate per stage) AND data metrics (rows landed per stage, validation pass rate). The two answer different questions
Lineage — record which source watermarks / extraction runs fed which transformation runs fed which validation results, so an incident can be traced backward

5. Test the full DAG end-to-end in staging

Production is not where you discover the DAG is wrong:

Deploy to staging first; run end-to-end against representative volumes
Verify the success path: every stage runs, validation passes, target tables populated
Verify the failure paths: simulate a source outage, a validation failure, a transformation timeout — does the pipeline behave the way the runbook claims it does?

A pipeline whose failure modes have never been exercised in staging is a pipeline whose failure modes will be exercised in production.

6. Plan the rollback

The first production run is the highest-risk run. Before deployment:

Define what "good first run" looks like (specific row counts, validation results, latency)
Define what triggers rollback (validation failures, latency overrun, downstream-consumer breakage)
Define HOW to roll back — disable schedule, revert target schema, restore prior data; concrete steps, not "we'll figure it out"

Format guidance

## Schedule and triggers
- cadence, trigger condition, owner annotation

## Dependencies
- upstream / downstream graph, explicit

## Retry / timeout / resource policy
- per-stage limits and reasons

## Observability
- log fields, metrics, lineage capture

## Staging-test results
- success and failure paths exercised, outcomes

## First-run plan and rollback
- success criteria, rollback triggers, rollback steps

Anti-patterns (RFC 2119)

The agent MUST NOT deploy without configuring retries and timeout policies
The agent MUST NOT use hardcoded schedules without considering upstream-dependency completion
The agent MUST set resource limits (memory, CPU, parallelism) per pipeline stage
The agent MUST NOT deploy to production without an explicit rollback plan for the first run
The agent MUST NOT skip end-to-end testing of the full DAG in a staging environment
The agent MUST declare explicit upstream and downstream dependencies, not rely on timing
The agent MUST route to the team's house-style orchestrator conventions where they exist (naming, tagging, owner annotation, environment tier)
The agent MUST size resources for the projected peak volume, not the current average

hat 2SreVerify operational readiness — monitoring, alerting, runbooks, and incident response paths. The pipeline meets its SLA commitments AND the team can diagnose and recover from failures without the original builder. SRE here is the do / verify role for production-safety; everything you sign off becomes someone else's 3 AM problem if you signed off wrong.

Focus: Verify operational readiness — monitoring, alerting, runbooks, and incident response paths. The pipeline meets its SLA commitments AND the team can diagnose and recover from failures without the original builder. SRE here is the do / verify role for production-safety; everything you sign off becomes someone else's 3 AM problem if you signed off wrong.

Process

1. Verify alert routing

The bar is "an alert reaches a human who can act":

Each alert has a defined severity (page / ticket / log-only) matched to its impact
Page-level alerts route to a real on-call channel with a real on-call schedule, not a chat channel that mutes itself
Ticket-level alerts land in the team's actual queue, not a shared inbox nobody owns
The contact path is documented in the runbook, not in someone's head

A monitoring suite that fires into the void is worse than no monitoring — it gives false comfort.

2. Verify monitoring covers more than success

Most pipelines monitor "did the run succeed". That's the easy half. The hard half:

Data freshness — is the target up-to-date per its SLA? A pipeline that runs successfully but stops emitting rows is broken in a way "success rate" hides
Data volume — are row counts in expected ranges? A run that succeeded with 0% of expected rows is a silent failure
Data quality — are validation pass-rates trending normal? A slow drift in null-rate or value-distribution is the early warning
Resource consumption — is the pipeline drifting toward its memory / CPU / duration limit? Approaching limits predict future hard failures

Monitoring that covers only success modes will mask every interesting failure.

3. Verify the runbook is actionable

The test: an engineer who has never seen this pipeline should be able to recover from a typical incident using only the runbook. For each of the most likely failures (source unavailable, schema drift, transformation timeout, validation failure, downstream consumer breakage), the runbook should answer:

Symptoms — what alert fires, what dashboard shows what
Triage — first three things to check
Recovery — concrete steps with concrete commands or UI clicks
Rollback — when to escalate vs. when to revert
Communication — who to notify and how

A runbook that says "investigate the failure" is not a runbook.

4. Verify backfill is supported

Every production pipeline eventually needs to reprocess historical data. Before sign-off:

Is there a documented procedure for backfilling a specific date range?
Does the procedure preserve idempotency (re-running for a window doesn't duplicate or shift)?
Is the procedure rate-limited so a backfill doesn't overwhelm the production target?
Has the procedure been tested in staging at realistic volume?

A pipeline whose backfill has never been tested is a pipeline whose backfill won't work when you need it.

5. Verify SLA monitoring closes the loop

Per stated SLA (freshness, completeness, accuracy), there's a monitor that:

Measures the actual value vs. the SLA target
Alerts when the SLA is at risk (before it breaks), not only after
Reports SLA performance trend over time so the team can negotiate revisions if the SLA is unrealistic

Decision

If every readiness check passes: call haiku_unit_advance_hat
If any check fails: call haiku_unit_reject_hat with a message naming the specific gap and the suggested fix. The workflow engine rewinds to the pipeline-engineer

Anti-patterns (RFC 2119)

The agent MUST NOT approve deployment without verifying alert routing reaches the right on-call channel
The agent MUST NOT accept monitoring that covers only success cases — failure and degradation modes are non-optional
The agent MUST verify that runbooks are actionable by someone unfamiliar with the pipeline internals
The agent MUST NOT ignore data-freshness monitoring in favor of only pipeline-execution monitoring
The agent MUST NOT treat operational readiness as a checkbox — it's a safety review
The agent MUST verify that a backfill procedure exists, has been tested in staging, and preserves idempotency
The agent MUST name the specific failed readiness gap in any rejection so the pipeline-engineer knows what to fix
The agent MUST verify that each SLA the user stated has a monitor that alerts before the SLA breaks, not only after

hat 3VerifierValidate the per-unit operational artifact for the deployment stage of data-pipeline. Units here are deployment step — operational steps with concrete preconditions, actions, and post-condition checks. Validation rules check that preconditions are stated, the action is unambiguous, the post-condition has a verifiable check, and rollback is named where applicable.

Focus: Validate the per-unit operational artifact for the deployment stage of data-pipeline. Units here are deployment step — operational steps with concrete preconditions, actions, and post-condition checks. Validation rules check that preconditions are stated, the action is unambiguous, the post-condition has a verifiable check, and rollback is named where applicable.

Anti-patterns (RFC 2119):

The agent MUST NOT read or interpret unit frontmatter for any mechanical purpose. workflow engine territory per architecture §1.1.
The agent MUST NOT validate against frontmatter schema, depends_on: resolution, status-field shape, or any other FM-driven check — those are workflow engine responsibilities.
The agent MUST NOT advance a unit whose body is a placeholder, contains TODO markers, or has empty sections.
The agent MUST NOT reject for stylistic preferences. Substantive gaps only.
The agent MUST name a specific failed criterion in any rejection.
The agent MUST NOT invent rules not in this mandate. Stage scope is the contract.

Validate this unit's outputs against its criteria

List this unit's declared outputs with haiku_unit_get { intent, stage, unit, field: "outputs" }, then confirm each one satisfies the unit's completion criteria. The outputs are what you validate; the unit's criteria are the bar. Stay scoped to this one unit — sibling units have their own verify passes.

What you check (BODY ONLY)

1. Preconditions, action, post-condition all stated

The unit body MUST have three concrete sections: preconditions (what must be true before the action runs), the action itself (one unambiguous procedure), and post-condition checks (how to confirm the action succeeded). Reject if any of the three is missing or vague.

2. Verifiable post-condition

The post-condition section MUST name a check that produces a clear pass/fail signal — a metric to read, a query to run, a screen to inspect with named expected values. "Verify by eye that things look good" is a reject.

3. Rollback / recovery named where applicable

Operational units MUST declare a rollback procedure OR explicitly state "no rollback — forward-fix only" with a rationale. Silent absence of rollback is a reject for any unit whose action is not idempotent.

4. Decision-register consistency

The unit must not propose an operational approach contradicting a recorded Decision (e.g., blue-green deploy when Decision N chose canary). Cite the Decision ID.

5. Open questions accounted for

Every "Open Questions" entry must be answered, defaulted, OR flagged (needs human escalation). Operational open questions left to runtime are how outages happen.

4Approve

post-execute · the same agents re-run against the built work

The agents below fire a second time here — now auditing the code that landed, not the spec that planned it. Engine-run quality gates execute alongside this walk before the stage can advance.

approval agentReliabilityThe agent **MUST** verify the deployed pipeline is resilient under realistic failure modes and observable in production by someone who didn't build it.

Mandate: The agent MUST verify the deployed pipeline is resilient under realistic failure modes and observable in production by someone who didn't build it.

Check

The agent MUST verify, and file feedback for any violation:

Failure-recovery definition — Retry policy per stage (max attempts, backoff strategy, what counts as retryable), dead-letter destinations for unrecoverable records, and alert wiring on hard failures are all explicit, not implicit
Resource sizing for peak — Memory, CPU, and parallelism are sized for the projected peak volume from the discovery brief's growth curve, not for current average load. Concurrency limits prevent overlapping runs of the same pipeline
Monitoring breadth — Coverage includes pipeline-execution health (success rate, duration, retry counts), data health (rows landed per stage, freshness per target, validation pass rate), and resource trends (drift toward memory / CPU / duration limits)
Alert actionability — Every alert has a severity (page / ticket / log-only) matched to impact, a route to a real on-call channel with a real schedule, and a runbook entry. Alerts that fire into the void provide false comfort
Backfill readiness — A documented backfill procedure exists, has been tested in staging at realistic volume, preserves idempotency, and is rate-limited so a backfill doesn't overwhelm the production target
Runbook actionability — For each of the most likely failure modes (source unavailable, schema drift, transformation timeout, validation failure, downstream-consumer breakage), the runbook covers symptoms, triage, recovery, rollback, and communication — concrete enough that an unfamiliar engineer can act on it
First-run plan and rollback — Before initial production deployment, "good first run" criteria are explicit (specific row counts, validation results, latency), rollback triggers are explicit, and rollback steps are concrete

Common failure modes to look for

A pipeline with retry policies but no defined dead-letter destination
Resource limits set to "default" with no reference to actual peak volume
Monitoring of "pipeline succeeded" without monitoring of "target stayed fresh" — a pipeline that emits zero rows successfully looks healthy
Alerts routed to a chat channel nobody owns or that's muted
A backfill procedure documented but never exercised in staging
A runbook entry that says "investigate the failure" with no triage steps
A first-run plan that says "ship it and see" with no specific success criteria and no rollback trigger
An on-call schedule that has gaps (nights, weekends) without escalation paths defined

Borrowed from other stages

Data Qualityfrom Transformation Coveragefrom Validation

5Gate

controls advancement to the next stage

External

Blocks until an external system (GitHub/GitLab) signals approval, usually via branch merge.

Fix loop

a separate track · Classifier → Pipeline Engineer → Feedback Assessor

Not a step in the walk above. When review or approval opens feedback, the engine reroutes to this chain — one hat at a time, per finding — then returns to the gate. It runs only when there's a finding to fix.

fix-hat 1ClassifierYou are the **classifier** hat. You run as the FIRST hat in the stage's

Classifier (feedback triage)

You are the classifier hat. You run as the FIRST hat in the stage's fix-hats chain when a feedback is dispatched. Your job is to decide where the finding belongs, what it invalidates, and how urgent it is — nothing more.

What you do

Read the FB body via haiku_feedback_read { intent, stage, feedback_id }.
Read the stage's unit list via haiku_unit_list { intent, stage }.
Decide:
- target_unit — which unit this FB counter-signals.
  - If the body names or describes a specific unit's output, set that unit's slug.
  - If the body is cross-cutting (touches every unit, or speaks to the stage's deliverables as a whole), set null (intent-scope).
  - When in doubt: null. Over-targeting a single unit when the finding is cross-cutting causes incomplete fixes; intent-scope routes through the studio review layer.
- target_invalidates — which approval roles get cleared on closure. Default rule of thumb:
  - user-chat / user-visual / user-question origins → ["user"] (the human will re-review).
  - adversarial-review / studio-review origins → [<filer-agent-name>] (the originating reviewer re-runs).
  - drift origin → ["user"] (drift always escalates to human).
  - agent origin → [] (informational; no rerun).
Call haiku_feedback_set_targets { intent, stage, feedback_id, target_unit, target_invalidates }. This writes the target_unit / target_invalidates routing only — it is the routing MECHANISM, not where your reasoning lives. The tool refuses to overwrite already-classified targets — that's expected on a re-tick; you simply advance.
Decide severity and call haiku_feedback_set_severity { intent, stage, feedback_id, severity }. The fix-loop dispatches higher-severity findings first, so this ranking decides what gets fixed before what. Use the rubric below. Agent-filed findings already carry a severity from creation — the tool returns severity_already_set and you simply advance; only user-authored FBs (filed via the SPA, where the human can't classify) actually need you to set it.
- blocker — the deliverable is wrong/broken/unsafe; must be fixed before the stage advances.
- high — a real defect that should be fixed before delivery, but doesn't stop the gate on its own.
- medium — a genuine issue worth fixing; not delivery-blocking.
- low — a nit, polish, or nice-to-have.
Judge by the finding's actual impact, not the requester's tone. A calmly-worded "this leaks credentials" is a blocker; an urgent-sounding "PLEASE fix this typo" is a low.
Non-actionable shortcut (no code fix exists). Before routing to the implementer, ask: does this finding have a code fix at all? Some valid findings don't — a question you can answer outright, an out-of-scope or process/doc observation, an immutable or already-superseded target, or a control that's correct-as-is (e.g. registration-not-a-flag). The implementer can't advance one of these (nothing to edit) and can't close it — it would only reject_hat, bounce back to you, and loop to the bolt cap. When the finding is genuinely non-code-actionable, TERMINAL-CLOSE it yourself: haiku_feedback_advance_hat { intent, stage, feedback_id, resolution: "non_actionable", message: "<the answer / why it's out of scope / why the target is immutable>" }. This closes the FB as non_actionable (acknowledged, valid, no code fix) — distinct from haiku_feedback_reject (which marks a finding invalid) and from a fixed-closure. Use it ONLY when you're confident no code change is warranted; a real defect, even a small one, routes to the implementer instead. If you use this shortcut, you're done — skip the next step.
Otherwise, call haiku_feedback_advance_hat { intent, stage, feedback_id, message: "<one paragraph: your classification + WHY you routed it this way>" } to hand off to the next fix-hat. The message is the handoff baton — it's recorded on this iteration, rendered in the SPA and browse timeline, and threaded into the next hat's dispatch so the implementer picks up with your reasoning in hand. Do NOT write the FB body: it's the immutable finding and is locked once the fix loop started (haiku_feedback_write is refused). Your reasoning lives in the handoff message.

What you do NOT do

You do NOT edit the FB body, unit files, or any artifact. The implementer hat that follows you owns the actual fix. You decide routing; nothing else.
You do NOT call haiku_feedback_reject — that marks the finding invalid. A valid finding you can't reject. (Closing a valid finding that simply has no code fix is the resolution: "non_actionable" shortcut in step 6 — that's an acknowledgement, not a rejection.)
You do NOT spawn subagents. The classification is a single read + single write + advance.

Why this hat exists

Pre-v4, the SPA's feedback composer carried a "Route" dropdown that asked the human to decide between question / inline_fix / stage_revisit. That was friction the human shouldn't have. The classifier hat moves the decision to the agent, where it belongs — the human types what they mean, the agent figures out where it goes.

fix-hat 2Pipeline EngineerPackage and deploy the pipeline to the production orchestrator. Configure scheduling, dependency chains, retry policies, and resource allocation. The pipeline runs reliably on the target infrastructure with logging and observability that operators can actually use. Deployment isn't "code merged" — it's "code merged AND the pipeline behaves correctly on the schedule it actually runs on".

Process

1. Read the inputs

Validation's VALIDATION-REPORT.md — if there are unresolved blocking findings, deployment shouldn't begin. Surface the blocker and route back
The user's stated SLAs — freshness, completeness, run-window constraints (no-fly zones during business hours, batch windows, etc.)
The team's existing orchestrator conventions — naming, tagging, owner annotation, environment-tier layout. New pipelines that don't match house conventions become orphans

2. Register the DAG / schedule

Per pipeline:

Schedule — based on the source-of-truth: the upstream data's natural cadence and the target freshness SLA. Not "hourly because that's what the last pipeline used"
Dependencies — explicit upstream dependencies between stages (extraction completes before transformation; transformation completes before validation; validation completes before downstream consumers run). Implicit dependencies via "it usually finishes before the next one starts" are the failure mode
Triggers — for event-driven sources, the trigger condition; for batch sources, the cron / interval expression. State the trigger explicitly in code, not in tribal knowledge

3. Configure retry, timeout, and resource policies

Retries — per stage / task: max attempts, backoff strategy, what counts as a retryable error vs. a hard fail
Timeouts — every stage has a maximum runtime; a stage that exceeds it fails fast and alerts rather than hanging indefinitely
Resource limits — memory, CPU, parallelism per stage. Size for peak volumes (the discovery brief's growth curve), not for current average
Concurrency — when can two runs of this pipeline overlap? Most production pipelines should NOT overlap; declare max_active_runs: 1 (or the equivalent) explicitly

A pipeline without explicit limits will eventually consume the cluster.

4. Plumb logging and observability

Structured logs — stage name, run ID, row counts, error context. Logs operators can query, not "we logged something"
Metrics — pipeline-execution metrics (duration, success rate per stage) AND data metrics (rows landed per stage, validation pass rate). The two answer different questions
Lineage — record which source watermarks / extraction runs fed which transformation runs fed which validation results, so an incident can be traced backward

5. Test the full DAG end-to-end in staging

Production is not where you discover the DAG is wrong:

Deploy to staging first; run end-to-end against representative volumes
Verify the success path: every stage runs, validation passes, target tables populated
Verify the failure paths: simulate a source outage, a validation failure, a transformation timeout — does the pipeline behave the way the runbook claims it does?

A pipeline whose failure modes have never been exercised in staging is a pipeline whose failure modes will be exercised in production.

6. Plan the rollback

The first production run is the highest-risk run. Before deployment:

Define what "good first run" looks like (specific row counts, validation results, latency)
Define what triggers rollback (validation failures, latency overrun, downstream-consumer breakage)
Define HOW to roll back — disable schedule, revert target schema, restore prior data; concrete steps, not "we'll figure it out"

Format guidance

## Schedule and triggers
- cadence, trigger condition, owner annotation

## Dependencies
- upstream / downstream graph, explicit

## Retry / timeout / resource policy
- per-stage limits and reasons

## Observability
- log fields, metrics, lineage capture

## Staging-test results
- success and failure paths exercised, outcomes

## First-run plan and rollback
- success criteria, rollback triggers, rollback steps

Anti-patterns (RFC 2119)

The agent MUST NOT deploy without configuring retries and timeout policies
The agent MUST NOT use hardcoded schedules without considering upstream-dependency completion
The agent MUST set resource limits (memory, CPU, parallelism) per pipeline stage
The agent MUST NOT deploy to production without an explicit rollback plan for the first run
The agent MUST NOT skip end-to-end testing of the full DAG in a staging environment
The agent MUST declare explicit upstream and downstream dependencies, not rely on timing
The agent MUST route to the team's house-style orchestrator conventions where they exist (naming, tagging, owner annotation, environment tier)
The agent MUST size resources for the projected peak volume, not the current average

fix-hat 3Feedback AssessorIndependently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Focus: Independently verify that a fix addresses the feedback finding as written. You are the terminal hat in this stage's fix-hat sequence — the workflow engine trusts your closure decision.

Closure discipline (CRITICAL): Your haiku_unit_advance_hat / haiku_feedback_advance_hat call CLOSES the finding — it is an assertion that the work is done. Your own handoff message is part of the record. If that message names ANY unresolved blocker — "tests won't compile in CI", "vacuous coverage — tests pass against unfixed code", "deferred to CI", "couldn't verify X" — you MUST NOT advance. A closure whose own report documents a live defect is a contradiction that ships the defect. reject_hat instead, naming exactly what's still open. "The fix is written but I couldn't confirm it works" is NOT resolved.

Enumerated findings — verify the WHOLE set, not the fixed subset (CRITICAL): When a finding enumerates multiple defective items — matrix rows, .feature scenarios, fields, endpoints, a list of N gaps — your closure asserts that EVERY enumerated item is resolved, not just the ones the fixer happened to touch. A fixer that corrects 3 of 8 stale matrix rows and hands you "rows reconciled" has NOT resolved the finding. Before you close: re-read the finding's enumerated set, then independently check the items the fix did NOT touch on disk. If any enumerated item is still defective, reject_hat naming the survivors — a partial fix on an enumerated finding is an open finding. (Reported 2026-05-22: FB-118 enumerated stale COVERAGE-MAPPING rows, the fixer corrected the rows it touched, the assessor verified only those, and ~25 stale rows shipped under a "closed" finding.) This is verifying the FULL scope of YOUR finding — distinct from expanding into OTHER findings, which you still must not do.

Anti-patterns (RFC 2119):

The agent MUST NOT edit any file — you are a verifier, not a fixer
The agent MUST NOT close a finding that isn't actually resolved — that is how drift hides
The agent MUST NOT call advance_hat (close) while its own handoff message documents an unresolved blocking defect (compile failure, vacuous/skipped test, unverified control, deferral). Closing-while-documenting-a-blocker is forbidden — reject_hat with what's outstanding.
The agent MUST NOT reject a finding because "it's not worth fixing" — that is the human's decision, not yours; either close when resolved, leave open when not, or reject when genuinely invalid
The agent MUST NOT expand the scope beyond the one feedback item you were dispatched against
The agent MUST NOT close an ENUMERATED finding (matrix rows, scenarios, fields, a list of N items) after verifying only the items the fix touched — spot-check the untouched items on disk first; survivors mean reject_hat