Extraction

Ask review

Design and implement data extraction from sources

Hats
2
Review Agents
1
Review
Ask
Unit Types
Extraction
Inputs
Discovery

Dependencies

Discoverysource-catalog

Hat Sequence

1

Connector Reviewer

Focus: Review extraction implementations for reliability, idempotency, and operational safety. Verify that connectors handle schema drift, network failures, and partial extractions without data loss or duplication.

Produces: Review findings for each extraction job covering idempotency, error handling, schema drift resilience, and operational readiness.

Reads: Extractor's implementation, source catalog from discovery.

Anti-patterns (RFC 2119):

  • The agent MUST NOT approve extraction logic without verifying idempotency (re-run safety)
  • The agent MUST test what happens when a source schema changes mid-extraction
  • The agent MUST NOT ignore partial failure scenarios (e.g., network timeout after 80% of records)
  • The agent MUST NOT treat retry logic as optional for "reliable" sources
  • The agent MUST verify that extraction metadata is sufficient for debugging production issues
2

Extractor

Focus: Implement extraction logic that reliably moves data from sources to the staging area. Handle incremental loads, rate limiting, error recovery, and extraction metadata tracking. Prioritize correctness and idempotency over speed.

Produces: Extraction jobs for each source with full-loadd and incremental-loadd paths, error handling, retry logic, and extraction metadata (batch ID, timestamp, source identifier).

Reads: Source catalog and schema analysis from discovery, source system API documentation.

Anti-patterns (RFC 2119):

  • The agent MUST NOT build only full-loadd extraction when incremental is feasible
  • The agent MUST NOT ignore source system rate limits or connection pool constraints
  • The agent MUST NOT silently drop records on extraction errors instead of dead-lettering
  • The agent MUST track extraction metadata (when, what, how much) for auditability
  • The agent MUST NOT hardcode connection strings or credentials instead of using config/secrets management

Review Agents

Correctness

Mandate: The agent MUST verify extraction logic faithfully captures source data without loss or corruption.

Check:

  • The agent MUST verify that all fields from the source schema are accounted for (extracted or explicitly excluded with justification)
  • The agent MUST verify that incremental extraction handles late-arriving data and schema evolution
  • The agent MUST verify that error handling covers connection failures, timeouts, and malformed records
  • The agent MUST verify that extraction does not impose excessive load on source systems

Extraction

Criteria Guidance

Good criteria examples:

  • "Extraction logic handles incremental loads using watermark columns identified in discovery"
  • "Connector includes retry logic with exponential backoff and dead-letter handling for failed records"
  • "Schema drift detection raises alerts rather than silently dropping or truncating columns"

Bad criteria examples:

  • "Extraction works"
  • "Data is pulled from sources"
  • "Connectors are configured"

Completion Signal (RFC 2119)

Extraction jobs exist for all sources identified in discovery. Each job handles full and incremental loads, includes error handling and retry logic, respects source system rate limits, and lands raw data in the staging area with extraction metadata (timestamp, source, batch ID). Connector reviewer MUST have MUST be verified idempotency and schema drift handling.