Discovery
Auto reviewUnderstand data sources, schemas, volumes, and SLAs
Hat Sequence
Data Architect
Focus: Map the data landscape — sources, targets, volumes, latency requirements, and system constraints. Define the high-level data flow architecture and identify integration patterns (batch, streaming, CDC) appropriate for each source-target pair.
Produces: Source catalog with connection details, volume estimates, freshness requirements, and a data flow diagram showing the intended pipeline topology.
Reads: Intent problem statement, existing infrastructure documentation, source system APIs or schema definitions.
Anti-patterns (RFC 2119):
- The agent MUST NOT design the target schema before understanding source constraints
- The agent MUST NOT assume all sources can support real-time extraction without verifying
- The agent MUST NOT ignore volume growth projections and designing only for current scale
- The agent MUST NOT skip SLA negotiation with source system owners
- The agent MUST NOT treat all data sources as equally reliable or consistent
Schema Analyst
Focus: Profile source schemas in detail — column types, nullability, cardinality, encoding, and semantic meaning. Identify type conflicts, naming inconsistencies, and data quality issues that will affect downstream transformation.
Produces: Schema analysis report with field-level profiling, type conflict inventory, and a mapping of semantic equivalences across sources (e.g., "customer_id" in system A = "cust_num" in system B).
Reads: Data architect's source catalog, raw schema definitions from source systems.
Anti-patterns (RFC 2119):
- The agent MUST NOT accept schema documentation at face value without sampling actual data
- The agent MUST NOT ignore edge cases in data types (e.g., timestamps without timezone, numeric precision loss)
- The agent MUST profil for null rates, distinct counts, and value distributions
- The agent MUST NOT treat schema discovery as a one-time activity rather than validating against live data
- The agent MUST NOT miss implicit schemas in semi-structured sources (JSON, XML, CSV without headers)
Review Agents
Completeness
Mandate: The agent MUST verify all data sources, schemas, and constraints are documented.
Check:
- The agent MUST verify that every source system is inventoried with connection details and access patterns
- The agent MUST verify that schema documentation includes all fields, types, nullability, and relationships
- The agent MUST verify that volume estimates and SLAs are based on measurements, not guesses
- The agent MUST verify that data quality issues in sources are documented, not discovered later
Discovery
Criteria Guidance
Good criteria examples:
- "Source catalog documents at least all known data sources with connection type, schema, and estimated row counts"
- "SLA requirements are captured for each target table including freshness, completeness, and acceptable error rates"
- "Schema analysis identifies all nullable fields, data type mismatches, and encoding inconsistencies across sources"
Bad criteria examples:
- "Sources are documented"
- "Schemas are understood"
- "Requirements are gathered"
Completion Signal (RFC 2119)
Source catalog MUST exist with connection details, schema snapshots, volume estimates, and data freshness requirements for every source. Schema analysis identifies type conflicts, nullability patterns, and encoding issues. SLA targets MUST be defined for latency, completeness, and error tolerance. Data lineage from source to intended target is mapped.