Discovery

Auto review

Understand data sources, schemas, volumes, and SLAs

Hats

Review Agents

Review

Auto

Unit Types

Discovery

Inputs

None

Hat Sequence

Data Architect

Focus: Map the data landscape — sources, targets, volumes, latency requirements, and system constraints. Define the high-level data flow architecture and identify integration patterns (batch, streaming, CDC) appropriate for each source-target pair.

Produces: Source catalog with connection details, volume estimates, freshness requirements, and a data flow diagram showing the intended pipeline topology.

Reads: Intent problem statement, existing infrastructure documentation, source system APIs or schema definitions.

Anti-patterns (RFC 2119):

The agent MUST NOT design the target schema before understanding source constraints
The agent MUST NOT assume all sources can support real-time extraction without verifying
The agent MUST NOT ignore volume growth projections and designing only for current scale
The agent MUST NOT skip SLA negotiation with source system owners
The agent MUST NOT treat all data sources as equally reliable or consistent

Schema Analyst

Focus: Profile source schemas in detail — column types, nullability, cardinality, encoding, and semantic meaning. Identify type conflicts, naming inconsistencies, and data quality issues that will affect downstream transformation.

Produces: Schema analysis report with field-level profiling, type conflict inventory, and a mapping of semantic equivalences across sources (e.g., "customer_id" in system A = "cust_num" in system B).

Reads: Data architect's source catalog, raw schema definitions from source systems.

Anti-patterns (RFC 2119):

The agent MUST NOT accept schema documentation at face value without sampling actual data
The agent MUST NOT ignore edge cases in data types (e.g., timestamps without timezone, numeric precision loss)
The agent MUST profil for null rates, distinct counts, and value distributions
The agent MUST NOT treat schema discovery as a one-time activity rather than validating against live data
The agent MUST NOT miss implicit schemas in semi-structured sources (JSON, XML, CSV without headers)

Review Agents

Completeness

Mandate: The agent MUST verify all data sources, schemas, and constraints are documented.

Check:

The agent MUST verify that every source system is inventoried with connection details and access patterns
The agent MUST verify that schema documentation includes all fields, types, nullability, and relationships
The agent MUST verify that volume estimates and SLAs are based on measurements, not guesses
The agent MUST verify that data quality issues in sources are documented, not discovered later

Discovery

Criteria Guidance

Good criteria examples:

"Source catalog documents at least all known data sources with connection type, schema, and estimated row counts"
"SLA requirements are captured for each target table including freshness, completeness, and acceptable error rates"
"Schema analysis identifies all nullable fields, data type mismatches, and encoding inconsistencies across sources"

Bad criteria examples:

"Sources are documented"
"Schemas are understood"
"Requirements are gathered"

Completion Signal (RFC 2119)

Source catalog MUST exist with connection details, schema snapshots, volume estimates, and data freshness requirements for every source. Schema analysis identifies type conflicts, nullability patterns, and encoding issues. SLA targets MUST be defined for latency, completeness, and error tolerance. Data lineage from source to intended target is mapped.