Data Intelligence Pipelines: Real Architecture

ETL vs ELT: Choosing Right

Extract-Transform-Load (ETL) transforms data before loading it to the destination. Extract-Load-Transform (ELT) loads raw data first and transforms it inside the destination system. The right choice depends on your destination's compute capacity and your transformation complexity.

ETL is appropriate when your destination lacks processing power (e.g., you're loading into a lean database or a flat file system), or when data privacy rules require masking PII before it ever reaches the warehouse. ETL pipelines are more complex to operate but give you tighter control.

ELT has become dominant with cloud data warehouses (BigQuery, Snowflake, Redshift) because these systems have abundant compute and can handle complex transformations in SQL at scale. Load raw data fast, then use dbt or native SQL to transform it. Iterating on transformations is faster and cheaper than rebuilding ETL pipelines.

Source Integration Patterns

API-based extraction is the cleanest pattern when available. Use cursor-based pagination (not offset-based — offsets break with data mutations during extraction). Implement incremental extraction via updated_at timestamps or change sequence numbers. Full extracts for large tables are expensive and unnecessary once you have incremental logic.

Database replication via Change Data Capture (CDC) is the most powerful pattern for operational databases. Tools like Debezium capture row-level changes from the database transaction log (binlog for MySQL, WAL for Postgres) and stream them to Kafka or a cloud message bus. This gives you near-real-time data with zero impact on the source database.

File-based sources (CSV drops, SFTP, S3 buckets) require defensive parsing. Never trust the schema. Validate column counts, data types, and nullability on every file. A partner sending you a CSV with an extra column or changed encoding on a Sunday will break your pipeline if you haven't built resilience in.

Transform & Validate

Data quality validation is not optional. At minimum, validate: row counts match expectations, required fields are non-null, date fields fall within plausible ranges, and foreign keys resolve. Build a data quality framework that runs as part of your pipeline and alerts on violations rather than silently passing bad data downstream.

Transformation logic should be version-controlled, tested, and idempotent. If you run a transformation twice on the same input, it should produce the same output. This property is essential for safe pipeline reruns after failures.

Use a staging pattern: load raw data into a staging schema/table first, validate and transform it there, then upsert into the final production schema. This separates ingestion failures from transformation failures and makes debugging dramatically easier.

Storage & Serving Layer

Your storage strategy should match your query patterns. Time-series data with rolling aggregations belongs in a columnar store optimized for range scans (TimescaleDB, ClickHouse, BigQuery). Entity-centric data (customer profiles, product catalogs) reads better from row-oriented stores with good indexing.

Implement a materialized view layer for your most common aggregations. Pre-computing daily/weekly/monthly rollups means your dashboards run in milliseconds instead of seconds, and your database isn't re-scanning billions of rows every time a user loads a report.

Design your partition strategy before data grows. Partition large tables by date, tenant, or region based on your most common filter patterns. A query that prunes 99% of partitions is orders of magnitude faster than a full table scan, and this difference only grows as data volume increases.

Scheduling & Monitoring

Choose your orchestrator based on team size and complexity. For small teams, cron jobs with alerting are sufficient for simple pipelines. For complex multi-step pipelines with dependencies, Airflow (or its managed alternatives: MWAA, Composer, Astronomer) is the industry standard. For cloud-native teams, Prefect and Dagster offer better developer experience.

Every pipeline run must produce observable metrics: rows extracted, rows loaded, rows failed, run duration, data freshness. Instrument your pipelines to emit these to a metrics store and build dashboards. Data team SLAs depend on knowing when something is wrong before your stakeholders do.

Implement circuit breakers and back-pressure mechanisms. If a source API starts returning errors, don't hammer it with retries — back off exponentially and alert the team. If your pipeline is falling behind, shed load gracefully rather than cascading failures downstream.