Building a Real-Time TfL Lakehouse from Scratch

When I decided to build a real-time data lakehouse using TfL's Unified API, I wasn't trying to build something impressive for a portfolio. I wanted to answer a real question: can a single developer build a production-grade data platform — with orchestration, transformation, quality checks, and lineage — using only open-source tools?

The answer is yes. But it requires making deliberate choices at every layer of the stack, and understanding why each tool earns its place.

Community impact: This project was built as part of my work with Somalis in Tech, where I used it as a teaching tool for users and students in the community. The TfL app gave members hands-on experience with a real production-style data platform — from understanding how live API data flows through pipelines to seeing how dbt transformations and data quality checks work in practice. Feedback from students showed it helped demystify data engineering concepts and improved their confidence in building their own projects. Several mentees went on to build their own pipeline projects inspired by this architecture.

Why TfL data?

Transport for London publishes a rich, free Unified API covering real-time arrivals, line statuses, stop points, journey planning, and more. It's a genuinely interesting dataset — high-frequency, multi-modal, and full of the kind of messiness you find in real operational systems. Delayed arrivals, missing stop data, edge cases around service disruptions. Good data engineering practice requires handling all of it.

More importantly, it's a live API. That means the pipeline has to actually run, on a schedule, in the real world. No synthetic datasets. No clean CSVs. Just raw API responses arriving every few minutes and a pipeline that has to deal with whatever comes back.

The architecture

The stack I settled on has four clear layers, each handled by a dedicated tool:

Orchestration: Apache Airflow — DAGs for scheduled ingestion, retry logic, SLA alerts
Storage & compute: DuckDB with Parquet files on the local filesystem (mimicking an S3 lakehouse pattern)
Transformation: dbt Core — staging models, intermediate joins, mart-layer aggregations
Quality & lineage: Great Expectations for row-level validation, OpenLineage for dataset-level lineage tracking

Design principle: Every layer should be independently testable and replaceable. Swapping DuckDB for BigQuery, or local Parquet for S3, should be a configuration change — not a rewrite.

Ingestion: Airflow DAGs and the raw layer

The ingestion layer is handled by three Airflow DAGs — one each for arrivals data, line statuses, and stop points. Each DAG runs on a configurable schedule (arrivals every 3 minutes, statuses every 10, stop points daily) and writes raw API responses as Parquet files partitioned by date and hour.

The key design decision here was to always write raw. No transformation at ingestion time. The raw layer is append-only — a permanent record of exactly what the API returned, at what time. This makes debugging straightforward: if something looks wrong downstream, I can always trace it back to the exact raw payload.

# Simplified ingestion task
def fetch_arrivals(stop_id: str, execution_date: str) -> None:
    response = requests.get(
        f"https://api.tfl.gov.uk/StopPoint/{stop_id}/Arrivals",
        params={"app_key": TFL_API_KEY}
    )
    response.raise_for_status()
    df = pd.DataFrame(response.json())
    df["_ingested_at"] = datetime.utcnow()
    df["_partition_date"] = execution_date

    path = f"data/raw/arrivals/{execution_date}/{stop_id}.parquet"
    df.to_parquet(path, index=False)

Transformation: dbt and the medallion architecture

Once raw data lands, dbt takes over. I followed a strict three-layer model:

Staging (`stg_`)

One staging model per source table. Rename columns to snake_case, cast data types, drop duplicates, add a _loaded_at timestamp. No joins, no business logic — just clean, typed data that exactly mirrors the raw source.

Intermediate (`int_`)

Join staging models together, apply business logic, calculate derived fields (e.g. expected_wait_seconds, is_disrupted). These models are not exposed to end consumers — they exist to keep mart models clean.

Marts (`mart_`)

The final analytics-ready tables. mart_line_performance, mart_stop_arrivals, mart_disruption_events. Each mart has a clear owner (in this case, me) and a documented freshness SLA defined in sources.yml.

The goal of the mart layer isn't to be clever. It's to be boring — predictable, well-tested, and fast to query.

Data quality: Great Expectations

Every staging model has a corresponding Expectation Suite. The checks I run on every load include: not-null constraints on primary keys, accepted value sets for categorical fields like line_id and vehicle_mode, range checks on time_to_station (must be between 0 and 3600 seconds), and freshness checks that fail if data is more than 15 minutes stale.

When a check fails, the Airflow DAG marks the task as failed and sends an alert. No silent data corruption. No dashboards showing stale numbers without warning. This is what treating data quality as engineering actually looks like in practice.

Lineage: OpenLineage

OpenLineage tracks what data was read and written by each pipeline run. This sounds simple, but the value compounds over time. When something breaks downstream, lineage tells you exactly which upstream models were affected, which DAG run produced the bad data, and what the dataset looked like at that point in time.

Wiring OpenLineage into Airflow and dbt is a one-time setup. Once it's running, you get a full dependency graph for free — which is invaluable as the pipeline grows.

What I'd do differently

If I were starting again, I'd invest earlier in schema evolution handling. The TfL API occasionally adds or renames fields, and the raw Parquet schema needs to evolve gracefully. Right now I handle this with schema enforcement at the staging layer, but a proper schema registry would be cleaner at scale.

I'd also add data contracts between the raw and staging layers from day one — formalising exactly what shape of data each consumer expects, so that upstream changes fail fast rather than silently degrading downstream models.

The full project is open-source on GitHub: aosman101/tfl-realtime-lakehouse. It's also been used as a learning resource within the Somalis in Tech community, where students and early-career professionals have used it to understand real-world pipeline design. If you're building something similar or have questions about any part of the stack, leave a comment below — I'm happy to go deeper on any layer.

Why TfL data?

The architecture

Ingestion: Airflow DAGs and the raw layer

Transformation: dbt and the medallion architecture

Staging (stg_)

Intermediate (int_)

Marts (mart_)

Data quality: Great Expectations

Lineage: OpenLineage

What I'd do differently

Reactions & comments.

Staging (`stg_`)

Intermediate (`int_`)

Marts (`mart_`)