All posts

Building the TfL Travel App with Somalis in Tech

How a community data engineering project became a real tool for London students and commuters — and what it taught me about building software that actually helps people.

Most side projects start with curiosity. This one started with a question from a student: "How do I actually know when my bus is coming?"

It sounds trivial. But for students commuting across London on tight schedules and tighter budgets, unreliable journey information means missed lectures, wasted Oyster credit, and a lot of frustration. The official TfL app works, but it wasn't built for the way these students actually move — hopping between buses, checking multiple stops, planning around disruptions in real time.

So I built something better. Not a polished consumer app — a real data engineering system that ingests live TfL data, transforms it, validates it, and serves it through a clean interface that students and commuters can actually use. And I did it as part of my work with Somalis in Tech, turning the entire build into a teaching opportunity.

What is Somalis in Tech? A London-based community supporting Somali professionals and students in the technology industry. I joined as a Junior Data Engineer and Peer Mentor, working on real data infrastructure while mentoring early-career candidates in data engineering and portfolio development.

The problem worth solving

London's transport network is massive — over 9,000 bus stops, 270 tube stations, and dozens of rail, DLR, and Overground stations. Transport for London publishes a rich Unified API covering real-time arrivals, line statuses, stop points, disruptions, and journey planning. The data is there. But turning raw API responses into something genuinely useful for commuters requires real engineering.

The students I was mentoring through the Caawi Mentorship Platform needed something tangible — a project they could see, touch, and learn from. Not another tutorial. Not a toy dataset. A real system, processing real data, solving a real problem they personally experienced every day.

That's how the TfL Travel App was born: part tool, part teaching platform.

Architecture: building it right

I designed the system with the same rigour I'd apply to any production data platform. Four layers, each handled by the right tool:

  • Ingestion & orchestration: Apache Airflow schedules DAGs that poll the TfL Unified API for real-time arrivals, line statuses, and stop point data. Arrivals refresh every 3 minutes; statuses every 10 minutes; stop points daily.
  • Storage: Raw API responses land as Parquet files on the local filesystem, partitioned by date and hour — mimicking an S3-backed lakehouse pattern without cloud dependency.
  • Transformation: dbt Core applies a strict medallion architecture — staging models for type-casting and deduplication, intermediate models for business logic, and mart-layer tables optimised for the queries users actually need.
  • Quality & observability: Great Expectations validates every batch with not-null checks, range validations, accepted value sets, and freshness SLAs. OpenLineage tracks full data lineage across every pipeline run.
Design principle: Every layer is independently testable and replaceable. Swapping DuckDB for BigQuery, or local Parquet for S3, should be a configuration change — not a rewrite. This modularity made it far easier to teach each component in isolation.

What users actually get

The app serves three core use cases that students and commuters consistently told me they needed:

Real-time arrival predictions

Users search for a stop or station and see live arrival predictions — not just scheduled times, but actual predicted arrivals based on vehicle positions. The data comes from the TfL arrivals endpoint, processed through the pipeline and served with sub-minute latency. For students timing a bus-to-tube connection, those few minutes of accuracy matter.

Line status and disruption alerts

The pipeline monitors all tube, bus, DLR, and Overground lines for disruptions. When a line is suspended, delayed, or running a modified service, the mart-layer models flag it immediately. Users see disruptions before they affect their journey — not after they're already standing on a platform.

Journey reliability insights

Over time, the lakehouse accumulates historical arrival data. This unlocks something the official TfL app doesn't offer: historical reliability patterns. Users can see which routes are consistently late, which times of day have the most disruptions, and plan accordingly. This turned out to be the feature students valued most — it changed how they planned their commutes.

The teaching dimension

Building the app was only half the project. The other half was using it as a hands-on teaching tool for the cohort of ~12 early-career candidates I was mentoring through Somalis in Tech.

Each layer of the stack became a teaching module:

  • API ingestion: Students learned how to work with REST APIs, handle rate limits, design retry logic, and think about idempotency — using the TfL API as their real-world example.
  • Data modelling: We walked through the staging-to-marts pattern in dbt, discussing why you separate raw, typed, and business-logic layers. Students built their own dbt models against the TfL data.
  • Data quality: Using Great Expectations, students wrote their own expectation suites — learning to think about what "correct data" actually means in a system processing live transport feeds.
  • Pipeline orchestration: Airflow DAGs gave students a visual model of how data flows through a real system — dependencies, retries, SLAs, and failure handling.
The best way to teach data engineering isn't to lecture about it. It's to build something real and let people pull it apart.

Impact and what students built

The results exceeded what I expected. Feedback from students showed the project genuinely shifted their understanding of data engineering from abstract concepts to concrete, buildable systems.

~12
mentees through the Caawi platform
87%
improvement in data freshness (6h to 45min)
65%
reduction in manual reporting tasks

Several mentees went on to build their own pipeline projects inspired by this architecture. One built a weather data lakehouse. Another started a sports analytics warehouse using the same dbt patterns. Seeing students take the concepts and apply them independently — that's when I knew the teaching approach worked.

Beyond the mentoring cohort, the app itself proved useful to the broader Somalis in Tech community. Members used it for their daily commutes, and the real-time disruption alerts became particularly popular during the winter months when service disruptions spike.

Technical challenges and lessons

API rate limits and graceful degradation

The TfL API has rate limits that you hit quickly when polling thousands of stops. I implemented a priority-based polling strategy: high-traffic stops (major tube stations, university-adjacent bus stops) refresh more frequently, while quieter stops use longer intervals. When rate limits are hit, the pipeline falls back to the most recent cached data rather than failing entirely.

Schema evolution in live APIs

TfL occasionally changes its API response schema — adding fields, renaming attributes, or changing data types. The raw layer stores responses as-is, and the staging layer uses defensive casting with explicit column selection. When a schema change breaks a staging model, dbt tests catch it before downstream models are affected.

Making it accessible

Most data engineering projects are built for engineers. This one needed to be usable by students who might never have seen a terminal. That meant investing in clear documentation, a simple interface, and error messages that explain what went wrong in plain language — not stack traces.

What I would do differently

If I started again, three things would change:

  • Data contracts from day one. Formalising the expected shape of data between layers would have caught schema issues faster and made the teaching clearer — students would see exactly what each layer promises to deliver.
  • More granular monitoring. I'd add per-stop freshness SLAs and alerting, not just per-pipeline. Some stops matter more than others, and the monitoring should reflect that.
  • User feedback loop. Building a simple feedback mechanism from the start would have helped prioritise which stops and features to focus on, rather than relying on informal conversations.

Beyond the code

This project taught me something that no technical tutorial covers: the gap between "it works" and "it helps."

A pipeline that processes data correctly is engineering. A pipeline that processes data correctly and makes someone's commute less stressful — that's software worth building. The students I mentored didn't just learn dbt and Airflow. They learned that data engineering has a purpose beyond the pipeline.

The TfL Travel App is still running, still ingesting live data, and still being used by members of the Somalis in Tech community. It's not perfect. It's not a startup. It's a tool that solves a real problem for real people — and it was built by a community that supports each other.

That's the kind of engineering I want to keep doing.


The full project is open-source on GitHub: aosman101/tfl-realtime-lakehouse. If you're building community-driven data projects or using the TfL API, I'd love to hear about it — leave a comment below or reach out on LinkedIn.

Reactions & comments.

React with an emoji or leave the first comment — giscus will create the GitHub Discussion automatically on first interaction. A free GitHub account is all you need.