Most side projects start with curiosity. This one started with a question from a student: "How do I actually know when my bus is coming?"
It sounds trivial. But for students commuting across London on tight schedules and tighter budgets, unreliable journey information means missed lectures, wasted Oyster credit, and a lot of frustration. The official TfL app works, but it wasn't built for the way these students actually move — hopping between buses, checking multiple stops, planning around disruptions in real time.
So I built something better. Not a polished consumer app — a real data engineering system that ingests live TfL data, transforms it, validates it, and serves it through a clean interface that students and commuters can actually use. And I did it as part of my work with Somalis in Tech, turning the entire build into a teaching opportunity.
The problem worth solving
London's transport network is massive — over 9,000 bus stops, 270 tube stations, and dozens of rail, DLR, and Overground stations. Transport for London publishes a rich Unified API covering real-time arrivals, line statuses, stop points, disruptions, and journey planning. The data is there. But turning raw API responses into something genuinely useful for commuters requires real engineering.
The students I was mentoring through the Caawi Mentorship Platform needed something tangible — a project they could see, touch, and learn from. Not another tutorial. Not a toy dataset. A real system, processing real data, solving a real problem they personally experienced every day.
That's how the TfL Travel App was born: part tool, part teaching platform.
Architecture: building it right
I designed the system with the same rigour I'd apply to any production data platform. Four layers, each handled by the right tool:
- Ingestion & orchestration: Apache Airflow schedules DAGs that poll the TfL Unified API for real-time arrivals, line statuses, and stop point data. Arrivals refresh every 3 minutes; statuses every 10 minutes; stop points daily.
- Storage: Raw API responses land as Parquet files on the local filesystem, partitioned by date and hour — mimicking an S3-backed lakehouse pattern without cloud dependency.
- Transformation: dbt Core applies a strict medallion architecture — staging models for type-casting and deduplication, intermediate models for business logic, and mart-layer tables optimised for the queries users actually need.
- Quality & observability: Great Expectations validates every batch with not-null checks, range validations, accepted value sets, and freshness SLAs. OpenLineage tracks full data lineage across every pipeline run.
What users actually get
The app serves three core use cases that students and commuters consistently told me they needed:
Real-time arrival predictions
Users search for a stop or station and see live arrival predictions — not just scheduled times, but actual predicted arrivals based on vehicle positions. The data comes from the TfL arrivals endpoint, processed through the pipeline and served with sub-minute latency. For students timing a bus-to-tube connection, those few minutes of accuracy matter.
Line status and disruption alerts
The pipeline monitors all tube, bus, DLR, and Overground lines for disruptions. When a line is suspended, delayed, or running a modified service, the mart-layer models flag it immediately. Users see disruptions before they affect their journey — not after they're already standing on a platform.
Journey reliability insights
Over time, the lakehouse accumulates historical arrival data. This unlocks something the official TfL app doesn't offer: historical reliability patterns. Users can see which routes are consistently late, which times of day have the most disruptions, and plan accordingly. This turned out to be the feature students valued most — it changed how they planned their commutes.
The teaching dimension
Building the app was only half the project. The other half was using it as a hands-on teaching tool for the cohort of ~12 early-career candidates I was mentoring through Somalis in Tech.
Each layer of the stack became a teaching module:
- API ingestion: Students learned how to work with REST APIs, handle rate limits, design retry logic, and think about idempotency — using the TfL API as their real-world example.
- Data modelling: We walked through the staging-to-marts pattern in dbt, discussing why you separate raw, typed, and business-logic layers. Students built their own dbt models against the TfL data.
- Data quality: Using Great Expectations, students wrote their own expectation suites — learning to think about what "correct data" actually means in a system processing live transport feeds.
- Pipeline orchestration: Airflow DAGs gave students a visual model of how data flows through a real system — dependencies, retries, SLAs, and failure handling.
The best way to teach data engineering isn't to lecture about it. It's to build something real and let people pull it apart.
Impact and what students built
The results exceeded what I expected. Feedback from students showed the project genuinely shifted their understanding of data engineering from abstract concepts to concrete, buildable systems.
Several mentees went on to build their own pipeline projects inspired by this architecture. One built a weather data lakehouse. Another started a sports analytics warehouse using the same dbt patterns. Seeing students take the concepts and apply them independently — that's when I knew the teaching approach worked.
Beyond the mentoring cohort, the app itself proved useful to the broader Somalis in Tech community. Members used it for their daily commutes, and the real-time disruption alerts became particularly popular during the winter months when service disruptions spike.
Technical challenges and lessons
API rate limits and graceful degradation
The TfL API has rate limits that you hit quickly when polling thousands of stops. I implemented a priority-based polling strategy: high-traffic stops (major tube stations, university-adjacent bus stops) refresh more frequently, while quieter stops use longer intervals. When rate limits are hit, the pipeline falls back to the most recent cached data rather than failing entirely.
Schema evolution in live APIs
TfL occasionally changes its API response schema — adding fields, renaming attributes, or changing data types. The raw layer stores responses as-is, and the staging layer uses defensive casting with explicit column selection. When a schema change breaks a staging model, dbt tests catch it before downstream models are affected.
Making it accessible
Most data engineering projects are built for engineers. This one needed to be usable by students who might never have seen a terminal. That meant investing in clear documentation, a simple interface, and error messages that explain what went wrong in plain language — not stack traces.
What I would do differently
If I started again, three things would change:
- Data contracts from day one. Formalising the expected shape of data between layers would have caught schema issues faster and made the teaching clearer — students would see exactly what each layer promises to deliver.
- More granular monitoring. I'd add per-stop freshness SLAs and alerting, not just per-pipeline. Some stops matter more than others, and the monitoring should reflect that.
- User feedback loop. Building a simple feedback mechanism from the start would have helped prioritise which stops and features to focus on, rather than relying on informal conversations.
Beyond the code
This project taught me something that no technical tutorial covers: the gap between "it works" and "it helps."
A pipeline that processes data correctly is engineering. A pipeline that processes data correctly and makes someone's commute less stressful — that's software worth building. The students I mentored didn't just learn dbt and Airflow. They learned that data engineering has a purpose beyond the pipeline.
The TfL Travel App is still running, still ingesting live data, and still being used by members of the Somalis in Tech community. It's not perfect. It's not a startup. It's a tool that solves a real problem for real people — and it was built by a community that supports each other.
That's the kind of engineering I want to keep doing.
The full project is open-source on GitHub: aosman101/tfl-realtime-lakehouse. If you're building community-driven data projects or using the TfL API, I'd love to hear about it — leave a comment below or reach out on LinkedIn.