jcousins/2026.04.001
Lineup
A production data platform for live-music discovery, with a SwiftUI client.
J. Cousins
Independent · London, UK
Keywords data engineering · Airflow · BigQuery · dbt · entity resolution · iOS · SwiftUI
1Introduction
Concert discovery is fragmented in a way that is faintly absurd given how mature both the live-music industry and the modern data stack are. There is no single platform that aggregates events across the full breadth of the calendar, and the existing aggregators are either commercial fronts for one supplier (Ticketmaster) or surface-level scrapers without any commitment to entity quality. Lineup approaches the problem as a data-engineering problem first and a product problem second: solve the consolidation, dedup, and identity issues at the platform layer, then expose a small, opinionated client that does one thing well.
This paper documents the system as it stood after twelve months of production operation. We focus on the technical decisions and the specific trade-offs that shaped them rather than the product surface, which is documented separately in the iOS App Store listing.
2Method
2.1Orchestration
All ingestion runs in Apache Airflow 2.10 with a CeleryExecutor and Redis as the broker. We considered a Lambda-based serverless approach early on, but Airflow's first-class observability (DAG visualisation, retries, alerting, task-level logs) outweighed the operational overhead of running our own scheduler. Ten production DAGs orchestrate the daily syncs from the six API sources, each with source-specific selectors, three retries with exponential backoff, and Slack and email alerting on failure.
2.2Entity resolution
Multi-source deduplication is the central technical problem. We resolve entities in three stages, in order of decreasing precision:
- Direct mapping from a source identifier to its MusicBrainz identifier where one exists.
- Fuzzy matching on normalised name plus location for everything else, with strict similarity thresholds tuned per entity type.
- An
ExternalIdManagerclass maintaining persistent cross-source mappings in dedicated tables, so any pair of identifiers we have ever resolved together stays resolved.
The trade-off is complexity: the matcher carries non-trivial state and the failure modes (false positives at stage two) are the kind that erode user trust silently. The payoff is data quality that the aggregators do not match.
2.3Storage
We use a write-through pattern: every successful ingestion writes simultaneously to Supabase (PostgreSQL, the operational store backing the iOS client) and GCS Parquet files (the analytic input feeding BigQuery via dbt). The two systems remain independent: an operational outage does not block the analytic pipeline, and an analytic refactor cannot accidentally mutate production data. The cost is a small write-time latency penalty, which is acceptable given that ingestion is asynchronous from the user's perspective.
BigQuery was chosen over a traditional data lake for cost efficiency at our query volumes and for the SQL ergonomics. The dbt project is layered: staging → intermediate → marts, with data-quality tests at every layer.
2.4Caching
An earlier version of the pipeline used a 57MB JSON file as a process-local cache. This was the kind of decision that survives until it doesn't: it was fine for a single worker, problematic for parallel Celery workers, and openly hostile to deployment hygiene. We replaced it with Redis 7.x as a distributed cache, which gives proper eviction semantics and substantially reduces worker memory footprint and warm-start times.
3Implementation
| Layer | Choice | Why |
|---|---|---|
| Orchestration | Apache Airflow 2.10 | Observable DAGs, mature retry / alerting. |
| Operational DB | Supabase (Postgres) | Row-level security, real-time, low ops overhead. |
| Analytic warehouse | BigQuery | Serverless, cost-efficient, native dbt integration. |
| Transformations | dbt | Version-controlled, testable SQL, layered architecture. |
| Cache | Redis 7.x | Distributed, proper eviction, replaces a process-local JSON. |
| Client | Swift / SwiftUI · iOS 26 | Native performance, Apple ecosystem. |
| Infra | Docker Compose · GitHub Actions | Local reproducibility, CI on every commit. |
4Results
- 1M+ events resolved across all sources.
- 300K+ unique artists in the canonical entity table.
- 6 active API integrations, each with its own DAG.
- 122 Python unit tests covering pipeline logic.
- 122 dbt data-quality tests in the warehouse layer.
- Zero production data losses across twelve months of operation.
5Discussion
The single best decision was treating entity resolution as a first-class subsystem with its own persistent state, rather than a per-run heuristic. The cross-source identifier registry compounds: every resolution we do today makes tomorrow's resolutions cheaper and more accurate. The single worst decision was the JSON cache; it took longer to retire than it should have, because it kept appearing to work.
Near-term work splits into two strands. On the platform side: incremental loads to cut BigQuery costs, and a Looker Studio dashboard for pipeline-health observability. On the analytics side: attribution modelling for venue-popularity trends and artist-discovery patterns, which we expect to be useful both to users (recommendations) and to the catalogue (signal for which sources to prioritise). Integration roadmap: Live Nation API for major-promoter coverage, then regional promoters.
6References
- Airflow documentation, Apache Software Foundation, accessed 2026.
- MusicBrainz database schema and identifier model, MetaBrainz Foundation.
- dbt project structure, dbt Labs, "How we structure our dbt projects".