Lineup

A production data platform for live-music discovery, with a SwiftUI client.

J. Cousins

Independent · London, UK

Submitted 2024-11 Revised 2026-04 Status In production

Abstract. Live-music discovery is a data integration problem masquerading as a consumer one: events are scattered across at least six largely-incompatible APIs, with inconsistent identifiers, partial overlap, and no canonical source of truth. Lineup presents an attempt to solve the problem from the data side first. We describe a production pipeline orchestrated in Apache Airflow that ingests 1M+ events and 300K+ artists daily from Ticketmaster, Skiddle, Resident Advisor, Concert Archives, Spotify, and MusicBrainz; resolves entities through a three-stage matcher backed by a persistent cross-source identifier registry; and lands data into Supabase (operational) and BigQuery (analytic) via a write-through pattern. A SwiftUI iOS client consumes the operational store. The system has run in production for twelve months with zero data losses.

Keywords data engineering · Airflow · BigQuery · dbt · entity resolution · iOS · SwiftUI

1Introduction

Concert discovery is fragmented in a way that is faintly absurd given how mature both the live-music industry and the modern data stack are. There is no single platform that aggregates events across the full breadth of the calendar, and the existing aggregators are either commercial fronts for one supplier (Ticketmaster) or surface-level scrapers without any commitment to entity quality. Lineup approaches the problem as a data-engineering problem first and a product problem second: solve the consolidation, dedup, and identity issues at the platform layer, then expose a small, opinionated client that does one thing well.

This paper documents the system as it stood after twelve months of production operation. We focus on the technical decisions and the specific trade-offs that shaped them rather than the product surface, which is documented separately in the iOS App Store listing.

2Method

2.1Orchestration

All ingestion runs in Apache Airflow 2.10 with a CeleryExecutor and Redis as the broker. We considered a Lambda-based serverless approach early on, but Airflow's first-class observability (DAG visualisation, retries, alerting, task-level logs) outweighed the operational overhead of running our own scheduler. Ten production DAGs orchestrate the daily syncs from the six API sources, each with source-specific selectors, three retries with exponential backoff, and Slack and email alerting on failure.

2.2Entity resolution

Multi-source deduplication is the central technical problem. We resolve entities in three stages, in order of decreasing precision:

Direct mapping from a source identifier to its MusicBrainz identifier where one exists.
Fuzzy matching on normalised name plus location for everything else, with strict similarity thresholds tuned per entity type.
An ExternalIdManager class maintaining persistent cross-source mappings in dedicated tables, so any pair of identifiers we have ever resolved together stays resolved.

The trade-off is complexity: the matcher carries non-trivial state and the failure modes (false positives at stage two) are the kind that erode user trust silently. The payoff is data quality that the aggregators do not match.

2.3Storage

We use a write-through pattern: every successful ingestion writes simultaneously to Supabase (PostgreSQL, the operational store backing the iOS client) and GCS Parquet files (the analytic input feeding BigQuery via dbt). The two systems remain independent: an operational outage does not block the analytic pipeline, and an analytic refactor cannot accidentally mutate production data. The cost is a small write-time latency penalty, which is acceptable given that ingestion is asynchronous from the user's perspective.

BigQuery was chosen over a traditional data lake for cost efficiency at our query volumes and for the SQL ergonomics. The dbt project is layered: staging → intermediate → marts, with data-quality tests at every layer.

2.4Caching

An earlier version of the pipeline used a 57MB JSON file as a process-local cache. This was the kind of decision that survives until it doesn't: it was fine for a single worker, problematic for parallel Celery workers, and openly hostile to deployment hygiene. We replaced it with Redis 7.x as a distributed cache, which gives proper eviction semantics and substantially reduces worker memory footprint and warm-start times.

3Implementation

Layer	Choice	Why
Orchestration	Apache Airflow 2.10	Observable DAGs, mature retry / alerting.
Operational DB	Supabase (Postgres)	Row-level security, real-time, low ops overhead.
Analytic warehouse	BigQuery	Serverless, cost-efficient, native dbt integration.
Transformations	dbt	Version-controlled, testable SQL, layered architecture.
Cache	Redis 7.x	Distributed, proper eviction, replaces a process-local JSON.
Client	Swift / SwiftUI · iOS 26	Native performance, Apple ecosystem.
Infra	Docker Compose · GitHub Actions	Local reproducibility, CI on every commit.

4Results

1M+ events resolved across all sources.
300K+ unique artists in the canonical entity table.
6 active API integrations, each with its own DAG.
122 Python unit tests covering pipeline logic.
122 dbt data-quality tests in the warehouse layer.
Zero production data losses across twelve months of operation.

5Discussion

The single best decision was treating entity resolution as a first-class subsystem with its own persistent state, rather than a per-run heuristic. The cross-source identifier registry compounds: every resolution we do today makes tomorrow's resolutions cheaper and more accurate. The single worst decision was the JSON cache; it took longer to retire than it should have, because it kept appearing to work.

Near-term work splits into two strands. On the platform side: incremental loads to cut BigQuery costs, and a Looker Studio dashboard for pipeline-health observability. On the analytics side: attribution modelling for venue-popularity trends and artist-discovery patterns, which we expect to be useful both to users (recommendations) and to the catalogue (signal for which sources to prioritise). Integration roadmap: Live Nation API for major-promoter coverage, then regional promoters.

6References

Airflow documentation, Apache Software Foundation, accessed 2026.
MusicBrainz database schema and identifier model, MetaBrainz Foundation.
dbt project structure, dbt Labs, "How we structure our dbt projects".

← Back to the index Correspondence →