Normalizing Music Streaming Data: 830 Values, 19 Unions

In our previous article, we showed what a real label's download folder looks like - 18 distributors, 5 file formats, 500+ files per year. But solving the file format problem is only step one.

Open those files. The data inside is just as messy.

The name problem

Ask 18 distributors what "Spotify" is called. You'll get more answers than you expect:

Distributor	How they label Spotify
FUGA	`Spotify`
ADA	`SPOTIFY`
Ingrooves	`spotify`
The Orchard	`Spotify Premium`
Bandcamp	N/A
MVD	`Spotify`
Emerald	`SPOTIFY`
SFM	`Spotify AB`

That's just one platform. Now multiply across every retailer, every label name, and every service type in the dataset. The same entity appears under different names, different capitalizations, and different abbreviations depending on which distributor sent the file.

830 values, 19 names

The solution is what we call unions - normalization groups that map many raw values to one canonical name.

Loading diagram...

Here's what the union management interface looks like in practice - TIDAL alone has 10 variations, SoundCloud over 30:

Union management interface showing raw value mappings

In production, this system maps 830 raw tag values to just 19 unions across three dimensions:

Retailer unions

Spotify, Amazon, YouTube, Pandora, Facebook, and more. Each with dozens of naming variations across sources.

Label unions

The label itself can appear under different names, abbreviations, or legal entity variations across distributors.

Service unions

Streaming, downloads, physical, sync - each distributor categorizes their services differently.

When a file is imported, every raw value is checked against the union mappings. If it matches, the canonical name is stored alongside the original. If it doesn't, the raw value is preserved - and flagged for review.

Why not just find-and-replace?

Because new values appear constantly. A distributor adds a sub-brand. Another changes their internal naming. A third introduces a typo that persists for six months before anyone notices. Static find-and-replace breaks every time the data evolves.

Unions are dynamic. A non-technical user can add a new mapping through an admin interface - no code changes, no redeployment. The next import picks it up automatically.

One real example: a distributor renamed their Spotify column from "Spotify" to "Spotify AB" mid-year. Without union mapping, this would have created a second "Spotify" in every report, splitting the data and making quarterly comparisons unreliable.

Currency exchange and data normalization

The cherry on top: currencies and territories

Even after normalizing names, two more problems remain.

Currencies

Every distributor reports income in their own currency. And they don't even agree on what to call the currency column:

Distributor	Column name	Default currency
FUGA	`Original currency`	varies
ADA	(file-level)	GBP
Ingrooves	`CURRENCY_CODE`	GBP
The Orchard	`Preferred Currency`	USD
Merlin	`Payable currency`	varies
MVD	`currency-code`	varies

To compare income across sources, every value needs to be converted to a single target currency using exchange rates tied to specific dates. A $1,000 Orchard payment and a £800 ADA payment aren't comparable until you normalize them.

Territories

Countries seem straightforward - until you see how distributors label them:

The same country	Variations across sources
United States	`US`, `USA`, `UNITED STATES`, `United States`
United Kingdom	`UK`, `GB`, `UNITED KINGDOM (GB)`, `United Kingdom`
Czech Republic	`CZ`, `CZECH REPUBLIC`, `Czechia`

Each variation needs to map to a standard two-letter code. Without this, "show me all US streams" misses half the data.

Three layers of normalization

Here's the full picture - what it actually takes to turn raw distributor data into something usable:

Loading diagram...

Most teams get stuck at Layer 1 and never even reach the name and currency problems. But without all three layers, you can't answer basic questions like "What were our total Spotify streams in Q3, in GBP?"

The payoff

After all three normalization layers, that question becomes trivial. Filter by retailer union, filter by date range, and every value is already in GBP. One query. One answer. No spreadsheet wrangling.

And once the data is clean, even non-technical users can get answers. We built an AI-powered analytics dashboard that lets anyone type a question in plain English and get a chart back in seconds, no SQL required.

That's the difference between a collection of files and a data platform. We build the latter.