Case Study·

830 Ways to Say Spotify - Normalizing Music Streaming Data

After solving the file format problem, the data inside is just as messy. Different names for the same platforms, labels, currencies, and territories. Here's how we normalize it.
830 Ways to Say Spotify - Normalizing Music Streaming Data

In our previous article, we showed what a real label's download folder looks like - 18 distributors, 5 file formats, 500+ files per year. But solving the file format problem is only step one.

Open those files. The data inside is just as messy.

The name problem

Ask 18 distributors what "Spotify" is called. You'll get more answers than you expect:

DistributorHow they label Spotify
FUGASpotify
ADASPOTIFY
Ingroovesspotify
The OrchardSpotify Premium
BandcampN/A
MVDSpotify
EmeraldSPOTIFY
SFMSpotify AB

That's just one platform. Now multiply across every retailer, every label name, and every service type in the dataset. The same entity appears under different names, different capitalizations, and different abbreviations depending on which distributor sent the file.

830 values, 19 names

The solution is what we call unions - normalization groups that map many raw values to one canonical name.

Loading diagram...

Here's what the union management interface looks like in practice - TIDAL alone has 10 variations, SoundCloud over 30:

In production, this system maps 830 raw tag values to just 19 unions across three dimensions:

Retailer unions

Spotify, Amazon, YouTube, Pandora, Facebook, and more. Each with dozens of naming variations across sources.

Label unions

The label itself can appear under different names, abbreviations, or legal entity variations across distributors.

Service unions

Streaming, downloads, physical, sync - each distributor categorizes their services differently.

When a file is imported, every raw value is checked against the union mappings. If it matches, the canonical name is stored alongside the original. If it doesn't, the raw value is preserved - and flagged for review.

Why not just find-and-replace?

Because new values appear constantly. A distributor adds a sub-brand. Another changes their internal naming. A third introduces a typo that persists for six months before anyone notices. Static find-and-replace breaks every time the data evolves.

Unions are dynamic. A non-technical user can add a new mapping through an admin interface - no code changes, no redeployment. The next import picks it up automatically.

One real example: a distributor renamed their Spotify column from "Spotify" to "Spotify AB" mid-year. Without union mapping, this would have created a second "Spotify" in every report, splitting the data and making quarterly comparisons unreliable.

The cherry on top: currencies and territories

Even after normalizing names, two more problems remain.

Currencies

Every distributor reports income in their own currency. And they don't even agree on what to call the currency column:

DistributorColumn nameDefault currency
FUGAOriginal currencyvaries
ADA(file-level)GBP
IngroovesCURRENCY_CODEGBP
The OrchardPreferred CurrencyUSD
MerlinPayable currencyvaries
MVDcurrency-codevaries

To compare income across sources, every value needs to be converted to a single target currency using exchange rates tied to specific dates. A $1,000 Orchard payment and a £800 ADA payment aren't comparable until you normalize them.

Territories

Countries seem straightforward - until you see how distributors label them:

The same countryVariations across sources
United StatesUS, USA, UNITED STATES, United States
United KingdomUK, GB, UNITED KINGDOM (GB), United Kingdom
Czech RepublicCZ, CZECH REPUBLIC, Czechia

Each variation needs to map to a standard two-letter code. Without this, "show me all US streams" misses half the data.

Three layers of normalization

Here's the full picture - what it actually takes to turn raw distributor data into something usable:

Loading diagram...

Most teams get stuck at Layer 1 and never even reach the name and currency problems. But without all three layers, you can't answer basic questions like "What were our total Spotify streams in Q3, in GBP?"

The payoff

After all three normalization layers, that question becomes trivial. Filter by retailer union, filter by date range, and every value is already in GBP. One query. One answer. No spreadsheet wrangling.

That's the difference between a collection of files and a data platform. We build the latter.

Let's Build Something Together

Have a similar project in mind? We'd love to hear about it.

Get in touch to discuss how we can help bring your vision to life.