
In our previous article, we showed what a real label's download folder looks like - 18 distributors, 5 file formats, 500+ files per year. But solving the file format problem is only step one.
Open those files. The data inside is just as messy.
Ask 18 distributors what "Spotify" is called. You'll get more answers than you expect:
| Distributor | How they label Spotify |
|---|---|
| FUGA | Spotify |
| ADA | SPOTIFY |
| Ingrooves | spotify |
| The Orchard | Spotify Premium |
| Bandcamp | N/A |
| MVD | Spotify |
| Emerald | SPOTIFY |
| SFM | Spotify AB |
That's just one platform. Now multiply across every retailer, every label name, and every service type in the dataset. The same entity appears under different names, different capitalizations, and different abbreviations depending on which distributor sent the file.
The solution is what we call unions - normalization groups that map many raw values to one canonical name.
Loading diagram...
Here's what the union management interface looks like in practice - TIDAL alone has 10 variations, SoundCloud over 30:

In production, this system maps 830 raw tag values to just 19 unions across three dimensions:
Retailer unions
Spotify, Amazon, YouTube, Pandora, Facebook, and more. Each with dozens of naming variations across sources.
Label unions
The label itself can appear under different names, abbreviations, or legal entity variations across distributors.
Service unions
Streaming, downloads, physical, sync - each distributor categorizes their services differently.
When a file is imported, every raw value is checked against the union mappings. If it matches, the canonical name is stored alongside the original. If it doesn't, the raw value is preserved - and flagged for review.
Because new values appear constantly. A distributor adds a sub-brand. Another changes their internal naming. A third introduces a typo that persists for six months before anyone notices. Static find-and-replace breaks every time the data evolves.
Unions are dynamic. A non-technical user can add a new mapping through an admin interface - no code changes, no redeployment. The next import picks it up automatically.

Even after normalizing names, two more problems remain.
Every distributor reports income in their own currency. And they don't even agree on what to call the currency column:
| Distributor | Column name | Default currency |
|---|---|---|
| FUGA | Original currency | varies |
| ADA | (file-level) | GBP |
| Ingrooves | CURRENCY_CODE | GBP |
| The Orchard | Preferred Currency | USD |
| Merlin | Payable currency | varies |
| MVD | currency-code | varies |
To compare income across sources, every value needs to be converted to a single target currency using exchange rates tied to specific dates. A $1,000 Orchard payment and a £800 ADA payment aren't comparable until you normalize them.
Countries seem straightforward - until you see how distributors label them:
| The same country | Variations across sources |
|---|---|
| United States | US, USA, UNITED STATES, United States |
| United Kingdom | UK, GB, UNITED KINGDOM (GB), United Kingdom |
| Czech Republic | CZ, CZECH REPUBLIC, Czechia |
Each variation needs to map to a standard two-letter code. Without this, "show me all US streams" misses half the data.
Here's the full picture - what it actually takes to turn raw distributor data into something usable:
Loading diagram...
Most teams get stuck at Layer 1 and never even reach the name and currency problems. But without all three layers, you can't answer basic questions like "What were our total Spotify streams in Q3, in GBP?"
After all three normalization layers, that question becomes trivial. Filter by retailer union, filter by date range, and every value is already in GBP. One query. One answer. No spreadsheet wrangling.
That's the difference between a collection of files and a data platform. We build the latter.
Have a similar project in mind? We'd love to hear about it.
Get in touch to discuss how we can help bring your vision to life.
13 Distributors, 5 File Formats, Zero Standards -The Reality of Music Royalty Data
Every month, independent labels receive royalty reports from over a dozen distributors. No two look the same. Here's what that actually looks like.
AI Song Structure Analysis: Intro, Verse, Chorus
A technical look at automatic song structure detection using AI and signal processing. Identify song sections with MTL Audio Locators.
Get music tech insights, case studies, and industry news delivered to your inbox.