Case Study·

Universal Music Data Parser for 20+ Platforms

How we aggregated 200M streaming records from Spotify, Apple Music, YouTube and more into a unified royalty data integration system.
Universal Music Data Parser for 20+ Platforms
CherryRed Records
This case study describes our work with CherryRed Records. Some technical details have been generalized to protect proprietary systems.

The Problem: Data Chaos in Music Industry

It was 2023 when we first sat down with a UK music label facing a common but frustrating problem. Every month, they received royalty reports from their distributors - ADA (Warner), The Orchard (Sony), Merlin, FUGA, Bandcamp, and a dozen others. Each report contained the same basic information: how many times their artists' songs were streamed, and how much money they earned.

Simple, right? Not quite.

ADA sent Excel files with columns named "Release Artist", "Release Title", and "Net Payable". Merlin called them "Artist name", "Track name", and "Payable". The Orchard went with "Artist Name", "Track Name", and "Label Share Net Receipts".

Same data. Different languages.

And that was just the beginning.

The Date Format Nightmare

Consider how different distributors format dates:

DistributorDate FormatExample
ADAExcel serial number45292
ADA (variant)YYYYMM integer202501
The Orchard"YYYYMNN""2025M01"
MerlinHidden in metadata cellsRow 1, Column B
QelloEmbedded in filenameQello_20250101_20250331.xlsx
BelieveISO date string"2025-01-15"

One distributor. Four date formats. And we had 20+ distributors to support.

The Currency Problem

Some distributors report in USD. Others in EUR, GBP, or the original local currency. Some files contain mixed currencies. Some hardcode "USD" even when the actual currency varies by territory.

The "Same But Different" Problem

"UNITED STATES" vs "USA" vs "US" vs "United States of America". All the same country. All different strings that need to be normalized.

"Sony Music" vs "SONY" vs "Sony Music Entertainment" vs "SME". Same label. Four different spellings across different platforms.

The Solution: Adapter Pattern with Field Mapping

We needed a system that could:

  1. Accept any file format (CSV, XLS, XLSX, XLSB)
  2. Map arbitrary column names to a standard schema
  3. Handle date parsing edge cases per distributor
  4. Normalize currencies, countries, and entity names
  5. Scale to hundreds of millions of records

The Architecture

At its core, the system uses the Adapter Pattern - a universal base class that handles all common operations, with specialized adapters for each distributor that define only what's different:

  • Base SDLImporter - handles file parsing, chunked processing, country normalization, currency conversion, and Elasticsearch indexing
  • Distributor Adapters - each adapter contains only a field mapping dictionary and any date/currency parsing overrides specific to that distributor
adapters/ada_adapter.py
class ADAAdapter(BaseSDLImporter):
    FIELD_MAP = {
        "Release Artist": "artist",
        "Release Title":  "track",
        "Net Payable":    "income",
    }

    def parse_date(self, raw):
        # ADA uses Excel serial numbers
        return excel_serial_to_date(raw)
This approach means adding a new distributor typically takes just 2-3 hours of work — define the field mapping, handle any date/currency quirks, and you're done.

Supported Distributors and Adapters

We currently support 25+ adapters covering the major music distribution ecosystem:

Major Distributors

DistributorParent CompanyAdapter Variants
ADAWarner Music5 adapters (Standard, V2, V2Fix, XLS, Custom)
The OrchardSony Music2 adapters (Standard, Legacy)
MerlinIndependent2 adapters (Standard, Historical)
FUGADowntown Music1 adapter
BelieveIndependent2 adapters (Standard, Legacy)
TuneCoreBelieve1 adapter
DistroKidIndependent1 adapter
CD BabyDowntown Music1 adapter

Specialized Platforms

PlatformTypeNotes
BandcampDirect-to-fanCustom CSV format
QelloConcert streamingDate embedded in filename
SoundcloudStreamingMultiple report types
YouTube MusicStreamingContent ID integration
TikTokSocial/UGCEmerging format
Meta (Facebook/Instagram)SocialRights manager exports

Regional Distributors

  • Phonofile (Nordic region)
  • Zebralution (Germany)
  • IDOL (France)
  • Altafonte (Spain/Latin America)

Supported File Formats

The system handles every file format encountered in the music distribution industry:

FormatExtensionUse CaseNotes
CSV.csvMost commonUTF-8 and ISO-8859-1 encoding support
Excel.xlsxStandard Excelopenpyxl with read-only mode
Excel Binary.xlsbLarge filespyxlsb parser, common for 500K+ rows
Legacy Excel.xlsOlder systemsxlrd parser
TSV.tsvSome APIsTab-separated values

File Format Quirks We Handle

  • XLS files containing CSV data - The Orchard and ADA sometimes send .xls files that are actually CSV
  • Mixed encodings - Automatic fallback from UTF-8 to ISO-8859-1
  • Files with metadata headers - Skip non-data rows (Merlin stores dates in row 1)
  • Multi-sheet workbooks - Automatic sheet detection

Scale: 200 Million Records and Growing

Our production system currently manages:

200M+ Streaming records

25+ Active adapters

50+ File format variations

5-10M New records/month

Performance Metrics

OperationPerformance
Single file import (500K rows)2-3 minutes
Report generation5-15 seconds
Full-text search across catalogSub-second
Dashboard aggregationsReal-time

The ability to generate complex royalty reports across 200 million records in just a few seconds is what sets the system apart. This is achieved through:

  • Elasticsearch for aggregations and full-text search
  • Pre-computed rollups for common report dimensions
  • Chunked async processing during import (2000 rows per batch)
  • Celery task queues for parallel processing

The 20 Standard Fields

After analyzing dozens of distributor formats, we defined a universal schema that captures all essential royalty data:

CategoryFields
IdentificationArtist, Track, Product (album), Label, UPC, EAN, ISRC
LocationCountry code (ISO), Territory name
Time PeriodYear, Month, Period begin/end dates
FinancialCurrency, Units (streams/downloads), Unit price, Income, Income type (GROSS/NET)
PlatformRetailer (Spotify, Apple Music, etc.), Service type (streaming, download, sync)

Real-World Challenges We Solved

The ADA Complexity

ADA (Warner's distribution arm) required 5 different adapters due to format inconsistencies:

#AdapterReason
1StandardCurrent production format
2V2New format introduced in 2024
3V2FixEdge case corrections for V2
4XLSLegacy .xls format
5Custom.xls files that actually contain CSV data

Entity Resolution

We maintain mapping databases to normalize inconsistent naming:

"SME"                  →  "Sony Music Entertainment"
"SPOTIFY AB"           →  "Spotify"
"UNITED KINGDOM (GB)"  →  "GB"
Entity resolution turned out to be one of the most valuable features — clean, unified naming across all platforms made reporting dramatically more useful.

Encoding and Format Detection

Every file goes through automatic format detection and encoding fallback — because "just UTF-8" is never the reality in international music data.

Technologies Used

Python Backend

Elasticsearch Search & aggregations

Celery + Redis Async processing

PostgreSQL Relational storage

Docker Containerization

Digital Ocean Infrastructure

Results

After 18 months of development:

  • 200 million records processed and searchable
  • 25+ adapters for major music distributors
  • Report generation in 5-15 seconds regardless of data volume
  • Sub-second search across the entire catalog
  • Zero manual data entry - fully automated pipeline

Development in Numbers

Building a system of this complexity doesn't happen overnight. Here's what the development effort looked like behind the scenes:

727 Commits

1,500+ Files created

84 Active development days

103 Peak commits in a single week

The development followed an intense sprint delivery model — 80% of all commits landed within a focused 3-month window (December 2024 – February 2025), averaging nearly 9 commits per active day. Each distributor adapter went through multiple iterations: for every new feature shipped, approximately 3 follow-up fixes were committed to handle real-world edge cases in date parsing, encoding, and field normalization.

This iterative approach — build, import real data, discover edge cases, fix, repeat — was essential. Music distribution data is messy by nature, and no amount of upfront design can replace the feedback loop of processing actual royalty files from 20+ platforms.

What We Learned

  • Start with abstraction, add workarounds later — The Adapter Pattern gave us a clean base, but real-world data required pragmatic hacks.
  • Date parsing is always harder than you think — We encountered formats we never imagined existed.
  • Entity resolution is a feature, not a bug — Union mapping became one of the most valuable features for reporting.
  • Scale matters from day one — Designing for millions of records upfront saved us from painful rewrites.

What's Next?

We're exploring:

  • ClickHouse as the analytics engine - For the next project of this scale, we would replace Elasticsearch with ClickHouse for analytical queries. Its columnar storage and vectorized execution are purpose-built for aggregating billions of rows, offering significantly faster performance for royalty reporting workloads
  • ML-based column mapping - Automatically detect field mappings for new distributors
  • Real-time streaming ingestion - Move from batch files to API integrations
  • Anomaly detection - Flag unusual patterns in royalty data

Let's Build Something Together

Have a similar project in mind? We'd love to hear about it.

Get in touch to discuss how we can help bring your vision to life.