Case Study·

How We Built a Universal Music Data Parser for 20+ Streaming Platforms

Aggregating 200 million streaming records from Spotify, Apple Music, YouTube and dozens of other platforms - when every distributor speaks a different language. A case study on building a scalable royalty data integration system.
How We Built a Universal Music Data Parser for 20+ Streaming Platforms
NDA

The Problem: Data Chaos in Music Industry

It was 2023 when we first sat down with a UK music label facing a common but frustrating problem. Every month, they received royalty reports from their distributors - ADA (Warner), The Orchard (Sony), Merlin, FUGA, Bandcamp, and a dozen others. Each report contained the same basic information: how many times their artists' songs were streamed, and how much money they earned.

Simple, right?

Not quite.

ADA sent Excel files with columns named "Release Artist", "Release Title", and "Net Payable". Merlin called them "Artist name", "Track name", and "Payable". The Orchard went with "Artist Name", "Track Name", and "Label Share Net Receipts".

Same data. Different languages.

And that was just the beginning.

The Date Format Nightmare

Consider how different distributors format dates:

DistributorDate FormatExample
ADAExcel serial number45292
ADA (variant)YYYYMM integer202501
The Orchard"YYYYMNN""2025M01"
MerlinHidden in metadata cellsRow 1, Column B
QelloEmbedded in filenameQello_20250101_20250331.xlsx
BelieveISO date string"2025-01-15"

One distributor. Four date formats. And we had 20+ distributors to support.

The Currency Problem

Some distributors report in USD. Others in EUR, GBP, or the original local currency. Some files contain mixed currencies. Some hardcode "USD" even when the actual currency varies by territory.

The "Same But Different" Problem

"UNITED STATES" vs "USA" vs "US" vs "United States of America". All the same country. All different strings that need to be normalized.

"Sony Music" vs "SONY" vs "Sony Music Entertainment" vs "SME". Same label. Four different spellings across different platforms.

The Solution: Adapter Pattern with Field Mapping

We needed a system that could:

  1. Accept any file format (CSV, XLS, XLSX, XLSB)
  2. Map arbitrary column names to a standard schema
  3. Handle date parsing edge cases per distributor
  4. Normalize currencies, countries, and entity names
  5. Scale to hundreds of millions of records

The Architecture

At its core, the system uses the Adapter Pattern - a universal base class that handles all common operations, with specialized adapters for each distributor that define only what's different:

  • Base SDLImporter - handles file parsing, chunked processing, country normalization, currency conversion, and Elasticsearch indexing
  • Distributor Adapters - each adapter contains only a field mapping dictionary and any date/currency parsing overrides specific to that distributor

This approach means adding a new distributor typically takes just 2-3 hours of work.

Supported Distributors and Adapters

We currently support 25+ adapters covering the major music distribution ecosystem:

Major Distributors

DistributorParent CompanyAdapter Variants
ADAWarner Music5 adapters (Standard, V2, V2Fix, XLS, Custom)
The OrchardSony Music2 adapters (Standard, Legacy)
MerlinIndependent2 adapters (Standard, Historical)
FUGADowntown Music1 adapter
BelieveIndependent2 adapters (Standard, Legacy)
TuneCoreBelieve1 adapter
DistroKidIndependent1 adapter
CD BabyDowntown Music1 adapter

Specialized Platforms

PlatformTypeNotes
BandcampDirect-to-fanCustom CSV format
QelloConcert streamingDate embedded in filename
SoundcloudStreamingMultiple report types
YouTube MusicStreamingContent ID integration
TikTokSocial/UGCEmerging format
Meta (Facebook/Instagram)SocialRights manager exports

Regional Distributors

  • Phonofile (Nordic region)
  • Zebralution (Germany)
  • IDOL (France)
  • Altafonte (Spain/Latin America)

Supported File Formats

The system handles every file format encountered in the music distribution industry:

FormatExtensionUse CaseNotes
CSV.csvMost commonUTF-8 and ISO-8859-1 encoding support
Excel.xlsxStandard Excelopenpyxl with read-only mode
Excel Binary.xlsbLarge filespyxlsb parser, common for 500K+ rows
Legacy Excel.xlsOlder systemsxlrd parser
TSV.tsvSome APIsTab-separated values

File Format Quirks We Handle

  • XLS files containing CSV data - The Orchard and ADA sometimes send .xls files that are actually CSV
  • Mixed encodings - Automatic fallback from UTF-8 to ISO-8859-1
  • Files with metadata headers - Skip non-data rows (Merlin stores dates in row 1)
  • Multi-sheet workbooks - Automatic sheet detection

Scale: 200 Million Records and Growing

Our production system currently manages:

  • 200+ million streaming records in the database
  • 25+ active distributor adapters
  • 50+ file format variations handled automatically
  • Monthly imports of 5-10 million new records

Performance Metrics

OperationPerformance
Single file import (500K rows)2-3 minutes
Report generation5-15 seconds
Full-text search across catalogSub-second
Dashboard aggregationsReal-time

The ability to generate complex royalty reports across 200 million records in just a few seconds is what sets the system apart. This is achieved through:

  • Elasticsearch for aggregations and full-text search
  • Pre-computed rollups for common report dimensions
  • Chunked async processing during import (2000 rows per batch)
  • Celery task queues for parallel processing

The 20 Standard Fields

After analyzing dozens of distributor formats, we defined a universal schema that captures all essential royalty data:

Identification

  • Artist, Track, Product (album), Label
  • UPC (Universal Product Code), EAN, ISRC

Location

  • Country code (ISO), Territory name

Time Period

  • Year, Month, Period begin/end dates

Financial

  • Currency, Units (streams/downloads), Unit price, Income, Income type (GROSS/NET)

Platform

  • Retailer (Spotify, Apple Music, etc.), Service type (streaming, download, sync)

Real-World Challenges We Solved

The ADA Complexity

ADA (Warner's distribution arm) required 5 different adapters due to format inconsistencies:

  • Current standard format
  • V2 format introduced in 2024
  • V2 edge case fixes
  • Legacy .xls format
  • Files labeled as .xls but containing CSV data

Entity Resolution

We maintain mapping databases to normalize inconsistent naming:

  • Label mapping: "SME" → "Sony Music Entertainment"
  • Retailer mapping: "SPOTIFY AB" → "Spotify"
  • Country mapping: "UNITED KINGDOM (GB)" → "GB"

Encoding and Format Detection

Every file goes through automatic format detection and encoding fallback - because "just UTF-8" is never the reality in international music data.

Technologies Used

  • Python - Backend development
  • Elasticsearch - Search and aggregations
  • Celery - Async task processing
  • Redis - Task queue broker
  • PostgreSQL - Relational data storage
  • Docker - Containerization
  • Google Cloud Platform - Infrastructure

Results

After 18 months of development:

  • 200 million records processed and searchable
  • 25+ adapters for major music distributors
  • Report generation in 5-15 seconds regardless of data volume
  • Sub-second search across the entire catalog
  • Zero manual data entry - fully automated pipeline

What We Learned

  1. Start with abstraction, add workarounds later - The Adapter Pattern gave us a clean base, but real-world data required pragmatic hacks.
  2. Date parsing is always harder than you think - We encountered formats we never imagined existed.
  3. Entity resolution is a feature, not a bug - Union mapping became one of the most valuable features for reporting.
  4. Scale matters from day one - Designing for millions of records upfront saved us from painful rewrites.

What's Next?

We're exploring:

  • ML-based column mapping - Automatically detect field mappings for new distributors
  • Real-time streaming ingestion - Move from batch files to API integrations
  • Anomaly detection - Flag unusual patterns in royalty data

Let's Build Something Together

Have a similar project in mind? We'd love to hear about it.

Get in touch to discuss how we can help bring your vision to life.