
Your data team knows the answer is in the database. Your A&R lead, your finance director, and your label manager do not know how to get it out. This is the gap that costs music companies real money, not in licensing fees or infrastructure, but in decisions delayed, trends missed, and reports that arrive a week too late.
We built an AI-powered analytics dashboard to close that gap. It sits inside MusicData Lab, our royalty analytics platform, and it lets anyone with access type a question in plain English and get a chart back in seconds.
The music industry has a data access problem. Not a data collection problem. Labels and distributors already have millions of rows of streaming data, royalty reports, and territorial breakdowns sitting in their databases. The bottleneck is getting from "I need to know which retailers drove revenue last quarter" to an actual answer.
Today, that journey typically looks like this:
This loop can take hours, sometimes days. Multiply it across every label, every territory, every reporting period, and you start to see the scale of the problem.
That was the design question behind our AI Dashboard. Instead of routing every data request through a technical team, what if the platform could understand a question like "top 5 artists from the US by income" and return a bar chart?
The workflow is simple:
No SQL knowledge required. No waiting for the data team. No spreadsheets.
Loading diagram...
Music royalty data is uniquely complex. A single label might receive reports from 13 different distributors, each with its own file format, column naming, and date conventions. Once that data is normalised and loaded into an analytics database, the schema reflects that complexity: dozens of fields covering artists, tracks, retailers, territories, currencies, and time periods. Even the retailer names need normalisation before they become queryable.
This is precisely the kind of dataset where AI-assisted querying shines. The system prompt includes the full database schema, domain-specific hints, and few-shot examples that teach the model how to write correct ClickHouse SQL. We build it dynamically from Django model metadata, so the prompt stays in sync with the schema automatically:
def build_system_prompt() -> str:
fields = []
for field in StreamDataCH._meta.get_fields():
fields.append(f" - {field.name} ({field.__class__.__name__})")
schema_block = "\n".join(fields)
return (
"You are a SQL analyst for a music streaming analytics platform.\n"
f"The database is ClickHouse. There is one table: `{TABLE}`.\n"
f"\n## Schema\n\n{schema_block}\n"
"\n## Domain hints\n\n"
"- `final_income` is income converted to the target currency\n"
"- `retailer_union` is the normalized retailer name\n"
"- Always filter out empty strings when grouping by text fields\n"
"\n## Output rules\n\n"
"1. Output ONLY a single SELECT statement\n"
"2. Always include a LIMIT clause (max 100)\n"
"3. Never use INSERT, UPDATE, DELETE, DROP, or any DDL\n"
f"\n## Examples\n\n{examples_block}"
)
The few-shot examples are critical. They teach the model patterns like subqueries for hierarchical groupings ("top 5 artists with their retailers by income") that a generic LLM would otherwise flatten into a single GROUP BY:
-- Example: top 5 artists with their retailers by income
SELECT artist, retailer_union, sum(final_income) AS total_income
FROM analytics_streamdatach
WHERE artist != '' AND retailer_union != ''
AND artist IN (
SELECT artist FROM analytics_streamdatach
WHERE artist != ''
GROUP BY artist ORDER BY sum(final_income) DESC LIMIT 5
)
GROUP BY artist, retailer_union
ORDER BY artist, total_income DESC LIMIT 100
The result is that even someone who has never seen a database can ask:
Here is where most "AI analytics" solutions stumble. They require sending your data, or at least your queries, to a third-party API. For a music company handling confidential royalty data, artist earnings, and pre-release catalogue information, that is often a non-starter.
Our architecture solves this with a pluggable LLM design:
Switching between providers is a configuration change, not a code change. The factory pattern reads a single setting and returns the right client:
def get_llm_provider() -> LLMProvider:
provider = getattr(settings, "AI_DASHBOARD_PROVIDER", "ollama")
if provider == "claude":
from .claude import ClaudeProvider
return ClaudeProvider()
from .ollama import OllamaProvider
return OllamaProvider()
The same application code, the same security layer, the same chart rendering.
Letting an AI write SQL that runs against your production database sounds risky. It is, if you do it naively. Our approach layers multiple security controls:
SQL Guard validates every query before execution. Here is the core of it:
BLOCKED_KEYWORDS = [
"INSERT", "UPDATE", "DELETE", "DROP", "ALTER",
"TRUNCATE", "CREATE", "GRANT", "REVOKE",
"ATTACH", "DETACH", "RENAME", "OPTIMIZE", "KILL",
]
class SQLGuard:
@staticmethod
def validate(sql: str) -> str:
sql = sql.rstrip(";").strip()
if not sql.upper().startswith("SELECT"):
raise SQLGuardError("Only SELECT queries are allowed")
if ";" in sql:
raise SQLGuardError("Multiple statements are not allowed")
sql_upper = sql.upper()
for keyword in BLOCKED_KEYWORDS:
if re.search(rf"\b{keyword}\b", sql_upper):
raise SQLGuardError(f"Forbidden keyword: {keyword}")
# Only allow the analytics table
for table in re.findall(r"(?:FROM|JOIN)\s+(\w+)", sql_upper):
if table.lower() != ALLOWED_TABLE:
raise SQLGuardError(f"Table '{table}' is not allowed")
# Enforce LIMIT <= 100
return _enforce_limit(sql, sql_upper)
Key constraints enforced:
SELECT statements are allowed. Any INSERT, UPDATE, DELETE, DROP, or DDL keyword is blockedLIMIT enforced automaticallyRead-only execution provides a second layer. The ClickHouse query runs over HTTP with safety parameters baked into every request:
response = httpx.post(
f"http://{host}:{port}",
params={
"query": f"{sql} FORMAT JSON",
"database": "default",
"readonly": "1",
"max_execution_time": "10",
"max_result_rows": "100",
},
timeout=15,
)
Even if SQL Guard missed something, ClickHouse itself would block writes (readonly=1), kill slow queries (10 seconds), and cap the result set.
This belt-and-suspenders approach means that even if the LLM generates a malicious query (unlikely, but possible), it cannot modify data, access unauthorised tables, or run expensive long-running operations.
We chose ClickHouse as the analytics engine for MusicData Lab because music streaming data is a textbook columnar analytics workload: append-only, time-series, high-volume, and query-heavy.
A typical label might have 10 to 50 million rows of streaming data, partitioned by month. Common queries aggregate by artist, retailer, territory, or time period. ClickHouse handles these in milliseconds where PostgreSQL would take seconds or minutes.
Key advantages for music data:
retailer_union, artist, and country_code that have a bounded set of valuesThree insights from the implementation that apply to any company considering AI-powered analytics:
The quality of SQL generation depends almost entirely on the system prompt. Including the full schema, domain-specific hints, and few-shot examples made the difference between "sometimes works" and "reliably useful." As shown in the code above, we dynamically build the prompt from Django model metadata, so it stays in sync with schema changes automatically. No manual updates when a column is added or renamed.
Running Ollama locally for development and testing removed the biggest adoption barrier: "we can't send data to an external API." Once stakeholders see the value, the conversation about using a cloud API for better quality becomes much easier.
The SQL Guard and read-only execution are not just safety nets. They are what made the business comfortable deploying this. When your CFO asks "can this AI delete our data?", the answer needs to be a confident "no, here's why."
This is not a product launch. It is a proof of concept that we built for our own platform and for our clients. If you recognise any of these situations, it might be relevant to you:
At MusicTech Lab, we build data platforms for the music industry. The AI Dashboard is one piece of a larger system that handles royalty ingestion, normalisation, currency conversion, and reporting. If this resonates, we should talk.
This is a v1. The underlying pattern, natural language to SQL to visualisation, is not limited to music data. Any company with a structured analytics database can benefit from making that data conversational. The technology is ready. The question is whether your organisation is ready to let business users ask their own questions.
Have a similar project in mind? We'd love to hear about it.
Get in touch to discuss how we can help bring your vision to life.
830 Ways to Say Spotify - Normalizing Music Streaming Data
After solving the file format problem, the data inside is just as messy. Different names for the same platforms, labels, currencies, and territories. Here's how we normalize it.
AI Rehearsal: Spaced Repetition for Your Musical Ideas
What if your voice memos could rehearse themselves back to you at the right moment? Applying spaced repetition — the technique behind language learning apps — to musical ideation.
Technical Partner
Technical partner at MusicTech Lab with 15+ years in software development. Builder, problem solver, blues guitarist, long-distance swimmer, and cyclist.
Get music tech insights, case studies, and industry news delivered to your inbox.