AI Analytics Dashboard for Music Data | MusicTech Lab

Your data team knows the answer is in the database. Your A&R lead, your finance director, and your label manager do not know how to get it out. This is the gap that costs music companies real money, not in licensing fees or infrastructure, but in decisions delayed, trends missed, and reports that arrive a week too late.

We built an AI-powered analytics dashboard to close that gap. It sits inside MusicData Lab, our royalty analytics platform, and it lets anyone with access type a question in plain English and get a chart back in seconds.

The problem everybody talks about at conferences

The music industry has a data access problem. Not a data collection problem. Labels and distributors already have millions of rows of streaming data, royalty reports, and territorial breakdowns sitting in their databases. The bottleneck is getting from "I need to know which retailers drove revenue last quarter" to an actual answer.

Today, that journey typically looks like this:

A business stakeholder writes an email or Slack message to the data team
The data team interprets the request, writes a SQL query, exports the results
Someone pastes the numbers into a spreadsheet and builds a chart
The chart goes back to the stakeholder, who asks a follow-up question
Repeat from step 1

This loop can take hours, sometimes days. Multiply it across every label, every territory, every reporting period, and you start to see the scale of the problem.

A 2024 IFPI report noted that the global recorded music market generated $28.6 billion in revenue, with streaming accounting for 67% of that. The volume of data behind those numbers is staggering, and growing every quarter.

What if business users could just ask?

That was the design question behind our AI Dashboard. Instead of routing every data request through a technical team, what if the platform could understand a question like "top 5 artists from the US by income" and return a bar chart?

The workflow is simple:

A user types a question in the chat interface
An LLM translates the question into a ClickHouse SQL query
The query runs against the analytics database (read-only, with safety guardrails)
Results come back as an interactive chart and a data table

No SQL knowledge required. No waiting for the data team. No spreadsheets.

Loading diagram...

Natural Language Input

Ask questions in plain English. No SQL, no training, no onboarding friction.

ClickHouse Analytics

Columnar storage handles millions of streaming records with sub-second query times.

Instant Visualization

Results render as interactive charts. Bar, line, and doughnut, selected automatically.

Why this matters for music companies specifically

Music royalty data is uniquely complex. A single label might receive reports from 13 different distributors, each with its own file format, column naming, and date conventions. Once that data is normalised and loaded into an analytics database, the schema reflects that complexity: dozens of fields covering artists, tracks, retailers, territories, currencies, and time periods. Even the retailer names need normalisation before they become queryable.

This is precisely the kind of dataset where AI-assisted querying shines. The system prompt includes the full database schema, domain-specific hints, and few-shot examples that teach the model how to write correct ClickHouse SQL. We build it dynamically from Django model metadata, so the prompt stays in sync with the schema automatically:

prompts.py

def build_system_prompt() -> str:
    fields = []
    for field in StreamDataCH._meta.get_fields():
        fields.append(f"  - {field.name} ({field.__class__.__name__})")

    schema_block = "\n".join(fields)

    return (
        "You are a SQL analyst for a music streaming analytics platform.\n"
        f"The database is ClickHouse. There is one table: `{TABLE}`.\n"
        f"\n## Schema\n\n{schema_block}\n"
        "\n## Domain hints\n\n"
        "- `final_income` is income converted to the target currency\n"
        "- `retailer_union` is the normalized retailer name\n"
        "- Always filter out empty strings when grouping by text fields\n"
        "\n## Output rules\n\n"
        "1. Output ONLY a single SELECT statement\n"
        "2. Always include a LIMIT clause (max 100)\n"
        "3. Never use INSERT, UPDATE, DELETE, DROP, or any DDL\n"
        f"\n## Examples\n\n{examples_block}"
    )

The few-shot examples are critical. They teach the model patterns like subqueries for hierarchical groupings ("top 5 artists with their retailers by income") that a generic LLM would otherwise flatten into a single GROUP BY:

-- Example: top 5 artists with their retailers by income
SELECT artist, retailer_union, sum(final_income) AS total_income
FROM analytics_streamdatach
WHERE artist != '' AND retailer_union != ''
  AND artist IN (
    SELECT artist FROM analytics_streamdatach
    WHERE artist != ''
    GROUP BY artist ORDER BY sum(final_income) DESC LIMIT 5
  )
GROUP BY artist, retailer_union
ORDER BY artist, total_income DESC LIMIT 100

The result is that even someone who has never seen a database can ask:

"Monthly income trend for 2024" and get a line chart
"Income by country" and get a ranked bar chart
"Top 10 tracks by streams" and get a sortable table with the data

Data privacy: keeping everything local

Here is where most "AI analytics" solutions stumble. They require sending your data, or at least your queries, to a third-party API. For a music company handling confidential royalty data, artist earnings, and pre-release catalogue information, that is often a non-starter.

Our architecture solves this with a pluggable LLM design:

Ollama (Local, Default)

Open-source LLM runtime running in Docker alongside the application. All inference happens on your infrastructure. Zero data exposure.

Claude API (Optional)

Anthropic's Claude API for higher-quality SQL generation. Swap in with a single environment variable change.

Switching between providers is a configuration change, not a code change. The factory pattern reads a single setting and returns the right client:

factory.py

def get_llm_provider() -> LLMProvider:
    provider = getattr(settings, "AI_DASHBOARD_PROVIDER", "ollama")

    if provider == "claude":
        from .claude import ClaudeProvider
        return ClaudeProvider()

    from .ollama import OllamaProvider
    return OllamaProvider()

The same application code, the same security layer, the same chart rendering.

The pluggable pattern means you can start with Ollama for privacy, evaluate quality, and switch to Claude API for specific use cases without any migration work.

Security by design, not by afterthought

Letting an AI write SQL that runs against your production database sounds risky. It is, if you do it naively. Our approach layers multiple security controls:

SQL Guard validates every query before execution. Here is the core of it:

guard.py

BLOCKED_KEYWORDS = [
    "INSERT", "UPDATE", "DELETE", "DROP", "ALTER",
    "TRUNCATE", "CREATE", "GRANT", "REVOKE",
    "ATTACH", "DETACH", "RENAME", "OPTIMIZE", "KILL",
]

class SQLGuard:
    @staticmethod
    def validate(sql: str) -> str:
        sql = sql.rstrip(";").strip()

        if not sql.upper().startswith("SELECT"):
            raise SQLGuardError("Only SELECT queries are allowed")
        if ";" in sql:
            raise SQLGuardError("Multiple statements are not allowed")

        sql_upper = sql.upper()
        for keyword in BLOCKED_KEYWORDS:
            if re.search(rf"\b{keyword}\b", sql_upper):
                raise SQLGuardError(f"Forbidden keyword: {keyword}")

        # Only allow the analytics table
        for table in re.findall(r"(?:FROM|JOIN)\s+(\w+)", sql_upper):
            if table.lower() != ALLOWED_TABLE:
                raise SQLGuardError(f"Table '{table}' is not allowed")

        # Enforce LIMIT <= 100
        return _enforce_limit(sql, sql_upper)

Key constraints enforced:

Only SELECT statements are allowed. Any INSERT, UPDATE, DELETE, DROP, or DDL keyword is blocked
Queries are restricted to a single analytics table (no access to user accounts, credentials, or other application data)
Result sets are capped at 100 rows, with LIMIT enforced automatically
Comments, semicolons, and multi-statement queries are stripped or rejected

Read-only execution provides a second layer. The ClickHouse query runs over HTTP with safety parameters baked into every request:

executor.py

response = httpx.post(
    f"http://{host}:{port}",
    params={
        "query": f"{sql} FORMAT JSON",
        "database": "default",
        "readonly": "1",
        "max_execution_time": "10",
        "max_result_rows": "100",
    },
    timeout=15,
)

Even if SQL Guard missed something, ClickHouse itself would block writes (readonly=1), kill slow queries (10 seconds), and cap the result set.

This belt-and-suspenders approach means that even if the LLM generates a malicious query (unlikely, but possible), it cannot modify data, access unauthorised tables, or run expensive long-running operations.

No AI system should have write access to production data. Always enforce read-only execution at both the application layer and the database layer.

Why ClickHouse for music analytics

We chose ClickHouse as the analytics engine for MusicData Lab because music streaming data is a textbook columnar analytics workload: append-only, time-series, high-volume, and query-heavy.

A typical label might have 10 to 50 million rows of streaming data, partitioned by month. Common queries aggregate by artist, retailer, territory, or time period. ClickHouse handles these in milliseconds where PostgreSQL would take seconds or minutes.

Key advantages for music data:

MergeTree engine with monthly partitioning matches the natural cadence of royalty reporting
Low-cardinality string optimisation is perfect for fields like retailer_union, artist, and country_code that have a bounded set of values
Columnar compression keeps storage costs low even as data grows to hundreds of millions of rows
HTTP API makes it straightforward to build read-only query interfaces with proper access controls

What we learned building this

Three insights from the implementation that apply to any company considering AI-powered analytics:

1. The prompt is the product

The quality of SQL generation depends almost entirely on the system prompt. Including the full schema, domain-specific hints, and few-shot examples made the difference between "sometimes works" and "reliably useful." As shown in the code above, we dynamically build the prompt from Django model metadata, so it stays in sync with schema changes automatically. No manual updates when a column is added or renamed.

2. Start local, scale to cloud

Running Ollama locally for development and testing removed the biggest adoption barrier: "we can't send data to an external API." Once stakeholders see the value, the conversation about using a cloud API for better quality becomes much easier.

3. Security is a feature, not a constraint

The SQL Guard and read-only execution are not just safety nets. They are what made the business comfortable deploying this. When your CFO asks "can this AI delete our data?", the answer needs to be a confident "no, here's why."

Who is this for?

This is not a product launch. It is a proof of concept that we built for our own platform and for our clients. If you recognise any of these situations, it might be relevant to you:

Your data team spends too much time answering ad-hoc reporting requests
Business stakeholders wait days for insights that should take seconds
You have sensitive royalty or financial data and cannot use cloud-based AI tools
You already have analytics data in ClickHouse, PostgreSQL, or a similar database and want to make it more accessible

At MusicTech Lab, we build data platforms for the music industry. The AI Dashboard is one piece of a larger system that handles royalty ingestion, normalisation, currency conversion, and reporting. If this resonates, we should talk.

What comes next

This is a v1. The underlying pattern, natural language to SQL to visualisation, is not limited to music data. Any company with a structured analytics database can benefit from making that data conversational. The technology is ready. The question is whether your organisation is ready to let business users ask their own questions.