Transcribing and Sentiment-Scoring 9 Million Call Recordings on £50/mo of VPS

The market for call-centre AI in 2026 is loud. Most of it is sold at three to eight thousand pounds per month for the same outcomes Telebyte's reference deployment produces on a stack that costs roughly fifty pounds a month per transcription worker. This article documents what is actually running, what the trade-offs are, and where the £50/month figure does and does not apply.

The full pipeline is in production today on Telebyte's largest engagement — a UK insurance brokerage with a 540,000-contact CRM, an 8-agent ViciDial floor, and a recording archive that has accumulated over nine million calls. Every recording is transcribed within thirty minutes of completion, sentiment-scored, indexed for full-text search, and surfaced in per-agent and per-campaign dashboards. None of the components are exotic. The discipline is in the orchestration.

What the pipeline actually does

The stack solves three problems that any contact-centre archive at meaningful scale needs solved:

Speech-to-text — converting every call recording (WAV, 8kHz mono) into searchable text within minutes of the call ending.
Sentiment scoring — labelling each transcript with a positive / neutral / negative sentiment plus a confidence score, per call and per turn.
Full-text indexing — making any phrase across the entire 9M-recording archive queryable in under a second.

The output: an operations team that can ask "show me every call in 2025 where the customer used the word 'cancel' within the first thirty seconds, scored negative" and get an answer back as a SQL query, not as a research project.

The compliance value of that capability is its own subject — covered in the FCA call-recording checklist. This article is about the engineering.

The component selection

Three model and storage choices, each made for unsexy reasons:

Whisper tiny INT8 for transcription

OpenAI's Whisper family runs from tiny (39M parameters) to large-v3 (1.55B parameters). Production accuracy on 8kHz UK English call audio looks roughly like this in Telebyte's testing:

| Model | Word error rate (UK call audio) | Real-time factor on a single vCPU | Practical use | |---|---|---|---| | tiny (INT8) | ~22% WER | 0.3x (faster than real-time) | Bulk archive transcription | | base (INT8) | ~18% WER | 0.5x | Better archive transcription | | small (FP16) | ~14% WER | 1.1x | Real-time supervision | | medium (FP16) | ~11% WER | 2.5x (slower than real-time) | Compliance review only | | large-v3 | ~9% WER | 5–8x (GPU-dependent) | Spot-checks |

The headline observation: tiny quantised to INT8 produces a 22 per cent WER on UK 8kHz call audio, which sounds bad until the use case is examined. The transcript exists to make keyword-based search and sentiment scoring possible across nine million calls. For that use case, 22 per cent WER is sufficient. Specific compliance reviews — where the transcript has to be precise — re-run the call through medium or large-v3 on demand.

That two-tier model — tiny for bulk, medium on request — is the cost lever. It is what makes the £50/mo per-worker number defensible.

RoBERTa for sentiment

The sentiment classifier is cardiffnlp/twitter-roberta-base-sentiment-latest. Inference is fast on CPU (around 10–20ms per turn on a single vCPU), the model is small enough to keep memory-resident on a £20/month box, and the output is a clean three-class classification (positive / neutral / negative) with a confidence score.

The model is trained on Twitter data, which is not call-centre data. The reason it works regardless: contact-centre dialogue tends to use a similar vocabulary of sentiment-bearing phrases ("I'm not happy with…", "thank you for…", "this is ridiculous…") and the model generalises adequately. Telebyte experimented with finance-domain fine-tuning; the accuracy improvement was real but small (3–4 percentage points) and not worth the operational overhead of maintaining a custom model.

The score is per-turn — agent and customer scored separately, each turn timestamped — which makes the per-call aggregate meaningful (a negative call where the customer starts negative and ends positive looks materially different from one where they start positive and end negative).

MariaDB FULLTEXT for search

The transcripts and sentiment scores live in MariaDB tables co-located with the ViciDial database. The full-text search uses MariaDB's built-in FULLTEXT index with InnoDB.

Three reasons this beats every shinier alternative for this use case:

Operational simplicity. The ViciDial team already runs MariaDB. Adding two tables and an index is configuration; adopting Elasticsearch is a project.
Co-located joins. Every search query joins the transcript table against vicidial_log (call metadata) and recording_log (file paths). Doing those joins inside a single database engine is cheaper than the cross-system equivalents.
Backup and replication are already solved. The transcripts inherit the existing MariaDB backup posture for free.

The scale concern — does MariaDB FULLTEXT cope with 9M transcripts? — is a non-issue in practice. The index is around 12GB and queries return in under a second even for unscoped phrase searches. Specialised search engines win on advanced ranking, faceted search, and high-cardinality filtering; for "find this phrase in this date range for this agent", MariaDB FULLTEXT is fast and sufficient.

The architecture

The pipeline runs as a worker pool. The ViciDial recording engine drops a finished WAV file into a watched directory and writes a row into a recording_queue table; workers pick up rows in priority order and process them through the pipeline.

A single worker box runs:

A Python service polling recording_queue every five seconds.
An invocation of whisper.cpp (the C++ port of Whisper, ~3x faster than the Python reference on CPU) against each new recording.
A second pass through a quantised RoBERTa model running under transformers with bitsandbytes for INT8.
Inserts into recording_transcript (full text plus per-turn segmentation) and recording_sentiment (per-turn sentiment with timestamps).

Each worker can process roughly 200 calls per hour on commodity 8-vCPU hardware. The £50/month figure refers to one such worker on a Contabo Cloud VPS L. Three workers (£150/month) handle the inbound flow from a 40,000-call-per-week operation with headroom for backfill.

The architecture splits cleanly:

| Tier | Cost | Job | |---|---|---| | Transcription workers (3x VPS) | ~£150/mo | Whisper tiny INT8, real-time keep-up | | Sentiment workers (1x VPS) | ~£25/mo | RoBERTa, per-turn scoring | | MariaDB FULLTEXT primary | shared with ViciDial | Indexing and search | | Cold storage (Whisper medium re-runs) | ~£15/mo + GPU-hour billing | Compliance-grade re-transcription | | Total steady-state | ~£200/mo | 9M recordings, real-time keep-up |

The £50/month headline is per worker. The full pipeline is in the ballpark of £200/month all-in. Either framing is honest; the per-worker number is the one that matters for capacity planning, the all-in number is the one that matters for budget approval.

What it does not do

A short list of things this pipeline deliberately does not attempt, despite being requested in most initial conversations:

Real-time agent guidance. That requires sub-second transcription latency and is a different architecture (smaller chunks, streaming Whisper, in-memory inference). The Telebyte pipeline is post-call, in the thirty-minute-window class.
Diarisation in difficult cases. Whisper's built-in diarisation works adequately when the channel topology gives it a hint (the recording engine writes the agent on the left channel and the customer on the right). When the topology is mixed-down, accuracy drops materially. The pipeline solves this by configuring the recording engine to write stereo by channel — a one-line ViciDial change — rather than by sophisticated post-processing.
Non-English transcription. The pipeline runs Whisper in English-only mode (the en variant is materially faster than the multilingual model). UK call audio that contains Welsh or other UK minority-language segments will transcribe poorly. A separate worker pool with the multilingual model handles those campaigns where they exist.
PII redaction at transcription time. Personally identifiable information (account numbers, addresses, payment cards) is redacted from the original audio during the call via dialplan-level pause-and-resume (see the FCA checklist for the pattern). The transcripts inherit the redaction by being generated from already-redacted audio.

Each of those omissions is deliberate. Each could be added, none of them are cheap, and none of them are necessary for the core use case.

The data model

The MariaDB schema, simplified to its load-bearing parts:

CREATE TABLE recording_transcript (
  recording_id BIGINT PRIMARY KEY,
  call_date DATETIME NOT NULL,
  agent_id VARCHAR(20) NOT NULL,
  campaign_id VARCHAR(20) NOT NULL,
  duration_seconds INT NOT NULL,
  full_text MEDIUMTEXT NOT NULL,
  model VARCHAR(20) NOT NULL,        -- 'tiny', 'medium', etc.
  word_error_estimate FLOAT,         -- self-reported confidence
  created_at DATETIME NOT NULL,
  KEY (call_date),
  KEY (agent_id, call_date),
  KEY (campaign_id, call_date),
  FULLTEXT KEY (full_text)
) ENGINE=InnoDB;

CREATE TABLE recording_sentiment (
  recording_id BIGINT NOT NULL,
  turn_index INT NOT NULL,           -- 0, 1, 2, … in call order
  speaker ENUM('agent','customer') NOT NULL,
  start_ms INT NOT NULL,             -- offset into the recording
  end_ms INT NOT NULL,
  text TEXT NOT NULL,
  sentiment ENUM('positive','neutral','negative') NOT NULL,
  confidence FLOAT NOT NULL,
  PRIMARY KEY (recording_id, turn_index)
) ENGINE=InnoDB;

Three details worth noting:

The full_text column is MEDIUMTEXT rather than TEXT. UK contact-centre calls run long; some compliance recordings hit fifteen thousand words.
The model column is what makes the two-tier strategy operationally cheap. When a compliance review requires a higher-fidelity transcript, the re-run is invoked, the existing recording_transcript row is replaced, and the model column reflects the new origin. The audit trail of which model produced which transcript is preserved in a separate change-log table.
The composite indexes on (agent_id, call_date) and (campaign_id, call_date) are what make the per-agent and per-campaign dashboards return in under 200ms even at 9M rows. A single column index on call_date alone is not sufficient.

The full schema, including the change-log table, the silent-call flagging, and the legal-hold integration, is around 200 lines of DDL. Most of the engineering above is in the migrations, not the queries.

What can go wrong

The pipeline has been in production for eighteen months. The genuine failure modes Telebyte has hit:

Queue backlog under exceptional load. A campaign producing 8,000 calls in one evening exceeds the per-worker throughput; the queue grows. The fix is horizontal — add a worker for the spike — but recognising the condition takes monitoring. The recording_queue table has a tracked age-of-oldest-row metric; alert when that exceeds ninety minutes.
Whisper hallucinations on silent recordings. Whisper sometimes produces plausible-looking text on near-silent audio (a recording where the customer never picked up, or where the line dropped early). The mitigation is a pre-filter: if the audio energy in the recording is below a threshold, skip transcription and write a silent flag instead.
MariaDB FULLTEXT lock contention during index rebuilds. Adding to a large FULLTEXT index is cheap; rebuilding one is not. The procedure for adding new columns to the transcripts table requires a maintenance window. Plan it.
Disk pressure on the transcription workers. Workers download the audio file to local disk, process it, then delete it. A worker that crashes mid-job leaves a partial file. A nightly cleanup of /tmp (or wherever the working directory is) is mandatory rather than nice-to-have.

The pattern in all four cases is the same: the pipeline runs steadily at scale once the operational discipline is in place, and the operational discipline is a half-day of monitoring work plus a runbook.

Why this matters commercially

The "AI for call centres" SaaS market price is materially higher than the engineering cost of the underlying capability. The pipeline above is not magic. It is open-source models stitched into a queue-and-worker pattern with a sensibly chosen database. Any competent in-house team can build it; most do not, because the SaaS option is a faster way to demonstrate something to a board.

Telebyte's view, having both built this and looked at the commercial alternatives: the SaaS products are not bad products. They are reasonable engineering at significant gross-margin uplift. For a UK contact centre at any scale beyond hobbyist, the make-versus-buy calculation flips quickly. The pipeline above pays for itself against a £4,000/month SaaS bill in less than two months of operating cost, and it remains the operator's property afterwards.

The deployment of this capability is also a route Telebyte runs commercially. Telebyte's Managed Pro tier includes the transcription and sentiment pipeline as a managed service. The economics are still better than any SaaS equivalent, with the operational overhead removed.

The minimum viable variant

For an operation considering whether to build something like this themselves, the minimum viable pipeline is materially simpler than what Telebyte runs in production:

One VPS (8 vCPU, 16GB RAM) — ~£35/month.
Whisper tiny INT8 via whisper.cpp.
A bash script polling a directory and writing transcripts to text files.
grep for search; promote to MariaDB FULLTEXT once the volume justifies it.

That minimum variant handles up to a few thousand recordings a month, accumulates technical debt, and is fine for a small operation testing the value of the capability before committing to the full architecture. The full architecture above kicks in around 20,000 recordings per month and is steady-state economic up to several hundred thousand.

The point is that the floor is low. Anyone who has been quoted four-figures-per-month for "AI call transcription" should look at this article carefully before signing.

Want help building, or buying, a transcription and sentiment pipeline for your call recordings? Tell Telebyte what you're trying to do — Telebyte runs both the engineering and the managed-service version.

whispertranscriptionsentimentrobertamariadb

← Back to all articles