Debt collection automation: why we replaced LiveKit and cut false endpointing from 15% to 2%

AI & Automation

May 11, 2026

8 min read

LiveKit is the default framework for building AI voice agents. We started with it for debt collection automation. Two weeks later, we replaced it with a custom three-thread pipeline. False endpointing dropped from 15% to 2%. Here is the full architecture.

Debt collection automation: why we replaced LiveKit and cut false endpointing from 15% to 2%

Key takeaways

LiveKit could not preserve LangGraph session state across dialogue phase transitions. The integration library returned session IDs but not the session itself, breaking graph resumption after every interruption.
Our custom pipeline uses three async threads (listener, generator, controller) to eliminate data loss during interruptions. Before this architecture, approximately 20% of user utterances during overlapping speech were lost entirely.
TurnSense (a 135M-parameter end-of-utterance model) reduced false endpointing from 10-15% to under 2%, adding only 70-90ms of latency. Without it, the bot interrupted users mid-sentence on every sixth or seventh exchange.
Phase routing via LLM tool calls eliminated double latency. Our initial approach used a separate LLM call for routing decisions, doubling the time to first response. Switching to simultaneous streaming + tool call routing cut one full LLM round-trip from every turn.
Debt collection automation requires policy-driven turn-taking. During FDCPA-mandated disclosures, the system must disable interruption handling entirely. Standard voice frameworks do not support this.

The project: an AI agent that negotiates payment

The client needed a voice bot that calls customers about overdue payments. Not a reminder bot that reads a script and hangs up. A system that verifies identity, states the balance, and if the person pushes back, negotiates a payment plan on the spot according to specific business rules.

The call flow has five phases:

Greeting (scripted, pre-recorded)
Identity verification (driver’s license or other document)
Payment notification (inform the customer of the amount owed)
Negotiation (if the customer refuses: offer partial payment plans according to specific business rules)
Resolution (process payment or schedule a callback with agreed terms)

The bot has to prioritize getting full payment. If the customer pushes back, it follows a rule-based escalation: installments first, then adjusted timelines, then scheduling a callback. Every phase has its own prompt, its own constraints, and its own compliance requirements.

The hard constraint: response latency under 1.5 seconds. At 2 seconds of silence, people assume the line is dead. At 3 seconds, they hang up. We learned that the hard way.

Why LiveKit failed in two weeks

LiveKit works well for simple voice agents. One system prompt, one conversational loop, no complex state. For that, it is genuinely good.

Our project was not that. We needed multi-phase dialogue with state preservation across interruptions. That is where things fell apart.

The LangGraph session problem

We used LangGraph to manage dialogue phases. LangGraph models conversations as state machines: each phase is a node, transitions between phases are conditional edges, and the entire conversation state (history, parsed entities, current phase) persists across the graph.

LangGraph supports a mechanism called Human-In-The-Loop, where the graph freezes its state, waits for external input, then resumes. In a voice bot, this is how turn-taking works: the graph freezes when the bot starts speaking, preserves its state, and resumes when the user responds.

The problem was in the LiveKit-LangGraph integration library. When the graph froze, the library saved the session ID but not the session state itself. On resumption, the system could not restore where it was. Every interruption corrupted the dialogue flow.

We modified the library to return the full session parameters on interruption. That fixed the immediate crash. But it exposed a deeper problem.

Interruptions destroyed data

This was the real problem. LiveKit processes audio sequentially: when the bot is generating a response, it stops listening.

Think about what that means on a real call. The bot starts explaining the balance. The customer says “Wait, no, that’s not right.” The bot does not hear it. It finishes its sentence, then generates a follow-up based on silence. The customer has to repeat themselves. The whole conversation feels off.

We measured this: roughly 20% of the time, users spoke while the bot was responding. Every single one of those interactions lost the user’s speech. The transcripts got dirtier with each lost phrase, the LLM got confused, and the model started repeating itself or asking questions that had already been answered.

We tried to fix it inside LiveKit. The framework does not expose the audio capture loop during generation. To change that behavior, we would have had to rewrite the core pipeline. At that point, we were not using LiveKit anymore. We were maintaining a fork of it.

Two weeks in, we stopped trying to fix it and started building our own.

The three-thread architecture

The core problem was simple once we saw it: LiveKit alternates between listening and speaking. Real conversations do not work that way. People talk over each other constantly. You have to listen and generate at the same time.

We split everything into three concurrent async threads using Python’s asyncio:

Thread 1: Listener

This thread does one thing: it reads from the audio stream and never stops. The bot could be mid-sentence, mid-generation, completely idle. Does not matter. The listener is always recording what the user says.

It runs TenVAD on the incoming audio to find speech segments. When it detects voice, audio gets buffered. When 300ms of silence passes, the buffer goes to Whisper-2 on Fireworks AI for transcription.

Thread 2: Generator

Receives finalized transcripts from the listener and runs the full response pipeline:

Pass transcript to the LangGraph state machine
LangGraph determines the current phase and generates a response via GPT-4
Stream the response text directly to ElevenLabs TTS
Return synthesized audio for playback

The generator only activates when the listener produces a complete transcript. It never touches the audio stream directly.

Thread 3: Controller

Manages the other two threads and handles all interruption logic. When the listener detects new speech while the generator is active, the controller:

Signals the generator to abort the current TTS stream
Waits for the listener to finalize the new transcript (which includes both the pre-interruption buffer and the new speech)
Sends the combined transcript to the generator for a new response

The controller also manages TenVAD state transitions and coordinates with TurnSense for end-of-utterance detection.

What this actually fixed

Before: 20% of user interactions during bot speech lost data. The transcripts were a mess and the model kept getting confused.

After: data loss dropped to near zero. The listener never stops, so nothing gets missed. If the user interrupts, the controller catches it, kills the current generation, and feeds the full context (old buffer plus the new speech) back into the generator.

The trade-off is complexity. Three threads sharing state through asyncio queues. Race conditions we had to find and fix one by one. It took about two weeks to build and stabilize. Worth it.

The 0.5-second pause problem

Even with the three-thread architecture running cleanly, one problem kept coming back.

People pause when they talk. They think for half a second, then keep going. Completely normal. But the VAD sees 500ms of silence and decides the person is done talking. It triggers transcription, the generator spins up a response, and then the person keeps talking. The bot just interrupted them.

This happened on roughly 10-15% of interactions. About every sixth or seventh exchange, the bot would jump in while the user was still mid-thought. The transcripts looked terrible because each interruption got marked as <interrupted> in the context, and after a few of those, the LLM started getting confused about what the user actually said.

Why raising the VAD threshold does not work

The obvious fix: bump the silence threshold from 300ms to 800ms. Wait longer before deciding the user stopped.

The problem is that delay hits every single turn. Every response comes 500ms later. When your total budget is 800-1500ms, you cannot add half a second to 100% of turns to fix a problem on 15% of turns. The math does not work.

TurnSense: a semantic check before the bot speaks

We added TurnSense, a 135-million-parameter transformer built on SmolLM2. It looks at the text coming out of STT (not the audio itself) and predicts whether that text looks like a finished thought or a mid-sentence pause.

The team tested a few end-of-utterance models on a small manually labeled dataset. TurnSense had the best results. Published benchmarks show 97.5% accuracy on the TURNS-2K dataset (2,000 conversational samples with backchannels and disfluencies).

It added 70-90ms per check. In exchange, false endpointing dropped from 10-15% to roughly 2%. The remaining failures happened in genuinely ambiguous spots where even a person listening would not be sure if the speaker was done.

How it works now: TenVAD detects 300ms of silence. Before the controller triggers generation, it passes the current transcript to TurnSense. If TurnSense says “not finished,” the system keeps listening. If it says “done,” generation starts.

This is faster than raising the VAD threshold because TurnSense only runs when silence is detected. Most of the time, the user is speaking continuously and TurnSense is idle. A higher threshold would delay every turn regardless.

LangGraph for phase-based dialogue

Building a LangGraph voice agent is a different problem than building a text chatbot. Most voice bots use a single system prompt. The LLM handles the whole conversation in one context window. That falls apart when your conversation has distinct phases, each with different rules, different compliance requirements, and strict transition logic that the model cannot be trusted to infer on its own.

Five phases, one state machine

Each dialogue phase is a LangGraph node with its own prompt and constraints:

Greeting: Scripted. Pre-recorded audio. No LLM involvement. This is the only phase that uses a pre-recorded phrase. Average response time: near zero.

Identity verification: The bot asks for a driver’s license number or other identifying document. The LLM must extract and validate specific data formats. Strict rules prevent the bot from proceeding until verification succeeds.

Payment notification: The bot informs the customer of the outstanding amount. In the US market, this phase must include the FDCPA mini-Miranda disclosure. During this disclosure, the system disables interruption handling entirely. The bot must complete the legally mandated statement without the user’s speech cutting it short.

Negotiation: The most complex phase. If the customer refuses to pay, the bot follows a priority-based negotiation strategy: full payment first, then partial payment plans, then scheduled callbacks. The LLM must reason about payment terms while staying within the client’s business rules.

Resolution: Either processes the payment or schedules a callback. The bot confirms the agreed terms and ends the call.

How routing works (and the mistake that doubled our latency)

First approach: after each user turn, send the transcript to one LLM call to decide whether to stay in the current phase or move to the next. Then send a second LLM call to actually generate the response.

Two sequential LLM calls per turn. At 400-800ms per call, the bot sat in silence for over a second before it even started speaking. Users noticed immediately.

We fixed this with tool call routing. The LLM now generates a streaming response AND makes the phase transition decision via tool calls in a single pass. It talks to the user while simultaneously deciding whether to advance. One call instead of two. The savings are significant: instead of waiting for two sequential calls, the model starts streaming immediately while routing happens in the background. Latency dropped back within the 800-1500ms target.

Handling off-script users

When a user says something completely unrelated to the current phase (asks about the weather, makes a joke, tries to change the subject), the LangGraph router keeps them in the current phase. The LLM responds with something like “I understand, but I need to discuss this specific matter with you” and redirects the conversation.

The router does not allow backward transitions. Once the identity verification phase is complete, the user cannot be routed back to it. The conversation only moves forward through the phases.

What the pipeline costs to run

Nobody publishes real cost numbers for voice AI in debt collection. Here is what our stack actually costs based on current provider pricing:

Component	Provider	Cost
STT	Whisper V3 on Fireworks AI	~$0.0015/min audio
LLM	GPT-4 (streaming)	~$0.03/1K input, $0.06/1K output tokens
TTS	ElevenLabs	~$0.10-0.30/min (varies by plan and model)
VAD + TurnSense	Local inference	Infrastructure cost only
Infrastructure	Python asyncio server	Variable

The LLM is the largest variable cost. Each conversational turn involves one streaming LLM call (response generation + tool call routing combined). Post-call, additional LLM calls may handle summarization and status extraction. At an average call duration of 3-5 minutes with 8-12 conversational turns, LLM costs per call range from $0.10 to $0.40 depending on transcript length and negotiation complexity.

The critical cost advantage over human agents: one server handles multiple concurrent calls. A human agent handles one. At 10,000+ calls per month, the per-call cost of automated debt collection drops below the human equivalent, even accounting for the AI stack.

Compliance is architecture, not configuration

Deploying an AI voice agent for US debt collection means operating under three overlapping regulatory frameworks: FDCPA, TCPA, and Regulation F.

The FCC ruled in February 2024 that AI-generated voice calls fall under the TCPA’s restrictions on artificial or prerecorded voice messages. This means every call requires prior express written consent. The system cannot cold-call.

Regulation F enforces the 7-in-7 rule: no more than seven calls within seven consecutive days about a specific debt. It also restricts calling hours to 8 AM through 9 PM in the consumer’s local timezone.

The FDCPA mandates the mini-Miranda disclosure: the bot must state that “this is an attempt to collect a debt” and that “any information obtained will be used for that purpose.” This disclosure must be delivered in full. If the user interrupts mid-disclosure and the bot stops speaking to acknowledge the interruption, the disclosure is incomplete. That is a federal violation.

Standard voice frameworks cannot handle this. You need the ability to selectively disable interruption handling during legally mandated speech, then re-enable it for normal conversation. That means low-level control over the VAD and TTS abort mechanisms, which managed platforms do not expose.

Our implementation: during compliance-critical audio segments, the controller thread raises the speech detection threshold to maximum, effectively requiring near-shouting levels of audio to trigger an interruption. The mini-Miranda plays in full. Once the disclosure is complete, normal thresholds resume. This level of pipeline control is only possible because we own every layer of the audio processing stack. Managed platforms abstract these controls away.

What I would do differently on day one

Skip LiveKit entirely.

We spent two weeks with it, another week patching the LangGraph integration, and still ended up building everything from scratch. Three weeks of work that produced nothing we kept. The custom three-thread pipeline took two weeks to build. If we had started there, we would have shipped three weeks earlier.

LiveKit is the right tool for simple voice agents. A single prompt, a single loop, no state management. For teams evaluating a LiveKit voice agent setup, it genuinely works well for that. But the moment you need multi-phase dialogue, LangGraph integration, or policy-driven turn-taking, you need a LiveKit alternative. For us, that alternative was writing our own orchestration layer in Python.

Frequently asked questions

What is LiveKit and why do developers use it for voice agents?

LiveKit is an open-source WebRTC media server and real-time agent framework. It handles audio routing, connection management, and basic turn detection for voice AI applications. Developers use it because it dramatically reduces the time to build a working voice agent prototype. OpenAI, Skydio, and Assort Health use LiveKit infrastructure in their products. The limitation appears when production requirements demand deep control over interruption handling, state management, or regulatory compliance that the framework does not expose.

How does debt collection automation work with AI voice agents?

An AI voice agent calls the customer, verifies their identity, delivers legally required disclosures, informs them of the outstanding balance, and either processes payment or negotiates a payment plan. The conversation is managed by a state machine (we use LangGraph) that enforces phase transitions and business rules. The system must comply with FDCPA, TCPA, and Regulation F requirements, including the mini-Miranda disclosure, calling hour restrictions, and the 7-in-7 contact limit.

What latency is acceptable for a voice agent in debt collection?

Our target was 800-1500ms from end of user speech to start of bot speech. Industry benchmarks confirm that latency above 2 seconds causes callers to assume the line is dead. Pre-recorded phrases (we use one for the greeting) achieve near-zero latency. LLM-generated responses through the full pipeline (STT + LLM + TTS) average 800-1500ms depending on response complexity and model load.

What is TurnSense and how does it reduce false endpointing?

TurnSense is a 135-million-parameter transformer model (built on SmolLM2 architecture) that predicts whether a transcribed utterance is a complete thought or a mid-sentence pause. Published benchmarks show 97.5% accuracy. In our pipeline, it reduced false endpointing from 10-15% to approximately 2%, adding only 70-90ms of processing time. It runs only when the VAD detects silence, so it adds no latency to turns where the user speaks continuously.

What regulations apply to AI-powered debt collection calls in the US?

Three overlapping frameworks: the Fair Debt Collection Practices Act (FDCPA) prohibits harassment and mandates the mini-Miranda disclosure. The Telephone Consumer Protection Act (TCPA) requires prior express written consent for AI-generated calls (per FCC’s February 2024 ruling that AI voices qualify as artificial or prerecorded). Regulation F enforces the 7-in-7 calling limit and restricts contact hours to 8 AM through 9 PM in the consumer’s local timezone.

Build voice agents that handle real conversations

This pipeline was not built for demos. It was built for regulated, multi-phase conversations where a missed disclosure is a federal violation and a missed utterance is a failed negotiation.

If your team is building voice AI for debt collection, payment reminders, or any application where compliance and conversation quality cannot be traded for development speed, talk to our engineering team.

The backend infrastructure and call classification architecture behind our other voice agent projects are documented separately.

We build AI systems with Python engineers who understand real-time audio, regulatory constraints, and the failure modes that only surface when you dial real phone numbers.

Business & Strategy

SEO in SaaS Development: Why Discoverability Starts Before Launch

Article

6 min read

Business & Strategy

You’re Not Building a Product. You’re Building a Growth Engine.

Article

5 min read

Author

Platon H.

Backend Developer

About the author

Platon H.

Backend Developer

Platon builds production AI systems and scalable backend architectures at Meduzzen. He specializes in Python, real-time voice pipelines, and LLM-powered automation. His work spans VoIP telephony integration, RAG systems, and high-load asynchronous processing for AI-driven platforms.

Have questions for Platon?

Let’s Talk

In this article

Debt collection automation: why we replaced LiveKit and cut false endpointing from 15% to 2%

Key takeaways

The project: an AI agent that negotiates payment

Why LiveKit failed in two weeks

The LangGraph session problem

Interruptions destroyed data

The three-thread architecture

Thread 1: Listener

Thread 2: Generator

Thread 3: Controller

What this actually fixed

The 0.5-second pause problem

Why raising the VAD threshold does not work

TurnSense: a semantic check before the bot speaks

LangGraph for phase-based dialogue

Five phases, one state machine

How routing works (and the mistake that doubled our latency)

Handling off-script users

What the pipeline costs to run

Compliance is architecture, not configuration

What I would do differently on day one

Frequently asked questions

Build voice agents that handle real conversations

Related Articles

No Spam Just Value

Trending Articles

SEO in SaaS Development: Why Discoverability Starts Before Launch

You’re Not Building a Product. You’re Building a Growth Engine.

Author

Need expert help right now?

About the author

Have questions for Platon?

Read next

Hire Python Developers from Ukraine at $35/hr: Why Meduzzen Beats Direct Hiring

You may also like

Best software development frameworks for startups in 2026

DevOps for software development: what elite teams do right

Startup MVP development: why 90% of first versions fail

Website redesign: why 80% fail and what actually drives growth

Node.js developer skills: how to evaluate before you hire

Node.js interview questions that expose the wrong hire before it costs you $240,000

NestJS vs Fastify vs Express: which backend wins in 2026

How We Classified Thousands of AI Voice Agent Calls at 97% Accuracy

Software scalability solutions most startups build wrong

Digital product development: frameworks and scaling strategies

Modern software engineering practices most teams get wrong

Full cycle software development: what most startups get wrong

How to Vet AI Developers in 2026: The Questions That Catch Fakes Before They Cost You $60,000

Startup software development for scalable SaaS success

SaaS application development: A practical guide for startups

Essential guide to startup software development success

How We Built an AI Voice Agent: Backend Architecture Guide

How to hire software developers for startup success

API development services: build scalable solutions for growth

7 Python Hiring Mistakes That Kill Projects in 2026

How to hire remote Python developers: proven steps

What is software scalability? SaaS growth guide for CTOs

What team integration means and how to make it work

How to Hire Python Developers in 2026: The Complete Guide

How Python drives AI innovation: a guide for startup teams

Step-by-step guide to AI-powered solution development

Best software development model for your startup in 2026

Python engineers drive startup growth: speed and scale

The essential software engineering checklist for scaling in 2026

Hire Python Developers from Ukraine at $35/hr: Why Meduzzen Beats Direct Hiring

Real estate automation: streamline property workflows for higher ROI

What Separates a Senior Python Developer from a Coder in 2026

Dedicated Development Team Guide: Hire, Scale, Succeed

Staff Augmentation vs Freelancers vs In-House: What Actually Works

Real estate tech explained: tools transforming property in 2026

AI-powered software: key components and startup insights

What is agile software development: A startup guide

Python Developer Cost in 2026: Real Rates, Hidden Costs, and What You’re Actually Paying For

Essential custom software development steps for startups

How to Evaluate Python Developers in 2026: A Practical Technical Framework

AI in healthcare: practical guide for innovators in 2026

Developer onboarding guide for startup founders and CTOs

Python in web development: scale startups 45% faster

Advantages of Python development for startups in 2026

No Spam
Just Value

Trending
Articles

Need expert
help right now?