Printing PressAI
← Back to front page
Generative AI & Tools

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Original reporting by Hugging Face

Image via Hugging Face

Over half the world's population speaks more than one language, and for many, code-switching—seamlessly blending languages mid-sentence—is a natural communication style. This phenomenon is common in enterprise settings, from customer support to IT helpdesks. Yet, evaluating how voice agents handle this bilingual reality has seen little dedicated research. Prompted by a customer's need to support their code-switching user base, we developed a novel benchmark and dataset to assess Automatic Speech Recognition (ASR) systems, the critical first step in any voice agent pipeline where transcription errors can propagate with real operational consequences.

Our benchmark covers four key language pairs (Spanish-English, French-English, Canadian French-English, German-English) within Human Resources and IT Service Management scenarios. We measure model performance using Word Error Rate (WER) for exact accuracy, alongside Semantic Word Error Rate (SWER) and Answer Error Rate (AER) to gauge meaning preservation for downstream tasks. We evaluated seven ASR systems, including frontier Large Audio Language Models (LALMs) and open-source solutions.

Key Findings Our findings reveal that the performance cost of code-switching varies significantly across language pairs and models. ElevenLabs Scribe V2, Google Gemini 3 Flash, and Assembly AI Universal 3-Pro emerged as the top performers, demonstrating surprising robustness to bilingual input and showing only a small degradation compared to monolingual speech. These results suggest that for leading ASR systems, code-switching is becoming a manageable condition, though careful benchmarking for specific language pairs remains crucial.

Our investigation confirms that code-switching, a ubiquitous aspect of multilingual communication, is increasingly within the grasp of frontier ASR systems. While historically a formidable challenge, top models like ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro exhibit surprising robustness, handling language shifts with minimal performance degradation compared to monolingual speech. Crucially, semantic accuracy often holds even when word-level errors occur, an encouraging sign for downstream applications. Our analysis further revealed that the likelihood of transcription errors is linked to the frequency of language switches, while the severity of those errors correlates with the overall density of code-mixing. Interestingly, errors disproportionately concentrate within the English segments of code-switched utterances, hinting at complex interactions when a model adapts to an embedded language.

Advancing Inclusive AI

These findings carry profound implications beyond technical metrics. For global enterprises, reliable code-switching ASR translates directly into enhanced operational efficiency and, more importantly, a superior, more natural experience for bilingual customers. By enabling users to communicate in their most comfortable and authentic manner, these advancements remove a significant barrier, pushing the industry closer to truly inclusive AI systems that genuinely reflect the diverse linguistic tapestry of the world. Moving forward, research must delve deeper into the specific contextual factors within embedded language segments that trigger errors, and expand benchmarks to include even more language pairs and naturally spoken, non-synthetic code-switched audio. Continued innovation will be vital to fully unlock the potential of ASR that understands and fluidly adapts to the dynamic, multilingual realities of human interaction, ensuring that language diversity is a strength, not a hurdle, for AI.

Intro and outro generated by Printing Press AI from the source article above. Always consult the original reporting for verbatim quotes and primary sources.