Part 4 (Appendix). Build Your Own Real-Time Translator - TTS via Bluetooth
![]()
Now supports 4 languages: English, Japanese, Spanish, and Chinese — bidirectional translation between any pair. Source code available on GitHub.
This is a supplementary post to the Intent-First Translation series. The previous three posts covered the core system — the silence problem, streaming LLM implementation, and local GPU testing. This one documents something different: an experiment in adding audio output to the translation pipeline.
To set expectations honestly: half of it worked, and half was blocked by browser limitations. But the code and the lessons learned may be useful as reference material for anyone working with Web Speech API, mobile TTS, or Bluetooth audio routing.
The Goal
Apple has shipped live translation on AirPods Pro (approximately 113) and Vasco E1 offer similar capabilities through specialized hardware and proprietary ecosystems.
These products deliver polished, integrated experiences. The question I wanted to answer was more modest: how far can you get with publicly available APIs and a pair of ordinary Bluetooth earbuds?
The target pipeline was straightforward:
Other person speaks English
→ Phone captures audio
→ Real-time translation (~2 seconds)
→ Read translation aloud in Japanese
→ Translation heard through Bluetooth earbuds
TTS Implementation: The Core Code
For the TTS engine, I chose the browser’s built-in Web Speech API. No API charges, no network latency, and no additional dependencies. For a prototype, it is the fastest path to audio output.
Here is the complete speakTranslation function and the state setup around it:
// State setup
const [ttsEnabled, setTtsEnabled] = useState(false);
const ttsEnabledRef = useRef(ttsEnabled);
useEffect(() => { ttsEnabledRef.current = ttsEnabled; }, [ttsEnabled]);
// Mobile detection for rate adjustment
const isMobile = /iPhone|iPad|Android/i.test(navigator.userAgent);
const ttsRate = isMobile ? 1.3 : 3.0;
// Track last spoken text to prevent duplicate reads
const lastSpokenRef = useRef<string>('');
const speakTranslation = useCallback((text: string) => {
if (!ttsEnabledRef.current) return;
if (lastSpokenRef.current === text) return; // Duplicate prevention
lastSpokenRef.current = text;
const utterance = new SpeechSynthesisUtterance(text);
utterance.lang = 'ja-JP';
utterance.rate = ttsRate;
window.speechSynthesis.speak(utterance);
}, [ttsRate]);
Two design decisions in this code are worth explaining:
ttsEnabledRef uses a ref instead of reading ttsEnabled state directly. This function is called from WebSocket message callbacks, not from React event handlers. If I read state directly, the closure would capture a stale value — whatever ttsEnabled was when the WebSocket handler was first created. The ref always points to the current value.
lastSpokenRef prevents the same translation from being read aloud twice. This matters because both translation_partial and intent messages can carry the same translation text. Without this guard, a single sentence could be spoken back-to-back.
Integrating TTS into the Translation Pipeline
The speakTranslation function is called inside the WebSocket message handler, specifically when an intent (complete translation) message arrives:
case 'intent': {
setCurrentIntent(message.data);
setPartialIntentLabel('');
// TTS: read translation aloud (duplicates prevented inside speakTranslation)
if (message.data.full_translation) {
speakTranslation(message.data.full_translation);
}
// ... rest of timeline update
break;
}
TTS fires on the intent message because it contains the confirmed full_translation — the most accurate version of the translation. Firing on partial translations would risk reading incomplete or soon-to-be-revised text aloud.
The TTS Toggle with Audio Unlock
The toggle button does more than flip a boolean. On mobile browsers, the first activation must include a silent utterance to unlock the audio context:
{/* TTS toggle button */}
<button
onClick={() => {
if (ttsEnabled) {
// Turning OFF: cancel any ongoing speech
window.speechSynthesis.cancel();
} else {
// Turning ON: unlock audio for mobile browsers
// Mobile browsers require a user gesture to enable speechSynthesis
const unlock = new SpeechSynthesisUtterance('');
unlock.volume = 0;
window.speechSynthesis.speak(unlock);
}
setTtsEnabled(!ttsEnabled);
}}
className={`px-4 py-2 rounded-lg text-sm font-medium transition-colors ${
ttsEnabled
? 'bg-amber-500 text-white'
: 'bg-gray-300 text-gray-600'
}`}
>
{ttsEnabled ? '🔊 TTS ON' : '🔇 TTS OFF'}
</button>
The silent utterance — new SpeechSynthesisUtterance('') with volume set to zero — is the key mechanism. When the user taps the button, this runs within the user gesture context, which satisfies the browser’s autoplay policy. All subsequent speak() calls from WebSocket callbacks are then permitted. Without this initial unlock, every programmatic TTS call would be silently blocked.
Problem 1: Platform-Specific TTS Rate
Web Speech API is a browser standard, but the underlying TTS engine varies by operating system.
| Setting | PC (Windows Chrome) | iPhone Chrome |
|---|---|---|
| rate=1.1 | Normal speed | Normal speed |
| rate=1.8 | Slightly fast | Quite fast |
| rate=3.0 | Just right | Too fast to understand |
The same rate=3.0 was incomprehensibly fast on iPhone. The cause: all browsers on iPhone (Chrome, Safari, Firefox) use Apple’s WebKit engine internally, and TTS processing runs through Apple’s proprietary speech engine. A “standard” API does not guarantee consistent behavior across platforms.
The fix is the mobile detection shown in the core code above:
const isMobile = /iPhone|iPad|Android/i.test(navigator.userAgent);
const ttsRate = isMobile ? 1.3 : 3.0;
A simple check, but necessary. Without it, translations on mobile were unintelligible.
Problem 2: Silent on Mobile
TTS that worked on desktop was completely silent on iPhone.
The cause was mobile browsers’ autoplay restrictions. speechSynthesis.speak() only works when called from a direct user gesture (a tap). Calls originating from WebSocket callbacks are not considered user-initiated and are silently blocked — no error, no warning, just silence.
The solution was the audio unlock mechanism shown in the toggle button code above. On the user’s first tap of the TTS ON button, the silent utterance new SpeechSynthesisUtterance('') unlocks the audio context. After this single gesture, all programmatic speak() calls function normally.
Problem 3: Some Translations Not Read Aloud
Some translations appeared on screen but were never spoken.
The issue was in the TTS trigger condition. Initially, TTS only fired on confirmed results (is_final=true). But in the experimental streaming mode, partial translations (is_final=false) could also appear on screen. When a partial translation was displayed and then overwritten by the next utterance before the confirmed result arrived, TTS never fired for that segment.
The fix: fire TTS whenever translation text exists, regardless of the is_final flag. The duplicate prevention logic inside speakTranslation (the lastSpokenRef check) prevents the same sentence from being read twice, so there is no risk of repeated audio even when TTS fires on both partial and confirmed results.
Bluetooth Earbuds Testing
After resolving these issues, I connected Bluetooth earbuds to an iPhone for real-device testing.
What Worked
- Speaking English myself, hearing Japanese translation in earbuds — this worked.
The earbud microphone captured my voice, the system translated it, and Japanese audio played through the earbuds. The delay between translation text appearing on screen and the audio starting was within about 1 second.
What Did Not Work
- Capturing someone else’s English through the earbud microphone — this failed.
Bluetooth earbud microphones are designed to capture the wearer’s voice. Noise cancellation actively suppresses ambient sound, optimized for phone calls. The microphone could not pick up the voice of someone sitting across from me.
This is not a software problem. It is a physical design constraint. AirPods Pro and dedicated translation earbuds use beamforming and multi-microphone arrays specifically to overcome this limitation — specialized hardware that general-purpose earbuds do not have.
Attempting Input/Output Device Separation
To capture the other person’s voice, the phone’s built-in microphone is the obvious choice. I attempted to fix the audio input to the built-in microphone while routing output to Bluetooth earbuds.
I implemented a microphone selection UI and specified the built-in microphone via getUserMedia with a deviceId constraint.
It failed. iOS browsers (WebKit) do not support explicit audio input device selection. The device specification was either ignored or the audio connection dropped immediately. There was no workaround within the browser environment.
Browser Limits vs Native App
At this point, the boundaries of what the browser can do became clear.
| Feature | Browser | Native App |
|---|---|---|
| TTS Audio Output | Possible (with limitations) | Possible (unrestricted) |
| Input/Output Device Separation | Not possible | Possible |
| Background Operation | Not possible | Possible |
An iOS native app can use the AVAudioSession API to control input and output devices independently. The architecture would look like this:
Other person speaks English
→ Phone built-in mic (AVAudioSession)
→ Deepgram → LLM → Translation (~2 seconds)
→ iOS TTS (AVSpeechSynthesizer)
→ Bluetooth earbuds play Japanese translation
The backend — FastAPI, Deepgram, LLM — can be reused entirely. Only the frontend needs replacement: React Native or Swift instead of a web browser.
Latency Measurements
Here are the latency numbers from testing:
| Metric | Measurement |
|---|---|
| End of speech → Translation text displayed | Average 2,115ms |
| Translation text → TTS playback complete (short) | Within ~1 second |
| End of speech → Translation audio heard | ~3 seconds |
For reference, professional simultaneous interpreters typically lag by 2 to 3 seconds. For short utterances, this prototype achieves comparable latency. For longer sentences, the gap widens — but as a baseline measurement, these numbers provide a useful starting point.
What This Experiment Confirmed
- Real-time translation audio playback is achievable with public APIs — though platform-specific handling is required for consistent behavior.
- Standard Bluetooth earbuds work for translation audio output — no specialized hardware needed on the output side.
- Browsers cannot separately control audio input and output devices — an iOS WebKit limitation with no current workaround.
- A native app can achieve “built-in mic for listening + earbuds for translation audio” —
AVAudioSessionprovides the device control that browsers lack.
It is not possible to reproduce everything that dedicated products offer through personal development. Hardware-dependent capabilities — beamforming for ambient sound capture, multi-microphone arrays — are beyond what software alone can address.
But for the architecture of “capture audio with the phone’s built-in mic, deliver translation audio through earbuds,” public APIs and open technology are sufficient to build a functional prototype. The code in this post — the TTS function, the audio unlock mechanism, the integration with the WebSocket pipeline — works as documented.
The lessons from this experiment — platform-specific TTS behavior, mobile browser audio restrictions, Bluetooth routing constraints — are practical reference material for anyone considering a similar implementation, whether in the browser or as a native app.