Part 4: Can You Hear Translations Through Bluetooth Earbuds?

In the previous post, I shared three future directions for the Intent-First Translation project. This is the record of implementing the first one — hands-free translation with text-to-speech — and testing it with Bluetooth earbuds.

This project builds on the Intent-First Translation pipeline (Deepgram speech recognition + LLM streaming translation), adding an audio output layer on top.

To state the conclusion upfront: half succeeded, half was blocked by browser limitations. But the lessons learned provide a clear path toward a native app implementation.

The Goal

Apple has shipped live translation on AirPods Pro (¥39,800 / ~ $265), combining iOS 26 and Apple Intelligence for face-to-face translation. Dedicated translation earbuds like Timekettle M3 (¥16,980~ / ~$ 113) and Vasco E1 have also emerged.

These products deliver polished experiences through specialized hardware and proprietary ecosystems.

But how far can you get with publicly available APIs and open technology? I wanted to test the possibilities and limits — specifically, whether a regular smartphone and Bluetooth earbuds could deliver a “hear the translation” experience.

Other person speaks English
  → Phone captures audio
  → Real-time translation
  → Read translation aloud in Japanese
  → Translation heard through Bluetooth earbuds

Implementing TTS with Web Speech API

For reading translations aloud, I chose the browser’s built-in Web Speech API.

const utterance = new SpeechSynthesisUtterance(translationText);
utterance.lang = 'ja-JP';
utterance.rate = 1.3;
window.speechSynthesis.speak(utterance);

No API charges, no network latency. The fastest option for prototyping.

The implementation itself was just a few lines — but unexpected problems started appearing one after another.

Problem 1: Same Code, Completely Different Behavior on PC vs iPhone

Web Speech API is a “browser standard,” but the actual TTS engine differs by OS.

Setting	PC (Windows Chrome)	iPhone Chrome
rate=1.1	Normal speed	Normal speed
rate=1.8	Slightly fast	Quite fast
rate=3.0	Just right	Too fast to understand

The same rate=3.0 was incomprehensibly fast on iPhone.

The cause: all browsers on iPhone (Chrome, Safari, Firefox) use Apple’s WebKit engine, and TTS processing runs through Apple’s proprietary engine. Even a “browser standard API” can’t fully abstract away platform differences.

Fix: Detect mobile via UserAgent and branch the rate.

const isMobile = /iPhone|iPad|Android/i.test(navigator.userAgent);
const ttsRate = isMobile ? 1.3 : 3.0;

Problem 2: No Sound on iPhone

TTS that worked perfectly on PC was completely silent on iPhone.

The cause: mobile browsers’ autoplay restrictions. speechSynthesis.speak() only works when called directly from a user tap. Calls from WebSocket callbacks aren’t considered “user actions” and are blocked.

Fix: Unlock audio with a silent utterance on the first tap.

// Inside user tap event handler
const unlock = new SpeechSynthesisUtterance('');
unlock.volume = 0;
window.speechSynthesis.speak(unlock);  // Unlocks audio

After this single tap, programmatic speak() calls work.

Problem 3: Translation Shows on Screen But Isn’t Read Aloud

Some translations appeared on screen but never played as audio.

Investigation revealed the issue was in the TTS trigger condition. Initially, TTS only fired on confirmed results (is_final=true). But in experimental mode, in-progress translations (is_final=false) could also appear on screen. When an in-progress translation was displayed and then overwritten by the next utterance before the confirmed result arrived, TTS never fired.

Fix: Fire TTS whenever translation text exists, regardless of is_final. Duplicate reads are prevented by comparing against the last spoken text.

Real-Device Testing with Bluetooth Earbuds

After these fixes, I connected Bluetooth earbuds to an iPhone for real-device testing.

What Worked

Speaking English myself → Hearing Japanese translation in earbuds ✅

The earbud microphone captured my voice, it was translated, and Japanese audio played through the earbuds. The delay between translation text display and audio output was within about 1 second.

What Didn’t Work

Picking up someone else’s English through earbud mic → Translation ❌

Bluetooth earbud microphones are designed to capture the wearer’s voice. Noise cancellation actively cuts ambient sound, optimized for calls. The mic simply couldn’t pick up the voice of someone sitting across from me.

This isn’t a software issue — it’s a physical design constraint. AirPods Pro and dedicated translation earbuds use beamforming and multi-microphone arrays specifically to overcome this limitation with specialized hardware.

Attempting “Input from Built-in Mic, Output to Earbuds”

To capture the other person’s voice, the phone’s built-in microphone should work. I tried fixing the input to the built-in mic while sending output to Bluetooth earbuds.

I implemented a mic selection UI and specified the built-in mic via getUserMedia deviceId constraints.

It failed. iOS browsers (WebKit) don’t support explicit audio input device selection — the device specification was either ignored or the connection dropped immediately.

Browser Limits and the Native App Path

At this point, the boundaries of what browsers can do became clear.

Feature	Browser	Native App
TTS Audio Output	✅ (with limitations)	✅ (unrestricted)
Input/Output Device Separation	❌	✅
Background Operation	❌	✅

An iOS native app can use the AVAudioSession API to control input and output devices independently. Capturing with the built-in mic while playing translation audio through Bluetooth earbuds — this is technically achievable as a native app.

Other person speaks English
  → Phone built-in mic (fixed via AVAudioSession)
  → Deepgram → LLM → Translation (~2 seconds)
  → iOS TTS (AVSpeechSynthesizer)
  → Hear Japanese translation through Bluetooth earbuds

The backend (FastAPI + Deepgram + LLM) can be reused as-is, with the frontend replaced by React Native or Swift.

Measured Data: Translation Text vs TTS Audio Latency

Here’s the latency data from the testing:

Metric	Measurement
End of speech → Translation text displayed	Average 2,115ms
Translation text → TTS playback complete (short sentences)	Within ~1 second
End of speech → Translation audio heard	~3 seconds

For reference, professional simultaneous interpreters typically lag by 2–3 seconds. For short utterances, we’re achieving comparable latency — a promising result for a personal prototype.

What This Experiment Confirmed

Real-time translation audio playback is achievable with public APIs (though platform-specific handling is required)
Standard Bluetooth earbuds work fine for translation audio output
Browsers cannot separately control audio input and output devices (iOS WebKit limitation)
A native app can achieve “built-in mic for listening + earbuds for translation audio”

You can’t reproduce everything that dedicated products offer through personal development. Hardware-dependent capabilities — like beamforming for ambient sound capture — are walls that software alone can’t cross.

But for the architecture of “capture audio with the phone’s built-in mic, deliver translation audio through earbuds” — public APIs and open technology are more than sufficient to build a functional prototype.

The lessons from this experiment — platform-specific TTS behavior, mobile browser constraints, Bluetooth audio routing limits — provide solid decision-making material for redesigning as a native app.

Part 1: The Real Challenge of Voice Translation Wasn’t Accuracy Part 2: JSON Field Order Made Translation Display 2x Faster Part 3: Can Real-Time Translation Run on a Home GPU?