Translation Earbuds Prototype — Real-Time Translation Audio via Public APIs

Challenge

Dedicated translation earbuds (AirPods Pro, Timekettle M3) cost ¥17,000–40,000 and require specific ecosystems. Can a regular smartphone and Bluetooth earbuds deliver a 'hear the translation' experience using only public APIs?

Solution

Added a TTS layer (Web Speech API) to the Intent-First Translation pipeline, solved mobile autoplay restrictions with audio lock unlock, and implemented platform-adaptive speech rate control.

Result

Achieved ~3 second end-to-end latency (speech → translated audio), comparable to professional interpreters. Identified browser I/O device limitations and validated a clear path to native app architecture.

Translation earbuds prototype — real-device testing

Background

This project is an extension of Intent-First Translation, which displays the speaker’s intent within 500ms during real-time voice translation. Here, we added an audio output layer — translating English speech into Japanese and playing it through Bluetooth earbuds.

The goal: can you replicate the core experience of AirPods Pro live translation using a regular smartphone, Bluetooth earbuds, and public APIs?

Other person speaks English
  → Phone captures audio (Deepgram)
  → LLM translates in real-time (~2 seconds)
  → Japanese TTS plays through Bluetooth earbuds

What Worked

  • Self-spoken English → Japanese translation in earbuds
  • Translation text to audio delay: within ~1 second
  • End-to-end latency: ~3 seconds (comparable to professional simultaneous interpreters at 2–3 seconds)

What Didn’t Work

  • Capturing another person’s voice through earbud mic ❌ — Bluetooth earbud mics are designed for the wearer’s voice; noise cancellation actively cuts ambient sound
  • Separating input/output devices in browser ❌ — iOS WebKit doesn’t support explicit audio input device selection

Platform-Specific TTS Challenges

Web Speech API is a “browser standard,” but the actual engine differs by OS. Three critical problems were discovered and solved:

1. Speech Rate Inconsistency

The same rate=3.0 was “just right” on Windows Chrome but incomprehensibly fast on iPhone. All iOS browsers use Apple’s WebKit engine with Apple’s TTS engine underneath.

const isMobile = /iPhone|iPad|Android/i.test(navigator.userAgent);
const ttsRate = isMobile ? 1.3 : 3.0;

2. Silent TTS on Mobile

Mobile browsers block speechSynthesis.speak() unless called from a direct user tap. WebSocket callbacks don’t qualify as “user actions.”

// Unlock audio with silent utterance on first tap
const unlock = new SpeechSynthesisUtterance('');
unlock.volume = 0;
window.speechSynthesis.speak(unlock);

3. Missing TTS Triggers

In-progress translations could appear on screen but get overwritten before the confirmed result fired TTS. Fixed by triggering TTS on any translation text, with duplicate detection.


Browser vs Native App

FeatureBrowserNative App
TTS Audio Output✅ (with limitations)✅ (unrestricted)
Input/Output Device Separation
Background Operation

A native iOS app using AVAudioSession can control input and output devices independently — capturing with the built-in mic while playing through Bluetooth earbuds. The backend (FastAPI + Deepgram + LLM) is fully reusable.


Measured Latency

MetricMeasurement
End of speech → Translation textAverage 2,115ms
Translation text → TTS complete (short)Within ~1 second
End of speech → Translation audio heard~3 seconds

Real-Device Testing

Bluetooth earbuds testing with smartphone


Tech Stack

LayerTechnology
Speech RecognitionDeepgram Streaming API
TranslationGPT-4 / Gemini Flash / Groq (LLM)
Text-to-SpeechWeb Speech API (SpeechSynthesis)
Real-Time CommunicationWebSocket
FrontendReact + TypeScript
BackendPython / FastAPI

Key Takeaway

Dedicated products like AirPods Pro solve audio input with specialized hardware — beamforming, multi-microphone arrays. Software alone can’t cross that wall. But for the architecture of “capture with built-in mic, deliver translation through earbuds”, public APIs and open technology are more than sufficient. The lessons learned here — platform TTS behavior, mobile browser audio constraints, Bluetooth routing limits — provide the foundation for a native app redesign.


Foundation Project

This project builds on Intent-First Translation — the 500ms intent-display real-time translation system. The speech recognition and LLM translation pipeline is shared.


Deep dive in the blog series: