Part 1. Build Your Own Real-Time Translator - Intent-First, Breaking the Silence

Now supports 4 languages: English, Japanese, Spanish, and Chinese — bidirectional translation between any pair. Source code available on GitHub.

Have you ever tried having a conversation with someone from another country using a voice translation app?

Microsoft Translator, Google Translate, Apple’s AirPods live translation — real-time voice translation from major tech companies is getting more accurate every year. Language support keeps expanding, and the technology itself is quite mature.

But when you actually use one in a live conversation, you notice something unexpected.

The translation is accurate, but the conversation doesn’t flow.

The Problem Was “Silence”

When you use a translation app in a business conversation, this is what happens:

The other person speaks in English → Wait for translation (3-5 seconds of silence) → Read the translation → You speak → The other person waits again

During this “waiting for translation” silence, the other person starts wondering “Can they hear me?” while you cannot respond until the translation appears. No matter how much translation quality improves, this problem of broken conversation rhythm cannot be structurally solved.

All existing voice translation systems use a “sequential translation” architecture. They wait for the speaker to finish, confirm the text, then translate. It is accurate, but it always creates a “waiting time.”

Why Don’t Simultaneous Interpreters “Wait”?

If you observe professional simultaneous interpreters at work, you notice something interesting.

Interpreters start translating before the sentence is complete.

When they hear “We should probably reschedule the meeting to…”, they are already conveying “They’re talking about rescheduling the meeting.” Before the specific detail “Tuesday afternoon” arrives, they communicate the intent first.

For the listener, the most important thing is understanding “what the topic is” within the first few seconds. The details can be filled in afterwards.

Could we recreate this “communicate intent first” approach in software? That question led me to build the prototype I am introducing in this series.

The Concept of Intent-First Translation

Intent-First Translation processes information in a different order than traditional translation.

Traditional Translation (Sequential):

Start speaking → Finish speaking → Confirm text → Translate → Display
                                (Nothing reaches the listener during this time)

Intent-First Translation:

Start speaking → After 0.5s: "Schedule adjustment proposal" displayed
               → After 0.8s: "Wants to move the meeting to Tuesday afternoon" displayed
               → Finish speaking → Updated to confirmed translation

About 500 milliseconds after the speaker starts talking, the intent — “This person is talking about schedule adjustment” — appears on screen. The full translation follows a few hundred milliseconds later.

The listener can understand “what the speaker is talking about” while they are still speaking. This is an experience that sequential translation architectures cannot provide.

Project Setup

The system consists of a React frontend communicating with a FastAPI backend over WebSocket. The backend handles speech recognition through Deepgram and translation through LLM APIs.

[Browser (React)] ←WebSocket→ [FastAPI Backend] → [Deepgram STT] + [LLM API]

Requirements: Python 3.11+, Node.js 18+, a Deepgram API key, and at least one LLM API key from the services below.

Service	Type	Pros	Cons
Google Gemini (Flash Lite)	Proprietary model	Low cost (~$1.17/5hrs). Free tier available. Good speed-quality balance	—
OpenAI (GPT-4o-mini)	Proprietary model	Consistent translation quality. Mature API documentation	Slower inference (~2s in benchmarks)
Groq (Llama, etc.)	Inference platform for open-source models	Fastest inference (~400ms) via custom LPU chips	Open-source models only. Higher cost for continuous use (~$3.43/5hrs)

Note: Groq does not develop its own models. It is a cloud service that runs open-source models (such as Llama) at high speed using proprietary LPU (Language Processing Unit) hardware. It is fundamentally different in nature from OpenAI or Google.

# backend/requirements.txt
fastapi==0.115.6
uvicorn[standard]==0.34.0
websockets==14.1
deepgram-sdk==3.10.0
python-dotenv==1.0.1
openai==1.58.1
google-generativeai>=0.8.0

# .env
DEEPGRAM_API_KEY=your_deepgram_key
OPENAI_API_KEY=your_openai_key      # For GPT-4o-mini
GOOGLE_API_KEY=your_google_key       # For Gemini
GROQ_API_KEY=your_groq_key          # For Groq (open-source model inference)

Connecting to Deepgram: Real-Time Speech Recognition

The core of the system is a persistent WebSocket connection to Deepgram’s streaming speech-to-text API. Here is the simplified start() method of the TranscriptionManager:

async def start(self):
    config = DeepgramClientOptions(options={"keepalive": "true"})
    deepgram = DeepgramClient(DEEPGRAM_API_KEY, config)

    self.dg_connection = deepgram.listen.asyncwebsocket.v("1")

    self.dg_connection.on(LiveTranscriptionEvents.Transcript, self._on_transcript)
    self.dg_connection.on(LiveTranscriptionEvents.Error, self._on_error)
    self.dg_connection.on(LiveTranscriptionEvents.Close, self._on_close)

    options = LiveOptions(
        model="nova-2",
        language="en-US",
        encoding="linear16",
        sample_rate=16000,
        channels=1,
        interim_results=True,     # Get partial results while speaking
        utterance_end_ms=1000,    # Detect end of utterance
        vad_events=True,          # Voice activity detection
    )

    await self.dg_connection.start(options)

Three settings are particularly important for the intent-first approach:

interim_results=True — This is essential. It tells Deepgram to send partial transcripts while the speaker is still talking, rather than waiting for the utterance to finish. Without this, there is nothing to feed into intent estimation early.
utterance_end_ms=1000 — After 1 second of silence, Deepgram marks the transcript as is_final=True. This threshold balances responsiveness with avoiding premature cutoffs.
vad_events=True — Enables voice activity detection events, which help the system distinguish between actual speech and background noise.

When Deepgram sends a transcript event, the callback processes it and triggers downstream tasks:

async def _on_transcript(self, *args, **kwargs):
    result = kwargs.get("result")
    if result is None and args:
        result = args[1] if len(args) > 1 else args[0]

    if result:
        transcript_data = result.channel.alternatives[0]
        transcript = transcript_data.transcript

        if transcript:
            is_final = result.is_final
            speech_final = getattr(result, "speech_final", False)

            # Send transcript to frontend
            await self.websocket.send_json({
                "type": "transcript",
                "transcript": transcript,
                "is_final": is_final,
                "speech_final": speech_final,
                "confidence": transcript_data.confidence,
            })

            # Instant keyword hints (no LLM, near-zero latency)
            await self.keyword_predictor.predict(transcript)

            # Intent estimation (async, non-blocking)
            asyncio.create_task(
                self.intent_estimator.estimate(transcript, is_final)
            )

The use of asyncio.create_task() here is deliberate. Intent estimation involves an LLM API call that may take several hundred milliseconds. If that call blocked the transcript callback, Deepgram’s keepalive connection would time out and disconnect. By running it as a separate async task, the callback returns immediately and the WebSocket connection stays alive.

WebSocket Endpoint

The FastAPI server exposes a single WebSocket endpoint that handles both binary audio data and JSON control messages:

@app.websocket("/ws/audio")
async def websocket_audio(websocket: WebSocket):
    await websocket.accept()
    manager = TranscriptionManager(websocket)

    if not await manager.start():
        await websocket.close()
        return

    try:
        while True:
            data = await websocket.receive()

            if "bytes" in data:
                # Binary audio data from browser microphone
                await manager.send_audio(data["bytes"])
            elif "text" in data:
                # Control messages (stop, model switch, etc.)
                message = json.loads(data["text"])
                if message.get("type") == "stop":
                    break
                elif message.get("type") == "set_model":
                    manager.set_model(message["model"])
    finally:
        await manager.close()

The endpoint handles two types of incoming data. Binary messages contain raw audio captured by the browser microphone, which are forwarded directly to Deepgram. Text messages are JSON-encoded control commands — for example, stopping the session or switching the LLM model mid-conversation.

Frontend: Capturing Audio

On the browser side, the Web Audio API captures microphone input and converts it to the format Deepgram expects:

const startRecording = useCallback(async () => {
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: {
      channelCount: 1,
      sampleRate: 16000,
      echoCancellation: true,
      noiseSuppression: true,
    },
  });

  const audioContext = new AudioContext({ sampleRate: 16000 });
  const source = audioContext.createMediaStreamSource(stream);
  const processor = audioContext.createScriptProcessor(4096, 1, 1);

  processor.onaudioprocess = (event) => {
    const inputData = event.inputBuffer.getChannelData(0);
    // Float32 to Int16 conversion (Deepgram expects linear16)
    const int16Data = new Int16Array(inputData.length);
    for (let i = 0; i < inputData.length; i++) {
      const s = Math.max(-1, Math.min(1, inputData[i]));
      int16Data[i] = s < 0 ? s * 0x8000 : s * 0x7fff;
    }
    onAudioData(int16Data.buffer);
  };

  source.connect(processor);
  processor.connect(audioContext.destination);
}, []);

The onAudioData callback sends the Int16 buffer through the WebSocket to the backend. At 16kHz mono 16-bit, each 4096-sample buffer is about 8KB — small enough for smooth real-time streaming without noticeable overhead.

Measured Performance

Here is the actual measured performance:

Metric	Intent-First Translation	Traditional Sequential Translation
Intent Display	~500ms	(No such feature)
Translation Display Start	~500-800ms	~2-5 seconds
Conversation Tempo	Nearly real-time	3-5 second interruption each time

Having the intent communicated within about 500 milliseconds creates a noticeably different experience compared to waiting several seconds for a complete translation.

See It in Action

Here is a demo of Intent-First Translation in use. Watch how the intent and translation appear while the speaker is still talking:

From “Translating Accurately” to “Conversing Naturally”

This prototype is not trying to compete with existing services on translation accuracy, nor does it claim to fundamentally solve real-time translation. It explores a different angle: reducing perceived latency by delivering information progressively, as a UX-level improvement.

The true breakthroughs in real-time translation will come from advances in model architecture and inference hardware — faster edge AI, lower-latency models, and perhaps paradigms we have not yet imagined. Intent-First Translation does not replace those advances. It is one practical approach that works within today’s constraints, making the waiting time feel shorter by showing intent first.

What I find encouraging is that this approach is accessible to anyone. Using publicly available APIs and open-source tools, any engineer can build a near real-time translator at home. I hope this series serves as a useful reference for those exploring the same space.

In the next post, I will cover the core LLM streaming implementation — dual prompts for speed versus quality, JSON field order optimization, and how the frontend displays translations progressively.