Part 2. Build Your Own Real-Time Translator - LLM Streaming for 500ms

Now supports 4 languages: English, Japanese, Spanish, and Chinese — bidirectional translation between any pair. Source code available on GitHub.

In Part 1, I introduced Intent-First Translation — an approach to voice translation that displays the speaker’s intent within about 500 milliseconds, keeping conversation tempo intact. I also walked through the foundational setup with Deepgram, FastAPI, and WebSocket.

This post covers the core LLM streaming implementation: the dual prompt strategy, streaming JSON extraction, debounce and filtering logic, and the progressive frontend display that ties it all together. I’ll include the actual code throughout.


Delivering Information in 3 Layers, Step by Step

Intent-First Translation is not a single translation process. It is a layered structure that stacks 3 layers with different speeds to deliver information progressively.

Layer 1: Keyword Prediction      → ~0ms    (No LLM needed)
Layer 2: Intent Label            → ~500ms  (LLM streaming)
Layer 3: Full Translation        → ~500-800ms (LLM streaming continued)

Layer 1: Keyword Prediction (Near-Zero Latency)

For the fragmentary text from speech recognition, we predict the topic instantly using a dictionary-based approach — no LLM involved.

“meeting” → 「会議」, “budget” → 「予算」, “Tuesday” → 「火曜日」

Even just this allows the listener to instantly grasp “It’s about meetings, budget, and Tuesday.” Since there’s no need to wait for LLM response, it displays with near-zero latency.

Layer 2 & 3: Intent Estimation and Translation by LLM

Every time the partial result from speech recognition (Deepgram) updates — the unconfirmed text while still speaking — we request intent estimation and translation from the LLM in JSON format.

The key is streaming output. We parse the JSON character by character as the LLM generates it, and send each piece to the frontend as soon as it’s ready.


JSON Field Order Was Determining the Speed

This is the most important point I want to share in this post.

LLM streaming output generates JSON from top to bottom. The first-defined field is generated first, and the last field is generated last.

This means the order of fields determines when each piece of information reaches the user.

When intent and translation are placed first:

{
  "dialogue_act": "PROPOSAL",
  "intent_label": "Schedule adjustment proposal",      Displayed immediately
  "slots": {"when": "Tuesday"},
  "full_translation": "Let's move the meeting to...",  Translation arrives early
  "confidence": 0.85,
  "is_meaning_stable": true
}

When translation is placed last:

{
  "dialogue_act": "PROPOSAL",
  "slots": {"when": "Tuesday"},
  "key_terms": ["meeting", "reschedule"],
  "confidence": 0.85,
  "is_meaning_stable": true,
  "intent_label": "Schedule adjustment proposal",
  "full_translation": "Let's move the meeting to..." Must wait for all fields
}

In the latter case, you have to wait for all fields to be generated before the translation appears, causing roughly 2x the latency.

Just rearranging JSON field order — that single change made translation display speed nearly 2x faster. This optimization was found by understanding LLM’s “generate from top to bottom” behavior.


The Dual Prompt Strategy

Speech recognition results come in two types:

  • Partial (in-progress): Incomplete text while still speaking (e.g., “I think we should…”)
  • Final (confirmed): Confirmed text after the speaker finishes (e.g., “I think we should reschedule the meeting to Tuesday.”)

I realized that using the same prompt for both is inefficient.

StateGoalPrompt Strategy
Partial (in-progress)Show intent quicklySpeed-first: generate intent → translation first
Final (confirmed)Provide accurate translationQuality-first: analyze context first, improve accuracy

While speaking, we prioritize speed and show rough intent and translation. When the speaker finishes, we replace it with an accurate, quality-first translation.

This is the same structure that human simultaneous interpreters use. First convey the gist, then correct when confirmed.

Here are the actual prompts.

SPEED Prompt (for partial / in-progress speech)

INTENT_SYSTEM_PROMPT_SPEED = """あなたはリアルタイム同時通訳AIです。
英語の発話をリアルタイムで日本語に翻訳します。

処理順序(重要):
1. まず対話行為(dialogue_act)を判断
2. 意図(intent_label)を抽出
3. 重要情報(slots)を特定
4. 上記を考慮して翻訳(full_translation)を生成

翻訳ルール:
- 自然で流暢な日本語に翻訳してください
- 文が途中でも、現時点で言いたいことを推測して完結した日本語にしてください

その他のルール:
- intent_labelは日本語で10文字以内
- 出力は必ず指定されたJSON形式で行ってください

出力JSON形式(この順序で出力すること):
{
  "dialogue_act": "QUESTION | PROPOSAL | AGREEMENT | ...",
  "intent_label": "短い日本語ラベル",
  "slots": {"when": "", "who": "", "where": "", "what": ""},
  "full_translation": "文脈を考慮した自然な日本語訳",
  "key_terms": ["重要な単語"],
  "confidence": 0.0〜1.0,
  "is_meaning_stable": true/false
}

注意: JSON以外の文字は出力しないでください。"""

In the SPEED prompt, intent_label and full_translation are placed early in the JSON schema. Because LLMs generate tokens sequentially from top to bottom, these fields are produced first — giving the user immediate access to the intent and a rough translation while the speaker is still talking.

QUALITY Prompt (for final / confirmed speech)

INTENT_SYSTEM_PROMPT_QUALITY = """あなたはリアルタイム対話支援AIです。
入力される英語から、話者の「意図」と「重要な単語」を抽出し、正確な翻訳を行ってください。

ルール:
- まず対話行為(dialogue_act)と重要情報(slots)を分析してください
- その分析結果を踏まえて、文脈に即した正確な翻訳を行ってください
- 出力は必ず指定されたJSON形式で行ってください
- intent_labelは日本語で10文字以内

出力JSON形式(この順序で出力すること):
{
  "dialogue_act": "QUESTION | PROPOSAL | AGREEMENT | ...",
  "slots": {"when": "", "who": "", "where": "", "what": ""},
  "key_terms": ["重要な単語"],
  "confidence": 0.0〜1.0,
  "is_meaning_stable": true/false,
  "intent_label": "短い日本語ラベル",
  "full_translation": "文脈を考慮した正確な日本語訳"
}

注意: JSON以外の文字は出力しないでください。"""

In the QUALITY prompt, slots, key_terms, and confidence come first — the LLM analyzes context before generating the translation. intent_label and full_translation are placed last, so they benefit from the preceding analysis. This mirrors how human interpreters take a moment to understand context before delivering a polished translation.


LLM Calling Code for All Providers

The system supports three providers. The frontend provides a one-click toggle to switch between them.

ProviderTypeModel
OpenAIProprietary modelGPT-4o-mini
Google GeminiProprietary modelGemini 2.5 Flash Lite
GroqInference platform for open-source modelsLlama 4 Maverick, Llama 3.3 70B, etc.

Client Initialization

from openai import AsyncOpenAI
import google.generativeai as genai

# OpenAI — standard client
openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Google Gemini — uses its own SDK
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
gemini_model = genai.GenerativeModel("gemini-2.5-flash-lite")

# Groq — provides an OpenAI-compatible API, so AsyncOpenAI works directly
GROQ_BASE_URL = "https://api.groq.com/openai/v1"
groq_client = AsyncOpenAI(base_url=GROQ_BASE_URL, api_key=os.getenv("GROQ_API_KEY"))

# Groq offers multiple open-source models to choose from
GroqModel = Literal[
    "llama-3.1-8b-instant",
    "llama-3.3-70b-versatile",
    "meta-llama/llama-4-maverick-17b-128e-instruct",
    "openai/gpt-oss-120b",
]
groq_model_name: GroqModel = "llama-3.3-70b-versatile"  # default

Groq providing an OpenAI-compatible API is a significant implementation advantage. The same AsyncOpenAI client works with just a different base_url.

OpenAI (GPT-4o-mini)

async def _call_gpt4(self, text: str, is_final: bool):
    start_time = time.time()
    system_prompt = INTENT_SYSTEM_PROMPT_QUALITY if is_final else INTENT_SYSTEM_PROMPT_SPEED

    context = ""
    if self.context_history:
        context = "過去の発話:\n" + "\n".join(self.context_history[-3:]) + "\n\n"
    user_prompt = f"{context}現在の発話({'確定' if is_final else '進行中'}):\n{text}"

    stream = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.3, max_tokens=500, stream=True,
    )

    full_response = ""
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            full_response += chunk.choices[0].delta.content
            await self._process_streaming_chunk(full_response, text, ...)

    await self._finalize_response(full_response, text, is_final, start_time, "GPT-4")

Google Gemini (2.5 Flash Lite)

Gemini uses its own SDK with a non-streaming call. asyncio.to_thread prevents blocking the event loop.

async def _call_gemini(self, text: str, is_final: bool):
    start_time = time.time()
    system_prompt = INTENT_SYSTEM_PROMPT_QUALITY if is_final else INTENT_SYSTEM_PROMPT_SPEED

    context = ""
    if self.context_history:
        context = "過去の発話:\n" + "\n".join(self.context_history[-3:]) + "\n\n"
    user_prompt = f"{context}現在の発話({'確定' if is_final else '進行中'}):\n{text}"
    full_prompt = f"{system_prompt}\n\n{user_prompt}"

    # Gemini API is synchronous — wrap with asyncio.to_thread
    response = await asyncio.to_thread(
        lambda: gemini_model.generate_content(
            full_prompt,
            generation_config=genai.types.GenerationConfig(
                temperature=0.3, max_output_tokens=1000,
            ),
        )
    )

    full_response = response.text if response.text else ""
    await self._finalize_response(full_response, text, is_final, start_time, "Gemini")

Groq (Llama and Other Open-Source Models)

Since Groq provides an OpenAI-compatible API, the code structure is nearly identical to _call_gpt4. Only the client and model name differ.

async def _call_groq(self, text: str, is_final: bool):
    start_time = time.time()
    system_prompt = INTENT_SYSTEM_PROMPT_QUALITY if is_final else INTENT_SYSTEM_PROMPT_SPEED

    context = ""
    if self.context_history:
        context = "過去の発話:\n" + "\n".join(self.context_history[-3:]) + "\n\n"
    user_prompt = f"{context}現在の発話({'確定' if is_final else '進行中'}):\n{text}"

    # groq_client initialized as AsyncOpenAI(base_url=GROQ_BASE_URL)
    stream = await groq_client.chat.completions.create(
        model=groq_model_name,  # dynamically switchable from frontend
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.3, max_tokens=500, stream=True,
    )

    full_response = ""
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            full_response += chunk.choices[0].delta.content
            await self._process_streaming_chunk(full_response, text, ...)

    await self._finalize_response(full_response, text, is_final, start_time, f"Groq/{groq_model_name}")

Model Switching (via WebSocket)

The frontend sends messages to switch providers and models dynamically.

# Provider switch
elif msg_type == "set_model":
    model = message.get("model", "gpt4")  # "gpt4" | "gemini" | "groq"
    manager.set_model(model)

# Groq model switch
elif msg_type == "set_groq_model":
    groq_model_name = message.get("groq_model", "llama-3.3-70b-versatile")

Design Points Common to All Providers

  • Prompt switching: The is_final flag determines whether to use the SPEED or QUALITY prompt.
  • Context history: The last 3 utterances are included in the user prompt, giving the LLM conversational context for more coherent translations.
  • Streaming chunk processing: Each chunk from the LLM is appended to full_response, and _process_streaming_chunk is called to extract and send partial results immediately.
  • Timing logs: Each _call_* method measures elapsed time from start_time, logging first chunk arrival, translation detection, and completion time. The benchmark table below is derived from these logs.

Streaming JSON Extraction

The _process_streaming_chunk method parses the in-progress JSON and sends partial results to the frontend via WebSocket as soon as they become available:

async def _process_streaming_chunk(self, full_response: str, text: str,
                                    sent_intent: bool, sent_translation: bool):
    """Parse in-progress JSON and send partial results via WebSocket"""
    if not sent_intent and '"intent_label"' in full_response:
        match = re.search(r'"intent_label"\s*:\s*"([^"]+)"', full_response)
        if match:
            await self.websocket.send_json({
                "type": "intent_partial",
                "intent_label": match.group(1),
                "source_text": text,
            })

    if not sent_translation and '"full_translation"' in full_response:
        match = re.search(r'"full_translation"\s*:\s*"([^"]+)"', full_response)
        if match:
            await self.websocket.send_json({
                "type": "translation_partial",
                "translation": match.group(1),
                "source_text": text,
            })

This uses regex to extract JSON field values from incomplete JSON. As soon as a field’s value is complete (the closing quote is detected), it is sent to the frontend immediately. There is no need to wait for the entire JSON to be valid — only the individual field needs to be parseable.

This is what makes the progressive display possible: intent_label arrives first, and full_translation follows shortly after, each sent the moment they are generated.


Debounce, Short Text Skip, and Duplicate Check

Deepgram returns partial results very quickly — potentially dozens of times per second. Without filtering, this would generate an overwhelming number of LLM calls. The estimate() method implements three mechanisms to control this:

async def estimate(self, text: str, is_final: bool):
    """Estimate intent from text (with rate limiting)"""
    if not text.strip():
        return
    if text == self.last_text:    # Duplicate check
        return

    self.last_text = text
    current_time = time.time() * 1000
    word_count = len(text.split())

    if is_final:
        # Final results always get processed immediately
        self.context_history.append(text)
        if len(self.context_history) > 5:
            self.context_history.pop(0)
        await self._call_llm_streaming(text, is_final)
        return

    # Skip short partials (e.g., "I" or "I think")
    if word_count < self.min_words:
        return

    time_since_last = current_time - self.last_call_time

    # Cancel any pending debounced call
    if self.pending_task and not self.pending_task.done():
        self.pending_task.cancel()

    # Debounce: wait 300ms before calling LLM
    if time_since_last < self.debounce_ms:
        wait_time = (self.debounce_ms - time_since_last) / 1000
        self.pending_task = asyncio.create_task(
            self._delayed_call(text, is_final, wait_time)
        )
    else:
        await self._call_llm_streaming(text, is_final)

The three filtering mechanisms:

  1. Duplicate check: If the incoming text is identical to the previous text, it is skipped. This prevents redundant processing when Deepgram sends the same partial result multiple times.
  2. Short text skip: Partials under 5 words (e.g., “I” or “I think”) are too fragmentary for meaningful translation. Skipping them reduces unnecessary LLM calls without losing useful information.
  3. Debounce (300ms): If new text arrives within 300ms of the last call, the pending LLM request is canceled, and the timer resets. This ensures we send only the latest version of the partial text rather than every intermediate state.

One important detail: is_final results always bypass all three filters. When the speaker finishes an utterance, we need the quality translation immediately — there is no reason to delay or skip it.


Frontend: WebSocket Message Handler

On the frontend (React), a handleMessage callback processes the three types of WebSocket messages and updates the UI state progressively:

const handleMessage = useCallback((message: WebSocketMessage) => {
  switch (message.type) {
    case 'intent_partial': {
      // Display intent label immediately
      setPartialIntentLabel(message.intent_label);
      // Update timeline: add or replace the latest partial entry
      setTimeline((prev) => {
        const lastPartialIndex = prev.findLastIndex((e) => !e.isFinal);
        if (lastPartialIndex !== -1) {
          const newList = [...prev];
          newList[lastPartialIndex] = {
            ...newList[lastPartialIndex],
            intentLabel: message.intent_label,
          };
          return newList;
        }
        return [...prev, {
          id: Date.now(),
          intentLabel: message.intent_label,
          dialogueAct: 'OTHER',
          translation: null,
          isFinal: false,
          timestamp: new Date(),
        }];
      });
      break;
    }
    case 'translation_partial': {
      // Update the latest partial translation
      setRealtimeTranslations((prev) => {
        const lastPartialIndex = prev.findLastIndex((entry) => !entry.isFinal);
        if (lastPartialIndex !== -1) {
          const newList = [...prev];
          newList[lastPartialIndex] = {
            sourceText: message.source_text,
            translation: message.translation,
            isFinal: false,
          };
          return newList;
        }
        return [...prev, {
          sourceText: message.source_text,
          translation: message.translation,
          isFinal: false,
        }];
      });
      break;
    }
    case 'intent': {
      // Complete result: replace partial with final
      setCurrentIntent(message.data);
      setPartialIntentLabel('');
      // Update timeline with confirmed entry
      setTimeline((prev) => {
        const lastPartialIndex = prev.findLastIndex((e) => !e.isFinal);
        const newEntry = {
          id: Date.now(),
          intentLabel: message.data.intent_label || '',
          dialogueAct: message.data.dialogue_act || 'OTHER',
          translation: message.data.full_translation || null,
          isFinal: message.is_final,
          timestamp: new Date(),
        };
        if (lastPartialIndex !== -1) {
          const newList = [...prev];
          newList[lastPartialIndex] = newEntry;
          return newList;
        }
        return [...prev, newEntry];
      });
      break;
    }
  }
}, []);

The display flow has 3 stages:

  1. intent_partial arrives first (~500ms): The intent label appears immediately. The translation area shows “Translating…” as a placeholder.
  2. translation_partial arrives next (~500-800ms): The translation text fills in, replacing the placeholder.
  3. intent (complete) arrives last: The partial entry is replaced with the confirmed result. Visual styling changes to indicate the translation is finalized.

Each stage updates the same timeline entry (found by findLastIndex where isFinal is false), so there is no duplication — the partial result smoothly transitions into the final result.


Frontend: Timeline UI

The timeline component renders each entry with visual states that reflect the data flow:

{/* Real-time translation timeline */}
<div className="bg-white rounded-xl shadow-lg p-6 mb-6">
  <h2 className="text-lg font-semibold text-gray-700">Real-time Translation</h2>
  <div className="max-h-[400px] overflow-y-auto space-y-3">
    {timeline.map((entry, i) => {
      const isLatest = i === timeline.length - 1;
      return (
        <div
          key={entry.id}
          className={`p-3 rounded-lg ${
            isLatest
              ? entry.isFinal
                ? 'bg-green-50 border-l-4 border-green-500'    // Confirmed
                : 'bg-blue-50 border-l-4 border-blue-400'      // In-progress
              : 'bg-gray-50'                                     // Historical
          }`}
        >
          {/* Intent label row */}
          <div className="flex items-center gap-2 mb-1">
            <span className="px-2 py-0.5 rounded text-white text-xs bg-blue-500">
              {entry.dialogueAct}
            </span>
            <span className="text-gray-700 font-medium">{entry.intentLabel}</span>
            {!entry.isFinal && (
              <span className="text-xs text-blue-500 animate-pulse">processing...</span>
            )}
          </div>
          {/* Translation row */}
          {entry.translation ? (
            <p className="text-gray-800 text-lg pl-1">→ {entry.translation}</p>
          ) : (
            <p className="text-gray-400 text-sm pl-1 animate-pulse">→ Translating...</p>
          )}
        </div>
      );
    })}
  </div>
</div>

The timeline UI uses 3 visual states:

  • Blue left border + pulse animation: Still processing (partial result). The intent label is visible, but the translation may still be arriving.
  • Green left border: Confirmed translation (final result). The entry is complete and verified.
  • Gray background: Previous entries that have scrolled up in the history.

This progressive display mirrors the data flow. The intent appears first (blue border), then the translation fills in, and when confirmed the entry turns green. The user sees information building up in real time rather than waiting for a blank screen to suddenly populate.


Benchmarking 6 LLM Models

The system lets you switch between any provider and model via a frontend toggle. Here are the results from testing all models with the same input.

Proprietary Models

ProviderModelTranslation SpeedQualityCost / 5hrs
GoogleGemini 2.5 Flash Lite954ms$1.17 (~¥175)
OpenAIGPT-4o-mini1,976ms$1.74 (~¥261)

Open-Source Models (Groq LPU Inference)

Groq does not develop its own models. It is a cloud service that runs open-source models such as Llama at high speed using proprietary LPU (Language Processing Unit) chips.

ModelTranslation SpeedQualityCost / 5hrs
Llama 4 Maverick413ms$3.43 (~¥515)
Llama 3.3 70B480ms
Llama 3.1 8B377ms
GPT-OSS 120B662ms

What the Comparison Reveals

  • Speed priority: Open-source models via Groq are overwhelmingly fast (~400ms). This is the effect of LPU hardware optimization.
  • Cost priority: Gemini 2.5 Flash Lite is the most economical at ~$1.17 for 5 hours, with a free tier available.
  • Quality and stability: OpenAI GPT-4o-mini is slower but offers consistent translation quality and mature API documentation.
  • Trade-off: Groq is the fastest but also the most expensive for continuous use (~$3.43 for 5 hours). Speed and cost are inversely correlated.

Overall System Data Flow

Here’s the technical overview:

[Browser]                [Backend (FastAPI)]           [External API]
   |                          |                           |
   |-- Audio binary --------->|                           |
   |                          |-- Audio data ------------>| Deepgram
   |                          |<-- Partial/Final text ----|
   |                          |                           |
   |                          |-- Text ----------------->| LLM (Gemini/Groq)
   |                          |<-- Streaming JSON --------|
   |                          |                           |
   |<-- intent_partial -------|  (Send intent label immediately)
   |<-- translation_partial --|  (Send translation immediately)
   |<-- intent (complete) ----|  (Complete result)

The key design: instead of waiting for the full LLM output, each JSON field is sent as an individual WebSocket message to the frontend as soon as it’s generated.


Key Lessons from This Development

The biggest takeaway from building this system was the importance of treating LLM output as a “stream” rather than a “result”.

  1. Optimize JSON field order to generate important information first
  2. Parse in-progress JSON partially and send it immediately
  3. Dynamically switch prompts based on Partial/Final state
  4. Filter aggressively on the input side — debounce, dedup, and skip short fragments
  5. Build the frontend to display information progressively, not all-at-once

All of these are designs that leverage LLM’s fundamental behavior of “generating from top to bottom.” These techniques don’t solve the fundamental challenge of real-time translation — that will require advances at the model and hardware level — but they demonstrate what’s achievable today with publicly available tools.

In the next post, I’ll share my experience testing local LLM inference, provide a guide for adapting this system to other language pairs, and reflect on what this approach can and cannot do.


Share this article

Join the conversation on LinkedIn — share your thoughts and comments.

Discuss on LinkedIn

Related Posts