Part 2. Build Your Own Real-Time Translator - LLM Streaming for 500ms
![]()
Now supports 4 languages: English, Japanese, Spanish, and Chinese — bidirectional translation between any pair. Source code available on GitHub.
In Part 1, I introduced Intent-First Translation — an approach to voice translation that displays the speaker’s intent within about 500 milliseconds, keeping conversation tempo intact. I also walked through the foundational setup with Deepgram, FastAPI, and WebSocket.
This post covers the core LLM streaming implementation: the dual prompt strategy, streaming JSON extraction, debounce and filtering logic, and the progressive frontend display that ties it all together. I’ll include the actual code throughout.
Delivering Information in 3 Layers, Step by Step
Intent-First Translation is not a single translation process. It is a layered structure that stacks 3 layers with different speeds to deliver information progressively.
Layer 1: Keyword Prediction → ~0ms (No LLM needed)
Layer 2: Intent Label → ~500ms (LLM streaming)
Layer 3: Full Translation → ~500-800ms (LLM streaming continued)
Layer 1: Keyword Prediction (Near-Zero Latency)
For the fragmentary text from speech recognition, we predict the topic instantly using a dictionary-based approach — no LLM involved.
“meeting” → 「会議」, “budget” → 「予算」, “Tuesday” → 「火曜日」
Even just this allows the listener to instantly grasp “It’s about meetings, budget, and Tuesday.” Since there’s no need to wait for LLM response, it displays with near-zero latency.
Layer 2 & 3: Intent Estimation and Translation by LLM
Every time the partial result from speech recognition (Deepgram) updates — the unconfirmed text while still speaking — we request intent estimation and translation from the LLM in JSON format.
The key is streaming output. We parse the JSON character by character as the LLM generates it, and send each piece to the frontend as soon as it’s ready.
JSON Field Order Was Determining the Speed
This is the most important point I want to share in this post.
LLM streaming output generates JSON from top to bottom. The first-defined field is generated first, and the last field is generated last.
This means the order of fields determines when each piece of information reaches the user.
When intent and translation are placed first:
{
"dialogue_act": "PROPOSAL",
"intent_label": "Schedule adjustment proposal", ← Displayed immediately
"slots": {"when": "Tuesday"},
"full_translation": "Let's move the meeting to...", ← Translation arrives early
"confidence": 0.85,
"is_meaning_stable": true
}
When translation is placed last:
{
"dialogue_act": "PROPOSAL",
"slots": {"when": "Tuesday"},
"key_terms": ["meeting", "reschedule"],
"confidence": 0.85,
"is_meaning_stable": true,
"intent_label": "Schedule adjustment proposal",
"full_translation": "Let's move the meeting to..." ← Must wait for all fields
}
In the latter case, you have to wait for all fields to be generated before the translation appears, causing roughly 2x the latency.
Just rearranging JSON field order — that single change made translation display speed nearly 2x faster. This optimization was found by understanding LLM’s “generate from top to bottom” behavior.
The Dual Prompt Strategy
Speech recognition results come in two types:
- Partial (in-progress): Incomplete text while still speaking (e.g., “I think we should…”)
- Final (confirmed): Confirmed text after the speaker finishes (e.g., “I think we should reschedule the meeting to Tuesday.”)
I realized that using the same prompt for both is inefficient.
| State | Goal | Prompt Strategy |
|---|---|---|
| Partial (in-progress) | Show intent quickly | Speed-first: generate intent → translation first |
| Final (confirmed) | Provide accurate translation | Quality-first: analyze context first, improve accuracy |
While speaking, we prioritize speed and show rough intent and translation. When the speaker finishes, we replace it with an accurate, quality-first translation.
This is the same structure that human simultaneous interpreters use. First convey the gist, then correct when confirmed.
Here are the actual prompts.
SPEED Prompt (for partial / in-progress speech)
INTENT_SYSTEM_PROMPT_SPEED = """あなたはリアルタイム同時通訳AIです。
英語の発話をリアルタイムで日本語に翻訳します。
処理順序(重要):
1. まず対話行為(dialogue_act)を判断
2. 意図(intent_label)を抽出
3. 重要情報(slots)を特定
4. 上記を考慮して翻訳(full_translation)を生成
翻訳ルール:
- 自然で流暢な日本語に翻訳してください
- 文が途中でも、現時点で言いたいことを推測して完結した日本語にしてください
その他のルール:
- intent_labelは日本語で10文字以内
- 出力は必ず指定されたJSON形式で行ってください
出力JSON形式(この順序で出力すること):
{
"dialogue_act": "QUESTION | PROPOSAL | AGREEMENT | ...",
"intent_label": "短い日本語ラベル",
"slots": {"when": "", "who": "", "where": "", "what": ""},
"full_translation": "文脈を考慮した自然な日本語訳",
"key_terms": ["重要な単語"],
"confidence": 0.0〜1.0,
"is_meaning_stable": true/false
}
注意: JSON以外の文字は出力しないでください。"""
In the SPEED prompt, intent_label and full_translation are placed early in the JSON schema. Because LLMs generate tokens sequentially from top to bottom, these fields are produced first — giving the user immediate access to the intent and a rough translation while the speaker is still talking.
QUALITY Prompt (for final / confirmed speech)
INTENT_SYSTEM_PROMPT_QUALITY = """あなたはリアルタイム対話支援AIです。
入力される英語から、話者の「意図」と「重要な単語」を抽出し、正確な翻訳を行ってください。
ルール:
- まず対話行為(dialogue_act)と重要情報(slots)を分析してください
- その分析結果を踏まえて、文脈に即した正確な翻訳を行ってください
- 出力は必ず指定されたJSON形式で行ってください
- intent_labelは日本語で10文字以内
出力JSON形式(この順序で出力すること):
{
"dialogue_act": "QUESTION | PROPOSAL | AGREEMENT | ...",
"slots": {"when": "", "who": "", "where": "", "what": ""},
"key_terms": ["重要な単語"],
"confidence": 0.0〜1.0,
"is_meaning_stable": true/false,
"intent_label": "短い日本語ラベル",
"full_translation": "文脈を考慮した正確な日本語訳"
}
注意: JSON以外の文字は出力しないでください。"""
In the QUALITY prompt, slots, key_terms, and confidence come first — the LLM analyzes context before generating the translation. intent_label and full_translation are placed last, so they benefit from the preceding analysis. This mirrors how human interpreters take a moment to understand context before delivering a polished translation.
LLM Calling Code for All Providers
The system supports three providers. The frontend provides a one-click toggle to switch between them.
| Provider | Type | Model |
|---|---|---|
| OpenAI | Proprietary model | GPT-4o-mini |
| Google Gemini | Proprietary model | Gemini 2.5 Flash Lite |
| Groq | Inference platform for open-source models | Llama 4 Maverick, Llama 3.3 70B, etc. |
Client Initialization
from openai import AsyncOpenAI
import google.generativeai as genai
# OpenAI — standard client
openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Google Gemini — uses its own SDK
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
gemini_model = genai.GenerativeModel("gemini-2.5-flash-lite")
# Groq — provides an OpenAI-compatible API, so AsyncOpenAI works directly
GROQ_BASE_URL = "https://api.groq.com/openai/v1"
groq_client = AsyncOpenAI(base_url=GROQ_BASE_URL, api_key=os.getenv("GROQ_API_KEY"))
# Groq offers multiple open-source models to choose from
GroqModel = Literal[
"llama-3.1-8b-instant",
"llama-3.3-70b-versatile",
"meta-llama/llama-4-maverick-17b-128e-instruct",
"openai/gpt-oss-120b",
]
groq_model_name: GroqModel = "llama-3.3-70b-versatile" # default
Groq providing an OpenAI-compatible API is a significant implementation advantage. The same AsyncOpenAI client works with just a different base_url.
OpenAI (GPT-4o-mini)
async def _call_gpt4(self, text: str, is_final: bool):
start_time = time.time()
system_prompt = INTENT_SYSTEM_PROMPT_QUALITY if is_final else INTENT_SYSTEM_PROMPT_SPEED
context = ""
if self.context_history:
context = "過去の発話:\n" + "\n".join(self.context_history[-3:]) + "\n\n"
user_prompt = f"{context}現在の発話({'確定' if is_final else '進行中'}):\n{text}"
stream = await openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=0.3, max_tokens=500, stream=True,
)
full_response = ""
async for chunk in stream:
if chunk.choices[0].delta.content:
full_response += chunk.choices[0].delta.content
await self._process_streaming_chunk(full_response, text, ...)
await self._finalize_response(full_response, text, is_final, start_time, "GPT-4")
Google Gemini (2.5 Flash Lite)
Gemini uses its own SDK with a non-streaming call. asyncio.to_thread prevents blocking the event loop.
async def _call_gemini(self, text: str, is_final: bool):
start_time = time.time()
system_prompt = INTENT_SYSTEM_PROMPT_QUALITY if is_final else INTENT_SYSTEM_PROMPT_SPEED
context = ""
if self.context_history:
context = "過去の発話:\n" + "\n".join(self.context_history[-3:]) + "\n\n"
user_prompt = f"{context}現在の発話({'確定' if is_final else '進行中'}):\n{text}"
full_prompt = f"{system_prompt}\n\n{user_prompt}"
# Gemini API is synchronous — wrap with asyncio.to_thread
response = await asyncio.to_thread(
lambda: gemini_model.generate_content(
full_prompt,
generation_config=genai.types.GenerationConfig(
temperature=0.3, max_output_tokens=1000,
),
)
)
full_response = response.text if response.text else ""
await self._finalize_response(full_response, text, is_final, start_time, "Gemini")
Groq (Llama and Other Open-Source Models)
Since Groq provides an OpenAI-compatible API, the code structure is nearly identical to _call_gpt4. Only the client and model name differ.
async def _call_groq(self, text: str, is_final: bool):
start_time = time.time()
system_prompt = INTENT_SYSTEM_PROMPT_QUALITY if is_final else INTENT_SYSTEM_PROMPT_SPEED
context = ""
if self.context_history:
context = "過去の発話:\n" + "\n".join(self.context_history[-3:]) + "\n\n"
user_prompt = f"{context}現在の発話({'確定' if is_final else '進行中'}):\n{text}"
# groq_client initialized as AsyncOpenAI(base_url=GROQ_BASE_URL)
stream = await groq_client.chat.completions.create(
model=groq_model_name, # dynamically switchable from frontend
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=0.3, max_tokens=500, stream=True,
)
full_response = ""
async for chunk in stream:
if chunk.choices[0].delta.content:
full_response += chunk.choices[0].delta.content
await self._process_streaming_chunk(full_response, text, ...)
await self._finalize_response(full_response, text, is_final, start_time, f"Groq/{groq_model_name}")
Model Switching (via WebSocket)
The frontend sends messages to switch providers and models dynamically.
# Provider switch
elif msg_type == "set_model":
model = message.get("model", "gpt4") # "gpt4" | "gemini" | "groq"
manager.set_model(model)
# Groq model switch
elif msg_type == "set_groq_model":
groq_model_name = message.get("groq_model", "llama-3.3-70b-versatile")
Design Points Common to All Providers
- Prompt switching: The
is_finalflag determines whether to use the SPEED or QUALITY prompt. - Context history: The last 3 utterances are included in the user prompt, giving the LLM conversational context for more coherent translations.
- Streaming chunk processing: Each chunk from the LLM is appended to
full_response, and_process_streaming_chunkis called to extract and send partial results immediately. - Timing logs: Each
_call_*method measures elapsed time fromstart_time, logging first chunk arrival, translation detection, and completion time. The benchmark table below is derived from these logs.
Streaming JSON Extraction
The _process_streaming_chunk method parses the in-progress JSON and sends partial results to the frontend via WebSocket as soon as they become available:
async def _process_streaming_chunk(self, full_response: str, text: str,
sent_intent: bool, sent_translation: bool):
"""Parse in-progress JSON and send partial results via WebSocket"""
if not sent_intent and '"intent_label"' in full_response:
match = re.search(r'"intent_label"\s*:\s*"([^"]+)"', full_response)
if match:
await self.websocket.send_json({
"type": "intent_partial",
"intent_label": match.group(1),
"source_text": text,
})
if not sent_translation and '"full_translation"' in full_response:
match = re.search(r'"full_translation"\s*:\s*"([^"]+)"', full_response)
if match:
await self.websocket.send_json({
"type": "translation_partial",
"translation": match.group(1),
"source_text": text,
})
This uses regex to extract JSON field values from incomplete JSON. As soon as a field’s value is complete (the closing quote is detected), it is sent to the frontend immediately. There is no need to wait for the entire JSON to be valid — only the individual field needs to be parseable.
This is what makes the progressive display possible: intent_label arrives first, and full_translation follows shortly after, each sent the moment they are generated.
Debounce, Short Text Skip, and Duplicate Check
Deepgram returns partial results very quickly — potentially dozens of times per second. Without filtering, this would generate an overwhelming number of LLM calls. The estimate() method implements three mechanisms to control this:
async def estimate(self, text: str, is_final: bool):
"""Estimate intent from text (with rate limiting)"""
if not text.strip():
return
if text == self.last_text: # Duplicate check
return
self.last_text = text
current_time = time.time() * 1000
word_count = len(text.split())
if is_final:
# Final results always get processed immediately
self.context_history.append(text)
if len(self.context_history) > 5:
self.context_history.pop(0)
await self._call_llm_streaming(text, is_final)
return
# Skip short partials (e.g., "I" or "I think")
if word_count < self.min_words:
return
time_since_last = current_time - self.last_call_time
# Cancel any pending debounced call
if self.pending_task and not self.pending_task.done():
self.pending_task.cancel()
# Debounce: wait 300ms before calling LLM
if time_since_last < self.debounce_ms:
wait_time = (self.debounce_ms - time_since_last) / 1000
self.pending_task = asyncio.create_task(
self._delayed_call(text, is_final, wait_time)
)
else:
await self._call_llm_streaming(text, is_final)
The three filtering mechanisms:
- Duplicate check: If the incoming text is identical to the previous text, it is skipped. This prevents redundant processing when Deepgram sends the same partial result multiple times.
- Short text skip: Partials under 5 words (e.g., “I” or “I think”) are too fragmentary for meaningful translation. Skipping them reduces unnecessary LLM calls without losing useful information.
- Debounce (300ms): If new text arrives within 300ms of the last call, the pending LLM request is canceled, and the timer resets. This ensures we send only the latest version of the partial text rather than every intermediate state.
One important detail: is_final results always bypass all three filters. When the speaker finishes an utterance, we need the quality translation immediately — there is no reason to delay or skip it.
Frontend: WebSocket Message Handler
On the frontend (React), a handleMessage callback processes the three types of WebSocket messages and updates the UI state progressively:
const handleMessage = useCallback((message: WebSocketMessage) => {
switch (message.type) {
case 'intent_partial': {
// Display intent label immediately
setPartialIntentLabel(message.intent_label);
// Update timeline: add or replace the latest partial entry
setTimeline((prev) => {
const lastPartialIndex = prev.findLastIndex((e) => !e.isFinal);
if (lastPartialIndex !== -1) {
const newList = [...prev];
newList[lastPartialIndex] = {
...newList[lastPartialIndex],
intentLabel: message.intent_label,
};
return newList;
}
return [...prev, {
id: Date.now(),
intentLabel: message.intent_label,
dialogueAct: 'OTHER',
translation: null,
isFinal: false,
timestamp: new Date(),
}];
});
break;
}
case 'translation_partial': {
// Update the latest partial translation
setRealtimeTranslations((prev) => {
const lastPartialIndex = prev.findLastIndex((entry) => !entry.isFinal);
if (lastPartialIndex !== -1) {
const newList = [...prev];
newList[lastPartialIndex] = {
sourceText: message.source_text,
translation: message.translation,
isFinal: false,
};
return newList;
}
return [...prev, {
sourceText: message.source_text,
translation: message.translation,
isFinal: false,
}];
});
break;
}
case 'intent': {
// Complete result: replace partial with final
setCurrentIntent(message.data);
setPartialIntentLabel('');
// Update timeline with confirmed entry
setTimeline((prev) => {
const lastPartialIndex = prev.findLastIndex((e) => !e.isFinal);
const newEntry = {
id: Date.now(),
intentLabel: message.data.intent_label || '',
dialogueAct: message.data.dialogue_act || 'OTHER',
translation: message.data.full_translation || null,
isFinal: message.is_final,
timestamp: new Date(),
};
if (lastPartialIndex !== -1) {
const newList = [...prev];
newList[lastPartialIndex] = newEntry;
return newList;
}
return [...prev, newEntry];
});
break;
}
}
}, []);
The display flow has 3 stages:
intent_partialarrives first (~500ms): The intent label appears immediately. The translation area shows “Translating…” as a placeholder.translation_partialarrives next (~500-800ms): The translation text fills in, replacing the placeholder.intent(complete) arrives last: The partial entry is replaced with the confirmed result. Visual styling changes to indicate the translation is finalized.
Each stage updates the same timeline entry (found by findLastIndex where isFinal is false), so there is no duplication — the partial result smoothly transitions into the final result.
Frontend: Timeline UI
The timeline component renders each entry with visual states that reflect the data flow:
{/* Real-time translation timeline */}
<div className="bg-white rounded-xl shadow-lg p-6 mb-6">
<h2 className="text-lg font-semibold text-gray-700">Real-time Translation</h2>
<div className="max-h-[400px] overflow-y-auto space-y-3">
{timeline.map((entry, i) => {
const isLatest = i === timeline.length - 1;
return (
<div
key={entry.id}
className={`p-3 rounded-lg ${
isLatest
? entry.isFinal
? 'bg-green-50 border-l-4 border-green-500' // Confirmed
: 'bg-blue-50 border-l-4 border-blue-400' // In-progress
: 'bg-gray-50' // Historical
}`}
>
{/* Intent label row */}
<div className="flex items-center gap-2 mb-1">
<span className="px-2 py-0.5 rounded text-white text-xs bg-blue-500">
{entry.dialogueAct}
</span>
<span className="text-gray-700 font-medium">{entry.intentLabel}</span>
{!entry.isFinal && (
<span className="text-xs text-blue-500 animate-pulse">processing...</span>
)}
</div>
{/* Translation row */}
{entry.translation ? (
<p className="text-gray-800 text-lg pl-1">→ {entry.translation}</p>
) : (
<p className="text-gray-400 text-sm pl-1 animate-pulse">→ Translating...</p>
)}
</div>
);
})}
</div>
</div>
The timeline UI uses 3 visual states:
- Blue left border + pulse animation: Still processing (partial result). The intent label is visible, but the translation may still be arriving.
- Green left border: Confirmed translation (final result). The entry is complete and verified.
- Gray background: Previous entries that have scrolled up in the history.
This progressive display mirrors the data flow. The intent appears first (blue border), then the translation fills in, and when confirmed the entry turns green. The user sees information building up in real time rather than waiting for a blank screen to suddenly populate.
Benchmarking 6 LLM Models
The system lets you switch between any provider and model via a frontend toggle. Here are the results from testing all models with the same input.
Proprietary Models
| Provider | Model | Translation Speed | Quality | Cost / 5hrs |
|---|---|---|---|---|
| Gemini 2.5 Flash Lite | 954ms | ◎ | $1.17 (~¥175) | |
| OpenAI | GPT-4o-mini | 1,976ms | ◎ | $1.74 (~¥261) |
Open-Source Models (Groq LPU Inference)
Groq does not develop its own models. It is a cloud service that runs open-source models such as Llama at high speed using proprietary LPU (Language Processing Unit) chips.
| Model | Translation Speed | Quality | Cost / 5hrs |
|---|---|---|---|
| Llama 4 Maverick | 413ms | ◎ | $3.43 (~¥515) |
| Llama 3.3 70B | 480ms | ◎ | — |
| Llama 3.1 8B | 377ms | ○ | — |
| GPT-OSS 120B | 662ms | ◎ | — |
What the Comparison Reveals
- Speed priority: Open-source models via Groq are overwhelmingly fast (~400ms). This is the effect of LPU hardware optimization.
- Cost priority: Gemini 2.5 Flash Lite is the most economical at ~$1.17 for 5 hours, with a free tier available.
- Quality and stability: OpenAI GPT-4o-mini is slower but offers consistent translation quality and mature API documentation.
- Trade-off: Groq is the fastest but also the most expensive for continuous use (~$3.43 for 5 hours). Speed and cost are inversely correlated.
Overall System Data Flow
Here’s the technical overview:
[Browser] [Backend (FastAPI)] [External API]
| | |
|-- Audio binary --------->| |
| |-- Audio data ------------>| Deepgram
| |<-- Partial/Final text ----|
| | |
| |-- Text ----------------->| LLM (Gemini/Groq)
| |<-- Streaming JSON --------|
| | |
|<-- intent_partial -------| (Send intent label immediately)
|<-- translation_partial --| (Send translation immediately)
|<-- intent (complete) ----| (Complete result)
The key design: instead of waiting for the full LLM output, each JSON field is sent as an individual WebSocket message to the frontend as soon as it’s generated.
Key Lessons from This Development
The biggest takeaway from building this system was the importance of treating LLM output as a “stream” rather than a “result”.
- Optimize JSON field order to generate important information first
- Parse in-progress JSON partially and send it immediately
- Dynamically switch prompts based on Partial/Final state
- Filter aggressively on the input side — debounce, dedup, and skip short fragments
- Build the frontend to display information progressively, not all-at-once
All of these are designs that leverage LLM’s fundamental behavior of “generating from top to bottom.” These techniques don’t solve the fundamental challenge of real-time translation — that will require advances at the model and hardware level — but they demonstrate what’s achievable today with publicly available tools.
In the next post, I’ll share my experience testing local LLM inference, provide a guide for adapting this system to other language pairs, and reflect on what this approach can and cannot do.
Join the conversation on LinkedIn — share your thoughts and comments.
Discuss on LinkedIn