Part 2: JSON Field Order Made Translation Display 2x Faster

In the previous post, I introduced “Intent-First Translation” — an approach to voice translation that displays the speaker’s intent within about 500 milliseconds and maintains natural conversation tempo.

This time, I’ll go deeper into the technical design that makes this speed possible. I hope this will be useful for engineers, and also give business-minded readers a sense of “why it’s fast.”

Delivering Information in 3 Layers, Step by Step

Intent-First Translation is not a single translation process. It’s a layered structure that stacks 3 layers with different speeds to deliver information progressively.

Layer 1: Keyword Prediction      → ~0ms    (No LLM needed)
Layer 2: Intent Label            → ~500ms  (LLM streaming)
Layer 3: Full Translation        → ~500-800ms (LLM streaming continued)

Layer 1: Keyword Prediction (Near-Zero Latency)

For the fragmentary text from speech recognition, we predict the topic instantly using a dictionary-based approach — no LLM involved.

“meeting” → 「会議」, “budget” → 「予算」, “Tuesday” → 「火曜日」

Even just this allows the listener to instantly grasp “It’s about meetings, budget, and Tuesday.” Since there’s no need to wait for LLM response, it displays with near-zero latency.

Layer 2 & 3: Intent Estimation and Translation by LLM

Every time the partial result from speech recognition (Deepgram) updates — the unconfirmed text while still speaking — we request intent estimation and translation from the LLM in JSON format.

The key is streaming output. We parse the JSON character by character as the LLM generates it, and send each piece to the frontend as soon as it’s ready.

JSON Field Order Was Determining the Speed

This is the most important point I want to share in this post.

LLM streaming output generates JSON from top to bottom. The first-defined field is generated first, and the last field is generated last.

This means the order of fields determines when each piece of information reaches the user.

When intent and translation are placed first:

{
  "dialogue_act": "PROPOSAL",
  "intent_label": "Schedule adjustment proposal",     ← Displayed immediately
  "slots": {"when": "Tuesday"},
  "full_translation": "Let's move the meeting to...", ← Translation arrives early
  "confidence": 0.85,
  "is_meaning_stable": true
}

When translation is placed last:

{
  "dialogue_act": "PROPOSAL",
  "slots": {"when": "Tuesday"},
  "key_terms": ["meeting", "reschedule"],
  "confidence": 0.85,
  "is_meaning_stable": true,
  "intent_label": "Schedule adjustment proposal",
  "full_translation": "Let's move the meeting to..."  ← Must wait for all fields
}

In the latter case, you have to wait for all fields to be generated before the translation appears, causing roughly 2x the latency.

Just rearranging JSON field order — that single change made translation display speed nearly 2x faster. This optimization was found by understanding LLM’s “generate from top to bottom” behavior.

Switching Prompts Between In-Progress and Confirmed

Speech recognition results come in two types:

Partial (in-progress): Incomplete text while still speaking (e.g., “I think we should…”)
Final (confirmed): Confirmed text after the speaker finishes (e.g., “I think we should reschedule the meeting to Tuesday.”)

I realized that using the same prompt for both is inefficient.

State	Goal	Prompt Strategy
Partial (in-progress)	Show intent quickly	Speed-first: generate intent → translation first
Final (confirmed)	Provide accurate translation	Quality-first: analyze context first, improve accuracy

While speaking, we prioritize speed and show rough intent and translation. When the speaker finishes, we replace it with an accurate, quality-first translation.

This is the same structure that human simultaneous interpreters use. First convey the gist, then correct when confirmed.

I Compared 6 LLM Models

This system is designed to switch between LLMs depending on the use case. Here are the benchmark results from the same input:

Model	Translation Display Speed	Quality	Cost per 5 hours
Groq / Llama 4 Maverick	413ms	◎	$3.43 (~¥515)
Groq / Llama 3.3 70B	480ms	◎	—
Groq / Llama 3.1 8B	377ms	○	—
Groq / GPT-OSS 120B	662ms	◎	—
Gemini 2.5 Flash Lite	954ms	◎	$1.17 (~¥175)
OpenAI GPT-4o-mini	1,976ms	◎	$1.74 (~¥261)

Groq + Llama 4 Maverick was the best fit for real-time translation with inference speed over 800 tokens/sec. For cost-sensitive use cases, Gemini 2.5 Flash Lite is the most economical at about $1.17 for 5 hours.

The frontend has a one-click model switching UI for choosing the right model.

Overall System Data Flow

Here’s the technical overview:

[Browser]                [Backend (FastAPI)]           [External API]
   |                          |                           |
   |-- Audio binary --------->|                           |
   |                          |-- Audio data ------------>| Deepgram
   |                          |<-- Partial/Final text ----|
   |                          |                           |
   |                          |-- Text ----------------->| LLM (Gemini/Groq)
   |                          |<-- Streaming JSON --------|
   |                          |                           |
   |<-- intent_partial -------|  (Send intent label immediately)
   |<-- translation_partial --|  (Send translation immediately)
   |<-- intent (complete) ----|  (Complete result)

The key design: instead of waiting for the full LLM output, each JSON field is sent as an individual WebSocket message to the frontend as soon as it’s generated.

Preventing Wasteful LLM Calls

Deepgram returns partial results very quickly, so without countermeasures, LLM requests would reach dozens per second. Several optimizations are in place:

Debounce (300ms): Wait 300ms after text update; only call LLM if no new text arrives
Short text skip (under 5 words): Skip partials like “I” or “I think” that are too short
Duplicate check: Don’t process the same text twice
Async pipeline: Continue receiving speech recognition data while waiting for LLM response

Speculative Translation Experiment

As an experimental feature, I also implemented “Speculative Translation.”

This predicts and pre-displays translation using only two pieces of information: the dialogue act (e.g., proposal) and the intent label (e.g., schedule adjustment). It’s less accurate than the confirmed translation, but communicates “what the conversation is about” roughly 500ms–1 second earlier.

Key Lessons from This Development

The biggest takeaway from building this system was the importance of treating LLM output as a “stream” rather than a “result”.

Optimize JSON field order to generate important information first
Parse in-progress JSON partially and send it immediately
Dynamically switch prompts based on Partial/Final state

All of these are designs that leverage LLM’s fundamental behavior of “generating from top to bottom.”

In the next post, I’ll share my experience challenging local LLM and the future vision for this project.

Part 1: The Real Challenge of Voice Translation Wasn’t Accuracy Part 3: Can Real-Time Translation Run on a Home GPU?