Part 3. Build Your Own Real-Time Translator - Ollama, LM Studio, and Home GPU

Now supports 4 languages: English, Japanese, Spanish, and Chinese — bidirectional translation between any pair. Source code available on GitHub.

In Part 1, I introduced the core problem — the silence gap in voice translation — and built the foundation with Deepgram, FastAPI, and WebSocket. In Part 2, I covered the LLM streaming implementation in detail, including JSON field order optimization and the dual-prompt strategy with full code. In this third part, I share hands-on results from local GPU testing with actual setup commands and backend code, investigate mobile LLM feasibility with concrete benchmarks, provide a practical guide for adapting the system to any language pair, and close with an honest reflection on what this approach can and cannot do.


Why I Tried Local LLM

Cloud APIs are fast and convenient, but they cost money for continuous use.

ProviderCost per 5 hours
Groq / Llama 4 Maverick$3.43 (¥515)
OpenAI GPT-4o-mini$1.74 (¥261)
Gemini 2.5 Flash Lite$1.17 (¥175)

As a daily-use tool, this cost is not negligible. If real-time translation could run entirely on a home GPU, the running cost would drop to zero.

So I tested Google’s open-source LLM “Gemma 3 4B” on my RTX 3060 (6GB VRAM) to see if it could handle real-time translation.


Setting Up Ollama

I started with Ollama, which is perhaps the most straightforward way to run open-source LLMs locally.

Installation and Model Setup

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3 4B model
ollama pull gemma3:4b

# Verify it's running
ollama list

After installation, Ollama runs as a background service and exposes an API on port 11434. One of its key advantages is that it provides an OpenAI-compatible endpoint, which means the same AsyncOpenAI client used for cloud APIs can connect to it with a single configuration change.

Backend Configuration

# Ollama provides an OpenAI-compatible API endpoint
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://192.168.0.183:11434/v1")
gemma_client = AsyncOpenAI(base_url=OLLAMA_BASE_URL, api_key="ollama")

# Critical: Ollama with 6GB VRAM cannot handle parallel requests
# A lock prevents concurrent calls
gemma_lock = asyncio.Lock()

Streaming Call Implementation

The streaming call to Ollama follows the same pattern as cloud API calls. Here is a simplified version of the _call_gemma method:

async def _call_gemma(self, text: str, is_final: bool):
    """Stream response from Gemma 3 via Ollama"""
    start_time = time.time()
    system_prompt = INTENT_SYSTEM_PROMPT_QUALITY if is_final else INTENT_SYSTEM_PROMPT_SPEED

    stream = await gemma_client.chat.completions.create(
        model="gemma3:4b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.3,
        max_tokens=500,
        stream=True,
    )

    full_response = ""
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            full_response += chunk.choices[0].delta.content
            await self._process_streaming_chunk(
                full_response, text, sent_intent_partial, sent_translation_partial
            )

Results

  • Single requests: Translation quality was practical. Gemma 3 4B handled English-to-Japanese translation reasonably well, and the streaming output was functional.
  • Parallel requests (which real-time translation requires): GPU memory was exhausted and Windows force-shutdown occurred.

The fundamental issue is that Ollama loads the entire model into VRAM. With 6GB, there is no room for concurrent inference contexts. Real-time translation generates multiple overlapping requests — partial results while the speaker is still talking, then a final translation once the utterance is confirmed. This concurrency pattern is incompatible with a single-model-in-VRAM architecture on a 6GB card.


LM Studio 0.4.0 — Headless Server via CLI

Next, I tried LM Studio, which introduced headless server capabilities in version 0.4.0 through its CLI tool and the llmster daemon.

Complete CLI Setup

# Install LM Studio (download from lmstudio.ai)
# LM Studio 0.4.0 introduced the headless daemon 'llmster'

# Start the daemon (runs without GUI)
lms daemon up

# Start the inference server
lms server start --port 1234 --bind 0.0.0.0 --cors

# Load a model (if not already loaded via GUI)
lms load gemma-3-4b-it

# Verify the server is running
curl http://localhost:1234/v1/models

The lms daemon up command starts LM Studio’s inference engine as a background process, entirely separate from the GUI application. This is particularly useful for running inference on a headless machine or integrating into automated pipelines. Like Ollama, it exposes an OpenAI-compatible API.

Backend Configuration

# LM Studio also provides an OpenAI-compatible API
LMSTUDIO_BASE_URL = os.getenv("LMSTUDIO_BASE_URL", "http://192.168.0.183:1234/v1")
LMSTUDIO_MODEL = os.getenv("LMSTUDIO_MODEL", "gemma-3-4b-it")
lmstudio_client = AsyncOpenAI(base_url=LMSTUDIO_BASE_URL, api_key="lm-studio")

# Lock mechanism: with limited VRAM, we serialize requests
lmstudio_lock = asyncio.Lock()

Lock Mechanism for Parallel Request Control

The lock strategy is where the implementation gets interesting. Since real-time translation generates two types of requests — partial (while speaking) and final (after utterance is confirmed) — the lock must handle them differently:

async def _call_lmstudio(self, text: str, is_final: bool):
    """LM Studio API with request serialization"""
    # For partial results: skip if another request is already running
    # For final results: wait for the lock (must be processed)
    if not is_final and lmstudio_lock.locked():
        print(f"[LMStudio] Skipped (lock busy, partial): {text[:30]}...")
        return

    async with lmstudio_lock:
        stream = await lmstudio_client.chat.completions.create(
            model=LMSTUDIO_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt},
            ],
            temperature=0.3,
            max_tokens=500,
            stream=True,
        )
        # ... same streaming chunk processing as other providers

The logic behind this lock strategy: partial translations are speculative — they will be overwritten as the speaker continues. If the GPU is already busy, it is better to drop a partial request than to queue it and cause memory pressure. Final translations, on the other hand, represent confirmed utterances and must always be processed, so they wait for the lock. This approach prioritizes accuracy for completed speech while gracefully degrading on in-progress speech.

Results

ItemResult
Translation△ Works but 3–4 second latency
Summary Generation× Lock contention prevented execution
Real-Time Performance× Sub-500ms response impossible
Cost◎ Completely free

Translation itself works, but taking 3–4 seconds per request means it cannot keep up with speaking speed.

Key Insight: API Compatibility Across Providers

One important takeaway from testing both Ollama and LM Studio is that they both provide OpenAI-compatible APIs. The same AsyncOpenAI client and streaming code works identically across cloud APIs (OpenAI, Groq, Gemini) and local servers — only the base_url changes. This means switching between cloud and local inference is a one-line configuration change, not a rewrite. This interoperability is a significant practical advantage when experimenting with different providers.

Conclusion

A GPU with 6GB VRAM cannot meet the requirements for real-time translation. With 12GB+ VRAM (RTX 4070 class or above), it might be feasible, but for now cloud APIs like Groq and Gemini remain the practical choice.


Mobile Local LLM Investigation

I also explored whether the LLM could run directly on a smartphone’s SoC, which would make the system fully portable without any server dependency.

Runtimes Investigated

The following is based on published documentation and community benchmark reports, not hands-on testing.

  • MLC Chat (mlc.ai): Runs quantized LLMs on mobile GPUs via Metal (iOS) and Vulkan (Android). Supports Llama, Gemma, and Phi model families.
  • Google AI Edge Gallery: Google’s official framework for on-device AI inference. Supports Gemma 2B/7B on Android with hardware acceleration through TensorFlow Lite delegates.
  • SmolChat: A lightweight chat interface designed for small models such as SmolLM and Phi-3-mini. Focuses on minimal memory footprint and ease of deployment.

Published Benchmark Numbers

Snapdragon 8 Gen 2 (flagship 2023):
  - Gemma 2B quantized (Q4): ~12-16 tokens/sec
  - Phi-3-mini (3.8B): ~8-10 tokens/sec

Apple A17 Pro:
  - Gemma 2B quantized: ~15-20 tokens/sec
  - Phi-3-mini: ~10-14 tokens/sec

Why Current Mobile Hardware Falls Short

For real-time translation at sub-500ms latency, the full JSON response (approximately 100-200 tokens) must be generated in under 500ms. That requires a throughput of 200-400 tokens/sec at minimum. Current mobile SoCs achieving 12-20 tokens/sec are roughly 20x too slow for this use case.

However, mobile SoC compute performance has historically doubled roughly every 2 years. At that rate, practical on-device translation could become feasible within 3-5 years, especially as model quantization techniques and architecture optimizations continue to improve. Smaller, more efficient model architectures — such as the trend toward sub-1B parameter models optimized for specific tasks — may accelerate this timeline further.


Adapting for Other Language Pairs

Update: The system now natively supports bidirectional translation across 4 languages — English, Japanese, Spanish, and Chinese. The guide below remains useful as a reference for understanding how language support is implemented internally.

The system was originally built for English → Japanese, but adapting it for other language pairs requires changes in only three places. The core architecture — streaming speech recognition, LLM streaming with optimized JSON field order, and WebSocket delivery — works the same regardless of language pair.

Change 1: Deepgram Language Parameter

The speech recognition input language is set through Deepgram’s LiveOptions. Changing the language parameter is all that is needed.

# Current: English input
options = LiveOptions(
    model="nova-2",
    language="en-US",      # ← Change this
    encoding="linear16",
    sample_rate=16000,
    channels=1,
    interim_results=True,
    utterance_end_ms=1000,
    vad_events=True,
)

# Example: Spanish input
options = LiveOptions(
    model="nova-2",
    language="es",          # Spanish
    # ... rest stays the same
)

Deepgram supports 30+ languages. Refer to their documentation for the full list of supported language codes.

Change 2: LLM Prompt — Target Language

The translation prompt defines the source language, the target language, and the language used for intent_label. Here is how the prompt would change for a different language pair:

# Current prompt (translates English → Japanese):
INTENT_SYSTEM_PROMPT_SPEED = """あなたはリアルタイム同時通訳AIです。
英語の発話をリアルタイムで日本語に翻訳します。
...
"""

# For Spanish → English translation:
INTENT_SYSTEM_PROMPT_SPEED = """You are a real-time interpreter AI.
Translate Spanish speech into English in real time.

Processing order:
1. Determine dialogue_act
2. Extract intent_label (in English, max 10 words)
3. Identify key slots
4. Generate full_translation in English

Output JSON format (output in this order):
{
  "dialogue_act": "QUESTION | PROPOSAL | ...",
  "intent_label": "short English label",
  "slots": {"when": "", "who": "", "where": "", "what": ""},
  "full_translation": "Natural English translation",
  "key_terms": ["important words"],
  "confidence": 0.0-1.0,
  "is_meaning_stable": true/false
}
"""

The prompt needs to specify three things: (1) the source language, (2) the target language, and (3) the language for intent_label.

Change 3: Frontend UI Labels

The UI text displayed to the user needs to match the target language.

// Current: Japanese labels
<h2>リアルタイム翻訳</h2>
<span>処理中...</span>
<span>翻訳中...</span>

// For English UI:
<h2>Real-time Translation</h2>
<span>processing...</span>
<span>Translating...</span>

These three changes are all that is needed. Deepgram’s streaming API handles speech recognition for the source language, and the LLM’s multilingual capabilities handle translation into the target language. The streaming architecture, JSON field order optimization, and WebSocket delivery pipeline all remain unchanged.


Future Development Plans

Having confirmed the limitations of local LLM, I am now considering three directions for the project’s continued development.

1. Hands-Free Translation with TTS (Text-to-Speech)

I implemented TTS output to deliver translations through Bluetooth earbuds. The Web Speech API limitations, mobile browser workarounds, and practical lessons learned are documented in Part 4 (Appendix).

2. Bidirectional Translation ✓

Currently the system handles one direction (English → Japanese). Update: Bidirectional translation is now implemented, supporting 4 languages — English, Japanese, Spanish, and Chinese. Any combination works in both directions:

  • You speak in Japanese → Translated to English → Reaches the other person
  • Other person speaks in Spanish → Translated to Chinese → Reaches you

3. Open-Source Smart Glasses Integration

Several open-source smart glasses projects have emerged recently — OpenGlass (about $20 in parts, converts regular glasses into AI smart glasses), Mentra (camera, speaker, microphone with open-source SDK), and Team Open Smart Glasses (fully open-source with display and live translation support). Combining these devices with Intent-First Translation could enable translation subtitles in the wearer’s field of view within 0.5 seconds of the other person beginning to speak.


Looking Back at the Series

PostThemeKey Point
Part 1Problem & Foundation”Silence” problem. Deepgram, FastAPI, WebSocket foundation
Part 2LLM StreamingJSON field order, dual prompts, streaming extraction — full code
Part 3Home Hardware & Multi-LanguageOllama/LM Studio/Mobile LLM hands-on. Multi-language architecture (now 4 languages)
Part 4TTS & Bluetooth (Appendix)Web Speech API, platform differences, Bluetooth audio routing

What Intent-First Translation Is — and What It Isn’t

I want to be straightforward about the scope of this project. Intent-First Translation is a UX optimization, not a fundamental solution to real-time translation.

The core idea — showing intent before the full translation arrives — reduces perceived latency by delivering information progressively. But the underlying translation still takes the same amount of time. The final output overwrites the initial approximation. In that sense, the fundamental constraints of current translation technology remain unchanged.

The true breakthroughs in real-time translation will come from a different level entirely: faster model architectures, purpose-built inference hardware, edge AI capable of running powerful models locally, and perhaps paradigms we haven’t yet imagined. These advances require progress in semiconductor technology and model research — not something an individual engineer’s UX approach can replace.

Why I Think It’s Still Worth Sharing

That said, I believe there is value in this kind of incremental exploration.

The techniques documented in this series — streaming JSON optimization, progressive information delivery, dual-prompt strategies — are practical patterns that any engineer can apply today. They work within the constraints of currently available tools, and they demonstrably improve the user experience.

More importantly, the fact that any engineer can build a near real-time translator at home using publicly available APIs and open-source technology is remarkable in itself. The barrier to experimenting with real-time translation has never been lower.

If this series contributes even a small reference point for engineers and researchers working toward the next generation of real-time translation, it will have served its purpose.


Share this article

Related Posts