Part 8: MySQL to Firestore Migration and Production -- The Road from RDB to NoSQL

Introduction

This is the final installment of the series. So far, I have covered LLM fine-tuning for stock price prediction and translation LLM selection.

This time, I will talk about the “last mile” — migrating data from MySQL to Firestore and getting the web service into production.

The struggles of a developer accustomed to RDBs (MySQL) migrating to NoSQL (Firestore) were greater than I expected. Index design, upsert strategy, document ID design, and above all, the battle with Firestore’s write billing. I hope these experiences will be useful for anyone considering Firestore adoption.

Why Firestore?

In the Senrigan service architecture, the data flow looks like this:

meloik (PHP/MySQL) -> assetai_firebase (Python) -> stockSite (Next.js/Vercel)
     VPS                      VPS                         Vercel

The frontend (stockSite) is a Next.js app on Vercel. There were several options for how the frontend would retrieve data:

Direct MySQL access: Not advisable from a security standpoint — connecting directly from the frontend to an RDB is risky
Build an API server: Create a REST API that serves data from MySQL. However, building and maintaining an API server has its own costs
Use Firestore as a middle layer: Retrieve data safely from the frontend using the Firebase SDK. Serverless, low operational cost

For independent development, minimizing operational overhead is key. With Firestore, there is no server to manage, and the Firebase SDK lets the frontend fetch data directly. Firestore security rules can also handle access control.

Index Differences — The First Wall

The first thing that caught me off guard when migrating from MySQL to Firestore was how indexes work.

In MySQL

In MySQL, indexes can be defined in bulk when creating a table or via ALTER TABLE.

-- Composite indexes are easy to create
CREATE INDEX idx_code_date ON ai_news (code, target_date);

-- Multi-column composite indexes work the same way
CREATE INDEX idx_code_type_date ON ai_news (code, type, target_date);

With GUI tools like phpMyAdmin, you can intuitively browse tables and add indexes with just a few clicks.

In Firestore

Firestore is an entirely different story.

Single-field indexes are created automatically
But composite indexes must be created manually, one at a time
You create them either through the Firebase Console UI or by defining them in firestore.indexes.json
You only discover that an index is needed when you execute a query and get an error

Error: The query requires an index. You can create it here:
https://console.firebase.google.com/v1/r/project/xxx/firestore/indexes?create_composite=...

It is nice that the error message includes a link to create the index, but the fact that you don’t know whether an index is needed until you run the query felt strange coming from a MySQL background.

On top of that, index creation can take several minutes to over ten minutes. In MySQL, an ALTER TABLE takes effect almost immediately, but in Firestore, the “Building” state lingers for a while.

Firebase Console UI — Painful for MySQL Developers

The Firebase Console UI is designed specifically for document-oriented databases. For developers accustomed to MySQL, there were several pain points.

No Bird’s-Eye View of the Table

In phpMyAdmin, you can list all records in a table and freely sort or filter them. In the Firestore console, you examine documents one by one within a collection, making it hard to grasp the big picture.

No SQL

SELECT * FROM news WHERE code = '4972' AND target_date >= '20241101' ORDER BY target_date DESC

You cannot just write a quick query like this to inspect your data. Firestore has a query builder, but it does not come close to SQL’s flexibility.

No JOINs

JOINs (table joins) that are taken for granted in MySQL do not exist in Firestore. To fetch related data, you need to run multiple queries and combine the results on the application side.

Evolution of the Upsert Strategy

The biggest challenge was preventing duplicate writes of the same data. Since Firestore charges per write, unnecessary writes translate directly into cost.

First Implementation — String Comparison (Buggy)

# First version: JST string comparison (problematic)
def save_to_firestore(collection_name, doc_id, data):
    doc_ref = db.collection(collection_name).document(doc_id)
    existing_doc = doc_ref.get()

    if existing_doc.exists:
        existing_data = existing_doc.to_dict()
        if existing_data.get("update_date", "0000-00-00 00:00:00") >= data["update_date"]:
            return False  # Skip if existing data is newer

    doc_ref.set(data)
    return True

This looks reasonable at first glance, but string-based timestamp comparison had hidden bugs. Timezone handling ambiguities and subtle differences in string formats led to incorrect comparison results in some cases.

Improved Version — UNIX Epoch Comparison

To solve the string comparison problem, I switched to integer comparison using UNIX timestamps (epoch seconds).

# JST string -> epoch seconds (UTC)
def jst_string_to_epoch(s):
    if not s:
        return None
    try:
        JST = timezone(timedelta(hours=9))
        dt = datetime.strptime(s, "%Y-%m-%d %H:%M:%S").replace(tzinfo=JST)
        return int(dt.timestamp())
    except Exception:
        return None

# Improved version: compare using UNIX timestamps (integers)
def save_to_firestore(collection_name, doc_id, data, force=False):
    doc_ref = db.collection(collection_name).document(doc_id)
    existing_doc = doc_ref.get()

    new_epoch = data.get("updated_at_epoch")
    if new_epoch is None:
        new_epoch = jst_string_to_epoch(data.get("update_date"))

    if existing_doc.exists and not force:
        existing_data = existing_doc.to_dict()
        old_epoch = existing_data.get("updated_at_epoch")
        if old_epoch is None:
            old_epoch = jst_string_to_epoch(existing_data.get("update_date"))

        # Skip write if existing data is newer (cost reduction)
        if old_epoch and new_epoch and old_epoch >= new_epoch:
            return False

    doc_ref.set(data)
    return True

The key points:

Integer comparison using UNIX epoch seconds: Completely eliminates timezone ambiguity
force parameter: For cases where a forced overwrite is needed during data repair
Using doc_ref.set(): Overwrites the entire document rather than merge/update, preventing partial update bugs
Fallback: If updated_at_epoch is missing, epoch seconds are calculated from the update_date string

Document ID Design

How to design Firestore document IDs was another important decision.

First Approach — SHA-1 Hash (Difficult to Debug)

def generate_doc_id(news):
    base_str = f"{news['code']}_{news['target_date']}_{news['type']}"
    return hashlib.sha1(base_str.encode('utf-8')).hexdigest()
    # -> "a3f8b2c1d4e5f6..." -- an unreadable ID

SHA-1 hashes guarantee uniqueness. However, when checking documents in the Firebase Console, IDs like a3f8b2c1d4e5f6... make it completely impossible to tell which news article a document belongs to.

Improved Version — Human-Readable IDs

# Japanese type -> English mapping
TYPE_MAPPING = {
    "PR情報": "pr",
    "決算": "earnings",
    "修正": "revision",
    "業績予想": "forecast",
    "配当予想": "dividend_forecast",
    # ...
}

def generate_readable_doc_id(news):
    code = str(news["code"])
    target_date = str(news["target_date_ymd"])
    type_en = TYPE_MAPPING.get(str(news.get("type", "")), "other")
    return f"{target_date}_{code}_{type_en}"
    # -> "20250315_2413_earnings" -- a readable ID

With 20250315_2413_earnings, you can immediately see “earnings news for stock code 2413 on March 15, 2025.” Debugging in the Firebase Console became dramatically easier.

A mapping table to convert Japanese news types (“earnings,” “revision,” etc.) to English is needed, but once created, it is easy to maintain.

Decimal Type Conversion

Data fetched from MySQL via Python sometimes contains Decimal type values. Firestore cannot handle Decimal directly, so conversion to float is necessary.

from decimal import Decimal

def convert_decimal_to_float(data):
    for key, value in data.items():
        if isinstance(value, Decimal):
            data[key] = float(value)
    return data

This is mundane but important — forgetting this conversion causes errors when writing to Firestore. When working with the MySQL + Python + Firestore combination, you always need to be aware of Decimal types.

The Battle with Firestore Write Billing

Firestore charges for reads, writes, and deletes individually. At one point, the Senrigan service was hitting approximately 85.93 million reads per month.

Cost Reduction 1: Extending ISR Revalidation

On the stockSite (Next.js) side, I extended the ISR (Incremental Static Regeneration) cache duration.

// Set revalidate to 300 seconds (5 minutes)
export const revalidate = 300;

For 5 minutes, cached static pages are served, dramatically reducing read requests to Firestore.

Cost Reduction 2: Stopping Unchecked News Sync

Previously, news with is_checked=0 (not yet reviewed by AI) was also synced to Firestore. I stopped this. By syncing only checked news, write volume was reduced.

Cost Reduction 3: Upsert Strategy

The upsert strategy described above skips writes when Firestore already has the latest data. This eliminates wasteful writes (rewriting the same data).

Production Operations Overview

Here is the full picture of the batch processes that power the Senrigan service in production.

meloik VPS (Data Generation and Prediction)

Batch	Function	Schedule
news_sokuhou_kabutan	News collection	Periodic on weekdays
predict_stock_realtime2	AI prediction	Runs after news collection
translate_english	English translation	Runs after prediction
summerize_news_msi	MSI news summarization	Periodic

assetai_firebase VPS (Firestore Sync)

Script	Function	Schedule
import_news_realtime.sh	Checked news sync	Every 15 min on weekdays
import_news.sh	Bulk news sync	4 times daily
import_companies.sh	Company info sync	Weekly

Data Flow (Chronological)

08:30  News collection batch starts before market open
       +-> Collects publicly disclosed information, stores in MySQL

09:00  Prediction batch starts
       +-- Fetches company info, prices, financials, and macro indicators from DB
       +-- Sends prediction request to fine-tuned LLM
       +-> Saves prediction results to MySQL

09:15  Translation batch starts
       +-> Translates news and prediction reasoning into English

09:30  Firestore sync (every 15 minutes)
       +-> Exports incremental changes from MySQL to Firestore

Ongoing stockSite (Web UI)
       +-> Fetches data from Firestore via ISR, renders on screen

All of this is managed through crontab, running automatically every day without any human intervention.

Current LLM Model Configuration

Here is the full picture of LLM models used in production.

Process	Model	Purpose
Stock prediction	gpt-4o-mini (fine-tuned)	Predicts next-day stock price movements using the fine-tuned model
Prediction reasoning	gpt-5-nano	Generates prediction reasoning based on news analysis
Translation and summarization	gpt-5-mini	Japanese-to-English translation, news summarization

Three different models are used, each matched to its purpose. Only the stock prediction model is fine-tuned; the others are general-purpose models.

Series Retrospective

Over 8 installments, I have covered the full journey from building a stock prediction LLM to running it as a production web service. Let me summarize the key lessons from the entire series.

Technical Lessons

Data quality > Model size: The reason open-source 7-8B models could not achieve accuracy was not just model capability, but also data quality and quantity. Being able to input full data via the OpenAI API was a major factor in accuracy improvement
“Fine-tuning works” and “fine-tuning is usable” are different things: LoRA + quantization made fine-tuning run, but achieving practical accuracy was a separate challenge
Base model capability is decisive: Stable output in JSON format and other instruction-following capabilities depend on the base model’s fundamental ability
Do not choose LLM providers based on cost alone: As the DeepSeek example showed, chasing low prices can lead to quality problems
RDB-to-NoSQL migration is harder than you think: Indexes, upserts, document ID design — many situations where RDB conventions do not apply

Lessons as an Independent Developer

Progress incrementally: Moving from local GPU to Colab to OpenAI API in stages allowed me to deeply understand the technology at each phase
Do not fear failure: Phases 1 and 2 were “failures,” but the knowledge gained there supported the success of Phase 3
Make pragmatic decisions: Prioritizing accuracy and cost-performance over the dream of “having my own LLM” was the right call
Prioritize automation: Automating batches via crontab reduced daily operational costs to nearly zero

Future Outlook

Expanding training data: From the current 1,009 samples to several thousand, with greater industry and temporal diversity
Distillation for local deployment: Transferring the fine-tuned model’s knowledge to a smaller model to eliminate API costs
Systematic accuracy evaluation: Building a framework for ongoing accuracy assessment and continuous improvement
Multimodal expansion: Incorporating not just text but also chart images as inputs

Closing Thoughts

What I wanted to convey through this series is that LLM fine-tuning is achievable even as an independent developer.

Even without cutting-edge GPU clusters or large-scale infrastructure, with enough ingenuity, you can integrate LLMs into your own service. The setback with the local GPU, the trial and error on Colab — all of it became fuel for the eventual success.

The Senrigan service ran stock price predictions automatically every day and published the results publicly. The service has since been taken offline, but I feel a genuine sense of accomplishment in having built it as an individual.

If this series can serve as a reference for anyone interested in LLM fine-tuning or building web services as an independent developer, nothing would make me happier.

Thank you for reading to the end.

Previous: Part 7 — “Choosing a Translation LLM”

Series: “Building an LLM That Predicts Stock Prices from News” — End