Part 8: MySQL to Firestore Migration and Production -- The Road from RDB to NoSQL
![]()
Introduction
This is the final installment of the series. So far, I have covered LLM fine-tuning for stock price prediction and translation LLM selection.
This time, I will talk about the “last mile” — migrating data from MySQL to Firestore and getting the web service into production.
The struggles of a developer accustomed to RDBs (MySQL) migrating to NoSQL (Firestore) were greater than I expected. Index design, upsert strategy, document ID design, and above all, the battle with Firestore’s write billing. I hope these experiences will be useful for anyone considering Firestore adoption.
Why Firestore?
In the Senrigan service architecture, the data flow looks like this:
meloik (PHP/MySQL) -> assetai_firebase (Python) -> stockSite (Next.js/Vercel)
VPS VPS Vercel
The frontend (stockSite) is a Next.js app on Vercel. There were several options for how the frontend would retrieve data:
- Direct MySQL access: Not advisable from a security standpoint — connecting directly from the frontend to an RDB is risky
- Build an API server: Create a REST API that serves data from MySQL. However, building and maintaining an API server has its own costs
- Use Firestore as a middle layer: Retrieve data safely from the frontend using the Firebase SDK. Serverless, low operational cost
For independent development, minimizing operational overhead is key. With Firestore, there is no server to manage, and the Firebase SDK lets the frontend fetch data directly. Firestore security rules can also handle access control.
Index Differences — The First Wall
The first thing that caught me off guard when migrating from MySQL to Firestore was how indexes work.
In MySQL
In MySQL, indexes can be defined in bulk when creating a table or via ALTER TABLE.
-- Composite indexes are easy to create
CREATE INDEX idx_code_date ON ai_news (code, target_date);
-- Multi-column composite indexes work the same way
CREATE INDEX idx_code_type_date ON ai_news (code, type, target_date);
With GUI tools like phpMyAdmin, you can intuitively browse tables and add indexes with just a few clicks.
In Firestore
Firestore is an entirely different story.
- Single-field indexes are created automatically
- But composite indexes must be created manually, one at a time
- You create them either through the Firebase Console UI or by defining them in
firestore.indexes.json - You only discover that an index is needed when you execute a query and get an error
Error: The query requires an index. You can create it here:
https://console.firebase.google.com/v1/r/project/xxx/firestore/indexes?create_composite=...
It is nice that the error message includes a link to create the index, but the fact that you don’t know whether an index is needed until you run the query felt strange coming from a MySQL background.
On top of that, index creation can take several minutes to over ten minutes. In MySQL, an ALTER TABLE takes effect almost immediately, but in Firestore, the “Building” state lingers for a while.
Firebase Console UI — Painful for MySQL Developers
The Firebase Console UI is designed specifically for document-oriented databases. For developers accustomed to MySQL, there were several pain points.
No Bird’s-Eye View of the Table
In phpMyAdmin, you can list all records in a table and freely sort or filter them. In the Firestore console, you examine documents one by one within a collection, making it hard to grasp the big picture.
No SQL
SELECT * FROM news WHERE code = '4972' AND target_date >= '20241101' ORDER BY target_date DESC
You cannot just write a quick query like this to inspect your data. Firestore has a query builder, but it does not come close to SQL’s flexibility.
No JOINs
JOINs (table joins) that are taken for granted in MySQL do not exist in Firestore. To fetch related data, you need to run multiple queries and combine the results on the application side.
Evolution of the Upsert Strategy
The biggest challenge was preventing duplicate writes of the same data. Since Firestore charges per write, unnecessary writes translate directly into cost.
First Implementation — String Comparison (Buggy)
# First version: JST string comparison (problematic)
def save_to_firestore(collection_name, doc_id, data):
doc_ref = db.collection(collection_name).document(doc_id)
existing_doc = doc_ref.get()
if existing_doc.exists:
existing_data = existing_doc.to_dict()
if existing_data.get("update_date", "0000-00-00 00:00:00") >= data["update_date"]:
return False # Skip if existing data is newer
doc_ref.set(data)
return True
This looks reasonable at first glance, but string-based timestamp comparison had hidden bugs. Timezone handling ambiguities and subtle differences in string formats led to incorrect comparison results in some cases.
Improved Version — UNIX Epoch Comparison
To solve the string comparison problem, I switched to integer comparison using UNIX timestamps (epoch seconds).
# JST string -> epoch seconds (UTC)
def jst_string_to_epoch(s):
if not s:
return None
try:
JST = timezone(timedelta(hours=9))
dt = datetime.strptime(s, "%Y-%m-%d %H:%M:%S").replace(tzinfo=JST)
return int(dt.timestamp())
except Exception:
return None
# Improved version: compare using UNIX timestamps (integers)
def save_to_firestore(collection_name, doc_id, data, force=False):
doc_ref = db.collection(collection_name).document(doc_id)
existing_doc = doc_ref.get()
new_epoch = data.get("updated_at_epoch")
if new_epoch is None:
new_epoch = jst_string_to_epoch(data.get("update_date"))
if existing_doc.exists and not force:
existing_data = existing_doc.to_dict()
old_epoch = existing_data.get("updated_at_epoch")
if old_epoch is None:
old_epoch = jst_string_to_epoch(existing_data.get("update_date"))
# Skip write if existing data is newer (cost reduction)
if old_epoch and new_epoch and old_epoch >= new_epoch:
return False
doc_ref.set(data)
return True
The key points:
- Integer comparison using UNIX epoch seconds: Completely eliminates timezone ambiguity
forceparameter: For cases where a forced overwrite is needed during data repair- Using
doc_ref.set(): Overwrites the entire document rather than merge/update, preventing partial update bugs - Fallback: If
updated_at_epochis missing, epoch seconds are calculated from theupdate_datestring
Document ID Design
How to design Firestore document IDs was another important decision.
First Approach — SHA-1 Hash (Difficult to Debug)
def generate_doc_id(news):
base_str = f"{news['code']}_{news['target_date']}_{news['type']}"
return hashlib.sha1(base_str.encode('utf-8')).hexdigest()
# -> "a3f8b2c1d4e5f6..." -- an unreadable ID
SHA-1 hashes guarantee uniqueness. However, when checking documents in the Firebase Console, IDs like a3f8b2c1d4e5f6... make it completely impossible to tell which news article a document belongs to.
Improved Version — Human-Readable IDs
# Japanese type -> English mapping
TYPE_MAPPING = {
"PR情報": "pr",
"決算": "earnings",
"修正": "revision",
"業績予想": "forecast",
"配当予想": "dividend_forecast",
# ...
}
def generate_readable_doc_id(news):
code = str(news["code"])
target_date = str(news["target_date_ymd"])
type_en = TYPE_MAPPING.get(str(news.get("type", "")), "other")
return f"{target_date}_{code}_{type_en}"
# -> "20250315_2413_earnings" -- a readable ID
With 20250315_2413_earnings, you can immediately see “earnings news for stock code 2413 on March 15, 2025.” Debugging in the Firebase Console became dramatically easier.
A mapping table to convert Japanese news types (“earnings,” “revision,” etc.) to English is needed, but once created, it is easy to maintain.
Decimal Type Conversion
Data fetched from MySQL via Python sometimes contains Decimal type values. Firestore cannot handle Decimal directly, so conversion to float is necessary.
from decimal import Decimal
def convert_decimal_to_float(data):
for key, value in data.items():
if isinstance(value, Decimal):
data[key] = float(value)
return data
This is mundane but important — forgetting this conversion causes errors when writing to Firestore. When working with the MySQL + Python + Firestore combination, you always need to be aware of Decimal types.
The Battle with Firestore Write Billing
Firestore charges for reads, writes, and deletes individually. At one point, the Senrigan service was hitting approximately 85.93 million reads per month.
Cost Reduction 1: Extending ISR Revalidation
On the stockSite (Next.js) side, I extended the ISR (Incremental Static Regeneration) cache duration.
// Set revalidate to 300 seconds (5 minutes)
export const revalidate = 300;
For 5 minutes, cached static pages are served, dramatically reducing read requests to Firestore.
Cost Reduction 2: Stopping Unchecked News Sync
Previously, news with is_checked=0 (not yet reviewed by AI) was also synced to Firestore. I stopped this. By syncing only checked news, write volume was reduced.
Cost Reduction 3: Upsert Strategy
The upsert strategy described above skips writes when Firestore already has the latest data. This eliminates wasteful writes (rewriting the same data).
Production Operations Overview
Here is the full picture of the batch processes that power the Senrigan service in production.
meloik VPS (Data Generation and Prediction)
| Batch | Function | Schedule |
|---|---|---|
| news_sokuhou_kabutan | News collection | Periodic on weekdays |
| predict_stock_realtime2 | AI prediction | Runs after news collection |
| translate_english | English translation | Runs after prediction |
| summerize_news_msi | MSI news summarization | Periodic |
assetai_firebase VPS (Firestore Sync)
| Script | Function | Schedule |
|---|---|---|
| import_news_realtime.sh | Checked news sync | Every 15 min on weekdays |
| import_news.sh | Bulk news sync | 4 times daily |
| import_companies.sh | Company info sync | Weekly |
Data Flow (Chronological)
08:30 News collection batch starts before market open
+-> Collects publicly disclosed information, stores in MySQL
09:00 Prediction batch starts
+-- Fetches company info, prices, financials, and macro indicators from DB
+-- Sends prediction request to fine-tuned LLM
+-> Saves prediction results to MySQL
09:15 Translation batch starts
+-> Translates news and prediction reasoning into English
09:30 Firestore sync (every 15 minutes)
+-> Exports incremental changes from MySQL to Firestore
Ongoing stockSite (Web UI)
+-> Fetches data from Firestore via ISR, renders on screen
All of this is managed through crontab, running automatically every day without any human intervention.
Current LLM Model Configuration
Here is the full picture of LLM models used in production.
| Process | Model | Purpose |
|---|---|---|
| Stock prediction | gpt-4o-mini (fine-tuned) | Predicts next-day stock price movements using the fine-tuned model |
| Prediction reasoning | gpt-5-nano | Generates prediction reasoning based on news analysis |
| Translation and summarization | gpt-5-mini | Japanese-to-English translation, news summarization |
Three different models are used, each matched to its purpose. Only the stock prediction model is fine-tuned; the others are general-purpose models.
Series Retrospective
Over 8 installments, I have covered the full journey from building a stock prediction LLM to running it as a production web service. Let me summarize the key lessons from the entire series.
Technical Lessons
- Data quality > Model size: The reason open-source 7-8B models could not achieve accuracy was not just model capability, but also data quality and quantity. Being able to input full data via the OpenAI API was a major factor in accuracy improvement
- “Fine-tuning works” and “fine-tuning is usable” are different things: LoRA + quantization made fine-tuning run, but achieving practical accuracy was a separate challenge
- Base model capability is decisive: Stable output in JSON format and other instruction-following capabilities depend on the base model’s fundamental ability
- Do not choose LLM providers based on cost alone: As the DeepSeek example showed, chasing low prices can lead to quality problems
- RDB-to-NoSQL migration is harder than you think: Indexes, upserts, document ID design — many situations where RDB conventions do not apply
Lessons as an Independent Developer
- Progress incrementally: Moving from local GPU to Colab to OpenAI API in stages allowed me to deeply understand the technology at each phase
- Do not fear failure: Phases 1 and 2 were “failures,” but the knowledge gained there supported the success of Phase 3
- Make pragmatic decisions: Prioritizing accuracy and cost-performance over the dream of “having my own LLM” was the right call
- Prioritize automation: Automating batches via crontab reduced daily operational costs to nearly zero
Future Outlook
- Expanding training data: From the current 1,009 samples to several thousand, with greater industry and temporal diversity
- Distillation for local deployment: Transferring the fine-tuned model’s knowledge to a smaller model to eliminate API costs
- Systematic accuracy evaluation: Building a framework for ongoing accuracy assessment and continuous improvement
- Multimodal expansion: Incorporating not just text but also chart images as inputs
Closing Thoughts
What I wanted to convey through this series is that LLM fine-tuning is achievable even as an independent developer.
Even without cutting-edge GPU clusters or large-scale infrastructure, with enough ingenuity, you can integrate LLMs into your own service. The setback with the local GPU, the trial and error on Colab — all of it became fuel for the eventual success.
The Senrigan service (https://senrigan.tech/) continues to run stock price predictions automatically every day, publishing the results for anyone to see. It is not a perfect service, but I feel a genuine sense of accomplishment in having built it as an individual.
If this series can serve as a reference for anyone interested in LLM fine-tuning or building web services as an independent developer, nothing would make me happier.
Thank you for reading to the end.
Previous: Part 7 — “Choosing a Translation LLM”
Series: “Building an LLM That Predicts Stock Prices from News” — End