Part 5: OpenAI API Fine-Tuning -- The Final Solution That Took Just 8 Minutes
![]()
Introduction
In the previous installments, I shared how I challenged two open-source models on a local GPU and Google Colab, and how neither achieved practical accuracy.
This time, I will talk about pivoting to OpenAI API fine-tuning. To state the conclusion upfront: training completed in about 8 minutes, and I got stable JSON output with sufficient accuracy. The difference was so stark that I wondered what the months of trial and error on Colab had been for.
However, “completed in 8 minutes” refers only to the training process itself. In reality, preparing and preprocessing the data involved considerable effort. This was the phase where I experienced firsthand that “data preparation is 90% of the work.”
The Decision to Pivot
Let me revisit the cost comparison I touched on last time.
| Aspect | Local / Colab | OpenAI API |
|---|---|---|
| Initial cost | GPU purchase or Colab Pro | None |
| Training cost | Electricity, GPU time | Token-based billing (~tens of dollars) |
| Training time | Hours to tens of hours | ~8 minutes |
| Accuracy | Low to medium | High |
| Output format stability | Low | High (reliable JSON) |
| Operational ease | Infrastructure management required | Just call the API |
| Model ownership | In your hands | On OpenAI’s servers |
For independent development, when you weigh accuracy, cost, and ease of use, I concluded that the API approach was the optimal solution.
I did feel some resistance to “not having the model locally.” My own model, trained on my own data, living on OpenAI’s infrastructure rather than my own server. But thinking about it practically, maintaining my own GPU server for every inference request would cost more in both money and effort than delegating to an API. And if I ever want to run things locally in the future, distillation remains an option.
Why I Chose gpt-4o-mini
OpenAI API fine-tuning lets you choose from several base models. I chose gpt-4o-mini-2024-07-18.
There were three reasons:
- Low cost: Significantly cheaper than GPT-4o or GPT-4. Since fine-tuning is billed by token count, a cheaper model makes trial and error much more practical
- Chat-format fine-tuning support: Can be trained with messages-format data. The role separation of system / user / assistant is clear
- Sufficient reasoning capability: Expected to have the analytical ability needed for stock prediction and the capacity for stable structured output in JSON format
Data Preparation — This Is 90% of the Work
Running OpenAI API fine-tuning is extremely easy in itself. Upload a file and make one API call. However, preparing the data to upload consumed 90% of the entire effort.
Step 1: Data Format Conversion
The data I initially created was in prompt-completion format.
{"prompt": "Company info, news, stock data...", "completion": "Prediction result JSON..."}
However, since gpt-4o-mini is a chat model, this format throws an error.
Invalid file format. Input file is in the prompt-completion format,
but the specified model gpt-4o-mini-2024-07-18 is a chat model
and requires chat-formatted data.
I needed to convert to messages format.
{
"messages": [
{"role": "system", "content": "You are a stock prediction assistant."},
{"role": "user", "content": "<JSON with company info + news + prices + financials + macro indicators>"},
{"role": "assistant", "content": "<Prediction result JSON>"}
]
}
I wrote a conversion script to batch-convert everything.
import json
def convert_to_chat_format(input_file, output_file):
with open(input_file, 'r') as f_in, open(output_file, 'w') as f_out:
for line in f_in:
data = json.loads(line)
chat_format = {
"messages": [
{"role": "system", "content": "You are a stock prediction assistant."},
{"role": "user", "content": data["prompt"]},
{"role": "assistant", "content": data["completion"]}
]
}
f_out.write(json.dumps(chat_format, ensure_ascii=False) + "\n")
Step 2: PHP Error Messages in the Data
The conversion was done, but when I examined the data more closely, I found a problem. Around line 229 of the JSONL file, a PHP error message had crept in.
Fatal error: Allowed memory size of 134217728 bytes exhausted
(tried to allocate 8 bytes) in /var/www/DBHandler.class.php on line 280
The PHP batch process on the meloik VPS that generates training data had hit its memory limit, and the error message was written straight into the JSONL file.
If garbage data like this gets mixed into training data, the model would learn that “PHP error messages are part of prediction results.”
As a countermeasure, I added a cleaning step that skips any line that cannot be parsed as valid JSON.
def clean_jsonl(input_file, output_file):
cleaned = 0
skipped = 0
with open(input_file, 'r') as f_in, open(output_file, 'w') as f_out:
for i, line in enumerate(f_in, 1):
line = line.strip()
if not line:
skipped += 1
continue
try:
data = json.loads(line)
f_out.write(json.dumps(data, ensure_ascii=False) + "\n")
cleaned += 1
except json.JSONDecodeError:
print(f"Line {i}: Skipped (invalid JSON)")
skipped += 1
print(f"Cleaned: {cleaned}, Skipped: {skipped}")
Step 3: Residual HTML Tags and Entities
Digging further, I also discovered that the news article text contained leftover HTML tags and HTML entities.
<br /> <- line break tag
< <- HTML entity for <
» <- HTML entity for >>
The HTML from the original data source had been left intact.
import re
import html
def clean_html(text):
# Remove HTML tags
text = re.sub(r"<br\s*/?>", "", text)
text = re.sub(r"<.*?>", "", text)
# Decode HTML entities
text = html.unescape(text)
return text
Step 4: Validation
I used the official validation script provided by OpenAI (tiktoken-based) for a final check.
Format errors: 0
Over 16K tokens: 0
Sample count: 1,009
Average tokens: 1,303 tokens/sample (min: 847, max: 3,168)
Assistant response: Average 112 tokens
All 1,009 samples passed validation.
The Decisive Difference from Colab
Here is an extremely important point.
On Colab, the 1,024-token limit forced me to remove financial data and macroeconomic indicators. But the OpenAI API’s maximum token length is 16,385 tokens. Data averaging 1,303 tokens per sample fits with room to spare.
This meant I could input all five data types (company information, news, stock prices, financials, and macroeconomic indicators) in full. There was no longer any need to strip information.
This was the biggest difference from the Colab experiments, and I believe it was a major factor in the accuracy improvement.
Running the Fine-Tuning
Once the data was prepared, the rest was surprisingly easy.
Cost Estimate
Total tokens: ~1.3 million tokens
Training epochs: 3
Billed tokens: ~3.88 million tokens
File Upload and Training Start
from openai import OpenAI
client = OpenAI()
# 1. Upload file
with open("train_combined_final.jsonl", "rb") as f:
file = client.files.create(file=f, purpose="fine-tune")
print(f"File ID: {file.id}")
# 2. Start fine-tuning
fine_tune_job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18"
)
print(f"Job ID: {fine_tune_job.id}")
print(f"Status: {fine_tune_job.status}")
That is all it takes. Then you just wait.
The status transitioned through validating_files -> running -> succeeded, and it completed in about 7-8 minutes.
Given that training on Colab had taken hours and still failed to deliver accuracy, this speed was shocking.
Prediction Test — A Clear Difference from Colab
Let me run a prediction with the fine-tuned model.
response = client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:personal::xxxxx",
messages=[
{"role": "system", "content": "You are a stock prediction assistant."},
{"role": "user", "content": json.dumps(test_data, ensure_ascii=False)}
]
)
prediction = response.choices[0].message.content
print(prediction)
Test result (Soken Chemical 4972, data from 2024/11/28):
{
"close_to_next_open": {"price": 3045, "change_pct": 0, "trend": "neutral"},
"close_to_next_close": {"price": 3075, "change_pct": 0.98, "trend": "neutral"},
"next_open_to_close": {"price": 3075, "change_pct": 1, "trend": "neutral"}
}
Perfect JSON format with stable output. Not once did it respond in natural language like the Colab models had.
This is the strength of gpt-4o-mini’s base capabilities. It reliably follows the instruction to “output in JSON format” and consistently returns structured output. This was a fundamentally different level of instruction-following ability compared to the 7-8B open-source models.
Fine-Tuned Model Registry
Here is a record of the models created during the trial-and-error process.
| Model ID | Dataset | Samples | Notes |
|---|---|---|---|
ft:gpt-4o-mini-2024-07-18:personal::xxxxx | training_data_chat.jsonl | Small | Initial test. Data format validation |
ft:gpt-4o-mini-2024-07-18:personal::xxxxx | train_combined_1738729676133.jsonl | 1,009 | Production model |
I took a two-stage approach: first confirming “it works” with a small test, then fine-tuning with the full production data.
Integration into Production
The fine-tuned model is integrated into the meloik project’s prediction batch (predict_stock_realtime2).
// Prediction request to OpenAI API
$apiKey = CHATGPT_API_KEY;
$url = 'https://api.openai.com/v1/chat/completions';
// Fetch prediction data (company info + news + prices + financials + macro indicators)
$jsonData = Utility::getPredictionData($db, $company['code'], $start_date, $end_date);
// Predict with the fine-tuned model
$data = [
'model' => 'ft:gpt-4o-mini-2024-07-18:personal::xxxxx',
'messages' => [
['role' => 'system', 'content' => 'You are a stock prediction assistant.'],
['role' => 'user', 'content' => $jsonData]
]
];
This batch, scheduled via crontab, automatically sends prediction requests to the fine-tuned model whenever news is published, and saves the results to MySQL.
Lessons from OpenAI API Fine-Tuning
Here are the key insights from Phase 3.
Data Preparation Really Is 90%
Running the fine-tuning itself is just uploading a file and making one API call. There is almost no technical difficulty.
However, the process of creating the data to upload — format conversion, garbage data removal, HTML processing, validation — consumed 90% of the total effort.
“The hard part of LLM fine-tuning is not training the model but preparing the data” is something many people say, but I did not truly internalize it until I experienced it firsthand.
The Importance of Full Data
On Colab, I was forced to strip information. With the OpenAI API, I could input the full data. I believe this difference had a major impact on accuracy.
For a task like stock price prediction, not just news but multi-faceted information — financial data, macroeconomic indicators, price trends — is critical. Human analysts do not make judgments based on a single piece of information either.
Base Model Capability Is Decisive
gpt-4o-mini already has strong instruction-following ability before fine-tuning. Tell it to “output in JSON format” and it will reliably return JSON. This base capability underpins the stability after fine-tuning.
The 7-8B open-source models lacked this base capability, which is why their output format remained unstable no matter how much fine-tuning was applied.
Looking Back at Phases 1 Through 3
Let me reflect on all three phases.
| Phase | What I Did | Result | What I Learned |
|---|---|---|---|
| Phase 1 | ELYZA 8B on RTX 3060 | Out of VRAM | Felt hardware constraints firsthand |
| Phase 2 | ELYZA + LLM-jp on Colab | Insufficient accuracy | Mastered LoRA/quantization/GGUF conversion |
| Phase 3 | OpenAI API fine-tuning | Adopted for production | Felt the importance of data preparation firsthand |
Looking at it linearly, you might think “why not just use the OpenAI API from the start?” But it was precisely because of the Phase 1-2 experience that I could work faster on data design and problem-solving in Phase 3.
For example, I only truly appreciated “the value of being able to input full data” because I had experienced the pain of stripping information on Colab. And I only understood “the importance of base model capability” because I had witnessed the output instability of 7-8B models firsthand.
Summary
Key points from OpenAI API fine-tuning (Phase 3):
- Fine-tuned on gpt-4o-mini-2024-07-18 as the base model
- Data preparation (format conversion, cleaning, validation) was 90% of the total effort
- Dealt with unexpected data issues: PHP error messages mixed in, residual HTML tags
- Training completed in about 8 minutes (overwhelmingly faster than the hours on Colab)
- All five data types could be input in full, yielding stable JSON output
- Integrated into the production prediction batch for automated operation
Next time, I will dive deeper into training data design — the factor that most directly affects prediction accuracy.
Previous: Part 4 — “Stock Prediction on Colab”
Next: Part 6 — “Training Data Design: How I Integrated Five Types of Data”