Part 5: OpenAI API Fine-Tuning -- The Final Solution That Took Just 8 Minutes

Introduction

In the previous installments, I shared how I challenged two open-source models on a local GPU and Google Colab, and how neither achieved practical accuracy.

This time, I will talk about pivoting to OpenAI API fine-tuning. To state the conclusion upfront: training completed in about 8 minutes, and I got stable JSON output with sufficient accuracy. The difference was so stark that I wondered what the months of trial and error on Colab had been for.

However, “completed in 8 minutes” refers only to the training process itself. In reality, preparing and preprocessing the data involved considerable effort. This was the phase where I experienced firsthand that “data preparation is 90% of the work.”


The Decision to Pivot

Let me revisit the cost comparison I touched on last time.

AspectLocal / ColabOpenAI API
Initial costGPU purchase or Colab ProNone
Training costElectricity, GPU timeToken-based billing (~tens of dollars)
Training timeHours to tens of hours~8 minutes
AccuracyLow to mediumHigh
Output format stabilityLowHigh (reliable JSON)
Operational easeInfrastructure management requiredJust call the API
Model ownershipIn your handsOn OpenAI’s servers

For independent development, when you weigh accuracy, cost, and ease of use, I concluded that the API approach was the optimal solution.

I did feel some resistance to “not having the model locally.” My own model, trained on my own data, living on OpenAI’s infrastructure rather than my own server. But thinking about it practically, maintaining my own GPU server for every inference request would cost more in both money and effort than delegating to an API. And if I ever want to run things locally in the future, distillation remains an option.


Why I Chose gpt-4o-mini

OpenAI API fine-tuning lets you choose from several base models. I chose gpt-4o-mini-2024-07-18.

There were three reasons:

  1. Low cost: Significantly cheaper than GPT-4o or GPT-4. Since fine-tuning is billed by token count, a cheaper model makes trial and error much more practical
  2. Chat-format fine-tuning support: Can be trained with messages-format data. The role separation of system / user / assistant is clear
  3. Sufficient reasoning capability: Expected to have the analytical ability needed for stock prediction and the capacity for stable structured output in JSON format

Data Preparation — This Is 90% of the Work

Running OpenAI API fine-tuning is extremely easy in itself. Upload a file and make one API call. However, preparing the data to upload consumed 90% of the entire effort.

Step 1: Data Format Conversion

The data I initially created was in prompt-completion format.

{"prompt": "Company info, news, stock data...", "completion": "Prediction result JSON..."}

However, since gpt-4o-mini is a chat model, this format throws an error.

Invalid file format. Input file is in the prompt-completion format,
but the specified model gpt-4o-mini-2024-07-18 is a chat model
and requires chat-formatted data.

I needed to convert to messages format.

{
  "messages": [
    {"role": "system", "content": "You are a stock prediction assistant."},
    {"role": "user", "content": "<JSON with company info + news + prices + financials + macro indicators>"},
    {"role": "assistant", "content": "<Prediction result JSON>"}
  ]
}

I wrote a conversion script to batch-convert everything.

import json

def convert_to_chat_format(input_file, output_file):
    with open(input_file, 'r') as f_in, open(output_file, 'w') as f_out:
        for line in f_in:
            data = json.loads(line)
            chat_format = {
                "messages": [
                    {"role": "system", "content": "You are a stock prediction assistant."},
                    {"role": "user", "content": data["prompt"]},
                    {"role": "assistant", "content": data["completion"]}
                ]
            }
            f_out.write(json.dumps(chat_format, ensure_ascii=False) + "\n")

Step 2: PHP Error Messages in the Data

The conversion was done, but when I examined the data more closely, I found a problem. Around line 229 of the JSONL file, a PHP error message had crept in.

Fatal error: Allowed memory size of 134217728 bytes exhausted
(tried to allocate 8 bytes) in /var/www/DBHandler.class.php on line 280

The PHP batch process on the meloik VPS that generates training data had hit its memory limit, and the error message was written straight into the JSONL file.

If garbage data like this gets mixed into training data, the model would learn that “PHP error messages are part of prediction results.”

As a countermeasure, I added a cleaning step that skips any line that cannot be parsed as valid JSON.

def clean_jsonl(input_file, output_file):
    cleaned = 0
    skipped = 0
    with open(input_file, 'r') as f_in, open(output_file, 'w') as f_out:
        for i, line in enumerate(f_in, 1):
            line = line.strip()
            if not line:
                skipped += 1
                continue
            try:
                data = json.loads(line)
                f_out.write(json.dumps(data, ensure_ascii=False) + "\n")
                cleaned += 1
            except json.JSONDecodeError:
                print(f"Line {i}: Skipped (invalid JSON)")
                skipped += 1

    print(f"Cleaned: {cleaned}, Skipped: {skipped}")

Step 3: Residual HTML Tags and Entities

Digging further, I also discovered that the news article text contained leftover HTML tags and HTML entities.

<br /> <- line break tag
&lt;   <- HTML entity for <
&raquo; <- HTML entity for >>

The HTML from the original data source had been left intact.

import re
import html

def clean_html(text):
    # Remove HTML tags
    text = re.sub(r"<br\s*/?>", "", text)
    text = re.sub(r"<.*?>", "", text)

    # Decode HTML entities
    text = html.unescape(text)

    return text

Step 4: Validation

I used the official validation script provided by OpenAI (tiktoken-based) for a final check.

Format errors:      0
Over 16K tokens:    0
Sample count:       1,009
Average tokens:     1,303 tokens/sample (min: 847, max: 3,168)
Assistant response: Average 112 tokens

All 1,009 samples passed validation.

The Decisive Difference from Colab

Here is an extremely important point.

On Colab, the 1,024-token limit forced me to remove financial data and macroeconomic indicators. But the OpenAI API’s maximum token length is 16,385 tokens. Data averaging 1,303 tokens per sample fits with room to spare.

This meant I could input all five data types (company information, news, stock prices, financials, and macroeconomic indicators) in full. There was no longer any need to strip information.

This was the biggest difference from the Colab experiments, and I believe it was a major factor in the accuracy improvement.


Running the Fine-Tuning

Once the data was prepared, the rest was surprisingly easy.

Cost Estimate

Total tokens:       ~1.3 million tokens
Training epochs:    3
Billed tokens:      ~3.88 million tokens

File Upload and Training Start

from openai import OpenAI

client = OpenAI()

# 1. Upload file
with open("train_combined_final.jsonl", "rb") as f:
    file = client.files.create(file=f, purpose="fine-tune")

print(f"File ID: {file.id}")

# 2. Start fine-tuning
fine_tune_job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18"
)

print(f"Job ID: {fine_tune_job.id}")
print(f"Status: {fine_tune_job.status}")

That is all it takes. Then you just wait.

The status transitioned through validating_files -> running -> succeeded, and it completed in about 7-8 minutes.

Given that training on Colab had taken hours and still failed to deliver accuracy, this speed was shocking.


Prediction Test — A Clear Difference from Colab

Let me run a prediction with the fine-tuned model.

response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:personal::xxxxx",
    messages=[
        {"role": "system", "content": "You are a stock prediction assistant."},
        {"role": "user", "content": json.dumps(test_data, ensure_ascii=False)}
    ]
)

prediction = response.choices[0].message.content
print(prediction)

Test result (Soken Chemical 4972, data from 2024/11/28):

{
  "close_to_next_open":  {"price": 3045, "change_pct": 0,    "trend": "neutral"},
  "close_to_next_close": {"price": 3075, "change_pct": 0.98, "trend": "neutral"},
  "next_open_to_close":  {"price": 3075, "change_pct": 1,    "trend": "neutral"}
}

Perfect JSON format with stable output. Not once did it respond in natural language like the Colab models had.

This is the strength of gpt-4o-mini’s base capabilities. It reliably follows the instruction to “output in JSON format” and consistently returns structured output. This was a fundamentally different level of instruction-following ability compared to the 7-8B open-source models.


Fine-Tuned Model Registry

Here is a record of the models created during the trial-and-error process.

Model IDDatasetSamplesNotes
ft:gpt-4o-mini-2024-07-18:personal::xxxxxtraining_data_chat.jsonlSmallInitial test. Data format validation
ft:gpt-4o-mini-2024-07-18:personal::xxxxxtrain_combined_1738729676133.jsonl1,009Production model

I took a two-stage approach: first confirming “it works” with a small test, then fine-tuning with the full production data.


Integration into Production

The fine-tuned model is integrated into the meloik project’s prediction batch (predict_stock_realtime2).

// Prediction request to OpenAI API
$apiKey = CHATGPT_API_KEY;
$url = 'https://api.openai.com/v1/chat/completions';

// Fetch prediction data (company info + news + prices + financials + macro indicators)
$jsonData = Utility::getPredictionData($db, $company['code'], $start_date, $end_date);

// Predict with the fine-tuned model
$data = [
    'model' => 'ft:gpt-4o-mini-2024-07-18:personal::xxxxx',
    'messages' => [
        ['role' => 'system', 'content' => 'You are a stock prediction assistant.'],
        ['role' => 'user', 'content' => $jsonData]
    ]
];

This batch, scheduled via crontab, automatically sends prediction requests to the fine-tuned model whenever news is published, and saves the results to MySQL.


Lessons from OpenAI API Fine-Tuning

Here are the key insights from Phase 3.

Data Preparation Really Is 90%

Running the fine-tuning itself is just uploading a file and making one API call. There is almost no technical difficulty.

However, the process of creating the data to upload — format conversion, garbage data removal, HTML processing, validation — consumed 90% of the total effort.

“The hard part of LLM fine-tuning is not training the model but preparing the data” is something many people say, but I did not truly internalize it until I experienced it firsthand.

The Importance of Full Data

On Colab, I was forced to strip information. With the OpenAI API, I could input the full data. I believe this difference had a major impact on accuracy.

For a task like stock price prediction, not just news but multi-faceted information — financial data, macroeconomic indicators, price trends — is critical. Human analysts do not make judgments based on a single piece of information either.

Base Model Capability Is Decisive

gpt-4o-mini already has strong instruction-following ability before fine-tuning. Tell it to “output in JSON format” and it will reliably return JSON. This base capability underpins the stability after fine-tuning.

The 7-8B open-source models lacked this base capability, which is why their output format remained unstable no matter how much fine-tuning was applied.


Looking Back at Phases 1 Through 3

Let me reflect on all three phases.

PhaseWhat I DidResultWhat I Learned
Phase 1ELYZA 8B on RTX 3060Out of VRAMFelt hardware constraints firsthand
Phase 2ELYZA + LLM-jp on ColabInsufficient accuracyMastered LoRA/quantization/GGUF conversion
Phase 3OpenAI API fine-tuningAdopted for productionFelt the importance of data preparation firsthand

Looking at it linearly, you might think “why not just use the OpenAI API from the start?” But it was precisely because of the Phase 1-2 experience that I could work faster on data design and problem-solving in Phase 3.

For example, I only truly appreciated “the value of being able to input full data” because I had experienced the pain of stripping information on Colab. And I only understood “the importance of base model capability” because I had witnessed the output instability of 7-8B models firsthand.


Summary

Key points from OpenAI API fine-tuning (Phase 3):

  • Fine-tuned on gpt-4o-mini-2024-07-18 as the base model
  • Data preparation (format conversion, cleaning, validation) was 90% of the total effort
  • Dealt with unexpected data issues: PHP error messages mixed in, residual HTML tags
  • Training completed in about 8 minutes (overwhelmingly faster than the hours on Colab)
  • All five data types could be input in full, yielding stable JSON output
  • Integrated into the production prediction batch for automated operation

Next time, I will dive deeper into training data design — the factor that most directly affects prediction accuracy.


Previous: Part 4 — “Stock Prediction on Colab”

Next: Part 6 — “Training Data Design: How I Integrated Five Types of Data”

Share this article

Related Posts