Part 6: Training Data Design -- How I Integrated Five Types of Data

Introduction

In the previous installment, I shared how OpenAI API fine-tuning achieved training in about 8 minutes with stable JSON output. But as I wrote, “data preparation is 90% of the work.” The single biggest factor determining the success or failure of fine-tuning is the quality and design of the training data.

This time, I will focus on training data design. How did I integrate five types of data? How were ground-truth labels created? What trial and error led to the final dataset? I will share the design philosophy and concrete implementation in detail.

Data Sources

The Senrigan service’s data is collected and generated by the meloik project (PHP + MySQL).

News and IR information: Collected from publicly disclosed corporate information such as TDnet (Timely Disclosure network)
Stock price data: Daily OHLCV (Open, High, Low, Close, Volume) for each stock
Company information: Industry, market capitalization, company description, market segment, etc.
Financial data: Earnings information (revenue, profit margins, EPS, ROA, ROE)
Macroeconomic indicators: CPI, GDP, unemployment rate, policy interest rates, exchange rates

News and IR information are sourced from publicly disclosed corporate information such as TDnet (Timely Disclosure network). Please always verify against the original sources for accuracy.

Input Data Structure — Anatomy of a Single Sample

Here is the JSON structure of one training data sample. This is the information fed as the “user” message during fine-tuning.

{
  "company_info": {
    "code": "4972",
    "name": "Soken Chemical & Engineering Co., Ltd.",
    "industry": "Chemicals",
    "market_segment": "TSE Standard",
    "market_cap": 28967000000,
    "shares_outstanding": 8300000,
    "stock_price": 3490,
    "description": "Major acrylic adhesive manufacturer. Chemical products are the core business..."
  },
  "news": {
    "date": "20241105",
    "time": "15:30",
    "type": "Earnings",
    "headline": "Soken Chemical revises FY ordinary profit forecast up 51%, raises record-high estimate",
    "content": "(Summarized news body)"
  },
  "stock_prices": [
    {"date": "20241030", "open": 3260, "high": 3265, "low": 3195,
     "close": 3195, "volume": 8300},
    {"date": "20241031", "open": 3210, "high": 3250, "low": 3185,
     "close": 3250, "volume": 11200},
    {"date": "20241101", "open": 3255, "high": 3300, "low": 3230,
     "close": 3265, "volume": 15400},
    {"date": "20241102", "open": 3270, "high": 3280, "low": 3220,
     "close": 3240, "volume": 9100},
    {"date": "20241105", "open": 3490, "high": 3500, "low": 3440,
     "close": 3490, "volume": 52600}
  ],
  "financial_data": {
    "2023": {"revenue": 38129000000, "profit_margin": 5.33, "eps": 173.9,
             "roa": 3.72, "roe": 4.92},
    "2024": {"revenue": 41318000000, "profit_margin": 9.26, "eps": 317.7,
             "roa": 6.29, "roe": 8.38}
  },
  "macro_indicators": {
    "cpi": 107.95,
    "gdp_growth": 0.32,
    "unemployment_rate": 2.53,
    "policy_rate": 0.75,
    "exchange_rate": 151.37
  }
}

The Role of Each Data Type

1. Company Information — “What Kind of Company Is This?”

Industry, market capitalization, and company description provide essential context for judging the impact of news.

The same “10% revenue increase” elicits different market reactions for a chemical manufacturer versus an IT company. The impact of the same news is entirely different for a mid-cap stock with a 10 billion yen market cap versus a large-cap with 1 trillion yen. Giving the model this context enables more accurate predictions.

2. News — The Primary Prediction Material

News is the single most important input for prediction. Earnings announcements, earnings revisions, new product launches, M&A — corporate information is contained here.

The strength of an LLM is its ability to “understand” this natural language news. It can judge that “51% upward revision to ordinary income” is positive and that “turning to a deficit” is negative, just as a human would.

3. Stock Price Data — The Last 5 Trading Days

I include OHLCV (Open, High, Low, Close, Volume) for the last 5 trading days.

The price movement pattern over the past 5 days — whether the stock is in an uptrend, a downtrend, or experiencing a volume spike — provides important clues for next-day prediction.

4. Financial Data — The Company’s Fundamentals

I include 2 years of revenue, profit margins, EPS (Earnings Per Share), ROA (Return on Assets), and ROE (Return on Equity).

Financial data indicates “whether this company is financially healthy” and “whether it is growing.” Positive news for a company with strong financials is expected to elicit a different market reaction than positive news for a company with declining performance.

5. Macroeconomic Indicators — The Overall Market Temperature

I include CPI (Consumer Price Index), GDP growth rate, unemployment rate, policy interest rates, and exchange rates.

No matter how positive a company’s individual news is, stock prices may fall if the macroeconomic environment is deteriorating. Conversely, a favorable macro environment may soften the impact of individual negative news.

Output Data Structure — How Ground-Truth Labels Are Created

The “ground-truth labels” in the training data are used as the “assistant” message during fine-tuning. In other words, they represent the model answer for “given this input, the model should respond like this.”

Three Prediction Values

{
  "close_to_next_open":  {"price": 3045, "change_pct": 0.0,  "trend": "neutral"},
  "close_to_next_close": {"price": 3075, "change_pct": 0.98, "trend": "neutral"},
  "next_open_to_close":  {"price": 3075, "change_pct": 1.0,  "trend": "neutral"}
}

The reason for outputting three prediction values is to accommodate different investment decision styles.

Close today -> Open tomorrow: Buy after market close on the news day and sell the next morning (overnight trade)
Close today -> Close tomorrow: For those who want to see the full next-day movement
Open tomorrow -> Close tomorrow: Buy at the next morning’s opening and sell at the closing (day trade)

Ground-Truth Label Calculation

Ground-truth labels are calculated from the actual next trading day’s stock prices.

News publication date: 2024/11/05
Close on that day: 3,490 yen
Next trading day's open: 3,045 yen
Next trading day's close: 3,075 yen

Close to next open change = (3045 - 3490) / 3490 * 100 = -12.75%
Close to next close change = (3075 - 3490) / 3490 * 100 = -11.89%
Next open to close change = (3075 - 3045) / 3045 * 100 = 0.98%

The change ratios are computed using the stock price at the time the news was published and the actual stock prices on the next trading day. These become the ground-truth labels.

Colab Version vs. OpenAI Version — Data Differences

As I mentioned in Part 4, the data composition differs significantly between the Colab and OpenAI versions.

Item	Colab (ELYZA / LLM-jp)	OpenAI (gpt-4o-mini)
News	Used	Used
Company info	Used	Used
Stock prices	Day-over-day ratios only	Full OHLCV
Financial data	Removed	Used
Macro indicators	Removed	Used
max_length	1,024 tokens	16,385 tokens

The Colab version’s output format was also simpler.

{
  "prediction": {
    "close_to_next_open_change_pct": -0.23,
    "close_to_next_close_change_pct": 3.09,
    "next_open_to_close_change_pct": 3.33
  }
}

The OpenAI version uses a richer output format that includes “price” and “trend” in addition to the change percentage. With tokens to spare, I could pack more information into the output as well.

Data Evolution — 8 Versions

It took 8 versions to arrive at the final dataset.

Version	File	Samples	Description
v1	train_data.json	2	Hand-crafted test data. Includes dummy data for basic validation
v2	training_data.jsonl	Small	prompt-completion format. First auto-generated data
v3	training_data_chat.jsonl	Small	v2 converted to chat (messages) format
v4	training_4972.jsonl	Medium	Single stock only (Soken Chemical)
v5	train_combined.jsonl	Medium	Multiple stocks combined (small scale)
v6	train_combined_final.jsonl	Large	Cleaned large-scale data
v7	train_combined_1738729676133.jsonl	1,009	Final version for OpenAI FT
v8	train_combined… (2 files)	2,011	For Colab (stripped-down data)

Key Points in the Trial and Error

v1-v3: Format experimentation

I started with 2 hand-crafted samples to confirm “does fine-tuning even run?” Then moved to auto-generation, adding the step of converting prompt-completion format data to chat format.

v4: Starting with a single stock

First trained on data for just one stock (Soken Chemical 4972) and ran prediction tests. This confirmed how well a “model specialized for a specific stock” could perform.

v5-v6: Expanding to multiple stocks and cleaning

Combined data from multiple stocks. This is where the PHP error contamination and residual HTML tags were discovered, leading to stronger cleaning processes.

v7: Final OpenAI FT version

1,009 clean samples. Averaging 1,303 tokens per sample, approximately 1.3 million tokens total. The model fine-tuned on this data is what runs in production.

The Full Data Cleaning Process

I touched on this partially in Part 5, but let me organize the complete cleaning steps.

Step 1: Remove JSON Parse Errors

# Skip lines that cannot be parsed as JSON
for line in f_in:
    line = line.strip()
    if not line:
        continue
    try:
        data = json.loads(line)
    except json.JSONDecodeError:
        skipped += 1
        continue

Removes PHP error messages (Fatal error: Allowed memory size...) and empty lines.

Step 2: Remove HTML Tags and Entities

import re
import html

def clean_text(text):
    # Remove HTML tags
    text = re.sub(r"<br\s*/?>", " ", text)
    text = re.sub(r"<.*?>", "", text)
    # Decode HTML entities
    text = html.unescape(text)
    return text

Handles <br />, <, », and other artifacts remaining in news article text.

Step 3: Format Conversion

Converting from prompt-completion format to chat (messages) format.

Step 4: Validation

Final check using OpenAI’s official validation script.

Final dataset statistics:
  Samples:           1,009
  Average tokens:    1,303 tokens/sample
  Minimum:           847 tokens
  Maximum:           3,168 tokens
  Assistant response: Average 112 tokens
  Over 16K:          0

Why 1,009 Samples Is Enough

1,009 samples is not a large amount by general LLM fine-tuning standards. However, with OpenAI’s fine-tuning approach, since the base model (gpt-4o-mini) has already undergone massive pre-training, it can learn effectively even from relatively small datasets.

The goal of fine-tuning is not to train a model from scratch, but to specialize existing knowledge for a specific task. gpt-4o-mini already understands financial terminology and has numerical reasoning capabilities. The 1,009 stock prediction samples were sufficient to align those capabilities with “the stock prediction output format.”

That said, increasing the data volume could further improve accuracy. This remains an area for future improvement.

Design Principles Behind the Data

Choosing JSON Format

The reason for using JSON format for the input data is to convey structured information to the LLM accurately.

I could have written it in natural language: “Soken Chemical is a company in the chemicals industry with a market cap of approximately 29 billion yen.” But JSON makes the data hierarchy and types explicit. LLMs have been extensively trained on JSON-format data, so their comprehension of structured data is strong.

Japanese Keys

I used Japanese key names in the JSON (“company_info,” “news,” “stock_prices” in the actual Japanese version). English keys would have worked too, but since the values (news text, etc.) are in Japanese, I decided that unifying the keys in Japanese would be easier for the model to process.

5 Days of Stock Price Data

I include 5 trading days of stock price data. I considered 5 days an appropriate window for capturing short-term trends. A single day does not reveal a trend, while 20 days would be too much information and consume too many tokens.

The Training Data Generation Pipeline

Training data is not hand-crafted — it is auto-generated by the meloik project’s batch processes. Here is the general flow:

1. Select target news
   +-> Fetch news with is_checked=0 from the ai_news table in MySQL

2. Collect related data
   +-- Company info: ai_companies table
   +-- Stock prices: ai_stock_prices table (last 5 trading days)
   +-- Financial data: ai_financials table (2 years)
   +-- Macro indicators: ai_macro_indicators table

3. Calculate ground-truth labels
   +-> Compute change ratios from the actual next trading day's prices

4. Convert to JSON format
   +-> Combine all of the above into a single JSON object

5. Output as JSONL
   +-> Save as a JSONL file with one sample per line

This pipeline means that as more news is collected, training data grows automatically. In the future, it will be possible to periodically add data and re-fine-tune the model.

Reflections and Future Improvements

Reflection: Data Diversity

The current 1,009 samples may have biases in industry and stock representation. If certain industries (e.g., chemicals, information technology) are overrepresented while others (e.g., finance, real estate) are underrepresented, the model’s predictions could be skewed.

Future Improvements

Expanding data volume: From 1,009 to several thousand samples, covering a wider range of stocks and industries
Temporal diversity: Including varied market conditions — bull markets, bear markets, high-volatility periods
News type variety: Not just earnings, but M&A, new product announcements, regulatory changes, and other news types

Summary

Key points in training data design:

Integrated 5 data types (company info, news, stock prices, financials, macro indicators) in JSON format
Ground-truth labels are calculated from actual next-trading-day price movements
Arrived at the final version (1,009 samples) through 8 iterations of trial and error
Data cleaning (PHP error removal, HTML processing, validation) determines data quality
The Colab version was forced to strip data due to token limits; the OpenAI version uses full data
Data is sourced from publicly disclosed information such as TDnet (Timely Disclosure network)

“Data quality determines model quality” — this is the single most important lesson in fine-tuning.

Next time, I will share the challenges of multilingual support and translation LLM selection. I was drawn to DeepSeek’s low cost and adopted it, only to face unexpected problems.

Previous: Part 5 — “OpenAI API Fine-Tuning”

Next: Part 7 — “Choosing a Translation LLM: From DeepSeek to ChatGPT”