Part 6: Training Data Design -- How I Integrated Five Types of Data
![]()
Introduction
In the previous installment, I shared how OpenAI API fine-tuning achieved training in about 8 minutes with stable JSON output. But as I wrote, “data preparation is 90% of the work.” The single biggest factor determining the success or failure of fine-tuning is the quality and design of the training data.
This time, I will focus on training data design. How did I integrate five types of data? How were ground-truth labels created? What trial and error led to the final dataset? I will share the design philosophy and concrete implementation in detail.
Data Sources
The Senrigan service’s data is collected and generated by the meloik project (PHP + MySQL).
- News and IR information: Collected from publicly disclosed corporate information such as TDnet (Timely Disclosure network)
- Stock price data: Daily OHLCV (Open, High, Low, Close, Volume) for each stock
- Company information: Industry, market capitalization, company description, market segment, etc.
- Financial data: Earnings information (revenue, profit margins, EPS, ROA, ROE)
- Macroeconomic indicators: CPI, GDP, unemployment rate, policy interest rates, exchange rates
News and IR information are sourced from publicly disclosed corporate information such as TDnet (Timely Disclosure network). Please always verify against the original sources for accuracy.
Input Data Structure — Anatomy of a Single Sample
Here is the JSON structure of one training data sample. This is the information fed as the “user” message during fine-tuning.
{
"company_info": {
"code": "4972",
"name": "Soken Chemical & Engineering Co., Ltd.",
"industry": "Chemicals",
"market_segment": "TSE Standard",
"market_cap": 28967000000,
"shares_outstanding": 8300000,
"stock_price": 3490,
"description": "Major acrylic adhesive manufacturer. Chemical products are the core business..."
},
"news": {
"date": "20241105",
"time": "15:30",
"type": "Earnings",
"headline": "Soken Chemical revises FY ordinary profit forecast up 51%, raises record-high estimate",
"content": "(Summarized news body)"
},
"stock_prices": [
{"date": "20241030", "open": 3260, "high": 3265, "low": 3195,
"close": 3195, "volume": 8300},
{"date": "20241031", "open": 3210, "high": 3250, "low": 3185,
"close": 3250, "volume": 11200},
{"date": "20241101", "open": 3255, "high": 3300, "low": 3230,
"close": 3265, "volume": 15400},
{"date": "20241102", "open": 3270, "high": 3280, "low": 3220,
"close": 3240, "volume": 9100},
{"date": "20241105", "open": 3490, "high": 3500, "low": 3440,
"close": 3490, "volume": 52600}
],
"financial_data": {
"2023": {"revenue": 38129000000, "profit_margin": 5.33, "eps": 173.9,
"roa": 3.72, "roe": 4.92},
"2024": {"revenue": 41318000000, "profit_margin": 9.26, "eps": 317.7,
"roa": 6.29, "roe": 8.38}
},
"macro_indicators": {
"cpi": 107.95,
"gdp_growth": 0.32,
"unemployment_rate": 2.53,
"policy_rate": 0.75,
"exchange_rate": 151.37
}
}
The Role of Each Data Type
1. Company Information — “What Kind of Company Is This?”
Industry, market capitalization, and company description provide essential context for judging the impact of news.
The same “10% revenue increase” elicits different market reactions for a chemical manufacturer versus an IT company. The impact of the same news is entirely different for a mid-cap stock with a 10 billion yen market cap versus a large-cap with 1 trillion yen. Giving the model this context enables more accurate predictions.
2. News — The Primary Prediction Material
News is the single most important input for prediction. Earnings announcements, earnings revisions, new product launches, M&A — corporate information is contained here.
The strength of an LLM is its ability to “understand” this natural language news. It can judge that “51% upward revision to ordinary income” is positive and that “turning to a deficit” is negative, just as a human would.
3. Stock Price Data — The Last 5 Trading Days
I include OHLCV (Open, High, Low, Close, Volume) for the last 5 trading days.
The price movement pattern over the past 5 days — whether the stock is in an uptrend, a downtrend, or experiencing a volume spike — provides important clues for next-day prediction.
4. Financial Data — The Company’s Fundamentals
I include 2 years of revenue, profit margins, EPS (Earnings Per Share), ROA (Return on Assets), and ROE (Return on Equity).
Financial data indicates “whether this company is financially healthy” and “whether it is growing.” Positive news for a company with strong financials is expected to elicit a different market reaction than positive news for a company with declining performance.
5. Macroeconomic Indicators — The Overall Market Temperature
I include CPI (Consumer Price Index), GDP growth rate, unemployment rate, policy interest rates, and exchange rates.
No matter how positive a company’s individual news is, stock prices may fall if the macroeconomic environment is deteriorating. Conversely, a favorable macro environment may soften the impact of individual negative news.
Output Data Structure — How Ground-Truth Labels Are Created
The “ground-truth labels” in the training data are used as the “assistant” message during fine-tuning. In other words, they represent the model answer for “given this input, the model should respond like this.”
Three Prediction Values
{
"close_to_next_open": {"price": 3045, "change_pct": 0.0, "trend": "neutral"},
"close_to_next_close": {"price": 3075, "change_pct": 0.98, "trend": "neutral"},
"next_open_to_close": {"price": 3075, "change_pct": 1.0, "trend": "neutral"}
}
The reason for outputting three prediction values is to accommodate different investment decision styles.
- Close today -> Open tomorrow: Buy after market close on the news day and sell the next morning (overnight trade)
- Close today -> Close tomorrow: For those who want to see the full next-day movement
- Open tomorrow -> Close tomorrow: Buy at the next morning’s opening and sell at the closing (day trade)
Ground-Truth Label Calculation
Ground-truth labels are calculated from the actual next trading day’s stock prices.
News publication date: 2024/11/05
Close on that day: 3,490 yen
Next trading day's open: 3,045 yen
Next trading day's close: 3,075 yen
Close to next open change = (3045 - 3490) / 3490 * 100 = -12.75%
Close to next close change = (3075 - 3490) / 3490 * 100 = -11.89%
Next open to close change = (3075 - 3045) / 3045 * 100 = 0.98%
The change ratios are computed using the stock price at the time the news was published and the actual stock prices on the next trading day. These become the ground-truth labels.
Colab Version vs. OpenAI Version — Data Differences
As I mentioned in Part 4, the data composition differs significantly between the Colab and OpenAI versions.
| Item | Colab (ELYZA / LLM-jp) | OpenAI (gpt-4o-mini) |
|---|---|---|
| News | Used | Used |
| Company info | Used | Used |
| Stock prices | Day-over-day ratios only | Full OHLCV |
| Financial data | Removed | Used |
| Macro indicators | Removed | Used |
| max_length | 1,024 tokens | 16,385 tokens |
The Colab version’s output format was also simpler.
{
"prediction": {
"close_to_next_open_change_pct": -0.23,
"close_to_next_close_change_pct": 3.09,
"next_open_to_close_change_pct": 3.33
}
}
The OpenAI version uses a richer output format that includes “price” and “trend” in addition to the change percentage. With tokens to spare, I could pack more information into the output as well.
Data Evolution — 8 Versions
It took 8 versions to arrive at the final dataset.
| Version | File | Samples | Description |
|---|---|---|---|
| v1 | train_data.json | 2 | Hand-crafted test data. Includes dummy data for basic validation |
| v2 | training_data.jsonl | Small | prompt-completion format. First auto-generated data |
| v3 | training_data_chat.jsonl | Small | v2 converted to chat (messages) format |
| v4 | training_4972.jsonl | Medium | Single stock only (Soken Chemical) |
| v5 | train_combined.jsonl | Medium | Multiple stocks combined (small scale) |
| v6 | train_combined_final.jsonl | Large | Cleaned large-scale data |
| v7 | train_combined_1738729676133.jsonl | 1,009 | Final version for OpenAI FT |
| v8 | train_combined… (2 files) | 2,011 | For Colab (stripped-down data) |
Key Points in the Trial and Error
v1-v3: Format experimentation
I started with 2 hand-crafted samples to confirm “does fine-tuning even run?” Then moved to auto-generation, adding the step of converting prompt-completion format data to chat format.
v4: Starting with a single stock
First trained on data for just one stock (Soken Chemical 4972) and ran prediction tests. This confirmed how well a “model specialized for a specific stock” could perform.
v5-v6: Expanding to multiple stocks and cleaning
Combined data from multiple stocks. This is where the PHP error contamination and residual HTML tags were discovered, leading to stronger cleaning processes.
v7: Final OpenAI FT version
1,009 clean samples. Averaging 1,303 tokens per sample, approximately 1.3 million tokens total. The model fine-tuned on this data is what runs in production.
The Full Data Cleaning Process
I touched on this partially in Part 5, but let me organize the complete cleaning steps.
Step 1: Remove JSON Parse Errors
# Skip lines that cannot be parsed as JSON
for line in f_in:
line = line.strip()
if not line:
continue
try:
data = json.loads(line)
except json.JSONDecodeError:
skipped += 1
continue
Removes PHP error messages (Fatal error: Allowed memory size...) and empty lines.
Step 2: Remove HTML Tags and Entities
import re
import html
def clean_text(text):
# Remove HTML tags
text = re.sub(r"<br\s*/?>", " ", text)
text = re.sub(r"<.*?>", "", text)
# Decode HTML entities
text = html.unescape(text)
return text
Handles <br />, <, », and other artifacts remaining in news article text.
Step 3: Format Conversion
Converting from prompt-completion format to chat (messages) format.
Step 4: Validation
Final check using OpenAI’s official validation script.
Final dataset statistics:
Samples: 1,009
Average tokens: 1,303 tokens/sample
Minimum: 847 tokens
Maximum: 3,168 tokens
Assistant response: Average 112 tokens
Over 16K: 0
Why 1,009 Samples Is Enough
1,009 samples is not a large amount by general LLM fine-tuning standards. However, with OpenAI’s fine-tuning approach, since the base model (gpt-4o-mini) has already undergone massive pre-training, it can learn effectively even from relatively small datasets.
The goal of fine-tuning is not to train a model from scratch, but to specialize existing knowledge for a specific task. gpt-4o-mini already understands financial terminology and has numerical reasoning capabilities. The 1,009 stock prediction samples were sufficient to align those capabilities with “the stock prediction output format.”
That said, increasing the data volume could further improve accuracy. This remains an area for future improvement.
Design Principles Behind the Data
Choosing JSON Format
The reason for using JSON format for the input data is to convey structured information to the LLM accurately.
I could have written it in natural language: “Soken Chemical is a company in the chemicals industry with a market cap of approximately 29 billion yen.” But JSON makes the data hierarchy and types explicit. LLMs have been extensively trained on JSON-format data, so their comprehension of structured data is strong.
Japanese Keys
I used Japanese key names in the JSON (“company_info,” “news,” “stock_prices” in the actual Japanese version). English keys would have worked too, but since the values (news text, etc.) are in Japanese, I decided that unifying the keys in Japanese would be easier for the model to process.
5 Days of Stock Price Data
I include 5 trading days of stock price data. I considered 5 days an appropriate window for capturing short-term trends. A single day does not reveal a trend, while 20 days would be too much information and consume too many tokens.
The Training Data Generation Pipeline
Training data is not hand-crafted — it is auto-generated by the meloik project’s batch processes. Here is the general flow:
1. Select target news
+-> Fetch news with is_checked=0 from the ai_news table in MySQL
2. Collect related data
+-- Company info: ai_companies table
+-- Stock prices: ai_stock_prices table (last 5 trading days)
+-- Financial data: ai_financials table (2 years)
+-- Macro indicators: ai_macro_indicators table
3. Calculate ground-truth labels
+-> Compute change ratios from the actual next trading day's prices
4. Convert to JSON format
+-> Combine all of the above into a single JSON object
5. Output as JSONL
+-> Save as a JSONL file with one sample per line
This pipeline means that as more news is collected, training data grows automatically. In the future, it will be possible to periodically add data and re-fine-tune the model.
Reflections and Future Improvements
Reflection: Data Diversity
The current 1,009 samples may have biases in industry and stock representation. If certain industries (e.g., chemicals, information technology) are overrepresented while others (e.g., finance, real estate) are underrepresented, the model’s predictions could be skewed.
Future Improvements
- Expanding data volume: From 1,009 to several thousand samples, covering a wider range of stocks and industries
- Temporal diversity: Including varied market conditions — bull markets, bear markets, high-volatility periods
- News type variety: Not just earnings, but M&A, new product announcements, regulatory changes, and other news types
Summary
Key points in training data design:
- Integrated 5 data types (company info, news, stock prices, financials, macro indicators) in JSON format
- Ground-truth labels are calculated from actual next-trading-day price movements
- Arrived at the final version (1,009 samples) through 8 iterations of trial and error
- Data cleaning (PHP error removal, HTML processing, validation) determines data quality
- The Colab version was forced to strip data due to token limits; the OpenAI version uses full data
- Data is sourced from publicly disclosed information such as TDnet (Timely Disclosure network)
“Data quality determines model quality” — this is the single most important lesson in fine-tuning.
Next time, I will share the challenges of multilingual support and translation LLM selection. I was drawn to DeepSeek’s low cost and adopted it, only to face unexpected problems.
Previous: Part 5 — “OpenAI API Fine-Tuning”
Next: Part 7 — “Choosing a Translation LLM: From DeepSeek to ChatGPT”