Part 4: Stock Prediction on Colab -- Trial and Error with Three Models

Introduction

In the previous installment, I explained the technical foundations and confirmed through the Mount Fuji Experiment that fine-tuning with LoRA + 8-bit quantization actually works. Now it was time to tackle real stock price data.

Using Google Colab’s T4 GPU (16 GB VRAM), I took on stock price prediction fine-tuning with two open-source models: ELYZA 8B and LLM-jp 7.2B.

To state the conclusion upfront: I was unable to achieve practical accuracy in this phase. However, the knowledge gained through the process — data preprocessing, model-specific format conversion, incremental training — contributed significantly to the eventual success with the OpenAI API.

Colab Environment

Item	Value
GPU	NVIDIA T4 (16 GB VRAM)
CUDA	12.x
Python	3.10 / 3.11
Data storage	Via Google Drive
Key libraries	transformers, peft, trl, bitsandbytes, datasets

The T4 GPU is available even on the free tier, which makes it a welcome environment for independent development. However, GPU usage time is limited, so the trial-and-error cycle was far from fast. The pattern of “write code -> start training -> hit the GPU limit and get interrupted -> resume the next day” repeated many times.

Taking on Stock Prediction with ELYZA 8B

The First Wall — Token Count Limits

The data I prepared for stock prediction included all five types of information introduced in Part 1 (company information, news, stock price data, financial data, and macroeconomic indicators). Tokenizing this raw data resulted in several thousand tokens per sample.

However, Colab’s limited VRAM forced me to cap the maximum sequence length at 1,024 tokens. Any sample exceeding 1,024 tokens would be truncated.

Fitting the full five data types into 1,024 tokens was impossible. So I decided to aggressively strip down the data.

[Removed items]
- Macroeconomic indicators (CPI, GDP, unemployment rate, interest rates, exchange rates -> all removed)
- Financial data (revenue, profit margins, EPS, ROE -> all removed)
- Detailed stock price data (open, high, low, close, volume -> all removed; only day-over-day change ratios kept)

[Retained items]
- News (headline + content) <- most important
- Company information (code, name, industry, market segment, market cap, description)
- Only day-over-day change ratios for stock prices (last 5 days)

I judged that news was the most critical information and stripped everything else down as far as possible. In hindsight, however, this reduction in information very likely had a negative impact on accuracy. Without financial data or macroeconomic indicators, the model had to make judgments based on news content alone, losing significant context.

ELYZA-Specific Format Conversion

ELYZA is based on Llama 3 and requires a specific prompt format using special tokens. Feeding ChatGPT-format data directly would not work.

<|start_header_id|>system<|end_header_id|>
You are a stock prediction assistant. Analyze the news and stock data,
and predict the next day's price change ratio.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
News: Date:20241113, Time:12:00, Type:Earnings,
Headline: Veritas, ordinary profit turns to deficit in Jan-Sep period...
Company info: Code:130A, Name:Veritas In Silico, Inc., Industry:Pharmaceuticals...
Stock data: Day change:2.52, Day change:-1.51, Day change:-2.11,
Day change:8.59, Day change:0.0
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
{"prediction": {"close_to_next_open_change_pct": -0.23,
"close_to_next_close_change_pct": 3.09,
"next_open_to_close_change_pct": 3.33}}
<|eot_id|>

The <|start_header_id|> and <|eot_id|> tokens are Llama 3 special tokens. Without inserting these correctly, the model cannot understand which part is the system message, which is the user input, and which is the expected response.

I implemented a conversion function to transform ChatGPT-format data (messages arrays) into the ELYZA format. It was tedious work, but dealing with model-specific formats is one of the pain points of working with open-source LLMs.

LoRA Configuration and Training

I applied the same configuration introduced in the previous technical explanation.

from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from transformers import TrainingArguments

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=1,
    save_steps=100,
    logging_steps=10,
    report_to=["tensorboard"]
)

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    args=training_args
)

trainer.train()

Training data: Two JSONL files combined into 2,011 samples. Training ran for 3 epochs with a maximum sequence length of 1,024 tokens. The process took several hours on the T4 GPU.

LoRA Merge and GGUF Conversion

After training completed, I merged the LoRA adapter into the base model and converted it to GGUF format so it could also be used for local inference.

from peft import PeftModel

# Merge LoRA adapter into the base model
model = PeftModel.from_pretrained(base_model, "fine_tuned_model")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("full_model")

# Convert to GGUF format (FP16)
python convert_hf_to_gguf.py full_model/ --outtype f16 --outfile model.gguf

# 8-bit quantization
./llama-quantize model.gguf model_q8.gguf q8_0

# 4-bit quantization
./llama-quantize model.gguf model_q4.gguf q4_0

The GGUF conversion pipeline established in Phase 1 proved its worth here.

Prediction Test Results

I ran predictions on 136 test samples.

The results fell short of expectations.

The model frequently failed to output valid prediction JSON
It sometimes responded in natural language instead of JSON format
Even when it did return valid JSON, the numerical accuracy was poor

For example, in response to positive news like “51% upward revision to ordinary income,” the model predicted a large negative price movement for the next day.

The LLM-jp 7.2B Challenge

I also tried a model other than ELYZA: llm-jp-3-7.2b-instruct3, developed under the leadership of Japan’s National Institute of Informatics (NII).

Why I Chose This Model

Japanese LLM: A Japanese model with a different architecture from ELYZA, allowing for a comparative experiment
7.2B parameters: Slightly smaller than ELYZA’s 8B, making it marginally easier to work with on Colab

Incremental Fine-Tuning

In the LLM-jp experiments, I also implemented an incremental training mechanism. Rather than retraining from scratch every time new data was added, this approach carries forward the results of the previous fine-tuning run.

# Toggle for incremental training
is_additional_fine_tuning = True

if is_additional_fine_tuning:
    # Load the previously fine-tuned full model as the base
    model = AutoModelForCausalLM.from_pretrained(
        "/content/drive/MyDrive/.../full_model",
        quantization_config=bnb_config,
        device_map={"": torch.cuda.current_device()}
    )
else:
    # First run: load the original base model
    model = AutoModelForCausalLM.from_pretrained(model_id, ...)

# Add LoRA adapters on top and train
model = get_peft_model(model, lora_config)

The mechanism is straightforward. For the initial training, LoRA is added to the base model, training runs, and the result is saved as a full model. For subsequent runs, the previously saved full model becomes the new base, and LoRA is added on top of it again for further training.

Model Version Management

With repeated incremental training, version management becomes necessary. I set up a system where old models are moved to timestamped directories, with the latest version referenced via the directory name.

# After training completes
# 1. Move old model to timestamped backup
mv full_model full_model_202502051200

# 2. Place new model as the latest version
mv full_model_updated full_model

This makes it possible to roll back to a previous version if something goes wrong.

Saving Prediction Results to Files

I also built a system to save prediction results as JSON files to Google Drive during testing.

output_dir = "/content/drive/MyDrive/.../predictions"

for i, sample in enumerate(converted_predict_dataset):
    # Run prediction
    result = predict(model, tokenizer, sample)

    # Save result as JSON file
    with open(f"{output_dir}/prediction_{i}.json", "w") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)

This makes it much easier to analyze prediction results after the fact and compare performance across models.

LLM-jp Results

Like ELYZA, prediction accuracy was insufficient. The incremental training mechanism itself worked correctly, but the fundamental accuracy problem remained unsolved.

Phase 2 Retrospective — Why Accuracy Fell Short

I experimented with two models but neither achieved practical accuracy. Here is my analysis of the causes.

Item	ELYZA (8B)	LLM-jp (7.2B)
LoRA fine-tuning	Worked	Worked
Incremental training	Not implemented	Implemented
Prediction accuracy	Insufficient	Insufficient
Output format stability	Low	Low
GGUF conversion	Successful	Successful
Training time	Long (several hours)	Long (several hours)

Cause 1: Insufficient Base Model Capability

I believe that models in the 7-8B class simply struggled to handle both complex financial data analysis and numerical prediction simultaneously.

Stock price prediction is not a simple text classification task (positive/negative). It requires predicting specific change ratios as numbers and outputting them accurately in JSON format. This was a very demanding ask for 7-8B models.

Cause 2: Unstable Output Format

Even when instructed to “respond in JSON,” the models frequently responded in natural language or changed the JSON key names.

Expected output:
{"prediction": {"close_to_next_open_change_pct": -0.23, ...}}

Actual output example:
"This stock is predicted to rise tomorrow. The change rate is approximately..."

This is an instruction-following capability issue. Larger models (GPT-4 class) can reliably follow an instruction like “output in JSON format,” but 7-8B class models could not do so consistently.

Cause 3: Too Much Information Removed

Due to the token count limitation, I removed financial data and macroeconomic indicators, and compressed stock price data down to just day-over-day change ratios. This loss of information very likely hurt prediction accuracy.

The significance of news like “51% upward revision to ordinary income” changes dramatically depending on whether you know the company’s historical financial data. For a company with a 10 billion yen market cap versus a 1 trillion yen company, the impact of the same news is entirely different.

Cause 4: Training Time and Resource Constraints

The GPU time limits on Colab’s free tier slowed down the trial-and-error cycle, which also had an impact. I was unable to do enough iterative improvement — adjusting hyperparameters, modifying data preprocessing, and retraining.

Colab Cost vs. OpenAI API

At this point, I also considered the cost angle.

Paying for Colab Pro would give me more GPU time. But which is more efficient: spending several months on Colab Pro (approximately 1,179 yen/~$8 per month) continuing to iterate, or using OpenAI API fine-tuning (a few to tens of dollars per run)?

Aspect	Colab (Local/Cloud)	OpenAI API
Initial cost	GPU purchase or Colab Pro	None
Training cost	Electricity, GPU time	Token-based billing (~tens of dollars)
Training time	Hours to tens of hours	~8 minutes
Accuracy	Low to medium	High
Output format stability	Low	High (reliable JSON)
Operational ease	Infrastructure management required	Just call the API
Model ownership	In your hands	On OpenAI’s servers

The downside of “not having the model locally” is real, but for independent development, the priorities should be accuracy and cost-performance. The accuracy that months of Colab experimentation couldn’t achieve might be attainable with OpenAI’s API for a few tens of dollars and 8 minutes.

I decided to pivot.

What Phase 2 Left Behind

While I ultimately abandoned stock prediction with open-source models, the trial and error in Phase 2 was far from wasted.

Technical Knowledge

The combination of LoRA + 8-bit quantization + SFTTrainer makes 8B model fine-tuning feasible on a T4 GPU
“Fine-tuning works” and “practical accuracy is achieved” are separate problems
Data quality and quantity determine accuracy. Strip too much information and the model loses the context it needs for prediction
Incremental training is useful (no need to retrain from scratch when data grows)
The GGUF conversion pipeline is established and ready for future use with distilled models

Code Assets

The fine-tuning pipeline, data preprocessing, and GGUF conversion code are all preserved on Colab. They can be reused directly when attempting a future challenge with larger models (70B class) or more training data.

The Key Lesson

“Model size < Data quality”

This was the strongest takeaway from Phase 2. No amount of tuning on a 7-8B open-source model will produce good predictions if the input data is poor. Conversely, if the base model has strong enough capabilities (GPT-4o class), you can feed it the full data and expect high accuracy.

Summary

Here is a summary of the Phase 2 (Google Colab) trial and error:

Fine-tuned two models (ELYZA 8B and LLM-jp 7.2B) for stock price prediction
Training ran successfully on T4 GPU with LoRA + 8-bit quantization
However, prediction accuracy was insufficient, with output format stability issues
Information reduction due to token limits was a contributing factor to poor accuracy
Incremental training and the GGUF conversion pipeline were established as lasting outcomes
After comparing cost-performance, I decided to pivot to the OpenAI API

This was a phase where “things didn’t work out,” but without this experience, I would not have understood the importance of OpenAI API fine-tuning or the critical role of data design.

Next time, I will cover the destination of that pivot: OpenAI API fine-tuning. I will share the surprising experience of training completing in just 8 minutes and getting stable JSON output.

Previous: Part 3 — “LoRA and Quantization Explained”

Next: Part 5 — “OpenAI API Fine-Tuning: The Final Solution That Took Just 8 Minutes”