Part 2: The Local GPU Challenge and Defeat -- Taking on an 8B Model with 6 GB of VRAM on the RTX 3060

Introduction

In the previous installment, I shared the big picture of the project and how the experiment where “ChatGPT beat fund managers” inspired me to build a stock price prediction AI.

This time, I will talk about the first phase — where I actually got my hands dirty. To cut to the conclusion: I was thoroughly defeated. However, this failure was extremely educational, and it gave me firsthand understanding of the hardware constraints involved in LLM fine-tuning.


The Dream of “My Own LLM”

When I decided to do fine-tuning, the first thing that came to mind was: “I want to train a model on my own machine.”

Rather than depending on cloud APIs, I wanted to use my own GPU, my own data, and build my own model. That was the appeal of open-source LLMs, as I saw it. Fortunately, I had an NVIDIA GeForce RTX 3060 in my laptop.

“If it has a GPU, surely I can do fine-tuning.”

Looking back, that was naive — but at the time, I believed it, and I started setting up the environment.


Environment Setup

First, I set up the local development environment.

ItemValue
GPUNVIDIA GeForce RTX 3060 Laptop GPU (6 GB VRAM)
CUDA12.4
PyTorch2.5.1+cu124
Python3.11.11 (Anaconda)
OSWindows
IDEJupyter Notebook

Installing CUDA and setting up PyTorch took some effort, but I eventually confirmed that the GPU was properly recognized.

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")
PyTorch version: 2.5.1+cu124
CUDA available: True
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
VRAM: 6.0 GB

The GPU was recognized. So far, so good.


Model Selection — Why ELYZA Llama-3-JP-8B

Next, I needed to choose a model to fine-tune.

The news I would be using for stock prediction is in Japanese, so I needed a model with strong Japanese language comprehension. At the end of 2024, the most prominent open-source LLM specialized for Japanese was ELYZA Llama-3-JP-8B.

  • Model name: elyza/Llama-3-ELYZA-JP-8B
  • Parameters: 8B (8 billion parameters)
  • Base model: Meta Llama 3
  • Key feature: Llama 3 base model with additional training on Japanese data
  • Model size: Split into 4 safetensors files, approximately 16 GB total

For understanding Japanese financial news, having strong baseline Japanese language ability is critical. ELYZA had a good reputation on that front, and the fact that it was easily downloadable from the HuggingFace model hub was another deciding factor.

However, fitting a 16 GB model into 6 GB of VRAM requires some tricks. That is where quantization comes in.


8-bit Quantization — Fitting a 16 GB Model into 6 GB

Quantization is a technique that reduces memory usage by lowering the numerical precision of a model’s weight parameters. Models are normally stored in FP32 (32-bit floating point), but converting them to INT8 (8-bit integer) can reduce memory usage to roughly one quarter.

Using a library called BitsAndBytes, you can automatically apply 8-bit quantization when loading a model.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "elyza/Llama-3-ELYZA-JP-8B"

# 8-bit quantization config
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

The 16 GB model was compressed to approximately 4-5 GB through 8-bit quantization — just enough to fit in 6 GB of VRAM.


Inference Test — It Works!

With the model loaded, I first tried some basic text generation.

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256
)

result = generator("日本の株式市場について教えてください。")
print(result[0]["generated_text"])

I couldn’t help but exclaim when I saw the output. It returned a remarkably fluent and reasonably accurate response about the Japanese stock market — hard to believe it was coming from an 8-bit quantized model.

“This is promising. If I fine-tune this model, I should be able to get even better predictions.”

Riding that wave of excitement, I dove into fine-tuning.


Fine-Tuning — And Then It All Fell Apart

I wrote the fine-tuning code and attempted to start training. The plan was to use the Transformers library’s Trainer and start with a small dataset.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    save_steps=100,
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()  # <- Error here

OutOfMemoryError: CUDA out of memory

Not enough VRAM. The moment training started, it crashed with a memory error.


The Memory Gap Between Inference and Training

Why did inference work, but training ran out of memory?

This is a crucial point that applies not just to LLMs, but to deep learning in general. Inference and training use GPU memory in fundamentally different ways.

[Inference (forward pass only)]
What's needed:
  +-- Model weights (approx. 4-5 GB with 8-bit quantization)
  +-- Input data + intermediate outputs (small)
Total: ~5 GB -> Fits in 6 GB VRAM

[Training (forward pass + backward pass)]
What's needed:
  +-- Model weights (~4-5 GB)
  +-- Gradients (update direction for each parameter -- same size as weights)
  +-- Optimizer state (with Adam, roughly 2x the size of weights)
  +-- Activations (intermediate results retained for backpropagation)
Total: ~16-20 GB -> Far exceeds 6 GB VRAM

In other words, training requires 3 to 4 times the memory of inference. When your VRAM is just barely enough for inference with 8-bit quantization, there is no room left for the gradients and optimizer state that training requires.

Fine-tuning an 8B model is generally recommended to have at least 24 GB of VRAM. My RTX 3060’s 6 GB was simply no match.


LoRA as an Option — Still Not Enough

“If I can’t train all the parameters, why not train just a subset?”

This is the idea behind LoRA (Low-Rank Adaptation). It freezes all of the model’s original parameters and trains only a small set of additional parameters, dramatically reducing memory usage. (I will explain LoRA’s mechanism in detail in Part 3.)

Even with LoRA applied, the RTX 3060’s 6 GB VRAM proved insufficient. While LoRA greatly reduces the number of trainable parameters, the activations during the forward pass must still be saved for the entire model — and with an 8B model, that alone consumes several gigabytes.

Memory breakdown with LoRA (estimated):
  +-- Frozen weights (approx. 4-5 GB with 8-bit quantization)
  +-- LoRA adapter gradients (tens of MB -- this part is small)
  +-- Activation retention (several GB even at batch size 1)
  +-- Other overhead
Total: 8-10 GB -> Still not enough

With 6 GB of VRAM, even the combination of 8-bit quantization and LoRA would not fit.


Inference Test with llama.cpp

Although I gave up on fine-tuning, I made the most of my setup and tried another approach: CPU inference with llama.cpp.

llama.cpp is a C/C++ implementation for running LLM inference efficiently on CPUs. It converts models to a proprietary quantized format called GGUF and runs inference without a GPU.

# Convert to GGUF in FP16
python convert_hf_to_gguf.py full_model/ --outtype f16 --outfile model.gguf

# Further compress with 4-bit quantization (q4_k_m)
./llama-quantize model.gguf model_q4.gguf q4_k_m

The 4-bit quantized GGUF model was only a few gigabytes and fit comfortably in CPU RAM. When I tested it in a chatbot-style interaction, the 4-bit quantized model generated surprisingly natural Japanese.

However, llama.cpp is an inference-only tool. It cannot perform fine-tuning (training). You can convert a fine-tuned model to GGUF format and run inference locally, but the training itself must be done elsewhere.

This experience would prove valuable later, when I built a pipeline to take models fine-tuned on Colab and run them locally for inference.


What I Learned in This Phase

The Phase 1 challenge ended in “failure,” but I came away with several important lessons.

1. VRAM Rules Everything

When fine-tuning LLMs on local GPUs, VRAM is the ultimate bottleneck. Model selection, quantization methods, batch size, sequence length — every design decision is constrained by VRAM.

2. Inference and Training Are Different Beasts

Just because inference works does not mean training will too. Training requires 3 to 4 times the memory of inference. I thought I understood this conceptually, but I did not truly grasp it until I saw the OutOfMemoryError with my own eyes.

3. Quantization Helps Inference; Training Needs LoRA

8-bit quantization is incredibly effective at reducing memory for inference, but it does not fundamentally solve the memory problem during training. For that, techniques like LoRA are necessary — and even then, sufficient VRAM is still required.

4. Establishing the GGUF Conversion Pipeline

Successfully setting up the GGUF conversion and inference pipeline with llama.cpp turned out to be useful in later phases. Understanding the full flow of HuggingFace format -> GGUF conversion -> quantization -> CPU inference was a genuine gain.


On to the Next Step — Leveraging Cloud GPUs

The local 6 GB of VRAM has its limits. I accepted that reality and considered my next move.

There were several options:

OptionProsCons
Buy a larger GPUHave the environment locallyAn RTX 4090 (24 GB) costs around 300,000 yen (~$2,000)
Cloud GPU (AWS, GCP, etc.)Access to high-performance GPUsPay-per-use costs add up
Google ColabT4 GPU (16 GB) available in the free tierGPU usage time is limited
API-based (OpenAI, etc.)No infrastructure management neededThe model is not in your hands

Given the constraints of independent development, Google Colab was the obvious first thing to try. A free T4 GPU with 16 GB of VRAM — roughly 2.7 times what I had. By combining LoRA with 8-bit quantization, fine-tuning an 8B model should be feasible.

And so, I moved on to Phase 2: “Fine-Tuning on Google Colab.”


Summary

Here are the key takeaways from this installment:

  • I challenged ELYZA Llama-3-JP-8B on an RTX 3060 (6 GB VRAM)
  • Inference worked with 8-bit quantization (fluent Japanese output)
  • Fine-tuning was impossible due to insufficient VRAM
  • Training requires 3 to 4 times the memory of inference
  • I established a GGUF conversion and CPU inference pipeline with llama.cpp
  • I accepted the limitations of local hardware and decided to move to Google Colab

It was a failure, but the visceral understanding of “the VRAM wall” and the GGUF conversion pipeline knowledge I gained here were put to good use in later phases. Even though it looked like a detour, there is a kind of understanding that only comes from getting your hands dirty and hitting a wall. That was the lesson I took away.


Previous: Part 1 — “The Beginning and the Big Picture”

Next: Part 3 — “LoRA and Quantization Explained: The Techniques That Make LLMs Lighter”

Share this article

Related Posts