Finetune llama 3.1
Published August 17, 2024 by Connor

Let’s finetune llama3.1 8B on an H100.
Rent an H100 on vast.ai and clone llama-recipes to the server. This is just a tutorial to finetune llama on an H100, not necessarily using a custom dataset or anything fancy. I still recommend you clone my repo to avoid a possible circular dependency error that could occur using the main llama-recipes repo.
1. Rent your GPU
Visit vast.ai to rent an H100 GPU.
- Region: Choose a GPU close to your region to minimize SSH lag.
- Memory: Ensure you have at least 80GB of VRAM. The H100 typically provides 80GB, which is sufficient.
You can use my docker config here or another deep learning container. Vast.ai also has recommended containers to choose from.

2. Clone llama-recipes repo
git clone https://github.com/conacts/llama-recipes
3. Add llama-recipes to PYTHONPATH
export PYTHONPATH="/root/llama-recipes/src"
4. Log into HuggingFace
Get your huggingface auth token here.
huggingface-cli login
5. Log into W&B
Weights and Biases (W&B) helps you track and visualize your training runs. This is optional but recommended.
wandb login
6. Finetune llama3.1
Here is a simple script to run to finetune llama3.1 on the samsum dataset.
Note: You must create your output directory before saving to it
torchrun --nnodes 1 --nproc_per_node 1 src/llama_recipes/finetuning.py \
--enable_fsdp \
--model_name meta-llama/Meta-Llama-3.1-8B \
--output_dir ./ \ # add output dir
--use_fast_kernels \
--low_cpu_fsdp \
--dataset = "samsum" \
--use_wandb # (optional)
--nnodes 1
: Num of computers (torchrun)--nproc_per_node
1: Num of GPUs in computer (torchrun) (edit)--enable_fsdp
: This distributes the model more effectively across the GPUs (check)--model_name meta-llama/Meta-Llama-3.1-8B
: This is the name of the model from huggingface--use_fast_kernels
: Make use Flash Attention and Xformer memory-efficient kernels--low_cpu_fsdp
: Loads the models into the GPUs in a RAM friendly way--dataset
: The dataset within the config--use_wandb
: Adds run data to weights and biases
When training, your output should look something like this (using nvtop
):

Great! Your model should be finetuning now. This process can take a while, but here are some possible errors you may run into while running the finetuning script.
Possible Errors:
CUDA out of memory
If you’re out of memory, I think this means you always need to get a GPU with more vram. I’ve noticed when I’ve finetuned the full model, I used all 80GB of vram.
Error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.83 GiB. GPU 1 has a total capacity of 23.64 GiB of which 3.30 GiB is free. Process 3045155 has 20.33 GiB memory in use. Of the allocated memory 17.74 GiB is allocated by PyTorch, and 2.02 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting
Solution
Get more vram w/ another GPU.
Circular dependencies in DB
If you pulled the real llama_recipes
github, you may stumble on this error. I think it’s cause llama_recipes/datasets
library having an import conflict with huggingface’s datasets
library.
Error
ImportError: cannot import name 'get_dataset' from partially initialized module 'llama_recipes.datasets.grammar_dataset.grammar_dataset' (most likely due to a circular import) (/root/llama-recipes/src/llama_recipes/datasets/grammar_dataset/grammar_dataset.py)
Solution
I got around this error by changing llama_recipes/datasets
to llama_recipes/sets
and fixed all the imports. You can use my workaround if you pull conacts/llama_recipes
.