PEFT finetune llama3.1

Published August 16, 2024 by Connor

Let’s finetune llama3.1 using llama_recipes on the samsum dataset.

Preface

The main purpose of this article is to act as notes for me, and hopefully help some people finetune llama. Sorry, I probably won’t include solutions for pip, python and cuda errors.

1. Rent your GPU

Visit vast.ai or wherever else you rent GPUs
Choose a GPU close to your region, ssh lag sucks
Get at least 24gb of ram, so far I’ve only got it to work w/ 2 GPUs
You can use my docker config here or another deep learning container. vast.ai has some good containers to choose from in recommended.

2. Setting up

2.1 Clone llama_recipes

This is a library from facebook that makes working with llms easy. We will be cloning my llama_recipes because there is a circular dependency issue I kept running into (click the link to visit real llama_recipes).

git clone https://github.com/conacts/llama-recipes

Install python requirements: pip3 install -r requirements.txt

2.2 Add llama_recipes to your PYTHONPATH

Run this command or add this line to your your .bashrc file so python can use llama_recipes as a library by adding it to your PYTHONPATH.

export PYTHONPATH="/root/{...}/llama-recipes/src"

Test it like this, make sure it contains your python path you put in your .bashrc

echo $PYTHONPATH

2.3 Apply for llama 3.1 license

Visit Meta’s Llama repo on HF and apply to use the model. It should take ~10 mins for them to approve your request.

2.4 Log into HuggingFace

Visit HuggingFace and get your token to login. This is so you can pull the llama model. Make sure you enable git config --global credential.helper store to store your HF token.

Installation: pip3 install huggingface_hub

Store git creds: git config --global credential.helper store

huggingface-cli login

2.4 Log into Weights and Biases (recommended)

Weights and biases is a cool place to track your training run dictated by the --use_wandb flag for training.

Installation: pip3 install wandb

If you don’t have a Weights and Biases account, you can create one here.

wandb login

If you decide to use W&B, you get to see the training metrics of your finetune visualized.

3. Train the model

Not much else to do, run this command to start your training:

Note: Read the param info below if you get errors.

torchrun --nnodes 1 --nproc_per_node 2 \
    src/llama_recipes/finetuning.py \
    --enable_fsdp \
    --model_name meta-llama/Meta-Llama-3.1-8B \
    --use_peft \
    --peft_method lora \
    --output_dir  peft_checkpoints \
    --context_length 1024 \
    --use_fast_kernels \
    --use_fp16 \
    --low_cpu_fsdp \
    --num_epochs 1 \
    --use_wandb # (optional)

Here is what each param does:

--nnodes 1 : Num of computers (torchrun)
--nproc_per_node 2: Num of GPUs in computer (torchrun) (edit)
--enable_fsdp : This distributes the model more effectively across the GPUs (check)
--model_name meta-llama/Meta-Llama-3.1-8B : This is the name of the model from huggingface
--use_peft : Enable Paramater Efficient Fine Tuning (vram)
--peft_method : Uses LoRA for PEFT, a method that freezes most of the parameters present in the model, and trains a small few. (vram)
--output_dir : The directory to save your PEFT weights
--context_length : The context length (amount of input your LLM can handle at once) used to train the model
--use_fast_kernels : Make use Flash Attention and Xformer
--use_fp16 : Loads model in fp16
--low_cpu_fsdp : Loads the models into the GPUs in a RAM friendly way
--num_epochs : number of epochs to train on (increase for a more finetuned model)
--use_wandb : Display data in weights and biases

    model = LlamaForCausalLM.from_pretrained(
        torch_dtype=torch.float16,
        ...
    )

Possible Errors:

CUDA out of memory

If you’re out of memory, I think this means you always need to get a GPU with more vram. In the docs, facebook mentioned you need at least 24GB of vram, but with it sharded across two machines I’m at 20GB of vram with each GPU. I’ve run into issues with this though, trying both 4090s and 3090s with 24GB of ram (~23.6X) and it fails.

Note: I’ve only got the llama_recipes script to run 2x 4090s. However, I did get it to finetune using this script on a single 4090.

Error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.83 GiB. GPU 1 has a total capacity of 23.64 GiB of which 3.30 GiB is free. Process 3045155 has 20.33 GiB memory in use. Of the allocated memory 17.74 GiB is allocated by PyTorch, and 2.02 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting

Solution

Get more vram w/ another GPU.

Circular dependencies in DB

If you pulled the real llama_recipes github, you may stumble on this error. I think it’s cause llama_recipes/datasets library having an import conflict with huggingface’s datasets library.

Error

ImportError: cannot import name 'get_dataset' from partially initialized module 'llama_recipes.datasets.grammar_dataset.grammar_dataset' (most likely due to a circular import) (/root/llama-recipes/src/llama_recipes/datasets/grammar_dataset/grammar_dataset.py)

Solution

I got around this error by changing llama_recipes/datasets to llama_recipes/sets and fixed all the imports. You can use my solution if you pull conacts/llama_recipes.

Out of storage error

Error

UserWarning: Not enough free disk space to download the file. The expected file size is: 4915.92 MB. The target location /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B/blobs only has 3479.61 MB free disk space.

OSError: [Errno 28] No space left on device

Solution

Get a machine with more device storage. On vast.ai, you can find that here