Fine-tuning LlaMa 2
Meta LlaMa 2 is an open-source large language model trained on publicly available data on the internet.
Parameter Size: 7 billion to 70 billion.
Architecture: Transformer architecture with certain modifications on context length, use of grouped-query attention, rotary positional embeddings, and so on. (Refer to this paper for training details).
While a full fine-tune LlaMa 2 is possible, this guide will use concepts such as PEFT (Parameter Efficient Fine-Tuning) and LoRA to more efficeintly fine-tune LlaMa 2. These approaches consider trade-offs in cost and computational resources, aiming to achieve optimal performance with reduced resource requirements.
Before we get started, you will need a GPU node to fine-tune LlaMa 2. You can use brev.dev to get GPU nodes. The code in this guide was developed using 1 NVIDIA A100 80GB GPU on the NYU HPC Cluster. But you can use any GPU type (RTX8000, V100, A10,..) with >= 28 GB memory.
Dataset
The code in this guide uses the Propaganda Techniques Corpus (PTC) dataset to fine-tune LlaMa 2 to detect propaganda techniques in news articles. PTC contains phrase-level annotations of propaganda techniques (eg, name-calling, loaded language, flag-waving, etc) in news articles.
Download PTC from here.
We need a pandas dataframe object with each row representing a news article and the corresponding phrase-technique instances in it. For the scope of this guide, we will focus on one technique at a time (this code can be modified to work with multiple technqiues (multi-label multi-class classification setting)).
Next, we will turn this pandas dataframe into a huggingface dataset.Dataset object.
finetune_train_dataset = Dataset.from_pandas(train_df)
finetune_dev_dataset = Dataset.from_pandas(dev_df)
Model
Next up, we load the model (with 4-bit quanitzation) using BitsAndBytes.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from accelerate import FullyShardedDataParallelPlugin, Accelerator
base_model_id = "meta-llama/Llama-2-7b-hf"
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, # torch.bfloat16 if A100, use getattr(torch, "float16") for RTX8000 or V100
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=quant_config,device_map={"": Accelerator().local_process_index})
Load tokenizer with left padding to get our training data points to be of the same length.
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
padding_side="left",
add_eos_token=True,
add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
Let’s use the following function to add padding (max length should be determined based on your dataset’s length).
max_length = 10000
def tokenize(prompt):
result = tokenizer(
prompt,
truncation=True,
max_length=max_length,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
We will frame the entity recognition problem using the following prompt template.
def generate_and_tokenize_prompt(data_point):
full_prompt =f"""I want you to perform a data annotation task. In your output, I want you to return a json dictionary with key as phrase and value as technique, depending on whether you think the phrases in the following text uses name-callling.
A phrase is "name-calling" if you perceive that it is "Labeling the object of the propaganda campaign as either something the target audience fears, hates, finds undesirable or otherwise loves or praises".
I want you to respond with a json dictionary strictly having the detected phrases as keys and technique (Name-Calling) as value.
### Text:
{data_point["text"]}
### Output:
{data_point["target"]}
"""
return tokenize(full_prompt)
Let’s obtain the tokenized dataset.
tokenized_train_dataset = finetune_train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = finetune_dev_dataset.map(generate_and_tokenize_prompt)
Fine-tuning the model using LoRA
The following code will process the quantized model for training.
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
Next up, we define our QLoRA configuration. The values were selected using this example as reference.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=[ # the linear layers that are now trainable
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
OPTIONAL: (Use Fully Sharded Data Parallel to make training large models faster).
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
model = accelerator.prepare_model(model)
Use weights and biases to track your experiment (& metrics)
!pip install -q wandb -U
import wandb, os
wandb.login()
wandb_project = "propaganda-finetune"
if len(wandb_project) > 0:
os.environ["WANDB_PROJECT"] = wandb_project
Finally, start the fine-tuning job:
import transformers
project = "propaganda-finetune"
base_model_name = "llama2-7b"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name
tokenizer.pad_token = tokenizer.eos_token
trainer = transformers.Trainer(
model=model,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
args=transformers.TrainingArguments(
output_dir=output_dir,
gradient_accumulation_steps=1,
per_device_train_batch_size = 1, # reduce this to avoid OOM errors
num_train_epochs = 3,
optim = "paged_adamw_32bit", # QLoRA uses paged adamw optimizer
lr_scheduler_type = "cosine",
learning_rate = 0.0002,
bf16 = True, # set to True on A100; set to false on RTX8000, V100
# fp16 = True, # set to true on RTX8000, V100
gradient_checkpointing=True,
logging_steps = 1, # log at each step
logging_dir="./logs",
warmup_steps = 5,
weight_decay=0.1,
save_strategy="steps", # save the model checkpoint every logging step
save_steps=3, # save checkpoints at every third step
evaluation_strategy="steps",
eval_steps=3, # Evaluate at every third step
do_eval=True,
report_to="wandb",
run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}" ),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
model.config.pretraining_tp = 1
trainer_stats = trainer.train()
Inference
Restart your kernel and run the following cells for inference.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
base_model_id = "meta-llama/Llama-2-7b-hf"
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id, # Llama 2 7b, same as before
quantization_config=quant_config, # Same quantization config as before
device_map={"": Accelerator().local_process_index},
trust_remote_code=True,
)
eval_tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
add_bos_token=True,
trust_remote_code=True,
)
Load the QLoRA model from a desired checkpoint.
from peft import PeftModel
ft_model = PeftModel.from_pretrained(base_model, "llama2-7b-propaganda-finetune/checkpoint-371")
Run inference on a test article.
eval_prompt = f"""I want you to perform a data annotation task. In your output, I want you to return a json dictionary with key as phrase and value as technique, depending on whether you think the phrases in the following text uses loaded language.
A phrase is "loaded language" if you perceive that it is "Using words or phrases with strong emotional implications to influence an audience".
I want you to respond with a json dictionary strictly having the detected phrases as keys and technique (Loaded Language) as value.
### Text:
{test_news_article}
### Output:
"""
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=500,repetition_penalty=1.5, top_p=0.1,top_k=20)[0], skip_special_tokens=False)
new_tokens = output.replace(eval_prompt, "")