Optimizing Preferences in Large Language Models with ORPO
Written on
Chapter 1: Introduction to ORPO
The landscape of aligning large language models (LLMs) with human preferences has diversified significantly. While Reinforcement Learning with Human Feedback (RLHF) played a pioneering role in developing models like ChatGPT, it is known for its high costs. In contrast, methods such as DPO, IPO, and KTO offer more economical alternatives by eliminating the need for a reward model.
Among these, DPO and IPO still necessitate the training of two distinct models: one for the supervised fine-tuning (SFT) phase, which helps the model respond to instructions, and another that aligns the model with human preferences using the SFT model as a reference.
ORPO emerges as a new approach that bypasses the SFT model entirely. It allows the LLM to learn to address instructions while simultaneously aligning with human preferences.
In this article, I will discuss ORPO, evaluating its performance and illustrating how it can transform the Mistral 7B model into a chat-oriented model using consumer-grade hardware.
Section 1.1: Joint SFT and Preference Optimization
ORPO is detailed in the paper titled "ORPO: Monolithic Preference Optimization without Reference Model." The authors convincingly argue that the SFT step might not be optimal in the alignment process. Although fine-tuning on instruction datasets can enhance the model's ability to respond to specific queries, it can also inadvertently increase the likelihood of generating answers that humans might reject.
This concept is intuitive: selected and rejected responses often share similarities—such as domain and format—leading to a higher chance of generating contextually relevant but incorrect answers.
Thus, techniques like DPO become essential in minimizing the likelihood of generating rejected responses while amplifying the chances of producing acceptable ones. Preference optimization methods are trained on datasets that include:
- Prompt
- Chosen answer
- Rejected answer
Conversely, SFT is trained solely on prompts linked with chosen responses. While the datasets for SFT and preference optimization can overlap, the latter does not include rejected answers.
From this perspective, it is logical to fine-tune a base LLM to learn how to respond to instructions while also penalizing undesired answers, utilizing the same dataset. This is the core function of ORPO.
Section 1.2: How ORPO Works
ORPO innovatively alters the training loss by integrating the negative log-likelihood loss with an OR loss (Odd Ratio). The OR loss imposes a mild penalty on rejected answers while strongly rewarding chosen answers. A hyperparameter, lambda, determines the weight of the OR loss.
Setting lambda at 0.1 has proven effective. However, increasing it to 0.5 enhances the discrimination between selected and rejected outputs but may also decrease the likelihood of selecting favorable answers. In specific scenarios where avoiding poor answers is paramount, a lambda of 0.5 may be more beneficial.
With ORPO's loss structure, the model acquires insights comparable to those gained during SFT, simultaneously learning human preferences using a single dataset and model. A potential drawback of this approach is that it may require larger preference datasets than those needed for optimization with other methods.
The first video, "ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)," provides an in-depth look at this innovative method, discussing its implications and effectiveness in aligning LLMs with human preferences.
Section 1.3: Implementing ORPO with TRL
All code relevant to this section is available in the provided notebook. The notebook includes an example of ORPO training using the GaLore framework. Hugging Face's TRL library now supports ORPO, albeit as a recent addition, necessitating installation from the source alongside several other packages:
pip install -q -U bitsandbytes
pip install --upgrade -q -U transformers
pip install -q -U peft
pip install -q -U accelerate
pip install -q -U datasets
Next, we import necessary modules:
import torch
import multiprocessing
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from trl import ORPOTrainer, ORPOConfig
The multiprocessing library will assist in applying the chat template to the dataset. Additionally, ORPOConfig is imported for configuring ORPO, which is derived from Transformers' TrainingArguments.
We also ensure compatibility with FlashAttention and bfloat16 based on GPU capability:
import os
major_version, minor_version = torch.cuda.get_device_capability()
if major_version >= 8:
os.system("pip install flash-attn")
torch_dtype = torch.bfloat16
attn_implementation='flash_attention_2'
print("Your GPU is compatible with FlashAttention and bfloat16.")
else:
torch_dtype = torch.float16
attn_implementation='eager'
print("Your GPU is not compatible with FlashAttention and bfloat16.")
Next, we load the dataset. I utilize "HuggingFaceH4/ultrafeedback_binarized," which has been compiled by Hugging Face for training Zephyr models. The dataset's "chosen" and "rejected" columns are processed to format the JSON appropriately.
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split=["train_prefs", "test_prefs"])
def process(row):
row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
return row
dataset[0] = dataset[0].map(
process,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False,
)
dataset[1] = dataset[1].map(
process,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False,
)
Subsection 1.3.1: Model Loading and Configuration
We proceed to load the tokenizer and the model:
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
The model is quantized on-the-fly using bitsandbytes' NF4 data type, configured with BitsAndBytesConfig. It's crucial to invoke "prepare_model_for_kbit_training" to enable gradient checkpointing, optimizing memory usage.
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch_dtype,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
)
model = prepare_model_for_kbit_training(model)
model.config.pad_token_id = tokenizer.pad_token_id
For LoRA configuration, standard hyperparameters are applied. Adjusting "r" could yield better outcomes but would also elevate memory consumption.
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=16,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)
In ORPOConfig, we set up the training parameters and initiate the training process:
orpo_config = ORPOConfig(
output_dir="./results/",
evaluation_strategy="steps",
do_eval=True,
optim="paged_adamw_8bit",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
per_device_eval_batch_size=2,
log_level="debug",
logging_steps=20,
learning_rate=8e-6,
eval_steps=20,
max_steps=100,
save_steps=20,
save_strategy='epoch',
warmup_ratio=0.1,
lr_scheduler_type="linear",
beta=0.1, # This is ORPO's lambda
max_length=1024,
)
trainer = ORPOTrainer(
model=model,
train_dataset=dataset[0],
eval_dataset=dataset[1],
peft_config=peft_config,
args=orpo_config,
tokenizer=tokenizer,
)
trainer.train()
The ORPOTrainer diverges from SFTTrainer and DPOTrainer, as it does not accept TrainingArguments directly. Instead, it requires an "ORPOConfig" with subtly different parameters.
The training process spans 100 steps, utilizing Google's new L4 GPU, which is significantly faster than the T4, yet it still takes over 9 hours, largely due to validation on the extensive ultrafeedback validation split. Adjusting eval_steps and using a subset of the validation data may enhance efficiency.
The second video, "Combined Preference and Supervised Fine Tuning with ORPO," further explores the integration of preference optimization and supervised fine-tuning within ORPO's framework.
Chapter 2: Performance Analysis and Conclusions
The training and validation losses show a downward trend, indicating the model is learning effectively. However, the expected increase in margins and accuracies is not yet observed.
Reviewing the learning curves from the ORPO paper, it is evident that substantial training steps—potentially in the thousands—are required for the model to distinguish between acceptable and unacceptable responses. Therefore, a minimum of 2,000 steps with a total batch size of 64 is recommended to achieve comparable results. For those using a high-end consumer GPU, such as an RTX 4090, this is feasible but may take several days.
In conclusion, ORPO stands out as an innovative approach for fine-tuning and aligning instruction-based LLMs in a single step without the necessity for a reward or SFT model. It presents a simpler alternative to DPO and RLHF.
According to the referenced paper, ORPO's performance is comparable to or slightly better than DPO. However, it requires several thousand training steps for the model to effectively learn the distinction between high-quality and low-quality responses.
Is ORPO the right choice for you? If you seek a straightforward and effective method, ORPO is certainly worth considering. However, if you aim for optimal results, the decision may be less clear-cut. A comprehensive comparison with other recent methods like KTO and IPO is still needed to evaluate ORPO's full potential.
To stay updated on the latest advancements in AI, consider subscribing to my newsletter for more articles and tutorials.