Optimizing Preferences in Large Language Models with ORPO

Chapter 1: Introduction to ORPO

The landscape of aligning large language models (LLMs) with human preferences has diversified significantly. While Reinforcement Learning with Human Feedback (RLHF) played a pioneering role in developing models like ChatGPT, it is known for its high costs. In contrast, methods such as DPO, IPO, and KTO offer more economical alternatives by eliminating the need for a reward model.

Among these, DPO and IPO still necessitate the training of two distinct models: one for the supervised fine-tuning (SFT) phase, which helps the model respond to instructions, and another that aligns the model with human preferences using the SFT model as a reference.

ORPO emerges as a new approach that bypasses the SFT model entirely. It allows the LLM to learn to address instructions while simultaneously aligning with human preferences.

In this article, I will discuss ORPO, evaluating its performance and illustrating how it can transform the Mistral 7B model into a chat-oriented model using consumer-grade hardware.

Section 1.1: Joint SFT and Preference Optimization

ORPO is detailed in the paper titled "ORPO: Monolithic Preference Optimization without Reference Model." The authors convincingly argue that the SFT step might not be optimal in the alignment process. Although fine-tuning on instruction datasets can enhance the model's ability to respond to specific queries, it can also inadvertently increase the likelihood of generating answers that humans might reject.

This concept is intuitive: selected and rejected responses often share similarities—such as domain and format—leading to a higher chance of generating contextually relevant but incorrect answers.

Thus, techniques like DPO become essential in minimizing the likelihood of generating rejected responses while amplifying the chances of producing acceptable ones. Preference optimization methods are trained on datasets that include:

Prompt
Chosen answer
Rejected answer

Conversely, SFT is trained solely on prompts linked with chosen responses. While the datasets for SFT and preference optimization can overlap, the latter does not include rejected answers.

From this perspective, it is logical to fine-tune a base LLM to learn how to respond to instructions while also penalizing undesired answers, utilizing the same dataset. This is the core function of ORPO.

Section 1.2: How ORPO Works

ORPO innovatively alters the training loss by integrating the negative log-likelihood loss with an OR loss (Odd Ratio). The OR loss imposes a mild penalty on rejected answers while strongly rewarding chosen answers. A hyperparameter, lambda, determines the weight of the OR loss.

Setting lambda at 0.1 has proven effective. However, increasing it to 0.5 enhances the discrimination between selected and rejected outputs but may also decrease the likelihood of selecting favorable answers. In specific scenarios where avoiding poor answers is paramount, a lambda of 0.5 may be more beneficial.

With ORPO's loss structure, the model acquires insights comparable to those gained during SFT, simultaneously learning human preferences using a single dataset and model. A potential drawback of this approach is that it may require larger preference datasets than those needed for optimization with other methods.

The first video, "ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)," provides an in-depth look at this innovative method, discussing its implications and effectiveness in aligning LLMs with human preferences.

Section 1.3: Implementing ORPO with TRL

All code relevant to this section is available in the provided notebook. The notebook includes an example of ORPO training using the GaLore framework. Hugging Face's TRL library now supports ORPO, albeit as a recent addition, necessitating installation from the source alongside several other packages:

pip install -q -U bitsandbytes

pip install --upgrade -q -U transformers

pip install -q -U peft

pip install -q -U accelerate

pip install -q -U datasets

Next, we import necessary modules:

import torch

import multiprocessing

from datasets import load_dataset

from peft import LoraConfig, prepare_model_for_kbit_training

from transformers import (

AutoModelForCausalLM,

AutoTokenizer,

BitsAndBytesConfig,

TrainingArguments,

)

from trl import ORPOTrainer, ORPOConfig

The multiprocessing library will assist in applying the chat template to the dataset. Additionally, ORPOConfig is imported for configuring ORPO, which is derived from Transformers' TrainingArguments.

We also ensure compatibility with FlashAttention and bfloat16 based on GPU capability:

import os

major_version, minor_version = torch.cuda.get_device_capability()

if major_version >= 8:

os.system("pip install flash-attn")

torch_dtype = torch.bfloat16

attn_implementation='flash_attention_2'

print("Your GPU is compatible with FlashAttention and bfloat16.")

else:

torch_dtype = torch.float16

attn_implementation='eager'

print("Your GPU is not compatible with FlashAttention and bfloat16.")

Next, we load the dataset. I utilize "HuggingFaceH4/ultrafeedback_binarized," which has been compiled by Hugging Face for training Zephyr models. The dataset's "chosen" and "rejected" columns are processed to format the JSON appropriately.

dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split=["train_prefs", "test_prefs"])

def process(row):

row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)

row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)

return row

dataset[0] = dataset[0].map(

process,

num_proc=multiprocessing.cpu_count(),

load_from_cache_file=False,

)

dataset[1] = dataset[1].map(

process,

num_proc=multiprocessing.cpu_count(),

load_from_cache_file=False,

)

Subsection 1.3.1: Model Loading and Configuration

We proceed to load the tokenizer and the model:

model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)

tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = 'left'

The model is quantized on-the-fly using bitsandbytes' NF4 data type, configured with BitsAndBytesConfig. It's crucial to invoke "prepare_model_for_kbit_training" to enable gradient checkpointing, optimizing memory usage.

bnb_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch_dtype,

bnb_4bit_use_double_quant=True,

)

model = AutoModelForCausalLM.from_pretrained(

model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation

)

model = prepare_model_for_kbit_training(model)

model.config.pad_token_id = tokenizer.pad_token_id

For LoRA configuration, standard hyperparameters are applied. Adjusting "r" could yield better outcomes but would also elevate memory consumption.

peft_config = LoraConfig(

lora_alpha=16,

lora_dropout=0.05,

r=16,

bias="none",

task_type="CAUSAL_LM",

target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]

)

In ORPOConfig, we set up the training parameters and initiate the training process:

orpo_config = ORPOConfig(

output_dir="./results/",

evaluation_strategy="steps",

do_eval=True,

optim="paged_adamw_8bit",

per_device_train_batch_size=2,

gradient_accumulation_steps=4,

per_device_eval_batch_size=2,

log_level="debug",

logging_steps=20,

learning_rate=8e-6,

eval_steps=20,

max_steps=100,

save_steps=20,

save_strategy='epoch',

warmup_ratio=0.1,

lr_scheduler_type="linear",

beta=0.1, # This is ORPO's lambda

max_length=1024,

)

trainer = ORPOTrainer(

model=model,

train_dataset=dataset[0],

eval_dataset=dataset[1],

peft_config=peft_config,

args=orpo_config,

tokenizer=tokenizer,

)

trainer.train()

The ORPOTrainer diverges from SFTTrainer and DPOTrainer, as it does not accept TrainingArguments directly. Instead, it requires an "ORPOConfig" with subtly different parameters.

The training process spans 100 steps, utilizing Google's new L4 GPU, which is significantly faster than the T4, yet it still takes over 9 hours, largely due to validation on the extensive ultrafeedback validation split. Adjusting eval_steps and using a subset of the validation data may enhance efficiency.

The second video, "Combined Preference and Supervised Fine Tuning with ORPO," further explores the integration of preference optimization and supervised fine-tuning within ORPO's framework.

Chapter 2: Performance Analysis and Conclusions

The training and validation losses show a downward trend, indicating the model is learning effectively. However, the expected increase in margins and accuracies is not yet observed.

Reviewing the learning curves from the ORPO paper, it is evident that substantial training steps—potentially in the thousands—are required for the model to distinguish between acceptable and unacceptable responses. Therefore, a minimum of 2,000 steps with a total batch size of 64 is recommended to achieve comparable results. For those using a high-end consumer GPU, such as an RTX 4090, this is feasible but may take several days.

In conclusion, ORPO stands out as an innovative approach for fine-tuning and aligning instruction-based LLMs in a single step without the necessity for a reward or SFT model. It presents a simpler alternative to DPO and RLHF.

According to the referenced paper, ORPO's performance is comparable to or slightly better than DPO. However, it requires several thousand training steps for the model to effectively learn the distinction between high-quality and low-quality responses.

Is ORPO the right choice for you? If you seek a straightforward and effective method, ORPO is certainly worth considering. However, if you aim for optimal results, the decision may be less clear-cut. A comprehensive comparison with other recent methods like KTO and IPO is still needed to evaluate ORPO's full potential.

To stay updated on the latest advancements in AI, consider subscribing to my newsletter for more articles and tutorials.

bekkidavis.com

Optimizing Preferences in Large Language Models with ORPO

Chapter 1: Introduction to ORPO

Section 1.1: Joint SFT and Preference Optimization

Section 1.2: How ORPO Works

Section 1.3: Implementing ORPO with TRL

Subsection 1.3.1: Model Loading and Configuration

Chapter 2: Performance Analysis and Conclusions

Share the page:

Recent Post:

Humans as Parasites: A Troubling Examination of Our Impact

Unlocking the Revolutionary Potential of Quantum Computing

Unlock Your Coding Potential: Top 10 Free Learning Websites

Transforming Advanced Cancer into a Manageable Chronic Condition

Finding the Silver Lining: Unexpected Benefits of Divorce

Navigating Entrepreneurship: Insights from Mark Cedar of AK Building Services

The LinkedIn Skill Test Dilemma: A Call for Change

The Essential Role of Self-Belief in Entrepreneurship