bekkidavis.com

Optimizing Preferences in Large Language Models with ORPO

Written on

Chapter 1: Introduction to ORPO

The landscape of aligning large language models (LLMs) with human preferences has diversified significantly. While Reinforcement Learning with Human Feedback (RLHF) played a pioneering role in developing models like ChatGPT, it is known for its high costs. In contrast, methods such as DPO, IPO, and KTO offer more economical alternatives by eliminating the need for a reward model.

Among these, DPO and IPO still necessitate the training of two distinct models: one for the supervised fine-tuning (SFT) phase, which helps the model respond to instructions, and another that aligns the model with human preferences using the SFT model as a reference.

ORPO emerges as a new approach that bypasses the SFT model entirely. It allows the LLM to learn to address instructions while simultaneously aligning with human preferences.

In this article, I will discuss ORPO, evaluating its performance and illustrating how it can transform the Mistral 7B model into a chat-oriented model using consumer-grade hardware.

Section 1.1: Joint SFT and Preference Optimization

ORPO is detailed in the paper titled "ORPO: Monolithic Preference Optimization without Reference Model." The authors convincingly argue that the SFT step might not be optimal in the alignment process. Although fine-tuning on instruction datasets can enhance the model's ability to respond to specific queries, it can also inadvertently increase the likelihood of generating answers that humans might reject.

This concept is intuitive: selected and rejected responses often share similarities—such as domain and format—leading to a higher chance of generating contextually relevant but incorrect answers.

Thus, techniques like DPO become essential in minimizing the likelihood of generating rejected responses while amplifying the chances of producing acceptable ones. Preference optimization methods are trained on datasets that include:

  • Prompt
  • Chosen answer
  • Rejected answer

Conversely, SFT is trained solely on prompts linked with chosen responses. While the datasets for SFT and preference optimization can overlap, the latter does not include rejected answers.

From this perspective, it is logical to fine-tune a base LLM to learn how to respond to instructions while also penalizing undesired answers, utilizing the same dataset. This is the core function of ORPO.

Section 1.2: How ORPO Works

ORPO innovatively alters the training loss by integrating the negative log-likelihood loss with an OR loss (Odd Ratio). The OR loss imposes a mild penalty on rejected answers while strongly rewarding chosen answers. A hyperparameter, lambda, determines the weight of the OR loss.

Setting lambda at 0.1 has proven effective. However, increasing it to 0.5 enhances the discrimination between selected and rejected outputs but may also decrease the likelihood of selecting favorable answers. In specific scenarios where avoiding poor answers is paramount, a lambda of 0.5 may be more beneficial.

With ORPO's loss structure, the model acquires insights comparable to those gained during SFT, simultaneously learning human preferences using a single dataset and model. A potential drawback of this approach is that it may require larger preference datasets than those needed for optimization with other methods.

The first video, "ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)," provides an in-depth look at this innovative method, discussing its implications and effectiveness in aligning LLMs with human preferences.

Section 1.3: Implementing ORPO with TRL

All code relevant to this section is available in the provided notebook. The notebook includes an example of ORPO training using the GaLore framework. Hugging Face's TRL library now supports ORPO, albeit as a recent addition, necessitating installation from the source alongside several other packages:

pip install -q -U bitsandbytes

pip install --upgrade -q -U transformers

pip install -q -U peft

pip install -q -U accelerate

pip install -q -U datasets

Next, we import necessary modules:

import torch

import multiprocessing

from datasets import load_dataset

from peft import LoraConfig, prepare_model_for_kbit_training

from transformers import (

AutoModelForCausalLM,

AutoTokenizer,

BitsAndBytesConfig,

TrainingArguments,

)

from trl import ORPOTrainer, ORPOConfig

The multiprocessing library will assist in applying the chat template to the dataset. Additionally, ORPOConfig is imported for configuring ORPO, which is derived from Transformers' TrainingArguments.

We also ensure compatibility with FlashAttention and bfloat16 based on GPU capability:

import os

major_version, minor_version = torch.cuda.get_device_capability()

if major_version >= 8:

os.system("pip install flash-attn")

torch_dtype = torch.bfloat16

attn_implementation='flash_attention_2'

print("Your GPU is compatible with FlashAttention and bfloat16.")

else:

torch_dtype = torch.float16

attn_implementation='eager'

print("Your GPU is not compatible with FlashAttention and bfloat16.")

Next, we load the dataset. I utilize "HuggingFaceH4/ultrafeedback_binarized," which has been compiled by Hugging Face for training Zephyr models. The dataset's "chosen" and "rejected" columns are processed to format the JSON appropriately.

dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split=["train_prefs", "test_prefs"])

def process(row):

row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)

row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)

return row

dataset[0] = dataset[0].map(

process,

num_proc=multiprocessing.cpu_count(),

load_from_cache_file=False,

)

dataset[1] = dataset[1].map(

process,

num_proc=multiprocessing.cpu_count(),

load_from_cache_file=False,

)

Subsection 1.3.1: Model Loading and Configuration

We proceed to load the tokenizer and the model:

model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)

tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = 'left'

The model is quantized on-the-fly using bitsandbytes' NF4 data type, configured with BitsAndBytesConfig. It's crucial to invoke "prepare_model_for_kbit_training" to enable gradient checkpointing, optimizing memory usage.

bnb_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch_dtype,

bnb_4bit_use_double_quant=True,

)

model = AutoModelForCausalLM.from_pretrained(

model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation

)

model = prepare_model_for_kbit_training(model)

model.config.pad_token_id = tokenizer.pad_token_id

For LoRA configuration, standard hyperparameters are applied. Adjusting "r" could yield better outcomes but would also elevate memory consumption.

peft_config = LoraConfig(

lora_alpha=16,

lora_dropout=0.05,

r=16,

bias="none",

task_type="CAUSAL_LM",

target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]

)

In ORPOConfig, we set up the training parameters and initiate the training process:

orpo_config = ORPOConfig(

output_dir="./results/",

evaluation_strategy="steps",

do_eval=True,

optim="paged_adamw_8bit",

per_device_train_batch_size=2,

gradient_accumulation_steps=4,

per_device_eval_batch_size=2,

log_level="debug",

logging_steps=20,

learning_rate=8e-6,

eval_steps=20,

max_steps=100,

save_steps=20,

save_strategy='epoch',

warmup_ratio=0.1,

lr_scheduler_type="linear",

beta=0.1, # This is ORPO's lambda

max_length=1024,

)

trainer = ORPOTrainer(

model=model,

train_dataset=dataset[0],

eval_dataset=dataset[1],

peft_config=peft_config,

args=orpo_config,

tokenizer=tokenizer,

)

trainer.train()

The ORPOTrainer diverges from SFTTrainer and DPOTrainer, as it does not accept TrainingArguments directly. Instead, it requires an "ORPOConfig" with subtly different parameters.

The training process spans 100 steps, utilizing Google's new L4 GPU, which is significantly faster than the T4, yet it still takes over 9 hours, largely due to validation on the extensive ultrafeedback validation split. Adjusting eval_steps and using a subset of the validation data may enhance efficiency.

The second video, "Combined Preference and Supervised Fine Tuning with ORPO," further explores the integration of preference optimization and supervised fine-tuning within ORPO's framework.

Chapter 2: Performance Analysis and Conclusions

The training and validation losses show a downward trend, indicating the model is learning effectively. However, the expected increase in margins and accuracies is not yet observed.

Reviewing the learning curves from the ORPO paper, it is evident that substantial training steps—potentially in the thousands—are required for the model to distinguish between acceptable and unacceptable responses. Therefore, a minimum of 2,000 steps with a total batch size of 64 is recommended to achieve comparable results. For those using a high-end consumer GPU, such as an RTX 4090, this is feasible but may take several days.

In conclusion, ORPO stands out as an innovative approach for fine-tuning and aligning instruction-based LLMs in a single step without the necessity for a reward or SFT model. It presents a simpler alternative to DPO and RLHF.

According to the referenced paper, ORPO's performance is comparable to or slightly better than DPO. However, it requires several thousand training steps for the model to effectively learn the distinction between high-quality and low-quality responses.

Is ORPO the right choice for you? If you seek a straightforward and effective method, ORPO is certainly worth considering. However, if you aim for optimal results, the decision may be less clear-cut. A comprehensive comparison with other recent methods like KTO and IPO is still needed to evaluate ORPO's full potential.

To stay updated on the latest advancements in AI, consider subscribing to my newsletter for more articles and tutorials.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Humans as Parasites: A Troubling Examination of Our Impact

An exploration of whether humans act as parasites on Earth, examining our relationship with nature and the implications of our actions.

Unlocking the Revolutionary Potential of Quantum Computing

Explore the transformative impact of quantum computing on various fields, from cryptography to artificial intelligence.

Unlock Your Coding Potential: Top 10 Free Learning Websites

Discover the best free resources to enhance your coding skills and embark on a successful programming journey.

Transforming Advanced Cancer into a Manageable Chronic Condition

Discover the latest advancements in managing advanced cancer, focusing on improved survival rates for breast cancer and melanoma.

Finding the Silver Lining: Unexpected Benefits of Divorce

Discover the surprising positives that can emerge from the challenges of divorce.

Navigating Entrepreneurship: Insights from Mark Cedar of AK Building Services

Mark Cedar shares valuable lessons from his journey in commercial cleaning and entrepreneurship.

The LinkedIn Skill Test Dilemma: A Call for Change

An exploration of the shortcomings of LinkedIn's skill assessments and their impact on job seekers.

The Essential Role of Self-Belief in Entrepreneurship

Discover how self-belief is crucial for success in entrepreneurship and personal growth through trials and challenges.