Revolutionizing LLM Training with Odds Ratio Preference Optimization
Written on
Chapter 1: Introduction to ORPO
Emerging from South Korea, researchers have introduced a novel training technique for Large Language Models (LLMs) called Odds Ratio Preference Optimization (ORPO). This method not only boosts computational efficiency but also results in superior model performance. Notably, it eliminates one of the most complex and costly phases in traditional training, potentially making fine-tuning models more affordable for both consumers and businesses. This development could lead to an influx of new models with enhanced capabilities.
ORPO has the potential to democratize LLM training, empowering the open-source and enterprise sectors to create better models and liberate themselves from the dominance of wealthier corporations.
How Does ORPO Work?
To grasp the significance of ORPO, it’s crucial to understand that LLMs are a type of neural network (NN) that relies on a trial-and-error training approach.
The Learning Process of Neural Networks
To train a neural network, we compare its outputs to the actual correct answers (ground truth). This comparison provides a 'signal' indicating the NN's performance, allowing for adjustments to minimize errors.
For LLMs, the output isn't just a single word but rather a list of potential words. The model predicts the next word in a sequence, evaluating how suitable each word in its vocabulary is as a continuation. Thus, the output is a probability distribution across its vocabulary, indicating the likelihood of each word as the next in the sequence.
Why Use Probability Distributions?
LLMs generate language, which inherently involves uncertainty since multiple continuations can convey the same idea. Therefore, the model must not only predict the most likely next word but also consider other possible options. This is why probability distributions are common in machine learning, particularly in LLMs.
But how do we assess the model's performance? We employ an objective function, specifically the cross-entropy function, to gauge loss. This function uses the correct answer as a boolean indicator and assesses the assigned probability to that answer. The model's effectiveness is tied to the probability it assigns to the correct word; a higher probability indicates a better model.
Why Introduce a Negative Sign?
In loss calculations, we prefer decreasing values over time. To maximize the probability of the correct word, a negative sign is necessary to ensure we're minimizing loss rather than maximizing it.
This fundamental equation has been the backbone of LLM training for years; all models have been trained using this optimization function.
But why monitor probabilities for other options if they're ignored? This question is vital in understanding ORPO's innovation.
The Traditional LLM Training Pipeline
The training of an LLM typically consists of several stages:
- Pre-training: The model learns to predict the next word using vast data but cannot follow instructions.
- Fine-tuning (Supervised Fine-Tuning): The model is trained to follow instructions, often resulting in models labeled as "Instruct-GPT" or "Llama 3 8B-Instruct."
- Fine-tuning (Alignment): The model undergoes safety training to avoid harmful responses, using methods like Reinforcement Learning from Human Feedback (RLHF) or DPO.
It's important to note that the objective function changes during the alignment phase. While the first two stages focus on predicting the next word, the third aims to enhance decision-making. The model learns to select the better response from two correct answers, receiving rewards for correct choices and penalties for incorrect ones.
Training large models can take months and cost millions, with estimates suggesting that Meta spent over $100 million on Llama 3's training.
Despite its high costs, this training method has remained unchanged for years, making it accessible only to a select few.
Introducing ORPO: A Game Changer
ORPO aims to streamline this process, effectively merging steps two and three. By doing so, it not only enhances efficiency but also yields better results.
The Problem with Traditional Training
In the training pipeline, the two phases of fine-tuning primarily focus on adjusting behavior rather than imparting new knowledge.
In the SFT phase, the model is crafted into a helpful assistant, while the alignment phase ensures it doesn't engage in harmful behavior. This presents a trade-off: improving safety might compromise the model's utility.
Researchers have long sought a method to integrate these stages, and ORPO appears to have achieved this.
Understanding the Objective Function
The original objective function maximizes the probability of the correct word but neglects the probabilities of rejected words. This oversight can inadvertently lead to increased probabilities for undesirable responses, necessitating the alignment phase.
A pivotal experiment revealed that, during fine-tuning, probabilities for incorrect words could surpass those for correct answers, highlighting the need for an additional stage to mitigate this issue.
The Solution: Introducing a Penalty Term
ORPO proposes a new objective function that includes a penalty term for incorrect responses, compelling the model to assign lower probabilities to these options. This dual approach increases the likelihood of correct selections while decreasing that of incorrect ones.
Consequently, the SFT phase becomes redundant, leading to significant computational efficiency gains as ORPO allows for better overall outcomes compared to traditional methods.
A New Dawn for LLM Training
By eliminating the SFT phase, ORPO has introduced a transformative method for training LLMs that promises substantial efficiency gains.
For instance, in typical setups, the alignment phase requires two models—one for training and another for reference. This duplication drives up costs significantly. With ORPO, this step is no longer necessary, saving both time and resources.
ORPO is set to revolutionize how organizations approach fine-tuning, making it accessible to a broader range of developers and researchers. This could foster innovation across the industry, moving away from reliance on a few dominant players.
To learn more about ORPO and its implications, check out the following videos:
ORPO Explained: Superior LLM Alignment Technique vs. DPO/RLHF
Fine Tune Llama 3 using ORPO
As we explore the future of AI, the emergence of ORPO signifies a shift towards more equitable access to powerful LLM training methods.