Large Model Post-Training in Practice: From SFT to RL — The Complete Tech Stack
A systematic walkthrough of key post-training methods (SFT, RLHF, OPD, PEFT) with quantitative evaluation of general capability loss
Introduction: The Three-Layer Architecture and Core Challenges of Post-Training
Large model development typically follows a two-stage paradigm: "pre-training → post-training." Pre-training endows the model with general language abilities and world knowledge through massive unsupervised data, while post-training adapts the model to specific tasks or behavioral norms using a small amount of high-quality data.
Post-training can be further divided into three layers:
The core challenge of post-training is: how to improve target task performance while preserving the general capabilities endowed by pre-training as much as possible. This article systematically explains the key methods in post-training, including supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning from human feedback (RLHF), and the latest on-policy distillation (OPD), and provides quantitative methods for evaluating general capability loss.
Supervised Fine-Tuning (SFT): Fundamental but Beware of "Incomplete Learning"
SFT is the most classic post-training method, which adjusts model behavior by minimizing cross-entropy loss on high-quality labeled data. However, SFT has a widely overlooked problem: the Incomplete Learning Phenomenon (ILP).
Five Root Causes of Incomplete Learning
Research by Tencent Hunyuan and the University of New South Wales (ACL 2026) first systematically revealed ILP: even when training loss converges and the learning rate decays, about 15.3% of samples are still answered incorrectly when re-evaluated on the training set. The authors attributed this to five root causes:
Targeted Mitigation Strategies
For different root causes, researchers have proposed differentiated interventions:
These strategies have achieved significant improvements on benchmarks in fields such as medicine and law, e.g., MedQA +12.5%, LegalBench +9.4%~14.1%.
Parameter-Efficient Fine-Tuning (PEFT): Balancing Plasticity and Stability
As model sizes grow, the computational cost of full fine-tuning becomes prohibitive. Parameter-efficient fine-tuning (PEFT) emerged as a solution, updating only a small number of parameters to adapt to downstream tasks. But how do PEFT methods trade off downstream performance versus general capability retention?
PEFT-Arena: A Dual-Axis Evaluation Framework
Proposed by institutions such as the Chinese University of Hong Kong, PEFT-Arena re-examines PEFT methods from the perspective of the stability–plasticity trade-off. Its core idea is:
Traditional evaluation only looks at downstream accuracy, while PEFT-Arena simultaneously evaluates general capability retention, visualizing results in a two-dimensional graph: the horizontal axis represents general capability, and the vertical axis represents target domain performance. The ideal method is located in the upper right corner.
Trade-off Performance of Different PEFT Methods
Experiments using Qwen2.5-7B and Llama3.2-3B-Instruct conducted SFT and reinforcement learning with verifiable rewards (RLVR) training on two target domains (mathematics and medical reasoning) and evaluated general capability retention with tasks such as IFEval, Natural Questions, and BBH. Key findings:
Mechanism of Forgetting: Destruction of Activation Space Geometric Structure
PEFT-Arena further analyzes the causes of forgetting from the weight space and activation space:
OFT, due to its orthogonal parameterization, tends to preserve the geometric structure of representations, thus exhibiting a better trade-off. This finding provides theoretical guidance for selecting PEFT methods.
Reinforcement Learning from Human Feedback (RLHF) and RLVR
RLHF trains a reward model using human preference data and then optimizes the policy model with reinforcement learning. However, traditional RLHF suffers from sparse reward signals — a reasoning trajectory of thousands of tokens receives only a single 0/1 correctness signal at the end, leading to credit assignment difficulties.
RLVR: Reinforcement Learning with Verifiable Rewards
RLVR (Reinforcement Learning with Verifiable Rewards) is the standard training paradigm for reasoning models, using verifiable rewards (e.g., whether a math answer is correct) instead of human preferences. Although simple and effective, the sparse reward problem persists.
Self-Distillation Methods to Address Sparse Rewards
Recent researchers have proposed using self-distillation to solve the sparse reward problem. The core idea is to let the same model act as both student and teacher, using the privileged context (e.g., correct answer or correct trajectory) seen by the teacher to provide token-level dense feedback to the student.
Representative works include:
RLRT: The Rebellious Student's Reverse Signal
RLRT (Rebellious Student), proposed by Microsoft and KAIST, completely subverts the teacher's role: on successful trajectories, the student has already done it correctly, so the teacher is redundant information. RLRT reverses the teacher signal — rewarding the student for deviating from the teacher's preferred tokens, thereby protecting the student's unique exploration paths. Experiments show that Qwen3-4B-Base improves by 18% on 6 math benchmarks compared to standard GRPO.
On-Policy Distillation (OPD): A New Paradigm for Post-Training
OPD (On-Policy Distillation) has become the third major standard technique for large models after SFT and RL. It combines the distribution matching advantage of on-policy RL with the dense signal advantage of distillation, and is adopted by mainstream models such as Qwen3, GLM-5, MiMo-V2, and DeepSeek-V4.
Core Principle of OPD
Traditional off-policy distillation suffers from distribution mismatch: during training, the student learns the teacher's distribution, but during inference, it generates from its own distribution, leading to poor performance on long sequences. OPD's solution is:
Reverse KL has a mode-seeking property, allowing the student to focus on learning the teacher's high-probability modes rather than averaging over all possible outputs, which is especially important for reasoning tasks.
Two Core Conditions for OPD Success
Systematic research by Tsinghua University reveals the key factors for OPD success or failure:
Underlying Mechanism of OPD
Successful OPD is essentially a progressive alignment of high-probability overlapping tokens between teacher and student. Statistics show that 97%-99% of effective gradients come from overlapping tokens. As training progresses, the overlapping region self-reinforces, forming a positive feedback loop.
Practical Guidelines
When OPD training fails, the following strategies can be adopted:
Limitations of OPD
OPD's token-level dense rewards come with an inherent cost: reward quality decreases as trajectory depth increases. Experiments show that OPD's effectiveness peaks at sequence lengths of 3K-7K tokens and stagnates or declines beyond 10K. On long sequences, the teacher's guidance on tail tokens may be inaccurate, leading to training instability.
Quantitative Methods for Evaluating General Capability Loss
During post-training, general capability loss is a non-negligible issue. Below are several quantitative methods:
Dual-Axis Evaluation Framework
As shown in PEFT-Arena, target domain performance and general capability retention are treated as two independent dimensions, plotted on a two-dimensional graph. General capability can be evaluated using benchmarks such as IFEval (instruction following), Natural Questions (factual knowledge), and BBH (reasoning).
Capability-Conditioned Drift (CSD)
CSD quantifies the activation perturbation caused by weight updates on general domain versus target domain data. Higher CSD on general domain data indicates more severe forgetting.
Representation Geometry Metrics
These metrics are strongly correlated with forgetting and can serve as evaluation tools.
Interpolation Path Analysis
Perform parameter interpolation between the base model and the fine-tuned model, observing how target performance and general performance change with the interpolation coefficient. The final checkpoint is often not the optimal trade-off point; intermediate interpolation points may recover general capability while retaining most of the target gains.
Practical Recommendations and Framework Selection
Choosing the Right Post-Training Method
Tools and Frameworks
FAQ
What is OPD? How is it different from traditional distillation? OPD (On-Policy Distillation) is an online policy distillation method where the student model generates its own trajectories, and the teacher model provides token-level dense feedback at each step, optimizing reverse KL divergence. Unlike traditional off-policy distillation, OPD solves the distribution mismatch problem, resulting in higher training stability and better performance on long sequences.
How can I quantify the general capability loss caused by fine-tuning? You can use the dual-axis evaluation framework (target domain performance vs. general capability retention), or quantify it through metrics such as Capability-Conditioned Drift (CSD), representation geometry metrics (Procrustes residual, Gram matrix distortion, CKA). Interpolation path analysis can also be used to find the optimal trade-off point.
Why can a strong teacher sometimes cause OPD training to fail? The core conditions for OPD success are consistency of thinking patterns between teacher and student and the teacher possessing new capabilities. If a strong teacher has a low initial overlap ratio with the student (mismatched thinking patterns), or is merely a larger model from the same pipeline (no new capabilities), it cannot provide effective gradient signals, leading to training failure.
Among PEFT methods, which performs best in retaining general capability? Orthogonal Fine-Tuning (OFT) achieves the best trade-off between target domain performance and general capability retention. By preserving the geometric structure of the weight spectrum, it minimizes damage to general representations while adapting to the target task.
How can I solve the incomplete learning problem in SFT? First, identify unlearned samples using MC conversion and pass@N detection. Then, attribute them to the five root causes (missing knowledge, knowledge conflict, data conflict, left-side forgetting, insufficient optimization) and apply targeted interventions: CPT to supplement knowledge, dynamic bucketing to resolve conflicts, global shuffling to combat forgetting, and gradual epoch increases to improve optimization.
What are the characteristics of the post-training framework LiteScale? LiteScale achieves asynchronous training through gradient accumulation, decoupling rollout and training processes to improve resource utilization. It supports GKD online knowledge distillation and features the LogitsExpress module for efficient transmission of teacher logits, supporting point-to-point communication under different DP-TP configurations.
Also available in 中文.