Large Model Post-Training in Practice: From SFT to RL — The Complete Tech Stack

A systematic walkthrough of key post-training methods (SFT, RLHF, OPD, PEFT) with quantitative evaluation of general capability loss

By AI Skill Navigation Editorial TeamPublished June 13, 2026

Introduction: The Three-Layer Architecture and Core Challenges of Post-Training

Large model development typically follows a two-stage paradigm: "pre-training → post-training." Pre-training endows the model with general language abilities and world knowledge through massive unsupervised data, while post-training adapts the model to specific tasks or behavioral norms using a small amount of high-quality data.

Post-training can be further divided into three layers:

Pre-training: Basic language modeling, learning grammar, reasoning, common sense, etc.

Mid-training: Domain knowledge injection, such as specialized corpora for code, medicine, law, etc.

Post-training: Behavior alignment and task adaptation, including instruction following, reasoning capability enhancement, etc.

The core challenge of post-training is: how to improve target task performance while preserving the general capabilities endowed by pre-training as much as possible. This article systematically explains the key methods in post-training, including supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning from human feedback (RLHF), and the latest on-policy distillation (OPD), and provides quantitative methods for evaluating general capability loss.

Supervised Fine-Tuning (SFT): Fundamental but Beware of "Incomplete Learning"

SFT is the most classic post-training method, which adjusts model behavior by minimizing cross-entropy loss on high-quality labeled data. However, SFT has a widely overlooked problem: the Incomplete Learning Phenomenon (ILP).

Five Root Causes of Incomplete Learning

Research by Tencent Hunyuan and the University of New South Wales (ACL 2026) first systematically revealed ILP: even when training loss converges and the learning rate decays, about 15.3% of samples are still answered incorrectly when re-evaluated on the training set. The authors attributed this to five root causes:

Missing Pre-training Knowledge: The base model itself lacks the knowledge required to solve the sample; SFT cannot create something from nothing.

Conflict Between SFT and Base Model Knowledge: Pre-training has formed stubborn incorrect beliefs that are hard to correct with SFT supervision signals.

Internal Conflict in SFT Data: Semantically similar samples have contradictory labels, causing optimization signals to cancel each other out.

Left-side Forgetting: When data from multiple tasks or domains are concatenated sequentially, early samples are overwritten by later ones.

Insufficient Optimization: Complex or long-tail samples are not fully trained.

Targeted Mitigation Strategies

For different root causes, researchers have proposed differentiated interventions:

Missing Knowledge: Use continual pre-training (CPT) to supplement relevant knowledge.

Knowledge Conflict: Use CPT to calibrate the model's internal representations.

Data Conflict: Dynamic bucketing, placing contradictory samples into different batches.

Left-side Forgetting: Globally shuffle the data order and dynamically resample.

Insufficient Optimization: Gradually increase training epochs.

These strategies have achieved significant improvements on benchmarks in fields such as medicine and law, e.g., MedQA +12.5%, LegalBench +9.4%~14.1%.

Parameter-Efficient Fine-Tuning (PEFT): Balancing Plasticity and Stability

As model sizes grow, the computational cost of full fine-tuning becomes prohibitive. Parameter-efficient fine-tuning (PEFT) emerged as a solution, updating only a small number of parameters to adapt to downstream tasks. But how do PEFT methods trade off downstream performance versus general capability retention?

PEFT-Arena: A Dual-Axis Evaluation Framework

Proposed by institutions such as the Chinese University of Hong Kong, PEFT-Arena re-examines PEFT methods from the perspective of the stability–plasticity trade-off. Its core idea is:

Plasticity: How much the model learns in the target domain.

Stability: How much of the pre-trained general capability the model retains.

Traditional evaluation only looks at downstream accuracy, while PEFT-Arena simultaneously evaluates general capability retention, visualizing results in a two-dimensional graph: the horizontal axis represents general capability, and the vertical axis represents target domain performance. The ideal method is located in the upper right corner.

Trade-off Performance of Different PEFT Methods

Experiments using Qwen2.5-7B and Llama3.2-3B-Instruct conducted SFT and reinforcement learning with verifiable rewards (RLVR) training on two target domains (mathematics and medical reasoning) and evaluated general capability retention with tasks such as IFEval, Natural Questions, and BBH. Key findings:

MethodTarget Domain PerformanceGeneral Capability RetentionCharacteristics

Full Fine-tuningHighestSignificant dropHigh computational cost, severe forgetting LoRARelatively highModerateLow-rank approximation, concentrated updates PiSSARelatively highPoorStrong interaction with principal singular directions, large structural perturbation VeRALowGoodStable general capability, but limited target improvement OFT (Orthogonal Fine-tuning)Relatively highGoodPreserves weight spectral geometry, best trade-off

Mechanism of Forgetting: Destruction of Activation Space Geometric Structure

PEFT-Arena further analyzes the causes of forgetting from the weight space and activation space:

Weight Space: Quantifies the impact of weight updates on different data distributions through Capability-Conditioned Drift (CSD). CSD on general domain data is strongly correlated with forgetting.

Activation Space: Uses metrics such as Procrustes residual, Gram matrix distortion, and CKA to compare representation changes before and after fine-tuning. The results show that the key to forgetting is not "how much the activations moved," but "whether the geometric structure of general representations is destroyed."

OFT, due to its orthogonal parameterization, tends to preserve the geometric structure of representations, thus exhibiting a better trade-off. This finding provides theoretical guidance for selecting PEFT methods.

Reinforcement Learning from Human Feedback (RLHF) and RLVR

RLHF trains a reward model using human preference data and then optimizes the policy model with reinforcement learning. However, traditional RLHF suffers from sparse reward signals — a reasoning trajectory of thousands of tokens receives only a single 0/1 correctness signal at the end, leading to credit assignment difficulties.

RLVR: Reinforcement Learning with Verifiable Rewards

RLVR (Reinforcement Learning with Verifiable Rewards) is the standard training paradigm for reasoning models, using verifiable rewards (e.g., whether a math answer is correct) instead of human preferences. Although simple and effective, the sparse reward problem persists.

Self-Distillation Methods to Address Sparse Rewards

Recent researchers have proposed using self-distillation to solve the sparse reward problem. The core idea is to let the same model act as both student and teacher, using the privileged context (e.g., correct answer or correct trajectory) seen by the teacher to provide token-level dense feedback to the student.

Representative works include:

SDPO: The teacher uses the student's own generated correct trajectory as context, providing signals through logit-level KL divergence.

SRPO: Builds on SDPO by introducing sample routing, using teacher signals only on incorrect trajectories.

RLSD: The teacher uses the ground-truth answer as context, using the teacher-student token probability ratio as a per-token weight for the advantage.

RLRT: The Rebellious Student's Reverse Signal

RLRT (Rebellious Student), proposed by Microsoft and KAIST, completely subverts the teacher's role: on successful trajectories, the student has already done it correctly, so the teacher is redundant information. RLRT reverses the teacher signal — rewarding the student for deviating from the teacher's preferred tokens, thereby protecting the student's unique exploration paths. Experiments show that Qwen3-4B-Base improves by 18% on 6 math benchmarks compared to standard GRPO.

On-Policy Distillation (OPD): A New Paradigm for Post-Training

OPD (On-Policy Distillation) has become the third major standard technique for large models after SFT and RL. It combines the distribution matching advantage of on-policy RL with the dense signal advantage of distillation, and is adopted by mainstream models such as Qwen3, GLM-5, MiMo-V2, and DeepSeek-V4.

Core Principle of OPD

Traditional off-policy distillation suffers from distribution mismatch: during training, the student learns the teacher's distribution, but during inference, it generates from its own distribution, leading to poor performance on long sequences. OPD's solution is:

The student model generates complete reasoning trajectories (rollouts) by itself.

At each prefix step of the student's generated trajectory, the teacher model's token-level log probabilities are used as dense reward signals.

The optimization objective is to minimize the reverse KL divergence on the student's trajectory.

Reverse KL has a mode-seeking property, allowing the student to focus on learning the teacher's high-probability modes rather than averaging over all possible outputs, which is especially important for reasoning tasks.

Two Core Conditions for OPD Success

Systematic research by Tsinghua University reveals the key factors for OPD success or failure:

Consistency of Thinking Patterns: The initial overlap ratio between the teacher and student must be sufficiently high. Experiments show that among two teachers with similar scores but different training pipelines, the one with a higher overlap ratio with the student yields significantly better distillation results.

Teacher Possesses New Capabilities: The teacher must have genuinely new capabilities that the student has never encountered. A stronger teacher (larger parameter count) from the same pipeline and data cannot provide effective signals, while a teacher that has undergone additional RL training improves results by more than 3 times.

Underlying Mechanism of OPD

Successful OPD is essentially a progressive alignment of high-probability overlapping tokens between teacher and student. Statistics show that 97%-99% of effective gradients come from overlapping tokens. As training progresses, the overlapping region self-reinforces, forming a positive feedback loop.

Practical Guidelines

When OPD training fails, the following strategies can be adopted:

Off-policy cold start: First perform a round of SFT on the student using teacher-generated trajectories to raise the initial overlap ratio.

Teacher-aligned prompt selection: Use prompts from the teacher's post-training phase for OPD, but mix in some OOD prompts to maintain generation diversity.

Limitations of OPD

OPD's token-level dense rewards come with an inherent cost: reward quality decreases as trajectory depth increases. Experiments show that OPD's effectiveness peaks at sequence lengths of 3K-7K tokens and stagnates or declines beyond 10K. On long sequences, the teacher's guidance on tail tokens may be inaccurate, leading to training instability.

Quantitative Methods for Evaluating General Capability Loss

During post-training, general capability loss is a non-negligible issue. Below are several quantitative methods:

Dual-Axis Evaluation Framework

As shown in PEFT-Arena, target domain performance and general capability retention are treated as two independent dimensions, plotted on a two-dimensional graph. General capability can be evaluated using benchmarks such as IFEval (instruction following), Natural Questions (factual knowledge), and BBH (reasoning).

Capability-Conditioned Drift (CSD)

CSD quantifies the activation perturbation caused by weight updates on general domain versus target domain data. Higher CSD on general domain data indicates more severe forgetting.

Representation Geometry Metrics

Procrustes Residual: Measures structural changes in representations before and after fine-tuning that cannot be aligned by orthogonal transformations.

Gram Matrix Distortion: Compares changes in the pairwise similarity matrix of samples.

CKA: Measures representation similarity.

These metrics are strongly correlated with forgetting and can serve as evaluation tools.

Interpolation Path Analysis

Perform parameter interpolation between the base model and the fine-tuned model, observing how target performance and general performance change with the interpolation coefficient. The final checkpoint is often not the optimal trade-off point; intermediate interpolation points may recover general capability while retaining most of the target gains.

Practical Recommendations and Framework Selection

Choosing the Right Post-Training Method

ScenarioRecommended MethodReason

Quick adaptation to a single taskPEFT (e.g., LoRA)Low computational cost, decent performance Pursuing extreme performanceFull fine-tuning + general capability monitoringBut beware of forgetting Enhancing reasoning abilityOPDDense signals, distribution matching Multi-task integrationMulti-teacher OPDIntegrate knowledge in logit space

Tools and Frameworks

LiteScale: A post-training framework supporting asynchronous training and online knowledge distillation. It decouples rollout and training via gradient accumulation, supports GKD (Generalized Knowledge Distillation) and the LogitsExpress efficient transmission module.

TRL: Hugging Face's reinforcement learning library, supporting algorithms such as PPO and GRPO.

DeepSpeed: Microsoft's distributed training framework, usable for large-scale post-training.

FAQ

What is OPD? How is it different from traditional distillation? OPD (On-Policy Distillation) is an online policy distillation method where the student model generates its own trajectories, and the teacher model provides token-level dense feedback at each step, optimizing reverse KL divergence. Unlike traditional off-policy distillation, OPD solves the distribution mismatch problem, resulting in higher training stability and better performance on long sequences.

How can I quantify the general capability loss caused by fine-tuning? You can use the dual-axis evaluation framework (target domain performance vs. general capability retention), or quantify it through metrics such as Capability-Conditioned Drift (CSD), representation geometry metrics (Procrustes residual, Gram matrix distortion, CKA). Interpolation path analysis can also be used to find the optimal trade-off point.

Why can a strong teacher sometimes cause OPD training to fail? The core conditions for OPD success are consistency of thinking patterns between teacher and student and the teacher possessing new capabilities. If a strong teacher has a low initial overlap ratio with the student (mismatched thinking patterns), or is merely a larger model from the same pipeline (no new capabilities), it cannot provide effective gradient signals, leading to training failure.

Among PEFT methods, which performs best in retaining general capability? Orthogonal Fine-Tuning (OFT) achieves the best trade-off between target domain performance and general capability retention. By preserving the geometric structure of the weight spectrum, it minimizes damage to general representations while adapting to the target task.

How can I solve the incomplete learning problem in SFT? First, identify unlearned samples using MC conversion and pass@N detection. Then, attribute them to the five root causes (missing knowledge, knowledge conflict, data conflict, left-side forgetting, insufficient optimization) and apply targeted interventions: CPT to supplement knowledge, dynamic bucketing to resolve conflicts, global shuffling to combat forgetting, and gradual epoch increases to improve optimization.

What are the characteristics of the post-training framework LiteScale? LiteScale achieves asynchronous training through gradient accumulation, decoupling rollout and training processes to improve resource utilization. It supports GKD online knowledge distillation and features the LogitsExpress module for efficient transmission of teacher logits, supporting point-to-point communication under different DP-TP configurations.

Also available in 中文.