Practical OPD for Post-Training of Large Models: From Principles to Framework Construction
Based on Tsinghua's Rethinking OPD paper, various model technical reports, and LiteScale framework practice, this article systematically explains the core conditions, underlying mechanisms, and engineering implementation of On-Policy Distillation. You will learn how to determine whether a teacher model is suitable for distillation, how to avoid training collapse, and master a set of deployable asynchronous OPD training framework construction methods to improve the performance of small models on reasoning tasks.
Steps
- 1
Check whether the teacher model satisfies two core conditions: thought pattern compatibility (high initial overlap rate) and possessing new capabilities that the student lacks (e.g., additional RL training).
- 2
If the teacher's conditions are insufficient, prioritize models from the same family that have undergone additional RL training, or use multi-teacher OPD to integrate multiple expert capabilities.
- 3
In the existing RL framework, replace the advantage function with the reverse KL divergence of the log ratio between teacher and student, enabling single-line code integration of OPD.
- 4
Adopt gradient accumulation with asynchronous rollout: immediately compute and accumulate gradients for each batch of arriving rollout data, then update parameters uniformly after the full round of data arrives, solving the synchronous waiting problem.
- 5
Modify Megatron's forward_step normalization coefficient to the global micro-batch count, and implement the gradient accumulation method accumulate_grad_step to ensure equivalence of asynchronous training.
- 6
After training, use MC transformation to detect unlearned samples, determine failure by pass@5 < 0.2, and attribute it to five major causes such as knowledge deficiency and conflict, then apply targeted CPT or data cleaning for remediation.
Recommended tools
Also available in 中文.