Tencent Hunyuan Open-Sources UniRL: A Unified Multimodal Reinforcement Learning Training Framework
On June 17, 2025, the Tencent Hunyuan team officially open-sourced UniRL, a distributed reinforcement learning (RL) post-training framework for multimodal generative models. Led by Pang Tianyu's team, UniRL aims to address the fragmentation in current AIGC multimodal RL training—where image diffusion models, video generation, VLMs, and LLMs each have their own independent tech stacks, leading to engineering duplication and hindered algorithmic innovation.
Background: The "Silo Dilemma" of Multimodal RL
With the rapid development of models such as Stable Diffusion, FLUX, Wan, and HunyuanVideo, the capabilities of AIGC are constantly expanding, but the RL training infrastructure lags significantly. Compared to LLM RL training, multimodal generative RL faces four major challenges:
- Fundamentally different generation processes: LLMs handle discrete tokens, while image/video generation involves continuous latent space denoising trajectories; unified multimodal models mix token generation with latent denoising, making credit assignment and policy updates more complex.
- Unstable system loop: Rollout, log-prob replay, and policy updates span multiple models and backends. The training side must strictly replicate the sampling-side conditions; otherwise, Training-Inference Mismatch occurs, introducing policy gradient bias.
- Heavier reward system: Multimodal RL rewards rely on multimodal evaluation chains such as VLM, OCR, aesthetic models, and video understanding models, which are costly.
- High trajectory storage and memory pressure: Intermediate products are high-dimensional latents, noise, timesteps, etc., which scale rapidly with resolution, frame count, and denoising steps in video generation.
These challenges have led to the industry practice of "one model, one training code," with developers spending significant time on repetitive engineering.
Core Design of UniRL: Unified Abstraction and Reusable Skeleton
UniRL is not tied to a single model family, algorithm, or training stack. It is built around Ray worker groups, Hydra flat recipes, composable training backends, and pluggable rollout engines, abstracting the multimodal RL closed-loop contract: rollout → reward → advantage → train → weight-sync.
The framework uses typed rollout data models (tracks) to represent generation trajectories at different stages: AR stages use TextSegment, image generation stages use LatentSegment. Different tracks are connected via parent-child relationships, naturally supporting chain processes like Bagel and HunyuanImage 3.0, where AR text reasoning precedes DiT image generation.
Supported Models and Algorithms
UniRL covers mainstream multimodal generative models:
- Image generation: SD3/3.5, Qwen-Image, Z-Image, FLUX.2-Klein
- Video generation: HunyuanVideo 1.0&1.5, WAN series
- Large language models: Qwen3 series
- Multimodal understanding models: Qwen-VL series
- Native unified multimodal models: HunyuanImage 3.0, Bagel
- Composable models: LLM/VLM + Diffusion Prompt-Enhancer architecture
Built-in RL algorithms include:
- Policy-gradient family: FlowGRPO, DanceGRPO, MixGRPO, LLM/VLM GRPO
- Forward-process family: DiffusionNFT
- Tencent Hunyuan proprietary algorithms: Flow-DPPO (for flow/diffusion models, using stepwise KL divergence proximal constraints instead of PPO ratio clipping) and DRPO (using advantage-weighted smooth policy shift regularization instead of hard clipping/masking)
Reward components integrate CLIPScore, GOT-OCR-2.0, PickScore, HPSv2/v3, ImageReward, UnifiedReward, GenEval2, WISE, VideoPickScore, VideoAlign, etc.
Impact and Significance
UniRL pushes the repetitive, error-prone, and hard-to-reuse system engineering problems in multimodal RL training down to the framework level, allowing developers to avoid repeatedly rewriting rollout, reward, trajectory transmission, and training alignment logic. The framework is still under active iteration, with plans to improve the core training loop, expand rollout engine support, and optimize large-scale training performance.
Also available in 中文.