14 Nov 02:45

wuxibin89

d62da49

v0.6.1 Latest

Latest

Highlights

Trainer

support fp16 training (FSDP/Megatron)

Megatron

support 1f1b_overlap/moe_a2a_overlap
support for Qwen3VL MoE/dense models
support Qwen2.5/3vl with context parallel

Rollout

Use vllm and sglang release image as ci base image, upgrade vllm==0.11.0, upgrade sglang==0.5.5
Prometheus monitoring

Algorithm

Rollout Correction: comprehensive overhaul of the rollout correction system with typed configuration, mathematical documentation, and performance optimizations.

Recipe

Introduce two new experimental recipes, which will be gradually merge to main in future release.

Fully Async Policy Trainer: fully asynchronous PPO training system that completely decouples the Trainer and Rollouter, supporting asynchronous sample generation and training.
TransferQueue Data System: an asynchronous streaming data management system for efficient post-training.
FlowRL

Importance bug fixes

#3861: fix missing offload parameter and optimizer to cpu when no checkpoint
#4097: fix missing finalize_model_grads_func in megatron model engine

What's Changed

[misc] feat: bump version to 0.7.0.dev by @vermouth1992 in #3772
[recipe] feat: Add example for gpt-oss training using agent loop by @HJSang in #3774
[docker] feat: update Dockerfile.rocm7 by @vickytsang in #3781
[doc] fix: actor_rollout_ref.critic is not correct by @HollowMan6 in #3778
[misc] fix: sft SFT E2E CI test failure due to megatron engine by @houminz in #3786
[recipe] fix: fix the gpt-oss-20b training script for agent loop recipe by @HJSang in #3793
[doc] chore: add agent loop get started tutorial by @wuxibin89 in #3790
[vllm] fix: catch exception of vllm async engine by @Yangruipis in #3789
[trainer] fix: batch size mismatch with n>1 when gen_max for ReMax by @HollowMan6 in #3779
[trainer] feat: ReMax support using reward model for baseline by @HollowMan6 in #3780
[megatron] feat: script of qwen3vl 235b by @ISEEKYAN in #3799
[trainer, recipe] feat: fully async training recipe by @ArronHZG in #2981
[doc] feat: update fully async experiment message by @ArronHZG in #3804
[worker] fix: create a new event loop if none exists when building rollouts by @ChangyWen in #3803
[trainer] fix: address serialization issues when using async reward function and ray ppo trainer by @benprofessionaledition in #3769
[megatron] fix: fix logits process error when disable pack_seqs by @HaochenYuan in #3777
[misc] fix: Sanitize MLFlow metric names by @pratik9891 in #3736
[ci] fix: Install mlflow dependency by @HollowMan6 in #3817
[rollout, vllm] fix: make LoRA with async vLLM work properly by @listar2000 in #3821
Revert "[worker] fix: create a new event loop if none exists when building rollouts" by @vermouth1992 in #3820
[trainer] fix: Add data.seed to config by @HollowMan6 in #3815
[doc] fix: update install instruction and retool readme by @chenhaiq in #3824
[algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error by @szrlee in #3810
[data] feat: filter out malformed data together with long prompts by @HollowMan6 in #3814
[worker] fix: to create a new event loop if none exists when building rollouts (a safer fix) by @ChangyWen in #3828
[data, trainer] feat: add support for limiting samples from dataset by @HollowMan6 in #3812
[model, megatron] feat: Support for Qwen3VL dense models by @HollowMan6 in #3838
[recipe] fix: Update the grpo training script for gpt-oss models by @HJSang in #3836
[recipe, rollout] feat: enable gpt-oss training for tool agent add gpt-oss for retool recipe by @HJSang in #3837
[data] feat: TransferQueue - An asynchronous streaming data management system by @0oshowero0 in #3649
[trainer, worker] feat: more flexible and easy-to-use reward model by @yyDing1 in #3679
[doc] fix: fix async policy message by @ArronHZG in #3845
[worker] fix: create a new event loop if none exists by @baymax591 in #3839
[misc] feat: add megatron script for open math reasoning by @vermouth1992 in #3844
[rollout, vllm] fix: name change for compilation level by @HollowMan6 in #3848
[trainer] fix: missing offload parameter and optimizer to cpu when no checkpoint by @wuxibin89 in #3861
[sglang] fix: make sglang wake_up/sleep work in colocate mode by @yyDing1 in #3860
[doc] feat: add doc for reward loop by @yyDing1 in #3851
[doc] misc: fix doc that penalty starts when exceeds the max_response_length - overlong_buffer.len by @bzantium in #3856
[recipe]fix: bugfix of Qwen3 8b/14b DAPO npu script by @acat-rw in #3858
[BREAKING][misc] feat: Abstract optimizer by @EduardDurech in #3656
[ci] feat: migrate gpu_unit_tests to volcengine by @vermouth1992 in #3872
[rollout] fix: Fix gpt-oss training in tool agent by @HJSang in #3865
[fsdp] fix : fix moe model run on full-async error by @chenjiaoAngel in #3874
[doc] feat: update doc of reward loop by @yyDing1 in #3880
[perf, data] feat: DP workload balance by @conver334 in #3605
[ci] fix: gsm8k interaction unit test by @wuxibin89 in #3888
[model] chore: deprecated legacy code for GRM by @yyDing1 in #3885
[recipe] fix: Qwen3-vl moe model patch by @leisuzz in #3878
Add PokeeResearch to README resources by @BillMatrix in #3892
[misc] feat: read environment for WandB entity (team) name by @BaiqingL in #3889
[tool] fix: remove duplicate tool initialization by @Tree-Shu-Zhao in #3893
[rollout] fix: incorrect value assignment while trying to access call_tool_result by @BaiqingL in #3891
[megatron] fix: VLMs using fused kernels by @HollowMan6 in #3849
[megatron] fix: mbridge load optimizer dist_ckpt by @ccilery in #3850
[misc] feat: fix ci break by @wuxibin89 in #3898
[doc, recipe] feat: update doc of rewardloop and add runnable scripts of fapo by @yyDing1 in #3900
[doc] chore: update installation scripts to use newer versions by @HollowMan6 in #3901
[recipe] fix: fix bug of tranfer queue runtime env by @baymax591 in #3904
[doc] fix: formatting issue for kl_ctrl and fused_kernel_options configs by @HollowMan6 in #3917
[recipe] fix: DAPO using KL in reward by @HollowMan6 in #3916
[recipe] fix: DAPO add trust_remote_code parameter to tokenizer and processor by @quancs in #3913
[recipe] fix: Update README with training and backend instructions by @vermouth1992 in #3929
[recipe] chore: use verl.utils.metric to import reduce_metrics by @HollowMan6 in #3927
[algo] refactor: Rollout Importance Sampling - Separate IS Weights from Rejection Sampling by @szrlee in #3915
[trainer, worker] feat: support loading LoRA adapters by @piood in #3523
[rollout, sglang] fix: correct input length check in sglang_rollout by @triston-lee in #3935
[rollout, vllm] fix: handle lora request when base_sync_done is false initially by @listar2000 in https://github.com/volcengine/verl/pull/...

Contributors

theely, pratik9891, and 54 other contributors

Assets 2

15 Oct 09:25

vermouth1992

v0.6.0

ddd86f5

v0.6.0: model engine, rollout server, composability

Highlights

Model Engine

As noted in #3624, model engine is a service that provides APIs for manipulation of a parallel and distributed model using single controller. This release provides a prototype for such idea using FSDP + ulysses backend and megatron core backend. The implementation is under https://github.com/volcengine/verl/tree/main/verl/workers/engine. Currently, we only implement SFT trainer using model engine. In the following releases, we will start to implement RL trainer using model engine.
Please refer to https://verl.readthedocs.io/en/latest/workers/model_engine.html for the design and instructions to add more model engine backends.

Rollout Server

As agentic reinforcement learning emerges as a predominant research area, verl rollout is transitioning from SPMD mode to server mode, which is more efficient for multi-turn rollout and tool calling. In version 0.6, we made several major changes to rollout servers:

SGLang: #3090 completely separates the SGLang process from the trainer process in SPMD mode and introduces a server adapter to synchronize weights between the trainer and SGLang server. Furthermore, #3456 migrates SGLang to native server mode, enabling full-fledged features and optimizations for online serving.
vLLM: While the vLLM model_runner remains within the trainer process, #3456 also transitions vLLM to native server mode. We may explore completely separating the vLLM process from the trainer process in future releases.

By switching to native server mode, #3530 adds DP+EP support for large MoE models.

To improve extensibility, #3285 refactors the BaseRollout interface and deprecates all sharding managers. This refactor ensures the training engine remains agnostic of the inference engine during weight synchronization, making it easier to integrate new inference engines (e.g., TensorRT-LLM) without modifying the training engine.

Newly Supported Models

Qwen3 VL
GPT OSS

Algorithm

GSPO
Token-level TIS: #2953 introduces token-level importance sampling to mitigate the gap between rollout and training.
Sequence-level TIS: #3694 add more comprehensive metrics to monitor distribution mismatch between rollout and training, and introduces sequence-level importance sampling.

Recipe

Some awesome recipes have been added in v0.6:

Breaking changes and deprecations

nD Dispatch method

Previously, we implemented a set of predefined dispatch method including ONE_TO_ALL, DP_COMPUTE_DATA_PROTO, MEGATRON_COMPUTE_DATA_PROTO, etc,. DP_COMPUTE_DATA_PROTO and MEGATRON_COMPUTE_DATA_PROTO are strongly correlated to the underlying distributed strategies. Writing a separate dispatch method for each strategy is not scalable. In this release, we propose a new API to to unify all distributed strategies. The general steps are

Define device meshes or process groups
register dispatch and collect info by calling _register_dispatch_collect_info inside the worker
Add registration for methods using @register(dispatch_mode=make_nd_compute_dataproto_dispatch_fn(mesh_name=mesh_name))

Please refer to https://github.com/volcengine/verl/blob/main/tests/single_controller/test_device_mesh_register.py as an example.

ShardingManager

ShardingManager is deprecated and will be removed in next release.

Importance bug fixes

Fix hang issue when mixing text and images data training in VLMs (e.g., Qwen VL)
Fix DataProto getstate bug

What's Changed

[cfg] refactor: add ActorConfig, EngineConfig, and ActorWorker unit test, refactor validation code by @eric-haibin-lin in #2621
[ci] test: add CriticWorker unit test, make some util CPU friendly by @eric-haibin-lin in #2717
[ray] feat: RayWorkerGroup support set worker env by @NKcqx in #2685
[sglang] fix: Adding strict naming sanity for sglang by @zhaochenyang20 in #2719
[misc] chore: bump main branch version to v0.5.0.dev by @eric-haibin-lin in #2718
[megatron] fix: resolve backward propagation error in megatron_actor due to shared logits tensor in-place modification by @HelloWorld686 in #2484
[tool] fix: geo3k create return by @nanjiangwill in #2714
[doc] feat: Add agent-lightning in the list of "awesome works using verl by @wizardlancet in #2726
[ci] fix: checkpoint_convertor ci miss a hf model download by @ETOgaosion in #2730
[recipe] chore: add retool training script by @wuxibin89 in #2732
[ci] fix: release ascend test time, fix one step off-policy CI by @ETOgaosion in #2731
[doc] feat: add resizable sidebar and improve layout by @Tingberer in #2577
[docker] feat: upgrade to torch 2.7, sglang 0.4.8 by @ETOgaosion in #2617
[megatron] feat: a bunch of optimzation on vram, sequence packing by @ISEEKYAN in #2678
[CI] feat: add mypy to pre-commit by @frrad in #2614
[doc] style: change resize handle from gradient to plain color by @Tingberer in #2746
refactor: Make sure to keep the type checking by @YeonwooSung in #2634
[rollout] feat: remove chat scheduler by @wuxibin89 in #2725
[perf] feat: add optional role selection in discrete mode for NPU Profiler by @tongtong0613 in #2750
[doc] feat: add retool blog by @eric-haibin-lin in #2761
[algo] refactor: don't special-case compute_policy_loss by @frrad in #2701
[BREAKING] [rollout] chore: remove default rollout selection by @vermouth1992 in #2757
[misc] fix: Handle N-D arrays and complex objects in union_numpy_dict by @MikeDean2367 in #2768
[recipe] fix: fix retool SFT dataset by @vermouth1992 in #2764
[doc] fix: fix typo in agentic RL documentation by @kibitzing in #2777
[cfg] fix: fix failing rollout config test on main by @eric-haibin-lin in #2771
[docker] feat: upgrade vllm to 0.9.1 by @ETOgaosion in #2747
[recipe] fix: fix issue when running split ppo by @as12138 in #2745
[recipe] feat: Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process by @none0663 in #2739
[recipe] feat: add QWen2.5-7b-instruct retool by @vermouth1992 in #2800
[recipe] feat: @register_policy_loss("geo_mean"); Geometric-Mean Policy Optimization by @MzeroMiko in #2795
[tool] fix: Typo fix -- Rename to_openai_function_tool_schema to get_openai_tool_schema by @wizeng23 in #2806
[perf] feat: Padding before batch post-process in agent-loop to save time by @PopSoda2002 in #2773
[vllm,rollout] fix: vllm rollout lock file permission by @clearhanhui in #2805
[training_utils] fix: enforce 1D object array shape for non-tensor data in collate_fn by @kibitzing in #2741
[vllm] fix: verl + vllm-ascend(version 0.9.1) running failed issue by @leo-pony in #2782
Revert "[recipe] feat: Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process" by @ETOgaosion in #2813
[algo] feat: add GSPO-token policy loss computation function by @0x404 in #2775
[sglang] fix: support the configuration of attention_backend in sglang by @tardis-key in #2818
[rollout] feat: pass all dataset fields to agent loop run by @wuxibin89 in #2810
[docker] feat: Upgrade sglang 0.4.9 + transformers 4.53.2 by @ETOgaosion in #2794
[sglang] fix: fix missing engine_kwargs by @vermouth1992 in #2823
[perf, doc] feat: Add profiling continous steps in one database by @davidmlw in https://github.com/volcengine/verl...

Contributors

davidmlw, TonyLianLong, and 141 other contributors

Assets 2

23 Jul 18:20

eric-haibin-lin

v0.5.0

8fdc4d3

v0.5.0: agentic RL rollout, prototypes for disaggregated async training & GenerativeRM, better rollout load balance & improved sglang+megatron/vlm support

Highlights

Agentic RL rollout interface [beta]

verl v0.5 introduces the AgentLoop abstraction that allows easy extension to custom rollout with tool/agent interactions. Server-based asynchronous rollout is adopted to efficiently utilize GPUs. verl provides a few example agent loop implementations including:

Multi-turn conversations and tool calls
LangGraph-based Agent

Please check the documentation for the system architecture design.

Disaggregated placement & async training [prototype]

verl v0.5 includes a community-contributed one-step-off async training recipe, with trainer and rollout deployed on disaggregated resources and off-policy model updates with staleness = 1. In a small scale experiment, the reference recipe provides 20-40% throughput gain compared to the on-policy baseline depending on the configuration. Please checkout the code and documentation for example configurations.

Remote generative reward models [prototype]

A recipe is provided as a prototype to demonstrate the recommended way to use generative reward models in verl. Documentation and code.

New features

LoRA RL support for VLMs: #2182
Better checkpoint manager support for SFT trainer #2292
Support rollout trajectory tracing and RolloutViewer with improved debug-ability and visualization
Megatron with mbridge integration, which better supports hf model loading into megatron #2064

Important fixes & improvements

Fixed an issue with FSDP2 state_dict memory usage caused by torch 2.6. Either using verl v0.5 or torch 2.7 avoids OOMs #2606
Significantly reduced the overhead of vllm async server performance (v.s. vllm engine) #2246
Fixed sglang + Megatron TP16 #2336
Improved SGLang + Megatron weight resharding by 10x #2418 and MoE weight resharding by 3x #2692
Significant rollout load balancing for GRPO-like algorithms via repeating samples before dispatching them #2324

Breaking changes and deprecations

Full list: #2270

Rollout

When generate_sequences with sampling params n>1, change DataProto repeat behavior:
- chunk-dispatch-repeat: DataProto is chunked and dispatched to rollout workers, then repeated in rollout workers.
- repeat-chunk-dispatch: DataProto is repeated by n in driver, then chunked and dispatched to rollout workers.
  Switch from chunk-dispatch-repeat to repeat-chunk-dispatch, this change may break almost all recipes and projects using verl GRPO as submodules. #2324
verl.workers.rollout.sglang_rollout.AsyncSglangServer is now renamed as AsyncSGLangServer
vllm <= v0.6 support is dropped

Multi-turn

We are moving multi-turn supports from ChatScheduler to AgentLoop to improve usability. #2124

Megatron

Megatron recomputation options are moved to *.megatron.override_transformer_config. #2651 Default values are:

override_transformer_config:
  recompute_granularity: null
  recompute_modules:
  - core_attn
  recompute_method: null
  recompute_num_layers: null

Merged config actor_rollout_ref.(actor, ref, rollout).profiler to actor_rollout_ref.profiler

What's Changed

Trainer & FSDP

[fsdp] fix: Change the data in the update_actor function from to.('cpu') to to.(get_device_id()) by @Keilo001 in #2477
[fsdp] fix: vlm dynamic batch & unify dynamic batch api by @hiyouga in #2524
[fsdp] fix: change geo3k model name from non-vl to vl by @nanjiangwill in #2555
[trainer, recipe] feat: add support for external generative reward models by @yyDing1 in #2121
[trainer] fix: fix split placement by @vermouth1992 in #2227
[trainer, vllm] feat: add lora exclude_modules to support VL model lora training by @Cccei000 in #2182
[trainer] fix: pre-commit broken by #2354 by @ETOgaosion in #2358
[trainer, cfg] feat: add BaseConfig for all dataclass configs. Introduce dataclass for algorithm related configs by @eric-haibin-lin in https://github.com/
[trainer] fix: Use safe masked mean/sum to handle NaN values outside the mask by @Yangruipis in #2377
[trainer, data] feat: Dynamic Data Generation by @jwong8314 in #2312
[trainer] fix: use .keys() to check 'response_mask' in TensorDict by @askender in #2491
[trainer] fix: Allow FSDP2 when doing strategy check by @HollowMan6 in #2497
[trainer] refactor: no need to call load_reward_manager in compute_reward_async by @eric-haibin-lin in #2557
[trainer, fsdp, vllm, recipe] feat: one step off async training recipe by @imh966 in #2231
[trainer] fix: maybe_filter_out_long_prompts on image and video by @firefighter-eric in #2553
[trainer] refactor: Training Engine Interface and Development Plan by @ZihengJiang in #1977
[trainer] feat: Add FSDPCheckpointManager for SFTtrainer, support resume training, manage the number of CKPTS in keep by @Pursuer-Hsf in #2292

Rollout & SGLang

[rollout] feat: add agent loop by @wuxibin89 in #2124
[rollout] feat: add zeromq vllm distributed executor by @wuxibin89 in #2246
[BREAKING][rollout] refactor: drop vllm v0.5.4 and v0.6.3 support by @eric-haibin-lin in #2257
[rollout] feat: Allow customization of async server class by @ultmaster in #2326
[rollout] fix: fix hf rollout and add single gpu test by @eric-haibin-lin in #2371
[BREAKING][rollout] feat: repeat DataProto when n>1 in driver instead of rollout workers by @wuxibin89 in #2324
[misc] feat: trace rollout generation and tool calls using weave by @chenhaiq in #2345
[cfg] refactor: make the rollout & ref configs more modular by @eric-haibin-lin in #2410
[perf] feat: add range tag to start/stop profile; clean actor_rollout_ref.profiler by @davidmlw in #2456
[rollout] feat: support mlflow in rollout trace by @chenhaiq in #2440
[rollout] feat: add ReactAgentLoop based on LangGraph by @wuxibin89 in #2463
[rollout] fix: fix bug for remax when the rollout mode is async by @none0663 in #2574
[tool] chore: introduce RolloutViewer TUI tools by @Yangruipis in #2469
[rollout,vllm] fix: A major issue in random sampling of vllm engine by @guanning03 in #2646
[tool] chore: Add log for AsyncRolloutRequest ID, and rollout viewr to support request id display and search by @Hecate0821 in https://github.com/volcengine/
[rollout] fix: use flashattn3 backend in sglang to avoid error in tool call by @chenhaiq in #2244
[rollout] fix: Make free_cache_engine option workable in latest vLLM/SGLang by @HollowMan6 in #1464
[rollout] fix: #1646 stop words for sglang rollout by @linxxx3 in #1991
[sglang, rollout] refactor: use torch.Tensor in async rollout schemas by @nanjiangwill in #2362
[rollout] fix: sglang async fail with Multi-stage Awake feature by @chenhaiq in #2365
[sglang] feat: Add multi-interaction registry support and testing by @SwordFaith in #2184
[sglang] feat: Repeat sampling parameter n into requests of GRPO in SGLang by @zhaochenyang20 in #2258
[sglang,tool] feat: Add support for tools that generate multimodal data by @nanjiangwill in #2146
[sglang] fix: only wake up weights on infer_tp 0 by @zhaochenyang20 in #2403
[sglang] fix: Import Error in the latest sglang by @yyDing1 in #2275
[sglang] fix: Fix qwen2vl weight keys issue by @hebiao064 in #2434
[sglang] fix: Only flush cache on TP rank=0. by @SuperCB in https...

Contributors

davidmlw, askender, and 61 other contributors

Assets 2

27 Jun 00:13

eric-haibin-lin

v0.4.1

8d9e350

v0.4.1 patch release: checkpoint fixes for MoE EP & LoRA, OpenAI/MCP tool calling schema, and SGLang memory optimizations

Key changes

PPO fixes and enhancements

Fixed a bug related to vf_loss coefficient for PPO, which was introduced in v0.4 #2016
Improved numerical stability when clamping KL divergence-related values #1779

Checkpoints related

Switched Megatron checkpointer to mcore's dist_checkpoint, which reduces peak memory usage and improves distributed model saving performance via *.checkpoint.async_save=True.
[BREAKING] Megatron's checkpoint directory layout is updated accordingly. Documentation
[BREAKING] Checkpoint manager constructor now takes checkpoint_config as the keyword to replace checkpoint_contents #2125
Checkpoint merger for LoRA is fixed #1821 via python -m verl.model_merger merge .... Documentation

Experimental function calling & MCP interfaces

These features are experimental and subject to changes in the future

Chat completion scheduler now speaks the OpenAI function-calling schema with an OpenAI server #1831
SGLang rollout with MCP client #1948 Documentation
SGLang multi-turn rollout code walk-through documentation
Multi-turn interaction system with SGLang, enabling dynamic conversational feedback and iterative problem-solving scenarios #1630, the building block for SCoRe

New models and recipes

New recipe/entropy to reproduce the paper The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning with Clip-Cov and KL-Cov methods
Megatron support for Qwen-2.5-VL #1286
Multi-turn SFT support for Qwen-3 #1889
Enhanced kimi-vl with sequence parallelism #1899

SGLang optimizations

rollout with SGLang memory usage is further optimized. Blog (requires sglang v0.4.8 #2187)
async multi-turn rollout with multi-modal support now available in SGLang #2014

Other performance profiling & optimizations

Nsight system profiling is available. Documentation
FSDP prefetch can be enabled via [actor|ref].fsdp_config.forward_prefetch=True #1927
The memory usage for entropy computation can be drastically reduced with fused kernels using [actor|ref].entropy_checkpointing=True and [actor|ref].entropy_from_logits_with_chunking=True #1927

Other breaking changes and deprecations

See #1902
vllm v0.6.3 support will be removed in the next release.

What's Changed

[feat] Wandb Timing: Add more detailed timing of gen_sequence and weights resharding by @ETOgaosion in #1834
[rollout] feat: follow OpenAI tool calling schema in chat scheduler by @wuxibin89 in #1831
[release] chore: bump version to v0.4 by @eric-haibin-lin in #1897
Dockerfile.rocm update tensordict==0.6.2 by @vickytsang in #1898
[feat] add validation shuffle by @mlpod in #1886
[feat][BREAKING] Megatron: Support learning rate scheduler by @ETOgaosion in #1701
fix errors in megatron_workers.py by @davidjsonn in #1906
[tests] chore: add PR title check by @eric-haibin-lin in #1901
fix qwen2vl grpo for vllm 0.9 and transformers 4.52 by @hiyouga in #1880
[rollout] fix: error in __collect_lora_params() in FSDPVLLMShardingManager by @rocke2020 in #1909
[recipe] feat: char count by @vermouth1992 in #1908
fix typos by @davidjsonn in #1912
[trainer] refactor: refactor reward manager, advantage estimator by @eric-haibin-lin in #1916
set CUDA and HIP VISIBLE DEVICES by @YangWang92 in #1914
[ppo] feat: add critic valuehead model support for multi-modal PPO by @Yangruipis in #1839
[bugfix] fix megatron model merger by @ShareLer in #1774
revert HIP_VISIBLE_DEVICES in worker.py by @YangWang92 in #1920
[worker] fix: do not break dynamic bsz in dp critic by @hiyouga in #1922
[sglang] feat: Efficient and model-agnostic multi-turn messages tokenization and masking by @jybsuper in #1668
[rollout] fix: fix async llm config passing by @eric-haibin-lin in #1933
[sglang] fix: Fix tool call parser not found error for SGLang==0.4.6.post5 by @jybsuper in #1852
fix sequence parallelism conflict in kimiVL by @ShareLer in #1899
[megatron] refactor: support MLATransformerConfig abstraction for DeepSeek V3 by @jinqinn in #1836
[rollout] feat: add async llm perf script by @wuxibin89 in #1930
[megatron] feat: qwen2.5vl by @ISEEKYAN in #1286
[ckpt] feat: model_merger.py support processing checkpoints with LoRA adapters by @thelongestusernameofall in #1821
[hardware] fix: fix issue when sp>1 on ASCEND NPU by @as12138 in #1942
[megatron] fix: rope_type typo in config_converter.py by @donpromax in #1944
[training_utils] Add qwen3 multi-turn sft support by @SwordFaith in #1889
[fsdp] fix: fsdp entropy metrics by @ETOgaosion in #1943
[FSDP] feat: Add FSDP forward pefetch and recompute chunking entropy by @CurryRice233 in #1927
[rollout] fix: set repetition_penalty=1.0 to AsyncLLM by @wuxibin89 in #1949
[fsdp] feat: Memory efficient cross entropy with a linear layer fused by @Jianbing-D in #462
[recipe] feat: qwen2.5vl 7b report and guide by @ISEEKYAN in #1969
[ckpt] refactor: enhance FSDP checkpoint manager flexibility by @0x404 in #1350
[env] fix: npu ray verion to 2.46.0 for CI problem by @wyz649296016 in #1987
Fix TypeError by Removing Duplicate Arguments in run_deepseek671b_math_megatron.sh by @none0663 in #1996
[megatron] feat: Config NCCL Timeout for Megatron Backend Model Loading by @none0663 in #1983
[tests] chore: ppo workflow runs on volcengine machine learning platform by @htc070011 in #1979
[megatron] fix: multiple key error when trying to override megatron tr… by @donpromax in #1990
[megatron] feat: robust and efficient mcore converter with meta device init and numel check for dpsk by @Yangruipis in #1995
Stabilize loss calculations by clamping KL divergence values by @syo093c in #1779
[ckpt] fix: run converter_hf_to_mcore with --test will raise an AttributeError by @lxg2015 in #2010
[algo] fix: vf_loss factor by @tongyx361 in #2016
[data] fix: fix retool sft data source by @vermouth1992 in #2018
[fsdp] fix: position_ids in qwen-vl by @ShareLer in #1947
[hardware] refactor: refactor part of device management by @FightingZhen in #1974
[trainer] fix: fix sft max_position_embeddings by @vermouth1992 in #2019
[misc] fix: fix format by @vermouth1992 in #2023
[megatron] fix: dpskv3 convert src and dst mixed up bug by @Yangruipis in #2029
fix: TensorDict usage error by @zhihe-wang in #2046
[hardware] feat: support qwen2_5_vl on ASCEND NPU by @as12138 in #1924
[trainer] chore: Reducing the number of calls to the write by @RuixiangMa in #2043
[Bug] fix None check in ...

Contributors

davidmlw, YangWang92, and 61 other contributors

Assets 2

0 Join discussion

06 Jun 23:55

eric-haibin-lin

v0.4.0

0758489

v0.4.0 release: large MoEs, tool calling, and low resource friendly

Highlights

Large MoE models support: DeepSeek 671b & Qwen3 235b

Preview features are provided to enable large MoE RL training with Megatron backend, such as DeepSeek 671b documentation. The Megatron backend now supports:

expert parallelism, context parallelism, gradient checkpointing
DeepSeek-V3, Qwen3-235b, Mixtral, Moonlight
dist-ckpt support

Tool-calling, multi-turn RL, SGLang rollout

Sample-level rollout with tool calling and multi-turn RL is supported via SGLang. We provide the Search-R1 recipe built on top of that.
A prototype for sample-level async tool calling is also available with vllm AsyncLLM server.
Multiple enhancements and improvements are made to SGLang rollout, supporting multi-node and multimodal.
Sandbox fusion is integrated.

Low resource friendly

LoRA support is available, enabling 70B+ models on a single node with A100x8 GPUs.
Fused cross entropy kernel to drastically reduce peak memory: actor_rollout_ref.model.use_fused_kernels=True

New models, algorithms and recipes

Documentation for PPO and GRPO
Recipe: DAPO
Recipe: Self-Play Fine-Tuning (SPIN)
Recipe: Self-Play Preference Optimization (SPPO)
OPO: On-Policy RL with Optimal Reward Baseline, DrGRPO, REINFORCE++, Dual-Clip PPO

New models and training utils include:

kimi_vl example
qwen3 example
video inputs support
Warmup-Stable-Decay scheduler
rope scaling
evals for GPQA, livecodebench
logging to ClearML

FSDP2 and training optimizations

FSDP2 is recommended to replace FSDP1, providing better throughput and memory usage, and is composable with other features (e.g. torch.compile):

actor_rollout_ref.ref.strategy=fsdp2
actor_rollout_ref.actor.strategy=fsdp2
critic.strategy=fsdp2 
reward_model.strategy=fsdp2

Furthermore, FSDP2 cpu offloading is compatible with gradient accumulation. You can turn it on to save memory with actor_rollout_ref.actor.offload_policy=True.

Other optimizations include:

Activation offloading
ulysses sequence parallelism for vlm
compute reward during log_prob for ppo trainer
timeline for ray profiling

Deployment and hardware

Easy deployment with dstack
Enhancements to non-nvidia GPUs

Breaking changes and deprecations

FSDPSFTTrainer now requires the dataset arguments #1282
SFTDataset and RLHFDataset now take a config as the input #924
entropy_coeff now defaults to 0 #1770
FSDP1 support will be dropped in the next release.
vllm v0.5.4 support will be dropped in the next release.
A few options are included into the default yaml file, existing script may throw errors such as +{config}={value}. Please try removing the + to fix such errors.
- ppo_trainer.yaml: trainer.val_before_train
- sft_trainer.yaml: data.{prompt,response}_dict_keys
verl.utils.reward_score._default_compute_score is deprecated. Use verl.utils.reward_score.default_compute_score instead.
the name of ray actor will change from "WorkerDict_xxxx" to "FusedWorker_xxxx", the name of tasks will change from {cls_name}_{method_name}" to "fuw_execute".

New Contributors

@zhao9797 @frederrx @dingyuan-shi @SwordFaith @CJReinforce @linjc16 @wkcn @hijkzzz @JustinTong0323 @mertunsall @Altair-Alpha @czczup @SparkJiao @sunjin-k @tsaoyu @XueruiSu @zhaochenyang20 @NascentAscension @corgilee @lei-lei @pengsun @silverriver @mingruimingrui @Ann-Qin @lilei199908 @YeonwooSung @himalalps @tao-githup @as12138 @thibautbar @aoshen524 @MantasBaksys @YangWang92 @patrik-bartak @mansicer @wangfuchun-fc @survivi @RainBowLuoCS @gzpan @HuaizhengZhang @HollowMan6 @zTonyZhao @lxg2015 @estsauver @jhinpan @yhyang201 @qingquansong @chenhaiq @ShareLer @Artessay @Jackory @swtheing @U-rara @Andrewzh112 @mansoor-s @Necolizer @llkn-2 @yuyuz @linxxx3 @gaokaiz2 @ccchow @ezyang @zw0610 @pavelgein @plutoZZZZ @jybsuper @hebiao064 @GaotangLi @zhangyongxin121 @spacegoing @cedricbeta @Geaming2002 @imh966 @zyzshishui @zzong2006 @langfengQ @zheliuyu @casper-hansen @Bihan @czx6858 @GHGmc2 @DtYXs @thelongestusernameofall @xichengpro @Irvingwangjr @shinytang6 @qyhfrank @mlpod @popomen @liyc-ai @leo-pony @LiuXTao @Lins-01 @yzlnew @vllbc @ZDJeffrey @sukrucildirr @Moyu-42 @YRdddream @jdf-prog @HUGHNew @ElliottYan @NileZhou @shizhediao @rj42 @Crispig @omahs @CurryRice233 @china10s
Thank you for your first contributions!

Full Changelog: v0.3.0.post1...v0.4.0

Contributors

ezyang, mansoor-s, and 107 other contributors

Assets 2

02 Apr 17:39

eric-haibin-lin

v0.3.0.post1

070ed6a

v0.3.0.post1

This release include fixes for sequence parallelism and sglang:

Fixed ulysses sequence parallel issue, which may hang with specific kv head num #850
SGLang stability & memory improvements #773 #756

Full Changelog: v0.3.0.post0...v0.3.0.post1

Assets 2

30 Mar 02:38

eric-haibin-lin

v0.3.0.post0

8e7780f

v0.3.0.post0 release

Highlights

New algorithms and recipes

Vision language reasoning with qwen2.5-vl #386
PRIME, RLOO, remax #753 #234 #341
FIRE sampling algorithm, math-verify rewards #545 #683

Engine

sglang integration is available for preview (single node with FSDP). Blazing fast! Please try and give us feedbacks! We recommend using verl main branch for continuous slang related fixes and improvement upon feedbacks.

--actor_rollout_ref.rollout.name='sglang'

Megatron is now upgraded to v0.11. Supporting checkpoint manager, qwen model & GRPO algorithm
vllm upgraded to v0.8.2, much faster than vllm v0.7 & v0.6.3 during rollout with the v1 engine! Please remember to enable cuda graph with the following option. There were memory leak issues before vllm v0.8.2, we recommend either using vllm v0.6.3 or v0.8.2.

actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.free_cache_engine=False \

Hardware:

AMD support is available for vllm and FSDP backend. Getting started one pager is here

Docs:

tutorial for distributed training setup, debugging, and the programming model

Roadmap for Q2: #710. Contributions are welcome!

Changelog

New Features

Algorithm Support

Support for extra_info in reward calculation
RLOO advantage estimator
PRIME algorithm (recipe and baseline)
Initial support for VLMs (Vision-Language Models), including Qwen2.5VL GRPO example
Math-Verify Support
Support for GRPO with Megatron backend
Added FIRE sampling in rollout
Replaced DataLoader with StatefulDataLoader for checkpoint resuming
Support for external reward function loading

Performance Improvements

Support for SGLang as a rollout engine
Support for Ulysses sequence parallel (transformers >= 0.48)
Support offloading parameters and optimizer during rollout
Tracking support for vemlp and TensorBoard
MFU (Model FLOPS Utilization) calculation for Megatron workers
Support for AMD (ROCm kernel)
Improved checkpoint loading (Megatron support for Llama/Qwen models)
Remove unnecessary torch.cuda.empty_cache() calls
Optimized weight loading (replaced custom VLLM loader with model.load_weights)

Bug Fixes

Fixed wrong args description
Fixed Gemma2 example and NGC Dockerfile
Fixed offload/load optimizer implementation
Fixed VLLM documentation links
Fixed typos and spelling errors
Fixed evaluation file path in Remax training scripts
Fixed OOM when resuming from checkpoint
Fixed position embedding for Qwen2.5-VL
Fixed PRIME algorithm issues (filtering long prompts, padding side, xformers)
Fixed FSDP checkpoint loading
Fixed SGLang rollout under multi-node
Fixed Python environment issues in installation
Fixed validation batch repeat before feeding into rollout

Deprecations and Breaking Changes

Deprecated val_batch_size
Removed redundant config parameters
Reverted RLHFDataset truncation config

Improvements

Documentation

Added Ray on Slurm example
Added FAQ for VLLM illegal memory access
Added distributed training docs (RLOO, VolcEngine)
Updated VLLM (>=0.7, >=0.8) documentation
Added meetup info, blogs, and project references
Improved Slurm example parameters
Added multi-node training and debug tutorial

Tooling & CI/CD

Added Dependabot action
Added secrets scan action
Added CI timeout and auto-cancel previous CI runs
Added e2e_ascend CI
Improved dataset handling in CI

Miscellaneous

Added assertion checks for PPO mini-batch size
Improved logging (SwanLab integration)
Pre-check resource pool availability to prevent hangs
Added tqdm progress bar for RayPPOTrainer
Skip special tokens in processing
Support for faster model downloads from ModelScope
Added Dockerfile for AWS SageMaker

New Contributors

This new release is contributed by 60 contributors, of which 47 are new contributors!
@AnselCmy @BASARANOMO @BaiqingL @BeSkyer @BearBiscuit05 @CajZella @Django-Jiang @DolbyUUU @ETOgaosion @HaoshengZou @ISEEKYAN @Kunlun-Zhu @PeterSH6 @PzySeere @Raf-Chen @WillemJiang @Yifan-Song793 @ZSL98 @Zeetc @ZefanW @Zeyi-Lin @caaatch22 @celestialli @danielz02 @dependabot @dirtyDan0 @eltociear @eric-haibin-lin @fyqqyf @gameofdimension @ganler @haoy-zzz @hiyouga @hongpeng-guo @iceflame89 @jayl940712 @kinman0224 @laonahongchen @liudayuan-carrot @maksimstw @mi804 @minleminzui @nomadlx @none0663 @nwiad @ocss884 @pat-jj @thomZ1 @tongyx361 @uygnef @vermouth1992 @wangchengnuo @wuxibin89 @xffxff @yaguanghu @yushengsu-thu @yyDing1 @zhanluxianshen @zhr2001 @zpqiu
Thank you all for making verl better!!

Full Changelog: v0.2.0.post2...v0.3.0.post0

Known issues tracker: #827

Contributors

WillemJiang, iceflame89, and 58 other contributors

Assets 2

0 Join discussion

21 Feb 14:32

eric-haibin-lin

v0.2.0.post2

fb53278

v0.2.0.post2

What's Changed

Fixed installation issues.
Fixed the remove padding flags in the gemma example.

New Contributors

@xffxff made their first contribution in #281

Full Changelog: v0.2...v0.2.0.post2

Contributors

xffxff

Assets 2

15 Feb 15:18

eric-haibin-lin

v0.2

828df7e

v0.2 release

Highlights

New algorithms and features

GRPO
ReMax
REINFORCE++
Checkpoint manager for FSDP backend
Sandbox for reward verification and scoring in PRIME

Performance optimization:

Remove padding tokens (i.e. sequence packing). Significant throughput increase expected for Llama, Mistral, Gemma, Qwen2 transformer models. Documentation

actor_rollout_ref.model.use_remove_padding=True
critic.model.use_remove_padding=True

Dynamic batch size. Significant throughput increase for variable length sequences. Documentation and example

actor_rollout_ref.actor.ppo_max_token_len_per_gpu
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu
critic.ppo_max_token_len_per_gpu
critic.forward_micro_batch_size_per_gpu
reward_model.forward_micro_batch_size_per_gpu

Sequence parallelism for long context training. Documentation and example

actor_rollout_ref.actor.ulysses_sequence_parallel_size
critic.ulysses_sequence_parallel_size
reward_model.ulysses_sequence_parallel_size

vllm v0.7+ integration (preview). For the qwen2 ppo example, 25% time reduction in rollout compared to v0.6.3, and 45% time reduction when cuda graph is enabled. Documentation

actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False

Liger-kernel integration for SFT. Documentation

model.use_liger=True

Changelog

New Features

Algorithm Support:
- Added support for GRPO algorithm (#124).
- Implemented REINFORCE++ algorithm (#228).
- Added ReMax algorithm (#234)
Performance Improvements:
- Enabled dynamic batch size support (#118).
- Added meta device initialization and parallel load for FSDP to avoid OOMs during init (#123).
- Improved gradient accumulation in sequence balance (#141).
- Added ref/RM offload support (#121).
- Added LoRA support for SFT (#127).
- feat: spport rmpad/data-packing in FSDP with transformers (#91)
- Liger kernel integration (#133)
Experiment Tracking:
- Integrated SwanLab for experiment tracking with online/offline mode and local dashboard support (#218).
- Added Mlflow support (#74).

Bug Fixes

Critical Fixes:
- Fixed checkpoint save with existing directories (#174).
- Fixed incorrect response_attention_mask in vLLM rollout (#213).
- Fixed gradient accumulation loss value (#102).
- Fixed reward model issues with TokenClassification models (#99).
Code Fixes:
- Fixed redundant non_zero_mask (#152).
- Fixed validation dp_size (#90).
- Fixed response_mask index (#60).

Improvements

Performance:
- Improved memory efficiency in logprobs_from_logits_v2 (#220).
- Enabled multiprocess dataloader in SFT trainer (#122).
- Added MFU calculation support (#117).
Miscellaneous:
- Added option to log validation generations to wandb (#177).

Deprecations and Breaking Changes

Breaking Changes:
- Changed micro_batch_size to micro_batch_size_per_gpu (#136).
- Removed @ray.remote on workers to allow inheritance (#61).
- Refactored old_log_prob into a separate function (#129).

Contributors

A big thank you to all the contributors who made this release possible:
@zhanluxianshen @xingyaoww @fzyzcjy @emergenz @openhands-agent @ZSL98 @YSLIU627 @ZefanW @corbt @jaysonfrancis @hiyouga @Jiayi-Pan @hongpeng-guo @eltociear @chujiezheng @PanAndy @zwhe99 @pcmoritz @huiyeruzhou @VPeterV @uygnef @zhiqi-0 @ExtremeViscent @liziniu @nch0w @Cppowboy @TonyLianLong @4332001876 @tyler-romero @ShaohonChen @kinman0224 @willem-bd @bebetterest @WeiXiongUST @dignfei

Pypi package will be soon available! Please let us know on Github if there's a problem extending RL training recipe based on the pip installed version fo verl.

Full Changelog: v0.1...v0.2

Contributors

pcmoritz, corbt, and 33 other contributors

Assets 2

11 Dec 16:14

PeterSH6

v0.1

9fa2cfb

v0.1

What's Changed

[misc] feat: update tutorial for opensource version by @PeterSH6 in #4
[misc] fix: vllm gpu executor issue when world_size is 1 and typo in doc by @PeterSH6 in #9
[ci] feat: add test files for ray hybrid programming model by @PeterSH6 in #23
[chore] remove unnecessary updating of _worker_names by @kevin85421 in #19
[misc] feat: add gemma example for small scale debug and fix gradient checkpoint in critic by @PeterSH6 in #27
[misc] fix issue in hf_weight_loader and fix typo in doc by @PeterSH6 in #30
[ci] test lint ci and lint tests dir by @PeterSH6 in #28
[example] fix: fix math circular dependency by @eric-haibin-lin in #31
[example] fix: make wandb optional dependency. allow extra args in existing scripts by @eric-haibin-lin in #32
[docs] feat: add related publications by @eric-haibin-lin in #35
[tokenizer] feat: support tokenizers whose pad_token_id is none by @eric-haibin-lin in #36
[rollout] feat: support vLLM v0.6.3 and fix hf rollout import issue by @PeterSH6 in #33
[distro] feat: add docker support by @eric-haibin-lin in #41
[example] add a split placement tutorial by @PeterSH6 in #43
[doc] add a new quickstart section by @PeterSH6 in #44
[BREAKING][core] move single_controller into verl directory by @PeterSH6 in #45

New Contributors

@eric-haibin-lin made their first contribution in #31

Full Changelog: v0.1rc...v0.1

Contributors

eric-haibin-lin, kevin85421, and PeterSH6

Assets 2

Releases: volcengine/verl

v0.6.1

Highlights

Trainer

Megatron

Rollout

Algorithm

Recipe

Importance bug fixes

What's Changed

Contributors

Uh oh!

v0.6.0: model engine, rollout server, composability

Highlights

Model Engine

Rollout Server

Newly Supported Models

Algorithm

Recipe

Breaking changes and deprecations

nD Dispatch method

ShardingManager

Importance bug fixes

What's Changed

Contributors

Uh oh!

v0.5.0: agentic RL rollout, prototypes for disaggregated async training & GenerativeRM, better rollout load balance & improved sglang+megatron/vlm support

Highlights

Agentic RL rollout interface [beta]

Disaggregated placement & async training [prototype]

Remote generative reward models [prototype]

New features

Important fixes & improvements

Breaking changes and deprecations

What's Changed

Trainer & FSDP

Rollout & SGLang

Contributors

Uh oh!

v0.4.1 patch release: checkpoint fixes for MoE EP & LoRA, OpenAI/MCP tool calling schema, and SGLang memory optimizations

v0.4.1 patch release: checkpoint fixes for MoE EP & LoRA, OpenAI/MCP tool calling schema, and SGLang memory optimizations

Key changes

What's Changed

Contributors

Uh oh!

v0.4.0 release: large MoEs, tool calling, and low resource friendly

Highlights

Large MoE models support: DeepSeek 671b & Qwen3 235b

Tool-calling, multi-turn RL, SGLang rollout

Low resource friendly

New models, algorithms and recipes

FSDP2 and training optimizations

Deployment and hardware

Breaking changes and deprecations

New Contributors

Contributors

Uh oh!

v0.3.0.post1

Uh oh!

v0.3.0.post0 release

Highlights

Changelog

New Features

Algorithm Support

Performance Improvements

Bug Fixes

Deprecations and Breaking Changes

Improvements

Documentation

Tooling & CI/CD

Miscellaneous

New Contributors

Contributors

Uh oh!

v0.2.0.post2

What's Changed

New Contributors

Contributors

Uh oh!

v0.2 release