Skip to content

PaddlePaddle 3.1.0 Release Note EN

XiaoguangHu edited this page Jun 29, 2025 · 2 revisions

Important Update

PaddlePaddle framework version 3.1 further optimizes and polishes the core function of automatic parallelism, enhancing usability and performance; it also provides FP8 low-precision training support, increasing the training speed of large models by 10-20%; it improves the hardware extension mechanism, reducing the cost of adapting to hardware similar to CUDA, and users only need to register kernels; at the same time, the basic capabilities of the framework are enhanced to improve its stability. The key updated features are as follows:

  • Automatic Parallel Architecture: The automatic parallel architecture has undergone further refinement to enhance the usability of the automatic parallel core mechanism and improve dynamic graph performance. The automatic parallel core mechanism has been improved, including the addition of multiple operator splitting derivation rules, support for the same dimension of distributed tensors being split by multiple mesh dimensions, and support for dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), among others. At the same time, performance optimizations have been systematically implemented for the automatic parallel system of dynamic graphs, achieving performance that is essentially on par with manual parallelism on models such as Llama2, Qwen Baichuan, and others.
  • Low-precision training: Based on the blockwise fp8 gemm operator, it supports low-precision training, achieving training accuracy comparable to BF16, and speeding up the training of large models by 10-20%.
  • Heterogeneous multi-chip adaptation: Provides a mechanism similar to CUDA operator reuse, where only registration is required to use the corresponding kernel.
  • Framework stability enhancement: The system has fixed the calculation errors of operators in the cases of 0-Size and large dimensions.

1. User experience upgrade

API enhancements, bug fixes, and improvements are aimed at enhancing user experience and API usability. The paddle.randn_like API has been added, multiple API functional defects have been fixed, and support for complex types and 0-Size Tensor has been enhanced. Documentation and code have also been updated and optimized accordingly to improve overall accuracy and professionalism.

New Features

  • Added paddle.randn_like API. #72492

Bug fixes

  • Fixed the issue of inconsistent input and output types in the tensordot API. #72139
  • Fixed the issue where the output of the atleast API was a Tensor list. #73102
  • Fixed the issue with the nonzer API. #72003
  • Fixed the memory leak issue in dualpipev. #72070
  • Fixed the overflow issue in softmax calculation. #71935
  • Fixed the shape checking issue in take_along_axis when broadcast=False. #72436
  • Fixed the incorrect handling of Nan input in maximum and minimum functions. #71933
  • Fixed the issue with visit_type. #72782
  • Fixed the int32 out-of-bounds issue in gather_scatter_functor. #72905
  • Fixed the inplace implementation of Bernoulli.#73271
  • Fixed issues with moe_permute and moe_unpermute. #73365
  • Fixed the syntax checking issue of ast.parse for pyi files. #71872
  • Fixed the issue of complex division. #73331
  • Fixed issues related to TensorRT integration. #72302, #72278

Enhanced functionality

Documentation

  • Fixed errors in the documentation, improving its usability and user experience. #72549, #73036

Developer-related

Cleanup of obsolete code

2. Basic execution architecture

Supports FP8 matrix operations, enhancing model training efficiency, and simultaneously enhancing multiple models to improve stability; provides a C_ops-style interface for calling the reverse operation, facilitating memory optimization and functional experimentation.

New Features

Bug fixes

Enhanced functionality

Performance improvement

Discard

  • Code cleanup: Cleaned up Python 3.8 support declarations, and completed related code cleanup, dependency reduction, and syntax modernization updates to optimize code maintainability and compatibility. #71815, #72802, #72856, #72854, #72855, #72873, #72870, #72868, #72891

Developer-related

Others

  • Others: Added kernel support for FP16/BF16 data types in CPU sections, optimized error handling and tolerance configuration in test modules, etc. #71764, #71951, #72944

3. Compiler architecture

Optimize compiler performance and enhance stability

Performance optimization

  • Support automatic conversion and optimization of Layout in training scenarios. #71891
  • Kernel compilation optimizations for operators such as argmin, argmax, and arange have been added to the backend. #71956, #72598
  • Support for fused optimization of matrix multiplication. #72846
  • Optimize the computation performance of some operators, specifically the Kernel. #72871

Bug fixes

Fix some processing logic bugs in various scenarios. #71813, #71886, #71927, #71915, #71946, #71949, #71955, #71942, #71939, #71973, #72001, #72020, #72014, #72021, #72027, #72061, #72025, #72095, #72108, #72132, #71985, #72106, #72140, #72167, #72037, #72178, #72143, #72175, #72191, #72213, #72189, #72214, #72166, #72180, #72284, #72267, #72348, #72332, #72307, #72353, #72204, #72457, #72426, #72536, #72541, #72365, #72621, #72630, #72669, #72682, #72732, #72811, #72941, #72795, #73536

4. Automatic parallel architecture

In version 3.1, we further refined the automatic parallel architecture to enhance the usability of automatic parallelism and the performance of dynamic graphs. Specifically, we improved the core mechanism of automatic parallelism, including adding new splitting derivation rules for multiple operators, supporting the splitting of the same dimension of distributed tensors by multiple mesh dimensions, and supporting dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), etc. At the same time, we systematically optimized the performance of the automatic parallel system for dynamic graphs, achieving performance that is basically on par with manual parallelism on models such as Llama.

Functional improvements

  • Support for distributed tensors where the same dimension is split across multiple mesh dimensions. #73233

  • Support for converting automatic parallel communication topology descriptions (ProcessMesh) into manual parallel communication groups. #72052

  • Support send/recv of any serializable Python object. #72098

  • Complete the parallel strategy for dynamic graphs

  • Support for pipeline parallelism strategies 1F1B and VPP scheduling. #72155. #72480, #72179

  • Support for parallel processing of long texts. #73195

  • Support for visual parallelism strategy. #73063, #73039

  • Support automatic parallelism in communication along the data parallelism dimension. #72540

  • Add the following operator segmentation derivation rules

  • min, min_grad #72269

  • bitwise_or,atan2,fmax,fmin,reciprocal #72310

  • argmin, abs, cosh #72264

  • mean_all, mean_all_grad #72479

  • topk, topk_grad #72499

  • argsort #72388

  • round, mish, elu, selu, celu, stanh, softplus, softshrink, thresholded_relu, logit, nonzero #72312

  • unique ops #72824

  • put_along_axis #72766

  • round_grad, trunc_grad, ceil_grad, floor_grad, poisson_grad #72677

  • log_softmax, cummax, cummin #72720

  • unary #72177

  • unary_grad #72260

  • index_select, index_select_grad #72727

  • roll, roll_grad #72740

  • empty_like #73169

  • roi_align, roi_align_grad #72925

  • expand_as, expand_as_grad #73107

  • fused_gemm_epilogur #73126

  • label_smooth, label_smooth #72845

  • group_norm, group_norm_grad #72946

  • instance_norm, instance_norm_grad #72938

  • batch_norm, sync_batch_norm #72918

  • reduce_any #73175

  • fused_gemm_epilogue_rule #73494

Performance optimization

  • Support for the tensor_fusion optimization strategy and overlap optimization strategy with grouped parallel segmentation. #72551, #72902, #73142, #71785
  • Optimize the reshard module to reduce communication overhead. #71969, #73024, #71868
  • Optimize the slicing derivation rule for multiply to reduce communication overhead. #73408
  • Optimize the reverse communication when the distributed partition status is set to Partial, to reduce communication overhead. #73236
  • Communication fusion optimization during gradient update. #72120 and #72745
  • Optimize the derivation of gelu slicing to reduce communication overhead. #73279
  • Optimize the slicing derivation rule of fused_rms_norm when there is Partial status in the input, to reduce communication and computation overhead. #73054

Bug Fixes

  • Fixed the bug of communication hang in the virtual pipeline parallel strategy on H-card. #71104, #73470
  • Fixed the bug in save/load. #72023
  • Fixed the bug that the linear_fused_grad_add strategy did not work in dynamic graph mode. #72708
  • Fixed the issues of the fused_rms_norm operator not running and precision bugs. #72663
  • Fixed a bug in the derivation rule for the expand operator segmentation. #73154

Others

  • Clean up dead code to facilitate code maintenance. #71814, #72538
  • Added API local_map to pass distributed tensors to functions written for ordinary tensors. #71804
  • Added checks for operator fused_linear_param_grad_add. #72483

5. Operator mechanism

New Features

  • Gradient and automatic differentiation optimization: Initially supports dual gradient computation for put_along_axis and repeat_interleave operations, enhances the numerical stability of complex operators in automatic differentiation scenarios, and implements operator decomposition for masked_fill operations. #72789, #73056, #73225
  • Operator mechanism extension: Added custom support for radd and rmul, enhancing the framework's ability to overload asymmetric operators. #73119
  • FP8 module support and operator development: Added support for FP8 block quantization GEMM, introduced multiple fused operators, and provided efficient operator-level implementation for mixed expert (MoE) models, enhancing training and inference performance. #73228, #73285, #73133, #73364, #73520, #73531

Bug Fixes

  • Gradient and automatic differentiation stability improvement: Fixed some errors in the calculation of the inverse operator gradient, enhancing numerical stability and functional correctness in automatic differentiation scenarios. #71716, #72299, #72358, #73037, #73140, #73185
  • Numerical accuracy and overflow protection: Addresses issues such as numerical overflow, loss of precision, and large tensor overflow, ensuring the reliability of low-precision computations and large tensor operations. #72584, #72608, #72681, #72639, #73245, #73359, #72456
  • Operator logic and framework alignment: Align operator operation logic, fix issues such as abnormal operator inputs, and other important fixes: add checks to ensure the correctness of framework functionality. #72282, #71863, #72650, #72843, #73070, #73141, #73203, #73350, #73440, #73539, #73339
  • CUDA kernel and hardware adaptation optimization: Supports NVIDIA SM90 architecture, fixes issues such as overflow, removes redundant CUDA error checks, and enhances GPU computing efficiency and adaptability to new hardware. #72507, #72849, #72959, #73130, #73489

Enhanced functionality

  • Added a fast division and modulo implementation for int64_t version, improving computational performance and numerical stability in large integer scenarios, #72530
  • Optimize the kernel with stride tensor copy to improve the efficiency of data copy under non-continuous memory layout. #72662

-Unify the usage of quantization API in dynamic and static graph modes, simplifying the quantization model development process, #73100

Performance improvement

  • Optimize the decomposition performance of the Gelu operator to enhance computational efficiency. #72812

Others

6. Framework performance optimization

New Features

The acc_steps of sharding_overlap is configurable. #72395

Bug fixes

  • Fixed the inplace issue of operator c_softmax_with_cross_entropy_grad. #72366

Feature Enhancements

  • Performance optimization and acceleration: Enabled cuDNN support for deep convolution, enhancing convolution operation efficiency. Updated pooling operation strategies and optimized permute memory operations to reduce CUDA memory usage. Optimized printing speed, accelerating debugging and log output processes. #71796, #73442, #73563
  • Feature Enhancements and Operational Support: Added the masked_fill operation and Boolean index optimization to enhance tensor masking processing capabilities. Implemented the index_elementwise operation to support index-based element-level operations. Added pooling and reshape execution strategies to enhance the flexibility of model operations. #72788, #72942
  • Bug fixes and stability improvements: Fixed partial state support issues of fused_rms_norm in SPMD parallel mode. Corrected index errors in output dimension calculation and IndexGetStride during slice operations to ensure computational correctness. #72118, #72223, #73184, #73237, #73054

Performance improvement

  • Faster Guard adaptation: Reduce SOT end-to-end link overhead. #71900, #71979, #72081, #72327, #72564, #72823
  • Performance optimization and acceleration: Optimize operator scheduling strategy. Upgrade Flash Attention to version 3 to reduce computational overhead. Fix model performance bottlenecks and improve inference and training speed. #71937, #71828, #71461, #72039, #72228, #72225, #72623, #72666, #73147, #73393
  • Parallel computing: Optimize the grid re-sharding strategy in automatic parallelism, achieve communication integration and optimization logic in the Sharding Stage, enhance the stability of distributed training, and reduce the communication overhead of distributed training. #71969, #72120, #73279, #73406

Feature Enhancements and Fixes: - Optimized operator indexing and kernel scheduling logic. #72625, #72741, #73082, #73501

  • Model and operation support: Support for deep convolution in NHWC format, adapting to more hardware memory layouts. #72121

7. Hardware adaptation

Optimize hardware mechanisms and provide a solution for reusing hardware kernels similar to CUDA.

New Features

Based on the customdevice integration solution, we introduce a low-cost support solution for hardware backends similar to CUDA. These CUDA-like backends can be plugged into Paddle in a modular manner, allowing for low-cost reuse of the majority of CUDA kernels from the NVIDIA ecosystem within Paddle. Furthermore, they can be decoupled from feature upgrades within the Paddle framework, significantly reducing the cost of hardware backend integration and iteration, enhancing user willingness to adopt, and fostering a positive collaborative ecosystem between Paddle and hardware manufacturers. #72604, #72668, #72758, #72865, #72910, #73033, #73145, #73281, #73079

Enhancing XPU basic capabilities: adding kernels, expanding data types, and supplementing branches in the XPU environment #71424, #71809, #71594, #71779, #71756, #71573, #71883, #71954, #71931, #72280, #72361, #72406, #72528, #72752, #72852, #72982, #73357, #73414, #73464, #73234, #71776

DCU kernel extended data type #73129

Bug fixes

Fix xpu execution issues #71852, #71966, #72005, #71908, #72431, #72519, #72734, #72763, #72762, #72890, #72867, #73071, #73004, #72726, #73113, #73127, #73025, #73301, #73292, #73272, #73305, #73356, #73438, #72041, #72275, #72787, #73504, #73290

8. Installation environment adaptation

We have optimized the stability and cross-platform compatibility of the framework, fixed issues related to compilation and installation failures on different platforms, upgraded key dependencies such as CUDA, further optimized the CI/CD process, improved the build speed, and enhanced the overall stability of the system. We have also discontinued the maintenance of compilation and installation in the Python 3.8 environment.

Bug fixes

  • Fixed compilation errors when using clang17 to compile third-party libraries. #72524
  • Fixed compilation issues when using CUDA 12.9. #72808, #72841, #72978, #73360
  • Fixed compilation issues when using GCC 13.3. #73144
  • Fixed compilation issues when WITH_PIP_CUDA_LIBRARIES=ON. #72907
  • Fixed compilation issues when WITH_NVSHMEM=ON. #73368

Enhanced functionality

  • Avoid copying temporary files generated during the compilation of custom operators. #73196
  • Warning message optimization. #72877

Developer-related

Discarded

  • Discontinue support for compilation in Python 3.8 environment. #72827

9. List of contributors

0x3878f, A-nnonymous, AndSonder, ApricityXX, aquagull, author, baoqiwen, BeingGod, blacksheep-Aristotle, BoShen5, bukejiyu, cangtianhuang, carryyu, chang-wenbin, changeyoung98, chen2016013, ckl117, co63oc, cqulilujia, crashbussy, cszdrg, Cutelemon6, cyy536, DanielSun11, danleifeng, datutu-L, deepllz, Dmovic, DrRyanHuang, dynamicheart, Eddie-Wang1120, eggman-1024, emmanuel-ferdman, Enigmatisms, enkilee, fangfangssj, feixi21, FeixLiu, ForFishes, Function-Samuel, ggggxm, GITD245, Glencsa, GoldenStain, gongshaotian, gouzil, gzy19990617, hanlintang, Hongqing-work, houj04, huangjiyi, hxzd5568, HydrogenSulfate, jzhang533, LCStayingdullCircuit, leon062112, lifulll, linkk08, LittleHeroZZZX, liufengwei0103, Liujie0926, liuruyan, lixinqi, LiYuRio, lizexu123, lizhenyun01, lj970926, lshpku, megemini, mikethegoblin, ming1753, mzj104, NKNaN, ooooo-create, pesionzhao, phlrain, pkuzyc, PolaKuma, Qin-sx, RichardWooSJTU, risemeup1, runzhech, RuohengMa, sasaya123, shanjiang7, SigureMo, sneaxiy, swgu98, SylarTiaNII, tianhaodongbd, tianshuo78520a, timminator, tizhou86, umiswing, waliwali777, wanghuancoder, Waynezee, Wennie396, xiaoguoguo626807, XieYunshen, Xing-lil, xkkkkkk23, Xreki, xuxinyi389, Yeenyeong, yongqiangma, YqGe585, yuanlehome, YuanRisheng, yulangz, yuwu46, zeroRains, zhangbo9674, zhanghonggeng, zhangting2020, ZhangX-21, zhangyk0314, zhangyuqin1998, zhink, zhiqiu, zhouquan32, zhoutianzi666, zhupengyang, zrr1999, zty-king, zyfncg

Clone this wiki locally