Skip to content

PaddlePaddle 3.1.1 Release Note EN

XiaoguangHu edited this page Aug 21, 2025 · 1 revision

Key Updates


PaddlePaddle Framework Version 3.1.1 delivers systematic enhancements across the entire large-scale model training pipeline. By addressing fundamental stability issues including operator numerical precision and functionality in large-scale model scenarios, combined with standardized API logging and comprehensive unit test coverage, this release significantly improves training correctness and stability. Performance-wise, while improving key framework API efficiency and quantization computation under FP8, it enhances FP8 quantization efficiency in distributed training and pipeline parallelism, substantially boosting training throughput. The deduction coverage of automatic parallel architecture partitioning has been expanded. For inference deployment, compatibility is improved while further enhancing EP parallel inference capabilities. Overall, it builds a more robust and efficient foundation for large-scale model development while maintaining full API compatibility.

Enhanced Operator & Execution System Correctness and Stability: Systematically resolved 0-size Tensor, large-shape Tensor, and CPU/GPU precision consistency issues to ensure training correctness and stability.
FP8 Operator Optimizations: Further improved performance of FP8-related quantization and computational fusion operators while optimizing SM usage for certain operators, enhancing FP8 mixed-precision training efficiency.
More Stable and Faster Large-Scale Model Training: Systematically optimized execution efficiency in slice-related scenarios for significant performance gains; fixed parameter synchronization issues in pipelines, added FP8 parameter quantization capability under Sharding parallelism and extreme communication-computation overlap in DualPipe, ensuring stable and efficient parallel training. Simultaneously enhanced partitioning deduction capabilities in automatic parallel architecture to improve partitioning efficiency.
Inference Deployment: Added support for safetensors loading. Enhanced internode_ll_two_stage functionality in EP parallelism to further boost inference efficiency.

1. User Experience Upgrade


In version 3.1, we supplemented multiple APIs commonly used in large-scale model scenarios and systematically fixed API documentation and implementation issues.

New Features

  • Added paddle.device.device_guard API for dynamic graph device switching context management. #73964
  • Added dtype conversion APIs: paddle.Tensor.bool, float16, half, bfloat16, float32, float, float64, double, int8, char, uint8, byte, int16, short, int32, int, int64, long, complex64, complex128, cfloat, cdouble. #74416
  • Added paddle.msort for multi-dimensional array sorting. #74421
  • Added paddle.ravel & paddle.Tensor.ravel for tensor flattening. #74439, #74454
  • Added F.dropout1d for dimension-wise random dropping. #74444
  • Added paddle.Tensor.type_as for type casting. #74459
  • Added paddle.Tensor.mul_, paddle.autograd.Function, paddle.argwhere. #74493
  • Added paddle.nn.MultiLabelMarginLoss loss function. #73538
  • Added autocast APIs: paddle.is_autocast_enabled, paddle.get_autocast_gpu_dtype. #74441
  • Added requires_grad property to Tensor. #74491

Bug Fixes

  • Fixed Tensor.place comparison logic. #73532
  • Fixed F.adaptive_log_softmax_with_loss calculation. #73554
  • Fixed Tensor.__radd__ and Tensor.__rmul__ issues. #73833
  • Fixed 0-size API boundary issues. #73874
  • Fixed _DataLoaderIter multi-process handling. #73931
  • Fixed paddle.nanmedian handling of empty inputs. #74263
  • Fixed paddle.eigh computation errors. #73349
  • Fixed paddle.arange issues. #74159

Enhancements

  • paddle.cumprod supports dim=None for global computation. #74106
  • Initialization APIs (zeros/ones/eye etc.) support device/dtype parameters. #74477
  • paddle.ones supports variable-length shape parameters. #74494
  • F.gelu supports string format for approximate parameter. #74485
  • F.scaled_dot_product_attention supports 3D inputs. #73804

Documentation

Others

2. Parallel Strategy Optimization


Version 3.1 delivers multiple parallel training optimizations: added forward state annotation and bubble hook for VPP pipelines, fixed parameter synchronization in pipeline parallelism, resolved training hang caused by Sharding dimension conflicts, corrected event management in DualPipe, enhanced FP8 quantization support under Sharding strategy, and improved communication-computation overlap in DualPipe pipelines.

New Features

  • Added forward state annotation for VPP pipeline scheduling. #74187
  • Added bubble hook for VPP memory balancing. #74100

Bug Fixes

  • Fixed shared parameter synchronization in pipeline parallelism. #74047, #74087, #73913
  • Fixed training hang when EP and Sharding dimensions conflict. #74268
  • Fixed event management in DualPipe scheduling. #74158

Performance Improvements

3. Communication Library


Enhanced communication library with new features and performance optimizations: added stream interface support, fixed memory waste caused by redundant comm contexts, improved communication efficiency through new initialization switches, multicast=2 support for DeepEP operators, and TMA-optimized internode implementation.

New Features

  • Added stream interface for communication library. #74016

Bug Fixes

  • Removed redundant comm contexts to save memory. #73297

Performance Improvements

  • Added initialization switches for communication library. #73297
  • DeepEP operators support multicast=2. #74144
  • Added TMA-optimized internode support for DeepEP. #74284

4. Basic Execution Architecture


Systematically upgraded slice-related operators for significant efficiency improvements and fixed key issues in large-scale model training.

Bug Fixes

Enhancements

5. Automatic Parallel Architecture


Enhanced sharding deduction rules and resolved key usability/correctness issues.

Improvements

  • Optimized sharding rules for reshape supporting multi-mesh dimension partitioning. #74352
  • Added default deduction rules for distributed tensor multi-mesh partitioning. #74396

Added sharding rules for:

  • take_along_axis #72063
  • index_put, index_put_grad #73486
  • conv3d, conv3d_grad #72882
  • depthwise_conv2d, depthwise_conv2d_grad #73134
  • conv2d_transpose, conv2d_transpose_grad #73188
  • einsum #73753
  • MoE operators #74215

Bug Fixes

  • Fixed long-sequence parallel strategy. #73735
  • Fixed save/load in pipeline parallelism. #73749
  • Fixed overlap strategy in grouped tensor parallelism. #73717
  • Fixed reshape inplace mechanism. #73836
  • Fixed all2all in MoE scenarios. #73894
  • Fixed communication fusion in grouped tensor parallelism. #73997
  • Fixed recomputation strategy in dynamic graph mode. #74075
  • Fixed group_norm and layer_norm sharding deduction. #74139
  • Fixed duplicate group creation in ProcessMesh.get_group. #73099
  • Fixed gradient clipping precision in pipeline parallelism. #74409
  • Fixed pipeline strategy bugs. #73615
  • Fixed stop_gradient parameter passing across pipeline stages. #73459

Performance Optimization

  • Optimized parameter sharing and communication in pipeline parallelism. #73733

Others

  • Cleaned deprecated code. #73967, #73744
  • Aligned manual vs automatic parallel precision. #73941

6. Operator Mechanism


Systematically resolved large-shape tensor, 0-size tensor, and kernel precision issues for large-model training reliability.

New Features

Bug Fixes

Enhancements

Performance Optimization

Others

  • Added CPU support for operators. #73872

7. Framework Performance Optimization


Enhancements

Others

  • Added slice-related checks. #74103

8. Compiler

Enhancements

  1. Added support for float8e4m3 data type (#74111)
  2. Newly supported rint operator and optimized arange operator performance (#74013, #74209)
  3. Enhanced expression simplification capabilities (#74292)

Bug Fixes

Fixed various processing logic bugs across multiple scenarios. (#73940, #74050, #73929, #74040, #74240, #74316, #74328, #74412, #74432, #74450, #74461, #74462)

9. Hardware Adaptation


New Features

  • Added zero-dim support for concat_grad OP on XPU #73808
  • Added fp16 weight_scale support for weight_only_linear on XPU #73963
  • Upgraded Python memory API on XPU #73189
  • Upgraded Python streams API on XPU #73924

Bug Fixes

  • Fixed FFT operator on XPU #73785
  • Fixed data_size parameter in XPU strided_copy #74225

Others

  • Upgraded XHPC version to 20250722. #74277

10. Installation Environment


Bug Fixes

  • Fixed compilation failure for macOS-x64 with -DONNXRUNTIME=ON. #73631
  • Updated Kunlun acceleration image. #73669
Clone this wiki locally