PaddlePaddle 3.1.1 Release Note EN

Key Updates

PaddlePaddle Framework Version 3.1.1 delivers systematic enhancements across the entire large-scale model training pipeline. By addressing fundamental stability issues including operator numerical precision and functionality in large-scale model scenarios, combined with standardized API logging and comprehensive unit test coverage, this release significantly improves training correctness and stability. Performance-wise, while improving key framework API efficiency and quantization computation under FP8, it enhances FP8 quantization efficiency in distributed training and pipeline parallelism, substantially boosting training throughput. The deduction coverage of automatic parallel architecture partitioning has been expanded. For inference deployment, compatibility is improved while further enhancing EP parallel inference capabilities. Overall, it builds a more robust and efficient foundation for large-scale model development while maintaining full API compatibility.

Enhanced Operator & Execution System Correctness and Stability: Systematically resolved 0-size Tensor, large-shape Tensor, and CPU/GPU precision consistency issues to ensure training correctness and stability.
FP8 Operator Optimizations: Further improved performance of FP8-related quantization and computational fusion operators while optimizing SM usage for certain operators, enhancing FP8 mixed-precision training efficiency.
More Stable and Faster Large-Scale Model Training: Systematically optimized execution efficiency in slice-related scenarios for significant performance gains; fixed parameter synchronization issues in pipelines, added FP8 parameter quantization capability under Sharding parallelism and extreme communication-computation overlap in DualPipe, ensuring stable and efficient parallel training. Simultaneously enhanced partitioning deduction capabilities in automatic parallel architecture to improve partitioning efficiency.
Inference Deployment: Added support for safetensors loading. Enhanced internode_ll_two_stage functionality in EP parallelism to further boost inference efficiency.

1. User Experience Upgrade

In version 3.1, we supplemented multiple APIs commonly used in large-scale model scenarios and systematically fixed API documentation and implementation issues.

New Features

Added paddle.device.device_guard API for dynamic graph device switching context management. #73964
Added dtype conversion APIs: paddle.Tensor.bool, float16, half, bfloat16, float32, float, float64, double, int8, char, uint8, byte, int16, short, int32, int, int64, long, complex64, complex128, cfloat, cdouble. #74416
Added paddle.msort for multi-dimensional array sorting. #74421
Added paddle.ravel & paddle.Tensor.ravel for tensor flattening. #74439, #74454
Added F.dropout1d for dimension-wise random dropping. #74444
Added paddle.Tensor.type_as for type casting. #74459
Added paddle.Tensor.mul_, paddle.autograd.Function, paddle.argwhere. #74493
Added paddle.nn.MultiLabelMarginLoss loss function. #73538
Added autocast APIs: paddle.is_autocast_enabled, paddle.get_autocast_gpu_dtype. #74441
Added requires_grad property to Tensor. #74491

Bug Fixes

Fixed Tensor.place comparison logic. #73532
Fixed F.adaptive_log_softmax_with_loss calculation. #73554
Fixed Tensor.__radd__ and Tensor.__rmul__ issues. #73833
Fixed 0-size API boundary issues. #73874
Fixed _DataLoaderIter multi-process handling. #73931
Fixed paddle.nanmedian handling of empty inputs. #74263
Fixed paddle.eigh computation errors. #73349
Fixed paddle.arange issues. #74159

Enhancements

paddle.cumprod supports dim=None for global computation. #74106
Initialization APIs (zeros/ones/eye etc.) support device/dtype parameters. #74477
paddle.ones supports variable-length shape parameters. #74494
F.gelu supports string format for approximate parameter. #74485
F.scaled_dot_product_attention supports 3D inputs. #73804

Documentation

Fixed documentation issues. #73640, #74356

Others

Code style optimizations. #73659, #73661, #73660, #74001, #74145, #73863, #73655, #73724, #73764, #73849
Implementation optimizations: file naming, typo fixes, MKLDNN improvements. #74132,#73626,#73641,#73587,#73801,#73812,#73827,#73845,#73841,#73859,#73858,#73840,#73826,#73825,#73824,#73887,#73949,#73823,#73218,#73948,#74479,#74233,#74513,#74518,#74516,#74178,#74166,#74165,#74163,#74174,#74164,#74162,#74208,#74180,#74205,#74288,#74232,#74299,#74244,#74230,#74314,#74392,#74424,#74417,#74536,#74473,#74135,#74245,#74327,#74325,#74326,#74315,#74385,#74395,#74398,#74393,#74367,#74391,#74436,#74410,#74475,#74517
Unit test optimizations and fixes. #73664,#73723,#73725,#73728,#73731,#73726,#73786,#73727,#73721,#73798,#73791,#73784,#73788,#73794,#73822,#73839,#73843,#73790,#73905,#73903,#73902,#73900,#73909,#73917,#73906,#73904,#73947,#73944,#73945,#73969,#74068,#74082,#74399,#74423,#74458,#74501,#74487,#74502,#74507,#74504,#74505,#74509,#74535,#74503,#74142,#74287
Build/CI fixes and warning optimizations. #73581, #73984, #74104, #74376, #74468
Improved debugging messages and error reporting. #73783,#73886,#74188,#73720,#73857,#74519,#74520,#74384,#74386,#74387,#74383,#74381,#73830,#74128,#74203
Added pybind API get_event_handle_from_custom_stream. #73918
Execution mechanism improvements. #74015

2. Parallel Strategy Optimization

Version 3.1 delivers multiple parallel training optimizations: added forward state annotation and bubble hook for VPP pipelines, fixed parameter synchronization in pipeline parallelism, resolved training hang caused by Sharding dimension conflicts, corrected event management in DualPipe, enhanced FP8 quantization support under Sharding strategy, and improved communication-computation overlap in DualPipe pipelines.

New Features

Added forward state annotation for VPP pipeline scheduling. #74187
Added bubble hook for VPP memory balancing. #74100

Bug Fixes

Fixed shared parameter synchronization in pipeline parallelism. #74047, #74087, #73913
Fixed training hang when EP and Sharding dimensions conflict. #74268
Fixed event management in DualPipe scheduling. #74158

Performance Improvements

Added FP8 parameter quantization support for Sharding strategy. [#73690](https://github.com/PaddlePaddle/Paddle/p pull/73690)
Enabled communication-computation overlap in DualPipe pipelines. #73690, #74527

3. Communication Library

Enhanced communication library with new features and performance optimizations: added stream interface support, fixed memory waste caused by redundant comm contexts, improved communication efficiency through new initialization switches, multicast=2 support for DeepEP operators, and TMA-optimized internode implementation.

New Features

Added stream interface for communication library. #74016

Bug Fixes

Removed redundant comm contexts to save memory. #73297

Performance Improvements

Added initialization switches for communication library. #73297
DeepEP operators support multicast=2. #74144
Added TMA-optimized internode support for DeepEP. #74284

4. Basic Execution Architecture

Systematically upgraded slice-related operators for significant efficiency improvements and fixed key issues in large-scale model training.

Bug Fixes

Fixed large-tensor APIs. #73283, #74120, #73657, #73654, #73559
Fixed dynamic-to-static conversion issues. #71520
Fixed CUDA multi-stream issues. #74011
Other critical fixes. #74378

Enhancements

Enhanced slice functionality. #74038, #74147, #74344

5. Automatic Parallel Architecture

Enhanced sharding deduction rules and resolved key usability/correctness issues.

Improvements

Optimized sharding rules for reshape supporting multi-mesh dimension partitioning. #74352
Added default deduction rules for distributed tensor multi-mesh partitioning. #74396

Added sharding rules for:

take_along_axis #72063
index_put, index_put_grad #73486
conv3d, conv3d_grad #72882
depthwise_conv2d, depthwise_conv2d_grad #73134
conv2d_transpose, conv2d_transpose_grad #73188
einsum #73753
MoE operators #74215

Bug Fixes

Fixed long-sequence parallel strategy. #73735
Fixed save/load in pipeline parallelism. #73749
Fixed overlap strategy in grouped tensor parallelism. #73717
Fixed reshape inplace mechanism. #73836
Fixed all2all in MoE scenarios. #73894
Fixed communication fusion in grouped tensor parallelism. #73997
Fixed recomputation strategy in dynamic graph mode. #74075
Fixed group_norm and layer_norm sharding deduction. #74139
Fixed duplicate group creation in ProcessMesh.get_group. #73099
Fixed gradient clipping precision in pipeline parallelism. #74409
Fixed pipeline strategy bugs. #73615
Fixed stop_gradient parameter passing across pipeline stages. #73459

Performance Optimization

Optimized parameter sharing and communication in pipeline parallelism. #73733

Others

Cleaned deprecated code. #73967, #73744
Aligned manual vs automatic parallel precision. #73941

6. Operator Mechanism

Systematically resolved large-shape tensor, 0-size tensor, and kernel precision issues for large-model training reliability.

New Features

MoE (Mixture of Experts) optimizations. #73592, #73763
Enhanced complex number operations. #73674
Dynamic graph & autograd enhancements. #73622, #73601, #73737, #73747, #73761, #74137
Composite operator optimizations. #73923
Improved API compatibility. #74449
Added pd.rint operator. #74012

Bug Fixes

Fixed 100+ 0-size tensor issues. #73680,#73715,#73702,#73550,#73736,#72986,#73740,#73821,#73889,#73882,#73962,#73998,#73040,#73853,#74046,#74006,#73848,#73806,#74022,#73933,#74057,#74042,#73832,#73844,#73854,#74153,#74131,#74129,#74119,#74143,#74152,#74175,#74112,#74185,#74184,#74200,#74150,#74157,#74229,#73986,#74261,#74259,#74295,#74305,#74323,#74354
Fixed 50+ large-tensor handling problems. #73380,#73635,#73706,#73712,#73745,#73738,#73802,#73899,#73895,#74019,#73996,#73979,#74063,#74213,#74061,#74070,#74155,#74062,#74089,#74134,#74058,#74183,#74196,#74223,#74252,#74060,#74273,#74211,#74055,#74242,#74058,#74293,#74289,#74172,#74330,#74329,#74342,#74369,#74279,#74370,#74404,#74451,#74537,#74324,#74254,#74360
Resolved 20+ precision issues across CPU/GPU kernels. #73562,#73739,#74009,#74081,#74160,#74077,#74102,#74219,#74257,#74198,#74182
Other critical fixes. #73670,#73774,#73877,#73993,#74032,#74034,#74073,#74096,#74065,#74002,#74282,#74313,#74303,#74306,#74298,#74044,#74149,#74290,#74348,#74364,#74332,#74224,#74382,#74406,#74434,#74448,#74457,#74322,#74530,#74442,#74123,#73892,#74025

Enhancements

Optimized FP8 operator performance. #73564, #73370, #73881, #73644, #73639, #73777, #74173, #74471
Enhanced linalg.lu functionality. #73885, #74130, #74258, #74456
Improved API compatibility. #74480,#74523,#74490,#74548
Other significant optimizations. #73815, #74372

Performance Optimization

Memory optimization. #73803

Others

Added CPU support for operators. #73872

7. Framework Performance Optimization

Enhancements

Enhanced slice performance. #73247, #73628, #73834, #73971

Others

Added slice-related checks. #74103

8. Compiler

Enhancements

Added support for float8e4m3 data type (#74111)
Newly supported rint operator and optimized arange operator performance (#74013, #74209)
Enhanced expression simplification capabilities (#74292)

Bug Fixes

Fixed various processing logic bugs across multiple scenarios. (#73940, #74050, #73929, #74040, #74240, #74316, #74328, #74412, #74432, #74450, #74461, #74462)

9. Hardware Adaptation

New Features

Added zero-dim support for concat_grad OP on XPU #73808
Added fp16 weight_scale support for weight_only_linear on XPU #73963
Upgraded Python memory API on XPU #73189
Upgraded Python streams API on XPU #73924

Bug Fixes

Fixed FFT operator on XPU #73785
Fixed data_size parameter in XPU strided_copy #74225

Others

Upgraded XHPC version to 20250722. #74277

10. Installation Environment

Bug Fixes

Fixed compilation failure for macOS-x64 with -DONNXRUNTIME=ON. #73631
Updated Kunlun acceleration image. #73669

Release Notes

PaddlePaddle 3.1.1 Release Note EN

Key Updates

1. User Experience Upgrade

New Features

Bug Fixes

Enhancements

Documentation

Others

2. Parallel Strategy Optimization

New Features

Bug Fixes

Performance Improvements

3. Communication Library

New Features

Bug Fixes

Performance Improvements

4. Basic Execution Architecture

Bug Fixes

Enhancements

5. Automatic Parallel Architecture

Improvements

Bug Fixes

Performance Optimization

Others

6. Operator Mechanism

New Features

Bug Fixes

Enhancements

Performance Optimization

Others

7. Framework Performance Optimization

Enhancements

Others

8. Compiler

Enhancements

Bug Fixes

9. Hardware Adaptation

New Features

Bug Fixes

Others

10. Installation Environment

Bug Fixes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!