-
Notifications
You must be signed in to change notification settings - Fork 5.8k
PaddlePaddle 3.1.1 Release Note EN
PaddlePaddle Framework Version 3.1.1 delivers systematic enhancements across the entire large-scale model training pipeline. By addressing fundamental stability issues including operator numerical precision and functionality in large-scale model scenarios, combined with standardized API logging and comprehensive unit test coverage, this release significantly improves training correctness and stability. Performance-wise, while improving key framework API efficiency and quantization computation under FP8, it enhances FP8 quantization efficiency in distributed training and pipeline parallelism, substantially boosting training throughput. The deduction coverage of automatic parallel architecture partitioning has been expanded. For inference deployment, compatibility is improved while further enhancing EP parallel inference capabilities. Overall, it builds a more robust and efficient foundation for large-scale model development while maintaining full API compatibility.
Enhanced Operator & Execution System Correctness and Stability: Systematically resolved 0-size Tensor, large-shape Tensor, and CPU/GPU precision consistency issues to ensure training correctness and stability.
FP8 Operator Optimizations: Further improved performance of FP8-related quantization and computational fusion operators while optimizing SM usage for certain operators, enhancing FP8 mixed-precision training efficiency.
More Stable and Faster Large-Scale Model Training: Systematically optimized execution efficiency in slice-related scenarios for significant performance gains; fixed parameter synchronization issues in pipelines, added FP8 parameter quantization capability under Sharding parallelism and extreme communication-computation overlap in DualPipe, ensuring stable and efficient parallel training. Simultaneously enhanced partitioning deduction capabilities in automatic parallel architecture to improve partitioning efficiency.
Inference Deployment: Added support for safetensors loading. Enhanced internode_ll_two_stage functionality in EP parallelism to further boost inference efficiency.
In version 3.1, we supplemented multiple APIs commonly used in large-scale model scenarios and systematically fixed API documentation and implementation issues.
- Added
paddle.device.device_guard
API for dynamic graph device switching context management. #73964 - Added dtype conversion APIs:
paddle.Tensor.bool
,float16
,half
,bfloat16
,float32
,float
,float64
,double
,int8
,char
,uint8
,byte
,int16
,short
,int32
,int
,int64
,long
,complex64
,complex128
,cfloat
,cdouble
. #74416 - Added
paddle.msort
for multi-dimensional array sorting. #74421 - Added
paddle.ravel
&paddle.Tensor.ravel
for tensor flattening. #74439, #74454 - Added
F.dropout1d
for dimension-wise random dropping. #74444 - Added
paddle.Tensor.type_as
for type casting. #74459 - Added
paddle.Tensor.mul_
,paddle.autograd.Function
,paddle.argwhere
. #74493 - Added
paddle.nn.MultiLabelMarginLoss
loss function. #73538 - Added autocast APIs:
paddle.is_autocast_enabled
,paddle.get_autocast_gpu_dtype
. #74441 - Added
requires_grad
property to Tensor. #74491
- Fixed
Tensor.place
comparison logic. #73532 - Fixed
F.adaptive_log_softmax_with_loss
calculation. #73554 - Fixed
Tensor.__radd__
andTensor.__rmul__
issues. #73833 - Fixed 0-size API boundary issues. #73874
- Fixed
_DataLoaderIter
multi-process handling. #73931 - Fixed
paddle.nanmedian
handling of empty inputs. #74263 - Fixed
paddle.eigh
computation errors. #73349 - Fixed
paddle.arange
issues. #74159
-
paddle.cumprod
supports dim=None for global computation. #74106 - Initialization APIs (
zeros/ones/eye
etc.) support device/dtype parameters. #74477 -
paddle.ones
supports variable-length shape parameters. #74494 -
F.gelu
supports string format for approximate parameter. #74485 -
F.scaled_dot_product_attention
supports 3D inputs. #73804
- Code style optimizations. #73659, #73661, #73660, #74001, #74145, #73863, #73655, #73724, #73764, #73849
- Implementation optimizations: file naming, typo fixes, MKLDNN improvements. #74132,#73626,#73641,#73587,#73801,#73812,#73827,#73845,#73841,#73859,#73858,#73840,#73826,#73825,#73824,#73887,#73949,#73823,#73218,#73948,#74479,#74233,#74513,#74518,#74516,#74178,#74166,#74165,#74163,#74174,#74164,#74162,#74208,#74180,#74205,#74288,#74232,#74299,#74244,#74230,#74314,#74392,#74424,#74417,#74536,#74473,#74135,#74245,#74327,#74325,#74326,#74315,#74385,#74395,#74398,#74393,#74367,#74391,#74436,#74410,#74475,#74517
- Unit test optimizations and fixes. #73664,#73723,#73725,#73728,#73731,#73726,#73786,#73727,#73721,#73798,#73791,#73784,#73788,#73794,#73822,#73839,#73843,#73790,#73905,#73903,#73902,#73900,#73909,#73917,#73906,#73904,#73947,#73944,#73945,#73969,#74068,#74082,#74399,#74423,#74458,#74501,#74487,#74502,#74507,#74504,#74505,#74509,#74535,#74503,#74142,#74287
- Build/CI fixes and warning optimizations. #73581, #73984, #74104, #74376, #74468
- Improved debugging messages and error reporting. #73783,#73886,#74188,#73720,#73857,#74519,#74520,#74384,#74386,#74387,#74383,#74381,#73830,#74128,#74203
- Added pybind API
get_event_handle_from_custom_stream
. #73918 - Execution mechanism improvements. #74015
Version 3.1 delivers multiple parallel training optimizations: added forward state annotation and bubble hook for VPP pipelines, fixed parameter synchronization in pipeline parallelism, resolved training hang caused by Sharding dimension conflicts, corrected event management in DualPipe, enhanced FP8 quantization support under Sharding strategy, and improved communication-computation overlap in DualPipe pipelines.
- Added forward state annotation for VPP pipeline scheduling. #74187
- Added bubble hook for VPP memory balancing. #74100
- Fixed shared parameter synchronization in pipeline parallelism. #74047, #74087, #73913
- Fixed training hang when EP and Sharding dimensions conflict. #74268
- Fixed event management in DualPipe scheduling. #74158
- Added FP8 parameter quantization support for Sharding strategy. [#73690](https://github.com/PaddlePaddle/Paddle/p pull/73690)
- Enabled communication-computation overlap in DualPipe pipelines. #73690, #74527
Enhanced communication library with new features and performance optimizations: added stream interface support, fixed memory waste caused by redundant comm contexts, improved communication efficiency through new initialization switches, multicast=2 support for DeepEP operators, and TMA-optimized internode implementation.
- Added stream interface for communication library. #74016
- Removed redundant comm contexts to save memory. #73297
- Added initialization switches for communication library. #73297
- DeepEP operators support multicast=2. #74144
- Added TMA-optimized internode support for DeepEP. #74284
Systematically upgraded slice-related operators for significant efficiency improvements and fixed key issues in large-scale model training.
- Fixed large-tensor APIs. #73283, #74120, #73657, #73654, #73559
- Fixed dynamic-to-static conversion issues. #71520
- Fixed CUDA multi-stream issues. #74011
- Other critical fixes. #74378
Enhanced sharding deduction rules and resolved key usability/correctness issues.
- Optimized sharding rules for
reshape
supporting multi-mesh dimension partitioning. #74352 - Added default deduction rules for distributed tensor multi-mesh partitioning. #74396
Added sharding rules for:
-
take_along_axis
#72063 -
index_put
,index_put_grad
#73486 -
conv3d
,conv3d_grad
#72882 -
depthwise_conv2d
,depthwise_conv2d_grad
#73134 -
conv2d_transpose
,conv2d_transpose_grad
#73188 -
einsum
#73753 - MoE operators #74215
- Fixed long-sequence parallel strategy. #73735
- Fixed save/load in pipeline parallelism. #73749
- Fixed overlap strategy in grouped tensor parallelism. #73717
- Fixed
reshape
inplace mechanism. #73836 - Fixed all2all in MoE scenarios. #73894
- Fixed communication fusion in grouped tensor parallelism. #73997
- Fixed recomputation strategy in dynamic graph mode. #74075
- Fixed
group_norm
andlayer_norm
sharding deduction. #74139 - Fixed duplicate group creation in
ProcessMesh.get_group
. #73099 - Fixed gradient clipping precision in pipeline parallelism. #74409
- Fixed pipeline strategy bugs. #73615
- Fixed stop_gradient parameter passing across pipeline stages. #73459
- Optimized parameter sharing and communication in pipeline parallelism. #73733
Systematically resolved large-shape tensor, 0-size tensor, and kernel precision issues for large-model training reliability.
- MoE (Mixture of Experts) optimizations. #73592, #73763
- Enhanced complex number operations. #73674
- Dynamic graph & autograd enhancements. #73622, #73601, #73737, #73747, #73761, #74137
- Composite operator optimizations. #73923
- Improved API compatibility. #74449
- Added
pd.rint
operator. #74012
-
Fixed 100+ 0-size tensor issues. #73680,#73715,#73702,#73550,#73736,#72986,#73740,#73821,#73889,#73882,#73962,#73998,#73040,#73853,#74046,#74006,#73848,#73806,#74022,#73933,#74057,#74042,#73832,#73844,#73854,#74153,#74131,#74129,#74119,#74143,#74152,#74175,#74112,#74185,#74184,#74200,#74150,#74157,#74229,#73986,#74261,#74259,#74295,#74305,#74323,#74354
-
Fixed 50+ large-tensor handling problems. #73380,#73635,#73706,#73712,#73745,#73738,#73802,#73899,#73895,#74019,#73996,#73979,#74063,#74213,#74061,#74070,#74155,#74062,#74089,#74134,#74058,#74183,#74196,#74223,#74252,#74060,#74273,#74211,#74055,#74242,#74058,#74293,#74289,#74172,#74330,#74329,#74342,#74369,#74279,#74370,#74404,#74451,#74537,#74324,#74254,#74360
-
Resolved 20+ precision issues across CPU/GPU kernels. #73562,#73739,#74009,#74081,#74160,#74077,#74102,#74219,#74257,#74198,#74182
-
Other critical fixes. #73670,#73774,#73877,#73993,#74032,#74034,#74073,#74096,#74065,#74002,#74282,#74313,#74303,#74306,#74298,#74044,#74149,#74290,#74348,#74364,#74332,#74224,#74382,#74406,#74434,#74448,#74457,#74322,#74530,#74442,#74123,#73892,#74025
- Optimized FP8 operator performance. #73564, #73370, #73881, #73644, #73639, #73777, #74173, #74471
- Enhanced
linalg.lu
functionality. #73885, #74130, #74258, #74456 - Improved API compatibility. #74480,#74523,#74490,#74548
- Other significant optimizations. #73815, #74372
- Memory optimization. #73803
- Added CPU support for operators. #73872
- Added slice-related checks. #74103
- Added support for float8e4m3 data type (#74111)
- Newly supported rint operator and optimized arange operator performance (#74013, #74209)
- Enhanced expression simplification capabilities (#74292)
Fixed various processing logic bugs across multiple scenarios. (#73940, #74050, #73929, #74040, #74240, #74316, #74328, #74412, #74432, #74450, #74461, #74462)
- Added zero-dim support for concat_grad OP on XPU #73808
- Added fp16 weight_scale support for weight_only_linear on XPU #73963
- Upgraded Python memory API on XPU #73189
- Upgraded Python streams API on XPU #73924
- Upgraded XHPC version to 20250722. #74277