PaddlePaddle 3.1.0 Release Note EN

Important Update

PaddlePaddle framework version 3.1 further optimizes and polishes the core function of automatic parallelism, enhancing usability and performance; it also provides FP8 low-precision training support, increasing the training speed of large models by 10-20%; it improves the hardware extension mechanism, reducing the cost of adapting to hardware similar to CUDA, and users only need to register kernels; at the same time, the basic capabilities of the framework are enhanced to improve its stability. The key updated features are as follows:

Automatic Parallel Architecture: The automatic parallel architecture has undergone further refinement to enhance the usability of the automatic parallel core mechanism and improve dynamic graph performance. The automatic parallel core mechanism has been improved, including the addition of multiple operator splitting derivation rules, support for the same dimension of distributed tensors being split by multiple mesh dimensions, and support for dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), among others. At the same time, performance optimizations have been systematically implemented for the automatic parallel system of dynamic graphs, achieving performance that is essentially on par with manual parallelism on models such as Llama2, Qwen Baichuan, and others.
Low-precision training: Based on the blockwise fp8 gemm operator, it supports low-precision training, achieving training accuracy comparable to BF16, and speeding up the training of large models by 10-20%.
Heterogeneous multi-chip adaptation: Provides a mechanism similar to CUDA operator reuse, where only registration is required to use the corresponding kernel.
Framework stability enhancement: The system has fixed the calculation errors of operators in the cases of 0-Size and large dimensions.

1. User experience upgrade

API enhancements, bug fixes, and improvements are aimed at enhancing user experience and API usability. The paddle.randn_like API has been added, multiple API functional defects have been fixed, and support for complex types and 0-Size Tensor has been enhanced. Documentation and code have also been updated and optimized accordingly to improve overall accuracy and professionalism.

New Features

Added paddle.randn_like API. #72492

Bug fixes

Fixed the issue of inconsistent input and output types in the tensordot API. #72139
Fixed the issue where the output of the atleast API was a Tensor list. #73102
Fixed the issue with the nonzer API. #72003
Fixed the memory leak issue in dualpipev. #72070
Fixed the overflow issue in softmax calculation. #71935
Fixed the shape checking issue in take_along_axis when broadcast=False. #72436
Fixed the incorrect handling of Nan input in maximum and minimum functions. #71933
Fixed the issue with visit_type. #72782
Fixed the int32 out-of-bounds issue in gather_scatter_functor. #72905
Fixed the inplace implementation of Bernoulli.#73271
Fixed issues with moe_permute and moe_unpermute. #73365
Fixed the syntax checking issue of ast.parse for pyi files. #71872
Fixed the issue of complex division. #73331
Fixed issues related to TensorRT integration. #72302, #72278

Enhanced functionality

Enhance the functionality of the API, improve its usability, and enhance the user experience. This includes but is not limited to expanding the data types supported by the API, checking API parameters, correcting default values of API parameters, and refining API return values. #71997, #72911, #72985, #73240, #72927, #73451, #73416, #73420, #73347, #73050, #73246, #73123, #73336, #73062, #72201, #72190
Enhanced API support for complex types. #72279, #72308, #72518, #72391, #72239, #72286, #72169, #72577, #72619
Enhanced API support for 0-Size Tensor. #72570, #72692, #72138, #72410, #72565, #72262
Correct spelling errors in the API code to enhance overall accuracy and professionalism. #71780, #71786, #72093, #72113, #72241, #72237, #72590, #72591, #72769, #72858, #73045, #72195, #72627, #72657, #73162, #73402, #72208, #72659, #72658, #72660, #72661, #72656
Communication optimization reduces peak memory usage. #72035

Documentation

Fixed errors in the documentation, improving its usability and user experience. #72549, #73036

Developer-related

Updates to code style check rules. #72896, #73179, #73060, #72553, #72915, #72916, #73338, #72935, #72325, #72935
Code variable naming updates and code migration. #73048, #73148, #73149, #73264, #73159, #73124, #73160, #73161, #73374, #73395, #73076, #73163, #73255
LodTensor is being phased out. #71968, #72152, #72145

Cleanup of obsolete code

Cleaned up useless code. #71795, #71792, #71794, #71793, #72265, #73167, #73115, #73049, #72162, #72321, #72336, #72952, #72828

2. Basic execution architecture

Supports FP8 matrix operations, enhancing model training efficiency, and simultaneously enhancing multiple models to improve stability; provides a C_ops-style interface for calling the reverse operation, facilitating memory optimization and functional experimentation.

New Features

Support FP8 matrix multiplication acceleration to enhance computational performance and precision adaptability. #73092
Support for 0-size Tensor execution. #71829, #72263, #72244, #72814
DeepEP support. #73495
Enable CINN backend by default. #71838
Support for SOT-related execution. #72472, #72559, #72466, #73269, #73329, #73405, #73399, #73424, #73509
Support for converting dynamic to static. #73417, #73081
Added support for kernels with stride mechanism. #73053

Bug fixes

Performance optimization and stability: Optimized training stability, enhanced support for Python 3.11+, improved automatic activation logic of the CINN compiler in dynamic graph mode, fixed issues with dynamic shape inference and gradient backpropagation, optimized GPU kernel execution efficiency (such as for_range and constant folding), improved NPU memory copy and context management, and enhanced large-scale model training performance and hardware utilization. #71777, #71837, #71834, #71950, #71960, #72103, #70652, #72313, #72405, #72581, #73418
Large Tensor Support Extension: The extension operator supports extremely large-sized tensors, including mathematical operations (lerp/mean/bmm/trapezoid), tensor operations (arg_min_max/diag/prelu), padding, comparisons (allclose/isclose), and fusion operators (softmax_mask_fuse), addressing compatibility issues in mixed-precision training. #71916, #71970, #72516, #72517, #72638, #72652, #73046, #73093, #73136, #72679, #73174, #73198, #73121, #73096, #73261, #73201, #73291, #73373, #73318, #73436, #72705, #72276, #73135, #73304, #73381, #72712, #72717, #72634, #72562, #72628, #72706, #72831, #72888, #72753, #72931, #73021, #73064, #73069, #73153, #73118, #73252, #73253, #73262, #73259, #73288, #73105, #73275, #73284, #73110, #73335, #73342, #73447, #73460, #73194
0-Size Tensor issue fix: Fixed the calculation anomalies caused by 0-Size Tensor, covering pooling (max_pool1d/lp_pool1d), sorting (matrix_rank), statistics (std/nanmedian), and element-level operations (elementwise compare), ensuring numerical stability and API consistency under extreme input scenarios. #71961, #72017, #72785, #73214, #73263, #73267, #73280, #72444, #72437, #72460, #73090, #73516, #72807, #72799, #72800, #72809, #73497
API Enhancements and Compatibility: Added support for Python standard library types (dataclasses), expanded API data type compatibility (creation of bfloat16 parameters, automatic inference of -1 dimension), fixed NumPy API interaction errors, and optimized BatchNorm memory layout. #72059, #72283, #72451, #72512, #72618, #72976, #73084, #73205, #73250, #73111, #73260, #72094, #71844, #71357
Memory management and bug fixes: Address high-risk issues such as memory overflow (set_value/nonzero), null pointer (data nullptr), and CUDA graph allocation failure. Fix memory leaks and computational errors in core operations such as gradient clipping (clip_grad), tensor assignment (assign), and broadcasting (broadcast). Optimize NPU asynchronous execution and predictor GIL release logic to enhance system robustness. #71895, #72101, #72133, #72149, #72176, #72314, #72256, #72757, #72749, #72792, #72815, #72819, #72958, #73023, #73103, #73014, #73137, #73256, #73211, #73251, #73210, #73415, #73206, #71983, #72485, #72561
Other important fixes: Fixed defects in scientific computation, save/load, and other modules, improved the Slice operator kernel configuration, optimized the fallback strategy for dynamic shape inference, and refined the exception throwing and type checking logic. #71810, #72246, #72378, #72467, #72635, #72751, #72044, #72051, #73231, #73109
Fixed issues related to SOT. #71932, #71971, #72194, #72288, #72306, #72367, #72495, #72522, #72704, #72631, #72737, #73067, #73030, #73059, #73282, #73511, #73526, #73549, #73515

Enhanced functionality

Construction of Paddle API 0-size mechanism. #72721, #72756, #72790, #72806, #72764, #72786, #72853, #72826, #72851, #72928, #72912, #72922, #72924, #72887, #72921, #72906, #72895, #72821, #72914, #72936, #72943, #72694, #72919, #72940, #72820, #72934, #72975, #72872, #72984, #72988, #72972, #72977, #72937, #73086, #73042, #73017, #73044, #73077, #73108, #73027, #72970, #73008, #72996, #73165, #73166, #73170, #73122, #73204, #73207, #73186, #73197, #73168, #73172, #73125, #73181, #73270, #73028, #73094, #73180, #73276, #73333, #73341, #73299, #73346, #73361, #73375, #73152, #73377, #73355, #73382, #73385, #73386, #73352, #73387, #73401, #73384, #73450, #73437, #73503, #73507, #73477, #73513, #73525, #73528, #73517, #72898, #72880, #72864, #72993, #72954, #72866, #72878, #72889, #72861, #72837
SOT-related enhancements: Enhanced functionality (such as NumPy interoperability and super support), improved training stability, and fixed multiple issues to enhance code robustness. #71763, #71666, #71858, #71865, #72474, #72154, #72784, #72956, #73038, #73066, #73287, #73278, #73332, #73372, #73412, #73407, #73506
Code style refactoring: Through code refactoring and unification of cross-platform kernel behaviors, we have improved code quality and maintainability, and added a YAML format pre-commit check tool. #72216, #72360, #72816, #72969, #73106, #72825, #73150, #73151, #73158, #73101, #73326, #72580, #72424
Paddle CPU/GPU Kernel accuracy issue is pushed to the whole team. #72879, #72894, #73012, #72973, #73018, #72965, #73128, #73229, #72992, #73344, #73274, #73295, #73293, #73317, #73320, #73454, #73492, #73535
Slice issue fixes: Fixed issues related to slices, including indexing logic, performance optimization, etc. #72644, #72676, #72838, #72966, #73095, #72840, #73112, #73367, #73390, #73307, #73465, #73362, #72733, #72886
Performance optimization: By optimizing index logic and improving performance, we aim to enhance overall performance. #72707, #73485
Other significant improvements: including dynamic shape support, fixing meshgrid and adding unit tests, upgrading CUB to version 2.1.0, improving FP8 numerical processing, optimizing the CUDA graph shared pool mechanism, removing ShadowFeedOp to simplify data flow, enhancing version compatibility for PIR model saving/loading, fixing flip and reverse kernel issues, improving NaN propagation logic for paddle.angle, introducing an asynchronous GC check mechanism, optimizing the Scope lock-free interface for Dy2St, cleaning up unused third-party dependencies (absl), and further promoting the decoupling of PHI and Fluid to enhance the framework's stability, performance, and scalability. #72356, #72380, #72633, #72794, #72917, #72920, #72945, #72620, #73011, #73051, #73052, #73075, #73176, #73191, #73337, #73311, #73173, #73239, #73448, #73478, #73522, #73369

Performance improvement

SOT-related: Through improvements such as optimizing the Guard condition mechanism, enhancing dynamic shape processing capabilities, and adding no_grad support, execution efficiency has been enhanced, functional features have been expanded, and the code structure and performance have been optimized. #70362. #70154, #71748, #72004, #72159, #72174, #71994, #72250, #72285, #72322, #72272, #72417, #72438, #72462, #72463, #72503, #72501, #72521, #72509, #72544, #73469, #73471, #73555

Discard

Code cleanup: Cleaned up Python 3.8 support declarations, and completed related code cleanup, dependency reduction, and syntax modernization updates to optimize code maintainability and compatibility. #71815, #72802, #72856, #72854, #72855, #72873, #72870, #72868, #72891

Developer-related

Optimized CINN backend integration and dynamic shape processing logic, improved framework stability through code structure refactoring and test reinforcement, and added debugging log functionality to enhance maintainability. #71817, #71896, #71984, #72067, #72165, #72207, #72235, #72273, #72326, #72400, #72381, #72560, #72783, #73530

Others

Others: Added kernel support for FP16/BF16 data types in CPU sections, optimized error handling and tolerance configuration in test modules, etc. #71764, #71951, #72944

3. Compiler architecture

Optimize compiler performance and enhance stability

Performance optimization

Support automatic conversion and optimization of Layout in training scenarios. #71891
Kernel compilation optimizations for operators such as argmin, argmax, and arange have been added to the backend. #71956, #72598
Support for fused optimization of matrix multiplication. #72846
Optimize the computation performance of some operators, specifically the Kernel. #72871

Bug fixes

Fix some processing logic bugs in various scenarios. #71813, #71886, #71927, #71915, #71946, #71949, #71955, #71942, #71939, #71973, #72001, #72020, #72014, #72021, #72027, #72061, #72025, #72095, #72108, #72132, #71985, #72106, #72140, #72167, #72037, #72178, #72143, #72175, #72191, #72213, #72189, #72214, #72166, #72180, #72284, #72267, #72348, #72332, #72307, #72353, #72204, #72457, #72426, #72536, #72541, #72365, #72621, #72630, #72669, #72682, #72732, #72811, #72941, #72795, #73536

4. Automatic parallel architecture

In version 3.1, we further refined the automatic parallel architecture to enhance the usability of automatic parallelism and the performance of dynamic graphs. Specifically, we improved the core mechanism of automatic parallelism, including adding new splitting derivation rules for multiple operators, supporting the splitting of the same dimension of distributed tensors by multiple mesh dimensions, and supporting dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), etc. At the same time, we systematically optimized the performance of the automatic parallel system for dynamic graphs, achieving performance that is basically on par with manual parallelism on models such as Llama.

Functional improvements

Support for distributed tensors where the same dimension is split across multiple mesh dimensions. #73233
Support for converting automatic parallel communication topology descriptions (ProcessMesh) into manual parallel communication groups. #72052
Support send/recv of any serializable Python object. #72098
Complete the parallel strategy for dynamic graphs
Support for pipeline parallelism strategies 1F1B and VPP scheduling. #72155. #72480, #72179
Support for parallel processing of long texts. #73195
Support for visual parallelism strategy. #73063, #73039
Support automatic parallelism in communication along the data parallelism dimension. #72540
Add the following operator segmentation derivation rules
min, min_grad #72269
bitwise_or,atan2,fmax,fmin,reciprocal #72310
argmin, abs, cosh #72264
mean_all, mean_all_grad #72479
topk, topk_grad #72499
argsort #72388
round, mish, elu, selu, celu, stanh, softplus, softshrink, thresholded_relu, logit, nonzero #72312
unique ops #72824
put_along_axis #72766
round_grad, trunc_grad, ceil_grad, floor_grad, poisson_grad #72677
log_softmax, cummax, cummin #72720
unary #72177
unary_grad #72260
index_select, index_select_grad #72727
roll, roll_grad #72740
empty_like #73169
roi_align, roi_align_grad #72925
expand_as, expand_as_grad #73107
fused_gemm_epilogur #73126
label_smooth, label_smooth #72845
group_norm, group_norm_grad #72946
instance_norm, instance_norm_grad #72938
batch_norm, sync_batch_norm #72918
reduce_any #73175
fused_gemm_epilogue_rule #73494

Performance optimization

Support for the tensor_fusion optimization strategy and overlap optimization strategy with grouped parallel segmentation. #72551, #72902, #73142, #71785
Optimize the reshard module to reduce communication overhead. #71969, #73024, #71868
Optimize the slicing derivation rule for multiply to reduce communication overhead. #73408
Optimize the reverse communication when the distributed partition status is set to Partial, to reduce communication overhead. #73236
Communication fusion optimization during gradient update. #72120 and #72745
Optimize the derivation of gelu slicing to reduce communication overhead. #73279
Optimize the slicing derivation rule of fused_rms_norm when there is Partial status in the input, to reduce communication and computation overhead. #73054

Bug Fixes

Fixed the bug of communication hang in the virtual pipeline parallel strategy on H-card. #71104, #73470
Fixed the bug in save/load. #72023
Fixed the bug that the linear_fused_grad_add strategy did not work in dynamic graph mode. #72708
Fixed the issues of the fused_rms_norm operator not running and precision bugs. #72663
Fixed a bug in the derivation rule for the expand operator segmentation. #73154

Others

Clean up dead code to facilitate code maintenance. #71814, #72538
Added API local_map to pass distributed tensors to functions written for ordinary tensors. #71804
Added checks for operator fused_linear_param_grad_add. #72483

5. Operator mechanism

New Features

Gradient and automatic differentiation optimization: Initially supports dual gradient computation for put_along_axis and repeat_interleave operations, enhances the numerical stability of complex operators in automatic differentiation scenarios, and implements operator decomposition for masked_fill operations. #72789, #73056, #73225
Operator mechanism extension: Added custom support for radd and rmul, enhancing the framework's ability to overload asymmetric operators. #73119
FP8 module support and operator development: Added support for FP8 block quantization GEMM, introduced multiple fused operators, and provided efficient operator-level implementation for mixed expert (MoE) models, enhancing training and inference performance. #73228, #73285, #73133, #73364, #73520, #73531

Bug Fixes

Gradient and automatic differentiation stability improvement: Fixed some errors in the calculation of the inverse operator gradient, enhancing numerical stability and functional correctness in automatic differentiation scenarios. #71716, #72299, #72358, #73037, #73140, #73185
Numerical accuracy and overflow protection: Addresses issues such as numerical overflow, loss of precision, and large tensor overflow, ensuring the reliability of low-precision computations and large tensor operations. #72584, #72608, #72681, #72639, #73245, #73359, #72456
Operator logic and framework alignment: Align operator operation logic, fix issues such as abnormal operator inputs, and other important fixes: add checks to ensure the correctness of framework functionality. #72282, #71863, #72650, #72843, #73070, #73141, #73203, #73350, #73440, #73539, #73339
CUDA kernel and hardware adaptation optimization: Supports NVIDIA SM90 architecture, fixes issues such as overflow, removes redundant CUDA error checks, and enhances GPU computing efficiency and adaptability to new hardware. #72507, #72849, #72959, #73130, #73489

Enhanced functionality

Added a fast division and modulo implementation for int64_t version, improving computational performance and numerical stability in large integer scenarios, #72530
Optimize the kernel with stride tensor copy to improve the efficiency of data copy under non-continuous memory layout. #72662

-Unify the usage of quantization API in dynamic and static graph modes, simplifying the quantization model development process, #73100

Performance improvement

Optimize the decomposition performance of the Gelu operator to enhance computational efficiency. #72812

Others

Fluid operator normalization and exit, #71789, #71818, #71808, #71860, #71806, #72011, #72043, #72034, #72047, #72056, #72087, #72086, #72083, #72079, #72078, #72076, #72057, #72077, #72096, #72085, #72092, #72110, #72127, #72111, #72126, #72135, #72112, #72131, #70358, #72125, #72171, #72160, #72188, #72197, [#7221

6. Framework performance optimization

New Features

The acc_steps of sharding_overlap is configurable. #72395

Bug fixes

Fixed the inplace issue of operator c_softmax_with_cross_entropy_grad. #72366

Feature Enhancements

Performance optimization and acceleration: Enabled cuDNN support for deep convolution, enhancing convolution operation efficiency. Updated pooling operation strategies and optimized permute memory operations to reduce CUDA memory usage. Optimized printing speed, accelerating debugging and log output processes. #71796, #73442, #73563
Feature Enhancements and Operational Support: Added the masked_fill operation and Boolean index optimization to enhance tensor masking processing capabilities. Implemented the index_elementwise operation to support index-based element-level operations. Added pooling and reshape execution strategies to enhance the flexibility of model operations. #72788, #72942
Bug fixes and stability improvements: Fixed partial state support issues of fused_rms_norm in SPMD parallel mode. Corrected index errors in output dimension calculation and IndexGetStride during slice operations to ensure computational correctness. #72118, #72223, #73184, #73237, #73054

Performance improvement

Faster Guard adaptation: Reduce SOT end-to-end link overhead. #71900, #71979, #72081, #72327, #72564, #72823
Performance optimization and acceleration: Optimize operator scheduling strategy. Upgrade Flash Attention to version 3 to reduce computational overhead. Fix model performance bottlenecks and improve inference and training speed. #71937, #71828, #71461, #72039, #72228, #72225, #72623, #72666, #73147, #73393
Parallel computing: Optimize the grid re-sharding strategy in automatic parallelism, achieve communication integration and optimization logic in the Sharding Stage, enhance the stability of distributed training, and reduce the communication overhead of distributed training. #71969, #72120, #73279, #73406

Feature Enhancements and Fixes: - Optimized operator indexing and kernel scheduling logic. #72625, #72741, #73082, #73501

Model and operation support: Support for deep convolution in NHWC format, adapting to more hardware memory layouts. #72121

7. Hardware adaptation

Optimize hardware mechanisms and provide a solution for reusing hardware kernels similar to CUDA.

New Features

Based on the customdevice integration solution, we introduce a low-cost support solution for hardware backends similar to CUDA. These CUDA-like backends can be plugged into Paddle in a modular manner, allowing for low-cost reuse of the majority of CUDA kernels from the NVIDIA ecosystem within Paddle. Furthermore, they can be decoupled from feature upgrades within the Paddle framework, significantly reducing the cost of hardware backend integration and iteration, enhancing user willingness to adopt, and fostering a positive collaborative ecosystem between Paddle and hardware manufacturers. #72604, #72668, #72758, #72865, #72910, #73033, #73145, #73281, #73079

Enhancing XPU basic capabilities: adding kernels, expanding data types, and supplementing branches in the XPU environment #71424, #71809, #71594, #71779, #71756, #71573, #71883, #71954, #71931, #72280, #72361, #72406, #72528, #72752, #72852, #72982, #73357, #73414, #73464, #73234, #71776

DCU kernel extended data type #73129

Bug fixes

Fix xpu execution issues #71852, #71966, #72005, #71908, #72431, #72519, #72734, #72763, #72762, #72890, #72867, #73071, #73004, #72726, #73113, #73127, #73025, #73301, #73292, #73272, #73305, #73356, #73438, #72041, #72275, #72787, #73504, #73290

8. Installation environment adaptation

We have optimized the stability and cross-platform compatibility of the framework, fixed issues related to compilation and installation failures on different platforms, upgraded key dependencies such as CUDA, further optimized the CI/CD process, improved the build speed, and enhanced the overall stability of the system. We have also discontinued the maintenance of compilation and installation in the Python 3.8 environment.

Bug fixes

Fixed compilation errors when using clang17 to compile third-party libraries. #72524
Fixed compilation issues when using CUDA 12.9. #72808, #72841, #72978, #73360
Fixed compilation issues when using GCC 13.3. #73144
Fixed compilation issues when WITH_PIP_CUDA_LIBRARIES=ON. #72907
Fixed compilation issues when WITH_NVSHMEM=ON. #73368

Enhanced functionality

Avoid copying temporary files generated during the compilation of custom operators. #73196
Warning message optimization. #72877

Developer-related

Compilation, installation, maintenance, and upgrade. #71911, #73005
Image maintenance and updates. #71065, #71821
Import, export, and update of symbols on the Windows platform. #72497, #72498, #72500
Windows platform supports CUDA 12.8. #72433
CI maintenance and upgrade. #72443, #72836, #72563, #72653, #72477, #72778, #72960, #73289, #73422, #73514, #72748,
Github Action CI construction. #71738, #70602, #71958, #71959, #71992, #72013, #72153, #72031, #72141, #72104, #72182, #72342, #72352, #72249, #72068, #72441, #72392, #72446, #72435, #72515, #72514, #72396, #72547, #72345, #72236, #72586, #72537, #72609, #72632, #72642, #72673, #72647, #72696, #72771, #72711, #72680, #72774, #72813, #72804, #72903, #72900, #72932, #72967, #72991, #72115, #73242, #72801, #73433, #73391, #73456, #73376, #73453, #73481, #73546, #73446, #72744

Discarded

Discontinue support for compilation in Python 3.8 environment. #72827

9. List of contributors

0x3878f, A-nnonymous, AndSonder, ApricityXX, aquagull, author, baoqiwen, BeingGod, blacksheep-Aristotle, BoShen5, bukejiyu, cangtianhuang, carryyu, chang-wenbin, changeyoung98, chen2016013, ckl117, co63oc, cqulilujia, crashbussy, cszdrg, Cutelemon6, cyy536, DanielSun11, danleifeng, datutu-L, deepllz, Dmovic, DrRyanHuang, dynamicheart, Eddie-Wang1120, eggman-1024, emmanuel-ferdman, Enigmatisms, enkilee, fangfangssj, feixi21, FeixLiu, ForFishes, Function-Samuel, ggggxm, GITD245, Glencsa, GoldenStain, gongshaotian, gouzil, gzy19990617, hanlintang, Hongqing-work, houj04, huangjiyi, hxzd5568, HydrogenSulfate, jzhang533, LCStayingdullCircuit, leon062112, lifulll, linkk08, LittleHeroZZZX, liufengwei0103, Liujie0926, liuruyan, lixinqi, LiYuRio, lizexu123, lizhenyun01, lj970926, lshpku, megemini, mikethegoblin, ming1753, mzj104, NKNaN, ooooo-create, pesionzhao, phlrain, pkuzyc, PolaKuma, Qin-sx, RichardWooSJTU, risemeup1, runzhech, RuohengMa, sasaya123, shanjiang7, SigureMo, sneaxiy, swgu98, SylarTiaNII, tianhaodongbd, tianshuo78520a, timminator, tizhou86, umiswing, waliwali777, wanghuancoder, Waynezee, Wennie396, xiaoguoguo626807, XieYunshen, Xing-lil, xkkkkkk23, Xreki, xuxinyi389, Yeenyeong, yongqiangma, YqGe585, yuanlehome, YuanRisheng, yulangz, yuwu46, zeroRains, zhangbo9674, zhanghonggeng, zhangting2020, ZhangX-21, zhangyk0314, zhangyuqin1998, zhink, zhiqiu, zhouquan32, zhoutianzi666, zhupengyang, zrr1999, zty-king, zyfncg

Release Notes

PaddlePaddle 3.1.0 Release Note EN

Important Update

1. User experience upgrade

New Features

Bug fixes

Enhanced functionality

Documentation

Developer-related

Cleanup of obsolete code

2. Basic execution architecture

New Features

Bug fixes

Enhanced functionality

Performance improvement

Discard

Developer-related

Others

3. Compiler architecture

Performance optimization

Bug fixes

4. Automatic parallel architecture

Functional improvements

Performance optimization

Bug Fixes

Others

5. Operator mechanism

New Features

Bug Fixes

Enhanced functionality

Performance improvement

Others

6. Framework performance optimization

New Features

Bug fixes

Feature Enhancements

Performance improvement

7. Hardware adaptation

New Features

Bug fixes

8. Installation environment adaptation

Bug fixes

Enhanced functionality

Developer-related

Discarded

9. List of contributors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!