Skip to content

Releases: PaddlePaddle/Paddle

PaddlePaddle 1.1.0

31 Oct 02:48
66024e9
Compare
Choose a tag to compare

Release Notes

Major New Features and Improvements

Framework

  • Memory optimization strategy "eager deletion" now supports sub-block in control flow operators (e.g. if-else, while). Significantly reduce memory consumption of models with control flow operators.

  • Optimize split operator, significantly improve performance.

  • Extend multiclass_nms operator, supports polygon bounding box.

  • Added generate_proposals operator CUDA implementation, significantly improve performance.

  • Support fusing affine_channel operator and batch_norm operator, significantly improve performance.

  • Optimize depthwise_conv operator, significantly improve performance.

  • Optimize reduce_mean operator, significantly improve performance.

  • Optimize sum operator, significantly improve performance.

  • Optimize top_k operator, significantly improve performance.

  • Added new sequence_slice operator. For a sequence, slice sub-sequence based on specified start and length.

  • Added new sequence_unpad operator. Support padding Tensor to LoDTensor conversion.

  • Added new sequence_reverse operator. roi_align operator, affine_channel operator.

Server Inference

  • Added avx, noavx auto switch feature, allow major models to automatically switch among avx, avx2, avx512.

  • Improve inference usability: Only need to include 1 header and 1 library.

  • Significantly improve inference performance.

Mobile Inference

  • Added Mali GPU and Andreno GPU support for mobilenet v1 model.

  • Added ZU5, ZU9 FPGA support for resnet34 and resnet50 models.

发布日志

主要新功能和优化

基础框架

  • 显存优化策略eager deletion支持control flow (e.g. if-else, while)中子block的优化。显著降低包含control flow的模型的显存开销。

  • 优化了split operator,显著提升性能。

  • 扩展multiclass_nms operator,支持多边形的预测框。

  • 新增generatoe_proposals operator的CUDA实现,显著提升性能。

  • 通过affine_channel operator融合batch_norm operator,显著提升性能。

  • 优化depthwise_conv operator的forward和backward,显著提升性能。

  • 优化reduce_mean operator。

  • 优化sum operator,该operator在输入是Tensor的情况下,减少一次zero memory耗时。

  • 优化top_k operator,显著提升性能。

  • 新增sequence_slice operator,对于一个sequence,可以从指定位置开始,slice出指定长度的subsequence。

  • 新增sequence_unpad operator,支持padding Tensor转LoDTensor。

  • 新增sequence_reverse operator,roi_align operator,affine_channel operator。

服务端预测

  • 增加了部署时 AVX 和 NOAVX 自动切换的feature,可以针对重点模型实现AVX, AVX2, AVX512自动切换
  • 提升预测库易用性:只需要 include一个头文件和一个库。
  • ICNet 预测性能大幅提升。

移动端预测

  • 新增Mali GPU和Andreno GPU上mobilenet v1模型支持。
  • 新增ZU5、ZU9等FPGA开发板上resnet34和resnet50模型支持。

PaddlePaddle 1.0.2

24 Oct 06:31
4a93486
Compare
Choose a tag to compare

Fix SelectedRows type inference.

PaddlePaddle 1.0.1

10 Oct 11:46
cddff20
Compare
Choose a tag to compare

Fix Windows library dynamic loading program

Fix Mac compile on MacOS 10.14

Fix truncated_normal

Fix manylinux docker build

Correctly set SelectedRows output shape

Correctly integrate tensorRT in inference library.

PaddlePaddle 1.0.0

09 Oct 01:08
627bea4
Compare
Choose a tag to compare

Release Log

Major New Features and Improvements:

  • Support MacOS training, inference, Windows inference (Alpha).

  • Speed up While operator

  • Enhance support for sparse tensor

  • TensorRT integration enhance

  • More fused operators for CPU inference: GRU, LSTM, etc.

  • Some improvements for sequence operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)

  • Other operator improvements: stack_op, BatchAUC, prelude, crf, pad2d

  • decayed_adagrad support for distributed training

  • Python multi-process reader

  • API doc improvements. Avoid kwargs.

Others:

  • Tighten public APIs. Hide public APIs that are currently not widely used and unlikely to be used in the near future.

  • Clean up some deprecated features.

Known Issues

  • Memory optimization still has space for improvements in next release.

  • Using memory optimization with distributed training should strictly follow some counter-intuitive instructions.

  • Sparse Tensor (SelectedRows)'s is not handled correctly in some operators and is being fixed in the next release

发布日志

主要新功能和优化

  • 支持 MacOS 训练和预测,Windows预测(内测)

  • 提高while operator的速度

  • 增强对sparse tensor的支持

  • TensorRT 集成的加强

  • 更多CPU预测的融合operator: GRU, LSTM, etc.

  • 优化序列相关operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)

  • 其他operator的优化: stack_op, BatchAUC, prelude, crf, pad2d

  • decayed_adagrad 支持分布式训练

  • Python多进程reader

  • API 文档优化,避免kwargs等问题

其他:

  • 规范管理public API. 一些当前不常被使用并且将来不太可能被使用的API被隐藏起来

  • 清理一些废弃的功能

已知问题

  • 内存优化在下个release还有一些的提高空间

  • 内存优化和分布式训练的同时使用需要严格遵循一些不太合乎直觉的步骤

  • Sparse Tensor (SelectedRows)'s 在一些operators里面没有被正确的处理,在下个release中会被修复。

PaddlePaddle 1.0.0-rc0

25 Sep 10:45
644bad1
Compare
Choose a tag to compare
Pre-release

Release Log

Major New Features and Improvements:

  • Support MacOS training, inference, Windows inference (Alpha).

  • Speed up While operator

  • Enhance support for sparse tensor

  • TensorRT integration enhance

  • More fused operators for CPU inference: GRU, LSTM, etc.

  • Some improvements for sequence operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)

  • Other operator improvements: stack_op, BatchAUC, prelude, crf, pad2d

  • decayed_adagrad support for distributed training

  • Python multi-process reader

  • API doc improvements. Avoid kwargs.

Others:

  • Tighten public APIs. Hide public APIs that are currently not widely used and unlikely to be used in the near future.

  • Clean up some deprecated features.

Known Issues

  • Memory optimization still has space for improvements in next release.

  • Using memory optimization with distributed training should strictly follow some counter-intuitive instructions.

发布日志

主要新功能和优化

  • 支持 MacOS 训练和预测,Windows预测(内测)

  • 提高while operator的速度

  • 增强对sparse tensor的支持

  • TensorRT 集成的加强

  • 更多CPU预测的融合operator: GRU, LSTM, etc.

  • 优化序列相关operators (sequence_pool, sequence_concat, sequence_mask, sequence_enumerate, sequence_slice, etc)

  • 其他operator的优化: stack_op, BatchAUC, prelude, crf, pad2d

  • decayed_adagrad 支持分布式训练

  • Python多进程reader

  • API 文档优化,避免kwargs等问题

其他:

  • 规范管理public API. 一些当前不常被使用并且将来不太可能被使用的API被隐藏起来

  • 清理一些废弃的功能

已知问题

  • 内存优化在下个release还有一些的提高空间

  • 内存优化和分布式训练的同时使用需要严格遵循一些不太合乎直觉的步骤

PaddlePaddle 0.15.0

05 Sep 03:46
1ca241c
Compare
Choose a tag to compare

Release Log

Major New Features and Improvements:

  • PyReader. Support python-level customized data loading and preprocessing for the buffered reader.

  • Unified Intermediate Representation (IR) and transforms for single-machine, distributed training and inference.

  • Python3 early support. (Alpha testing)

  • Inference library symbol hiding. Better isolation with other libraries linked together.

  • Distributed lookup table training with parallel executor. Allow to scale distributed training for large scale sparse dataset. (Alpha testing)

  • Major stability improvements and test coverage improvements of distributed training.

  • Polish high frequency enforce error message. Enhance user usability.

  • Profiler improvements for dist_train and fixes.

  • Operator improvements: mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout

  • Major expansion of TensorRT inference support.

  • Continuous Integration and Evaluation service scale and stability improvements

  • Hide many public APIs that shouldn't be exposed.

Performance:

  • layer_norm speedup: forward: 0.52ms -> 0.16ms (average) backward: 1.08ms -> 0.41ms (average)

  • conv_grad mkldnn speedup, fc, gru cpu improvements.

  • reduce_sum cpu kernel speedup: 4 times

  • softmax_with_cross_entropy op is as followings: Forward: 52.4ms -> 15.6ms

  • OCR CPU model speed improvements. Enhanced im2col and some OCR model performance on 2620v3 improved 34.6%

  • depthwise conv2d_transposed speed up. Improved face detection model by 16.5%.

Others:

  • Added external dependencies: xbyak, cub, libxsmm

  • Merge libpaddle_inference_api[.a/.so] into libpaddle_fluid[.a/.so]. Inference only need to link libpaddle_fluid[.a/.so].

  • Fixes of float16 support

  • Significantly reduce fluid.tgz package size. GPU version reduced from 730M to 190M. CPU version reduced from 335M to 77M.

Known Issues

  • Using memory_optimize with distributed training might trigger subtle bugs. We are aiming to fix it in the next release.

发布日志

主要新功能和优化

  • PyReader. 支持python自定义数据的读取和预处理,然后发送给带buffer的reader

  • 单机,多机和预测都使用统一的中间表达和转换。

  • Python3的支持(内测)

  • 预测库更好的symbol隐藏,更好的和其他的依赖库进行隔离。

  • 支持分布式的lookup table。可以支持训练是的大规模稀疏。(内测)

  • 分布式训练的显著稳定性提升和测试覆盖提升。

  • 提高报错信息的可读性。

  • Profile对分布式的支持和修复

  • 新增算子:mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout

  • 对TensorRT支持的扩展,支持更多的TensorRT算子。

  • 持续集成测试系统规模的稳定性的提升

  • 隐藏了大量不应该暴露的public API,增强public API的严谨性。

性能:

  • layer_norm前向加速:0.52ms -> 0.16ms (average),反向加速:backward: 1.08ms -> 0.41ms (average)

  • conv_grad mkldnn 加速, fc, gru cpu 上优化。

  • reduce_sum cpu上4倍提速

  • softmax_with_cross_entropy提速52.4ms -> 15.6ms

  • OCR CPU模型性能提升,改进im2col实现,增强了conv的执行效率,使得OCR模型在2620v3上取得34.6%的性能提升。

  • conv2d_transposed_op支持设置Group,并且加速depthwise conv2d_transposed,该加速使得人脸检测模型速度提升16.5%

其他:

  • 新增第三方库:xbyak, cub, libxsmm

  • 将 libpaddle_inference_api[.a/.so] 合并到 libpaddle_fluid[.a/.so],预测只需要链接 libpaddle_fluid[.a/.so]

  • float16的修复

  • 大幅减少发布的fluid.tgz包大小,gpu版本从730M降低为190M,cpu版本从335M降低为77M,加快用户下载。

已知问题

  • memory_optimize 在分布式的时候会触发bug,我们会在下一个版本修复。

PaddlePaddle 0.15.0-rc0

03 Sep 03:23
64d48f4
Compare
Choose a tag to compare
Pre-release

Release Log

Major New Features and Improvements:

  • PyReader. Support python-level customized data loading and preprocessing for the buffered reader.

  • Unified Intermediate Representation (IR) and transforms for single-machine, distributed training and inference.

  • Python3 early support. (Alpha testing)

  • Inference library symbol hiding. Better isolation with other libraries linked together.

  • Distributed lookup table training with parallel executor. Allow to scale distributed training for large scale sparse dataset. (Alpha testing)

  • Major stability improvements and test coverage improvements of distributed training.

  • Polish high frequency enforce error message. Enhance user usability.

  • Profiler improvements for dist_train and fixes.

  • Operator improvements: mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout

  • Major expansion of TensorRT inference support.

  • Continuous Integration and Evaluation service scale and stability improvements

  • Hide many public APIs that shouldn't be exposed.

Performance:

  • layer_norm speedup: forward: 0.52ms -> 0.16ms (average) backward: 1.08ms -> 0.41ms (average)

  • conv_grad mkldnn speedup, fc, gru cpu improvements.

  • reduce_sum cpu kernel speedup: 4 times

  • softmax_with_cross_entropy op is as followings: Forward: 52.4ms -> 15.6ms

  • OCR CPU model speed improvements. Enhanced im2col and some OCR model performance on 2620v3 improved 34.6%

  • depthwise conv2d_transposed speed up. Improved face detection model by 16.5%.

Others:

  • Added external dependencies: xbyak, cub, libxsmm

  • Merge libpaddle_inference_api[.a/.so] into libpaddle_fluid[.a/.so]. Inference only need to link libpaddle_fluid[.a/.so].

  • Fixes of float16 support

  • Significantly reduce fluid.tgz package size. GPU version reduced from 730M to 190M. CPU version reduced from 335M to 77M.

发布日志

主要新功能和优化

  • PyReader. 支持python自定义数据的读取和预处理,然后发送给带buffer的reader

  • 单机,多机和预测都使用统一的中间表达和转换。

  • Python3的支持(内测)

  • 预测库更好的symbol隐藏,更好的和其他的依赖库进行隔离。

  • 支持分布式的lookup table。可以支持训练是的大规模稀疏。(内测)

  • 分布式训练的显著稳定性提升和测试覆盖提升。

  • 提高报错信息的可读性。

  • Profile对分布式的支持和修复

  • 新增算子:mkldnn softmax_grad, squeeze_op, hsigmoid_op, Sampling id, beam_search, flatten_op, rank_loss_op, prior_box_op, bilinear initializer, squeeze/unsqueeze, maxout

  • 对TensorRT支持的扩展,支持更多的TensorRT算子。

  • 持续集成测试系统规模的稳定性的提升

  • 隐藏了大量不应该暴露的public API,增强public API的严谨性。

性能:

  • layer_norm前向加速:0.52ms -> 0.16ms (average),反向加速:backward: 1.08ms -> 0.41ms (average)

  • conv_grad mkldnn 加速, fc, gru cpu 上优化。

  • reduce_sum cpu上4倍提速

  • softmax_with_cross_entropy提速52.4ms -> 15.6ms

  • OCR CPU模型性能提升,改进im2col实现,增强了conv的执行效率,使得OCR模型在2620v3上取得34.6%的性能提升。

  • conv2d_transposed_op支持设置Group,并且加速depthwise conv2d_transposed,该加速使得人脸检测模型速度提升16.5%

其他:

  • 新增第三方库:xbyak, cub, libxsmm

  • 将 libpaddle_inference_api[.a/.so] 合并到 libpaddle_fluid[.a/.so],预测只需要链接 libpaddle_fluid[.a/.so]

  • float16的修复

  • 大幅减少发布的fluid.tgz包大小,gpu版本从730M降低为190M,cpu版本从335M降低为77M,加快用户下载。

v0.14.0

03 Jul 08:29
163b5e5
Compare
Choose a tag to compare

Release Log

Major Features

  • Enhanced the inference library. Better memory buffer. Added several demos.
  • Inference library added support for Anakin engine, TensorRT engine.
  • ParallelExecutor supports multi-threaded CPU training. (In addition to multi-GPU training)
  • Added mean IOU operator, argsort operator, etc. Improved L2norm operator. Added crop API.
  • Released pre-trained ResNet50, Se-Resnext50, AlexNet, etc, Enahanced Transformer, etc.
  • New data augmentation operators.
  • Major documentation and API comment improvements.
  • Enhance the continuous evaluation system.

Performance Improvements

  • More overlap of distributed training network operation with computation. ~10% improvements
  • CPU performance improvements with more MKLDNN support.

Major Bug Fixes

  • Fix memory leak issues.
  • Fix concat operator.
  • Fix ParallelExecutor input data memcpy issue.
  • Fix ParallelExecutor deadlock issue.
  • Fix distributed training client timeout.
  • Fix distributed training pserver side learning rate decay.
  • Thread-safe Scope implementation.
  • Fix some issue using memory optimizer and parallelexecutor together.

Known Issues

  • IfElse has some bugs.
  • BatchNorm is not stable if batch_size=1

v0.13.0

05 Jun 07:34
9d40eb3
Compare
Choose a tag to compare

Release Log

Major Features

  • Asynchronous distributed training support.
  • Distributed training with ParallelExecutor.
  • Distributed ring-based training with NCCL2.
  • Support checkpoint save on trainer and store on trainer and parameter server.
  • Graceful shutdown of parameter server.
  • Publish the high-level inference lib API and inference implementation.
  • Assign roles to each op.
  • Publish the C++ train API to allow to embed fluid into other C++ systems.
  • Support uint8_t type data file and data exchange.
  • C++ reader supports customized data augmentation.
  • Improved operator and interface support for speech models.
  • New random_crop op.
  • New shape op to get the tensor's shape.
  • New resize_bilinear interface.
  • New dice_loss layer.
  • Enhanced reduce_op to support reduce on multiple dimensions.

Performance Improvements

On P40 GPU ResNet-50 model, single GPU speed improves 23.8% (105 images/sec to 130 images/sec). 8 GPUs speedup ratio 6, 32 GPUs speedup ratio reaches 17.4.

  • Overlap send/recv op with other operators.
  • Multi-thread server-side request handling.
  • Weight decay and clipping moved from trainer to parameter server for performance and correctness.
  • Improved C++ reader.

Major Bug Fixes

  • Fix accuracy loss when both ParallelExecutor and memory optimizer are used.
  • Fix ParallelExecutor hang when multiple inputs duplicate.
  • Fix Program clone cause memory leak.
  • Fix GRU unit bias ineffective and wrong activation.
  • Fix ROI Pooling GPU computation issues.
  • Fix fill_constant_batch_size_like when input is sequence.
  • Fix reshape op.

v0.12.0

26 Apr 11:48
Compare
Choose a tag to compare

Release log

Major Improvements

Reader Prototype. Data can be read through C++ reader asynchronously with potentially higher performance.

ParallelExecutor. Significantly improve the multi-gpu performance over the previous solution.

Distributed Training. Major performance improvements and stability improvements.

Inplace Activation. Significantly reduce the GPU memory requirements and increase the batch size.

Operator Optimizations. Performance improvements of many operators.

Timeline Profiling. Allow to visualize performance as time series.

Major Bug Fixes

Calling cublas/cudnn library with wrong argument types.

Evaluated Models

Image Classification

Object Detection

OCR

Machine Translation

Text Classification

Language Model

Sequence Tagging