Replaced hardshrink implementation by a pure python one. #913

gabrieldemarmiesse · 2020-01-19T23:15:24Z

With XLA, we can write python functions which go faster (or same speed) than custom C++/CUDA code and can target more platforms (Rocm, TPU too). Since it's python, people having problems compiling tf addons will be able to use this function.

It's experimental right now, but if we pin the target tensorflow version in the CI, we shouldn't have any problem using experimental features.

The proposed implementation is as fast as previous one for the gpu and much faster for CPU. See this notebook for the benchmark.

We have three choices:

We accept this PR as is. Remove the c++ and cuda implementation.
We use both implementations. We check in the test suite that they give the same results on random tensors(that's actually a nice test to have). We give a flag/fallback for people who prefer to use the pure python version of the function. (people who want a faster op/ have problems compiling).
We drop the idea until XLA is not experimental anymore.

I'm happy to have any input from the maintainers as this can set a guideline for future features that we accept or even how we can change the ops that we have already implemented in C++/cuda.

WindQAQ

So my previous concern is that not all people could enjoy XLA support on their machines (like my lab's workstation, the admin compiled from source but not built with XLA), but XLA is mature so far. I am happy to switch back to pure python ops if it is indeed faster 😄

It turns out we might have another issue: should we use XLA in our repo or let user compile it by themselves?

tensorflow_addons/activations/hardshrink.py

WindQAQ · 2020-01-20T00:04:38Z

cc @tensorflow/sig-addons-maintainers

WindQAQ · 2020-01-20T00:41:45Z

remember to remove this line
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/custom_ops/activations/BUILD#L23

guillaumekln · 2020-01-20T14:43:20Z

Does XLA now support dynamic batch size? Last time I checked it required shapes to be fully defined, otherwise it would re-compile the kernel on unseen shapes.

gabrieldemarmiesse · 2020-01-20T15:16:50Z

@guillaumekln nice point,

I tried with this piece of code:

for _ in range(50):
    size = tuple(np.random.randint(1, 30) for _ in range(np.random.randint(2, 6)))
    np_array = np.random.uniform(-5, 5, size=size)
    hardshrink(np_array)

and I didn't get any error/warning. Usually tf.function warns you in case there is a redrawing of the graph.

In the XLA doc, I didn't see anything related to fixed shaped / dynamic shapes.

I don't know how we can do further checks about that. Maybe ask someone in the XLA team?

gabrieldemarmiesse · 2020-01-20T16:42:11Z

EDIT for the main PR description: I ran again the notebook. Due to how tf eager works, a function can be executed without the result being computed. I call .numpy() to force computation now. The speed is the same for gpu execution and XLA is much faster for single cpu execution (I expect SIMD and other cpu specific instructions to play a big part here).

I wonder if there exist somewhere easy to use benchmarking utilities for tensorflow functions.

guillaumekln · 2020-01-20T17:29:55Z

It seems you can get some information in the debug logs. For example with random shapes it rebuilds a computation on each iteration:

$ TF_CPP_MIN_VLOG_LEVEL=1 python hardshrink.py 2>&1 | grep "Building new computation" | wc -l
50

while a fixed shape results in a single build:

$ TF_CPP_MIN_VLOG_LEVEL=1 python hardshrink.py 2>&1 | grep "Building new computation" | wc -l
1

This could be an issue for some tasks.

gabrieldemarmiesse · 2020-01-20T20:48:41Z

@guillaumekln good point. That means that we shouldn't use XLA in addons, but let people use it if they wish on their graph.

I updated the notebook and added eager and graph (without XLA) to the benchmark, if that can help us make a more informed decision.

In all frankness the benchmark results for GPU feel very strange. It's not faster than the CPU version. Maybe most of the time is spent getting the result back from the GPU.

A good benchmarking tool would be nice and would help us in other situations where we don't know how to implement an op (pure python/cuda,c++).

WindQAQ · 2020-01-20T20:58:29Z

Umm, I think .numpy() would be a bottleneck of the benchmark. If a tensor is on GPU, then .numpy() will be firstly transferred on CPU.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/ops.py#L946-L971
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/lib/core/ndarray_tensor.cc#L414-L489

gabrieldemarmiesse · 2020-01-20T21:06:22Z

The problem is that I don't know how to force the execution of the op without doing .numpy() since gpu ops are executed lazily.

…dons into python_version_too

tensorflow_addons/activations/hardshrink.py

gabrieldemarmiesse · 2020-01-22T15:49:08Z

Thanks @WindQAQ @guillaumekln @facaiy for your inputs.

Based on what we've seen so far, it's not a good idea to use XLA in tf addons because of the recompilation step every time.

Users should use XLA/tf.function whenever they need performance. But they should make the decision.

For debug purposes, I didn't decorate the function with tf.function (without XLA) because the speed increase is small.

We should give the user the liberty of using tf.function/XLA when they want to be fast when they need it and to be able to debug when they need.

I believe with this new implementation we gain in speed, maintainability, and debugabbility from the user side. This function will also be usable by everyone who had issues with C++/cuda compilation before.

WindQAQ · 2020-01-22T16:16:45Z

The problem is that I don't know how to force the execution of the op without doing .numpy() since gpu ops are executed lazily.

Seems that the CUDA stream serves ops in FIFO order so we can force the last ops to execute in order to execute all ops in the stream.

https://www.tensorflow.org/guide/eager#performance

In this way, custom ops is 1.5~2 times faster than XLA on GPU, but 4 times slower on CPU. Without XLA, custom ops is much more faster than pure python operations on GPU. Custom ops on CPU is surprisingly slow to me. I would check if I miss something in the impl.

https://colab.research.google.com/drive/12ende9xXMSywP2lOKWrFJBwaDDHbkYXh

I would like to say again that I am not against either python or C++ ops. Both of them have pros and cons. The main issue for C++ custom ops is for the compilation and maintainability, as you mentioned above, and the one for python ops concerns speed. It would be really great to see if XLA/JIT can strike the balance between two worlds 😄

gabrieldemarmiesse · 2020-01-23T22:06:50Z

I'll close this PR right now. It was great discussion! Thanks guys! I'll think about it more and maybe redo another pull request once everything is ready :)

Replaced hardshrink by a pure python version.

eb3a2b7

gabrieldemarmiesse requested review from facaiy, seanpmorgan and a team as code owners January 19, 2020 23:15

googlebot added the cla: yes label Jan 19, 2020

gabrieldemarmiesse changed the title ~~Replaced hardshrink implementation by a pure python version.~~ Replaced hardshrink implementation by a pure python one. Jan 19, 2020

boring-cyborg bot added activations custom-ops labels Jan 19, 2020

WindQAQ reviewed Jan 20, 2020

View reviewed changes

tensorflow_addons/activations/hardshrink.py Outdated Show resolved Hide resolved

Update hardshrink.py

4b37fb2

Forgot a line.

f6d3b3b

Added a check.

f65e318

WindQAQ added the discussion needed label Jan 20, 2020

gabrieldemarmiesse added 2 commits January 21, 2020 16:56

Merge branch 'master' into python_version_too

fba41b4

Merge branch 'python_version_too' of github.com:gabrieldemarmiesse/ad…

06e219a

…dons into python_version_too

facaiy reviewed Jan 22, 2020

View reviewed changes

tensorflow_addons/activations/hardshrink.py Outdated Show resolved Hide resolved

Used a version without decorator.

0ed7f87

gabrieldemarmiesse requested review from facaiy and WindQAQ January 22, 2020 15:52

gabrieldemarmiesse closed this Jan 23, 2020

This was referenced Feb 20, 2020

Adding a pure python implementation corresponding to simple C++/Cuda ops. #1114

Closed

Added an option to use the pure python implementation. #1137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replaced hardshrink implementation by a pure python one. #913

Replaced hardshrink implementation by a pure python one. #913

Uh oh!

gabrieldemarmiesse commented Jan 19, 2020 •

edited

Loading

Uh oh!

WindQAQ left a comment

Uh oh!

Uh oh!

WindQAQ commented Jan 20, 2020

Uh oh!

WindQAQ commented Jan 20, 2020

Uh oh!

guillaumekln commented Jan 20, 2020

Uh oh!

gabrieldemarmiesse commented Jan 20, 2020 •

edited

Loading

Uh oh!

gabrieldemarmiesse commented Jan 20, 2020 •

edited

Loading

Uh oh!

guillaumekln commented Jan 20, 2020

Uh oh!

gabrieldemarmiesse commented Jan 20, 2020 •

edited

Loading

Uh oh!

WindQAQ commented Jan 20, 2020 •

edited

Loading

Uh oh!

gabrieldemarmiesse commented Jan 20, 2020

Uh oh!

Uh oh!

gabrieldemarmiesse commented Jan 22, 2020 •

edited

Loading

Uh oh!

WindQAQ commented Jan 22, 2020 •

edited

Loading

Uh oh!

gabrieldemarmiesse commented Jan 23, 2020

Uh oh!

Uh oh!

Replaced hardshrink implementation by a pure python one. #913

Replaced hardshrink implementation by a pure python one. #913

Uh oh!

Conversation

gabrieldemarmiesse commented Jan 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WindQAQ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WindQAQ commented Jan 20, 2020

Uh oh!

WindQAQ commented Jan 20, 2020

Uh oh!

guillaumekln commented Jan 20, 2020

Uh oh!

gabrieldemarmiesse commented Jan 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabrieldemarmiesse commented Jan 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guillaumekln commented Jan 20, 2020

Uh oh!

gabrieldemarmiesse commented Jan 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WindQAQ commented Jan 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabrieldemarmiesse commented Jan 20, 2020

Uh oh!

Uh oh!

gabrieldemarmiesse commented Jan 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WindQAQ commented Jan 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabrieldemarmiesse commented Jan 23, 2020

Uh oh!

Uh oh!

gabrieldemarmiesse commented Jan 19, 2020 •

edited

Loading

gabrieldemarmiesse commented Jan 20, 2020 •

edited

Loading

gabrieldemarmiesse commented Jan 20, 2020 •

edited

Loading

gabrieldemarmiesse commented Jan 20, 2020 •

edited

Loading

WindQAQ commented Jan 20, 2020 •

edited

Loading

gabrieldemarmiesse commented Jan 22, 2020 •

edited

Loading

WindQAQ commented Jan 22, 2020 •

edited

Loading