Skip to content

Replaced hardshrink implementation by a pure python one. #913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Replaced hardshrink implementation by a pure python one. #913

wants to merge 7 commits into from

Conversation

gabrieldemarmiesse
Copy link
Member

@gabrieldemarmiesse gabrieldemarmiesse commented Jan 19, 2020

With XLA, we can write python functions which go faster (or same speed) than custom C++/CUDA code and can target more platforms (Rocm, TPU too). Since it's python, people having problems compiling tf addons will be able to use this function.

It's experimental right now, but if we pin the target tensorflow version in the CI, we shouldn't have any problem using experimental features.

The proposed implementation is as fast as previous one for the gpu and much faster for CPU. See this notebook for the benchmark.

We have three choices:

  • We accept this PR as is. Remove the c++ and cuda implementation.
  • We use both implementations. We check in the test suite that they give the same results on random tensors(that's actually a nice test to have). We give a flag/fallback for people who prefer to use the pure python version of the function. (people who want a faster op/ have problems compiling).
  • We drop the idea until XLA is not experimental anymore.

I'm happy to have any input from the maintainers as this can set a guideline for future features that we accept or even how we can change the ops that we have already implemented in C++/cuda.

@gabrieldemarmiesse gabrieldemarmiesse changed the title Replaced hardshrink implementation by a pure python version. Replaced hardshrink implementation by a pure python one. Jan 19, 2020
Copy link
Member

@WindQAQ WindQAQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my previous concern is that not all people could enjoy XLA support on their machines (like my lab's workstation, the admin compiled from source but not built with XLA), but XLA is mature so far. I am happy to switch back to pure python ops if it is indeed faster 😄

It turns out we might have another issue: should we use XLA in our repo or let user compile it by themselves?

@WindQAQ
Copy link
Member

WindQAQ commented Jan 20, 2020

cc @tensorflow/sig-addons-maintainers

@WindQAQ
Copy link
Member

WindQAQ commented Jan 20, 2020

@guillaumekln
Copy link
Contributor

Does XLA now support dynamic batch size? Last time I checked it required shapes to be fully defined, otherwise it would re-compile the kernel on unseen shapes.

@gabrieldemarmiesse
Copy link
Member Author

gabrieldemarmiesse commented Jan 20, 2020

@guillaumekln nice point,

I tried with this piece of code:

for _ in range(50):
    size = tuple(np.random.randint(1, 30) for _ in range(np.random.randint(2, 6)))
    np_array = np.random.uniform(-5, 5, size=size)
    hardshrink(np_array)

and I didn't get any error/warning. Usually tf.function warns you in case there is a redrawing of the graph.

In the XLA doc, I didn't see anything related to fixed shaped / dynamic shapes.

I don't know how we can do further checks about that. Maybe ask someone in the XLA team?

@gabrieldemarmiesse
Copy link
Member Author

gabrieldemarmiesse commented Jan 20, 2020

EDIT for the main PR description: I ran again the notebook. Due to how tf eager works, a function can be executed without the result being computed. I call .numpy() to force computation now. The speed is the same for gpu execution and XLA is much faster for single cpu execution (I expect SIMD and other cpu specific instructions to play a big part here).

I wonder if there exist somewhere easy to use benchmarking utilities for tensorflow functions.

@guillaumekln
Copy link
Contributor

It seems you can get some information in the debug logs. For example with random shapes it rebuilds a computation on each iteration:

$ TF_CPP_MIN_VLOG_LEVEL=1 python hardshrink.py 2>&1 | grep "Building new computation" | wc -l
50

while a fixed shape results in a single build:

$ TF_CPP_MIN_VLOG_LEVEL=1 python hardshrink.py 2>&1 | grep "Building new computation" | wc -l
1

This could be an issue for some tasks.

@gabrieldemarmiesse
Copy link
Member Author

gabrieldemarmiesse commented Jan 20, 2020

@guillaumekln good point. That means that we shouldn't use XLA in addons, but let people use it if they wish on their graph.

I updated the notebook and added eager and graph (without XLA) to the benchmark, if that can help us make a more informed decision.

In all frankness the benchmark results for GPU feel very strange. It's not faster than the CPU version. Maybe most of the time is spent getting the result back from the GPU.

A good benchmarking tool would be nice and would help us in other situations where we don't know how to implement an op (pure python/cuda,c++).

@WindQAQ
Copy link
Member

WindQAQ commented Jan 20, 2020

@gabrieldemarmiesse
Copy link
Member Author

The problem is that I don't know how to force the execution of the op without doing .numpy() since gpu ops are executed lazily.

@gabrieldemarmiesse
Copy link
Member Author

gabrieldemarmiesse commented Jan 22, 2020

Thanks @WindQAQ @guillaumekln @facaiy for your inputs.

Based on what we've seen so far, it's not a good idea to use XLA in tf addons because of the recompilation step every time.

Users should use XLA/tf.function whenever they need performance. But they should make the decision.

For debug purposes, I didn't decorate the function with tf.function (without XLA) because the speed increase is small.

We should give the user the liberty of using tf.function/XLA when they want to be fast when they need it and to be able to debug when they need.

I believe with this new implementation we gain in speed, maintainability, and debugabbility from the user side. This function will also be usable by everyone who had issues with C++/cuda compilation before.

@WindQAQ
Copy link
Member

WindQAQ commented Jan 22, 2020

The problem is that I don't know how to force the execution of the op without doing .numpy() since gpu ops are executed lazily.

Seems that the CUDA stream serves ops in FIFO order so we can force the last ops to execute in order to execute all ops in the stream.

https://www.tensorflow.org/guide/eager#performance

In this way, custom ops is 1.5~2 times faster than XLA on GPU, but 4 times slower on CPU. Without XLA, custom ops is much more faster than pure python operations on GPU. Custom ops on CPU is surprisingly slow to me. I would check if I miss something in the impl.

https://colab.research.google.com/drive/12ende9xXMSywP2lOKWrFJBwaDDHbkYXh

I would like to say again that I am not against either python or C++ ops. Both of them have pros and cons. The main issue for C++ custom ops is for the compilation and maintainability, as you mentioned above, and the one for python ops concerns speed. It would be really great to see if XLA/JIT can strike the balance between two worlds 😄

@gabrieldemarmiesse
Copy link
Member Author

I'll close this PR right now. It was great discussion! Thanks guys! I'll think about it more and maybe redo another pull request once everything is ready :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants