-
Notifications
You must be signed in to change notification settings - Fork 614
Replaced hardshrink implementation by a pure python one. #913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced hardshrink implementation by a pure python one. #913
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my previous concern is that not all people could enjoy XLA support on their machines (like my lab's workstation, the admin compiled from source but not built with XLA), but XLA is mature so far. I am happy to switch back to pure python ops if it is indeed faster 😄
It turns out we might have another issue: should we use XLA in our repo or let user compile it by themselves?
cc @tensorflow/sig-addons-maintainers |
remember to remove this line |
Does XLA now support dynamic batch size? Last time I checked it required shapes to be fully defined, otherwise it would re-compile the kernel on unseen shapes. |
@guillaumekln nice point, I tried with this piece of code: for _ in range(50):
size = tuple(np.random.randint(1, 30) for _ in range(np.random.randint(2, 6)))
np_array = np.random.uniform(-5, 5, size=size)
hardshrink(np_array) and I didn't get any error/warning. Usually tf.function warns you in case there is a redrawing of the graph. In the XLA doc, I didn't see anything related to fixed shaped / dynamic shapes. I don't know how we can do further checks about that. Maybe ask someone in the XLA team? |
EDIT for the main PR description: I ran again the notebook. Due to how tf eager works, a function can be executed without the result being computed. I call I wonder if there exist somewhere easy to use benchmarking utilities for tensorflow functions. |
It seems you can get some information in the debug logs. For example with random shapes it rebuilds a computation on each iteration: $ TF_CPP_MIN_VLOG_LEVEL=1 python hardshrink.py 2>&1 | grep "Building new computation" | wc -l
50 while a fixed shape results in a single build: $ TF_CPP_MIN_VLOG_LEVEL=1 python hardshrink.py 2>&1 | grep "Building new computation" | wc -l
1 This could be an issue for some tasks. |
@guillaumekln good point. That means that we shouldn't use XLA in addons, but let people use it if they wish on their graph. I updated the notebook and added eager and graph (without XLA) to the benchmark, if that can help us make a more informed decision. In all frankness the benchmark results for GPU feel very strange. It's not faster than the CPU version. Maybe most of the time is spent getting the result back from the GPU. A good benchmarking tool would be nice and would help us in other situations where we don't know how to implement an op (pure python/cuda,c++). |
Umm, I think https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/ops.py#L946-L971 |
The problem is that I don't know how to force the execution of the op without doing |
Thanks @WindQAQ @guillaumekln @facaiy for your inputs. Based on what we've seen so far, it's not a good idea to use XLA in tf addons because of the recompilation step every time. Users should use XLA/tf.function whenever they need performance. But they should make the decision. For debug purposes, I didn't decorate the function with tf.function (without XLA) because the speed increase is small. We should give the user the liberty of using tf.function/XLA when they want to be fast when they need it and to be able to debug when they need. I believe with this new implementation we gain in speed, maintainability, and debugabbility from the user side. This function will also be usable by everyone who had issues with C++/cuda compilation before. |
Seems that the CUDA stream serves ops in FIFO order so we can force the last ops to execute in order to execute all ops in the stream. https://www.tensorflow.org/guide/eager#performance In this way, custom ops is 1.5~2 times faster than XLA on GPU, but 4 times slower on CPU. Without XLA, custom ops is much more faster than pure python operations on GPU. Custom ops on CPU is surprisingly slow to me. I would check if I miss something in the impl. https://colab.research.google.com/drive/12ende9xXMSywP2lOKWrFJBwaDDHbkYXh I would like to say again that I am not against either python or C++ ops. Both of them have pros and cons. The main issue for C++ custom ops is for the compilation and maintainability, as you mentioned above, and the one for python ops concerns speed. It would be really great to see if XLA/JIT can strike the balance between two worlds 😄 |
I'll close this PR right now. It was great discussion! Thanks guys! I'll think about it more and maybe redo another pull request once everything is ready :) |
With XLA, we can write python functions which go faster (or same speed) than custom C++/CUDA code and can target more platforms (Rocm, TPU too). Since it's python, people having problems compiling tf addons will be able to use this function.
It's experimental right now, but if we pin the target tensorflow version in the CI, we shouldn't have any problem using experimental features.
The proposed implementation is as fast as previous one for the gpu and much faster for CPU. See this notebook for the benchmark.
We have three choices:
I'm happy to have any input from the maintainers as this can set a guideline for future features that we accept or even how we can change the ops that we have already implemented in C++/cuda.