-
Notifications
You must be signed in to change notification settings - Fork 80
Add performance tips tutorial #1065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add performance tips tutorial #1065
Conversation
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a first pass, thanks @mollyxu , it looks great!
| # - If you care about exactness of frame seeking, use “exact”. | ||
| # - If you can sacrifice exactness of seeking for speed, which is usually the case when doing clip sampling, use “approximate”. | ||
| # - If your videos don’t have variable framerate and their metadata is correct, then “approximate” mode is a net win: it will be just as accurate as the “exact” mode while still being significantly faster. | ||
| # - If your size is small enough and we’re decoding a lot of frames, there’s a chance exact mode is actually faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above:
This is a good description. I think we can be more nuanced about when to recommend approximate, e.g. we should try to clearly articulate the last 3 bullet points which are currently slightly overlapping and contradictory (we now know that approximate won't always be "a net win").
That's on me: I need to first have a clear understanding of why approximate mode is sometimes slower, and I'll need to update the approximate mode tutorial with more detailed recommendations.
I won't be able to do that in the next few days, so to unblock yourself I think you can just remove the claims about approximate being strictly superior ( bullet points 2 and 3), and the more generic reco could be something like
If the video is long and you're only decoding a small amount of frames, approximate mode should be faster.
It's not super actionable for users but I hope the dedicated tutorial I'll edit will be more precise.
|
Thanks for the feedback! |
|
Let's update |
|
|
||
| """ | ||
| ==================================== | ||
| Performance Tips and Best Practices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a more descriptive title would help with discoverability : TorchCodec Performance Tips and Best Practices.
Also adding meta directive at the top should help as well. SOmething like:
.. meta::
:description: Learn how to optimize TorchCodec video decoding performance with batch APIs, approximate seeking, multi-threading, and CUDA acceleration.
| """ | ||
| ==================================== | ||
| Performance Tips and Best Practices | ||
| ==================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not required but there is a template, you could look into as well: https://github.com/pytorch/tutorials/blob/main/beginner_source/template_tutorial.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback!
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Molly, this is great! I made a bunch of comments but they are easy to address, so I'll approve now to unblock.
| # In some cases, CUDA decoding may fall back to CPU decoding. This can happen | ||
| # when the video codec or format is not supported by the NVDEC hardware decoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # In some cases, CUDA decoding may fall back to CPU decoding. This can happen | |
| # when the video codec or format is not supported by the NVDEC hardware decoder. | |
| # In some cases, CUDA decoding may fall back to CPU decoding. This can happen | |
| # when the video codec or format is not supported by the NVDEC hardware decoder, or when NVCUVID wasn't found. |
| decoder = VideoDecoder(video_file, device="cuda") | ||
|
|
||
| # Check and print the CPU fallback status | ||
| print(decoder.cpu_fallback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the new section above: this is great. Let's move it just above the "Visualizing Frames" section.
| # **Performance impact:** Enables consistent, predictable performance for repeated | ||
| # random access without the overhead of exact mode's scanning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "random access" is really relevant here. For perf, this is less about frame access than it is about decoder initialization
| # **Performance impact:** Enables consistent, predictable performance for repeated | |
| # random access without the overhead of exact mode's scanning. | |
| # **Performance impact:** speeds up decoder instantiation, similarly to ``seek_mode="approximate"``. |
| # | ||
| # When decoding multiple videos or decoding a large number of frames from a single video, there are a few parallelization strategies to speed up the decoding process: | ||
| # | ||
| # - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames | |
| # - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames. For that, use the `num_ffmpeg_threads` parameter of the :class:`~torchcodec.decoders.VideoDecoder` |
| # iteration, and frame retrieval, see: | ||
| # | ||
| # - :ref:`sphx_glr_generated_examples_decoding_basic_example.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make the rendering slightly more compact by avoiding bullet points (here and everwhere else in these "notes"). I think it flows a bit better:
| # iteration, and frame retrieval, see: | |
| # | |
| # - :ref:`sphx_glr_generated_examples_decoding_basic_example.py` | |
| # iteration, and frame retrieval, see :ref:`sphx_glr_generated_examples_decoding_basic_example.py` |
| # **Performance impact:** CUDA decoding can significantly outperform CPU decoding, | ||
| # especially for high-resolution videos and when combined with GPU-based transforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "transforms" stuff is slightly misleading and might become even more misleading soon when we actually release native transforms - but they'll be CPU-only for a bit.
| # **Performance impact:** CUDA decoding can significantly outperform CPU decoding, | |
| # especially for high-resolution videos and when combined with GPU-based transforms. | |
| # **Performance impact:** CUDA decoding can significantly outperform CPU decoding, | |
| # especially for high-resolution videos and when decoding a lot of frames. |
| # %% | ||
| # **Recommended Usage for Beta Interface** | ||
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| # | ||
| # .. code-block:: python | ||
| # | ||
| # with set_cuda_backend("beta"): | ||
| # decoder = VideoDecoder("file.mp4", device="cuda") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this section above, it should be the first one we see (before the "when to use" section"). Let's also make it slightly more obvious:
| # %% | |
| # **Recommended Usage for Beta Interface** | |
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
| # | |
| # .. code-block:: python | |
| # | |
| # with set_cuda_backend("beta"): | |
| # decoder = VideoDecoder("file.mp4", device="cuda") | |
| # %% | |
| # **Recommended: use the Beta Interface!!** | |
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
| # | |
| # We recommend you use the new "beta" CUDA interface which is significantly faster than the previous one, and supports the same features: | |
| # | |
| # .. code-block:: python | |
| # | |
| # with set_cuda_backend("beta"): | |
| # decoder = VideoDecoder("file.mp4", device="cuda") |
| # decoder = VideoDecoder("file.mp4", device="cuda") | ||
| # decoder[0] # Decode at least one frame first (for FFmpeg backend) | ||
| # | ||
| # # Print detailed fallback status | ||
| # print(decoder.cpu_fallback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the beta interface here - we really want users to use that as the default now. Since we're using beta, we don't need to decode a frame first. That will be documented just below in your great note.
| # TorchCodec offers multiple performance optimization strategies, each suited to | ||
| # different scenarios. Use batch APIs for multi-frame decoding, approximate mode | ||
| # for faster initialization, parallel processing for high throughput, and CUDA | ||
| # acceleration for GPU-intensive workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # acceleration for GPU-intensive workflows. | |
| # acceleration to offload the CPU. |
| # linked examples as a guide. | ||
| # | ||
| # For more information, see: | ||
| # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my other comment above, the list below should definitely be kept as a bullet list!
Consolidate the following performance tips in docs
Also updated docs on cuda example to reflect cpu fallback (#943)