Skip to content

Conversation

@mollyxu
Copy link
Contributor

@mollyxu mollyxu commented Nov 20, 2025

Consolidate the following performance tips in docs

  1. Batch APIs - Decode multiple frames at once
  2. Approximate Mode & Keyframe Mappings - Trade accuracy for speed
  3. Multi-threading - Parallelize decoding across videos or chunks
  4. CUDA Acceleration - Use GPU decoding for supported formats

Also updated docs on cuda example to reflect cpu fallback (#943)

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2025
@meta-pytorch meta-pytorch deleted a comment from meta-codesync bot Nov 20, 2025
Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass, thanks @mollyxu , it looks great!

# - If you care about exactness of frame seeking, use “exact”.
# - If you can sacrifice exactness of seeking for speed, which is usually the case when doing clip sampling, use “approximate”.
# - If your videos don’t have variable framerate and their metadata is correct, then “approximate” mode is a net win: it will be just as accurate as the “exact” mode while still being significantly faster.
# - If your size is small enough and we’re decoding a lot of frames, there’s a chance exact mode is actually faster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above:

This is a good description. I think we can be more nuanced about when to recommend approximate, e.g. we should try to clearly articulate the last 3 bullet points which are currently slightly overlapping and contradictory (we now know that approximate won't always be "a net win").

That's on me: I need to first have a clear understanding of why approximate mode is sometimes slower, and I'll need to update the approximate mode tutorial with more detailed recommendations.

I won't be able to do that in the next few days, so to unblock yourself I think you can just remove the claims about approximate being strictly superior ( bullet points 2 and 3), and the more generic reco could be something like

If the video is long and you're only decoding a small amount of frames, approximate mode should be faster.

It's not super actionable for users but I hope the dedicated tutorial I'll edit will be more precise.

@mollyxu
Copy link
Contributor Author

mollyxu commented Nov 21, 2025

Thanks for the feedback!

@Dan-Flores
Copy link
Contributor

Let's update docs/source/index.rst so this tutorial appears on the main index.html page (similar to these changes)


"""
====================================
Performance Tips and Best Practices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a more descriptive title would help with discoverability : TorchCodec Performance Tips and Best Practices.
Also adding meta directive at the top should help as well. SOmething like:

.. meta::
   :description:  Learn how to optimize TorchCodec video decoding performance with batch APIs, approximate seeking, multi-threading, and CUDA acceleration.

"""
====================================
Performance Tips and Best Practices
====================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required but there is a template, you could look into as well: https://github.com/pytorch/tutorials/blob/main/beginner_source/template_tutorial.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!

@mollyxu mollyxu marked this pull request as ready for review December 8, 2025 19:31
Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Molly, this is great! I made a bunch of comments but they are easy to address, so I'll approve now to unblock.

Comment on lines 148 to 149
# In some cases, CUDA decoding may fall back to CPU decoding. This can happen
# when the video codec or format is not supported by the NVDEC hardware decoder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# In some cases, CUDA decoding may fall back to CPU decoding. This can happen
# when the video codec or format is not supported by the NVDEC hardware decoder.
# In some cases, CUDA decoding may fall back to CPU decoding. This can happen
# when the video codec or format is not supported by the NVDEC hardware decoder, or when NVCUVID wasn't found.

decoder = VideoDecoder(video_file, device="cuda")

# Check and print the CPU fallback status
print(decoder.cpu_fallback)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the new section above: this is great. Let's move it just above the "Visualizing Frames" section.

Comment on lines 104 to 105
# **Performance impact:** Enables consistent, predictable performance for repeated
# random access without the overhead of exact mode's scanning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "random access" is really relevant here. For perf, this is less about frame access than it is about decoder initialization

Suggested change
# **Performance impact:** Enables consistent, predictable performance for repeated
# random access without the overhead of exact mode's scanning.
# **Performance impact:** speeds up decoder instantiation, similarly to ``seek_mode="approximate"``.

#
# When decoding multiple videos or decoding a large number of frames from a single video, there are a few parallelization strategies to speed up the decoding process:
#
# - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames
# - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames. For that, use the `num_ffmpeg_threads` parameter of the :class:`~torchcodec.decoders.VideoDecoder`

Comment on lines 63 to 65
# iteration, and frame retrieval, see:
#
# - :ref:`sphx_glr_generated_examples_decoding_basic_example.py`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make the rendering slightly more compact by avoiding bullet points (here and everwhere else in these "notes"). I think it flows a bit better:

Suggested change
# iteration, and frame retrieval, see:
#
# - :ref:`sphx_glr_generated_examples_decoding_basic_example.py`
# iteration, and frame retrieval, see :ref:`sphx_glr_generated_examples_decoding_basic_example.py`

Comment on lines 158 to 159
# **Performance impact:** CUDA decoding can significantly outperform CPU decoding,
# especially for high-resolution videos and when combined with GPU-based transforms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "transforms" stuff is slightly misleading and might become even more misleading soon when we actually release native transforms - but they'll be CPU-only for a bit.

Suggested change
# **Performance impact:** CUDA decoding can significantly outperform CPU decoding,
# especially for high-resolution videos and when combined with GPU-based transforms.
# **Performance impact:** CUDA decoding can significantly outperform CPU decoding,
# especially for high-resolution videos and when decoding a lot of frames.

Comment on lines 162 to 169
# %%
# **Recommended Usage for Beta Interface**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# .. code-block:: python
#
# with set_cuda_backend("beta"):
# decoder = VideoDecoder("file.mp4", device="cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this section above, it should be the first one we see (before the "when to use" section"). Let's also make it slightly more obvious:

Suggested change
# %%
# **Recommended Usage for Beta Interface**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# .. code-block:: python
#
# with set_cuda_backend("beta"):
# decoder = VideoDecoder("file.mp4", device="cuda")
# %%
# **Recommended: use the Beta Interface!!**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# We recommend you use the new "beta" CUDA interface which is significantly faster than the previous one, and supports the same features:
#
# .. code-block:: python
#
# with set_cuda_backend("beta"):
# decoder = VideoDecoder("file.mp4", device="cuda")

Comment on lines 182 to 186
# decoder = VideoDecoder("file.mp4", device="cuda")
# decoder[0] # Decode at least one frame first (for FFmpeg backend)
#
# # Print detailed fallback status
# print(decoder.cpu_fallback)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the beta interface here - we really want users to use that as the default now. Since we're using beta, we don't need to decode a frame first. That will be documented just below in your great note.

# TorchCodec offers multiple performance optimization strategies, each suited to
# different scenarios. Use batch APIs for multi-frame decoding, approximate mode
# for faster initialization, parallel processing for high throughput, and CUDA
# acceleration for GPU-intensive workflows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# acceleration for GPU-intensive workflows.
# acceleration to offload the CPU.

# linked examples as a guide.
#
# For more information, see:
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my other comment above, the list below should definitely be kept as a bullet list!

@mollyxu mollyxu merged commit f6a8161 into meta-pytorch:main Dec 10, 2025
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants