Add performance tips tutorial #1065

mollyxu · 2025-11-20T05:31:08Z

Consolidate the following performance tips in docs

Batch APIs - Decode multiple frames at once
Approximate Mode & Keyframe Mappings - Trade accuracy for speed
Multi-threading - Parallelize decoding across videos or chunks
CUDA Acceleration - Use GPU decoding for supported formats

Also updated docs on cuda example to reflect cpu fallback (#943)

NicolasHug

Made a first pass, thanks @mollyxu , it looks great!

examples/decoding/performance_tips.py

NicolasHug · 2025-11-21T10:47:45Z

examples/decoding/performance_tips.py

+# - If you care about exactness of frame seeking, use “exact”.
+# - If you can sacrifice exactness of seeking for speed, which is usually the case when doing clip sampling, use “approximate”.
+# - If your videos don’t have variable framerate and their metadata is correct, then “approximate” mode is a net win: it will be just as accurate as the “exact” mode while still being significantly faster.
+# - If your size is small enough and we’re decoding a lot of frames, there’s a chance exact mode is actually faster.


Above:

This is a good description. I think we can be more nuanced about when to recommend approximate, e.g. we should try to clearly articulate the last 3 bullet points which are currently slightly overlapping and contradictory (we now know that approximate won't always be "a net win").

That's on me: I need to first have a clear understanding of why approximate mode is sometimes slower, and I'll need to update the approximate mode tutorial with more detailed recommendations.

I won't be able to do that in the next few days, so to unblock yourself I think you can just remove the claims about approximate being strictly superior ( bullet points 2 and 3), and the more generic reco could be something like

If the video is long and you're only decoding a small amount of frames, approximate mode should be faster.

It's not super actionable for users but I hope the dedicated tutorial I'll edit will be more precise.

examples/decoding/performance_tips.py

mollyxu · 2025-11-21T16:39:16Z

Thanks for the feedback!

examples/decoding/performance_tips.py

Dan-Flores · 2025-11-21T19:35:35Z

Let's update docs/source/index.rst so this tutorial appears on the main index.html page (similar to these changes)

examples/decoding/performance_tips.py

svekars · 2025-12-01T17:32:27Z

examples/decoding/performance_tips.py

+
+"""
+====================================
+Performance Tips and Best Practices


Probably a more descriptive title would help with discoverability : TorchCodec Performance Tips and Best Practices.
Also adding meta directive at the top should help as well. SOmething like:

.. meta:: :description: Learn how to optimize TorchCodec video decoding performance with batch APIs, approximate seeking, multi-threading, and CUDA acceleration.

examples/decoding/performance_tips.py

svekars · 2025-12-01T17:37:18Z

examples/decoding/performance_tips.py

+"""
+====================================
+Performance Tips and Best Practices
+====================================


It's not required but there is a template, you could look into as well: https://github.com/pytorch/tutorials/blob/main/beginner_source/template_tutorial.py

Thanks for the feedback!

NicolasHug

Thank you Molly, this is great! I made a bunch of comments but they are easy to address, so I'll approve now to unblock.

NicolasHug · 2025-12-09T11:09:23Z

examples/decoding/basic_cuda_example.py

+# In some cases, CUDA decoding may fall back to CPU decoding. This can happen
+# when the video codec or format is not supported by the NVDEC hardware decoder.


Suggested change

# In some cases, CUDA decoding may fall back to CPU decoding. This can happen

# when the video codec or format is not supported by the NVDEC hardware decoder.

# In some cases, CUDA decoding may fall back to CPU decoding. This can happen

# when the video codec or format is not supported by the NVDEC hardware decoder, or when NVCUVID wasn't found.

NicolasHug · 2025-12-09T11:10:16Z

examples/decoding/basic_cuda_example.py

+    decoder = VideoDecoder(video_file, device="cuda")
+
+# Check and print the CPU fallback status
+print(decoder.cpu_fallback)


On the new section above: this is great. Let's move it just above the "Visualizing Frames" section.

NicolasHug · 2025-12-09T11:13:42Z

examples/decoding/performance_tips.py

+# **Performance impact:** Enables consistent, predictable performance for repeated
+# random access without the overhead of exact mode's scanning.


I'm not sure "random access" is really relevant here. For perf, this is less about frame access than it is about decoder initialization

Suggested change

# **Performance impact:** Enables consistent, predictable performance for repeated

# random access without the overhead of exact mode's scanning.

# **Performance impact:** speeds up decoder instantiation, similarly to ``seek_mode="approximate"``.

NicolasHug · 2025-12-09T11:14:48Z

examples/decoding/performance_tips.py

+#
+# When decoding multiple videos or decoding a large number of frames from a single video, there are a few parallelization strategies to speed up the decoding process:
+#
+# - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames


Suggested change

# - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames

# - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames. For that, use the `num_ffmpeg_threads` parameter of the :class:`~torchcodec.decoders.VideoDecoder`

NicolasHug · 2025-12-09T11:15:58Z

examples/decoding/performance_tips.py

+#     iteration, and frame retrieval, see:
+#
+#     - :ref:`sphx_glr_generated_examples_decoding_basic_example.py`


We can make the rendering slightly more compact by avoiding bullet points (here and everwhere else in these "notes"). I think it flows a bit better:

Suggested change

# iteration, and frame retrieval, see:

#

# - :ref:`sphx_glr_generated_examples_decoding_basic_example.py`

# iteration, and frame retrieval, see :ref:`sphx_glr_generated_examples_decoding_basic_example.py`

NicolasHug · 2025-12-09T11:19:01Z

examples/decoding/performance_tips.py

+# **Performance impact:** CUDA decoding can significantly outperform CPU decoding,
+# especially for high-resolution videos and when combined with GPU-based transforms.


The "transforms" stuff is slightly misleading and might become even more misleading soon when we actually release native transforms - but they'll be CPU-only for a bit.

Suggested change

# **Performance impact:** CUDA decoding can significantly outperform CPU decoding,

# especially for high-resolution videos and when combined with GPU-based transforms.

# **Performance impact:** CUDA decoding can significantly outperform CPU decoding,

# especially for high-resolution videos and when decoding a lot of frames.

NicolasHug · 2025-12-09T11:20:52Z

examples/decoding/performance_tips.py

+# %%
+# **Recommended Usage for Beta Interface**
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# .. code-block:: python
+#
+#     with set_cuda_backend("beta"):
+#         decoder = VideoDecoder("file.mp4", device="cuda")


Let's move this section above, it should be the first one we see (before the "when to use" section"). Let's also make it slightly more obvious:

Suggested change

# %%

# **Recommended Usage for Beta Interface**

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#

# .. code-block:: python

#

# with set_cuda_backend("beta"):

# decoder = VideoDecoder("file.mp4", device="cuda")

# %%

# **Recommended: use the Beta Interface!!**

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#

# We recommend you use the new "beta" CUDA interface which is significantly faster than the previous one, and supports the same features:

#

# .. code-block:: python

#

# with set_cuda_backend("beta"):

# decoder = VideoDecoder("file.mp4", device="cuda")

NicolasHug · 2025-12-09T11:22:05Z

examples/decoding/performance_tips.py

+#     decoder = VideoDecoder("file.mp4", device="cuda")
+#     decoder[0]  # Decode at least one frame first (for FFmpeg backend)
+#
+#     # Print detailed fallback status
+#     print(decoder.cpu_fallback)


Let's use the beta interface here - we really want users to use that as the default now. Since we're using beta, we don't need to decode a frame first. That will be documented just below in your great note.

NicolasHug · 2025-12-09T11:23:35Z

examples/decoding/performance_tips.py

+# TorchCodec offers multiple performance optimization strategies, each suited to
+# different scenarios. Use batch APIs for multi-frame decoding, approximate mode
+# for faster initialization, parallel processing for high throughput, and CUDA
+# acceleration for GPU-intensive workflows.


Suggested change

# acceleration for GPU-intensive workflows.

# acceleration to offload the CPU.

NicolasHug · 2025-12-09T11:24:05Z

examples/decoding/performance_tips.py

+# linked examples as a guide.
+#
+# For more information, see:
+#


Related to my other comment above, the list below should definitely be kept as a bullet list!

first draft of performance tips tutorial

304fdf9

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2025

modify format

5693776

meta-pytorch deleted a comment from meta-codesync bot Nov 20, 2025

mollyxu added 2 commits November 20, 2025 14:18

Merge branch 'meta-pytorch:main' into performance-tips-tutorial

e8b2a73

Merge branch 'meta-pytorch:main' into performance-tips-tutorial

7ac0d2f

NicolasHug reviewed Nov 21, 2025

View reviewed changes

address feedback

a74f653

Dan-Flores reviewed Nov 21, 2025

View reviewed changes

examples/decoding/performance_tips.py Show resolved Hide resolved

Dan-Flores reviewed Nov 21, 2025

View reviewed changes

examples/decoding/performance_tips.py Show resolved Hide resolved

mollyxu and others added 2 commits November 24, 2025 09:43

Merge branch 'meta-pytorch:main' into performance-tips-tutorial

2286285

address feedback

547d8e5

NicolasHug mentioned this pull request Nov 28, 2025

More nuanced approximate mode tutorial recommendations #1078

Merged

svekars reviewed Dec 1, 2025

View reviewed changes

mollyxu and others added 14 commits December 1, 2025 22:15

Merge branch 'meta-pytorch:main' into performance-tips-tutorial

cc737b1

address feedback

9e0f33a

expose cpu_fallback

b32e6f3

modify comments

cf5b718

modify comments

6e69c8c

address feedback:

5ac8321

switch _.code._get_backend_details() to new api

e97490e

Merge branch 'meta-pytorch:main' into cpu-fallback

f353758

address feedback

6a05947

Merge branch 'main' into cpu-fallback

52ea290

fix lint

8b75eac

ffmpeg backend logic

14ad6c7

update with cpufallback

f9e0bd1

add cpufallback

bddfa7c

mollyxu marked this pull request as ready for review December 8, 2025 19:31

NicolasHug approved these changes Dec 9, 2025

View reviewed changes

Molly Xu and others added 2 commits December 9, 2025 15:23

address feedback

0e52bb6

Merge branch 'main' into performance-tips-tutorial

ba317a9

mollyxu merged commit f6a8161 into meta-pytorch:main Dec 10, 2025
73 checks passed

		# In some cases, CUDA decoding may fall back to CPU decoding. This can happen
		# when the video codec or format is not supported by the NVDEC hardware decoder.

		# Performance impact: Enables consistent, predictable performance for repeated
		# random access without the overhead of exact mode's scanning.

	# Performance impact: Enables consistent, predictable performance for repeated
	# random access without the overhead of exact mode's scanning.
	# Performance impact: speeds up decoder instantiation, similarly to ``seek_mode="approximate"``.

	# - FFmpeg-based parallelism - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames
	# - FFmpeg-based parallelism - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames. For that, use the `num_ffmpeg_threads` parameter of the :class:`~torchcodec.decoders.VideoDecoder`

		# Performance impact: CUDA decoding can significantly outperform CPU decoding,
		# especially for high-resolution videos and when combined with GPU-based transforms.

	# acceleration for GPU-intensive workflows.
	# acceleration to offload the CPU.

Add performance tips tutorial #1065

Add performance tips tutorial #1065

Uh oh!

Conversation

mollyxu commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mollyxu commented Nov 21, 2025

Uh oh!

Uh oh!

Dan-Flores commented Nov 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mollyxu commented Nov 20, 2025 •

edited

Loading