Add a migrator for CUDA 12.8 #7005

jakirkham · 2025-02-08T00:26:08Z

This is an initial pass at adding a CUDA 12.8 migrator based on the discussion so far. Based on testing, this works as intended. Though there may be things (like the messaging) that can be improved.

Checklist

Used a personal fork of the feedstock to propose changes
Bumped the build number (if the version is unchanged)
Reset the build number to 0 (if the version changed)
Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
Ensured the license file is being packaged.

conda-forge-admin · 2025-02-08T00:27:42Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

jakirkham

Have commented on a few sections below to check we have what we need. Based on testing they work ok, but it is worth checking we have handled the different cases we expect to see throughout conda-forge

jakirkham · 2025-02-08T00:26:58Z

recipe/migrations/cuda128.yaml

+  commit_message: |
+    Upgrade to CUDA 12.8
+
+    With CUDA 12.8, the following new architectures are added `sm_100`, `sm_101` and `sm_120`.
+    To build for these architectures, maintainers will need to add these to list of architectures
+    that their package builds for.
+
+    ref: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#new-features


@carterbox could you please take a look at this messaging and suggest any improvements that you have in mind?

jakirkham · 2025-02-08T00:28:28Z

recipe/migrations/cuda128.yaml

+c_compiler_version:            # [((linux and (x86_64 or aarch64)) or win64) and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
+  - 13                         # [((linux and (x86_64 or aarch64)) or win64) and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
+
+cxx_compiler_version:          # [((linux and (x86_64 or aarch64)) or win64) and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
+  - 13                         # [((linux and (x86_64 or aarch64)) or win64) and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
+
+fortran_compiler_version:      # [((linux and (x86_64 or aarch64)) or win64) and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
+  - 13                         # [((linux and (x86_64 or aarch64)) or win64) and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]


Currently this is using GCC 13, but CUDA 12.8 can also support GCC 14. Not sure if we want to update now or wait

Have no personal preferences either way. Just wanted to mention it if we are already planning to upgrade to GCC 14 in the near future

@h-vetinari please let me know if you have thoughts on this

Just wanted to mention it if we are already planning to upgrade to GCC 14 in the near future

We've settled (roughly) into a rhythm where we wait for GCC {N+1}.0 and the corresponding patch release for GCC N.y before we move to GCC N. For example, GCC 14.1 came out early May 2024, and GCC 13.3 (with the most important fixes that had been accumulated over the course of almost a year) came out end of May 2024. So it's unlikely that we'd move to GCC 14 before June. Also, we still need to fix conda-forge/gfortran_impl_osx-64-feedstock#82

jakirkham · 2025-02-08T00:31:41Z

recipe/migrations/cuda128.yaml

+docker_image:                                       # [os.environ.get("BUILD_PLATFORM", "").startswith("linux-") and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
+  # CUDA 12 builds on CentOS 7
+  - quay.io/condaforge/linux-anvil-x86_64:cos7      # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") == "cos7"]
+  - quay.io/condaforge/linux-anvil-aarch64:cos7     # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-aarch64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") == "cos7"]
+
+  # CUDA 12 builds on AlmaLinux 8
+  - quay.io/condaforge/linux-anvil-x86_64:alma8     # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") in ("alma8", "ubi8")]
+  - quay.io/condaforge/linux-anvil-aarch64:alma8    # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-aarch64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") in ("alma8", "ubi8")]
+
+  # CUDA 12 builds on AlmaLinux 9
+  - quay.io/condaforge/linux-anvil-x86_64:alma9     # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") == "alma9"]
+  - quay.io/condaforge/linux-anvil-aarch64:alma9    # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-aarch64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") == "alma9"]


Have borrowed past migrations's docker_images field and adapted it for our recently updated images ( #6626 )

Would be good to have another pair of eyes on this to make sure we haven't missed anything and to verify we have the intended behavior here

jakirkham · 2025-02-08T00:35:24Z

recipe/migrations/cuda128.yaml

+  ordering:
+    cuda_compiler:
+      - None
+      - nvcc
+      - cuda-nvcc
+    cuda_compiler_version:
+      - None
+      - 11.8
+      - 12.4
+      - 12.6
+      - 12.8


Past migrators have also contained this field. However they included a lot more detail to do the different Docker images we provided for CUDA versions and other details

Since we are at a point where CUDA 12.x reuses a lot of the same content from our global matrix and doesn't change it, found this could be greatly reduced compared to past migrations

To test whether the ordering stuff works, I removed operation: key_add as suggested by Isuru, and tried

ordering: cuda_compiler: - [...] cuda_compiler_version: - None - 12.4 - 12.6 - 11.8 - 12.8

and

ordering: cuda_compiler: - [...] cuda_compiler_version: - 12.4 - 12.6 - None - 11.8 - 12.8

on the faiss feedstock (not that many feedstocks I'm aware of which still build 11.8). From what I can tell, None is not handled correctly - in both cases, the resulting jobs are for CUDA 11.8, 12.6 & 12.8, and the CPU job gets removed.

jakirkham · 2025-02-08T01:55:31Z

recipe/migrations/cuda128.yaml

+docker_image:                                       # [os.environ.get("BUILD_PLATFORM", "").startswith("linux-") and os.environ.get("CF_CUDA_ENABLED", "False") == "True"]
+  # CUDA 12 builds on CentOS 7
+  - quay.io/condaforge/linux-anvil-x86_64:cos7      # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") == "cos7"]
+  - quay.io/condaforge/linux-anvil-aarch64:cos7     # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-aarch64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") == "cos7"]
+
+  # CUDA 12 builds on AlmaLinux 8
+  - quay.io/condaforge/linux-anvil-x86_64:alma8     # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") in ("alma8", "ubi8")]
+  - quay.io/condaforge/linux-anvil-aarch64:alma8    # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-aarch64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") in ("alma8", "ubi8")]
+
+  # CUDA 12 builds on AlmaLinux 9
+  - quay.io/condaforge/linux-anvil-x86_64:alma9     # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") == "alma9"]
+  - quay.io/condaforge/linux-anvil-aarch64:alma9    # [os.environ.get("CF_CUDA_ENABLED", "False") == "True" and os.environ.get("BUILD_PLATFORM") == "linux-aarch64" and os.environ.get("DEFAULT_LINUX_VERSION", "alma9") == "alma9"]


Looks like pre-commit doesn't like the line length of this section based on this CI error

recipe/migrations/cuda128.yaml 52:181 error line too long (222 > 180 characters) (line-length) 53:181 error line too long (227 > 180 characters) (line-length) 56:181 error line too long (233 > 180 characters) (line-length) 57:181 error line too long (238 > 180 characters) (line-length) 60:181 error line too long (223 > 180 characters) (line-length) 61:181 error line too long (228 > 180 characters) (line-length)

Not sure we can fix that give the selectors needed. Maybe we can disable this check?

Yeah, that check needs to be ignored or disabled. The docker selection is going to stay this complicated until we drop CUDA 11.8.

h-vetinari

This is an initial pass at adding a CUDA 12.8 migrator based on the discussion so far.

I don't feel like my concern has been addressed, so I reiterated it in the issue.

h-vetinari · 2025-04-08T11:58:02Z

FWIW, someone is testing this migrator on the tensorflow feedstock, and ran into the following issues with GCC 13 on aarch:

$ nvcc cu_neon_test.cu 
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/share/fsx/arm/gcc-13.2.0_RHEL-9/lib/gcc/aarch64-linux-gnu/13.2.0/include/arm_neon.h(5790): error: identifier "float16x4x2_t" is undefined
  vst2_lane_f16 (float16_t *__ptr, float16x4x2_t __val, const int __lane)
                                   ^

/share/fsx/arm/gcc-13.2.0_RHEL-9/lib/gcc/aarch64-linux-gnu/13.2.0/include/arm_neon.h(5792): error: identifier "__builtin_aarch64_st2_lanev4hf" is undefined
    __builtin_aarch64_st2_lanev4hf ((__builtin_aarch64_simd_hf *) __ptr, __val,
    ^

Downgrading to GCC 11 works though

isuruf

This migrator will build for both 12.6 and 12.8. This migrator should instead be adjusted to add 12.8 and remove 12.6.

jakirkham · 2025-04-08T16:24:23Z

We agree with that Isuru. This was also discussed in issue: #6980

However it is not clear that this is possible with the existing migrator behavior. If you see a way to accomplish this, could you please make specific suggestions in the file?

isuruf · 2025-04-08T17:44:02Z

Instead of operation: key_add, you can use an override just like we do for every other migration.

h-vetinari · 2025-05-09T00:09:00Z

Instead of operation: key_add, you can use an override just like we do for every other migration.

I think that would cause us to move from (11.8, 12.6) to (12.6, 12.8), rather than the desired (11.8, 12.8). Removing the version in the middle of the range is the thing that I don't think will work (with current infra).

Of course, we could reconsider dropping 11.8 (which would make this whole operation trivial again), but that was deferred without a timeline only in January; OTOH the value of 11.8 builds has been depreciating rapidly IMO, so I'd be open to do it.

h-vetinari · 2025-05-17T01:44:48Z

Removing the version in the middle of the range is the thing that I don't think will work (with current infra).

It might work with a custom ordering:... 🤔

h-vetinari

I'd like to unblock this, people are starting to ask for sm_120 support, not least because our current pytorch builds end up showing:

UserWarning:
NVIDIA GeForce RTX 5090 D with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_61 sm_70 sm_75 sm_80 sm_86 sm_89 sm_90 compute_90.
If you want to use the NVIDIA GeForce RTX 5090 D GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

Recapping my argument from #6980, we have four options IMO:

Just pull the trigger without a migration (as we did for 12.0 -> 12.6, which wasn't a big deal)
Run this migration as-is, but time-box it to 2-4 weeks (to limit the time where we build both 12.6 & 12.8)
Work on the migrator infrastructure so that cuda_compiler_version in (None, 11.8, 12.6) can be migrated to cuda_compiler_version in (None, 11.8, 12.8).¹
Drop CUDA 11.8, then this migrator (after removing the key_add) would trivially work

I prefer 1. or 4., and could live with 2. Option 3. might be best, but implies a potentially very lengthy wait (case in point: zero progress since February) while someone tries to dig in and do that work, so I don't want to rely on that.

PS. We could probably already use CUDA 12.9 by now.

CC @conda-forge/core

I tested it (see below), it doesn't work currently. ↩

h-vetinari · 2025-05-17T02:01:30Z

recipe/migrations/cuda128.yaml

+  ordering:
+    cuda_compiler:
+      - None
+      - nvcc
+      - cuda-nvcc
+    cuda_compiler_version:
+      - None
+      - 11.8
+      - 12.4
+      - 12.6
+      - 12.8


To test whether the ordering stuff works, I removed operation: key_add as suggested by Isuru, and tried

ordering: cuda_compiler: - [...] cuda_compiler_version: - None - 12.4 - 12.6 - 11.8 - 12.8

and

ordering: cuda_compiler: - [...] cuda_compiler_version: - 12.4 - 12.6 - None - 11.8 - 12.8

on the faiss feedstock (not that many feedstocks I'm aware of which still build 11.8). From what I can tell, None is not handled correctly - in both cases, the resulting jobs are for CUDA 11.8, 12.6 & 12.8, and the CPU job gets removed.

jakirkham · 2025-05-17T03:47:13Z

Thanks for following up Axel! 🙏

Would still prefer a migrator given the newer architectures. Would rather users know whether a package has been built with the newer architectures than have this be opaque.

At this stage would recommend we go straight to CUDA 12.9. There are new architectures there too. So it is better to do a migrator only once.

Before dropping CUDA 11.8 we should try to assess the amount of usage it still sees somehow by maintainers and users. If we confirm this to be low, then document these findings and add a news item.

So that puts me preferring 2 or 3. Can be convinced to do 4 based on evidenced gathered from the prior point.

Add a migrator for CUDA 12.8

2254197

jakirkham commented Feb 8, 2025

View reviewed changes

jakirkham mentioned this pull request Feb 8, 2025

Adding CUDA 12.9 migration #6980

Closed

jakirkham commented Feb 8, 2025

View reviewed changes

h-vetinari reviewed Feb 8, 2025

View reviewed changes

This was referenced Mar 22, 2025

How to install pytorch -c conda-forge with a specific CUDA version? conda-forge/pytorch-cpu-feedstock#381

Closed

test 2025 build conda-forge/gromacs-feedstock#66

Closed

h-vetinari mentioned this pull request Apr 7, 2025

add linux_aarch64 conda-forge/tensorflow-feedstock#426

Closed

isuruf requested changes Apr 8, 2025

View reviewed changes

h-vetinari mentioned this pull request May 8, 2025

Adding CUDA 12.8 Support conda-forge/pytorch-cpu-feedstock#390

Closed

h-vetinari reviewed May 17, 2025

View reviewed changes

h-vetinari mentioned this pull request May 17, 2025

Drop CUDA 11.8 #7404

Closed

This was referenced May 26, 2025

GCC 14 & Clang 19 #7421

Merged

drop CUDA 11.8 #7431

Merged

h-vetinari mentioned this pull request Jun 13, 2025

Migrate for CUDA 12.9 #7476

Merged

h-vetinari closed this in #7476 Jul 14, 2025

Uh oh!

Add a migrator for CUDA 12.8 #7005

Add a migrator for CUDA 12.8 #7005

Uh oh!

Conversation

jakirkham commented Feb 8, 2025

Uh oh!

conda-forge-admin commented Feb 8, 2025

Uh oh!

jakirkham left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

h-vetinari left a comment

Choose a reason for hiding this comment

Uh oh!

h-vetinari commented Apr 8, 2025

Uh oh!

isuruf left a comment

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Apr 8, 2025

Uh oh!

isuruf commented Apr 8, 2025

Uh oh!

h-vetinari commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-vetinari commented May 17, 2025

Uh oh!

h-vetinari left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented May 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

h-vetinari commented May 9, 2025 •

edited

Loading

h-vetinari left a comment •

edited

Loading