Skip to content

Break Kernels based on shell-type and using CUDA Streams #8

@Ali-Tehrani

Description

@Ali-Tehrani

To greatly increase performance with minimal coding changes, it may be worthwhile to break up the computation of the kernels (e.g., computing atomic orbitals, derivatives of atomic orbitals, second derivatives, etc.) based on the Shell-Type. Currently, all functions are limited by the maximum of 255 registers per thread, which reduces the number of active threads. Based on profiling, I have observed that at most 8 warps are running concurrently. Additionally, breaking up the kernels can lead to reduced branch divergence and better compiler optimizations.

If you break up the kernels so that compute_atomic_orbitals<S>, compute_atomic_orbitals<P> etc, then the S-type can use less registers and more threads can be running at a time. Further, using CUDA Streams would mean that the shell-type functions are runned at the same time. This approach could also eliminate any if-statements by utilizing template specialization.

#define STYPE  0
#define PTYPE 1
...
template<int ShellType>
void compute_atomic_orbitals(...)
    if constexpr (ShellType == STYPE) {
         compute s-type orbitals
     }

To implement this for evaluate_scalar_quantity you'll need either need to change it so that it takes an array of function pointers with each entry the S-type function, P-type function etc or simpler but harder to understand, use templates. CUDA streams would be added here, as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions