Break Kernels based on shell-type and using CUDA Streams

To greatly increase performance with minimal coding changes, it may be worthwhile to break up the computation of the kernels (e.g., computing atomic orbitals, derivatives of atomic orbitals, second derivatives, etc.) based on the Shell-Type. Currently, all functions are limited by the maximum of 255 registers per thread, which reduces the number of active threads. Based on profiling, I have observed that at most 8 warps are running concurrently. Additionally, breaking up the kernels can lead to reduced branch divergence and better compiler optimizations.

If you break up the kernels so that `compute_atomic_orbitals<S>,  compute_atomic_orbitals<P>` etc, then the S-type can use less registers and more threads can be running at a time.  Further, using [CUDA Streams](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) would mean that the shell-type functions are runned at the same time. This approach could also eliminate any if-statements by utilizing template specialization.

```
#define STYPE  0
#define PTYPE 1
...
template<int ShellType>
void compute_atomic_orbitals(...)
    if constexpr (ShellType == STYPE) {
         compute s-type orbitals
     }
```
 
 To implement this for `evaluate_scalar_quantity` you'll need either need to change it so that it takes an array of function pointers with each entry the S-type function, P-type function etc or simpler but harder to understand, use templates. CUDA streams would be added here, as well.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Break Kernels based on shell-type and using CUDA Streams #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Break Kernels based on shell-type and using CUDA Streams #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions