Multigrid Solver

The feature/multigrid branch contains the present work in progress on implementing the adaptive multigrid multigrid algorithm into QUDA. Once you've checked out this branch, you should configure QUDA with the --enable-multigrid option to ensure that it is compiled properly.

The tests/multigrid_invert_test contains a simple example for how to use the multigrid solver, and is intended to be instructional on how to interface the multigrid solver into other applications. This test support loading gauge fields into it, loading previously generated null-space vectors as well as dumping these to disk for future use. A list of the relevant command-line options can be gleamed with the --help options:

kate@nvsocal2:~/github/quda-mg$ tests/multigrid_invert_test --help
Usage: tests/multigrid_invert_test [options]
Common options: 
    --prec <double/single/half>               # Precision in GPU
    --prec_sloppy <double/single/half>        # Sloppy precision in GPU
    --recon <8/9/12/13/18>                    # Link reconstruction type
    --recon_sloppy <8/9/12/13/18>             # Sloppy link reconstruction type
    --dagger                                  # Set the dagger to 1 (default 0)
    --dim <n>                                 # Set space-time dimension (X Y Z T)
    --sdim <n>                                # Set space dimension(X/Y/Z) size
    --xdim <n>                                # Set X dimension size(default 24)
    --ydim <n>                                # Set X dimension size(default 24)
    --zdim <n>                                # Set X dimension size(default 24)
    --tdim <n>                                # Set T dimension size(default 24)
    --Lsdim <n>                               # Set Ls dimension size(default 16)
    --gridsize <x y z t>                      # Set the grid size in all four dimension (default 1 1 1 1)
    --xgridsize <n>                           # Set grid size in X dimension (default 1)
    --ygridsize <n>                           # Set grid size in Y dimension (default 1)
    --zgridsize <n>                           # Set grid size in Z dimension (default 1)
    --tgridsize <n>                           # Set grid size in T dimension (default 1)
    --partition <mask>                        # Set the communication topology (X=1, Y=2, Z=4, T=8, and combinations of these)
    --kernel-pack-t                           # Set T dimension kernel packing to be true (default false)
    --dslash_type <type>                      # Set the dslash type, the following values are valid
                                                  wilson/clover/twisted_mass/twisted_clover/staggered
                                                  /asqtad/domain_wall/domain_wall_4d/mobius
    --flavor <type>                           # Set the twisted mass flavor type (minus (default), plus, deg_doublet, nondeg_doublet)
    --load-gauge file                         # Load gauge field "file" for the test (requires QIO)
    --niter <n>                               # The number of iterations to perform (default 10)
    --inv_type <cg/bicgstab/gcr>              # The type of solver to use (default cg)
    --precon_type <mr/ (unspecified)>         # The type of solver to use (default none (=unspecified)).
                                                  For multigrid this sets the smoother type.
    --multishift <true/false>                 # Whether to do a multi-shift solver test or not (default false)
    --mass                                    # Mass of Dirac operator (default 0.1)
    --anisotropy                              # Temporal anisotropy factor (default 1.0)
    --mass-normalization                      # Mass normalization (kappa (default) / mass)
    --matpc                                   # Matrix preconditioning type (even-even, odd_odd, even_even_asym, odd_odd_asym) 
    --tol  <resid_tol>                        # Set L2 residual tolerance
    --tolhq  <resid_hq_tol>                   # Set heavy-quark residual tolerance
    --tune <true/false>                       # Whether to autotune or not (default true)
    --test                                    # Test method (different for each test)
    --verify <true/false>                     # Verify the GPU results using CPU results (default true)
    --mg-nvec                                 # Number of null-space vectors to define the multigrid transfer operator at each level
    --mg-gpu-prolongate <true/false>          # Whether to do the multigrid transfer operators on the GPU (default false)
    --mg-levels <2+>                          # The number of multigrid levels to do (default 2)
    --mg-nu-pre  <1-20>                       # The number of pre-smoother applications to do at each multigrid level (default 2)
    --mg-nu-post <1-20>                       # The number of post-smoother applications to do at each multigrid level (default 2)
    --mg-block-size <x y z t>                 # Set the geometric block size for the each multigrid level's transfer operator (default 4 4 4 4)
    --mg-generate-nullspace <true/false>      # Generate the null-space vector dynamically (default true)
    --mg-load-vec file                        # Load the vectors "file" for the multigrid_test (requires QIO)
    --mg-save-vec file                        # Save the generated null-space vectors "file" from the multigrid_test (requires QIO)
    --help                                    # Print out this message

For example, say we have a lattice contained in the file ~/lattices/wl_5p5_x2p38_um0p4086_cfg_1000.lime of size 16^3x64, that we wish to use, and we want to run the multigrid setup to generate the null space vectors, dump them to disk, and then solve a linear system with mass parameter mass=-0.42. To run this is on two GPUs (two processes) we might do something like this:

mpirun -np 2 tests/multigrid_invert_test --mg-levels 2 --sdim 16 --tdim 32 --partition 0 --mass -0.42 --precon_type mr --prec double --prec_sloppy single --mg-nu-pre 6 --mg-nu-post 6 --mg-block-size 4 4 4 4 --tol 1e-8 --mg-generate-nullspace true --mg-nvec 16 --load-gauge ~/lattices/wl_5p5_x2p38_um0p4086_cfg_1000.lime --anisotropy 2.38 --mg-save-vec /tmp/null --tgridsize 2

This will run a 3-level multigrid process, with 16^3x32 volume on each GPU at the fine grid. The coarse grid will have 4^3x16 geometry. The parameters to the different multigrid levels are set to be uniformly the same, e.g., number of null-space vectors per level, pre/post smoothing steps, etc. To change these on a per-level basis requires manually editing the test file (tests/multigrid_invert_test.cpp).

If we want to reuse these same null-space vectors in a subsequent run, and not rerun the null-space generation, we would use

mpirun -np 2 tests/multigrid_invert_test --mg-levels 2 --sdim 16 --tdim 32 --partition 0 --mass -0.42 --precon_type mr --prec double --prec_sloppy single --mg-nu-pre 6 --mg-nu-post 6 --mg-block-size 4 4 4 4 --tol 1e-8 --mg-generate-nullspace false --mg-nvec 16 --load-gauge ~/lattices/wl_5p5_x2p38_um0p4086_cfg_1000.lime --anisotropy 2.38 --mg-load-vec /tmp/null --tgridsize 2

Note that to support loading and/or saving gauge and/or null-space fields, QIO support must be enabled when configuring QUDA (--enable-qio=PATH_TO_QIO), this in turn implies that QMP is enabled (--with-qmp=PATH_TO_QMP).

There are many warts left in both the multigrid_invert_test as well as the overall implementation. One of the major limitations is due to the template explosion of parameters, the compilation time can be rather long. For multi-level multigrid (>2 levels) this has serious limitations on what can be run. This is an ongoing problem that needs to be fixed in order to put full multigrid into production.

The following is an incomplete list of what needs to be done:

Optimize CPU coarse grid operator - at present no attempt has been made to optimize the CPU coarse grid operator despite it being on the critical path in a multi-level implementation since we have the expectation that the coarsest grid will be on the CPU. Optimization here will almost certainly include OpenMP parallelisation as well as potentially extending the QUDA autotuner to encompass CPU kernels, as well as vectorization.
Optimize CPU BLAS kernels
Optimize halo packing for improved strong scaling #393
Implement even-odd preconditioning for the coarse operator (need to compute inverse of the clover matrices)
In single precision, the restriction operator has uncoalesced memory transactions #391
Work out how to implement half precision for coarse grids (difficult with fine-grained accessors)
Improve how precision is exposed in the interface
Separate null-space vector generation and multigrid operator construction from resulting solver
Add support for calling multigrid solver using standard invertQuda interface.
Better exposure of the coarsest grid solver parameters - at present this is hard-coded in lib/multigrid.cpp

QUDA calls

Multigrid Solver

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!