Skip to content

Commit 7da2893

Browse files
authored
Merge pull request #43 from ecmwf-ifs/nabr-release-1.4
Release 1.4.0
2 parents f4a90b6 + 65ae813 commit 7da2893

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+21125
-315
lines changed
12 KB
Binary file not shown.

.github/scripts/run-targets.sh

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,17 @@ skipped_targets=(dwarf-cloudsc-gpu-claw)
1111
if [[ "$arch" == *"nvhpc"* ]]
1212
then
1313
# Skip GPU targets if built with nvhpc (don't have GPU in test runner)
14-
skipped_targets+=(dwarf-cloudsc-gpu-scc dwarf-cloudsc-gpu-scc-hoist dwarf-cloudsc-gpu-omp-scc-hoist)
14+
skipped_targets+=(dwarf-cloudsc-gpu-scc dwarf-cloudsc-gpu-scc-hoist dwarf-cloudsc-gpu-omp-scc-hoist dwarf-cloudsc-gpu-scc-field)
1515

1616
# Skip GPU targets from Loki if built with nvhpc (don't have GPU in test runner)
1717
skipped_targets+=(dwarf-cloudsc-loki-claw-gpu dwarf-cloudsc-loki-scc dwarf-cloudsc-loki-scc-hoist)
1818

19+
# Skip CUDA targets if built with nvhpc
20+
skipped_targets+=(dwarf-cloudsc-gpu-scc-cuf dwarf-cloudsc-gpu-scc-cuf-k-caching)
21+
skipped_targets+=(dwarf-cloudsc-loki-scc-cuf-hoist dwarf-cloudsc-loki-scc-cuf-parametrise)
22+
skipped_targets+=(dwarf-cloudsc-cuda dwarf-cloudsc-cuda-hoist dwarf-cloudsc-cuda-k-caching)
1923
# Skip C target if built with nvhpc, segfaults for unknown reasons
20-
skipped_targets+=(dwarf-cloudsc-c)
24+
skipped_targets+=(dwarf-cloudsc-c dwarf-cloudsc-loki-c)
2125
fi
2226

2327
exit_code=0

.github/scripts/verify-targets.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,15 @@ then
2222
then
2323
targets+=(dwarf-cloudsc-gpu-claw)
2424
fi
25+
if [[ "$cuda_flag" == "--with-cuda" ]]
26+
then
27+
targets+=(dwarf-cloudsc-gpu-scc-cuf dwarf-cloudsc-gpu-scc-cuf-k-caching)
28+
targets+=(dwarf-cloudsc-gpu-scc-field)
29+
fi
30+
if [[ "$cuda_flag" == "--with-cuda" && "$io_library_flag" == "--with-serialbox" ]]
31+
then
32+
targets+=(dwarf-cloudsc-cuda dwarf-cloudsc-cuda-hoist dwarf-cloudsc-cuda-k-caching)
33+
fi
2534
fi
2635

2736
if [[ "$loki_flag" == "--with-loki" ]]
@@ -36,6 +45,10 @@ then
3645
then
3746
targets+=(dwarf-cloudsc-loki-claw-cpu dwarf-cloudsc-loki-claw-gpu)
3847
fi
48+
if [[ "$cuda_flag" == "--with-cuda" ]]
49+
then
50+
targets+=(dwarf-cloudsc-loki-scc-cuf-hoist dwarf-cloudsc-loki-scc-cuf-parametrise)
51+
fi
3952
fi
4053

4154
#

.github/workflows/build.yml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ jobs:
3838

3939
gpu_flag: ['', '--with-gpu'] # GPU-variants enabled
4040

41+
cuda_flag: [''] # Enable CUDA variants
42+
4143
loki_flag: ['', '--with-loki'] # Loki source-to-source translation enabled
4244

4345
claw_flag: [''] # Flag to enable CLAW-generated variants
@@ -49,11 +51,15 @@ jobs:
4951
mpi_flag: ''
5052
prec_flag: ''
5153
gpu_flag: '--with-gpu'
54+
cuda_flag: '--with-cuda'
55+
loki_flag: '--with-loki'
5256
- arch: github/ubuntu/nvhpc/21.9
5357
io_library_flag: '--with-serialbox'
5458
mpi_flag: ''
5559
prec_flag: ''
5660
gpu_flag: '--with-gpu'
61+
cuda_flag: '--with-cuda'
62+
loki_flag: '--with-loki'
5763

5864
# Steps represent a sequence of tasks that will be executed as part of the job
5965
steps:
@@ -99,14 +105,15 @@ jobs:
99105
./cloudsc-bundle build --retry-verbose \
100106
--arch=arch/${{ matrix.arch }} ${{ matrix.prec_flag }} \
101107
${{ matrix.mpi_flag }} ${{ matrix.io_library_flag }} ${{ matrix.gpu_flag }} \
102-
${{ matrix.claw_flag}} ${{ matrix.loki_flag }}
108+
${{ matrix.claw_flag}} ${{ matrix.loki_flag }} ${{ matrix.cuda_flag }}
103109
104110
# Verify targets exist
105111
- name: Verify targets
106112
env:
107113
io_library_flag: ${{ matrix.io_library_flag }}
108114
prec_flag: ${{ matrix.prec_flag }}
109115
gpu_flag: ${{ matrix.gpu_flag }}
116+
cuda_flag: ${{ matrix.cuda_flag }}
110117
loki_flag: ${{ matrix.loki_flag }}
111118
claw_flag: ${{ matrix.claw_flag }}
112119
run: .github/scripts/verify-targets.sh

AUTHORS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
- Z. Piotrowski (ECMWF)
1717
- B. Reuter (ECMWF)
1818
- D. Salmond (ECMWF)
19+
- M. Staneker (ECMWF)
1920
- M. Tiedtke (ECMWF)
2021
- A. Tompkins (ECMWF)
2122
- S. Ubbiali (ETH Zuerich)

README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,29 @@ Balthasar Reuter ([email protected])
6060
move parameter structures to constant memory. To enable this variant,
6161
a suitable CUDA installation is required and the `--with-cuda` flag
6262
needs to be passed at the build stage.
63+
- **dwarf-cloudsc-gpu-scc-cuf-k-caching**: GPU-enabled and further
64+
optimized version of CLOUDSC that uses the SCC loop layout in
65+
combination with loop fusion and temporary local array demotion, implemented
66+
using CUDA-Fortran (CUF). To enable this variant,
67+
a suitable CUDA installation is required and the `--with-cuda` flag
68+
needs to be passed at the build stage.
69+
- **CUDA C prototypes**: To enable these variants, a suitable
70+
CUDA installation is required and the `--with-cuda` flag needs
71+
to be pased at the build stage.
72+
- **dwarf-cloudsc-cuda**: GPU-enabled, CUDA C version of CLOUDSC.
73+
- **dwarf-cloudsc-cuda-hoist**: GPU-enabled, optimized CUDA C version
74+
of CLOUDSC including host side hoisted temporary local variables.
75+
- **dwarf-cloudsc-cuda-k-caching**: GPU-enabled, further optimized CUDA
76+
C version of CLOUDSC including loop fusion and temporary local
77+
array demotion.
78+
- **dwarf-cloudsc-gpu-scc-field**: GPU-enabled and optimized version of
79+
CLOUDSC that uses the SCC loop layout, and a dedicated Fortran FIELD
80+
API to manage device offload and copyback. The intent is to demonstrate
81+
the explicit use of pinned host memory to speed-up data transfers, as
82+
provided by the shipped prototype implmentation, and investigate the
83+
effect of different data storage allocation layouts. To enable this
84+
variant, a suitable CUDA installation is required and the
85+
`--with-cuda` flag needs to be passed at the build stage.
6386

6487
## Download and Installation
6588

@@ -249,6 +272,27 @@ srun bash -c "CUDA_VISIBLE_DEVICES=\$SLURM_LOCALID bin/dwarf-cloudsc-gpu-scc-hoi
249272

250273
In principle, the same should work for multi-node execution (`-N 2`, `-N 4` etc.) once interconnect issues are resolved.
251274

275+
### GPU runs: Timing device kernels and data transfers
276+
277+
For GPU-enabled runs two internal timer results are reported:
278+
279+
* The isolated compute time of the main compute kernel on device (where `#BLKS == 1`)
280+
* The overall time of the execution loop including data offload and copyback
281+
282+
It is important to note that due to the nature of the kernel, data
283+
transfer overheads will dominate timings, and that most supported GPU
284+
variants aim to optimise compute kernel timings only. However, a
285+
dedicated variant `dwarf-cloudsc-gpu-scc-field` has been added to
286+
explore host-side memory pinning, which improves data transfer times
287+
and alternative data layout strategies. By default, this will allocate
288+
each array variable individually in pinned memory. A runtime flag
289+
`CLOUDSC_PACKED_STORAGE=ON` can be used to enable "packed" storage,
290+
where multiple arrays are stored in a single base allocation, eg.
291+
292+
```sh
293+
NV_ACC_CUDA_HEAPSIZE=8G CLOUDSC_PACKED_STORAGE=ON ./bin/dwarf-cloudsc-gpu-scc-field 1 80000 128
294+
```
295+
252296
## Loki transformations for CLOUDSC
253297

254298
[Loki](https://github.com/ecmwf-ifs/loki) is an in-house developed

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.3.0
1+
1.4.0

arch/toolchains/ecmwf-hpc2020-nvhpc.cmake

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,14 @@ set( OpenACC_Fortran_FLAGS "-acc=gpu -mp=gpu -gpu=cc80,lineinfo,fastmath" CACHE
3737
# Enable this to get more detailed compiler output
3838
# set( OpenACC_Fortran_FLAGS "${OpenACC_Fortran_FLAGS} -Minfo" )
3939

40+
####################################################################
41+
# CUDA FLAGS
42+
####################################################################
43+
44+
if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
45+
set(CMAKE_CUDA_ARCHITECTURES 80)
46+
endif()
47+
4048
####################################################################
4149
# COMMON FLAGS
4250
####################################################################

arch/toolchains/ecmwf-volta-pgi-gpu.cmake

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ set(ECBUILD_Fortran_FLAGS "${ECBUILD_Fortran_FLAGS} -Ktrap=fp")
5050
set(ECBUILD_Fortran_FLAGS "${ECBUILD_Fortran_FLAGS} -Kieee")
5151
set(ECBUILD_Fortran_FLAGS "${ECBUILD_Fortran_FLAGS} -Mdaz")
5252

53+
set(ECBUILD_Fortran_LINK_FLAGS "-gpu=pinned")
54+
5355
set( ECBUILD_Fortran_FLAGS_BIT "-O2 -gopt" )
5456

5557
set( ECBUILD_C_FLAGS "-O2 -gopt -traceback" )

benchmark/include/include_patternset.yml

Lines changed: 39 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -158,45 +158,45 @@ patternset:
158158

159159
- name: timing_pattern
160160
pattern:
161-
- {name: thr_time, type: int, _: '(?:$jube_pat_nint\s+){6}:\s+$jube_pat_int\s+$jube_pat_nint\s+@\s+(?:rank#$jube_pat_nint:)?core#'} #$jube_pat_nint'} # C-version doesn't print core number?
162-
- {name: thr_mflops, type: int, _: '(?:$jube_pat_nint\s+){6}:\s+$jube_pat_nint\s+$jube_pat_int\s+@\s+(?:rank#$jube_pat_nint:)?core#'} #$jube_pat_nint'}
163-
- {name: rnk_time, type: int, _: '(?:$jube_pat_nint\s+){6}:\s+$jube_pat_int\s+$jube_pat_nint\s+:\s+TOTAL\s@\srank#$jube_pat_nint'}
164-
- {name: rnk_mflops, type: int, _: '(?:$jube_pat_nint\s+){6}:\s+$jube_pat_nint\s+$jube_pat_int\s+:\s+TOTAL\s@\srank#$jube_pat_nint'}
165-
- {name: tot_time, type: int, _: '(?:$jube_pat_nint\s*x\s*)?(?:$jube_pat_nint\s+){6}:\s+$jube_pat_int\s+$jube_pat_nint\s+(?::\s+)?TOTAL(?!\s@)'}
166-
- {name: tot_mflops, type: int, _: '(?:$jube_pat_nint\s*x\s*)?(?:$jube_pat_nint\s+){6}:\s+$jube_pat_nint\s+$jube_pat_int\s+(?::\s+)?TOTAL(?!\s@)'}
167-
- {name: tot_nproc, type: int, _: '$jube_pat_int\s*x\s*(?:$jube_pat_nint\s+){6}:\s+(?:$jube_pat_nint\s+){2}(?::\s+)?TOTAL(?!\s@)'}
168-
- {name: tot_numomp, type: int, _: '(?:$jube_pat_nint\s*x\s*)?$jube_pat_int\s+(?:$jube_pat_nint\s+){5}:\s+(?:$jube_pat_nint\s+){2}(?::\s+)?TOTAL(?!\s@)'}
169-
- {name: tot_ngptot, type: int, _: '(?:$jube_pat_nint\s*x\s*)?$jube_pat_nint\s+$jube_pat_int\s+(?:$jube_pat_nint\s+){4}:\s+(?:$jube_pat_nint\s+){2}(?::\s+)?TOTAL(?!\s@)'}
170-
- {name: tot_ngpblks,type: int, _: '(?:$jube_pat_nint\s*x\s*)?(?:$jube_pat_nint\s+){3}$jube_pat_int\s+(?:$jube_pat_nint\s+){2}:\s+(?:$jube_pat_nint\s+){2}(?::\s+)?TOTAL(?!\s@)'}
171-
- {name: tot_nproma, type: int, _: '(?:$jube_pat_nint\s*x\s*)?(?:$jube_pat_nint\s+){4}$jube_pat_int\s+$jube_pat_nint\s+:\s+(?:$jube_pat_nint\s+){2}(?::\s+)?TOTAL(?!\s@)'}
161+
- {name: thr_time, type: int, _: '(?:$jube_pat_nint\s+){6}:\s+$jube_pat_int\s+(?:$jube_pat_nint\s+){2}@\s+(?:rank#$jube_pat_nint:)?core#'} #$jube_pat_nint'} # C-version doesn't print core number?
162+
- {name: thr_mflops, type: int, _: '(?:$jube_pat_nint\s+){6}:\s+$jube_pat_nint\s+$jube_pat_int\s+$jube_pat_nint\s+@\s+(?:rank#$jube_pat_nint:)?core#'} #$jube_pat_nint'}
163+
- {name: rnk_time, type: int, _: '(?:$jube_pat_nint\s+){6}:\s+$jube_pat_int\s+(?:$jube_pat_nint\s+){2}:\s+TOTAL\s@\srank#$jube_pat_nint'}
164+
- {name: rnk_mflops, type: int, _: '(?:$jube_pat_nint\s+){6}:\s+$jube_pat_nint\s+$jube_pat_int\s+$jube_pat_nint\s+:\s+TOTAL\s@\srank#$jube_pat_nint'}
165+
- {name: tot_time, type: int, _: '(?:$jube_pat_nint\s*x\s*)?(?:$jube_pat_nint\s+){6}:\s+$jube_pat_int\s+(?:$jube_pat_nint\s+){2}(?::\s+)?TOTAL(?!\s@)'}
166+
- {name: tot_mflops, type: int, _: '(?:$jube_pat_nint\s*x\s*)?(?:$jube_pat_nint\s+){6}:\s+$jube_pat_nint\s+$jube_pat_int\s+$jube_pat_nint\s+(?::\s+)?TOTAL(?!\s@)'}
167+
- {name: tot_nproc, type: int, _: '$jube_pat_int\s*x\s*(?:$jube_pat_nint\s+){6}:\s+(?:$jube_pat_nint\s+){3}(?::\s+)?TOTAL(?!\s@)'}
168+
- {name: tot_numomp, type: int, _: '(?:$jube_pat_nint\s*x\s*)?$jube_pat_int\s+(?:$jube_pat_nint\s+){5}:\s+(?:$jube_pat_nint\s+){3}(?::\s+)?TOTAL(?!\s@)'}
169+
- {name: tot_ngptot, type: int, _: '(?:$jube_pat_nint\s*x\s*)?$jube_pat_nint\s+$jube_pat_int\s+(?:$jube_pat_nint\s+){4}:\s+(?:$jube_pat_nint\s+){3}(?::\s+)?TOTAL(?!\s@)'}
170+
- {name: tot_ngpblks,type: int, _: '(?:$jube_pat_nint\s*x\s*)?(?:$jube_pat_nint\s+){3}$jube_pat_int\s+(?:$jube_pat_nint\s+){2}:\s+(?:$jube_pat_nint\s+){3}(?::\s+)?TOTAL(?!\s@)'}
171+
- {name: tot_nproma, type: int, _: '(?:$jube_pat_nint\s*x\s*)?(?:$jube_pat_nint\s+){4}$jube_pat_int\s+$jube_pat_nint\s+:\s+(?:$jube_pat_nint\s+){3}(?::\s+)?TOTAL(?!\s@)'}
172172

173-
# NUMOMP NGPTOT #GP-cols #BLKS NPROMA tid# : Time(msec) MFlops/s
174-
# 8 16384 2048 128 16 0 : 295 866 @ core#22
175-
# 8 16384 2048 128 16 1 : 284 899 @ core#4
176-
# 8 16384 2048 128 16 2 : 282 905 @ core#16
177-
# 8 16384 2048 128 16 3 : 239 1067 @ core#1
178-
# 8 16384 2048 128 16 4 : 261 975 @ core#2
179-
# 8 16384 2048 128 16 5 : 266 959 @ core#3
180-
# 8 16384 2048 128 16 6 : 267 955 @ core#21
181-
# 8 16384 2048 128 16 7 : 273 934 @ core#23
182-
# 8 16384 16384 1024 16 -1 : 295 6931 : TOTAL
173+
# NUMOMP NGPTOT #GP-cols #BLKS NPROMA tid# : Time(msec) MFlops/s col/s
174+
# 8 16384 2048 128 16 0 : 295 866 1320 @ core#22
175+
# 8 16384 2048 128 16 1 : 284 899 1320 @ core#4
176+
# 8 16384 2048 128 16 2 : 282 905 1320 @ core#16
177+
# 8 16384 2048 128 16 3 : 239 1067 1320 @ core#1
178+
# 8 16384 2048 128 16 4 : 261 975 1320 @ core#2
179+
# 8 16384 2048 128 16 5 : 266 959 1320 @ core#3
180+
# 8 16384 2048 128 16 6 : 267 955 1320 @ core#21
181+
# 8 16384 2048 128 16 7 : 273 934 1320 @ core#23
182+
# 8 16384 16384 1024 16 -1 : 295 6931 1320 : TOTAL
183183

184184
# NUMPROC=8, NUMOMP=1, NGPTOTG=16384, NPROMA=16, NGPBLKS=128
185-
# NUMOMP NGPTOT #GP-cols #BLKS NPROMA tid# : Time(msec) MFlops/s
186-
# 1 2048 2048 128 16 0 : 237 1075 @ rank#0:core#20
187-
# 1 2048 2048 128 16 -1 : 237 1075 : TOTAL @ rank#0
188-
# 1 2048 2048 128 16 0 : 230 1109 @ rank#1:core#11
189-
# 1 2048 2048 128 16 -1 : 230 1109 : TOTAL @ rank#1
190-
# 1 2048 2048 128 16 0 : 281 906 @ rank#2:core#6
191-
# 1 2048 2048 128 16 -1 : 281 906 : TOTAL @ rank#2
192-
# 1 2048 2048 128 16 0 : 254 1002 @ rank#3:core#24
193-
# 1 2048 2048 128 16 -1 : 254 1002 : TOTAL @ rank#3
194-
# 1 2048 2048 128 16 0 : 271 940 @ rank#4:core#3
195-
# 1 2048 2048 128 16 -1 : 271 940 : TOTAL @ rank#4
196-
# 1 2048 2048 128 16 0 : 249 1025 @ rank#5:core#25
197-
# 1 2048 2048 128 16 -1 : 249 1025 : TOTAL @ rank#5
198-
# 1 2048 2048 128 16 0 : 235 1086 @ rank#6:core#1
199-
# 1 2048 2048 128 16 -1 : 235 1086 : TOTAL @ rank#6
200-
# 1 2048 2048 128 16 0 : 243 1050 @ rank#7:core#15
201-
# 1 2048 2048 128 16 -1 : 243 1050 : TOTAL @ rank#7
202-
# 8 x 1 16384 16384 1024 16 -1 : 281 8193 : TOTAL
185+
# NUMOMP NGPTOT #GP-cols #BLKS NPROMA tid# : Time(msec) MFlops/s col/s
186+
# 1 2048 2048 128 16 0 : 237 1075 1320 @ rank#0:core#20
187+
# 1 2048 2048 128 16 -1 : 237 1075 1320 : TOTAL @ rank#0
188+
# 1 2048 2048 128 16 0 : 230 1109 1320 @ rank#1:core#11
189+
# 1 2048 2048 128 16 -1 : 230 1109 1320 : TOTAL @ rank#1
190+
# 1 2048 2048 128 16 0 : 281 906 1320 @ rank#2:core#6
191+
# 1 2048 2048 128 16 -1 : 281 906 1320 : TOTAL @ rank#2
192+
# 1 2048 2048 128 16 0 : 254 1002 1320 @ rank#3:core#24
193+
# 1 2048 2048 128 16 -1 : 254 1002 1320 : TOTAL @ rank#3
194+
# 1 2048 2048 128 16 0 : 271 940 1320 @ rank#4:core#3
195+
# 1 2048 2048 128 16 -1 : 271 940 1320 : TOTAL @ rank#4
196+
# 1 2048 2048 128 16 0 : 249 1025 1320 @ rank#5:core#25
197+
# 1 2048 2048 128 16 -1 : 249 1025 1320 : TOTAL @ rank#5
198+
# 1 2048 2048 128 16 0 : 235 1086 1320 @ rank#6:core#1
199+
# 1 2048 2048 128 16 -1 : 235 1086 1320 : TOTAL @ rank#6
200+
# 1 2048 2048 128 16 0 : 243 1050 1320 @ rank#7:core#15
201+
# 1 2048 2048 128 16 -1 : 243 1050 1320 : TOTAL @ rank#7
202+
# 8 x 1 16384 16384 1024 16 -1 : 281 8193 1320 : TOTAL

0 commit comments

Comments
 (0)