Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,8 @@ def template_args(self):
dry_runs=self.dry_runs,
timers=self.timers,
strides=self.strides,
index_type=self.index_type)
index_type=self.index_type,
implementation_name=self.template_file().partition('.')[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the implementation_name is used for easier analysis with NCU? But if we for example run using two different domain sizes, template_file is the same, so this does not really help in general, or does it?
Wouldn’t it be better to either just add the hash of all parameters (which is completely unreadable, but enforces a unique name for unique parameters) or create a human-readable implementation_name that includes all relevant parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we for example run using two different domain sizes, template_file is the same, so this does not really help in general, or does it?

It's true that it only partly helps. IIRC one can infer the domain size by the grid size reported by NCU though?
When I was looking into this, distinguishing the kernel names was helpful enough since I could figure out which size was each report then. For different sizes one can also launch the benchmarks separately and create separate ncu-rep files that encode the domain sizes as well.
Tbh I don't think that a hash adds much value towards the desired purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running just a single kernel at a time was what I always did. But then we don’t need any change from now because you either know exactly what you have run or otherwise you have a problem because of missing information, right?
So I am for either completely fixing the problem (if it is one) or completely ignoring it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we run each kernel with separate commands then we can avoid those issues and simplify things. I was just using the sbench-<arch>-collection to run the benchmarks and collect results in batches.
It would still be useful to have a way to inspect and distinguish the generated code for each implementation by either encoding the implementation name in the function name or the filename saved to disk.


@contextlib.contextmanager
def on_device(self, data):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ using hipDeviceProp = hipDeviceProp_t;
}

{% block gpu_kernel %}
__global__ void gpu_kernel(
__global__ void gpu_kernel_{{ implementation_name }}(
{%- for arg in args %}
{{ ctype }} * __restrict__ {{ arg }}{{ "," if not loop.last }}
{%- endfor %}
Expand Down Expand Up @@ -135,7 +135,7 @@ extern "C" int kernel(
}

for (int dry_run = 0; dry_run < {{ dry_runs }}; ++dry_run) {
gpu_kernel<<<grid_size, block_size, smem_size>>>(
gpu_kernel_{{ implementation_name }}<<<grid_size, block_size, smem_size>>>(
{%- for arg in args %}
{{ arg }}{{ "," if not loop.last }}
{%- endfor %}
Expand All @@ -158,9 +158,9 @@ extern "C" int kernel(
{%- endif %}

{%- if timers == 'hip-ext' %}
hipExtLaunchKernelGGL(gpu_kernel, grid_size, block_size, smem_size, 0, start, stop, 0,
hipExtLaunchKernelGGL(gpu_kernel_{{ implementation_name }}, grid_size, block_size, smem_size, 0, start, stop, 0,
{%- else %}
gpu_kernel<<<grid_size, block_size, smem_size>>>(
gpu_kernel_{{ implementation_name }}<<<grid_size, block_size, smem_size>>>(
{%- endif %}
{%- for arg in args %}
{{ arg }}{{ "," if not loop.last }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ __forceinline__ __device__ void forward_sweep(const {{ index_type }} ishift,
}


__global__ void __launch_bounds__({{ block_size[0] * block_size[1] }}) gpu_kernel(
__global__ void __launch_bounds__({{ block_size[0] * block_size[1] }}) gpu_kernel_{{ implementation_name }}(
const {{ ctype }} *__restrict__ ustage,
const {{ ctype }} *__restrict__ upos,
const {{ ctype }} *__restrict__ utens,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ __forceinline__ __device__ void forward_sweep(const {{ index_type }} ishift,
}


__global__ void __launch_bounds__({{ block_size[0] * block_size[1] }}) gpu_kernel(
__global__ void __launch_bounds__({{ block_size[0] * block_size[1] }}) gpu_kernel_{{ implementation_name }}(
const {{ ctype }} *__restrict__ ustage,
const {{ ctype }} *__restrict__ upos,
const {{ ctype }} *__restrict__ utens,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -389,7 +389,7 @@ __forceinline__ __device__ void forward_sweep(const {{ ctype }} *__restrict__ wc
}


__global__ void __launch_bounds__({{ block_size[0] * block_size[1] }}) gpu_kernel(
__global__ void __launch_bounds__({{ block_size[0] * block_size[1] }}) gpu_kernel_{{ implementation_name }}(
const {{ ctype }} *__restrict__ ustage,
const {{ ctype }} *__restrict__ upos,
const {{ ctype }} *__restrict__ utens,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ __forceinline__ __device__ void forward_sweep(const {{ index_type }} ishift,
}


__global__ void __launch_bounds__({{ block_size[0] * block_size[1] }}) gpu_kernel(
__global__ void __launch_bounds__({{ block_size[0] * block_size[1] }}) gpu_kernel_{{ implementation_name }}(
const {{ ctype }} *__restrict__ ustage,
const {{ ctype }} *__restrict__ upos,
const {{ ctype }} *__restrict__ utens,
Expand Down
2 changes: 1 addition & 1 deletion stencil_benchmarks/tools/compilation.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ def __init__(self,
compile_command = ['gcc'] if extension.lower() == '.c' else ['g++']

if compile_command[0].endswith('nvcc'):
compile_command += ['-Xcompiler', '-shared', '-Xcompiler', '-fPIC']
compile_command += ['-Xcompiler', '-shared', '-Xcompiler', '-fPIC', '--generate-line-info']
else:
compile_command += ['-shared', '-fPIC']

Expand Down