Refactor malloc allocation

Currently, some of the benchmarks allocate memory that's accessed from the kernels using:
- `malloc`
- `aligned_alloc(2MiB, ...)`
- C++ `new`
- vendor specific apis (cudaMallocManaged, etc.)

How the memory is allocated does impact benchmark performance a bit. 

I think we should extract these into their own file, to ensure all benchmarks may pick the exact same defaults.