Optimize server-setup, by offloading large matrix multiplication to GPU #11

itzmeanjan · 2025-04-05T03:22:21Z

Use vulkan compute shaders to offload large matrix multiplication and matrix transposition to GPU
(feature-gated by non-default gpu feature), for speeding up server-setup phase of ChalametPIR.

Signed-off-by: Anjan Roy <[email protected]>

Shader is taken from https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a Signed-off-by: Anjan Roy <[email protected]>

Signed-off-by: Anjan Roy <[email protected]>

… it finishes Signed-off-by: Anjan Roy <[email protected]>

Signed-off-by: Anjan Roy <[email protected]>

…queue Signed-off-by: Anjan Roy <[email protected]>

Signed-off-by: Anjan Roy <[email protected]>

… buffer creation Signed-off-by: Anjan Roy <[email protected]>

Signed-off-by: Anjan Roy <[email protected]>

…tion Signed-off-by: Anjan Roy <[email protected]>

… function Signed-off-by: Anjan Roy <[email protected]>

…spond` Signed-off-by: Anjan Roy <[email protected]>

itzmeanjan · 2025-04-05T03:55:57Z

Without gpu feature, server-setup cost on Intel i7-1260P CPU

$ cargo bench --features mutate_internal_client_state --profile optimized --bench offline_phase -q server_setup
Timer precision: 10 ns
offline_phase                                                                        fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ server_setup                                                                                    │               │               │               │         │
   ├─ 3                                                                                            │               │               │               │         │
   │  ╰─ DBConfig { db_entry_count: 65536, key_byte_len: 32, value_byte_len: 1024 }  2.522 m       │ 2.648 m       │ 2.585 m       │ 2.585 m       │ 2       │ 2
   ╰─ 4                                                                                            │               │               │               │         │
      ╰─ DBConfig { db_entry_count: 65536, key_byte_len: 32, value_byte_len: 1024 }  2.535 m       │ 2.552 m       │ 2.543 m       │ 2.543 m       │ 2       │ 2

When enabled the gpu feature, server-setup is ~12.45x faster 🚀

$ cargo bench --features mutate_internal_client_state,gpu --profile optimized --bench offline_phase -q server_setup
Timer precision: 10 ns
offline_phase                                                                        fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ server_setup                                                                                    │               │               │               │         │
   ├─ 3                                                                                            │               │               │               │         │
   │  ╰─ DBConfig { db_entry_count: 65536, key_byte_len: 32, value_byte_len: 1024 }  12.18 s       │ 12.69 s       │ 12.45 s       │ 12.46 s       │ 25      │ 25
   ╰─ 4                                                                                            │               │               │               │         │
      ╰─ DBConfig { db_entry_count: 65536, key_byte_len: 32, value_byte_len: 1024 }  11.73 s       │ 12.24 s       │ 11.86 s       │ 11.87 s       │ 26      │ 26

Signed-off-by: Anjan Roy <[email protected]>

itzmeanjan · 2025-04-06T06:34:59Z

I benchmarked server-setup on AWS EC2 instance g6e.8xlarge, featuring Nvidia L40S tensor core GPUs.

Server-setup on CPU

Server-setup, partially offloaded to GPU

Note

Server-setup can be offloaded to GPU, by enabling feature gpu. You need to install Vulkan drivers and library for this feature to work.

itzmeanjan added 27 commits March 19, 2025 16:02

Add dependencies for new feature gpu

28a85da

Signed-off-by: Anjan Roy <[email protected]>

Use u32 for matrix dimensions

11e5b77

Signed-off-by: Anjan Roy <[email protected]>

Add compute shader for matrix-matrix multiplication

cfb9124

Shader is taken from https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a Signed-off-by: Anjan Roy <[email protected]>

Setup a Vulkan device and queue so that commands can be submitted to it

4cb8d97

Signed-off-by: Anjan Roy <[email protected]>

Setup gpu returns a memory allocator and command buffer allocator too

4b9bac8

Signed-off-by: Anjan Roy <[email protected]>

Given a matrix, returns a buffer with transfer-src flag set

9c2ba00

Signed-off-by: Anjan Roy <[email protected]>

Add error enum for vulkan buffer creation failure

dce7da2

Signed-off-by: Anjan Roy <[email protected]>

Simplify return in matrix to transfer source buffer function

2b8d84c

Signed-off-by: Anjan Roy <[email protected]>

Add function recording Vulkan buffer to buffer data transfer command

96552dc

Signed-off-by: Anjan Roy <[email protected]>

Make error type more explicit

0e21934

Signed-off-by: Anjan Roy <[email protected]>

Add function to create empty Vulkan storage buffer

1b3c3bc

Signed-off-by: Anjan Roy <[email protected]>

Add function to submit transfer command buffer to queue and wait till…

e526074

… it finishes Signed-off-by: Anjan Roy <[email protected]>

Rename error enum variant to be more generic

db5aca1

Signed-off-by: Anjan Roy <[email protected]>

Add function for computing number of bytes required to encode matrix

8ff3965

Signed-off-by: Anjan Roy <[email protected]>

Matrix-matrix multiplication command submission and execution on GPU …

9f4e0ea

…queue Signed-off-by: Anjan Roy <[email protected]>

Reformat GLSL compute shader using clang-format

3d5757b

Signed-off-by: Anjan Roy <[email protected]>

Add matrix transpose compute shader

1cc4806

Signed-off-by: Anjan Roy <[email protected]>

Submit and wait for matrix transpose job to finish on GPU

98a0746

Signed-off-by: Anjan Roy <[email protected]>

Fix matrix transpose shader

679bc17

Signed-off-by: Anjan Roy <[email protected]>

Refactor function for transferring host matrix to device

3f33f81

Signed-off-by: Anjan Roy <[email protected]>

Maintain two different functions for host-accessible and device-local…

9b50f41

… buffer creation Signed-off-by: Anjan Roy <[email protected]>

Implementation server-setup phase for gpu feature

450d7dc

Signed-off-by: Anjan Roy <[email protected]>

Add row-vector transposed matrix multiplication compute shader

3be9c22

Signed-off-by: Anjan Roy <[email protected]>

Implement server-respond function, using gpu feature

40ba459

Signed-off-by: Anjan Roy <[email protected]>

Change work-group size for vector-matrix multiplication shader invoca…

ec4a802

…tion Signed-off-by: Anjan Roy <[email protected]>

Duplicate comment for gpu feature-gated version of server-respond…

1d2ed91

… function Signed-off-by: Anjan Roy <[email protected]>

Avoid computing vector-matrix multiplication on GPU during `server-re…

fe5ce49

…spond` Signed-off-by: Anjan Roy <[email protected]>

itzmeanjan added 2 commits April 6, 2025 11:50

Update project documentation mentioning about the gpu feature gate

1e391ad

Signed-off-by: Anjan Roy <[email protected]>

Prepare for release v0.5.0

42a6736

Signed-off-by: Anjan Roy <[email protected]>

itzmeanjan merged commit 0646d4e into main Apr 6, 2025
5 checks passed

itzmeanjan deleted the integrate-mat-mul-on-gpu branch April 6, 2025 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize server-setup, by offloading large matrix multiplication to GPU #11

Optimize server-setup, by offloading large matrix multiplication to GPU #11

Uh oh!

itzmeanjan commented Apr 5, 2025

Uh oh!

itzmeanjan commented Apr 5, 2025

Uh oh!

itzmeanjan commented Apr 6, 2025

Uh oh!

Uh oh!

Uh oh!

Optimize server-setup, by offloading large matrix multiplication to GPU #11

Optimize server-setup, by offloading large matrix multiplication to GPU #11

Uh oh!

Conversation

itzmeanjan commented Apr 5, 2025

Uh oh!

itzmeanjan commented Apr 5, 2025

Uh oh!

itzmeanjan commented Apr 6, 2025

Server-setup on CPU

Server-setup, partially offloaded to GPU

Uh oh!

Uh oh!

Uh oh!