Skip to content

Conversation

@cpubot
Copy link
Collaborator

@cpubot cpubot commented Dec 5, 2025

The solana-short-vec ShortU16 implementation incurs unnecessary overhead from:

  • Value mask before checking continuation bit (wasted instructions on theoretically hot 1 byte path)
  • Loop with dynamic i * 7 shift
  • Defensive and technically unnecessary saturating_* / checked_* ops to guard against theoretical loop edge cases
  • All the serde badness when paired with bincode

The compiler actually does a decent job optimizing this, but it can be improved.
https://rust.godbolt.org/z/rojK31Wf4

This implementation:

  • Fully unrolled with constant shifts (7, 14)
  • No unnecessary widening or defensive ops — unrolled structure makes overflow impossible by construction
  • const fn
  • no serde obviously

https://rust.godbolt.org/z/eYzq1G8oT

results

ShortU16/solana_short_vec:decode_shortu16_len/127
                        time:   [8.1727 ns 8.1882 ns 8.2045 ns]
                        thrpt:  [116.24 MiB/s 116.47 MiB/s 116.69 MiB/s]
ShortU16/wincode:decode_short_u16/127
                        time:   [1.4131 ns 1.4156 ns 1.4179 ns]
                        thrpt:  [672.57 MiB/s 673.71 MiB/s 674.90 MiB/s]
ShortU16/solana_short_vec:decode_shortu16_len/16383
                        time:   [8.0641 ns 8.0804 ns 8.0978 ns]
                        thrpt:  [235.54 MiB/s 236.05 MiB/s 236.52 MiB/s]
ShortU16/wincode:decode_short_u16/16383
                        time:   [1.5453 ns 1.5522 ns 1.5588 ns]
                        thrpt:  [1.1950 GiB/s 1.2000 GiB/s 1.2054 GiB/s]
ShortU16/solana_short_vec:decode_shortu16_len/65535
                        time:   [8.4347 ns 8.4469 ns 8.4589 ns]
                        thrpt:  [338.23 MiB/s 338.71 MiB/s 339.20 MiB/s]
ShortU16/wincode:decode_short_u16/65535
                        time:   [1.7430 ns 1.7457 ns 1.7484 ns]
                        thrpt:  [1.5980 GiB/s 1.6005 GiB/s 1.6030 GiB/s]

results are obviously more dramatic without black_boxing since the compiler can constant fold e.g.,

ShortU16/solana_short_vec:decode_shortu16_len/65535
                        time:   [8.1254 ns 8.1398 ns 8.1548 ns]
                        thrpt:  [350.84 MiB/s 351.48 MiB/s 352.11 MiB/s]
                 change:
                        time:   [-3.9026% -3.6384% -3.3188%] (p = 0.00 < 0.05)
                        thrpt:  [+3.4328% +3.7757% +4.0611%]
                        Performance has improved.
ShortU16/wincode:decode_short_u16/65535
                        time:   [599.87 ps 600.92 ps 602.20 ps]
                        thrpt:  [4.6396 GiB/s 4.6495 GiB/s 4.6576 GiB/s]
                 change:
                        time:   [-65.392% -65.287% -65.146%] (p = 0.00 < 0.05)
                        thrpt:  [+186.91% +188.07% +188.95%]
                        Performance has improved.

@t-nelson
Copy link

t-nelson commented Dec 5, 2025

we need to be very careful with review here. this type has been responsible for at least two bounty payouts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants