Description
One realization is that the hot case for Result<T, E>
is usually the Ok()
case. And in many cases, T
is actually some sort of NonNull pointer: a ref, a Box, etc.
The current layout for Result<NonNull<T>, E>
is (tag, union { NonNull, E }). Which means either way, the code needs to read the tag, and then read the union.
If instead, the layout was (union { tag, NonNull }, E), then the common case becomes one read.
The generalization could be formulated like this: When the tag is a boolean, and the first variant is a NonNull/NonZero type, the first variant is stored in place of the tag, and the invalid zero value acts as tag for the second variant.
So Ok(value)
would be (value, undefined), and Err(e)
would be (0, e).
Some code to show the benefits of this optimization:
#![feature(nonzero)]
extern crate core;
use core::nonzero::NonZero;
pub struct Foo(usize, usize);
impl Foo {
fn as_result(&self) -> Result<NonZero<usize>, usize> {
if self.0 > 0 {
Ok(unsafe { NonZero::new_unchecked(self.0) })
} else {
Err(self.1)
}
}
}
pub fn foo(f: &Foo) -> Option<NonZero<usize>> {
f.as_result().ok()
}
pub fn foo_unwrap(f: &Foo) -> usize {
f.as_result().unwrap().get()
}
pub fn bar(f: &Result<NonZero<usize>, usize>) -> Option<NonZero<usize>> {
f.ok()
}
pub fn bar_unwrap(f: &Result<NonZero<usize>, usize>) -> usize {
f.unwrap().get()
}
Compiled as the following with godbolt:
example::foo:
push rbp
mov rbp, rsp
mov rax, qword ptr [rdi]
pop rbp
ret
example::foo_unwrap:
mov rax, qword ptr [rdi]
test rax, rax
jne .LBB4_2
mov rax, qword ptr [rdi + 8]
.LBB4_2:
je .LBB4_3
ret
.LBB4_3:
push rbp
mov rbp, rsp
mov rdi, rax
call core::result::unwrap_failed
ud2
example::bar:
push rbp
mov rbp, rsp
cmp qword ptr [rdi], 1
je .LBB5_1
mov rax, qword ptr [rdi + 8]
pop rbp
ret
.LBB5_1:
xor eax, eax
pop rbp
ret
example::bar_unwrap:
mov rax, qword ptr [rdi + 8]
cmp qword ptr [rdi], 1
je .LBB6_1
ret
.LBB6_1:
push rbp
mov rbp, rsp
mov rdi, rax
call core::result::unwrap_failed
ud2
This doesn't really remove branches in the example above, but removes the need to read memory in the common case (although, the data is probably in the same cache-line, or in the next pre-fetched one, but that's still less instructions to execute). In some cases, I've seen the compiler use cmov instead of a branch, though.
Note the compiler does a poor job with foo_unwrap
, for some reason... manually inlining as_result() makes it generate better code.
This could be applied to slices too, where Result<&[T], E> could become (union { tag, slice-ptr }, union { slice-size, E}), in which case this would even make the type smaller than (tag, union { (slice-ptr, slice-size), E }).