BitVector::build_index: 100x speedup #28

jmr · 2020-06-02T13:08:58Z

Process the BitVector unit-by-unit instead of bit-by-bit.

Use PopCount::count() to update num_1s and use select_bit to find
the bit positions for the select0s_ and select1s_ indexes.

According to my benchmarks, the old bit-by-bit version processed a
256kbit vector at about 20MB/s independent of enables_select0 and
enables_select1.

The new version is 50x-150x faster, depending on the compiler and build_index
options.

enables_select_0=enables_select1=false:
popcnt, no bmi2: 1600MB/s
popcnt, no bmi2: 2500MB/s
popcnt and bmi2: 2900MB/s

enables_select_0=enables_select1=true:
no popcnt, no bmi2: 1100MB/s
popcnt, no bmi2: 1600MB/s
popcnt and bmi2: 1800MB/s

Process the BitVector unit-by-unit instead of bit-by-bit. Use PopCount::count() to update num_1s and use select_bit to find the bit positions for the select0s_ and select1s_ index. According to my benchmarks, the old bit-by-bit version processed a 256kbit vector at about 20MB/s independent of enables_select0/enables_select1. The new version is 50x-150x faster, depending on the compiler and build_index options. enables_select_0=enables_select1=false: popcnt, no bmi2: 1600MB/s popcnt, no bmi2: 2500MB/s popcnt and bmi2: 2900MB/s enables_select_0=enables_select1=true: no popcnt, no bmi2: 1100MB/s popcnt, no bmi2: 1600MB/s popcnt and bmi2: 1800MB/s

This is safe and there is no truncation.

lib/marisa/grimoire/vector/bit-vector.cc

The 32-bit select_bit will be used to make the new build_index implementation work for MARISA_WORD_SIZE == 32.

This is already used by build_index and fixes the 32-bit build.

lib/marisa/grimoire/vector/bit-vector.cc

s-yata · 2020-06-16T05:23:11Z

Benchmark

Data: enwiki-20191001-all-titles-in-ns0 (downloaded from https://dumps.wikimedia.org/enwiki/)
Tool: marisa-benchmark

The following table shows build speed [1,000 keys/second].

#tries	s-yata:master [K/s]	jmr:build-index [K/s]
1	1,054.64	1,087.85
2	915.17	937.07
3	901.00	920.59
4	896.08	914.57
5	894.30	912.47

jmr:build-index is 2-3% faster than s-yata:master.

jmr · 2020-06-16T07:51:37Z

jmr:build-index is 2-3% faster than s-yata:master.

Did you configure with --enable-native-code? popcnt and select_bit are going to be important.

My benchmark was just on BitVector::build_index. I don't know what fraction of marisa-benchmark is spent in build_index, so I can't say whether more than 2-3% is expected.

I will have time to run/profile the benchmarks myself later in the week.

s-yata · 2020-06-16T07:58:42Z

The table shows the speed of dictionary construction and BitVector::build_index is not a major part of it.
However, I think the improvement is enough to accept this pull request.

s-yata · 2020-06-17T06:27:22Z

It looks good tome.
Thank you!

glebm approved these changes Jun 2, 2020

View reviewed changes

jmr added 2 commits June 2, 2020 09:15

Add missing <algorithm> include for std::min

b814b23

Add static_cast to suppress -Wconversion warning

139e1b6

This is safe and there is no truncation.

jmr commented Jun 2, 2020

View reviewed changes

lib/marisa/grimoire/vector/bit-vector.cc Show resolved Hide resolved

jmr mentioned this pull request Jun 3, 2020

select_bit: Extract 32-bit non-SSE2 version #29

Merged

jmr added 2 commits June 3, 2020 08:28

Merge branch 'select-bit-32' into build-index

638e59c

The 32-bit select_bit will be used to make the new build_index implementation work for MARISA_WORD_SIZE == 32.

select_bit: Add overload taking a single UInt32

1f62787

This is already used by build_index and fixes the 32-bit build.

jmr commented Jun 3, 2020

View reviewed changes

lib/marisa/grimoire/vector/bit-vector.cc Show resolved Hide resolved

s-yata self-assigned this Jun 15, 2020

s-yata added the enhancement label Jun 15, 2020

Merge branch 'master' into build-index

ca5561a

s-yata merged commit 0873e86 into s-yata:master Jun 17, 2020

jmr deleted the build-index branch May 20, 2025 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BitVector::build_index: 100x speedup #28

BitVector::build_index: 100x speedup #28

Uh oh!

jmr commented Jun 2, 2020

Uh oh!

Uh oh!

Uh oh!

s-yata commented Jun 16, 2020

Uh oh!

jmr commented Jun 16, 2020

Uh oh!

s-yata commented Jun 16, 2020

Uh oh!

s-yata commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BitVector::build_index: 100x speedup #28

BitVector::build_index: 100x speedup #28

Uh oh!

Conversation

jmr commented Jun 2, 2020

Uh oh!

Uh oh!

Uh oh!

s-yata commented Jun 16, 2020

Benchmark

Uh oh!

jmr commented Jun 16, 2020

Uh oh!

s-yata commented Jun 16, 2020

Uh oh!

s-yata commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants