Description
When trying to beat git verify-pack
Attempt 1
I remembered timings on a cold cache that indicated something around 5:50min for git to run a verify pack on the linux kernel pack. However, turns out that if the environment is a little more controlled, git is still considerably faster than us despite using an LRU cache and despite using multiple cores quite efficiently.
Observation
Git uses a streaming pack approach which is optimized to apply objects inversely. It works by
- decompressing all deltas
- applying all deltas that depend on a base, recursively (and thus avoiding to have to decompress deltas multiple times)
We work using a memory mapped file which is optimized for random access, but won't be very fast for this kind of workload.
How to fix
Wait until we have implemented a streaming pack as well and try again, having the same algorithmical benefits possibly faired with more efficient memory handling.
Git for some reason limits the application to 3 threads, even though we do benefit from having more threads so could be faster just because of this.
The streaming (indexing) phase of reading a pack can be parallelised in case we have a pack on disk, and it should be easy to implement if the index datastructure itself is threadsafe (but might not be worth the complexity or memory overhead, let's see).