optimization: batch XOR operations 12% faster IBD (!31144) · Merge requests · bitcoin / bitcoin

Placeholder l0rinc requested to merge github/fork/l0rinc/l0rinc/optimize-xor into master Oct 24, 2024

Depends on https://github.com/bitcoin/bitcoin/pull/31551.

The descriptions and stats will be updated after the dependent PRs are already merged, but so far it seems that the 3 PRs combined can achieve a 12% IBD speedup (with GCC at least).

Anyone can reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5

Block XOR obfuscations introduced a measurable serialization cost (seen during IBD benchmarking). This PR is meant to alleviate that.

Summary

The obfuscation's effect on IBD seems' to be about ~4% when it still has to be applied, and ~9% when it's turned off, see:

Changes in testing, benchmarking and implementation

Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks.
Updated the benchmark to reflect realistic scenarios by capturing every call of util::Xor for the first 860k blocks, creating a model with the same data distribution. An additional benchmark checks the effect of short-circuiting XOR when the key is zero, ensuring no speed regression occurs when the obfuscation feature is disabled.
Optimized the Xor function to process in batches (64/32/16/8 bits instead of per-byte).
Migrated remaining std::vector<std::byte>(8) values to uint64_t.

Reproducer and assembly

Memory alignment is handled via std::memcpy, optimized out on tested platforms (see https://godbolt.org/z/GGYcedjzY):

Clang (x86-64) - 32 bytes/iter using SSE vector operations
GCC (x86-64) - 16 bytes/iter using unrolled 64-bit XORs
RISC-V (32-bit) - 8 bytes/iter using load/XOR/store sequence
s390x (big-endian) - 64 bytes/iter with unrolled 8-byte XORs

Endianness

The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption. Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order. Since CI lacks a big-endian machine, testing was done locally via Docker.

Details

brew install podman pigz
softwareupdate --install-rosetta
podman machine init
podman machine start
docker run --platform linux/s390x -it ubuntu:latest /bin/bash
    apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev && \
    cd /mnt && git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
    cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
    ./build/src/test/test_bitcoin --run_test=streams_tests

Performance

   cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build -j$(nproc) \
&& build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000

The 860k block profile contains a lot of very big arrays (96'233 separate sizes, biggest was 3'992'470 bytes long) - a big departure from the previous 400k and 700k blocks (having 1500 sizes, biggest was 9319 bytes long).

The performance characteristics are also quite different, now that we have more and bigger byte arrays:

C++ compiler .......................... AppleClang 16.0.0.16000026

Before:

ns/byte	byte/s	err%	total	benchmark
1.00	1,000,913,427.27	0.7%	10.20	`AutoFileXor`
0.85	1,173,442,964.60	0.2%	11.16	`XorHistogram`

After:

ns/byte	byte/s	err%	total	benchmark
0.09	11,204,183,007.86	0.6%	11.08	`AutoFileXor`
0.15	6,459,482,269.06	0.3%	10.97	`XorHistogram`

i.e. ~11/5.5x (disabled/enabled) faster with Clang at processing the data with representative histograms.

C++ compiler .......................... GNU 13.2.0

Before:

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
1.87	535,253,389.72	0.0%	9.20	3.45	2.669	1.03	0.1%	11.02	`AutoFileXor`
1.70	587,844,715.57	0.0%	9.35	5.41	1.729	1.05	1.7%	10.95	`XorHistogram`

After:

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
0.59	1,706,433,032.76	0.1%	0.00	0.00	0.620	0.00	1.8%	11.01	`AutoFileXor`
0.47	2,145,375,849.71	0.0%	0.95	1.48	0.642	0.20	9.6%	10.93	`XorHistogram`

i.e. ~3.2/3.5x faster (disabled/enabled) with GCC at processing the data with representative histograms.

A few other benchmarks that seem to have improved as well (tested with Clang only)

Before:

ns/op	op/s	err%	total	benchmark
2,237,168.64	446.99	0.3%	10.91	`ReadBlockFromDiskTest`
748,837.59	1,335.40	0.2%	10.68	`ReadRawBlockFromDiskTest`

After:

ns/op	op/s	err%	total	benchmark
1,827,436.12	547.21	0.7%	10.95	`ReadBlockFromDiskTest`
49,276.48	20,293.66	0.2%	10.99	`ReadRawBlockFromDiskTest`

Also visible on https://corecheck.dev/bitcoin/bitcoin/pulls/31144

optimization: batch XOR operations 12% faster IBD

Summary

Changes in testing, benchmarking and implementation

Reproducer and assembly

Endianness

Performance

C++ compiler .......................... AppleClang 16.0.0.16000026

C++ compiler .......................... GNU 13.2.0

Merge request reports