my projects

Description

Modern ad tech platforms need to answer one question billions of times per day: "Which users belong to Segment A AND Segment B, but NOT Segment C?" Storing user IDs in lists requires massive memory and slow pointer chasing, so BitFilter represents each segment as a dense bitmap where one bit maps to one user. This turns a boolean query into pure bitwise logic across hundreds of megabytes of data, executed at memory-bandwidth speed using SIMD.

The project is built around mechanical sympathy: writing code that respects how the CPU physically moves data from RAM to registers. Using Coreinfo64 and CPU-Z to characterize the hardware first (cache hierarchy, DRAM bandwidth ceiling, SIMD support), then shaping every allocation, loop, and prefetch hint around those constraints. With AVX2's 256-bit YMM registers, a single _mm256_andnot_si256 instruction intersects two segments across 256 users per clock cycle.

Features

Built a SIMD-accelerated audience segmentation engine in C++20 that evaluates boolean queries over 500M users in 11 ms, achieving 29 GB/s throughput at 85-90% of the DDR4 memory ceiling.
Identified single-threaded DDR4 bus saturation as the root cause of 0.98x multi-threaded eval scaling: write-allocate RFO traffic inflates actual bus load by 26% beyond application-visible throughput, leaving no headroom for additional threads.
Built CI pipeline with GitHub Actions testing x86 AVX2 natively and ARM SVE via cross-compilation + QEMU emulation, validating bit-exact correctness across architectures.

Info

benchmarks & analysis

source code

C++20, AVX2/SIMD, ARM SVE, CMake, Google Benchmark, GitHub Actions, Coreinfo64, CPU-Z