Birbla

SIMD-friendly algorithms for substring searching (2016) (0x80.pl)
217 points by Rendello - 2 days ago

I implemented these in my attempt to SIMD optimize Wasm/WASI libc.
https://github.com/ncruces/go-sqlite3/tree/main/sqlite3/libc
For known haystack lengths, and large needles, I found it useful to augment the approach with Quick Search.
https://igm.univ-mlv.fr/~lecroq/string/node19.html
by ncruces - 1 day ago
I implemented many different algorithms for searching and splitting strings using SMID methods as well: https://github.com/naver/tamgu/blob/06aedf2f14895925d7b5a8e2... I used a different algorithm than the ones presented here.
by clauderoux - 1 day ago
We know that the libc strstr performs horrible, but musl is fast and state of the art.
Now we just need a name, and we can add it to the smart shootout. Wonder how it compares to the best SIMD algos
by rurban - 1 day ago
missing (2018) ?
by mgaunard - 1 day ago
What is the string sizes thresold for efficiency?
Because the SIMD (aligned, size computations, etc) setup code is usually more expensive that byte-oriented basic search for "small" strings.
Yes, it depends heavily on the hardware architecture.
by sylware - 1 day ago
The swar algorithm has UB because it casts the 1 byte aligned data to 8 byte aligned. Perhaps it suffers some performance issues from unaligned loads?
by eska - 1 day ago
The "AVX2 (generic)" approach is roughly what ripgrep uses (via Rust's `regex` crate) to accelerate most searches. Even something like `\w+\s+Sherlock\s+\w+` will benefit since ripgrep will pluck `Sherlock` out of the regex and search that.
The actual implementation is here: https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...
The main difference with the algorithm presented here is that instead of always using the first and last bytes of the needle, a heuristic is used to try to pick two bytes that occur less frequently according to a background frequency distribution.
It ends up being quite a bit faster than just plain Two-Way or even GNU libc's implementation of `memmem`. From the root of the `memchr` repository:
```
    $ rebar rank benchmarks/record/x86_64/2023-12-29.csv -e '^rust/memchr/memmem/(oneshot|prebuilt|twoway)' -e '^libc/memmem/oneshot'
    Engine                       Version  Geometric mean of speed ratios  Benchmark count
    ------                       -------  ------------------------------  ---------------
    rust/memchr/memmem/prebuilt  2.7.1    1.07                            57
    rust/memchr/memmem/oneshot   2.7.1    1.39                            54
    libc/memmem/oneshot          unknown  3.15                            54
    rust/memchr/memmem/twoway    2.7.1    5.26                            54
```
The "prebuilt" benchmarks also demonstrate the API deficiencies of routines like `memmem`. Because of their API, they need to rebuild the state necessary to execute a search for the needle given. But it is often the case that the needle is invariant across repeated calls with different haystacks. So you end up doing that repeated work for no gain.
by burntsushi - 1 day ago
Now if I can just use SIMD directly from Python without calling out to another language.
by azhenley - 1 day ago
Reminds me of https://github.com/nikneym/hparse
Which uses SIMD algos for fast HTTP parsing.
by klaussilveira - 1 day ago
C# has implemented some SIMD for IndexOf: https://github.com/dotnet/runtime/pull/63285
by mwsherman - 1 day ago