Birbla

Login
    SIMD-friendly algorithms for substring searching (2016) (0x80.pl)
    217 points by Rendello - 2 days ago

  • I implemented these in my attempt to SIMD optimize Wasm/WASI libc.

    https://github.com/ncruces/go-sqlite3/tree/main/sqlite3/libc

    For known haystack lengths, and large needles, I found it useful to augment the approach with Quick Search.

    https://igm.univ-mlv.fr/~lecroq/string/node19.html

    by ncruces - 1 day ago
  • I implemented many different algorithms for searching and splitting strings using SMID methods as well: https://github.com/naver/tamgu/blob/06aedf2f14895925d7b5a8e2... I used a different algorithm than the ones presented here.
    by clauderoux - 1 day ago
  • We know that the libc strstr performs horrible, but musl is fast and state of the art.

    Now we just need a name, and we can add it to the smart shootout. Wonder how it compares to the best SIMD algos

    by rurban - 1 day ago
  • missing (2018) ?
    by mgaunard - 1 day ago
  • What is the string sizes thresold for efficiency?

    Because the SIMD (aligned, size computations, etc) setup code is usually more expensive that byte-oriented basic search for "small" strings.

    Yes, it depends heavily on the hardware architecture.

    by sylware - 1 day ago
  • The swar algorithm has UB because it casts the 1 byte aligned data to 8 byte aligned. Perhaps it suffers some performance issues from unaligned loads?
    by eska - 1 day ago
  • The "AVX2 (generic)" approach is roughly what ripgrep uses (via Rust's `regex` crate) to accelerate most searches. Even something like `\w+\s+Sherlock\s+\w+` will benefit since ripgrep will pluck `Sherlock` out of the regex and search that.

    The actual implementation is here: https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...

    The main difference with the algorithm presented here is that instead of always using the first and last bytes of the needle, a heuristic is used to try to pick two bytes that occur less frequently according to a background frequency distribution.

    It ends up being quite a bit faster than just plain Two-Way or even GNU libc's implementation of `memmem`. From the root of the `memchr` repository:

        $ rebar rank benchmarks/record/x86_64/2023-12-29.csv -e '^rust/memchr/memmem/(oneshot|prebuilt|twoway)' -e '^libc/memmem/oneshot'
        Engine                       Version  Geometric mean of speed ratios  Benchmark count
        ------                       -------  ------------------------------  ---------------
        rust/memchr/memmem/prebuilt  2.7.1    1.07                            57
        rust/memchr/memmem/oneshot   2.7.1    1.39                            54
        libc/memmem/oneshot          unknown  3.15                            54
        rust/memchr/memmem/twoway    2.7.1    5.26                            54
    
    The "prebuilt" benchmarks also demonstrate the API deficiencies of routines like `memmem`. Because of their API, they need to rebuild the state necessary to execute a search for the needle given. But it is often the case that the needle is invariant across repeated calls with different haystacks. So you end up doing that repeated work for no gain.
    by burntsushi - 1 day ago
  • Now if I can just use SIMD directly from Python without calling out to another language.
    by azhenley - 1 day ago
  • Reminds me of https://github.com/nikneym/hparse

    Which uses SIMD algos for fast HTTP parsing.

    by klaussilveira - 1 day ago
  • C# has implemented some SIMD for IndexOf: https://github.com/dotnet/runtime/pull/63285
    by mwsherman - 1 day ago

© 2025 Birbla.com, a Hacker News reader · Content · Terms · Privacy · Support