Birbla

Going faster than memcpy (squadrick.dev)
63 points by snihalani - 4 hours ago

Thought about zero-copy IPC recently. In order to avoid memcopy for the complete chain, I guess it would be best if the sender allocates its payload directly on the shared memory when it’s created. Is this a standard thing in such optimized IPC and which libraries offer this?
by waschl - 4 hours ago
It's not clear from a skim of this article, but a common problem I've seen in the past with memory copying benchmarks is to not serialise and access the copied data in its destination to ensure that it was actually completed before concluding the timing. A simple REP MOVS should be at or near the top, especially on CPUs with ERMSB.
by userbinator - 4 hours ago

Stick to `std::memcpy`. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.

----

So that's five minutes I'll never get back.

I'd make an exception for RISC-V machines with "RVV" vectors, where vectorised `memcpy` hasn't yet made it into the standard library and a simple ...

    0000000000000000 <memcpy>:
       0:   86aa                    mv      a3,a0
    
    0000000000000002 <.L1^B1>:
       2:   00267757                vsetvli a4,a2,e8,m4,tu,mu
       6:   02058007                vle8.v  v0,(a1)
       a:   95ba                    add     a1,a1,a4
       c:   8e19                    sub     a2,a2,a4
       e:   02068027                vse8.v  v0,(a3)
      12:   96ba                    add     a3,a3,a4
      14:   f67d                    bnez    a2,2 <.L1^B1>
      16:   8082                    ret

... often beats `memcpy` by a factor of 2 or 3 on copies that fit into L1 cache.

https://hoult.org/d1_memcpy.txt

by brucehoult - 4 hours ago

I thought this was going to be about https://github.com/Blosc/c-blosc
by dataflow - 3 hours ago
It's not clear how the author controlled for HW caching. Without this, the results are, unfortunately, meaningless, even though some good work has been gone
by Arech - 3 hours ago
Would have loved to see performance comparisons along the way, instead of just the small squashed graph at the end. Nice article otherwise :)
by jesse__ - 3 hours ago
the "dumb of perf": some Freudian Slip?
by wolfi1 - 3 hours ago
soo... time to send a patch to glibc?
by _ZeD_ - 3 hours ago
> The operation of copying data is super easy to parallelize across multiple threads. […] This will make the copy super-fast especially if the CPU has a large core count.
I seriously doubt that. Unless you have a NUMA system, a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller. If you can avoid going through main memory – e.g., when copying between the L2 caches of different cores – multi-threading can speed things up. But then you need precise knowledge of your program's memory access behavior, and this is outside the scope of a general-purpose memcpy.
by adwn - 3 hours ago
[2020]
by Orangeair - 2 hours ago
There's an error here: “NT instructions are used when there is an overlap between destination and source since destination may be in cache when source is loaded.”
Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon, so it shouldn't push out other things in the cache. They may skip the cache entirely, or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.
by Sesse__ - 2 hours ago