- Thought about zero-copy IPC recently. In order to avoid memcopy for the complete chain, I guess it would be best if the sender allocates its payload directly on the shared memory when it’s created. Is this a standard thing in such optimized IPC and which libraries offer this?by waschl - 4 hours ago
- It's not clear from a skim of this article, but a common problem I've seen in the past with memory copying benchmarks is to not serialise and access the copied data in its destination to ensure that it was actually completed before concluding the timing. A simple REP MOVS should be at or near the top, especially on CPUs with ERMSB.by userbinator - 4 hours ago
- Conclusionby brucehoult - 4 hours ago
Stick to `std::memcpy`. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.
----
So that's five minutes I'll never get back.
I'd make an exception for RISC-V machines with "RVV" vectors, where vectorised `memcpy` hasn't yet made it into the standard library and a simple ...
... often beats `memcpy` by a factor of 2 or 3 on copies that fit into L1 cache.0000000000000000 <memcpy>: 0: 86aa mv a3,a0 0000000000000002 <.L1^B1>: 2: 00267757 vsetvli a4,a2,e8,m4,tu,mu 6: 02058007 vle8.v v0,(a1) a: 95ba add a1,a1,a4 c: 8e19 sub a2,a2,a4 e: 02068027 vse8.v v0,(a3) 12: 96ba add a3,a3,a4 14: f67d bnez a2,2 <.L1^B1> 16: 8082 ret
- I thought this was going to be about https://github.com/Blosc/c-bloscby dataflow - 3 hours ago
- It's not clear how the author controlled for HW caching. Without this, the results are, unfortunately, meaningless, even though some good work has been goneby Arech - 3 hours ago
- Would have loved to see performance comparisons along the way, instead of just the small squashed graph at the end. Nice article otherwise :)by jesse__ - 3 hours ago
- the "dumb of perf": some Freudian Slip?by wolfi1 - 3 hours ago
- soo... time to send a patch to glibc?by _ZeD_ - 3 hours ago
- > The operation of copying data is super easy to parallelize across multiple threads. […] This will make the copy super-fast especially if the CPU has a large core count.by adwn - 3 hours ago
I seriously doubt that. Unless you have a NUMA system, a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller. If you can avoid going through main memory – e.g., when copying between the L2 caches of different cores – multi-threading can speed things up. But then you need precise knowledge of your program's memory access behavior, and this is outside the scope of a general-purpose memcpy.
- [2020]by Orangeair - 2 hours ago
- There's an error here: “NT instructions are used when there is an overlap between destination and source since destination may be in cache when source is loaded.”by Sesse__ - 2 hours ago
Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon, so it shouldn't push out other things in the cache. They may skip the cache entirely, or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.