Wednesday, November 24, 2010

Fast memcpy for large blocks

Memory copy of 8MB blocks can be quite slow.

I found that both memcpy and CopyMemory won't utilize the full bandwidth of your RAM due to memory controller bottlenecks (I suspect the memory controller isn't smart enough to prefetch the right data). So this implementation by William Chan issues SSE2 prefetch instructions and gets the memory controller to literally stream the data back and forth from RAM in the fastest manner.

Note though, that you'll need to give it 16-byte aligned memory and it copies in 128-byte blocks.

The result is here (on my Core2Duo Wolfdale CPU @ 3.6GHz, dual channel DDR2 @ 800MHz):

memcpy/CopyMemory:
1871.775MB/sec

William Chan's SSE2 memcpy:
3540.471MB/sec

That's nearly double the speed of the naive memcpy!

No comments: