I found that both memcpy and CopyMemory won't utilize the full bandwidth of your RAM due to memory controller bottlenecks (I suspect the memory controller isn't smart enough to prefetch the right data). So this implementation by William Chan issues SSE2 prefetch instructions and gets the memory controller to literally stream the data back and forth from RAM in the fastest manner.
Note though, that you'll need to give it 16-byte aligned memory and it copies in 128-byte blocks.
The result is here (on my Core2Duo Wolfdale CPU @ 3.6GHz, dual channel DDR2 @ 800MHz):
William Chan's SSE2 memcpy:
That's nearly double the speed of the naive memcpy!