# 数组、链表和Cache污染

## 方法二

void memcpy(char * dst, char * src, unsigned size) {
char * dst_end = dst + size;
while (dst != dst_end) {
*((__m128i *)dst) = res;
src += 16;
dst += 16;
}
}


Note that even in user space memcpy() using MMX registers is NOT necessarily a good idea at all.

Why?

It looks damn good in benchmarks. Especially for large memory areas that are not in the cache.

But it tends to have horrible side-effects. Like the fact that when multiple processes (or threads) are running, it means that the FP state has to be switched all the time. Normally we can avoid this overhead, because most programs do not actually tend to use the FP unit very much, so with some simple lazy FP switching we can make thread and process switches much faster.

Using the FPU or MMX for memcpy makes that go away completely. Suddenly you get slower task switching, and people will blame the kernel. Even though the _real_ bug is an optimization that looks very good on benchmarks, but does not necessarily actually win all that much in real life.

Basically, you should almost never use the i387 for memcpy(), unless you know you can get it for free (ie you’re already using the FPU). A i387 state save/restore is expensive. It’s expensive even in user mode where you don’t do it explicitly, but the kernel does it for you.

The MMX stuff is similar. Only use it if you already know you’re using the MXX unit. Because otherwise you _will_ slow the system down.

NOTE! If you absolutely want to do it anyway, make sure that the size cutoff is large. It definitely is not worth a few FPU task switches to do small memcpy’s. But for really large memcpy’s you might consider it (ie if size is noticeably larger than a few kilobytes). Use regular integer stuff for smaller areas.

And it’s insidious. When benchmarking this thing, you usually (a) don’t have any other programs running and (b) even if you do, they haven’t been converted to using FPU memcpy yet anyway, so you’d see only half of the true cost anyway.

• 世界总是在变
• 没有免费的午餐
