Copy or move

I’ve been optimizing RDE’s vector class a little bit recently. In all honesty, it’s probably the least interesting container, there are only so many ways to do things, so it all boils down to working at the instruction level, trying to eliminate everything that’s not needed. Those few cycles don’t even matter in the grand scheme of things, but it can be an interesting learning experience at times. One of the functions I was interested in was erase. Let’s take a quick look at the code:

1iterator erase(iterator it)
2{
3        // [Preconditions, invariants etc.]
4        // Move everything down, overwriting *it
5        internal::move(it + 1, m_end, it, int_to_type<has_trivial_copy<T>::value>());
6        [...]

Line 5 is where the magic happens. We’re erasing an element at position ‘it’, so we need to move all elements following it one element “down” (it+1 overwriting it, it+2 overwriting it+1 etc). internal::move will select an optimal way of moving the data, either memmove or one-by-one, using assignment operator. Now, you might be wondering if we have to use move here, wouldn’t copy suffice (if you need to refresh your memory on differences – Google memcpy vs memmove)?

Actually, I was using copy at first. I assumed that even though buffer did overlap, it should be safe in this particular case (as we’re not writing to addresses that we use later). Turned out there are some platforms, where _memcpy _function cannot guarantee that bytes in src are copied to dest before being overwritten. Fair enough, I pushed too far here, changed it to move, which solved this particular problem. I also added some more test cases to verify my move/copy functions were working properly in all situations. I was quite surprised when I found out that memcpy was actually working better than expected, it’d produce correct results even if buffers were overlapping. I dug deeper and discovered that MSVC’s implementation of _memcpy _was the same as _memmove _and included overlap checks. We had an interesting Twitter discussion (or two, it branched out a little bit) about the whole thing, turned out it’s quite popular approach. Personally, I’m not convinced it’s a good idea, it makes it very easy to write code that works on one platform/compiler, but will silently fail when executed on another. I’m speaking as a game coder, though, I do not maintain an OS, I like clean rules & restrictions. I’m fine with Linus’ “debug mode” suggestion, it seems like a happy medium. It would give at least some indication of future problems. In any case, MSVC’s implementation is really quite bad, it does not even behave consistently! As I discovered, if the number of bytes to copy is known at compile-time, it tries to be smart and generates a bunch of movs instructions. See below:

 1__declspec(noinline) void MemCpyNoInline(void* dest, const void* src, size_t bytes)
 2{
 3    memcpy(dest, src, bytes);
 4}
 5
 6void MemCpyOverlap1(int tab[])
 7{
 8    const int* src = &tab[0];
 9    int* dst = &tab[1];
10    memcpy(dst, src, 3 * sizeof(int));
11}
12void MemCpyOverlap2(int tab[])
13{
14    const int* src = &tab[0];
15    int* dst = &tab[1];
16    MemCpyNoInline(dst, src, 3 * sizeof(int));
17}
18void PrintTab(const int t[], const char* name)
19{
20    printf("%s = ", name);
21    for (size_t i = 1; i < 4; ++i)
22    {
23        printf("%d ", t[i]);
24    }
25    printf("\n");
26}
27TEST(MemCpyOverlap)
28{
29    int tab[] = { 0, 1, 2, 3, 4, 5, 6 };
30    int tab2[] = { 0, 1, 2, 3, 4, 5, 6 };
31    MemCpyOverlap1(tab);
32    PrintTab(tab, "tab");
33    MemCpyOverlap2(tab2);
34    PrintTab(tab2, "tab2");
35}

Care to guess what’s the result? tab = 0 0 0, tab2 = 0 1 2. Yikes! What’s the difference? Well, for tab2, we forced the compiler to actually call memcpy, which, as we learned – will check for overlap and deal with it properly. For tab, the generated code looks as follows:

1lea         esi,[tab]
2lea         edi,[ebp-2Ch]
3movs        dword ptr es:[edi],dword ptr [esi]
4movs        dword ptr es:[edi],dword ptr [esi]
5movs        dword ptr es:[edi],dword ptr [esi]

(For bigger batches of data, rep movsd is used). To make things even more “interesting”, it only works this way with optimizations enabled, in debug builds, it calls memcpy in all cases. Just for giggles, I tried compiling this function with online GCC as well, see the results here, it’s quite impressive, actually…

Going back to where I started, the latest version of vector::erase uses memmove for types with trivial copy operation, but normal “copy” in the other case. Why? Manual “move” requires iterating backwards, but it’s not necessary here and it was much less cache friendly.

Old comments

Pavel Shevaev 2012-10-20 20:05:53

Hi! Thanks for sharing. BTW, I was browsing through RDE’s sources and it looks like internal::move for non-trivial types doesn’t preserve the collection order. Please correct me if I’m wrong…

admin 2012-10-20 23:48:09

No, you’re right, nice find!

More Reading