Know your assembly, part 3

I’ve been playing with radix sort for RDESTL recently and MSVC optimizer surprised me one more time. First of all, it seems to generate much better code if my histogram array is local to function and not member (in the first case it makes better use of registers and keep less stuff in temporary variables). It costs us 4k of stack, but I can live with that. Other than that, generated code is pretty smart, sometimes almost too smart. My first version only worked on array if integers, so there was no indirection when retrieving sorting key. Consider this code for second pass in such case:

for (int i = 0; i < num; ++i)
{
   const uint32_t pos = (m_dst[i] >> 8) & 0xFF;
   src[h1[pos]++] = m_dst[i];
} 
(dst and src are arrays of integers here).

Later, in order to make it more universal and allow for sorting arrays of various types, as long as sorting key is provided, I changed this to:

1template<typename TFunc>
2void Sort(T* src, int num, const TFunc& func)
3{
4    for (int i = 0; i < num; ++i)
5    {
6         const uint32_t pos = (func(m_dst[i]) >> 8) & 0xFF;
7         src[h1[pos]++] = m_dst[i];
8    } 

(func = function that retrieves sorting key from data). My first test was simple - I still used array of integers and ‘func’ just returned it’s argument (original int). To my surprise, it already meant ~15% slowdown. Initially, I thought that maybe compiler had problems with inlining my functor, but it turned out it’s more complicated. Let’s see generated code for both versions:

1movzx edx, BYTE PTR [eax+1] 
(Version 1, eax == m_dst + i)

1mov ebx, DWORD PTR [esi+eax*4]
2mov eax, ebx
3shr eax, 8
4and eax, edx 
(Version 2, esi == m_dst, eax == i)

As you can see compiler was smart enough to access higher byte in integer, because it knows there’s nothing else in the array! It doesn’t have to shift/mask later. In second case, it inlines the functor, but there are too many levels of indirections, so it fails to recognize the chance for further optimizations. It looks similar for third and fourth pass. Honestly, I’m not sure I’d even think of this myself, were I coding this in assembler. There’s this kind of optimization in Pierre Terdiman’s radix sorter (he indexes bytes in C++ instead of shifting), but it’s not even necessary anymore, as compiler does this for us. Sadly, it also means that for maximum performance we may have to provide two versions of this function, one of table of integers and another for more complicated objects.

If you’re interested in knowing more about scale of optimizations you can expect from current compilers - read this paper. It really proves that in many situations compilers are smarter than use (take abs for example, there are not many people knowing about the cdq/xor/sub trick, GCC does). Of course, it doesn’t mean we shouldn’t control them from time to time.

Obligatory collection of unrelated links:

  • Intel finally released their Smoke tech demo. It may be a tad overengineered, but definitelly worth checking out, some nice concepts there.

  • Nice collection of screenshots from various stages of MotoGP evolution. Impressive development pace as well, first screenshot is from May, last from September!

Old comments

remigiusz 2008-11-29 19:45:04

“Impressive development pace as well, first screenshot is from May, last from September!”
The newest MotoGP, made in Italy, was done in a similar timeframe, for 4 platforms (WII version is late). It is funny, by the way, how polish gamedev guys are still trying to play this “lazy and unprofessional Italians” stereotype.

admin 2008-11-29 21:47:58

Well, it’s always easier to make sequel/next iteration because you work on existing technology. That said, I’ve been working in Italy for a while and have very fond memories. Guys I met there were one of the best I’ve seen in the industry. Then again, I’ve never heard anyone mentioning this stereotype.

Christer Ericson 2008-11-30 10:47:00

“First of all, it seems to generate much better code if my histogram array is local to function and not member”
That’s typical of aliasing and compilers’ ability to resolve same. Always remember that member variables are implicitly prefixed by “this->” which is just about as bad as them being global as far as resolving aliasing goes.
C++ is evil.

admin 2008-11-30 11:56:59

Yes, aliasing is a nasty beast here. It affects even loops that only write to local variables, just because register usage in the whole function is suboptimal.

js 2008-12-01 21:00:31

I just tested the code snippet with vs2005-sp1 in the same loop as described in the article, I have different result
I use
template
struct MyFunc {
const T &operator()( const T &var ) const { return var; }
};
In my case, the functor is translated into :
movzx edx,byte ptr [ecx-4]
regardless if m_dst is a member variable or not

js 2008-12-01 21:29:49

Btw, since this is about assembly, I am bit annoyed by the use of movzx, I would prefer something like this instead :
xor edx,edx

for (int i = 0; i < num; ++i) {

mov dl, byte ptr [ecx-4]

}
It is not always possible to do so, but in simple loops (and this one is the case), you could save few cycles per iteration

ed 2008-12-01 21:50:44

Actually “movzx” is used here correctly and will be faster than yours code, since this avoids of partial register stall (it happens when register is modified partially and then is used as a whole). In this case, dl is modified while edx is used.

js 2008-12-01 22:51:20

well, the fact that you set edx to 0 by using
the xor instruction remove any partial stall (on this register).
As long as the upper 24 bits of edx are not touched, it’s ok. If you change edx and set it back to zero with a xor, it’s ok too

ed 2008-12-04 19:55:23

yes, but since movzx doesn’t cost (at least have same cost as mov dl), your code adds cost of xor ;) Also if this loop calls a function that may modify dh, movzx is safest solution here.

js 2008-12-05 08:39:31

I beg to disagree ed, movzx costs a bit much than a mov since it has to set the register to zero.

ed 2008-12-05 21:00:55

:) you can check Intel’s Optimization Manual. Page C-27

js 2008-12-06 11:23:59

Yeh, you are right ed; this is true on modern processors.
If anyone else ever wonders, ed is talking of “Intel 64 and IA32 Architecture Optimization Reference Manual” of november 2007.

More Reading
Newer// Catching up
Older// Shameless plug