Lock free hell

For the last days (weeks) apart from playing GTA IV (mainly) I’ve been improving my thread pool class. I decided to implement system similar to the one described by Ian Lewis from Microsoft in “Getting More from Multicore” from GDC2008. Basically, it’s an extension of my system, it gives greater control over inter-task dependencies. Task cannot be started until all tasks it depends on are finished. In theory it’s rather easy, I just provide extra scheduler thread that deals with incoming tasks, if any of them is “ready” it’s put in a priority queue (priority in the most simple case is number of dependents), so that it can be processed by worker threads. I had basic implementation working after few hours, however I spent another week trying to find a nasty bug.

Once in a while it would “lost” one task and therefore block all the execution pipeline (all the tasks depending on it would wait indefinitely). It was HELL to debug. First of all, it only happened once in 500 tries or so. I downloaded one of those macro recording programs, recorded F5+Esc (start/exit) sequence and looped it. Then I ran it before going to work, left running for few hours and tried to examine the problem when I came back (it would usually hang up by then). Of course you couldnt rely on logging etc, so all I could do was to store all the info I could think of in helper tables and try to investigate it. You can probably blame it on my lack experience, but it was seriously frustrating. Even debugging “normal” multithreaded code is pleasant comparing to this. At least, if you’ve a lock somewhere, you know no one is going to change your data. With lock-free - everything can happen and when the bug occurs, it’s usually too late.

After a week or so I finally found the bug (I think, it’s hell to verify as well, but it survived 4 debugging “sessions” (~8h each) so far), but it wasnt really thanks to all the stuff I “recorded”, I simply tried to examine every line of code and even every assembly instruction. It turned out that I wanted to be too smart and it kicked back (as it usually happens when coding). Consider this piece of code from my stack class:

Node* Pop()
{
    Head oldHead, newHead;
    Node* iter(0);
    while (true)
    {
        // @note Important to get 64bits at once.
        oldHead.value64 = m_head.value64;
        iter = oldHead.value.node;
        if (iter == 0) return 0;
        newHead.value.node = iter->next;
        newHead.value.popCounter = oldHead.value.popCounter + 1;
        if (CAS2(&m;_head.value64, newHead.value64, oldHead.value64))
            break;
    }
    return iter;
}
[...]
union Head
{
    struct Value64
    {
        Node* volatile node;
        int32_t volatile popCounter;
     } value;
     int64_t    value64;
};
Notice how I even put this smart-ass comment there, without verifying it. Let’s take a look at the assembly code:
1    ; 164     oldHead.value64 = m_head.value64;
2    lea    edx, DWORD PTR [edi+8]
3    mov    esi, DWORD PTR [edx]
4    mov    ecx, DWORD PTR [edx+4]
As you can see, it doesnt actually get 64 bits at once (what was I thinking?). In theory, it could be done with cmpxchg8b for example, but I dont think we should blame compiler for not using that, it was rather foolish assumption (not verified at that) on my side. Anyway, it shouldnt be a problem, right? Doh… wrong. Consider the following scenario:

  • stack with one node (A), popCounter == 0,
  • thread 0 tries to pop the node, gets head address, thread 1 interrupts, pops head, then pushes another nodes (B, C, D, E, whatever), then (A) again.
  • thread 0 resumes, loads popCounter (1), CAS2 succeeds while it shouldnt…

Yuck. Funny thing is that roughly 50% of documents about lock-free stacks/queues out there have this bug (including GPG6 article by Toby Jones… which actually makes me wonder if I’m right here). “Updated” stack class can be found here. (I had similar problem with queue, but “fix” is the same). Simple solution was to exchange the order of reading variables. It should work on x86, because reads shouldnt be re-ordered. On X360 you’d most probably should put barrier there.

To simplify conversion from “normal” function to asynchronous model I also added special kind(s) of tasks - DelegateTask (based on an excellent library by Sergey Ryazanov). Say, you have a Mesh class with Animate method, taking time delta argument. In order to create task from it you do:

    new DelegateTask1(animGroup, 
        DelegateTask1::tDelegate::from_method<Mesh, 
            &Mesh;::Animate>(mesh), dt); 

Finally, here’s my take on example give in Lewis’ paper (classical animate -> render -> main loop).

// Dependencies: main -> render -> anim
virtual void MainTask::Execute()
{
    for (int i = 0; i < numMeshes; ++i)
    {
        m_pool.AddTask(new DelegateTask1(animGroup, 
             DelegateTask1::tDelegate::from_method<Mesh, &Mesh;::Animate>(meshes[i]), dt));
    }
    for (int i = 0; i < numMeshes; ++i)
    {
        m_pool.AddTask(new DelegateTask1(renderGroup, 
            DelegateTask1::tDelegate::from_method<Mesh, &Mesh;::Render>(meshes[i]), dt));
    }
    m_pool.AddTask(this);
}
It’s pretty rough sketch (normally you do not want to create/destroy tasks every frame, however in this case they all go from stack pool which is fully discarded at the frame end, so it’s not that bad), but you get the idea, I hope.

Now that it works - it’s nice, but I’d still be scared when introducing this in a commercial environment. It probably is best idea to implement a version with locks first and when you’re sure it’s absolutely, 100% working, has no bugs then maybe try to introduce some of this stuff (and limit people working on these pieces of code, as you can see just modifying single, “irrevelant” line can be dangerous). For home projects it’s still lots of fun tho (for a very perverse and twisted definition of fun).

I also made a simple improvement to my classical Mandelbrot test - before each task would process continous section of the image. This resulted in poor load balancing (certain areas are quicker to compute). After simple modification (tasks processing scanlines from different image areas) I’m getting almost ideal 2x speedup on my dual-core machine, just as expected. PS. As a bonus - my CAS2 implementation exploiting unions. It results in a much better code than version that’s commented out (CDQs and stuff shudder… uncomment if you’re curious).

Old comments

rad 2008-07-06 22:54:36

Hey, very nice :)
I’ve been trying to implement something similar for a while, however that gdc paper has given me some thoughts.
At the moment my design follows a simplified task-stealing scheme found in TBB, however it’s far from perfect.
And yeah, debugging this, is like watching endless spaghetti..
All in all, thanks for your blog.

More Reading
Older// GTA IV