Forward that store

I know I’ve been bitching about load-hit-store (too) many times before, but it’s been one of the most annoying things we had to deal with at the previous generation consoles. LHS happens when we try to load data from the address that has been recently written to. X360/PPC CPUs were fairly simple, they could neither try to execute some other instruction (no OOE) nor retrieve the data without waiting for it to reach cache. All they could do was to sit there, idle thumbs and wait. Fortunately, modern x86s come with store forwarding functionality. It means that - under certain conditions - we can read directly from store buffers, without having to wait for the data to go to memory and back. Luckily for us, the restrictions are not too severe and in most cases, we don’t have to worry about them at all. One group of limitations that you might actually run into and is worth keeping in mind has to do with data size/alignment. Two most important restrictions are:

  • load & store needs to have the same alignment (that’s obvious if they’re the same size, but might be an issue if load is smaller),

  • load size must be smaller/equal to the store size

Simple experiment:

 1volatile unsigned int ibuffer[8192];
 2__declspec(noinline) int NoStoreForwarding()
 3{
 4    int result = 0;
 5    for (unsigned int i = 0; i < 8192; ++i)
 6    {
 7        *(unsigned char*)(&ibuffer[0] + i) = (unsigned char)i;
 8        result = ibuffer[i];
 9    }
10    return result;
11}
12__declspec(noinline) int StoreForwarding()
13{
14    int result = 0;
15    for (unsigned int i = 0; i < 8192; ++i)
16    {
17        ibuffer[i] = i;
18        result = ibuffer[i];
19    }
20    return result;
21}

In the first case we story only 1 byte, but load 32-bits, so no store-forwarding for us. Second case should work as expected (made buffer elements volatile, to stop compiler from using register).

Results from my i5/Ivy Bridge (8192 elements, 1000 iterations):

  • with store forwarding: 13562 ticks/iteration

  • without store forwarding: 107578 ticks/iteration

As you can see – the difference is really significant. Accessing store buffers is not instant, we still have to wait for data to be written, but it’s much much faster than hitting memory, of course. Just for kicks I also added version where we write to memory, but read from register (no ‘volatile’ modifier), it’s a little bit faster than store forwarding version, but not much:

  • 13432 ticks/iteration

Now, as always with PC, these results will vary a little bit, but the relation between them stays the same (store forwarding a little bit slower than using registers, no store forwarding - almost an order of magnitude slower). Updated cmov/store forwarding test here.

More Reading
Older// cmov fun