Nibblesort: Adventures in Optimization

A few months ago compiler researcher John Regehr held a low-level optimization contest for a silly problem: sort the nibbles in an arbitrary 64-bit number:

The problem is to sort the 4-bit pieces of a 64-bit word with (unsigned) smaller values towards the small end of the word. The nibble sort of 0xbadbeef is 0xfeedbba000000000. The function you implement will perform this sorting operation on a buffer of 1024 64-bit integers.