I have long had a desire to see an application for hardware acceleration of software tasks, and it only increased after I got into Nios (the embedded on-a-chip processor + FPGA) programming. But today I finally completed such an optimization, and all I can say is WOW.
This algorithm took about 18 milliseconds to run when implemented in software, which is somewhat too much for my application (which is an embedded one, where every millisecond counts). So, I implemented some of it in VHDL, wrote some interface C code to exchange data with the hardware accelerator, and improved the runtime to 0.25 milliseconds (70-fold !!!)
The most amazing thing about this is that the whole process took just one day.