Monday, November 17, 2008

FastMM's Multicore Performance Scaling

Having used .NET's asynchronous socket, filestream, WinForm's Invoke / BeginInvoke and fell in love with its design and knowing that it uses IOCP / threadpool WinAPI underneath the hood, I decided to write my own in C++, using C++ Builder. Everything went well until I started testing. I realised that my dual core machine doesn't seem to get full utilization when I'm running some of the most intensive tests, which easily causes the threadpool to spawn extra threads to serve the load.
I started investigating and eventually managed to reproduce the issue with just the following code:
void __fastcall TAnsiStringTesterThread::Execute()
{
     AnsiString str;
     for (int i=0; i<10000000; i++)
     {
          str = " something ";
     }
}
Delphi doesn't seem to suffer the same problem at first sight when I ran the above in its Delphi equivalent. That is until I tried the following (instead of assigning straight to a literal string, I did an IntToStr): procedure TAnsiStringTesterThread.Execute; var i: Integer; str: string; begin for i := 0 to 10000000 - 1 do str := IntToStr(10000000); end; The similarity in both these codes is that they both call LStrFromPCharLen, which eventually leads to a call to GetMem and when the string's ref count goes back to zero, a call to FreeMem. Could GetMem and FreeMem be the culprit? As it turns out, yes. To put the hypothesis to the test, I did a tight GetMemory / FreeMemory loop in 2 threads and observed the CPU usage. Unsurprisingly, only 50% of my dual cores are utilized, even though the utilization spreads across both quite evenly. In my search for a better memory manager, I came across the Intel Threading Building Block library. Among other useful things like parallel loops, concurrent hash maps and lock-free queue, it has a scalable memory manager. With that, I wrote a BorlndMM.dll wrapper and called it TBBMM. Here's the result:



In the single-threaded test, the TBBMM is only faster than FastMM by 20%. However, from 2 to 8 threads on a Dual Core machine, the improvements are from 2.25x to a staggering 2.5x.
For more information / to download TBBMM, visit my TBBMM webpage.