Friday, January 15, 2010

FastMM - Slow in multithreaded apps on multicore CPUs

There's something wrong with FastMM4's (i.e. the default memory manager of Delphi / C++ Builder starting BDS2006) usability on multicore systems, especially running multithreaded apps in a GC/managed environment. The result of this is that when multicore is enabled, performance suffers by up to 5 folds. So, not only that FastMM would not scale, your multithreaded apps will run tremendously slower on a multicore system - up to 5 times slower on a dual-core machine vs a single-core one at the same clock speed of the same architecture.

That's 500% performance drop going from single-core to dual-core! Comparing the dual-core performance of FastMM4 and TBBMM, the latter is 9 times faster!

This test is meant to show just that. Download Test (updated 27/01/2010) (see readme.txt for instructions) *** WARNING: Incompatible with x64 OS due to an OS bug.

It runs through a variety of algorithms in multiple threads (in a threadpool of the framework, similar to .NET's ThreadPool) consisting of a mix of GC list, GC dictionary, and GC string unit-tests.

Keep in mind that this is an app written using a GC framework, which means allocations usually happen in multiple threads concurrently while de-allocations are done in specialized garbage collector threads. This may be the reason FastMM breaks down (a general-purpose memory manager shouldn't break down given any usage patterns).

Notice that when you run the FastMM Test with CPU Affinity set to just one CPU, you'll end up with nearly the same performance as TBBMM. Once you enable multicores though, you'd immediately lose performance once again, running slower than with just one core.

Note: You'll find that the FastMM BorlndMM.dll is different from the default Rad Studio 2010 one. This is due to the changes added to support the GC framework, but at its heart, it's simply making calls to GetMemory, ReallocMemory and FreeMemory (as oppose to WinMM's version of HeapAlloc, HeapRealloc and HeapFree respectively, with all else
being equal). The WinMM version is initialized with the LFH (low fragmentation heap) flag.

Here are some results from my own tests:


Test results in ops/second (10sec average), listed in the following order:
1) TBBMM
(what is TBBMM?)
2) WinMM
3) FastMM


Core2Duo E6550 2.33GHz (Conroe) - XP SP3
Both cores enabled
1) 1785
2) 1230
3) 250

Single core (via CPU affinity mask)
1) 930
2) 650
3) 950


Core2Duo E6550 throttled to 1.33GHz - XP SP3
Both cores enabled
1) 730
2) 520
3) 180

Single core (via CPU affinity mask)
1) 410
2) 275
3) 395


Pentium M 1.2GHz (Banias) - XP SP3
CPU is Single core
1) 395
2) 340
3) 395

Core2Duo E7200 3.6GHz (Wolfdale) - Vista
Both cores enabled
1) 2595
2) 2080
3) 290

Single core (via CPU affinity mask)
1) 1450
2) 1180
3) 1405

As you can see, the results are quite consistent. On a dual core machine, the performance of FastMM is terrible. From 2.33GHz to 3.6GHz, there's virtually no increase at all in speed! In fact, when the test was running, the CPU wasn't even fully utilized (with more than 50% of CPU spent in kernel time), whereas the other memory managers had the CPU pegged at 100% and nearly no kernel time.

If you wish to try it out on your system, download this GC speed tester (updated 27/01/2010) and unzip it to a folder of your choice. Then, run "Run All Tests.bat" and follow the on-screen instructions. Note that the GC Speed Test app will run indefinitely, so once you take note of the speed (ops/sec), you can quit the app to move on to the next test.

I'd appreciate it if you could post your results here in the comments in the same format as the ones above - i.e. CPU make (I'd love to see how AMD CPUs fare) and model number as well as the frequency, OS / service pack, and the results.

My advice? For an all-rounded memory manager, use the Windows default one. It may be a little slower than FastMM on a single core, but it certainly scales very well on multicore systems. Alternatively, the Intel TBB allocator has a near perfect scaling and is the fastest memory managers around. Only thing is, it consumes more RAM.

Regardless, I'd stay away from FastMM4 (thus the default memory manager of Delphi / C++ Builder).

13 comments:

Zach Saw said...

After several rounds of optimization of the GC framework, FastMM4 is now a tad faster. But, so are the rest of the memory managers:

Core2Duo E6550 2.33GHz (Conroe) - XP SP3
Both cores enabled
1) 1845
2) 1280
3) 450

FastMM4 is still more than 4 times slower than the leading memory manager.

jamesrhodes said...

Where is source code for TBB so that others can try to reproduce the results?

jamesrhodes said...

Your gctester dies on my machine with an assertion failure: CATASTROPHIC FAILURE, bad hardware or bad compiler, etc. Why don't you just post source? That is how these things are normally done.

Zach Saw said...

The TBB source is available from Intel's TBB website.

'these things'? I think you may have a different opinion on what the intention of my article is - which is simply to point out the potential usage patterns that may break FastMM so readers will test their apps against other memory managers if they too encounter a slowdown. Given that the source is using a framework that is proprietary, it would be meaningless to post just the test code. So the compromise is unfortunately only to have the binaries available for testing.

Also, when you encounter a CATASTROPHIC FAILURE, it means the compiler has done a duplicated d'tor call (happens a lot with all versions of CBuilder but it's getting better).

Anonymous said...

I am making multithreaded apps. How do I know my Delphi 2007 use FastMM? I don't find MM.pas.FastMM.dll in installation folder but I found borlndmm.dll.

Zach Saw said...

Version 2006 and above all use FastMM by default. With or without Borlndmm.dll as an external dependency, you (your apps) are using FastMM. That *IS* the default memory manager.

Vency said...

Hi Zach,
I downloaded and ran your test on my machine - Intel Core i7 920 (Nehalem, 2.66GHz, 4 real cores with hyperthreading = 8 logical cores) running 32-bit Windows Vista.
And here are the results:

1) 3900 (100% cpu utilization)
2) 3500 (100% cpu utilization)
3) 370 (60% cpu utilization)

Carsten said...

There is a "bug" in the fastmm, floating around the internet. This
bug causes fastmm to call Sleep, when there are a contentions on Free. Try to set the "NeverSleepOnThreadContention" option and compile. You wil see this baby fly :-)

Anonymous said...

Zach,

We are in the situation, where only your TBB based Memory Manager seem to help. Nexus and FastMM both fail to scale in our application running on 24 cores.

We are however struggeling on your comment, that your MM falls back to FastMM for larger blocks. Since we are still forced to use Delphi 6, does your MM still fall back to FastMM (somehow internally) or to the default Delphi 6 Memory Manager, which would be very slow ?

Your help is greatly appreciated.

How can we contact you directly ?

Kind regards
Martin

Zach Saw said...

Send me a PM on my website

Zach Saw said...

@Carsten

RE: Try to set the "NeverSleepOnThreadContention" option and compile. You wil see this baby fly :-)

I tried that option with FastMM v4.94 but it caused the multithreaded app to hang at the loop where FastMM is trying to acquire lock to medium blocks. Enabling "NeverSleepOnThreadContention" appears to have uncovered yet another bug in FastMM4.

There's an easy fix though - call SwitchToThread to allow other threads to proceed if the lock can't be acquired.

Anonymous said...

Is this still an issue with XE? Did FastMM get 'fixed' in any way?

Zach Saw said...

No. Pierre's only put in an ifdef option to fix this problem in FastMM4.97 which was released very recently following my complaints.