Friday, January 15, 2010

FastMM - Slow in multithreaded apps on multicore CPUs

There's something wrong with FastMM4's (i.e. the default memory manager of Delphi / C++ Builder starting BDS2006) usability on multicore systems, especially running multithreaded apps in a GC/managed environment. The result of this is that when multicore is enabled, performance suffers by up to 5 folds. So, not only that FastMM would not scale, your multithreaded apps will run tremendously slower on a multicore system - up to 5 times slower on a dual-core machine vs a single-core one at the same clock speed of the same architecture.

That's 500% performance drop going from single-core to dual-core! Comparing the dual-core performance of FastMM4 and TBBMM, the latter is 9 times faster!

This test is meant to show just that. Download Test (updated 27/01/2010) (see readme.txt for instructions) *** WARNING: Incompatible with x64 OS due to an OS bug.

It runs through a variety of algorithms in multiple threads (in a threadpool of the framework, similar to .NET's ThreadPool) consisting of a mix of GC list, GC dictionary, and GC string unit-tests.

Keep in mind that this is an app written using a GC framework, which means allocations usually happen in multiple threads concurrently while de-allocations are done in specialized garbage collector threads. This may be the reason FastMM breaks down (a general-purpose memory manager shouldn't break down given any usage patterns).

Notice that when you run the FastMM Test with CPU Affinity set to just one CPU, you'll end up with nearly the same performance as TBBMM. Once you enable multicores though, you'd immediately lose performance once again, running slower than with just one core.

Note: You'll find that the FastMM BorlndMM.dll is different from the default Rad Studio 2010 one. This is due to the changes added to support the GC framework, but at its heart, it's simply making calls to GetMemory, ReallocMemory and FreeMemory (as oppose to WinMM's version of HeapAlloc, HeapRealloc and HeapFree respectively, with all else
being equal). The WinMM version is initialized with the LFH (low fragmentation heap) flag.

Here are some results from my own tests:


Test results in ops/second (10sec average), listed in the following order:
1) TBBMM
(what is TBBMM?)
2) WinMM
3) FastMM


Core2Duo E6550 2.33GHz (Conroe) - XP SP3
Both cores enabled
1) 1785
2) 1230
3) 250

Single core (via CPU affinity mask)
1) 930
2) 650
3) 950


Core2Duo E6550 throttled to 1.33GHz - XP SP3
Both cores enabled
1) 730
2) 520
3) 180

Single core (via CPU affinity mask)
1) 410
2) 275
3) 395


Pentium M 1.2GHz (Banias) - XP SP3
CPU is Single core
1) 395
2) 340
3) 395

Core2Duo E7200 3.6GHz (Wolfdale) - Vista
Both cores enabled
1) 2595
2) 2080
3) 290

Single core (via CPU affinity mask)
1) 1450
2) 1180
3) 1405

As you can see, the results are quite consistent. On a dual core machine, the performance of FastMM is terrible. From 2.33GHz to 3.6GHz, there's virtually no increase at all in speed! In fact, when the test was running, the CPU wasn't even fully utilized (with more than 50% of CPU spent in kernel time), whereas the other memory managers had the CPU pegged at 100% and nearly no kernel time.

If you wish to try it out on your system, download this GC speed tester (updated 27/01/2010) and unzip it to a folder of your choice. Then, run "Run All Tests.bat" and follow the on-screen instructions. Note that the GC Speed Test app will run indefinitely, so once you take note of the speed (ops/sec), you can quit the app to move on to the next test.

I'd appreciate it if you could post your results here in the comments in the same format as the ones above - i.e. CPU make (I'd love to see how AMD CPUs fare) and model number as well as the frequency, OS / service pack, and the results.

My advice? For an all-rounded memory manager, use the Windows default one. It may be a little slower than FastMM on a single core, but it certainly scales very well on multicore systems. Alternatively, the Intel TBB allocator has a near perfect scaling and is the fastest memory managers around. Only thing is, it consumes more RAM.

Regardless, I'd stay away from FastMM4 (thus the default memory manager of Delphi / C++ Builder).

14 comments:

Zach Saw said...

After several rounds of optimization of the GC framework, FastMM4 is now a tad faster. But, so are the rest of the memory managers:

Core2Duo E6550 2.33GHz (Conroe) - XP SP3
Both cores enabled
1) 1845
2) 1280
3) 450

FastMM4 is still more than 4 times slower than the leading memory manager.

Anonymous said...

Where is source code for TBB so that others can try to reproduce the results?

Anonymous said...

Your gctester dies on my machine with an assertion failure: CATASTROPHIC FAILURE, bad hardware or bad compiler, etc. Why don't you just post source? That is how these things are normally done.

Zach Saw said...

The TBB source is available from Intel's TBB website.

'these things'? I think you may have a different opinion on what the intention of my article is - which is simply to point out the potential usage patterns that may break FastMM so readers will test their apps against other memory managers if they too encounter a slowdown. Given that the source is using a framework that is proprietary, it would be meaningless to post just the test code. So the compromise is unfortunately only to have the binaries available for testing.

Also, when you encounter a CATASTROPHIC FAILURE, it means the compiler has done a duplicated d'tor call (happens a lot with all versions of CBuilder but it's getting better).

Anonymous said...

I am making multithreaded apps. How do I know my Delphi 2007 use FastMM? I don't find MM.pas.FastMM.dll in installation folder but I found borlndmm.dll.

Zach Saw said...

Version 2006 and above all use FastMM by default. With or without Borlndmm.dll as an external dependency, you (your apps) are using FastMM. That *IS* the default memory manager.

Vency said...

Hi Zach,
I downloaded and ran your test on my machine - Intel Core i7 920 (Nehalem, 2.66GHz, 4 real cores with hyperthreading = 8 logical cores) running 32-bit Windows Vista.
And here are the results:

1) 3900 (100% cpu utilization)
2) 3500 (100% cpu utilization)
3) 370 (60% cpu utilization)

Unknown said...

There is a "bug" in the fastmm, floating around the internet. This
bug causes fastmm to call Sleep, when there are a contentions on Free. Try to set the "NeverSleepOnThreadContention" option and compile. You wil see this baby fly :-)

Anonymous said...

Zach,

We are in the situation, where only your TBB based Memory Manager seem to help. Nexus and FastMM both fail to scale in our application running on 24 cores.

We are however struggeling on your comment, that your MM falls back to FastMM for larger blocks. Since we are still forced to use Delphi 6, does your MM still fall back to FastMM (somehow internally) or to the default Delphi 6 Memory Manager, which would be very slow ?

Your help is greatly appreciated.

How can we contact you directly ?

Kind regards
Martin

Zach Saw said...

Send me a PM on my website

Zach Saw said...

@Carsten

RE: Try to set the "NeverSleepOnThreadContention" option and compile. You wil see this baby fly :-)

I tried that option with FastMM v4.94 but it caused the multithreaded app to hang at the loop where FastMM is trying to acquire lock to medium blocks. Enabling "NeverSleepOnThreadContention" appears to have uncovered yet another bug in FastMM4.

There's an easy fix though - call SwitchToThread to allow other threads to proceed if the lock can't be acquired.

Anonymous said...

Is this still an issue with XE? Did FastMM get 'fixed' in any way?

Zach Saw said...

No. Pierre's only put in an ifdef option to fix this problem in FastMM4.97 which was released very recently following my complaints.

Anonymous said...

FastMM4 has improved since 2010, but it is still has room for improvement when it comes to multi-threaded applications.

I have made a fork to improve multi-threaded work of FastMM4. See https://github.com/maximmasiutin/FastMM4

Here are the comparison of the Original FastMM4 version 4.992, with default
options compiled for Win64 by Delphi 10.2 Tokyo (Release with Optimization),
and the current FastMM4-AVX branch. Under some scenarios, the FastMM4-AVX branch
is more than twice as fast comparing to the Original FastMM4. The tests
have been run on two different computers: one under Xeon E6-2543v2 with 2 CPU
sockets, each has 6 physical cores (12 logical threads) - with only 5 physical
core per socket enabled for the test application. Another test was done under
a i7-7700K CPU.

Used the "Multi-threaded allocate, use and free" and "NexusDB"
test cases from the FastCode Challenge Memory Manager test suite,
modified to run under 64-bit.



Xeon E6-2543v2 2*CPU i7-7700K CPU
(allocated 20 logical (allocated 8 logical
threads, 10 physical threads, 4 physical
cores, NUMA) cores)

Orig. AVX-br. Ratio Orig. AVX-br. Ratio
------ ----- ------ ----- ----- ------
02-threads realloc 96552 59951 62.09% 65213 49471 75.86%
04-threads realloc 97998 39494 40.30% 64402 47714 74.09%
08-threads realloc 98325 33743 34.32% 64796 58754 90.68%
16-threads realloc 116708 45855 39.29% 71457 60173 84.21%
16-threads realloc 116273 45161 38.84% 70722 60293 85.25%
31-threads realloc 122528 53616 43.76% 70939 62962 88.76%
64-threads realloc 137661 54330 39.47% 73696 64824 87.96%
NexusDB 02 threads 122846 90380 73.72% 79479 66153 83.23%
NexusDB 04 threads 122131 53103 43.77% 69183 43001 62.16%
NexusDB 08 threads 124419 40914 32.88% 64977 33609 51.72%
NexusDB 12 threads 181239 55818 30.80% 83983 44658 53.18%
NexusDB 16 threads 135211 62044 43.61% 59917 32463 54.18%
NexusDB 31 threads 134815 48132 33.46% 54686 31184 57.02%
NexusDB 64 threads 187094 57672 30.25% 63089 41955 66.50%


(the tests have been done on 14-Jul-2017)