Friday, December 4, 2009

A Precise Garbage Collector for C++ Builder

Ahh... Nothing rivals the feeling of success!

I've finally devised a precise garbage collector (note: a precise GC is very different from conventional GCs such as BoehmGC - a precise GC does not suffer from ambiguities) for C++ Builder that is faster than boost's shared_ptr.

That's a bold claim indeed! And, by claiming that it is faster than shared_ptr, I'm also claiming that I can create a String class that is similarly faster than Delphi string, which is also reference counted like shared_ptr. And yes, the gc_ptr is thread-safe too.

The trick is in the pointer assignment. In my precise GC, a pointer assign is no more expensive than its native equivalent - both are a single instruction operation, whereas a ref counted pointer requires a locked instruction and a refcount check, both of which are *very* expensive on any processors, especially modern multicore ones. Most precise GC source codes you can freely download from the net get the assignment operator correct. However, the part where they lose out big time is in the gc_ptr's constructor and destructor with lock instructions and what not for multithread support (note that some don't even support multithread). These are often non-trivial and is the biggest cause of performance bottleneck.

With a minimal c'tor and d'tor for my gc_ptr, I end up with an insanely fast precise GC engine which runs a .NET like framework in pure native mode. How about that - a managed framework that runs natively. None of the pain (both literally and performance-wise) of pinning objects and marshaling data going in and out of the managed world (anything that interfaces with the real world cannot have a non-fixed address - e.g. it's impossible to write a driver that allocates a .NET managed non-fixed buffer and let your system's DMA fill it up). Looks like .NET's GC team is paying the price of betting on the Stop-And-Copy algo (yes, the .NET GC is nothing more than an advanced version of the Stop-And-Copy GC) - Update: *Correction, I was wrong. The CLR GC is very similar to what I have implemented - in fact, even more so now that I have finished the generational version of the GC. Only notable thing is that my GC does not compact the heap after the sweep phase and does not support weak references or resurrection. Also, it relies on an external memory allocator.

Update 1: GcString is 16.5 times faster than AnsiString in an array reversal test in CB2010 - more info here.

Update 2: FastMM (Rad Studio's default memory manager) is excruciatingly slow in a GC / managed app in multicore systems - more info here.

6 comments:

OldTabby said...

Zach,
Check your website, zachsaw.co.cc, it's showing a notice that says your domain registration has expired! I've got PackIt listed on Freewarehome.com & the broken link was reported but I don't have an email address for you.
Dee

Zach Saw said...

Are you sure about that? I've just checked and it's fine.

Moritz Beutel said...

Hello Zach,
that's certainly interesting. Do you plan to release it sometime? :)

Zach Saw said...

Hey Moritz,

Good to see ya here :) It's a framework which I developed for the company I work for and it's being used in production software. Not sure if there're any plans for releasing it, but I sure hope some day it'll be released as open-source. Then everyone could contribute to converting the .NET framework to native and have programs written in .NET to be easily portable to C++ Builder.

I've done more tests to gauge the speed and it seems that allocations (gcnew) is 1.5 times slower than pure 'new'. Assigning a gcptr to another gcptr is 4 times faster than shared_ptr (in a tight assign loop). Haven't tested the instantiation of gcptr vs shared_ptr yet but I suspect it'll be about the same - without the prologue and epilogue added by the compiler for c'tors, gcptr's c'tor is typically 20 asm instructions.

What's more interesting to me is that I can port most of the .NET framework to C++ Builder based on the garbage collector library. .NET's gcnew equivalent is mightily fast as it's basically a pointer bump, thanks to its stop-and-copy algo. The downside of that is the troublesome pinning of gc-objects for comms going off to the outside world. That said, however, speed of gc isn't much of an issue considering how fast it is, unless it's tied up in meaningless benchmark of a tight-looped gcnew vs new.

There're still a fair bit to go with this as it still lacks a generational garbage collection feature. But for 100,000 objects in flight, you won't even notice any performance drop caused by the GC. This is mostly enough for 32-bit apps, but obviously going to 64-bit a generational GC will be much more beneficial.

Anonymous said...

Hi Zach,

Could you talk about more about the design of your GC. I want to port a C# application to C++ and want to try implement a simple GC. However, it is complicated. Are you using reference count and is it a concurrent collector? Do you use some lock-free data structure? Thanks,

-Steven

Zach Saw said...

Hi Steven,

It's very complicated if you want a high performance GC (i.e. non blocking multithreaded while maintaining the fully thread-safe nature of a gc_ptr). A gc_ptr can be fully thread-safe as oppose to a shared_ptr which only guarantees thread-safe operations in a limited manner.

It's not that complicated if you want something simple that doesn't scale well and doesn't guarantee the same thread-safety-ness as a .NET managed ptr.