Ahh... Nothing rivals the feeling of success!
I've finally devised a precise garbage collector (note: a precise GC is very different from conventional GCs such as BoehmGC - a precise GC does not suffer from ambiguities) for C++ Builder that is faster than boost's shared_ptr.
That's a bold claim indeed! And, by claiming that it is faster than shared_ptr, I'm also claiming that I can create a String class that is similarly faster than Delphi string, which is also reference counted like shared_ptr. And yes, the gc_ptr is thread-safe too.
The trick is in the pointer assignment. In my precise GC, a pointer assign is no more expensive than its native equivalent - both are a single instruction operation, whereas a ref counted pointer requires a locked instruction and a refcount check, both of which are *very* expensive on any processors, especially modern multicore ones. Most precise GC source codes you can freely download from the net get the assignment operator correct. However, the part where they lose out big time is in the gc_ptr's constructor and destructor with lock instructions and what not for multithread support (note that some don't even support multithread). These are often non-trivial and is the biggest cause of performance bottleneck.
With a minimal c'tor and d'tor for my gc_ptr, I end up with an insanely fast precise GC engine which runs a .NET like framework in pure native mode. How about that - a managed framework that runs natively. None of the pain (both literally and performance-wise) of pinning objects and marshaling data going in and out of the managed world (anything that interfaces with the real world cannot have a non-fixed address - e.g. it's impossible to write a driver that allocates a .NET managed non-fixed buffer and let your system's DMA fill it up). Looks like .NET's GC team is paying the price of betting on the Stop-And-Copy algo (yes, the .NET GC is nothing more than an advanced version of the Stop-And-Copy GC) - Update: *Correction, I was wrong. The CLR GC is very similar to what I have implemented - in fact, even more so now that I have finished the generational version of the GC. Only notable thing is that my GC does not compact the heap after the sweep phase and does not support weak references or resurrection. Also, it relies on an external memory allocator.
Update 1: GcString is 16.5 times faster than AnsiString in an array reversal test in CB2010 - more info here.
Update 2: FastMM (Rad Studio's default memory manager) is excruciatingly slow in a GC / managed app in multicore systems - more info here.