Zach Saw's Blog: 2010

Wednesday, November 24, 2010

Fast memcpy for large blocks

Memory copy of 8MB blocks can be quite slow.

I found that both memcpy and CopyMemory won't utilize the full bandwidth of your RAM due to memory controller bottlenecks (I suspect the memory controller isn't smart enough to prefetch the right data). So this implementation by William Chan issues SSE2 prefetch instructions and gets the memory controller to literally stream the data back and forth from RAM in the fastest manner.

Note though, that you'll need to give it 16-byte aligned memory and it copies in 128-byte blocks.

The result is here (on my Core2Duo Wolfdale CPU @ 3.6GHz, dual channel DDR2 @ 800MHz):

memcpy/CopyMemory:
1871.775MB/sec

William Chan's SSE2 memcpy:
3540.471MB/sec

That's nearly double the speed of the naive memcpy!

Saturday, November 13, 2010

Reporting a bug against Windows OS - possible?

Is Microsoft so arrogant that they think their OS is bug-free and that no one should ever need to report a bug against their OS?

There doesn't seem to be a way to report a bug against the Windows OS (no such category in Microsoft Connect). 2 years ago, I tried to report a bug against Vista - TreeView Indent in Vista causes HitTest to fail. The closest way would be to report it against VisualStudio. Surprise surprise, the VS product team closed it as external and did absolutely *nothing* after that. Sure, they redirected me to a non-existent website, which was supposed to be the MSDN forums, but that was where I was redirected to MS Connect in the first place! Talk about going around in circles... This really reminds me of Telstra (the ex-government owned now privatized biggest telco in Australia).

With regards to the bug that I've just reported, I'd expect them to do the same, and close it as external, not knowing which department they need to talk to. Microsoft is so big, dumb and slow that its right hand really have no idea what the left hand is doing! Sad really...

* Edit: Looks like I'm not the only one complaining:

Reporting a bug in Vista 64 WOW64
How do you file a bug report for Windows?
Problems with comdlg32.ocx, Windows Vista and long file names/extension´s

WOW64 bug: GetThreadContext() may return stale contents

My GC app runs perfectly fine under native x86 OS's starting from XP.

I've installed Windows 7 x64 recently (I've always used Win7-x86 due to 64-bit being in its infancy yet for the stuff that I do on my desktop) and to my dismay the app fails after a few minutes of running.

After hours of debugging (GC is never easy to debug, let alone a hypothesis that involves a bug in WOW64 which has been used by millions since XP-64), I found that WOW64 clobbers the ESP value (as returned by GetThreadContext(), of which the the GC thread in the app relies upon to get the current stack pointer of mutator threads) when it does a call out to long mode. I also found that ESP is always restored upon returning.

Prior to calling GetThreadContext(), the GC thread suspends all mutator threads. If it just so happens to suspend the mutator thread while it's running long mode code in user mode, the ESP value gets changed to a value indicating a higher address than the actual stack pointer (remember on IA, stack 'grows' downwards). I've seen this happen for SetEvent() and SwitchToThread() (as these are the most frequently called kernel functions in the app).

This means that either SuspendThread is suspending a thread in an incompatible way to native x86, or the thread's context in WOW64 is not being protected when the code jumps to translation mode. Either way, I was sure it's a bug.

I then found this article (difference between WOW64 and native x86 system DLLs) and while the article isn't exactly addressing the issue I'm facing, I found it very useful because this guy (Skywing from Microsoft) certainly knows WOW64 very well. I proceeded to email him and he replied with the following:

[...] there’s an issue with get and set context operations against amd64 Wow64 threads returning bad information in circumstances when the thread is running long mode code in user mode. This relates to us [Microsoft] pulling the Wow64 context from the TLS slot (as described in the [this] article) before that context structure has been updated with current contents.

That sounded very much like the issue here. So, I decided to dig deeper and put a few software traps to try and catch it in the act.

This is what I found.

The stale contents from GetThreadContext() actually came from the previous system call out (a looong way up the stack really - it's not as if it's a few instructions ago). It should've returned contents from the *current* system call out instead (or to be precise, just before the call out to long mode took place). Like Skywing said, they pulled the context before it's updated with the current contents.

With that said, we can now conclude that it is indeed an OS bug (Win7 SP1 hasn't fixed it).

Update 29 March 2014: As of Windows 8.1, this bug is still *NOT* fixed!

* I'd like to thank Skywing for his effort in assisting me to root cause this issue.

Thursday, October 21, 2010

BCC32 the Optimizing Compiler?

We all know asking the bcc32 compiler team to do put more effort into making the codegen generate more efficient / optimized asm is a *BIG ASK*, seeing how they haven't even got the basics (compiler bugs that Microsoft VC++ compiler fixed since MSVC 2005) working correctly. VC++ 6 was worse than C++ Builder 6, but since then, Microsoft have worked hard and fixed up their compiler, made it fully optimizing and generate one of the fastest and most efficient code in the world. It is also now one of the most compliant C++ compilers of all.

Let's compare the following code sample.

For the uninformed, the following usage pattern is found in a lot of expanded template code (I'm using it a lot in my GC framework and Boost uses it too).

#include <tchar.h>

struct foo
{
    inline operator bool() const
    {
        return false;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    if (foo())
        return 0; // *see footnote
    return 0;
}

*footnote: usually this is something more meaningful, but here I'm trying to illustrate how massively unintelligent bcc32
is.

bcc32 (C++ Builder XE) command line:

    bcc32 -O2 -Hs- -C8 -v- -vi test8.cpp

generated asm:

push ebp
mov ebp,esp
add esp,-$08
push edi
lea edi,[ebp-$08]
xor eax,eax
mov ecx,$00000008
rep stosb
lea eax,[ebp-$08]
xor edx,edx
test dl,dl
jz $00401201
xor eax,eax
jmp $00401203
xor eax,eax
pop edi
pop ecx
pop ecx
pop ebp
ret

cl command line:

    cl /Ox /Ot test8.cpp

generated asm:

xor eax,eax
ret

BCC32 generated 18 lines of useless opcodes when really only 2 are required. With BCC32, your code would be a few hundred times slower (taking into account of memory access latencies).

Tuesday, July 6, 2010

SerialPort IOException Workaround in C#

As promised, I've whipped up a quick workaround to fix the problem as described here.

Here's the code:

// Copyright 2010-2014 Zach Saw
// 
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// 
//     http://www.apache.org/licenses/LICENSE-2.0
// 
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

using System;
using System.IO;
using System.IO.Ports;
using System.Runtime.InteropServices;
using System.Text;
using Microsoft.Win32.SafeHandles;

namespace SerialPortTester
{
    public class SerialPortFixer : IDisposable
    {
        public static void Execute(string portName)
        {
            using (new SerialPortFixer(portName))
            {
            }
        }
        #region IDisposable Members

        public void Dispose()
        {
            if (m_Handle != null)
            {
                m_Handle.Close();
                m_Handle = null;
            }
        }

        #endregion

        #region Implementation

        private const int DcbFlagAbortOnError = 14;
        private const int CommStateRetries = 10;
        private SafeFileHandle m_Handle;

        private SerialPortFixer(string portName)
        {
            const int dwFlagsAndAttributes = 0x40000000;
            const int dwAccess = unchecked((int) 0xC0000000); 
 
            if ((portName == null) || !portName.StartsWith("COM", StringComparison.OrdinalIgnoreCase))
            {
                throw new ArgumentException("Invalid Serial Port", "portName");
            }
            SafeFileHandle hFile = CreateFile(@"\\.\" + portName, dwAccess, 0, IntPtr.Zero, 3, dwFlagsAndAttributes,
                                              IntPtr.Zero);
            if (hFile.IsInvalid)
            {
                WinIoError();
            }
            try
            {
                int fileType = GetFileType(hFile);
                if ((fileType != 2) && (fileType != 0))
                {
                     throw new ArgumentException("Invalid Serial Port", "portName");
                }
                m_Handle = hFile;
                InitializeDcb();
            }
            catch
            {
                hFile.Close();
                m_Handle = null;
                throw;
            }
        }

        [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
        private static extern int FormatMessage(int dwFlags, HandleRef lpSource, int dwMessageId, int dwLanguageId,
                                                StringBuilder lpBuffer, int nSize, IntPtr arguments);

        [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
        private static extern bool GetCommState(SafeFileHandle hFile, ref Dcb lpDcb);

        [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
        private static extern bool SetCommState(SafeFileHandle hFile, ref Dcb lpDcb);

        [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
        private static extern bool ClearCommError(SafeFileHandle hFile, ref int lpErrors, ref Comstat lpStat);

        [DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
        private static extern SafeFileHandle CreateFile(string lpFileName, int dwDesiredAccess, int dwShareMode,
                                                        IntPtr securityAttrs, int dwCreationDisposition,
                                                        int dwFlagsAndAttributes, IntPtr hTemplateFile);

        [DllImport("kernel32.dll", SetLastError = true)]
        private static extern int GetFileType(SafeFileHandle hFile);

        private void InitializeDcb()
        {
            Dcb dcb = new Dcb();
            GetCommStateNative(ref dcb);
            dcb.Flags &= ~(1u << DcbFlagAbortOnError);
            SetCommStateNative(ref dcb);
        }

        private static string GetMessage(int errorCode)
        {
            StringBuilder lpBuffer = new StringBuilder(0x200);
            if (
                FormatMessage(0x3200, new HandleRef(null, IntPtr.Zero), errorCode, 0, lpBuffer, lpBuffer.Capacity,
                              IntPtr.Zero) != 0)
            {
                return lpBuffer.ToString();
            }
            return "Unknown Error";
        }

        private static int MakeHrFromErrorCode(int errorCode)
        {
            return (int) (0x80070000 | (uint) errorCode);
        }

        private static void WinIoError()
        {
            int errorCode = Marshal.GetLastWin32Error();
            throw new IOException(GetMessage(errorCode), MakeHrFromErrorCode(errorCode));
        }

        private void GetCommStateNative(ref Dcb lpDcb)
        {
            int commErrors = 0;
            Comstat comStat = new Comstat();

            for (int i = 0; i < CommStateRetries; i++)
            {
                if (!ClearCommError(m_Handle, ref commErrors, ref comStat))
                {
                     WinIoError();
                }
                if (GetCommState(m_Handle, ref lpDcb))
                {
                     break;
                }
                if (i == CommStateRetries - 1)
                {
                     WinIoError();
                }
            }
        } 
 
        private void SetCommStateNative(ref Dcb lpDcb)
        {
            int commErrors = 0;
            Comstat comStat = new Comstat(); 
 
            for (int i = 0; i < CommStateRetries; i++)
            {
                 if (!ClearCommError(m_Handle, ref commErrors, ref comStat))
                 {
                     WinIoError();
                 }
                 if (SetCommState(m_Handle, ref lpDcb))
                 {
                     break;
                 }
                 if (i == CommStateRetries - 1)
                 {
                     WinIoError();
                 }
            }
        }

        #region Nested type: COMSTAT

        [StructLayout(LayoutKind.Sequential)]
        private struct Comstat
        {
            public readonly uint Flags;
            public readonly uint cbInQue;
            public readonly uint cbOutQue;
        }

        #endregion

        #region Nested type: DCB

        [StructLayout(LayoutKind.Sequential)]
        private struct Dcb
        {
            public readonly uint DCBlength;
            public readonly uint BaudRate;
            public uint Flags;
            public readonly ushort wReserved;
            public readonly ushort XonLim;
            public readonly ushort XoffLim;
            public readonly byte ByteSize;
            public readonly byte Parity;
            public readonly byte StopBits;
            public readonly byte XonChar;
            public readonly byte XoffChar;
            public readonly byte ErrorChar;
            public readonly byte EofChar;
            public readonly byte EvtChar;
            public readonly ushort wReserved1;
        }

        #endregion

        #endregion
    }

    internal class Program
    {
        private static void Main(string[] args)
        {
            SerialPortFixer.Execute("COM1");
            using (SerialPort port = new SerialPort("COM1"))
            {
                port.Write("test");
            }
        }
    }
}

* Use Firefox to copy and paste the code above. Formatting won't be preserved if you use IE (IE bug).

Thursday, July 1, 2010

.NET SerialPort Woes

Preface

This is a very long post articulating the .NET SerialPort bug and the proposed fix for Microsoft to implement in its post .NET 4.0 framework. An interim fix that doesn't involve Microsoft is also available.

Introduction

The built-in serial port support in .NET has been a major let down (which has remained largely unchanged since its introduction in v2.0 to the latest v4.0). Posts on MSDN have suggested that a lot of people (both C# and VB users alike) are in fact facing some form of difficulties using System.IO.Ports.SerialPort:

IOException when reading serial port using .NET 2.0 SerialPort.Read

Port IOException

SerialPort.Close throws IOException

IOException when SerialPort.Open()

WinCE 5.0 - IOException when serialPort.Open()

WARNING! SerialPort in .NET 3.5

... and many more, but take the last one with a grain of salt.

Yet Microsoft could't seem to be able to reproduce the bug, or worse, brushed it aside thinking it's a problem with users' code. That is likely due to the noise (i.e. incorrect answers accepted as correct answers) introduced by the forum's serial port expert pretenders (a lot of them Microsoft support staffs) - so far none of the posts were answered correctly, yet were all marked as correctly answered! Yay to the quality of MSDN forum - a forum where n00bs answer questions by other n00bs. The truth is, there's only one similarity in all of the posts - they all encountered IOException.

The Problem - IOException

To understand why IOException occurs in SerialPort (or rather, SerialStream to be exact), one would only need to look as far as the WinAPI Comm functions. SerialStream calls several comm APIs to get the real job done and when the API functions fail and return an error, SerialStream simply throws an IOException with the message returned by FormatMessage with the error code from GetLastError.

So why does the WinAPI function fail? From the posts, they all have a common error message:

"The I/O operation has been aborted because of either a thread exit or an application request." (error code 995)

While a thread exit will also cause an overlapped (asynchronous) function to implicitly abort, in this case it's aborted because the serial port is in a mode where any errors encountered by the UART chip would trigger the abort flag causing all current and subsequent serial port related calls to abort. Errors include parity error, buffer overrun, etc.

Some developers have even encountered IOException as soon as they call SerialPort.Open(), especially in slow devices such as handhelds running .NET CE. Some encounter it when the garbage collector disposes the serial port. Some encounter it when they call SerialPort.Read().

They're all due to a mistake in .NET's SerialStream implementation - neglecting to set the fAbortOnError flag in the DCB structure when initializing the serial port.

This negligence on Microsoft's part means every time you run your application you could potentially encounter a different behavior (this flag is persistent across app runs and the default is determined by BIOS and/or UART hardware vendor). Some claim that it only happens in one machine and not others. This also explains why it has remained such a pesky problem for both developers and Microsoft since the first incarnation of the SerialPort class.

When fAbortOnError flag is set to true, this is indeed the expected behavior - but is this the desired behavior Microsoft intended for its users? No. System.IO.Ports.SerialStream was never meant to work with fAbortOnError set to true, because the ClearCommError WinAPI function that goes hand-in-hand was nowhere to be found among its methods. Clearly, whoever wrote SerialStream made a mistake (and needs to be shot).

The Solution

It took me an entire day to root cause this problem. Luckily the solution is much simpler.
Here's what Microsoft needs to do to fix the problems (in reference to the .NET 4.0 source):

1) In InitializeDCB, SetDcbFlag for bit 14 to zero - this sets fAbortOnError to false. Also, retry GetCommState and SetCommState if it fails with error 995 (call ClearCommError() before retrying).

2) In SerialStream's c'tor, move InitializeDCB to the line before GetCommProperties. This fixes the problem for the folks who've been getting IOException when calling SerialPort.Open(). The reason SerialPort.Open() only failed on slow devices because between the port's CreateFile and the time GetCommProperties() is called, a comm port physical error might have already occurred.

The reason some people have claimed that their app simply crashes out when their application terminates is due to DiscardInBuffer() in SerialStream.Dispose() throwing IOException because PurgeComm failed with error 995, likely because of buffer overrun as their serial devices would've been sending and filling up the input buffer before user closes the app. And mind you, Dispose() at that point would've been called by the garbage collector thread - hence a try-catch would've been ineffective, unless of course, you've manually disposed the object prior to closing the app - causing the app to hard crash with unhandled exception.

How do you fix it in the interim? Simple. Before you call SerialPort.Open(), simply open the serial port by calling CreateFile and SetCommState's fAbortOnError to false. Now you can safely open the serial port without worrying that it might throw an IOException. I've whipped up a sample workaround in C# that you could use as a reference.

Sunday, May 23, 2010

Anonymous methods / Lambda expression variable capture scope in C#

Let's face it, at some point or another, we've come across the problem in C# where you pass a for-loop variable into an anonymous method without redeclaring it and find that it doesn't behave expectedly. Once we get used to the idea of always redeclaring a variable you're going to use in your lambda expression though, you tend to ignore the fact that there is actually no reason why it should be the way it is.

At least, I find it hard to explain to a junior team member the reason behind this implementation where the compiler would not consider a for-loop variable as a local variable to its scope when it comes to anonymous methods. In case you're wondering, this is the reason why we need to redeclare 'var s' in the following example:

foreach (var s in Names)
{
    var temp = s; // redeclaration

    criteria = criteria.Or(x => x.Name == temp);
}

The above example is very obvious, no doubt. But, throw in some code refactoring (out to a separate method and merging inline again etc.) and more complex lambda expression, you'll find that it is quite easy to forget that 'temp' variable. A bad language is one that requires you to 'tip toe around a field of land mine' when you're using it - which defeats the whole selling point of C# (or at least what Eric Gunnerson was selling when he first introduced C# to the world).

Like I said, there's really no reason why the compiler shouldn't consider 'var s' as the local variable of the foreach block - there's no way you could access variable 's' outside of it anyway. In this case, variable 's' isn't local enough to be captured by the lambda expression, but isn't any more global than that either!

My suggestion to Microsoft is to have the compiler automatically insert the redeclaration (as a quick and simple fix). I really can't think of a genuine use-case where one would actually desire the behavior without the redeclaration. In fact, if that is truly what is intended, then wouldn't this make more sense and a LOT more readable (read: maintainable)?

var s = Names[Names.Length-1];

for (int i=0; i<Names.Length; i++)
{
    criteria = criteria.Or(x => x.Name == s);
}

rather than,

foreach (var s in Names)
{
    criteria = criteria.Or(x => x.Name == s);
}

Tuesday, January 26, 2010

Rad Studio IDE Changes System Wide Timer Resolution

This one had me scratching my head for a long time. Apparently, the CodeGear / Embarcadero RAD Studio IDEs (I've tested 2007 and 2010) exhibit this behavior - it calls timeBeginPeriod to change system wide timer resolution to 1ms when it starts and timeEndPeriod when it quits.

What's the problem then?

For starters, this means that simple code such as:

while (!Terminated)
{
    if (Poll())
        DoSomething();

    Sleep(1);
}

will behave very differently with and without the IDE running. You may end up polling a lot slower than you expected when you deploy your applications and may cause DoSomething() to not get called in-time. The evil is in the fact that when you are developing the application, DoSomething() always gets called as you would have expected. But when you deploy your application (or hands it over to the testers for testing), you'd soon realize something is amiss. Everyone knows Windows is not a real-time OS, so no one would expect Sleep(1) to actually sleep for 1ms. But while developing the application, you had found that it is actually quite close.

Well, surprise, surprise! Without the IDE running, Sleep(1) would actually wait for 15.625ms by default - that's more than 15 times slower than what you were expecting.

The Sleep Function documentation from MSDN really doesn't do a good job at explaining the Sleep function. In DOS, I would've expected the system ticks to be at a default of 15.7ms. But I had expected Windows starting from Windows 95 to have a default system tick of 1ms. I was wrong (well, not really, see my comment #1).

Regardless, this is a serious problem with all of RAD Studio's IDEs. I am sure hardly anyone knows about this and at one point or another, you would've been bitten by this, even if you didn't know it - except that your application failed on the day of demo at your client's site. Just your luck again.

Microsoft's Visual Studio IDEs (tested 6, 2005, 2008) don't do this.

Also, it's never a good idea to change the system tick resolution - from MSDN, "(The timeBeginPeriod) function affects a global Windows setting. Windows uses the lowest value (that is, highest resolution) requested by any process. Setting a higher resolution can improve the accuracy of time-out intervals in wait functions. However, it can also reduce overall system performance, because the thread scheduler switches tasks more often. High resolutions can also prevent the CPU power management system from entering power-saving modes. Setting a higher resolution does not improve the accuracy of the high-resolution performance counter."

Perhaps the OS should have simply fixed it at 1ms. To allow processes to change system timer resolution that affects global system settings does not make much sense in a multitasking environment.

Wednesday, January 20, 2010

Component / Control with TPropertyEditor in DesignEditors

If you include <designeditor.hpp> and try to use TPropertyEditor in C++ Builder, you'll run into BCC32 errors complaining about multiple declaration for 'IPropertyDescription' and ambiguity between 'IPropertyDescription' and 'Designintf::IPropertyDescription'. This is true for every version post-BCB6, including the latest CB2010.

The namespace ambiguity problem is an inherent problem with C++ Builder because every HPP file that is generated from Delphi includes the namespace in the header file. We all know that's *BAD* now, but it's a decision that dates back to the first version where even the std namespace was implicit included. Now that we've found ourselves too deep in the rabbit hole, there's really no easy way out as far as backward compatibility is concerned.

But, for this particular problem, there's a solution.

Before you include DesignEditors.hpp, you should first include PropSys.hpp, such as,

#include <propsys.hpp>
#include <designeditors.hpp>

class PACKAGE TMyComponentEditor: public TPropertyEditor
{
// ...
};

Perhaps the better way would be for DesignEditors.hpp to include PropSys.hpp at the very top of the file, so anyone who uses DesignEditors.hpp doesn't need to remember including PropSys.hpp explicitly. That one's for Embarcadero to decide.

Upgrading VCL Apps to C++ Builder 2010

If you run into the following error, here's what you need to do.

[ILINK32 Error] Error: Unresolved external 'wWinMain' referenced from C:\PROGRAM FILES\EMBARCADERO\RAD STUDIO\7.0\LIB\C0W32W.OBJ

Open your main cpp file and look for this line,

WINAPI WinMain(HINSTANCE, HINSTANCE, LPSTR, int)

Change it to,

WINAPI wWinMain(HINSTANCE, HINSTANCE, LPWSTR, int)

This is to do with the Unicode support in the new IDE (starting from CB2009).

IDE Regex Replace: char to wchar_t string literals

While upgrading your apps to use wchar_t* instead of char* string literals, you'll find that you need to change a string such as "This is a string" to _T("This is a string"), as well as character literals such as 'c' to _T('c').

Well the good news is there's a quick way of doing this.

The C++ Builder IDE has always have a Regex (Regular Expression) based search and replace function. All you have to do is enable it in the Replace Text dialog, under Options | Regular expressions.

These are the corresponding Regex you'll need.

For string literals,

Text to find: "{(\\"|[^"])*}" (include the double quotes)
Replace with: _T("\0")

For char literals,

Text to find: \'{\\[^']|[^']}\'
Replace with: _T('\0')

* Note: Do not blindly replace all. You may end up replacing the text inside a string, such as "I can see 'u' from here". If anyone has any suggestions on how to correct this, I'd appreciate it (note that the IDE regex replacer does not support backreference). You may end up replacing strings that aren't string literals, such as #include "myfile.h".

The reason you'd want to use the _T(x) macro is because it's faster when you do an assignment to UnicodeString (which is typedef'd to String). The _T(x) macro maps to L##x - i.e. _T("text") == L"text". The String and _T(x) macro pair is compatible going from a compiler that supports Unicode to one that doesn't as String will map to UnicodeString in the former and AnsiString in the latter, which is the same for the _T(x) macro mapping to L (L"string") and nothing ("string") respectively.

String fromAnsi = "text";
_UStrFromPChar, which ends up calling MultiByteToWideChar. It is a Windows API that converts Ansi strings to Unicode strings, and as fast as it may be, it's bound to be slower than a straight memory copy.

String fromUnicode = L"text";
All else being equal (allocate memory and finding string length), this is much faster as it's basically just a straight memory copy.

Friday, January 15, 2010

FastMM - Slow in multithreaded apps on multicore CPUs

There's something wrong with FastMM4's (i.e. the default memory manager of Delphi / C++ Builder starting BDS2006) usability on multicore systems, especially running multithreaded apps in a GC/managed environment. The result of this is that when multicore is enabled, performance suffers by up to 5 folds. So, not only that FastMM would not scale, your multithreaded apps will run tremendously slower on a multicore system - up to 5 times slower on a dual-core machine vs a single-core one at the same clock speed of the same architecture.

That's 500% performance drop going from single-core to dual-core! Comparing the dual-core performance of FastMM4 and TBBMM, the latter is 9 times faster!

This test is meant to show just that. Download Test (updated 27/01/2010) (see readme.txt for instructions) *** WARNING: Incompatible with x64 OS due to an OS bug.

It runs through a variety of algorithms in multiple threads (in a threadpool of the framework, similar to .NET's ThreadPool) consisting of a mix of GC list, GC dictionary, and GC string unit-tests.

Keep in mind that this is an app written using a GC framework, which means allocations usually happen in multiple threads concurrently while de-allocations are done in specialized garbage collector threads. This may be the reason FastMM breaks down (a general-purpose memory manager shouldn't break down given any usage patterns).

Notice that when you run the FastMM Test with CPU Affinity set to just one CPU, you'll end up with nearly the same performance as TBBMM. Once you enable multicores though, you'd immediately lose performance once again, running slower than with just one core.

Note: You'll find that the FastMM BorlndMM.dll is different from the default Rad Studio 2010 one. This is due to the changes added to support the GC framework, but at its heart, it's simply making calls to GetMemory, ReallocMemory and FreeMemory (as oppose to WinMM's version of HeapAlloc, HeapRealloc and HeapFree respectively, with all else
being equal). The WinMM version is initialized with the LFH (low fragmentation heap) flag.

Here are some results from my own tests:

Test results in ops/second (10sec average), listed in the following order:
1) TBBMM (what is TBBMM?)
2) WinMM
3) FastMM

Core2Duo E6550 2.33GHz (Conroe) - XP SP3
Both cores enabled
1) 1785
2) 1230
3) 250

Single core (via CPU affinity mask)
1) 930
2) 650
3) 950

Core2Duo E6550 throttled to 1.33GHz - XP SP3
Both cores enabled
1) 730
2) 520
3) 180

Single core (via CPU affinity mask)
1) 410
2) 275
3) 395

Pentium M 1.2GHz (Banias) - XP SP3
CPU is Single core
1) 395
2) 340
3) 395

Core2Duo E7200 3.6GHz (Wolfdale) - Vista
Both cores enabled
1) 2595
2) 2080
3) 290

Single core (via CPU affinity mask)
1) 1450
2) 1180
3) 1405

As you can see, the results are quite consistent. On a dual core machine, the performance of FastMM is terrible. From 2.33GHz to 3.6GHz, there's virtually no increase at all in speed! In fact, when the test was running, the CPU wasn't even fully utilized (with more than 50% of CPU spent in kernel time), whereas the other memory managers had the CPU pegged at 100% and nearly no kernel time.

If you wish to try it out on your system, download this GC speed tester (updated 27/01/2010) and unzip it to a folder of your choice. Then, run "Run All Tests.bat" and follow the on-screen instructions. Note that the GC Speed Test app will run indefinitely, so once you take note of the speed (ops/sec), you can quit the app to move on to the next test.

I'd appreciate it if you could post your results here in the comments in the same format as the ones above - i.e. CPU make (I'd love to see how AMD CPUs fare) and model number as well as the frequency, OS / service pack, and the results.

My advice? For an all-rounded memory manager, use the Windows default one. It may be a little slower than FastMM on a single core, but it certainly scales very well on multicore systems. Alternatively, the Intel TBB allocator has a near perfect scaling and is the fastest memory managers around. Only thing is, it consumes more RAM.

Regardless, I'd stay away from FastMM4 (thus the default memory manager of Delphi / C++ Builder).

Thursday, January 14, 2010

C++ Builder 2010 Optimizing C++ Compiler

I'm pleasantly surprised after giving C++ Builder 2010 a quick spin. It's much better at optimizing code than its predecessor CB2007 (I skipped CB2009 altogether as it was and still is completely broken).

CB2010 vs CB2007:

AnsiString test:
6938ms vs 6765ms

GcString test:
420ms vs 1734ms
(yes, that's 420ms, it's not a typo)

In the AnsiString test, things got just a bit slower (about 2% - nothing to worry about). But the big surprise here is my GcString test, which is over 400% FASTER!

Code for the test above (executed on Core2Duo 2.33GHz with TBBMM):

void __fastcall RunTest()
{
   const int TEST_COUNT = 10;
   const int TEST_SIZE = 10000;
   const int LOOP_COUNT = 1000;

   {
       // RefCounted String Test
        AnsiString strings[TEST_SIZE];
       for (int i=0; i<TEST_SIZE; i++)
       {
           strings[i] = "test";
       }

       DWORD start = GetTickCount();
       for (int x=0; x<LOOP_COUNT; x++)
       {
           AnsiString temp;
           for (int j=0; j<TEST_COUNT; j++)
           for (int i=0; i<TEST_SIZE/2; i++)
           {
               temp = strings[i];
               strings[i] = strings[TEST_SIZE-1-i];
               strings[TEST_SIZE-1-i] = temp;
           }
       }
       ShowMessage(IntToStr((int)GetTickCount() - (int)start));
   }

   {
       // GcString Test
        GcString strings[TEST_SIZE];
       for (int i=0; i<TEST_SIZE; i++)
       {
           strings[i] = "test";
       }

       DWORD start = GetTickCount();
       for (int x=0; x<LOOP_COUNT; x++)
       {
           GcString temp;
           for (int j=0; j<TEST_COUNT; j++)
           for (int i=0; i<TEST_SIZE/2; i++)
           {
               temp = strings[i];
               strings[i] = strings[TEST_SIZE-1-i];
               strings[TEST_SIZE-1-i] = temp;
           }
       }
       ShowMessage(IntToStr((int)GetTickCount() - (int)start));
   }
}

As you may have noticed, it is simply an array reversal test. And yes, the GcString version was 4 times faster even in CB2007. In CB2010, GcString is now a staggering 16.5 times faster than AnsiString!

Internal Compiler Error (ICE) in BCC32 of C++ Builder 2010

An excellent write-up of ‘What is an Internal Compiler Error?’ by David Dean (an Embarcadero C++ QA Engineer) is a must-read if you do not know what an ICE is, apart from it giving you error message such as this, “[BCC32 Fatal Error] FileA.cpp(56): F1004 Internal compiler error at 0x59650a1 with base 0x5900000”.

CB2010 seems to be more prone to encountering ICE, for reasons which are beyond my understanding. However, with a lot of struggle and time spent to get my projects compiled, I’ve found a few settings that are vital to avoid ICE.

The first thing I’d do is disable smart cached precompiled headers (command line: -Hs-). I’ve found that this option, combined with Debugging | Expand inline functions and/or Optimizations | Expand common intrinsic functions (implicit via Generate fastest possible code) is the root of all evil. Disabling the former will allow the latter two to be enabled, thus taking advantage of the new optimization featured in BCC32 v6.21 of CB2010. In fact, I’ve made all my projects default to this configuration. If you still get ICE, then start disabling the other two as well. Even if you get it to compile after disabling either or both of them, you’d still want to submit a QC entry (a bug report). To do this, follow the instructions in the above link (David Dean’s page about ICE).

Thursday, January 7, 2010

ATI DXVA with Arcsoft - Still Behind nVidia with Anything

Happy New Year 2010 to all my readers!

First post of the year. And it's bad news for ATI, yet again.

*note: this post is a follow-up to my post here.

With the recent changes (additions) to the popular open-sourced H.264 encoder, x264, encoding at ref 16 with b-pyramid normal and weighted-p 2 (which is default), playback on ATI cards with the Arcsoft decoder will exhibit bad artifacts, as if using MPC-HC's internal DXVA decoder on ref 16 encodes. Meanwhile, it's all well and dandy over at the green camp. nVidia owners will find that their cards can decode these streams without even the slightest artifact, and with just about any DXVA decoders you could get your hands on - even the Win7 Microsoft DTV-DVD. If you intend to encode with b-pyramid normal + weighted-p 2, you should reduce the ref to 12 (haven't tried anything in between) to ensure artifact-free DXVA playback with ATI+Arcsoft.

So there you go, even with the best combo, ATI still looses out to nVidia. So again, my advice is, sell your ATI cards and stick to nVidia.

Let's see if this is the year ATI catches up (My bet is on NO. Perhaps never. Frankly, ATI doesn't care about the HTPC scene).