Saturday, November 13, 2010

WOW64 bug: GetThreadContext() may return stale contents

My GC app runs perfectly fine under native x86 OS's starting from XP.

I've installed Windows 7 x64 recently (I've always used Win7-x86 due to 64-bit being in its infancy yet for the stuff that I do on my desktop) and to my dismay the app fails after a few minutes of running.

After hours of debugging (GC is never easy to debug, let alone a hypothesis that involves a bug in WOW64 which has been used by millions since XP-64), I found that WOW64 clobbers the ESP value (as returned by GetThreadContext(), of which the the GC thread in the app relies upon to get the current stack pointer of mutator threads) when it does a call out to long mode. I also found that ESP is always restored upon returning.

Prior to calling GetThreadContext(), the GC thread suspends all mutator threads. If it just so happens to suspend the mutator thread while it's running long mode code in user mode, the ESP value gets changed to a value indicating a higher address than the actual stack pointer (remember on IA, stack 'grows' downwards). I've seen this happen for SetEvent() and SwitchToThread() (as these are the most frequently called kernel functions in the app).

This means that either SuspendThread is suspending a thread in an incompatible way to native x86, or the thread's context in WOW64 is not being protected when the code jumps to translation mode. Either way, I was sure it's a bug.

I then found this article (difference between WOW64 and native x86 system DLLs) and while the article isn't exactly addressing the issue I'm facing, I found it very useful because this guy (Skywing from Microsoft) certainly knows WOW64 very well. I proceeded to email him and he replied with the following:
[...] there’s an issue with get and set context operations against amd64 Wow64 threads returning bad information in circumstances when the thread is running long mode code in user mode. This relates to us [Microsoft] pulling the Wow64 context from the TLS slot (as described in the [this] article) before that context structure has been updated with current contents.

That sounded very much like the issue here. So, I decided to dig deeper and put a few software traps to try and catch it in the act.

This is what I found.

The stale contents from GetThreadContext() actually came from the previous system call out (a looong way up the stack really - it's not as if it's a few instructions ago). It should've returned contents from the *current* system call out instead (or to be precise, just before the call out to long mode took place). Like Skywing said, they pulled the context before it's updated with the current contents.

With that said, we can now conclude that it is indeed an OS bug (Win7 SP1 hasn't fixed it).

Update 29 March 2014: As of Windows 8.1, this bug is still *NOT* fixed!

* I'd like to thank Skywing for his effort in assisting me to root cause this issue.

44 comments:

Alexey Pakhunov said...

> This means that either SuspendThread is suspending a thread in an incompatible way to native x86, or the thread's context in WOW64 is not being protected when the code jumps to translation mode.

Wow64 is almost entirely user mode. There is no way one can protect user mode code from interruption by a kernel APC (this is what SuspendThread is). So it is indeed possible to suspend a thread while Wow64 is inconsistent.

It cannot be easily fixed either. Either entire Wow64 needs to be moved to kernel mode (which I cannot see being done in foreseeable future due to application compatibility reasons) or Wow64 need to be able to detect that a thread was suspended while in transition and fixup the context (tricky business) or Wow64 needs a way to tell the caller that the returned context is inconsistent.

The latter is the easiest fix. But I'm not sure if Microsoft is going to fix it. But I really hope they will.

Zach Saw said...

@Alexey:

It's not that tricky to be honest. Context synchronizing could be done by the caller thread of SuspendThread - obviously it is possible to detect if a thread is running long mode in user mode. In that situation, SuspendThread needs to do additional sync. If the thread isn't suspended to begin with, then stale contents could be returned and that would still be consistent with what MSDN documentation says.

That would make SuspendThread slower (very infrequently so since it's only slower when it suspends a thread in long mode), but I wouldn't think it would break compatibility.

Zach Saw said...

@Alexey:

I think I see the complication now. My proposed methods wouldn't have worked as SuspendThread wouldn't have been able to sync the contents - ESP would've been 'clobbered' by RSP at that time.

The only other thing I could imagine possible is to get all call outs to sync its 32-bit context before doing anything else.

Zach Saw said...

@Alexey:

Last I checked when I was working in Intel, ring-2 wasn't used at all in Windows. Wouldn't the use of ring-2 for WOW64 translation be the correct way to fix this?

SuspendThread would then only suspend thread in ring-3 (as it does now - no changes necessary), and a crash/fault/exploit in ring-2 won't affect the kernel - it'd only affect ring-2 and ring-3.

Such changes would necessitate the change of a few WinAPI functions such as Wow64Get/SetThreadContext etc. This would break apps relying on undocumented features but that's to be expected. Granted, translation would be slower as it takes a few CPU cycles to transition from ring-3 to ring-2 (depending on the CPU family) but I'd think that the role of the OS is first and foremost to ensure correct operation. Translation already adds overhead to apps running under WOW64, so that's to be expected too.

Alexey Pakhunov said...

> Context synchronizing could be done by the caller thread of SuspendThread
> SuspendThread needs to do additional sync.

I read it as SuspendThread should wait until it is safe to stop a thread. It is a security vulnerability to do that.

You can introduce a special version of SuspendThread that will wait limiting the threat to Wow64 applications and user mode only.

Other ways of fixing this look cleaner.

> The only other thing I could imagine possible is to get all call outs to sync its 32-bit context before doing anything else.

I don't see how it can help. It is still done in user mode and there is no way to protect from SuspendThread in user mode.

> Wouldn't the use of ring-2 for WOW64 translation be the correct way to fix this?

This will double number of ring changes and it is more complex than moving Wow64 to ring 0.

> SuspendThread would then only suspend thread in ring-3 (as it does now - no changes necessary)

SuspendThread also suspends a thread in ring 0 when it is outside of a critical region.

> This would break apps relying on undocumented features but that's to be expected.

Well, not really. In Windows we are expected to not break apps even if they rely on undocumented features. Unless the benefits of breaking them outweigh significantly.

> I'd think that the role of the OS is first and foremost to ensure correct operation.

Nope, not really. There is not a single thing that should be achieved no matter what. It is always a compromise.

Zach Saw said...

> > SuspendThread needs to do additional sync.

> I read it as SuspendThread should wait until it is safe to stop a thread. It is a security vulnerability to do that.

As you've said, doesn't SuspendThread do this already in Kernel mode (i.e. wait until it's outside of a critical region)?

Or does it simply fail? If it does, then why don't we make it fail too if it's not safe to stop a thread? At least a simple retry on SuspendThread by the caller thread would suffice (although this mandates changes to the current implementation of Boehm GC – see my last point).

> You can introduce a special version of SuspendThread that will wait limiting the threat to Wow64 applications and user mode only.

See my last point too.

> Other ways of fixing this look cleaner.

Also see my last point.

> > Wouldn't the use of ring-2 for WOW64 translation be the correct way to fix this?

> This will double number of ring changes and it is more complex than moving Wow64 to ring 0.

Then perhaps Wow64 should've been done in ring 0 to begin with. It really sounds as though this is a design oversight, although the motivation behind keeping it away from ring 0 was to reduce security risk (at least that's what MSDN doc says). That's why I believe ring-2 would be the best compromise.

> > This would break apps relying on undocumented features but that's to be expected.

> Well, not really. In Windows we are expected to not break apps even if they rely on undocumented features. Unless the benefits of breaking them outweigh significantly.

Yes, however, the compromise here is to break apps that rely on documented features to favour those relying on *un*documented features. We're talking niche apps (ones that rely on undocumented features) vs niche apps (ones that do thread hijacking) here - so the point about affecting all apps running under Wow64 is moot, really. Most other apps would be completely indifferent to the changes.

Put it another way – is it instead better to break apps that rely on documented features such as GetThreadContext?

> > I'd think that the role of the OS is first and foremost to ensure correct operation.

> Nope, not really. There is not a single thing that should be achieved no matter what. It is always a compromise.

I read (again, from MSDN) that the reason Wow64 is implemented in user-mode is mainly performance. To compromise correctness over performance especially in an OS is *not* acceptable, IMHO.

Sure, the easiest way would be simply to fail SuspendThread or GetThreadContext when the context is stale. This would work too, but MSDN docs need to be updated to reflect such changes. That’s rather trivial, though, compared to apps that use Boehm GC – they would also need to be updated and rebuilt (the GC currently does not expect SuspendThread / GetThreadContext to fail in such cases). Again, I would've thought the existence of Wow64 is to let users run old legacy 32-bit apps under Windows64. Any changes that require the rebuild of legacy apps would defeat the purpose of having Wow64 in the first place, wouldn't it?

I wonder if the IA-32 Suite of Linux suffer the same problem. How do they do their x86 emulation under 64-bit Linux?

Zach Saw said...

> > SuspendThread needs to do additional sync.

> I read it as SuspendThread should wait until it is safe to stop a thread. It is a security vulnerability to do that.

As you've said, doesn't SuspendThread do this already in Kernel mode (i.e. wait until it's outside of a critical region)?

Or does it simply fail? If it does, then why don't we make it fail too if it's not safe to stop a thread? At least a simple retry on SuspendThread by the caller thread would suffice (although this mandates changes to the current implementation of Boehm GC – see my next reply).

> You can introduce a special version of SuspendThread that will wait limiting the threat to Wow64 applications and user mode only.

See my next reply too.

> Other ways of fixing this look cleaner.

Also see my next reply.

> > Wouldn't the use of ring-2 for WOW64 translation be the correct way to fix this?

> This will double number of ring changes and it is more complex than moving Wow64 to ring 0.

Then perhaps Wow64 should've been done in ring 0 to begin with. It really sounds as though this is a design oversight, although the motivation behind keeping it away from ring 0 was to reduce security risk (at least that's what MSDN doc says). That's why I believe ring-2 would be the best compromise.

> > This would break apps relying on undocumented features but that's to be expected.

> Well, not really. In Windows we are expected to not break apps even if they rely on undocumented features. Unless the benefits of breaking them outweigh significantly.

Yes, however, the compromise here is to break apps that rely on documented features to favour those relying on *un*documented features. We're talking niche apps (ones that rely on undocumented features) vs niche apps (ones that do thread hijacking) here - so the point about affecting all apps running under Wow64 is moot, really. Most other apps would be completely indifferent to the changes.

Put it another way – is it instead better to break apps that rely on documented features such as GetThreadContext?

Zach Saw said...

> > I'd think that the role of the OS is first and foremost to ensure correct operation.

> Nope, not really. There is not a single thing that should be achieved no matter what. It is always a compromise.

I read (again, from MSDN) that the reason Wow64 is implemented in user-mode is mainly performance. To compromise correctness over performance especially in an OS is *not* acceptable, IMHO.

Sure, the easiest way would be simply to fail SuspendThread or GetThreadContext when the context is stale. This would work too, but MSDN docs need to be updated to reflect such changes. That’s rather trivial, though, compared to apps that use Boehm GC – they would also need to be updated and rebuilt (the GC currently does not expect SuspendThread / GetThreadContext to fail in such cases). Again, I would've thought the existence of Wow64 is to let users run old legacy 32-bit apps under Windows64. Any changes that require the rebuild of legacy apps would defeat the purpose of having Wow64 in the first place, wouldn't it?

I wonder if the IA-32 Suite of Linux suffer the same problem. How do they do their x86 emulation under 64-bit Linux?

Alexey Pakhunov said...

> As you've said, doesn't SuspendThread do this already in Kernel mode

Yes. This is not a security vulnerability because all kernel mode code is trusted. User mode can be anything. A hacker can modify Wow64 code so that it will spin forever under certain conditions so some privileged thread calling SuspendThread will be blocked forever.

> Then perhaps Wow64 should've been done in ring 0 to begin with.

Maybe. There are reasons to do it in ring3 and reasons to do it in ring0.

> so the point about affecting all apps running under Wow64 is moot, really

By affecting all apps I meant that any 32-bit app depends on this code, so if you change it all of them will be affected. It does not mean that all of them will fail. But some of them will (as we have learned from experience). The problem is that we cannot test all applications and find out which ones are going to fail.

> Put it another way – is it instead better to break apps that rely on documented features such as GetThreadContext?

No we should not break any apps whether they use undocumented features or not.

> I would've thought the existence of Wow64 is to let users run old legacy 32-bit apps under Windows64.

Correct. But being a compromise the statement has changed to "most of legacy 32-bit apps". "Most" is really determined by feasibility of doing so with reasonable effort.

> To compromise correctness over performance especially in an OS is *not* acceptable, IMHO.

I can give you another example: moving GDI to kernel mode in NT 4. It was a big win from business point of view. It wasn't a good thing from design point of view. :-(

> I wonder if the IA-32 Suite of Linux suffer the same problem. How do they do their x86 emulation under 64-bit Linux?

They reconstruct 32-bit context out of saved 32-bit context and native ia-64 context. There were some issues with this approach (I don't recall the details) but it works.

Zach Saw said...

> Correct. But being a compromise the statement has changed to "most of legacy 32-bit apps". "Most" is really determined by feasibility of doing so with reasonable effort.

In which case I would've expected any shortcomings to be documented in details - including the intimate inner workings of Wow64. This way, developers would've updated their code in preparation for this even prior to the release of Wow64 way back in the days. Even now, there are no publicly available docs on this behavior of Wow64. I'm probably not the first to stumble upon this bug and definitely won't be the last, but proper documentation would at least save us days of debugging and getting pissed.

In such cases, a workaround is certainly required (so *what* is the workaround here???).

Microsoft certainly did not do any of those.

Alexey Pakhunov said...

> In which case I would've expected any shortcomings to be documented in details

Yes, that would be nice.

> including the intimate inner workings of Wow64.

I doubt that it is possible. The intimate inner details are usually not documented on purpose in order to live developers some space for later changes and bug fixes. In this case the bug violates the interface contracts so yes, in the ideal world it should be either documented or fixed, preferably later.

> This way, developers would've updated their code

As we have learned from the experience this never happens.

> so *what* is the workaround here???

There is currently no workaround that I know about. All I know it that the bug has been reported to Wow64 team.

> Microsoft certainly did not do any of those.

... which leads us back to the question whether Microsoft should spend its resources on bugs that cause thousands times worse negative impact or this particular one. Because (in my opinion), Microsoft does not have resources to fix all bugs in its products (again as any other company producing commercial software).

PS. You can also argue that Microsoft should choose the bugs to fix differently. I don't have well grounded opinion about this aspect. Maybe.

Zach Saw said...

> > Microsoft certainly did not do any of those.

> ... which leads us back to the question whether Microsoft should spend its resources on bugs that cause thousands times worse negative impact or this particular one. Because (in my opinion), Microsoft does not have resources to fix all bugs in its products (again as any other company producing commercial software).

By "Microsoft did none of those", I would've thought it's clear that I meant documenting the bug, inner workings and workaround.

By documenting the inner workings doesn't mean developers are bound to the one implementation only. There are unofficial ways to document them these days (e.g. blogs) and with clear disclaimer that it's only the current implementation and it may change in the future.

Alexey Pakhunov said...

> By "Microsoft did none of those", I would've thought it's clear that I meant documenting the bug, inner workings and workaround.

I understand that. But this work also requires resources to be spend. The resources for documenting bugs and workarounds are allocated the same way as for fixing bugs.

> There are unofficial ways to document them these days (e.g. blogs) and with clear disclaimer that it's only the current implementation and it may change in the future.

This may work in some cases but it is very tricky to make this work. There are numerous examples when some undocumented features were documented somewhere in such unofficial manner. What happens next is that developers don't bother to read that it is unofficial. They find a solution to their problem and they use it. With the next release of Windows, the "undocumented" feature gets changed and the application stops working. The developers of the application are busy with the current release and don't have any time to fix and redistribute old version of the app. Frankly speaking the developer that used that feature left the company long time ago and nobody want to touch his code. So users blame Microsoft for releasing buggy version of Windows because apparently the application work in previous versions of Windows.

Zach Saw said...

> So users blame Microsoft for releasing buggy version of Windows because apparently the application work in previous versions of Windows.

So there's really no outright win here. Damned if you do, damned if you don't. So which one is worse? Don't fix bugs that have obviously violated its documentation, or breaking apps that rely on undocumented features.

Since all your worry is a hypothesis that fixing it might break other apps (you have no concrete stats on the number of apps it would break), we might just as well give up on fixing bugs the correct way and keep introducing one workaround after another.

At some point, it'll become infeasible to keep track of all the workarounds it would be less costly to break backward compatibility and start afresh. All that because MS doesn't want to break compatibility with apps relying on undocumented features, which may not even exist out there.

Alexey Pakhunov said...

> we might just as well give up on fixing bugs

Yes, you can easily reach this conclusion. The way out of this logic loop is to evaluate bugs (and features) on risk/reward scale. The more rewarding and less risky ones will have higher chances of being fixed.

Now there is another whole discussion about how to evaluate bugs on risk/reward scale. There is no single right way of doing it. Even within one company there are different opinions about this. More over I'm pretty sure this is what makes difference between different software companies today.

Here is one rather well-known example on how Microsoft is doing this. Most of the bugs are fixed only in by a future version of the OS. Why? Simply because fixing a bug in a released version of a product costs a few orders of magnitude more (no kidding) than fixing a bug in a product that it being developed.

For a customer with a concrete problem it is rather useless. The customer does not want to wait another 3 years and pay for another copy of the OS.

Zach Saw said...

> fixing a bug in a released version of a product costs a few orders of magnitude more (no kidding) than fixing a bug in a product that it being developed.

Yes I'm well aware of that. Intel learned that the hard way when they had to recall the Pentium chip / provide workaround for compiler vendors due to the infamous FP bug. That was long before my time at Intel some years back but resulting from that incident, they created several different groups just to make sure bugs like that don't happen again. I'd have to say though it's cheaper for software vendors to fix a bug for a released product vs hardware companies.

> Most of the bugs are fixed only in by a future version of the OS.

You could do it in Win7 SP1, Vista SP3 and XP SP4. There's prior precedence that Service Packs broke things - such as introduction of DEP, Firewall etc. This would be no different.

Alexey Pakhunov said...

> I'd have to say though it's cheaper for software vendors to fix a bug for a released product vs hardware companies.

Definitely cheaper but still very expensive.

> You could do it in Win7 SP1, Vista SP3 and XP SP4.

Only if the benefit of fixing it really high. Service packs are mostly collections of already released patches plus some (but very few) other bugs. So it is still need to be worth it.

Note, I don't have enough information to evaluate this bug. With the partial information I have I would say the impact of the bug is relatively small and it is definitely not a candidate for a Windows Update patch or Service Pack. If there is already released and wildly used GC it may be a good candidate for a KB article with a patch download link. It is my personal opinion, of course.

Zach Saw said...

> Service packs are mostly collections of already released patches plus some (but very few) other bugs.

Did you read my reply about DEP and Firewall? Did you have the stats of the number of apps DEP broke?

I get the impression that the core values of MSFT is very different from that of Intel. Intel is much more proactive and dynamic (and I stand by my other post about MSFT being slow and hardly even reactive) - again, Pentium FP flaw - how many real apps were reported to be affected by the time of recall? A handful at most - they went ahead with the recall. Perhaps being a software company, MSFT has the luxury of being sloppy in the SLDC process.

Zach Saw said...

http://support.microsoft.com/kb/947504

In that particular case, because Java is a bigger client of yours, a hotfix was implemented and released. Keeping in mind the number of affected apps would've been minute (since it's only affecting JVM on IA64), it got the attention it needed because it's Java.

By your logic where compromise is to be able to run most of the apps, MSFT is definitely favouring big corps over small firms. I'm not sure that's a statement MSFT wants to send to the public.

Anonymous said...

Last time I heard, Microsoft had $60 billion in the bank. Surely that's enough resources to fix a few bugs?

Zach Saw said...

How does .NET's GC get hold of its current stack pointer of a specific thread when the GC runs?

I would imagine it requires some sort of thread hijacking as well for performance. Is it not affected by this bug then?

Alexey Pakhunov said...

> Did you read my reply about DEP and Firewall? Did you have the stats of the number of apps DEP broke?

1. Some service packs included features.
2. Some changes we made broke many applications.

So what? It does not change my statement about risk/reward evaluation of changes we do. DEP & Firewall changes were made to improve security of the system. Security bugs are considered the most riskier ones (remember years 2000, 2001?) from business point of view. Patches distributed via Windows Update almost entirely dedicated to security problems.

> I get the impression that the core values of MSFT is very different from that of Intel.

I don't have a well grounded opinion on this. I worked with Intel on a couple of _software_ projects. I didn't get impression of dynamic and proactive company.

BTW, I'm just curious, is there a web form or e-mail address or anything else open to general public specifically created to report bugs in Intel processors?

> MSFT is definitely favouring big corps over small firms.

I've heard arguments both supporting this statement and proving it wrong. I'm not sure whether the balance is met or not.

> Last time I heard, Microsoft had $60 billion in the bank. Surely that's enough resources to fix a few bugs?

I'm not qualified to make decisions how to spend this kind of money. So I don't know if it makes sense to fix more bugs from business point of view.

> I would imagine it requires some sort of thread hijacking as well for performance.

I don't know enough about this. Thread hijacking is not the only technique. You can insert control points into generated code.

Alexey Pakhunov said...

> Security bugs are considered the most riskier ones (remember years 2000, 2001?) from business point of view.

Oops. Not risky ones, rewarding ones.

From the risk/reward evaluation security changes are the most _rewarding_. The reward is that by fixing it eliminate security threat which otherwise can turn into a large business expense.

Zach Saw said...

Last I checked, even their TBB team responded to my queries promptly and some changes were made to the way TBBMalloc allocates/deallocates memory. Could it have affected existing applications? Most certainly. Mind you, that's *not* even Intel's core business.

I'm surprised you even asked - have you done a search on "Contact Intel Support"?

If you have a bug against the CPU you'd like to report (which I doubt in your lifetime that you would come across even one), here's how: http://www.intel.com/support/feedback.htm?group=processor

And it's really simple to get to the page as well. The contact support page has all Intel's product groups well laid out in one page - where's the MSFT equivalent?

You come across to me as someone who believes MSFT is a corp which has one of the best business practices out there and all other giant sized companies must be around the same standard too. Is it so hard to accept that there're things MSFT could've done much better?

Alexey Pakhunov said...

> TBB team responded to my queries promptly and some changes were made to the way TBBMalloc allocates/deallocates memory. Could it have affected existing applications? Most certainly.

Have TBB team redistributed an updated version of TBB to the end users? Or have they made anyone using TBB to recompile their applications and redistribute them to the end users?

If not, than why do you care that bugs are fixed primarily in next versions of Windows, not in the released ones?

> where's the MSFT equivalent?

You will not believe it but Microsoft also has support page that is not much different from Intel's one: http://support.microsoft.com/

It does not tell me much that Intel has a well structured support page and some support people in staff. The question was "is it possible for an end user to report a bug in a CPU?". I sort of know the answer to the question "is it possible for an end user to report a bug in Windows?" but only because other people shared their experience. It is possible but bumpy. Unfortunately I don't known anyone having such experience with Intel. Last time I needed to report a bug I could use a private channel.

> If you have a bug against the CPU you'd like to report (which I doubt in your lifetime that you would come across even one)

I've encountered three bugs for far. Two in a beta version of a processor and one - in a retail version. Never say never. :-)

> You come across to me as someone who believes MSFT is a corp which has one of the best business practices

I didn't say that. I've said that there are business reasons to treat bug as Microsoft is doing.

> Is it so hard to accept that there're things MSFT could've done much better?

Surely there is a number of ways things Microsoft could have done better. Much better. I'm not saying that that way it is now is perfect. I just would like you to take a deeper look at the problem.

Zach Saw said...

> Have TBB team redistributed an updated version of TBB to the end users? ...

That statement was in reply to your assertion that Intel's not dynamic.

> You will not believe it but Microsoft also has support page that is not much different from Intel's one: http://support.microsoft.com/

Huh? How's that even remotely similar? Can I talk to someone without paying?

> Last time I needed to report a bug I could use a private channel.

So, how's that proof that end users can't report a CPU bug? With MSFT, end users have to /PAY/ to report a Windows bug via support.

If the goal is to stick your head in the sand, you've achieved it.

> I've encountered three bugs for far. Two in a beta version of a processor and one - in a retail version.

Beta. Let's see, how many bugs did beta users report against Win7 beta?

And for the retail version, did Intel /not/ fix the bug via microcode patch (or at least put out an errata to allow for workarounds)?

> I just would like you to take a deeper look at the problem.

Perhaps that's not a good idea. The deeper I look at the rationale of risk vs rewards, the more I disagree it's the right model for an OS.

An OS is more similar to a CPU than a software library / application. You can't swap out a buggy part of the OS for a working one just like you can for a library.

In this case, this OS bug is a *show-stopper* for us.

Alexey Pakhunov said...

> That statement was in reply to your assertion that Intel's not dynamic.

Well, then you should say that Microsoft is also dynamic. We fix bugs in the very core of the OS and sometimes such fixes are based on someones post in some blog. :-)

My point was that fixing a bug in house is cheap and easy. Distributing the fix to the end user is hard and expensive. This is true for both Intel and Microsoft and it is true measure of dynamism.

> Huh? How's that even remotely similar? Can I talk to someone without paying?

You can actually do quite a few things without paying. Just like you can do them at Intel's counterpart.

I understand you fixation on the "payment question", but I believe that the point I was trying to draw your attention to is not very relevant to this. Even if it is so frustrating.

> So, how's that proof that end users can't report a CPU bug?

Oh, come on! Intel does not need you defending it. I'm not trying to prove that Intel is bad or anything like that.

I'm honestly saying that I have some facts describing how it is to report a bug in Windows. I'm also saying that I don't have same kind of facts about anyone reporting a bug in Intel's CPU. That is it.

> If the goal is to stick your head in the sand, you've achieved it.

I thought this was a reasonable discussion. Please don't turn it into a "There can be only one" type of discussion.

> And for the retail version, did Intel /not/ fix the bug via microcode patch (or at least put out an errata to allow for workarounds)?

I don't know. I wasn't skilled enough at the time to go all the way through.

> Perhaps that's not a good idea. The deeper I look at the rationale of risk vs rewards, the more I disagree it's the right model for an OS.

No, why? A negative answer "no that is not the right model" is also a good answer if it is well thought through.

> An OS is more similar to a CPU than a software library / application.

You've got a point but you are talking about different angle of the problem. So it is not surprising that you end up with a different answer.

I agree that it would be really great that OSes (and other software too!) was as bug free as processors are. Now it seems that there is no way one can achieve this using the current model of commercial software development. But at the same time writing virtually bug free software is not a self sustainable business. It work out only in aerospace and medical industries and only because of harder safety requirements.

Zach Saw said...

> My point was that fixing a bug in house is cheap and easy. Distributing the fix to the end user is hard and expensive.

Yes, but that doesn't stop Intel from doing it via Microcode patch.

> I have some facts describing how it is to report a bug in Windows.

Yes, and I thought the discussion was how to improve that? Which was why Intel was brought into the discussion as an example of how MSFT should learn from?

> I thought this was a reasonable discussion.

This topic has gone one big circle and yet I see no reasonable discussion taking place. I honestly think we'll have to agree to disagree in this case. I see OS (or at least WinAPI and the functionalities it exposes) to be closer to CPU while you kept comparing it to commercial software. We're not talking about .NET framework, Boost etc. or Photoshop, Internet Explorer, etc. I see WinAPI functions to be no different to the functionalities you get from CPU such as FDIV. You'll find each have their own complexities and cost in distributing a fix to end users. Like I said, it's even more costly for Intel.

> I don't know. I wasn't skilled enough at the time to go all the way through.

Point me to the errata and I'll take a look for ya.

> BTW, what if I say that thread hijacking does not actually work on native( x86, amd64 or ia64) either? In some scenarios. I don't know exactly how you use it.

I'm not exactly hijacking a thread (hijacking was your word) - I don't do SetThreadContext - only GetThreadContext, the same way Debuggers work (unless you meant debuggers shouldn't rely on GetThreadContext to get the current stack pointer of a thread either).

> Now it seems that there is no way one can achieve this using the current model of commercial software development.

Again, that's where your opinion differs from mine - OS is not /just/ any commercial software development. Intel could've simply said the same about Pentium FDIV bug.

Alexey Pakhunov said...

> Yes, but that doesn't stop Intel from doing it via Microcode patch.

So do you mean that Intel is delivering microcode patches to the end end users? I don't see this happen broadly. It should be either BIOS update or OS update. BIOS updates are not done routinely by end users. OS updates are much more frequent but they go though the same triage process as other OS updates, which is not dynamic enough per your interpretation.

> Yes, and I thought the discussion was how to improve that?

Well, so far what is have said is that you don't like that business reasons dictate that commercial software will have a bunch of bugs in it. You didn't offer however any way to avoid that and don't ruin the business.

> I honestly think we'll have to agree to disagree in this case. I see OS (or at least WinAPI and the functionalities it exposes) to be closer to CPU while you kept comparing it to commercial software. We're not talking about .NET framework, ...

How are you going to draw the line? For example WinAPI includes tons of high level stuff, really high level. Or why suddenly .NET framework is not considered? It is much closer to CPU-like services than WinAPI.

> Again, that's where your opinion differs from mine - OS is not /just/ any commercial software development.

It is a plain fact, not my opinion. Commercial OSes _are_ commercial software.

Now I'd be interested to hear your opinion how it could be changed to the model you think is right - OS is like a CPU from a business point or view. Is it possible to make such business profitable?

Zach Saw said...

> ... which is not dynamic enough per your interpretation.

You twisted my words again. I was talking about fixing bugs that don't affect a whole lot of people. You said MSFT won't do that (or would you like to backpaddle on that as well?), I said Intel did and would continue to do so.

> You didn't offer however any way to avoid that and don't ruin the business.

I don't know how Intel does it, or Google as well with giving away Andriod for free. But, I did provide examples of others who do it better than MSFT. It's up to MSFT as an entity to learn.

> Is it possible to make such business profitable?

And you know it won't be profitable because...?

> It is a plain fact, not my opinion. Commercial OSes _are_ commercial software.

DUH! What an intelligent observation. Yes they are commercial software, but let me repeat again, "not /just/ any commercial software". Perhaps you might want to ask someone what that means.

This will be my last reply because we /really/ are getting around in circles with you (deliberately or not) replying non-sensibly and twisting my words. I don't see a point of continuing with this discussion.

Anonymous said...

> So do you mean that Intel is delivering microcode patches to the end end users? I don't see this happen broadly. It should be either BIOS update or OS update.

I thought Microsoft has something called "Windows Update" for that particular purpose or am I missing something?

This blog really sheds a bad light on Microsoft's already bad business practice.

Anonymous said...

First, congratulations on actually finding this bug. I've first encountered it while trying to make the Open Dylan IDE work on Vista 64 bit. I couldn't locate the source of the problem, suspected a bug in our compiler, and finally gave up. GC in this case was MPS.

I'm disappointed that MS won't fix this bug. That completely destroys binary compatibility with just about any 32 bit application that uses garbage collection. Which essentially means any program written in a language that does GC. Including ours.

Anonymous said...

Sounds like Microsoft is deliberate shutting out any GC implementations for C / C++ in a desperate and illegal attempt to protect its investments in .NET.

This just *screams* antitrust - the governments (EU is our best hope) should take a serious look at this case!

Yuhong Bao said...

FYI you have to remember that WOW64 was originally designed for IA64 where it has to run a x86 CPU emulator.

Anonymous said...

Actually that is not so. Intel only removed the capability to switch to x86 mode in ia64 after Madison. Till then the CPU was perfectly able to run x86 code without the need for any software emulation. After Madison, Intel removed that capability and introduced ia32el, an OS independent software emulator (with JIT) that was both successfully integrated in Windows and Linux. By the way, when WoW is running under software mode, on ia64, SuspendThread synchronizes with the simulator to make sure the call only returns after the WoW context has been updated in the TEB. Note that this synchronization is only performed if the thread calling SuspendThread and the target thread are in the same process. But then again that is always the case with these managed/GC run-time scenarios.

Zach Saw said...

@Anonymous

This is definitely very enlightening!

Sounds like they've implemented it correctly for ia64 but not x64!

Anonymous said...

Actually, since Windows 8, things should be working well - and this applies to both native and simulated architectures by both software and hardware means.

There are 4 obscure _CONTEXT flags that make hijacking reliable.

You see, even for native, SuspendThread+GetThreadContext+SetThreadContext is not completely reliable. If the thread is, for some reason, in the kernel already at the time of SuspendThread for a syscall or a trap that could escalate to an exception, it may return to user-mode without some or any of the context SetThreadContext sets. An exception, for example, will overwrite the context completely to initiate Structured Exception Handling dispatch. For the Syscall case, one register will always reflect the NTSTATUS and all volatile registers are scrubbed.

If you want a full-fidelity SetThreadContext, then you need to make sure SuspendThread catches the thread while running in user-mode.

That's where those obscure variables kick-in.

If you specify ctx->ContextFlags |= CONTEXT_EXCEPTION_REQUEST, you are asking the kernel to report back what were the circumstances that led the thread to enter kernel-mode.

When GetThreadContext returns, you should look for the reply in the ContextFlags.

If the kernel replies with CONTEXT_EXCEPTION_REPORTING set, it means that it understood the request (basically, that the OS supports, understands and is replying to the request). If this flag comes back 0, then it means that the OS does not support this.

If it supports, result is specified thorugh 2 other flags:
CONTEXT_EXCEPTION_ACTIVE - if this comes back set, then the thread was already in the kernel, taking care of an trap (that could escalate to an exception). SetThreadContext is not a good idea.
CONTEXT_SERVICE_ACTIVE - if this comes back set, then the thread was already in the kernel, taking care of syscall. Once again, SetThreadContext is not a good idea.

If these 2 flags come 0, then the thread entered the kernel for an interrupt. SetThreadContext is reliable - go ahead and hijack the thread.

On Windows 7 these flags were supported for native threads but not simulated (WoW). On Windows 8 and posterior, these flags work on both Native and WoW threads.

Zach Saw said...

@Anonymous

The problem had nothing to do with SetThreadContext. It was purely with GetThreadContext returning stale info. This problem continues to exist on Win8.1!

Anonymous said...

Regardless. If you make sure you only hijack a thread (which requires the execution of the SuspendThread, GetThreadContext and SetThreadContext) only when the flags say it is safe to do so, then the end goal should work just fine.

Zach Saw said...

Let me try to understand what you're saying.

On Windows XP, even in native 32-bit mode, we can't be certain that the context returned via GetThreadContext is correct? Since WinXP does not support CONTEXT_EXCEPTION_REQUEST, GetThreadContext is plain unreliable under XP and should be avoided completely?

On Win7, even in native 32-bit mode, we need to check the CONTEXT_EXCEPTION flags before we assume the contents are correct. However, this doesn't work in WOW64 mode because it isn't supported.

On Win8, CONTEXT_EXCEPTION flags are supported both under native 32-bit and WOW64 modes.

Am I correct?

Zach Saw said...

Just another thing -- I'm not hijacking the thread. I'm merely relying on information returned via GetThreadContext for tracing managed pointers. I'm not touching the context in anyway, which is what hijacking typically means.

Anonymous said...

I see what you are saying - you are performing manage type reference counting cross-thread instead of performing it in-thread after hijacking it.

On native, GetThreadContext is always reliable. SetThreadContext requires those flags.

On WoW, given that it is implemented in user-mode (and user mode can't block APCs and stuff) the context can be stale on short time windows when using hardware mode. As a work-around you could use those flags which should give you confidence that you are not in those dreadful windows.

Zach Saw said...

Ah! Thank you so much for this very useful info! I'll make sure to write something up to detail what exactly needs to be done. It'll be helpful for others facing the same problem.

Still no solution for those OSes that don't support those extended flags though is there?

Ira said...

So I'm a victim of this bug too; I have an application that wants to do both thread state inspection (e.g, what's in the registers including ESP) and thread hijacking.

I don't mind being told (like the Magic 8 ball), "Situation hazy, try again later" but being told, "Sitation always hazy" is pretty bad news for XP/Vista/Windows7-64.

Can somebody explain how 32 bit debuggers work on Windows64 boxes, given that the ReadThreadContext content isn't reliable? BREAKALL in the debugger is implemented using SuspendThread, right? If not, what method does the debugger use?

I agree with Zack. Alexey was whining about not breaking existing applications and using that to justify not fixing the behavior of this... yet the problem is that MS's implementation breaks existing applications. He can't have it both ways; I think MS is simply being cheap. Worse, I think MS is saving money, having gotten ours, and worse, is costing us money because of something they were too cheap to fix. That's hardly what I call nice corporate behavior.