Zach Saw's Blog: August 2008

Friday, August 29, 2008

Testing PHP on Windows - in 5 seconds! Without installing Apache, IIS or even PHP.

Testing PHP on Windows in less than 5 seconds without installing Apache, IIS or even PHP. Is this possible?
Yes it certainly is - and it's FREE.
QuickPHP is designed spefically for this purpose.
Here are the steps to test PHP on Windows in less than 5 seconds:

Download QuickPHP WebServer (quickphp_webserver.zip) from http://www.zachsaw.com/?pg=quickphp_php_tester_debugger
Unzip the file into C:\QuickPHP
Run QuickPHP.exe from the folder
Hit Start.

That's it!
You can now test your PHP webpages by browsing to http://127.0.0.1:5723 and QuickPHP will run 'index.php' in 'C:\' folder (of course, make sure you have a file called 'C:\index.php' - if not, copy and paste the following code and put them into 'C:\index.php').
If you wish, you can change the webserver's root to your local webpage folder and the default document name to point to your own index file.
index.php:
<?php phpinfo(); ?>
If you need any help, you can visit the QuickPHP forum.

Wednesday, August 20, 2008

Larrabee with FPGA pledge

Following up on my pledge to Intel for including a real-time reprogrammable highspeed FPGA on Larrabee, it looks like it's definitely very useful in a number of applications. With texture filtering being done in hardware with Larrabee running as a GPU, we could reprogram the FPGA to do motion search for H264 encoding. We're now going into a new era where computer engineers are very good in both software and hardware design.

I suggested the idea of including an FPGA as part of the CPU to a fellow employee / manager back in Intel but unfortunately, it never got any attention. That was back in year 2001. 7 years following that, we're now seeing companies making full use of FPGAs to accelerate applications that aren't efficient to be run on a CPU. Larrabee solves some of the things I said an FPGA would solve, but there are definitely several other applications out there that would benefit from an FPGA. I also pointed out that Intel should come up with a library (i.e. hardware design) that developers can simply load into the FPGA to accelerate specific types of algos.

Take a look at this.

Larrabee picks it up where CUDA fails

Having read most of the publications by nVidia, ATI/AMD and Intel made available to SIGGRAPH, I have to say, I'm a believer in Larrabee. Most of the problems that plagued CUDA involves having to design and offload only certain parts of the algo which can be suited for GPU and which is small enough in terms of bandwidth utilization across PCIe over to the GPU, and then getting the results back from it via the same path.

The reason this is even being discussed lies in the fault of the whole GPGPU concept. The GPU is good at one thing - being fed textures (compressed) and command that are then pumped through its fat pipelines to get results (rendered image). Use it for something more generic, we have to deal with issues such as the PCIe bandwidth and having to feed the onboard frame buffer with enough contiguous data to work with. Say we have infinite video RAM. Even then, we'll still have to do some parts of the algo on the CPU as the GPU is just incapable of doing things like scalar operations and sequential branching algos (namely tree algos - heck, CUDA doesn't even do recursion) effectively. With a measly PCIe between CPU and GPU, any performance gained will most likely be offset.

CUDA is at best, a DSP SDK. nVidia's attempt at using its GPU as a very basic DSP. Nothing more. Yes, you may find that offloading some parts of, say, a H264 encoder will give you some gains. But if you go further, and implement say, anything beyond the baseline profile, you'll run into troubles. You'll get some gains no doubt, since the GPU is always a free agent if it's not being utilized. Is it worth the effort though? Hardly. The x264 developer has gone out to say CUDA is the worst API / language he's ever encountered (particularly with the threading model).

Larrabee, however, will change the landscape quite a bit. All the above mentioned problems, are exactly what Larrabee seeks out to solve. OpenMP for threading model, much higher level of abstraction between CPU and Larrabee (it's capable of running Pentium x86 instruction sets, so there's no need to go back to the CPU as frequently as GeForce / Radeon), and SSE vector instruction sets -- these are all directly targeted at the downfalls of CUDA!

When Pat Gelsinger said CUDA will just be a footnote in computing history, nVidia was a fool to laugh it off. It's already happening. Perhaps Wiki should start deleting their CUDA pages and start footnoting GPGPU pages with a short and sweet "meanwhile, there's CUDA" line. :)

Thursday, August 14, 2008

Larrabee and TFLOPS SP / DP - the TFLOPS race BEGINS!

There's much confusion over the upcoming Larrabee chip from Intel. It seems that most people who've tried to calculate the peak performance of the chip in terms of TFLOPS couldn't come up with the 2 TFLOPS Intel claimed Larrabee would achieve.

Larrabee's in-order cores are capable of processing a peak of 16 SP (single precision floating point) data per clock (512-bit VPU, hence 16 SP or 8 DP). At 2GHz, 32 cores, you only get 1.024TFLOPS SP (2GHz * 32 * 16SP). So how come Intel is claiming it is capable of 2 TFLOPS at that configuration?

Well, here goes. 1.024TFLOPS is the peak for most SIMD instructions, but if we take the MULTIPLY-ADD instruction into account (which Intel implemented recently in its SSSE3 - or SSE4 for the non-informed) then we would multiply 1.024TFLOPS by 2 - hence giving Larrabee a peak performance of 2.048 TFLOPs SP. Yes, it's Single Precision Floating Point (i.e. 32-bit) and not Double Precision (i.e. 64-bit) as some people are claiming it is. For DP, Larrabee would get a peak of 1.024 TFLOPs.

Before you go saying that Intel's cheating, the 4870 HD also implements the same instruction and its 1.2 TFLOPs SP performance is calculated for this specific instruction as well (same as Larrabee). A high end 48 core Larrabee would give a peak performance of 3 TFLOPs at 2GHz.

Interesting thing about the MULTIPLY-ADD instruction really - it's patented by 2 Japanese persons. This is a single cycle instruction that does multiplication and add - it's obviously very beneficial to vector calculations and applications like GPUs largely depends on this particular instruction.

I can see things shaping up quite nicely for Larrabee. Now that the GHz-race era is behind us, let the TFLOPS race begin!

nVidia gives Larrabee its blessing for CUDA and Physx

Just a thought - nVidia has said that its CUDA will run on x86 too (duh, that would be simply an x86 C compiler wouldn't it?) and since Physx runs on CUDA, that means nVidia has (probably inadvertently) given its blessing to get Physx running on Larrabee. Now that wouldn't be a bad thing at all.

Very generous of nVidia. ;)

Highspeed FPGA to complement Larrabee

Before I joined Intel, I've always had this idea in my mind - to have a highspeed FPGA as a coprocessor. I think this is a much better time to propose this solution to the world than it's ever been. With the buzz going around Larrabee and its need for a fixed function unit such as the rasterization unit for GPU, it would be so much more flexible if this is implemented as a block of FPGA. The driver is then responsible for converting this block into whatever the application sees fit. Anything that could not fit in that cGPU paradigm can be hardware accelerated via the FPGA block.

Any take on this, Intel?

About Intel Larrabee

About Larrabee and Larrabee vs GeForce / CUDA or ATI / CTM (without repeating what you could look up on other sites):

1) Michael Abrash, Tim Sweeney and John Carmack are all on board Intel's software team for Larrabee. This should give them a pretty solid team (understatement) for driver development.

2) A quote from GCDC'08: multi-thread your DirectX code and drivers. "3. Direct 3D runtimes and drivers account for 25-40 percent of CPU cycles per frame. This needs to be reduced in order to push performance!"

The freedom of offloading these 24-40 percent to Larrabee and leave the CPU to process everything else is something quite significant. This is, however, something they're still working on, as some calls involve the OS kernel and is not the natural way things happen as it stands with Larrabee on PCIe. Again, the ultimate goal is to get Larrabee sitting on your motherboard running as a co-processor, in which case scheduling will be done by the OS just as it would for a normal processor. The design decision to use software task scheduling is obviously two-folds.

3) CUDA does not support recursion (among several other things) - and will not likely be implemented in the near future due to hardware limitation - and not unless they implement a sophisticated prefetch hardware like Larrabee, it will most likely never happen.

Developers look for free-lunch. CUDA doesn't seem to provide that very well as it requires the algo to be completely rewritten - see www.gpuchess.com for example. With that said, it doesn't mean that nVidia can't emulate some of these features through other means like what GpuChess has done in its compiler. But, that leads me to the next point.

4) CUDA is a C-like language. That's good - but, how do you get C++ / C# / VB / Delphi / Java etc. developers to code for it? Not unless nVidia starts writing their own .NET IL runtime libraries, and VCL runtimes for their hardware (read: doesn't make sense financially and impossible in the limited time frame before Larrabee debuts). Larrabee gets all of these, for free.

The final point is what I'm excited about - because you're not restricted to just CUDA-C. You're free to develop in whatever language you're most familiar with. The best part is, with the binaries compiled for Larrabee (if you don't go for the exotic mnemonics of course), it'll be possible to run it on a machine without Larrabee, albeit much slower - but, at least it will run. I don't see any developers (bar hobbyists) getting excited over writing the same algo 3 different ways - CUDA, CTM and x86.

I don't know about the rest of you, but this looks like a very good idea to me. When I was working in Intel, I was going to propose something similar to Larrabee, but a more hardware solution. Maybe it's still possible. I'll leave that for the next post.