Wednesday, August 20, 2008

Larrabee picks it up where CUDA fails

Having read most of the publications by nVidia, ATI/AMD and Intel made available to SIGGRAPH, I have to say, I'm a believer in Larrabee. Most of the problems that plagued CUDA involves having to design and offload only certain parts of the algo which can be suited for GPU and which is small enough in terms of bandwidth utilization across PCIe over to the GPU, and then getting the results back from it via the same path.

The reason this is even being discussed lies in the fault of the whole GPGPU concept. The GPU is good at one thing - being fed textures (compressed) and command that are then pumped through its fat pipelines to get results (rendered image). Use it for something more generic, we have to deal with issues such as the PCIe bandwidth and having to feed the onboard frame buffer with enough contiguous data to work with. Say we have infinite video RAM. Even then, we'll still have to do some parts of the algo on the CPU as the GPU is just incapable of doing things like scalar operations and sequential branching algos (namely tree algos - heck, CUDA doesn't even do recursion) effectively. With a measly PCIe between CPU and GPU, any performance gained will most likely be offset.

CUDA is at best, a DSP SDK. nVidia's attempt at using its GPU as a very basic DSP. Nothing more. Yes, you may find that offloading some parts of, say, a H264 encoder will give you some gains. But if you go further, and implement say, anything beyond the baseline profile, you'll run into troubles. You'll get some gains no doubt, since the GPU is always a free agent if it's not being utilized. Is it worth the effort though? Hardly. The x264 developer has gone out to say CUDA is the worst API / language he's ever encountered (particularly with the threading model).

Larrabee, however, will change the landscape quite a bit. All the above mentioned problems, are exactly what Larrabee seeks out to solve. OpenMP for threading model, much higher level of abstraction between CPU and Larrabee (it's capable of running Pentium x86 instruction sets, so there's no need to go back to the CPU as frequently as GeForce / Radeon), and SSE vector instruction sets -- these are all directly targeted at the downfalls of CUDA!

When Pat Gelsinger said CUDA will just be a footnote in computing history, nVidia was a fool to laugh it off. It's already happening. Perhaps Wiki should start deleting their CUDA pages and start footnoting GPGPU pages with a short and sweet "meanwhile, there's CUDA" line. :)


RonVal said...

Badaboom H.264 encoder debunks your POV.

Zach Saw said...

Or so the 'great' RonVal says...

If you have any prove whatsoever that Badaboom has a higher on average quality-speed curve vs x264, Dark Shikari (one of the x264 devs) would be happy to hear from you.