There's much confusion over the upcoming Larrabee chip from Intel. It seems that most people who've tried to calculate the peak performance of the chip in terms of TFLOPS couldn't come up with the 2 TFLOPS Intel claimed Larrabee would achieve.
Larrabee's in-order cores are capable of processing a peak of 16 SP (single precision floating point) data per clock (512-bit VPU, hence 16 SP or 8 DP). At 2GHz, 32 cores, you only get 1.024TFLOPS SP (2GHz * 32 * 16SP). So how come Intel is claiming it is capable of 2 TFLOPS at that configuration?
Well, here goes. 1.024TFLOPS is the peak for most SIMD instructions, but if we take the MULTIPLY-ADD instruction into account (which Intel implemented recently in its SSSE3 - or SSE4 for the non-informed) then we would multiply 1.024TFLOPS by 2 - hence giving Larrabee a peak performance of 2.048 TFLOPs SP. Yes, it's Single Precision Floating Point (i.e. 32-bit) and not Double Precision (i.e. 64-bit) as some people are claiming it is. For DP, Larrabee would get a peak of 1.024 TFLOPs.
Before you go saying that Intel's cheating, the 4870 HD also implements the same instruction and its 1.2 TFLOPs SP performance is calculated for this specific instruction as well (same as Larrabee). A high end 48 core Larrabee would give a peak performance of 3 TFLOPs at 2GHz.
Interesting thing about the MULTIPLY-ADD instruction really - it's patented by 2 Japanese persons. This is a single cycle instruction that does multiplication and add - it's obviously very beneficial to vector calculations and applications like GPUs largely depends on this particular instruction.
I can see things shaping up quite nicely for Larrabee. Now that the GHz-race era is behind us, let the TFLOPS race begin!