Rakarin:
Isn't that sorta what IBM was smoozing to with their
Power6 processor?
Do you mean the massively parallel thing? Yes. PowerPC architecture is vector based. With scalar architecture, the data sits in the processor pipeline longer, and you hammer the bajezuz out of it. In vector based (PPC and video), you have a shorter pipeline, and single instruction, multiple data processing. Vector data is also easier (as I understand it, on PPC through "AltiVec"
to move between processors / cores, and even spread across cores, parallel processing is easier. Years ago, when I have a G4, I remember, Folding@Home had their two SMP (Symmetric Multi Processing, or parallel processing) clients in beta, and BOINC projects were discussing the issue in abstract. Meanwhile, I had BOINC running on one processor, and F@H on the other on my G4. If F@H couldn't get work, my WCG work units would temporarily "spread" over both processors, and the one process would take 200% of the processor. (Note: On Windows, dual core is 100% [50+50]; on OSX, dual core is 200% [100+100].)
Anyway, PowerPC and SPARC processors are working with this idea. If your data can be handled in parallel, you are better off with 6 or 8 or 10 or 12 small cores working together slower and cooler, rather than two or four cores that require a cooling system that can suck up pets and small children. On the floating-point side, you see the exact same things with Cell (8xi and BBE), CUDA, and Larabee:. (I don't know if that colon is required. It think they are trying to make it look like an old-school port, like COM1: or LPT1:.
Now, that's your SIMD (single=instruction, multiple-data) crunching. If you need high performance per thread, and you have a small / singular data (set) and just need to hammer it with a lot of math, Intel or AMD is the processor of choice. If you do scalar type math on a PPC processor, because of the short pipeline, the processor sucks in one or a few elements, runs a portion of the instructions, release, input, quick crunch, input, etc. It's not efficient. On x86, the data is held longer and more can be done on each cycle.
GPU processing is a hot thing because x86 processors have developed parallelism (SSE 1-4.x), but GPU's do it better and faster, and with floating point. Also, everyone has a GPU now. It's like trying to scrimp and save to buy a good engineering calculator, but discovering your neighbor has a high end server. Your time and resources are then better spent buying gifts for your neighbor so he will let you get a network connection and access.