Oh, I really should know better than to wade into debates like these....
OK, I should start by pointing out that I'm not a programmer. I do work in IT (desktop support), and I have an interest in operating systems, processor architecture, distributed computing, and such. My current “farm†is: Windows, AMD 32-bit; Fedora Core 8, AMD 64-bit; OSX, G5 PPC; Ubuntu, Via C7-D; PS3 (a.k.a. “the other boyfriendâ€)(not running Linux, no HDMI). I have also worked with SunOS and Solaris on Sparc workstations.
I'm not trying to pass myself off as an expert on such things, and welcome anyone pointing out my mistakes. I'm just trying to say I've done some reading and done some work, so I can also offer observations. This is also a fairly broad and complex issue, so I hope I say something useful rather than just muddy the water.
Let me start off by repeating a quote about Intel vs. mainframe systems, “It's like comparing a sports car to a tractor. Both are powerful, but one will burn out in the field, the other will be a speed bump on the highway.â€
People who talk about the Cell processor usually quote the flops, whetstones / dryhstones, or other measure of processing power. Unfortunately, this is not always the best measurement, as this is under optimized conditions, or an extrapolation of “all transistors in useâ€. If you look, graphics processors on the video card have measurements that can smoke the CPU, but only under very specific kinds of work.
If you have a PS3, remember you also have Folding@Home, which is a very worthy project. Their web site is also well worth checking out, as they discuss the advantages and disadvantages of running F@H on a CPU, GPU (or VPU, graphic processor), and PS3. The best output is on the GPU client, by far. If you look, what it can do is staggering. However, it only works well for specific kinds of formulas. Very specific modeling processes on a GPU very quickly, but if you try to force other modeling into the client, it actually slows down, sometimes quite a bit. The PS3 is similar to a GPU, but more robust. It does not have the same speed, but is able to handle more kinds of modeling. So, on a PS3, more kinds of modeling get a significant speed increase than the GPU, but the increase is not as much. Also, some kinds of modeling do not work well on the PS3, and are faster on a mid-range desktop computer. Computer CPU's are designed to handle just about anything. Modeling equations that choke a GPU and are slow on a PS3 will work fine on a computer CPU, although perhaps not with blazing speed.
Now, to change gears, I'm going to do a very broad and generalized discussion of processor architecture. I work at a research institution, and was at a few presentations by Apple. Before they dropped the G5 PPC architecture, they were trying to push it in research settings. Certain modeling programs worked significantly faster on a G5 than on Intel / AMD processors.
There are two things to look at here. The first I believe is called “program registersâ€. These are basically points on the processor where a bit of programming starts working on each processor cycle. It's a starting gate at the transistor fields. To a degree, the more a processor has, the more programs can run in parallel on a processor. Traditionally, Intel processors have (historically) had less than other architectures (PPC, Sparc, Tru64, Itanium, Opteron). This was actually one advantage of older AMD processors over Intel, and how AMD Athlons could get similar performance to Intel Pentiums at lower clock speeds.
The second thing to look at is the basic processor architecture. PPC processors (G3 [Nintendo Game Cube], G4 [processor of the Aegis Physix card], G5, G6, Cell Broadband, Freescale PPC, as well as the Nintend Wii, the 3DO console, and others) are what is called “vector basedâ€. This is the same basic architecture as video card. Essentially, it is an architecture designed for SIMD data, or Single Instruction, Multiple Data, and a “short pipe lengthâ€. This type of architecture was imported, in pieces, by the different SSE sets in x86 processors, but is still an optimized library.
The basis of SIMD data is that you have a large data set, say pixel shading for a video card or a population set for a population simulation running on a PPC Mac, and fairly short, simple, repeatable instructions to be run on each data point. Let's look at graphics for a moment. Say you're displaying a bunch of pixels, and you have to find the RGB (red, green, blue) value for each in the display. This is perfect for SIMD computation. You load as many data points as your processor registers allows, on the next processor cycle you run the fairly simple calculation for Red value across each, then the next cycle repeat for green, then the next cycle for blue, then record and release. (Yes, this is a gross over simplification, I know.) The same applies for bio-statistical modeling of human populations. You can apply short, simple bits of math to numerous data points simultaneously, allowing you to go through your entire population set much more quickly.
This is what is meant by “short pipe lengthâ€. On each processor cycle, the processor cannot do a lot of math to each data point. However, if you do not need a lot of math for each point, the processor can handle many more points in parallel, and release the results more quickly.
Graphics processors use this because they have to get (relatively simple) shading information on lots and lots of pixels, as many as possible at once, very quickly.
Now, the x86 architecture uses what is called a “long pipe lengthâ€. It is the opposite approach. It handles fewer points of data at once. However, what it is designed to do is, in mathematical terms, really beat the snot out of the data before it lets it go. SIMD processing does parallel, non-sequential data quickly. The x86 architecture can put data through long, sequential, convoluted math, and hold it until it is done. If you are working with fewer data points, but need a lot of processing on each point, the x86 architecture can process the data with fewer loads and releases, and be more efficient per processor cycle, having a significant advantage over PPC processors (read “Cell Broadbandâ€).
OK, I should wrap this up... So, the Cell Broadband processor is freaking awesome, yes. It can crunch data fast as heck, yes. However, it's a PPC processor. It's essentially a hyper complex video card on steroids with cybernetic implants, but a video card at heart. It was designed around streaming SIMD data. For “grab-crunch-release†data, it's great. Again, look at what F@H is doing with it. For something it can't do quickly and really has to chew on, though, its advantages really begin to fall apart.
This is where you have to look at the statement, “throw all the BOINC projects on it, and they will be really fastâ€. Well, not not always. If they handle data in a way that works well on a PPC processor (look for the ones that already support PPC, like WCG, which has excellent PPC clients), and a Playstation 3 will tear the data apart. It will be glorious. However, take a project that requires lots of computation on individual data points, and the data must be crunched in successive stages (non-parallelizeable data), and the PS3 will suddenly seem slow and a disappointment.
I hope the “signal to noise†in all that was petty good....