Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×

A Glimpse Inside the Cell Processor 66

XenoPhage writes "Gamasutra has up an article by Jim Turley about the design of the Cell processor, the main processor of the upcoming Playstation 3. It gives a decent overview of the structure of the cell processor itself, including the CBE, PPE, and SPE units." From the article: "Remember your first time? Programming a processor, that is. It must have seemed both exciting and challenging. You ain't seen nothing yet. Even garden-variety microprocessors present plenty of challenges to an experienced programmer or development team. Now imagine programming nine different processors all at once, from a single source-code stream, and making them all cooperate. When it works, it works amazingly well. But making it work is the trick."
This discussion has been archived. No new comments can be posted.

A Glimpse Inside the Cell Processor

Comments Filter:
  • by llamalicious ( 448215 ) on Friday July 14, 2006 @01:37PM (#15720272) Journal
    I was 17 and she was 26 and ... oh shit, wrong first time.
  • There aren't many businesses where manufacturing technology exceeds design technology. Throughout human history we've been able to dream up things we can't yet build, like spaceships, skyscrapers, jet packs, underwater breathing apparatus, or portable computers. But in the semiconductor business the situation is reversed: chip makers can build bigger and more complicated chips than they can design. Manufacturing prowess exceeds design capability. We can fabricate more transistors than we know what to do wit
    • I think the article's point was that once you get more and more transistors on there it becomes very difficult to design things to not end up overheating all the time and not use up insane amounts of power, not to mention just becoming extremely complex like x86 cores today.

      Thusly, you're right, the parallelization is the answer (atleast according to the Cell design philosophy). Because it's possible to put so many transistors on there, the way to do it without running into as many problems would be to cr
      • I think the article's point was that once you get more and more transistors on there it becomes very difficult to design things to not end up overheating all the time and not use up insane amounts of power, not to mention just becoming extremely complex like x86 cores today.

        I wasn't talking so much about the article as a whole, but the insane levels of hyperbole in the particular paragraph I quoted. "We're capable of putting more transistors on a chip than we can think of things to do with". That's not even vaguely true.

        More transistors == more power, all else being equal, because it's all those junctions flipping state so quickly that uses the power.

        As for the insanity if Intel's processors... that seems to be a perversion particular to Intel. In the past three decades that I've been following the industry, Intel has only managed to produce *one* sane CPU design, the i960, and they promptly caponised it by removing the MMU and relegating it to embedded controls lest it outcompete their cash cow.

        The rest... from the 4004 through the 8080, the 8086 and its many descendants, iApx432, i860, and Itanium... have been consistently outperformed by chips with smaller transistor budgets built by companies with far fewer resources. They only occasionally broke past the midrange of the RISC chips, and were usually trailing back with the anemic Sparc. Where they have excelled has been marketing and in the breadth of their support... both hardware and business. IBM went with the 8088 because they could get them in quantity and they could get good cheap support chips for them: if you went with Motorola or Zilog or Western Digital or National Semiconductor you pretty much had to go back to Intel to build the rest of your computer anyway.
    • Except, of course, that ray tracing is not easily parallelizable as you need a significant amount of data to each of those postage stamp size pieces (hey, that's one of the reasons that "just" rendering triangles is so much easier, you take a global problem and make it local). Wiring all those transistors would be hard. Adding cache and cores is also, to some degree, the solution when you are out of ideas. It will make things better, but it's a quite expensive way to get the scaling (especially cache).
      • Except, of course, that ray tracing is not easily parallelizable as you need a significant amount of data to each of those postage stamp size pieces

        The mesh is common to all the processors, and not that big, it can be broadcast. Textures are the big chunk, but most pieces will only need high resolution versions of the textures in their direct view... unless a processor is looking at an optically interesting surface (for reflections or refractions) it can get by with mesh-resolution approximations to the tex
        • But raytracing is practically the poster boy for "embarassingly parallelizable" applications.

          You neglected to mention the primary reason this is true; you don't have to do anything fancy, because it's fairly rare that we even need to parallelize rendering a single frame these days - most rendering involves big bulk numbers of frames which are later assembled into a video. You can always send individual frames to clients. Thus you can parallelize it without even doing anything hard.

  • Sega Saturn Redux? (Score:4, Interesting)

    by ToxikFetus ( 925966 ) on Friday July 14, 2006 @02:09PM (#15720523)
    As TFA mentioned, this has the potential of becoming another Sega Saturn boondoggle. Will the developers learn how to fully utilize this incredibly complex architecture? Relying on the "octopiler" to efficiently map to the Cell architecture seems a bit optimistic and naive.
    • Comment removed based on user account deletion
    • by SSCGWLB ( 956147 )
      I seriously doubt they will write efficient programs in the lifetime of this console. The level of efficiency they will achieve depends on a lot of things. I didn't see it in TFA, but I am assuming you cannot treat each SPE as an individual processor.

      First of all, their dream of a general 'octopiler' is pure fantasy. I have written massively parallel MPI and Shared Memory applications and can testify to their complexity. Mapping an arbitrary piece of code transparently to multiple processor is a extr
      • I am assuming you cannot treat each SPE as an individual processor.

        Your assumption would be wrong.
        • by SSCGWLB ( 956147 )
          Thanks for the condescending and uninformative remark. What I was not sure of was if the OS treated each SPE as a separate, autonomous core (i.e. SMP). I had assumed the context of my question made that clear. As it turns out, my assumption was correct.

          "The PPE which is capable of running a conventional operating system has control over the SPEs and can start, stop, interrupt and schedule processes running on the SPEs. To this end the PPE has additional instructions relating to control of the SPEs. De
          • Thanks for the condescending and uninformative remark.

            My pleasure. ;p

            Yeah, the PPE has to kickstart an SPE, but after that, you can treat the SPE as totally autonomous. They can fetch their own code and data, and what more do you need than that? You don't have to, you can manage them pretty much any way you want to. The PPE can halt an SPE, but that's a really inefficient way of doing things. Think of the size of the context you'd have to swap out to have the PPE control the threading on the SPEs.

            Also I'd
            • The Cyber architecture [wikipedia.org] had typically two main CPUs (60-bit), and 12-20 "Peripheral Processing Units", which were much lower capacity, 12-bit processors. The CPUs were started and stopped by the PPUs, and had no interrupt architecture. Control of the system was actually in the PPUs, they loaded programs into memory, set up memory mapping, handled context switches and system requests. PPUs themselves were implemented as shared hardware with multiple contexts, and control actually changed after each instruc

          • They are atonomous cores. Indeed, the best analogy for them is a node in a network. They've got their own non-coherent local memory, and are connected via a ring bus.

            The programming model for the SPEs is fairly straightforward. You bundle some code and some data into an APUlet, and upload it via the ring bus to the SPE. The SPE runs that code for some amount of time, and can communicate with the rest of of the chip either by sending messages over the ringbus (using a mailbox mechanism), or doing DMAs.
    • by jd ( 1658 ) <imipak@ y a hoo.com> on Friday July 14, 2006 @03:20PM (#15720984) Homepage Journal
      Not sure it's that complex. If anything, it sounds rather limiting. Eight isolated physical coprocessors, each supporting two threads? Why not have one coprocessor that supports 16 threads that maps onto as many virtual coprocessors as desired? Basically the same circuitry, but can dynamically remap to the problem being solved, as opposed to remapping the problem to the circuits provided.


      (Having the computer model itself to the problem reduces the complexity of programming and will make optimal use of the hardware. Having the program model itself after what the computer is tuned to do is merely an ugly hack and requires ugly compilers to specifically translate between the paradigms.)


      The cell processor is designed around 1980s concepts of load-balancing while keeping to many of the rules of second-generation programming. Technology has moved on. That's not to say the cell is bad. It's a definite improvement over the 1960s concepts used in many modern CPUs. However, it is still 20 years behind the curve. C'mon, guys, this isn't the Space Shuttle, it's a microprocessor. There is no excuse for network and design technology to be so far beyond the best of the best that industrial giants are capable of doing.


      Actually, it's worse than that. Modern multi-processor systems require specially-designed chipsets and become exponentially more expensive as you build them up. Single boards don't usually go beyond 16 processors. In comparison, people built single boards with 1024 Transputers without difficulty, with costs increasing linearly. So, in multi-processor architectures, we can't even match everything that could be done in the 1980s.


      How does this affect those using the Cell? Well, that's simple. It doesn't offer enough of an added advantage and is different enough that coders will have difficulty making good use of it. That means that coders will have to be inefficient OR dedicated to that one chip, which has no guarantee of making any money for them. Coders won't bother, unless there is something out there that will make it a guaranteed success. I'm not seeing this killer demo.

      • "Basically the same circuitry."

        Functionally? Maybe. But considering the 20% yields, would you rather lose 1/8th of the chip, or the whole thing? Also, I imagine managing the cache for that on the fly would be a significantly larger headache then dividing it up in this more consistant way; associative lookup can take up a lot of realestate real quick.

        • Virtually the only thing I have heard from Sir Clive Sinclair that made me stop and think "that is so utterly the way to do it" was when he proposed wafer-scale architecture where, instead of relying on everything working, you design it with the idea that some components will be bad at the start, and others will fail when in use. He did this by proposing that the selection of which element to use support the notion of bad elements which the selection hardware (or software) simply ignores.

          If you did this wit

      • Modern multi-processor systems require specially-designed chipsets and become exponentially more expensive as you build them up.

        Unless, of course, you're using an AMD processor, which has Hypertransport links, and become linearly more expensive as you build them up. Give or take.

        In order to get the best performance out of hammer and HT you have to link the processors more than in a line, but since it's a NUMA system you can simply link them end to end. It will not be an efficient architecture for mos

      • [QUOTE]Why not have one coprocessor that supports 16 threads that maps onto as many virtual coprocessors as desired? Basically the same circuitry, but can dynamically remap to the problem being solved, as opposed to remapping the problem to the circuits provided.[/QUOTE]

        It's not the same thing *at all*. CPUs are highly non-linear. 8 2-way processors are much simpler than 1 16-way processor. CPU structures tend to scale with the square of their width. A front-end capable of issuing 32-instructions per cycle
    • It's not really an "incredibly complex" architecture. It's different, but in practice, it's probably less complex in practice than symmetric multithreading. If you're programming specifically for Cell, it should be fairly straightforward to create pieces of code that you can run on the SPEs, while doing control logic on the PPE. What will be difficult will be porting existing code to the new architecture.

      PS) Yes, I am a programmer. I think many discussions of Cell take it for granted that multithreaded prog
    • It was essentially an uber 2d platform with a 3dchip added in the last minute. The cell, rsx, and memory type were conceived a long time ago to work together. Neither the cell nor the graphics chip is a last minute addon to compete with a brand new foe (as psx was with it's new 3d capability).

      Also sony is hard at work at dev kits which will make programming with the cell much easier. How well they succeed in making these dev kits will be the primary factor in how programming for the beast goes.
  • This article might not be an exact dupe, but this same information has been posted countless times already. 90% of it is even readable at cell's wikipedia article [wikipedia.org]. I don't think anything more about cell is news worthy until someone actually does something this the processor...
  • by Harry Balls ( 799916 ) * on Friday July 14, 2006 @02:22PM (#15720605)
    ...on the average, one of the slave processors is non-functional.
    Read more about the yield problems of the Cell chip here:
    http://theinquirer.net/default.aspx?article=32978/ [theinquirer.net]

    Fabrication yield is estimated at only 10% to 20%, which is very low for the industry.

    • by Anonymous Coward
      Fabrication yield is estimated at only 10% to 20%

      That's for a completely working package, the cell plus 8 SPEs. Because of the low yield of the "perfect" processors, PS3 will be using the ones with 7 working SPEs, since there are plenty of those. The IBM discussion linked by the inquirer shows that.

      Yield is so low due not only to the complexity but also the size, if there are an average of 10 defects on a wafer and you can only fit 10 processors on a wafer (these numbers pulled totally out of my ass) then
    • "Clarification Tom Reeves, IBM's VP of semiconductor and technology services, said he was not making any specific references to past or current Cell yields in an executive insight interview that ran last week. He was, instead, referring to large die yield challenges in general and the successful leverage provided by logic redundancy strategies. IBM does not release product specific yield information. This clarification was made on July 14, 2006."
    • We're talking about the Cell in general, not what Sony decides to ship in the PS3.
    • But how much larger is the CPU than the regular ones?
  • It gives a decent overview of the structure of the cell processor itself, including the CBE, PPE, and SPE units.


    As long as I can't see the PS3 running those incredible games with out-of-this-world AI and physics and all, I won't buy into this whole "Cell FUD"...

    "Emotion Engine", anyone?

  • I'll bet programming the Cell would be so much fun if you were working in a scientific or graphics research lab at a university. It has "wouldn't it be cool if..." written all over it, but I feel sympathy for the developers who will have to make code run on this thing and make deadlines.

E = MC ** 2 +- 3db

Working...