×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Next-Gen Processor Unveiled

kdawson posted more than 6 years ago | from the trillions-and-trillions dept.

Supercomputing 183

A bunch of readers sent us word on the prototype for a new general-purpose processor with the potential of reaching trillions of calculations per second. TRIPS (obligatory back-formation given in the article) was designed and built by a team at the University of Texas at Austin. The TRIPS chip is a demonstration of a new class of processing architectures called Explicit Data Graph Execution. Each TRIPS contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. The article claims that current high-performance processors typically are designed to sustain a maximum execution rate of four operations per cycle.

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

183 comments

I want one (4, Insightful)

Normal Dan (1053064) | more than 6 years ago | (#18861013)

But when are they likely to be ready?

I want one (0)

Anonymous Coward | more than 6 years ago | (#18861223)

But when are they likely to be ready?

I want one (0)

Anonymous Coward | more than 6 years ago | (#18862061)

But when are they likely to be ready?

Re:I want one (5, Funny)

ackthpt (218170) | more than 6 years ago | (#18861275)

But when are they likely to be ready?



  • You know they'll be ready when Intel places large orders for aluminium for heatsinks.
  • You know they'll be ready when there's a sudden drop in prices of the current Hot CPUs, which are all proven but suddenly look like last month's pizza from under the couch.
  • You know they'll be ready when AMD hasn't said anything and they are suddenly shipping them, while Intel tells you in 9 mos. then suddenly says 3 mos. (and you can hear the whips cracking through the walls.)
  • You know they'll be ready when Microsoft doesn't have an operating system ready, but there are a dozen Linux distros good to go.

Re:I want one (4, Funny)

mrbluze (1034940) | more than 6 years ago | (#18862741)

Well i won't buy one until the Super version comes out (STRIPS). Now that's a name that has appeal!

Hm... (4, Insightful)

imsabbel (611519) | more than 6 years ago | (#18861039)

The article contains little more information than the blurb.
But it seems to me that we called this great new invention "vector processors" 15 years ago, and there is a reason they arent around anymore.
"Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?

Re:Hm... (4, Interesting)

superpulpsicle (533373) | more than 6 years ago | (#18861121)

Come on now. It's a capitalist market. You can't just innovate your way to fame. Just like the list of 5 million other patents that never see the daylight.

Re:Hm... (3, Informative)

Anonymous Coward | more than 6 years ago | (#18861245)

Actually, it is more like the dataflow architectures from the 70s. Vector processors are a totally different kind of thing (SIMD).

The idea is simple, instead of discovering instruction level parallelism by checking the dependencies and anti-dependencies by global names (registers), define the dependencies directly by relating to instructions themselves.

> "Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?

That equality does not exist. It is a wide parallel execution, not super-pipelining, ergo no huge branching penalties.
Also, the architecture is more likely exploting the wide execution unit by predicating both branches and calculating them both.

Re:Hm... (1)

MindStalker (22827) | more than 6 years ago | (#18861487)

Also with the move towards multi-core there is the potential that special task could stay on one processor for their entire length. This would renew reason to look at vector processing for main CPU usage.

actually... (1)

slew (2918) | more than 6 years ago | (#18862075)

You can read more about it here [utexas.edu]...

Actually, from what I can tell it's more like a VLIW with it's program chopped up into horizontal and vertical microcode "chunks" for more efficient register forwarding, than a vector processor...

I figure that it chops up the code into 128-instruction chunks (or smaller if there are branch dependancies that can't be done with predicates) and schedules it horizontally (the classic wide VLIW microcode which feeds independent instruction pipelines), and vertically (the sequence that can distribute over time and use register forwarding paths). The pipelines seem to be loosely coupled through reservation stations and the forwarding done with low bandwidth wormhole routes so it isn't a rigid as a classic VLIW machine.

I doubt it does that much better with normal scalar code (which has lots of branches), but it probably is much better than a vector processor would be with irregular code.

Re:Hm... (1)

naoursla (99850) | more than 6 years ago | (#18862591)

I only have a passing familiarity with TRIPS but I think that one of the goals was to get rid of the huge costs for pipeline flushes.

Doug Burger [utexas.edu] is one of the main PI's on this project (which is around seven years old at this point). I'm sure you can find more information there if you are interested.

Re:Hm... (1)

flaming-opus (8186) | more than 6 years ago | (#18862625)

I'd say it looks more like VLIW or EPIC. Instructions grouped into blocks with no looping or data dependence, running in parallel. It looks like a 16-wide Itanium, with all the same problems actually generating code that will run on it very well.

Next-Gen Business Model Unveiled (3, Insightful)

Anonymous Coward | more than 6 years ago | (#18861043)

1. Copy some university press release to your blog
2. Make sure google ads show up at the top of the page
3. Submit blog to slashdot
4. Profit

Re:Next-Gen Business Model Unveiled (0)

Anonymous Coward | more than 6 years ago | (#18861131)

good idea, I'm putting that on my blog

Re:Next-Gen Business Model Unveiled (1)

richdun (672214) | more than 6 years ago | (#18861173)

Actually, to be next-gen, you need to add flashy graphics or random Ajax to your blog as well, since that is already the current-gen business model.

Re:Next-Gen Business Model Unveiled (0)

Anonymous Coward | more than 6 years ago | (#18861479)

i thought there were only 3 steps to profit :| .. or was that pre-web 2.0 ?

Vista Capable? (-1, Redundant)

Anonymous Coward | more than 6 years ago | (#18861071)

What I want to know is, will it run vista? And does anyone even care anymore?

Marketting hype? (5, Informative)

faragon (789704) | more than 6 years ago | (#18861085)

Each TRIPS chip contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. Current high-performance processors are typically designed to sustain a maximum execution rate of four operations per cycle.

It's me or are they trying to reparaphrasing, euphemistically, the Out-of-order execution [wikipedia.org]?

Re:Marketting hype? (4, Informative)

Aadain2001 (684036) | more than 6 years ago | (#18861459)

Based on the article, "TRIPS" is nothing more than a Out-Of-Order(OOO) SuperScalar based processor. So unless the article is grossly simplifying (possible), this is nothing but a PR stunt. And based on the quote from one of the Professors about building it on "nanoscale" technology (um, been doing that for years now), my vote is pure PR BS.

And as an aside, the reason modern CPUs are designed to "only" issue 4 instructions per cycle instead of 16 is because after years of careful research and testing real work applications, 4 instructions is almost always the maximum number of instructions any program can concurrently issue, due to issues like branches, cache-misses, data dependencies, etc. Makes me question just how much these "professors" really know.

Re:Marketting hype? (3, Interesting)

Doches (761288) | more than 6 years ago | (#18861713)

Branches are no problem for TRIPS -- in the EDGE architecture, both execution paths resulting from a branch are computed, unlike in classic architectures where the processor blocks (8086), skips ahead a single instruction before blocking (MIPS), or chooses a path using a branch predictor and executing it, possibly only to discard all instructions issued since the branch, if the predictor turns out wrong. EDGE architectures still lag on cache misses (or any memory hit) -- but that's fundamentally a problem with memory, not with the processor. Don't read the article, read the UT pdf.

Re:Marketting hype? (1)

Randolpho (628485) | more than 6 years ago | (#18862003)

hmmmmm.... sounds almost exactly like IA64, something Intel has had since the turn of the century.

Re:Marketting hype? (1)

renoX (11677) | more than 6 years ago | (#18862877)

No in the IA64, like in the VLIW, the instructions are scheduled by the compiler (which works well on very regular code, poorly everywhere else) whereas in a TRIPS, on each execution unit the resources are dynamically used.

From the paper "Scaling to the End of Silicon with EDGE Architectures", TRIPS ISAs are hardware dependant though, which means that you'd have to recompile your applications each time you use a new CPU, if I understood correctly, this is a significant problem (that and the memory wall).

Re:Marketting hype? (0)

Anonymous Coward | more than 6 years ago | (#18862667)

I took Burger's computer architecture class back in the day (read: 4 years ago) and I discussed the TRIPS processor as well as MRAM with him extensively. The guy knows what he's talking about. Now as to whether or not the TRIPS will be some big revolution is for the market to decide. There are some really good ideas coming from some really smart people over there. Sadly, that doesn't necessarily mean that the good ideas will come to fruition.

Re:Marketting hype? (1)

arktemplar (1060050) | more than 6 years ago | (#18862737)

Its possibly is market hype, Space on silicon is extremely costly, by doing somthing like pipelining and parallelism they would be wasting precious silicon real estate, I think that the Cell processor (an no I am not an IBM fanboi), is closer to what you should expect from even the GPU's of tomorrow, and personally, I think (hope, even though I am a VLSI designer) that silicon would be obsolete in a couple of decades or so.

Re:Marketting hype? (1)

baggins2001 (697667) | more than 6 years ago | (#18862983)

And as an aside, the reason modern CPUs are designed to "only" issue 4 instructions per cycle instead of 16 is because after years of careful research and testing real work applications, 4 instructions is almost always the maximum number of instructions any program can concurrently issue, due to issues like branches, cache-misses, data dependencies, etc. Makes me question just how much these "professors" really know.
This processor was designed for parallel processing. It's intent is to be used in a large super computer that they are building in Austin. So the programs will be designed to take advantage of the CPU capability. This is not a chip to replace a Pentium or AMD processor running MS Word. This is a chip that will be used in a huge parallel computer system replacing AMD or Intel quad core chips, which they are currently waiting on.

They got the funding by saying it could be used to design game plans that would defeat Oklahoma into the next century. Unless of course Oklahoma gets a really big super computer.

Re:Marketting hype? (1)

Wesley Felter (138342) | more than 6 years ago | (#18862189)

Yes, TRIPS is an out-of-order superscalar processor. But it's bigger and better: by eliminating centralized structures, a TRIPS core can issue more instructions per cycle out of a bigger instruction window. It's not just more of the same; it's a qualitative improvement that allows much bigger (and thus higher performance) cores to built, yet with lower power and design costs.

Re:Marketting hype? (1)

faragon (789704) | more than 6 years ago | (#18862395)

Each of the two processor cores can execute up to 16 out-of-order operations [utexas.edu]
vs
Current high-performance processors are typically designed to sustain a maximum execution rate of four operations per cycle. [scienceblog.com]

They are comparing oranges against apples (!), as you can not compare 16 OoO executed instructions per cycle versus 4 *resulting in-order* instructions per cycle (where for achieving these 4 instructions/cycle may be you had to execute 10, 20 or more OoO instructions (!)). Please, where is the rigor? Fair play anyone?

Re:Marketting hype? (1)

ghoul (157158) | more than 6 years ago | (#18862691)

Actually the TRIPS is somewhere between OOO and VLIW (Itanium) The explicit data flow information embedded in the instructionenables much larger instruction windows than possible in traditional OOO. As instructions are not just loaded into the window in a dumb manner it reduces the chances and costs of pipeline flushes

TRIPS (1)

HTH NE1 (675604) | more than 6 years ago | (#18861093)

TRIPS (obligatory back-formation given in the article)
Is that to make people RTFA (Read The F[ine] Article), or because "Tera-op, Reliable, Intelligently adaptive Processing System" was 13 more characters than the submitter wanted to copy and paste?

One trillion calculations per second by 2012 (3, Informative)

xocp (575023) | more than 6 years ago | (#18861101)

A link to the U of Texas project website can be found here [utexas.edu].

Key Innovations:

Explicit Data Graph Execution (EDGE) instruction set architecture
Scalable and distributed processor core composed of replicated heterogeneous tiles
Non-uniform cache architecture and implementation
On-chip networks for operands and data traffic
Configurable on-chip memory system with capability to shift storage between cache and physical memory
Composable processors constructed by aggregating homogeneous processor tiles
Compiler algorithms and an implementation that create atomically executable blocks of code
Spatial instruction scheduling algorithms and implementation
TRIPS Hardware and Software

Re:One trillion calculations per second by 2012 (3, Informative)

xocp (575023) | more than 6 years ago | (#18861289)

DARPA is the primary sponsor...
Check out this writeup at HPC wire [hpcwire.com].

A major design goal of the TRIPS architecture is to support "polymorphism," that is, the capability to provide high-performance execution for many different application domains. Polymorphism is one of the main capabilities sought by DARPA, TRIPS' principal sponsor. The objective is to enable a single processor to perform as if it were a heterogeneous set of special-purpose processors. The advantages of this approach, in terms of scalability and simplicity of design, are obvious.

To implement polymorphism, the TRIPS architecture employs three levels of concurrency: instruction-level, thread-level and data-level parallelism (ILP, TLP, and DLP, respectively). At run-time, the grid of execution nodes can be dynamically reconfigured so that the hardware can obtain the best performance based on the type of concurrency inherent to the application. In this way, the TRIPS architecture can adapt to a broad range of application types, including desktop, signal processing, graphics, server, scientific and embedded.

obvious drm implications.. (1)

plasmacutter (901737) | more than 6 years ago | (#18861539)

with individual instructions no longer spitting out to registers, and , to quote you Compiler algorithms and an implementation that create atomically executable blocks of code , does this not mean they can finally hide keys from us in the die of a general purpose processor?

Re:obvious drm implications.. (1)

dreamchaser (49529) | more than 6 years ago | (#18861699)

I doubt that DARPA cares much about DRM, and if AMD and Intel wanted to they could have already hidden encryption keys on their CPU's.

Re:obvious drm implications.. (1)

plasmacutter (901737) | more than 6 years ago | (#18861999)

and if AMD and Intel wanted to they could have already hidden encryption keys on their CPU's.

not true, otherwise they would not be general purpose because they would not run every piece of x86 software thrown at them.

with the current architecture the key has to be in plaintext in one of the registers, which can then be dumped.

in this proposed architecture it can be passed to the next instruction throughout huge contiguous blocks of code w/o touching a register.

this also brings up the related issue of debugging on such chips.

Re:obvious drm implications.. (1)

dreamchaser (49529) | more than 6 years ago | (#18862205)

I disagree. They could quite easily add extensions to the microcode and architecture that would allow harder to break DRM (there is no such thing as impossible to break). The key(s) would leak due to human nature though so it would be a futile effort.

Re:obvious drm implications.. (0)

Anonymous Coward | more than 6 years ago | (#18862853)

You think existing keys fit in a single machine register? AACS is already designed to never keep the entire key in plaintext. Of course sometimes the hardware, like the Xbox's HD-DVD player, helpfully adds a debug mode for that purpose, but it's ultimately not necessary to crack the key.

Moving the whole shebang, player key, decoder and all into one chunk of silicon is probably the dream of the content companies, but that idea's a non-starter for everyone else involved.

Re:One trillion calculations per second by 2012 (1)

faragon (789704) | more than 6 years ago | (#18861947)

Explicit Data Graph Execution (EDGE) instruction set architecture [wikipedia.org]?
Scalable and distributed processor core composed of replicated heterogeneous tiles [wikipedia.org]?
Non-uniform cache architecture and implementation [wikipedia.org]?
...

Well, very disapointing when compared to other [wikipedia.org] modern [wikipedia.org] microprocessor [wikipedia.org] architectures [wikipedia.org]. Don't get me wrong, I love computer architecture, and the design seems interesting, but the "over-hype" is discouraging.

Re:One trillion calculations per second by 2012 (1)

maxwell demon (590494) | more than 6 years ago | (#18862587)

Explicit Data Graph Execution (EDGE) instruction set architecture?

Exactly. As is explicitly stated in the PDF [utexas.edu] linked from this [slashdot.org] comment by volsung, TRIPS is an implementation of EDGE.

Must...resist...obvious...joke (2, Funny)

jimicus (737525) | more than 6 years ago | (#18861129)

Imagine a beowulf cluster of these!

Re:Must...resist...obvious...joke (1)

ookabooka (731013) | more than 6 years ago | (#18861989)

Imagine a beowulf cluster of these!

Care for a game of chess? Nothing drives innovation in processing power more than a good game of chess :-D

Re:Must...resist...obvious...joke (1)

RedElf (249078) | more than 6 years ago | (#18862273)

What!?!? You mean you don't already have a beowulf cluster of these?

Useless! (1)

iminplaya (723125) | more than 6 years ago | (#18861143)

Here [utexas.edu] From the horses mouth. Plus we don't have to keep that damn digg thing. Come on, guys. A little less fluff please.

Ix86 (1)

nurb432 (527695) | more than 6 years ago | (#18861175)

So i assume its software compatible with 90% of the code that the 'general public' uses?

It did say 'general purpose' and if you try to create something beter but different, you get slapped down eventually ( like PowerPC Apples. )

Re:Ix86 (0)

Anonymous Coward | more than 6 years ago | (#18861373)

Except that it wasn't better. Other than that you are spot on!

Re:Ix86 (4, Insightful)

convolvatron (176505) | more than 6 years ago | (#18861409)

you are absolutely right. no one should ever do any research into
something which doesn't ultimately look like an x86.

Re:Ix86 (1)

jarom (899827) | more than 6 years ago | (#18862531)

Eh, just slap on a few million transistors to make the ISA translation, and you are good to go.

Re:Ix86 (1)

ghoul (157158) | more than 6 years ago | (#18862665)

I took Prof Burger's course and we studied the TRIPS processor. There are two parts to the project. One is the Chip team and the other is the software team led by Dr Katherine McKinley . The software team has developed emulators so that current code can run on the TRIPS processor. Of course emulation is never as good as native execution but it does provide an upgrade path. The key thing to notice is that the upgrade path has been part of the thinking from the beginning.

TRIP UP (0)

Anonymous Coward | more than 6 years ago | (#18861189)

How about the TRIP UP processor, that needs a new set of feet. Or, how about the DRIPS processor that needs a paper towel to blot it dry. Or, how about, the NIPS processor, that needs nip/tuck. Or, how about the SHITS processor, that needs toilet paper. Little on the details, much on the promises is usually a bad sign.

Gets rid of the register-file (5, Insightful)

DrDitto (962751) | more than 6 years ago | (#18861221)

The EDGE architecture gets rid of relying on a single register file to communicate results between instructions. Instead, a producer-consumer ISA directly sends results to one of 128 instructions in a superblock (sort of like a basic block, but larger). In this way, hopefully more instruction-level parallelism can be extracted because superscalars can't really go beyond 4-wide (8-wide is a stretch...DEC was attempting this before Alpha was killed). Nice concept, but it doesn't solve many pressing problems in computer architecture, namely the memory wall and parallel programmability.

Re:Gets rid of the register-file (1)

bishiraver (707931) | more than 6 years ago | (#18862415)

That's because they designed it in 2004, built the prototype in 2005, and some blogspam idiot is publicizing it in 2007.

Re:Gets rid of the register-file (1)

flaming-opus (8186) | more than 6 years ago | (#18862523)

Well, it gets rid of the isa-visible register file. That doesn't mean there aren't sram cells in there holding onto data. Don't confuse architecture an implementation.

This is cool! (3, Informative)

adubey (82183) | more than 6 years ago | (#18861261)

The link has NO information.

The PDF here: has more information about EDGE [utexas.edu].

The basic idea is that CISC/RISC architectures rely on storing intermediate data in registers (or in main memory on old skool CISC). EDGE bypasses registers: the output of one instruction is fed directly to the input of the next. No need to do register allocation while compiling. I'm still reading the PDF, this sounds like a really neat idea, though.

The only question is, will this be so much better than existing ISA's to eventually replace them? -- even if only for specific applications like high-performance computing.

Re:This is cool! (2, Insightful)

treeves (963993) | more than 6 years ago | (#18861519)

If it's so cool, why did it take three years for us to hear about it? I'm really asking, not just trolling.

Re:This is cool! (1)

Angstroem (692547) | more than 6 years ago | (#18861585)

Because you read the wrong publications. Try IEEE and ACM digital library.

Doug Burger's work is known to computer scientists for years...

Re:This is cool! (1)

treeves (963993) | more than 6 years ago | (#18862091)

That's what I mean. The link given by the poster I replied to was to an article from IEEE Computer from Jul 2004. I'm not a computer scientist so I don't regularly read those journals. But the question is, is it really "news for nerds" if it's three years old?

Re:This is cool! (1)

treeves (963993) | more than 6 years ago | (#18862147)

Sorry for replying to my own post, but I guess the answer is that they devised the architecture three years ago, but just now have the actual thing in silicon. I should read more carefully or not rely on short-term memory of 30 seconds ago!

Re:This is cool! (1)

ghoul (157158) | more than 6 years ago | (#18862719)

Its been in development for a while. Papers were published 3 years back but the actual working prototype just came out last year and after testing and debugging it was released to public in a function this month. Just like AMD has been talking about Greyhound for 3 years but it wont release till June this year

Re:This is cool! (1)

sofla (969715) | more than 6 years ago | (#18862339)

The basic idea is that CISC/RISC architectures rely on storing intermediate data in registers (or in main memory on old skool CISC). EDGE bypasses registers: the output of one instruction is fed directly to the input of the next. No need to do register allocation while compiling. I'm still reading the PDF, this sounds like a really neat idea, though.

What I liked even more was the idea of "execution blocks", where a given (processor pipeline width) worth of instructions are treated as an atomic "transaction"... clearly this is a nod to minimize the branch prediction penalty. I haven't been keeping up with hardware arch enough to know if this is a new idea or not, but it was new to me.

The only question is, will this be so much better than existing ISA's to eventually replace them? -- even if only for specific applications like high-performance computing.

I think it may be a contender, or at least gain some popularity (like MIPS did in the 90's), if nothing else for the fact that the big limitation in the EDGE design appears to be in the ability for real programs to exploit parallelism. This is exactly the same problem we're fancing with the latest batch of multi-core processors from Intel: all those processing units are nice, but its a b**ch to keep them all busy.

Moore's law immortal? (3, Interesting)

Manos_Of_Fate (1092793) | more than 6 years ago | (#18861291)

It seems like for every "realist" claiming that Moore's law will soon hit a ceiling, I see another ZOMG Breakthrough! Lately, the question I've been asking myself is, "Will we ever surpass it?"

Re:Moore's law immortal? (1)

mandelbr0t (1015855) | more than 6 years ago | (#18862169)

Will we ever surpass it?
Doubtful, unless we see a new hardware player burst onto the scene. AMD made quite a splash, but they certainly didn't have any potential, nor do they today, to outpace Moore's Law. Intel still drives the hardware market.

Re:Moore's law immortal? (1)

maxwell demon (590494) | more than 6 years ago | (#18862429)

Moore's Law is about the transistor density on the chip. A new processor design may help getting more performance from the same transistor density, but it certainly doesn't anything to increase the transistor density.

Since there's a finite atom density on a chip, the transistor density will inevitably stop to grow eventually.

but... (4, Insightful)

Klintus Fang (988910) | more than 6 years ago | (#18861353)

The motivations for this technology provided in the article ignore some rather basic facts.

They point out that current multi-core architectures put a huge burden on the software developer. This is true, but their claim that this technology will relieve that burden is dubious. They mention, for example, that current processing cores can typically only perform 4 simultaneous operations per-core, and imply that this is some kind of weakness. They completely fail to mention that the vast majority of applications running on those processors don't even use the 4 available scheduling resources in each core. In other words, the number of applications that would benefit from being able to execute more than 4 simultaneous instructions in the same core is vanishingly small. This is why most current processors have stopped at 3 or 4. Not because they haven't thought of pushing it beyond that, but because it is expensive, and because it yields very little return on the investment. Very few real-world users would see any performance benefit if the current cores on the market were any wider than 3 or 4. Most of those users aren't even using the 4 that are currently available.

Certainly the ability to do 1024 operations simulatenously in a single core is impressive. But it is not an ability that magically solves any of the current bottlenecks in multi-threaded software design. Most software application developers have difficulty figuring out what to do with multiple-cores. Those same developers would have just as much (if not more) difficult figuring out what to do with a the extra resources in a core that can execute 1024 simultaneous operations.

Re:but... (3, Informative)

$RANDOMLUSER (804576) | more than 6 years ago | (#18861667)

Two words: loop unwinding [wikipedia.org]. This critter is perfect to run all iterations of (certain) loops in parallel, which would be determinable at compile time.

Re:but... (3, Interesting)

Klintus Fang (988910) | more than 6 years ago | (#18861961)

of course loop unwinding works fine... when you have a long loop. it does though have two problems. 1) it only works when you have very long loops where there are very little dependencies between the consecutive iterations of the loop 2) even when it does work, it causes the code footprint of the application to be much bigger which means you end up putting a lot more stress on your cache pipeline, requiring bigger caches and a wider fetch engine. And that all aside, what about the vast majority of code segments where massive parrallelizable loops are not being executed? Loop unwinding isn't going to help at all for those.

Re:but... (1)

BillyBlaze (746775) | more than 6 years ago | (#18861719)

The benefit per simultaneous operations is not necessarily monotonically decreasing.

Consider a loop with a medium-sized body, and iterations mostly independent. If there are enough simultaneous operations allowed to schedule multiple iterations through the loop at once, the loop could potentially run that many times faster. Now, with current designs, there aren't that many slots, and even if there were, the ISA makes it difficult to express this in a way that's useful to the processor. All we can do is OpenMP-like stuff where the programmer explicitly tells the runtime system to try to divide the loop between multiple threads. There's a lot of overhead, both in terms of context switching and programmer time.

If, however, a different paradigm for an ISA can make it easier for compilers to communicate the dependencies to the processors, then the processors will be able to take advantage of that parallelism much more.

Re:but... (1)

Klintus Fang (988910) | more than 6 years ago | (#18862027)

That is exactly what Itanium's ISA does... Itanium is designed around the idea that the compiler knows best and provides a lot of tools to the compilers to enable the types of things you are talking about. But compilers either are not making use of it (or are unable to). It's not clear which.

Re:Better support for concurrency in Languages (2, Insightful)

Anonymous Coward | more than 6 years ago | (#18861773)

A lot of this is due to the fact that most popular languages right now do not support concurrency very well. Most common languages are stateful, and state and concurrency are rather antithetical to one another. The solution is to gradually evolve toward languages that solve this either by forsaking state (Haskell, Erlang) or by using something like transaction memory for encapsulating state in a way that is easy to deal with (Haskell's STM, Fortress (I think), maybe some others).

Concurrency is not that hard to do well in the right setting.

And before anyone claims that Haskell and Erlang are impractical, there are many examples of "real world" programs written in them.

A few nice, and very useful ones are Yaws [hyber.org] (for erlang) and Darcs [abridgegame.org] (for Haskell). There are many others (even quake clones), which I won't bother listing, but people can find them easily if they look.

Regarding concurrency, and its ease of use in these languages, I'm taking a machine learning class at the moment where most of the problems are computationally intensive, and could stand for improvement by making use of multiple cores. I do all of my assignments in Haskell, and not only are my solutions often shorter than those of my classmates (and they often work fine the first time they compile), but it's usually trivial to allow my application to scale nicely to as many cores as I can throw at it. It's worth mentioning, by the way, that most algorithms given in these classes are given under the assumption that people are using imperative languages, and even then, it's still easy. It takes a while to learn how to approach problems differently without mutable state, yes, but it's not as hard as some people make it out to be. I think it has more to do with the fact that people just don't like to learn anything new unless they absolutely are forced to do so, which is a pity.

By the way, there is a nice presentation [mit.edu] from Tim Sweeney on what he would like future programming languages to look like, and there's a lot in there about functional programming, concurrency, and expressive (re: dependent) types.

Re:but... (1)

beef623 (998368) | more than 6 years ago | (#18861841)

Know of any good programming resources that show how to take advantage of those extra resources?

TRIPS may solve some problems (2, Informative)

knowsalot (810875) | more than 6 years ago | (#18862693)

The big thing that all the commenters have missed that I've read so far is the fact that OOO execution is difficult not because it's hard to make many ALU's on a chip (vector design, anyone?) but because in a general-purpose processor the register file and routing complexity grows as N^2 in the number of units. That's bad. Every unit has to communicate with every other unit (via the register file or, more commonly, via bypasses to an OOO buffer for every stage prior to writeback). The issue being addressed here is wiring complexity which, as modern designers would tell you, is a much harder problem than designing fast logic. Routing is hard. Plunking down more ALU's is easy. If you eliminate the register file, and design your processor and ISA to feed instructions in a data-flow manner to thousands of ALU's then you may be able to vastly simplify routing requirements, thereby decreasing the length of your critical path electrical circuits, thereby allowing the processor to clock faster. (Data-flow execution is executing instructions when their data inputs are ready, rather than tracking the compiler-optimized order, which does not have the run-time information that the hardware has.) If you are clever about your compiler, and make your hardware wide enough, you can for example speculatively execute both sides of a branch until it is resolved, thus eliminating a certain percentage of pipeline stalls for branch mispredicts. Similarly, with data-prediction you can speculate during cache misses. The list goes on. This is a very new and different paradigm (ugly word) for CPUs which may lead to higher IPC. This isn't the single golden goose, but it's a very different way of looking at the problem of pushing more instructions through a processor at higher speeds.

Re:TRIPS may solve some problems (3, Interesting)

knowsalot (810875) | more than 6 years ago | (#18862795)

Oh, and before someone points this out for me, you have to imagine that the routing requirements are VASTLY improved. Imagine a grid of ALU's each connected by a single bus, (simple,) rather than 128 bypass busses all multiplexed in to each ALU. (chaos! don't forget the MUX logic!) You map one instruction to one (virtual) ALU, rather than one result to a (virtual) register. Then you pipeline/march each instruction with its partial data down the grid until all the inputs come in. Instructions continually cascade in the top of the grid, and commit out the bottom. But their results are available to feed other instructions as soon as they are computed! Never have to wait for a MUX or a bus or what-have-you. Plus, you can clock the whole thing EXTREMELY fast, because you don't have these wire-delays from difficult routing requirements.

nothing spectacular (4, Informative)

CBravo (35450) | more than 6 years ago | (#18861557)

Right, let me begin by saying that after reading ftp://ftp.cs.utexas.edu/pub/dburger/papers/IEEECOM PUTER04_trips.pdf [utexas.edu] it actually became a bit more clear about what they were talking about.

It might sound very novel if you are only accustomed to normal processors. Look at MOVE http://www.everything2.com/index.pl?node_id=103228 8&lastnode_id=0 [everything2.com] to see what transport-triggered architectures are about. They are more power efficient, etc etc.

Secondly, they talk about how execution graphs are mapped onto their processing grid. I don't think any scheduler has a problem with scheduling an execution graph (or whatever name you give it) to an architecture. Generally, it can be scheduled in-time (there is a critical path somewhere) or it is scheduled with a certain degree (generally > .9 efficient) of optimality. I don't see the gain there in efficiency.

Now here comes the shameless self-plug. If you want to gain efficiency in scheduling a node of an execution graph you have to know which node is more critical than the other. The critical nodes (the ones on the critical path) need to be scheduled to the fast/optimized processing units and the others can be scheduled to slow/efficient processing units (and they can get some communication delays without penalty). Look http://ce.et.tudelft.nl/publicationfiles/786_11_dh ofstee_v1.0_18july2003_eindverslag.pdf [tudelft.nl] here for my thesis.

Welcome to 1994 (0)

Anonymous Coward | more than 6 years ago | (#18861849)

...and the first round of lab testing for EPIC. If they keep this up, eventually they can independently invent Itanium1 (yuk).

I love how they skipped EPIC in their comparison section in the pdf.

Re:Welcome to 1994 (3, Informative)

Wesley Felter (138342) | more than 6 years ago | (#18862321)

EPIC (i.e. Itanium) is still based on centralized structures like register files. To create a 16-issue EPIC processor, you'd need a ~32R/16W port register file which would be virtually impossible to build because it would be so huge and power-hungry. Also, EPIC needs heroic compiler optimizations to overcome its in-order execution, while EDGE is naturally out-of-order.

The Compiler Is Key (1)

A440Hz (1054614) | more than 6 years ago | (#18862037)

I've worked in detail with a VLIW (Very Long Instruction Word) architecture, the TI 'C6x DSP. It has eight execution units (not all of which can perform the same operations, though there is a little overlap) which can all be active in a single cycle. However, the key is keeping all of the units busy.

While the C compiler for this architecture is incredibly good, there are situations where using raw assembly (quite hard because of pipelining issues) or "compiled assembly" (easier, since you write in the order you wish operations to occur, and the compiler schedules the pipeline for you) gives better performance.

In short, no matter how much hardware folks can throw at a computing problem, the issue is adapting lots of different kinds of software to the architecture. Sounds like the compiler is going to have to be very good, or else there will have to be some runtime mojo to keep all of the chip doing something useful.

failzorss! (-1, Flamebait)

Anonymous Coward | more than 6 years ago | (#18862059)

arseholes at Waln0t polite to bring with THOUSANDS of

This is just an update from year ago... (4, Informative)

coldmist (154493) | more than 6 years ago | (#18862577)

Here is the slashdot article from 2003 about this processor: link [slashdot.org]

The specs have been updated to 1024 from 512, but that's about it.

Another 3-5 years out?

Don't dismiss it (5, Informative)

er824 (307205) | more than 6 years ago | (#18862757)

I apologize if I butcher some of the details, but I highly recommend that anyone interested peruse the TRIPS website.

http://www.cs.utexas.edu/~trips/ [utexas.edu]

They have several papers available that motivate the rationale for a architecture.

The designers of this architecture believed that conventional architectures were going to run into some physical limitations that were going to prevent them from scaling further. One of the issues they foresaw was that as feature size continued to shrink and die size continued to increase chips would become susceptible to, and ultimately constrained by wire delay. Meaning the amount of time it took to send a signal from one part of a chip to another would constrain the ultimate performance. To some extent the shift in focus to multi-core CPUS validates some of their beliefs.

To address the wire delay problem the architecture attempts to limit the length of signal paths through the CPU by having instructions send their results directly to their dependent instructions instead of using intermediate architectural registers. TRIPS is similar to VLIW in that many small instructions are grouped into larger instructions (Blocks) by the compiler. However it differs in how the operations within the block are scheduled.

TRIPS does not depend on the compiler to schedule the operations making up a block like a VLIW architecture does. Instead the TRIPS compiler maps the individual operations making up a large TRIPS instruction block to a grid of execution units. Each execution unit in the grid has several reservation stations, effectively forming a 3 dimensional execution substrate.

By having the compiler assign data dependent instructions to execution units that are physically close to one another the communication overhead on the chip can be reduced. The individual operations wait for the operands to arrive at their assigned execution unit, once all of operations dependencies are available then the operation fires and its result is forwarded to any waiting instruction. In this way the operations making up the TRIPS are dynamically scheduled according to the data flow of the block and the amount of communications that have to occur across large distances are limited. Once an entire block is executed its can be retired and its results can be written to a register or memory.

At the block level a TRIPS processor can still function much like a conventional processor. Blocks can be executed out of order, speculatively, or in parallel. They have also defined TRIPS as a polymorphous architecture meaning the configuration and execution dynamics can be changed to best leverage the available parallelism. If code is highly parallelizable it might make sense to allow bigger blocks mapped. However, by performing these type of operations at the level of a block instead of for each individual instruction the overhead is theoretically drastically reduced.

There is some flexibility in how the hardware can be utilized. For some types of software with a high degree of parallelism you may want very large blocks, when there is less data level parallelism available it may be better to schedule multiple blocks onto the substrate simultaneously. I'm not sure how the prototype is implemented but the designers have several papers available where they discuss how a TRIPS style architecture can be adapted to perform well on a wide gamut of software.

new innovations (0)

Anonymous Coward | more than 6 years ago | (#18862925)

whatever happened to the laser cpu developed in iran years ago?
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...