Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Startup Combines CPU and DRAM

samzenpus posted more than 2 years ago | from the two-in-one dept.

Technology 211

MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."

Sorry! There are no comments related to the filter you selected.

but... (5, Funny)

Anonymous Coward | more than 2 years ago | (#38789505)

does it run GNU/Linux?

Re:but... (-1)

Anonymous Coward | more than 2 years ago | (#38789579)

When I see your naked bootyass, I lick my lips.

I just wanna slurp it right off! There's no more truly will away for you, and your scourning parading has been matched!

Re:but... (5, Funny)

robthebloke (1308483) | more than 2 years ago | (#38789701)

Multiple, low power, semi useless processor cores? Sounds like sony has just found the silicon to power the Playstation 4! :p

Don't count this out yet (5, Interesting)

fyngyrz (762201) | more than 2 years ago | (#38789883)

Useless? My key question would be does it have decent speed integer multiply and perhaps even divide instructions. A whole heck of a lot can be achieved if you have, say, the basic instruction set of a 6809, but fast and wide (and it didn't even have a divide... so we built multiply-by-reciprocal macros to substitute, that works too.)

I know everyone's used to having FP right at hand, but I'm telling you, fast integer code and table tricks can cover a lot more bases than one might initially think. A lot of my high performance stuff -- which is primarily image processing and software defined radio -- is currently limited considerably more by how fast I can move data in and out of main memory than it is by actually needing FP operations. On a dual 4-core machine, I can saturate the memory bus without half trying with code that would otherwise be considerably more efficient, if it could actually get to the memory when it needs to.

Another thing... when you're coding with C, for instance, the various FP ops can just as easily be buried in a library, then who cares why or how they get done anyway, as long as they are? With lots-o-RAM, you can write whatever you need to and it'd be the same code you'd write for another platform. Just mostly faster, because for many things, FP just isn't required, or critical. Fixed point isn't very bard to build either and can cover a wide range of needs (and then there's BCD code... better than FP for accounting, for instance.)

Signed, old assembly language programmer guy who actually admits he likes asm...

Re:Don't count this out yet (3, Insightful)

ByOhTek (1181381) | more than 2 years ago | (#38789927)

pfft. floating point sucks anyway.

typedef struct FRACTION_STRUCT
{ //numerator/denominator * 10^exponent
    int numerator;
    unsigned int denominator;
    int exponent;
} Fraction;

Re:Don't count this out yet (0)

Anonymous Coward | more than 2 years ago | (#38790043)

If floating point sucks, why did you implement it in your fraction struct? ;)

Re:Don't count this out yet (1)

ByOhTek (1181381) | more than 2 years ago | (#38790087)

Floating point is based on an integer base (rather than a fractional one).

Or maybe, I should say, floating point, as I have seen it implemented up to this point.

Re:Don't count this out yet (2)

smi.james.th (1706780) | more than 2 years ago | (#38790073)

You're assuming a rational number there.

Wait. Hang on. Forget that I pointed that out... :P

Re:Don't count this out yet (2)

ByOhTek (1181381) | more than 2 years ago | (#38790093)

it still manages a few cases classic floating point would miss. However, if you give me an infinitely wide register, I would happily work on something that would handle irrational numbers :-)

Re:Don't count this out yet (2)

buglista (1967502) | more than 2 years ago | (#38789931)

Exactly. ARM2 didn't have FP, people still wrote some extremely good stuff for it. (You can always approximate it by pretending that the last 10 bits are 2^-1, 2^-2, ... 2^10 and multiplying out - in fact shifting - when you need the answer. I've written graphics demos like that.)

Re:Don't count this out yet (4, Interesting)

walshy007 (906710) | more than 2 years ago | (#38790077)

Exactly. ARM2 didn't have FP, people still wrote some extremely good stuff for it.

Nintendo DS doesn't have an fpu on either cpu.

Re:Don't count this out yet (4, Interesting)

tibit (1762298) | more than 2 years ago | (#38790241)

Agreed. I'm working on a digital oscilloscope display system and that thing might be very useful in this application -- where you need lots of bandwith, but also plenty of storage. Say, zooming, filtering, scaling of one second long acquisition done at 2Gs/s, using a 12 bit digitizer. You tweak the knobs, it updates, all in real time. In the worst case, you need about 120 Gbytes/s memory bandwidth to make it real time on a 30FPS display. And that's assuming the filter coefficients don't take up any bandwidth, because if they do you've just upped the bandwidth to terabytes/s.

Re:Don't count this out yet (4, Funny)

TeknoHog (164938) | more than 2 years ago | (#38790289)

Signed, old assembly language programmer guy

I see what you did there.

Re:Don't count this out yet (5, Funny)

fyngyrz (762201) | more than 2 years ago | (#38790555)

That's a bit shifty, don't you think? I don't mean to negate your point, but too, it's beyond my power to complement you -- I'm somewhat over a barrel. Perhaps if you add one to your argument, we'd have something else. Logically speaking. HCF.

Re:Don't count this out yet (1)

poetmatt (793785) | more than 2 years ago | (#38790511)

I agree with you, this is substantial. The ability to have no megabytes of "cache" but instead gigabytes, depending on how it's used, could be very very substantial.

I could see this going many ways, including basically the equivalent of having a "Socket type" for the ram - drop in and upgrade as necessary. Even with the significant latency differences versus various levels of on-die cache this can be very significant.

Re:but... (1)

sourcerror (1718066) | more than 2 years ago | (#38790127)

I can suggest an application from the top of my head: spatial configuration tables for experimantal/reconfigurable robotic arms.

Re:but... (1)

ByOhTek (1181381) | more than 2 years ago | (#38789865)

Not yet, but I'm sure it'll run a herd of HURD!

Re:but... (1)

Anonymous Coward | more than 2 years ago | (#38790187)

It's a *design*. So it runs a GNU/Linux OS on paper.

  (English majors: the company is now hiring kernel bug copy editors to ensure that the end-reader experience is flawless)

Fantastic (1, Funny)

Anonymous Coward | more than 2 years ago | (#38789533)

I'd love to see a beowulf cluster of these things...

Oh, wait..

Either/or? (5, Insightful)

Gwala (309968) | more than 2 years ago | (#38789535)

Does it have to be a either-or suggestion?

I could see this being useful as an accelerator - in the same way that GPUs can accellerate vector operations. E.g. memory that can calculate a hash table index by itself. Stuffed in as a component of a larger system it could be a really clever breakthrough for incremental performance improvements.

Re:Either/or? (-1, Troll)

IJustSlurpedYourAss (2558907) | more than 2 years ago | (#38789693)

My saliva is dribblin' down your bootyasscheekcrackhole right now. Why? Because I slurped it until there was nothin' left to slurp!

Dont read this....it is a curse

In 2007 a little boy named joey was taken by aliens, they told him we will feast on what is inside.Then the aliens took his clothes and started throwing potatos at his ass, then they ate his shit and poked his ass to tickle his ass. Now that you have read this the aliens will eat your shit and poke the inner parts of your ass to tickle it unless you do the following:

1.retype this as a comment for three other videos at the end

Re:Either/or? (-1, Troll)

HowISlurpIt (2558919) | more than 2 years ago | (#38789715)

I'm gonna slurp it! Oh, you better believe I'm gonna slurp it! I'm gonna slurp your ass! Slurp it real good! Slurp it until there's spit all over it! Lots of spit and mucus! Mucus stringing from one end of your opened anus to the other end!

Dont read this....it is a curse

In 2007 a little boy named joey was taken by aliens, they told him we will feast on what is inside.Then the aliens took his clothes and started throwing potatos at his ass, then they ate his shit and poked his ass to tickle his ass. Now that you have read this the aliens will eat your shit and poke the inner parts of your ass to tickle it unless you do the following:

1.retype this as a comment for three other videos at the end

Re:Either/or? (3, Interesting)

Anonymous Coward | more than 2 years ago | (#38789731)

Like Mitsubishi 3D RAM [thefreelibrary.com]

They put the logic ops and blend on the RAM

The 3D-RAM is based on the Mitsubishi Cache DRAM (CDRAM (Cached DRAM) A high-speed DRAM memory chip developed by Mitsubishi that includes a small SRAM cache. ) architecture that integrates DRAM memory and SRAM cache on a single chip. The CDRAM was then optimized for 3-D graphics rendering and further enhanced by adding an on-chip arithmetic logic unit See ALU. (ALU (Arithmetic Logic Unit) The high-speed CPU circuit that does calculating and comparing. Numbers are transferred from memory into the ALU for calculation, and the results are sent back into memory. Alphanumeric data are sent from memory into the ALU for comparing. ) and video buffer.

So, is it a CAM or a DRPU? (0)

Anonymous Coward | more than 2 years ago | (#38789545)

And how do I add more RAM to my system?

Re:So, is it a CAM or a DRPU? (1)

Sulphur (1548251) | more than 2 years ago | (#38789729)

And how do I add more RAM to my system?

Its Computer Assisted Memory. Check the NUMA box. /Ducks and Covers

Re:So, is it a CAM or a DRPU? (1)

somersault (912633) | more than 2 years ago | (#38789843)

I think UM-CRAP, or RAD-MC-PU would be more catchy.

Re:So, is it a CAM or a DRPU? (3, Funny)

somersault (912633) | more than 2 years ago | (#38789871)

Missed a D - better make that DUM-CRAP*.

I wonder, how much DUM-CRAP could we fit into a single PC?

* this name is by no means a reflection on what I think of the tech - it sounds like a pretty cool idea.

Re:So, is it a CAM or a DRPU? (2, Insightful)

Anonymous Coward | more than 2 years ago | (#38790165)

The idea is not new and lots of products having a CPU on the RAM die exist. Sun had this on graphic cards for example.
The missing FP is not a great deal since FP can be calculated with ints if needed - but it shall get an FP in the follow up products to stop the rants.
The dirty secrect of the computer industrie is that the CPU has to wait "lots" of cycles to go to memory and back since the CPUs are much higher clocked that RAM plus there are other chips in between...
RAM can be added in system designs here - but you simply get more CPUs as well ;-)
I guess market acceptance is always a matter of integration effort so Linux it shall run and have broad IO chipsets available.
Well done - hope it succeedes

performance vs. memory bandwidth (2, Insightful)

Anonymous Coward | more than 2 years ago | (#38789557)

> "that trades 25 years of flexibility and performance for scads of memory bandwidth"

Right... because memory bandwidth isn't one of the greatest bottlenecks in current designs...

Re:performance vs. memory bandwidth (3, Interesting)

hattig (47930) | more than 2 years ago | (#38789639)

And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?

One of the issues they had to deal with was that DRAM is usually made on a 3 metal layer process, whereas CPUs usually take a lot more layers due to their complexity.

This will have to compete with TSV connected DRAM, which will be a major bandwidth and power aid to conventional SoCs.

Re:performance vs. memory bandwidth (1)

ByOhTek (1181381) | more than 2 years ago | (#38790015)

for a limited subset of tasks, very high performance.

If the chip can achieve either (a) higher clock speed or (b) fewer cycles for the same op, or even both - then there can easily be some operations that are faster. For tasks focused on those operations, the chip will be faster. The memory improvements won't hurt things either.

CPUs and GPUs are rarely the same speed and transistor count, but we use both. GPUs excel for floating point and rapid fire streams of the same op against an array of data. CPUs are better at integer processing, and rapidly changing operations on individual values.

The idea of something like this isn't to replace the CPU I suspect, except in extremely low power/size situations, but rather, to add another unit to offload difficult tasks. Also, it could easily be a proof of concept, in which case, a more powerful built-in CPU could be in the pipeline.

Re:performance vs. memory bandwidth (3, Insightful)

tibit (1762298) | more than 2 years ago | (#38790249)

Perfect for networking -- switching, routing, ... Think of content addressable memory, etc.

Re:performance vs. memory bandwidth (3, Interesting)

K. S. Kyosuke (729550) | more than 2 years ago | (#38790375)

And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?

Quite a lot, I would guess. A stack-based design would give you 1 instruction per cycle with a compact opcode format capable of packing multiple instructions into a single machine word, which means a single instruction fetch for multiple actual instructions executed. Oh, and make it word addressed, that simplifies things a bit as well. In the end, you'll have a core that does perhaps 50%-100% as much clock cycles per second on a given manufacturing technology level (say, 60 nm), with just a single thread of execution, but with a negligible transistor budget and power consumption. The resulting effective computational performance per energy consumed will be at least one OOM better than the current offerings by Intel and AMD, although you first have to learn how to program it.

Map Reduce? (3, Interesting)

complete loony (663508) | more than 2 years ago | (#38789571)

So you could implement some simple map reduce operations and run them directly in RAM?

Re:Map Reduce? (1)

Anonymous Coward | more than 2 years ago | (#38789891)

Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

Re:Map Reduce? (4, Funny)

Pieroxy (222434) | more than 2 years ago | (#38789991)

Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

Since WWIII hasn't happened yet, you cannot rule out the fact that it *might* be the solution.

Re:Map Reduce? (0)

Anonymous Coward | more than 2 years ago | (#38790319)

Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

Since WWIII hasn't happened yet, you cannot rule out the fact that it *might* be the solution.

Why thank you, sweety!

Re:Map Reduce? (0)

K. S. Kyosuke (729550) | more than 2 years ago | (#38790413)

Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.

Since WWIII hasn't happened yet, you cannot rule out the fact that it *might* be the solution.

I thought that reducing the map (of world) was supposed to be the outcome of WWIII, not a solution to it.

Re:Map Reduce? (2)

PiMuNu (865592) | more than 2 years ago | (#38790347)

Why in the world is people always saying the word Map Reduce nowerdays.

Distributed computing...

Re:Map Reduce? (5, Insightful)

lkcl (517947) | more than 2 years ago | (#38789979)

Aspex Semiconductors took this a lot further. they did content-addressable-memory. ok, they did a hell of a lot more than that. they created a massively-parallel deep SIMD architecture with a 2-bit CPU (early versions were 1 bit), with each CPU having something like 256 bits of memory to play with. ok, early versions had 128-bits of "straight" RAM and 256 bits of content-addressable RAM. when i was working for them they were planning the VASP-G architecture which would have 65536 such 2-bit CPUs on a single die. it was the 10th largest CPU being designed, in the world, at the time.

programming such CPUs was - is - a complete f*****g nightmare. you not only have the parallelism of the CPU to deal with but you have the I/O handling to deal with. do you try to fit the data 1-bit-wide per CPU and process it serially? or... do you try to fit the data across 32 CPUs and process it in parallel? (each CPU was connected to its 2 neighbours so you could do this sort of thing). or... do you do anything in between, because if you have only 1-bit-wide that means that the I/O is held up, but if you do 32-bits across 32 CPUs you process it so quick that you're now I/O bound.

much of the work in fitting algorithms onto ASPs involved having to write bloody spreadsheets in Excel to analyse whether it was best to use 1, 2, 4 .... 32 CPUs just to process the bloody data! 6 weeks of analysis to write 30 lines of code for god's sake!

it gets worse: you can't even go read a book on algorithms for hardware because that doesn't apply; you can't go read a book on algorithms for software because _that_ doesn't apply. working out how to fit AES onto the Aspex Semi CPU took me about... i think it was 6 weeks, to even _remotely_ make it optimal. i had to read up on the design of the 2-bit Galois Field theory behind the S-Boxes, because although you could do 8-bit S-Box substitution by running 256 "compare" instructions, one per substitution, in parallel across all 4096 CPUs, it turned out that if you actually implemented the *original* 2-bit Galois Field mathematical operations in each of the 2-bit CPUs you could get it down to 40 instructions, not 256.

and that was just for _one_ part of the Rijndael algorithm: i had to do a comprehensive detailed analysis of _every_ aspect of the algorithm.

in other words, everything that you _think_ you know about optimising software and algorithm design for either hardware or for software is completely and utterly wrong, for these types of massively-parallel map-reduce and content-addressable-memory CPUs.

that leaves them somewhere in the very very specialist dept, and even there, they have problems, because it takes so long to verify and design a new CPU. when the Aspex VASP-F architecture was being planned, it was AMAZING! wow! 100x faster than the best Pentium-III processor! of course, within 18 months it was only 20x better than the top-of-the-line Pentium that was available, and by the time it _actually_ came out, it was only 5x better than a bunch of x86 CPUs, which are a hell of a lot easier to program.

it was the same story for the next version of the CPU, even though that promised to have 64k processing elements...

Re:Map Reduce? (2)

Andy_R (114137) | more than 2 years ago | (#38790197)

I know this is probably going to sound flippant, but I'm sure there is a genuine reason and I'd be interested to hear it... Why not just write it both ways and test?

Better yet, why not get the compiler to try different parallelisations and use a genetic algorithm to optimise automatically?

Re:Map Reduce? (1)

postbigbang (761081) | more than 2 years ago | (#38790243)

Meh.

From a theorists standpoint, it's classical. You get classical Von Neumann state machine. There's the problem of heat and die size, and buses are absolutely custom if you use them, although someone will put together a nice chipset to deal with the timing.

Multiple cores still have the same problem in terms of cache state, fetch state, and synch, so no real benefit there. Add in memory protection and this has become more wicked still. Fast, but wicked difficult from an OS maker's standpoint. Not that it's easy now.

Processing In Memory (5, Interesting)

Anonymous Coward | more than 2 years ago | (#38789581)

This isn't new. The MIT Terasys platform did the same in 1995, and many have since. Nobody has yet come up with a viable programming model for such processors.

I'm expecting AMD's Fusion platform to move in the same direction (interleaved memory and shader banks), and they already have a usable MIMD model (basically OpenCL).

Re:Processing In Memory (2)

Omnifarious (11933) | more than 2 years ago | (#38789769)

*nod* I'm not surprised it isn't new.

I've noticed that locality has become more and more important as speeds have gone up. I kind of wonder if something like this isn't the future.

I'm noticing, for example, that programming models involving channels and lots of threads have shown up and seem like a viable model for something like this. Erlang and Go are the two languages that do this that I can think of right offhand.

Re:Processing In Memory (0)

Anonymous Coward | more than 2 years ago | (#38789811)

Uuum, look up Haskell thread sparks.
Combine that with vector processing.

Looks very viable to me.

Re:Processing In Memory (2)

master_p (608214) | more than 2 years ago | (#38789879)

Nobody has yet come up with a viable programming model for such processors.

The Actor [wikipedia.org] model fits well to this type of CPU. Each CPU could be considered an actor.

A Jump instruction to a memory bank of another CPU could be translated as an Actor call to be executed in parallel with the caller.

A call instruction to a memory bank of another CPU could be translated as a parallel call and wait instruction.

A load/store instruction could be translated as a queued request for retrieving/updating data.

This provides a very natural multitasking solution, that provides good expandability both in memory and processing power by adding more memory/CPU chips.

An object-oriented programming language could spread out parallel objects into as many CPUs as possible.

Re:Processing In Memory (1)

Entrope (68843) | more than 2 years ago | (#38790329)

The programming challenge with these architectures is not how to write applications for them. It's how to write efficient, correct applications reasonably quickly. In practice, the processors quickly become special-purpose rather than general-purpose as a result of their programming frameworks focusing on particular problems that the architecture is good at. (Not to mention Amdahl's law kicks in pretty quickly.)

Re:Processing In Memory (1)

Attila the Bun (952109) | more than 2 years ago | (#38789893)

This isn't new. The MIT Terasys platform did the same in 1995, and many have since. Nobody has yet come up with a viable programming model for such processors.

Indeed, but PC architecture is going in this direction. The powerful and flexible main CPU will remain, but there are more and more devices with their own specialised processors and memory. First graphics cards, then HDDs and other devices followed suit, and now we think nothing of putting microcontrollers in mice, keyboards, even speakers. Perhaps in the future I/O could be handled entirely by the in-memory processors. The more work the CPU can outsource to specialised processors, the faster it's going to get done.

Re:Processing In Memory (1)

hackertourist (2202674) | more than 2 years ago | (#38789895)

What about the programming model that was used for every processor that had a 1:1 clock relationship with its memory, i.e. everything before the 80386?

Re:Processing In Memory (1)

Omnifarious (11933) | more than 2 years ago | (#38789923)

That's not the issue. It's the massive parallelism that's the issue. And most models for getting a grip on that tacitly assume symmetric access to all memory by all CPUs. It's just now that C++ is getting the atomic operations that have as an implicit assumption that perhaps some memory is seen differently by one thread vs. another.

Re:Processing In Memory (3, Interesting)

Anne Thwacks (531696) | more than 2 years ago | (#38790191)

C++ is your problem. Algol68 dealt with these issues over 40 years ago.There were two problems with Algol68:

It was not American (NIH)

The best training manual for it was called "Algol68 with fewer tears"

Other than that, it was able to handle parallelism, and most everything else, in a relatively painless manner.

For those who actually LIKE pain, there is always Occam.

Re:Processing In Memory (0)

Anonymous Coward | more than 2 years ago | (#38789997)

The DAP http://en.wikipedia.org/wiki/Distributed_Array_Processor was a similar idea... SIMD programming model. Worked really well for certain kinds of applications.

Re:Processing In Memory (0)

Anonymous Coward | more than 2 years ago | (#38790001)

ObWikipediaLink: Processor-in-memory [wikipedia.org] .

Just a first step... (3, Interesting)

bradley13 (1118935) | more than 2 years ago | (#38789585)

Really, this was inevitable, and this first implementation is just a first step. Future versions will undoubtedly include more functionality.

Current processors are ridiculously complicated. If you can knock out the entire cache with all of its logic, give the processor direct access to memory, and stick to a RISC design, you can get a very nice processor in under a million transistors.

Re:Just a first step... (0)

Anonymous Coward | more than 2 years ago | (#38789823)

Or if we could allocate a part of L1 cache in x86 for processes for direct memory access, that alone would result in a considerable speedup.

Re:Just a first step... (4, Informative)

lkcl (517947) | more than 2 years ago | (#38789901)

the cache is there because the speed of DRAM, regardless of how fast you can communicate with it, still has latency issues on addressing.

to do the "routing" to address a 4-bit bus, you need 1/2 the number of transistors than if you addressed a 2-bit bus. each time you add another bit to the address range, you have increased the latency of access.

if you were to provide entirely random-access to an entire 32-bit range you would absolutely kill performance. so, what RAM IC designers do is they go "ok, you're not going to get 32-bit addressing, you're going to get 14-bit addressing, you're going to have to read an entire page of 1k or 2kbits, and you're going to have to have parallel ICs, the first IC does bits 0 to 1 of the data, the second IC does bits 2 and 3 etc."

this relies on the design of the processor having a VM architecture - paging.

but the same principle applies *inside* the processor: even just decoding the addressing, in the MMU, it's *still* too much latency involved.

so this is why you end up with hierarchical cacheing - 1st level is tiny, 2nd level is huge.

even with RISC designs you _still_ have to have 1st and 2nd level caches in order to remain competitive. if you've ever seen a picture of a RISC CPU, it's astounding: the actual CPU is like 1% of the total area; caches are huuuge by comparison, crossbar routing takes up 50% of the chip and the I/O pads, required to be massive in order to handle the current, can take up something like 5% of the chip (guessing here, it's been a while since i looked at an annotated example CPU).

Re:Just a first step... (1)

K. S. Kyosuke (729550) | more than 2 years ago | (#38790465)

even with RISC designs you _still_ have to have 1st and 2nd level caches in order to remain competitive. if you've ever seen a picture of a RISC CPU, it's astounding: the actual CPU is like 1% of the total area; caches are huuuge by comparison,

Don't do caches, do scratchpad memories and minimal instruction formats that require minimum bandwidth per opcode performed. And write a reasonable compiler. I've already seen books on it (static/automatic allocation of storage for scratchpad-equipped CPUs), I think CRC published a chapter on it in one of their recent compiler construction handbooks.

Why not a hexagonal design? (4, Interesting)

G3ckoG33k (647276) | more than 2 years ago | (#38789587)

Speaking of unconventional design, why don't we see hexagonal or triangular CPU-designs? All I have seen are the Manhattan-like designs. Are these really the best? Embedding the CPU inside a hexagonal/triangular DRAM design should be possible too. What would be the trade-offs?

Re:Why not a hexagonal design? (1)

Anonymous Coward | more than 2 years ago | (#38789613)

Problematic edges.

Re:Why not a hexagonal design? (2)

PSVMOrnot (885854) | more than 2 years ago | (#38789621)

It probably boils down to ease and efficiency of manufacture. Certainly for the core of the cpu I would imagine it's because squares tesselate nicely on the silicon wafer.

Re:Why not a hexagonal design? (1)

ByOhTek (1181381) | more than 2 years ago | (#38789791)

hexagons would probably tessellate even better, with less waste.

Ease of manufacture is still the case though. Cutting them out would be a bitch though.

Re:Why not a hexagonal design? (2)

dkf (304284) | more than 2 years ago | (#38789819)

hexagons would probably tessellate even better, with less waste.

Ease of manufacture is still the case though. Cutting them out would be a bitch though.

That's why triangles would be good; they can act as parts of hexagons, and yet you can cut them out with straight cuts. OTOH, you'll have to deal with acute angles in the result, which might have its own set of problems. Squares are likely a reasonable compromise, all things considered.

Re:Why not a hexagonal design? (1)

ByOhTek (1181381) | more than 2 years ago | (#38790079)

acute angles are definitely bad in electronics like that, also, if you tried to use them as hexagons, you'd have to merge six, which would have a whole extra set of complexities, and room for error. Yes, square is going to be the best option.

Re:Why not a hexagonal design? (2)

mdenham (747985) | more than 2 years ago | (#38789807)

Ease, yes. Efficiency, especially the amount of the wafer that's wasted due to it being circular initially, not so much.

A triangular layout probably would be ideal - you could (or at least, should be able to) adapt existing equipment (instead of two cuts at right angles, you make three cuts at 60-degree angles - hexagonal layouts would require either piecing together triangles produced this way, casting much smaller ingots such that it's one chip per wafer, or stamping out the hexagons), and you reduce waste somewhat (depending on the size of the chip and the size of the initial wafer).

However, the problem with triangular layouts (and hexagonal, for that matter) is that it involves cutting along planes the original ingot (which is a single huge silicon crystal) won't naturally fracture in.

So... unfortunately, we're stuck with square chips (if we start building up 3D chips, we'll have the option of cubes and octahedra) because of nature.

Re:Why not a hexagonal design? (2, Interesting)

TechnoCore (806385) | more than 2 years ago | (#38790021)

Guess it is because the silicon wafers that a CPU's are made from must be cut along the atomic layers of silicon. Silicon in solid form at room temperature crystalizes into a diamond cubic crystal structure. It is very strong, but also very brittle. It is easy to cut along stright lines, following the faces of an octahedron. To cut at any other angle would propably be very difficult and risky. Maybe it would therefore be hard to cut a wafer it trianglular shapes?

Their money isn't old enough. (2, Interesting)

Anonymous Coward | more than 2 years ago | (#38789719)

They're innovating?

T-minus two days until they've been hit with 13 different patent lawsuits by companies that don't even produce anything similar.

Sorry about your luck!

Sorry , I don't believe it (1)

Viol8 (599362) | more than 2 years ago | (#38789749)

Memory bottlenecks might be an issue but cache generally solves a lot of them. Binning just about every advance in processor design since the Z80 simply to speed up memory access is farcical. I'm afraid this is going to sink without trace since if you need low power you can just use ARM anyway which incidentally will have a shed load more performance.

Re:Sorry , I don't believe it (1)

mdenham (747985) | more than 2 years ago | (#38789839)

Think of this as a proof-of-concept work, which can rapidly (relative to the initial rate of progress from the 22k transistor range) be pushed to something closer to present-day processor strength. So figure sometime around 2020-2025 for them to have caught up to present-day transistor counts, with a system that'd have higher performance than anything we can get right now (without overclocking).

Granted, that 10-12 years figure is a gigantic ass pull on my part, but it shouldn't be too much slower than that. They'll catch up eventually, and to low-power chips faster than everything else.

Re:Sorry , I don't believe it (0)

Anonymous Coward | more than 2 years ago | (#38790567)

I'm not an engineer of any sort, but based on my readings, this is the general direction I would see these systems going in

What I'm envisioning is a 64bit in-order fairly simple CPU with a chunk of memory attached. Latency is mostly hidden by extra registers. Then a bunch(think hundreds) on a single die. Each CPU has its own kernel and it works cooperatively with the other CPUs to keep related threads "near" each other.

You would effectively treat this system like a bunch of single CPU nodes in a grid with different latencies to other nodes, nearest being the fastest. You would need some way to describe related work loads. They would use message passing, so all messages transferred would be immutable structs instead of passing pointers and accessing relatively high-latency remote memory.

All on one chip (3, Interesting)

jaweekes (938376) | more than 2 years ago | (#38789755)

I'm just wondering and maybe it exists already, but why not make everything on one chip? The CPU, memory, GPU, etc? Most people don't mess with the insides of their computer, and I'm guessing that it will speed up the computer as a whole. You won't even need to make it high-performance. Just do a I3 core with the associated chipset (or equivalent), maybe 4GB of RAM, some connectivity (USB 2, DVI, SATA, Wi-Fi and 1000Base-T) and you have it all. The power savings should be huge as everything internally should be low voltage. The die will be huge but we are heading that way anyway.
Am I talking bollocks?

Re:All on one chip (0)

Anonymous Coward | more than 2 years ago | (#38789775)

Yes, you are.

Support Chip? (1)

TaoPhoenix (980487) | more than 2 years ago | (#38789803)

I'm no techie, but I'm just wondering if this isn't more of a support chip that works the other way, if it's like a "smart cache" where the main CPU can offload something memory intensive and repetitive to keep it out of the way of the fancy thread calculations.

Re:All on one chip (2)

BeardedChimp (1416531) | more than 2 years ago | (#38789817)

You are basically describing a system on chip [wikipedia.org] . You have one in your phone.

Re:All on one chip (1)

jaweekes (938376) | more than 2 years ago | (#38789841)

I didn't think about that! Now why can't they do that for a desktop or laptop? Is the ARM system just that much smaller then the Intel desktop chips?

Re:All on one chip (1)

TheDarkMaster (1292526) | more than 2 years ago | (#38789873)

I thought the same thing. The question would be to: Is possible to make an SoC with the performance of a desktop PC? And I do not know if doing so would be justified, given that the main advantage of a desktop PC is able to make hardware upgrades.

Re:All on one chip (1)

Tim4444 (1122173) | more than 2 years ago | (#38790169)

Well, Rasberry Pi [raspberrypi.org] could be described as a proof of concept for the whole SoC as a PC substitute idea. At least for the Windows world, the popular software is only offered as precompiled binaries for x86 based platforms. It may be a while before there's a critical mass of ARM based offerings to attract serious commercial attention. Windows 8 may change this but I think it's still too early to tell.

I think upgradability is possibly not the main advantage of desktops though it's certainly a key factor for many people. I'd argue that a sizeable number of PC's, if not the majority, will never get an upgrade that requires opening the case (so, I'm excluding new peripherals). That's why there's a market for things like onboard (ie. on the mobo) audio, NIC, and others including sometimes GPU.

Way to miss the point (0)

Anonymous Coward | more than 2 years ago | (#38789945)

Er, from that link:

the term SoC is typically used with more powerful processors, ..., which need external memory chips (flash, RAM) to be useful

And it also explains why the different processes used to make memory and CPU means these are usually separated.

Re:All on one chip (4, Interesting)

stevelinton (4044) | more than 2 years ago | (#38789853)

There are basically two problems:

1. The external connectivity -- SATA, USB, ethernet, etc. needs too much power to easily move or handle on a chip (and the radio stuff needs radio power). You can do the protocol work on the main chip if you like, but you'll need amplifiers, and possibly sensors off chip.

2. DRAM and CPUs are made in quite different processes, optimised for different purposes. Cache is memory made using CPU processes (so it's expensive and not very dense). These guys are trying to make CPUs using DRAM processes, which are slow.

Re:All on one chip (1)

fuzzyfuzzyfungus (1223518) | more than 2 years ago | (#38789903)

The trouble is that, not only will the die be huge(which is an issue because it increases the odds that you'll have to throw the whole thing away because of a defect in some vital bit of it); but the entire die will have to be produced on a single process, presumably the one used by the most demanding of the parts.

That doesn't make it impossible; but it would very likely make it extraordinarily expensive. If you totted up the total die area of a contemporary PC, CPU,RAM, GPU, assorted peripherals and interconnect stuff,it is already very large; but it is very large spread out over a number of processes, and a lot of dice that can be tested(and if necessary trashed) individually during production. Requesting the same functions, on a single die, from the fancy-cutting-edge process probably used to make the CPU, would be ruinously expensive...

Re:All on one chip (0)

Anonymous Coward | more than 2 years ago | (#38790019)

This is what the Parallax Propeller (2006) does. Also, it's a 8 core design and costs ten bucks.

Re:All on one chip (0)

Anonymous Coward | more than 2 years ago | (#38790095)

BeagleBoard [wikipedia.org] is similar to what you described. It has the CPU and GPU on one IC, then uses PoP [wikipedia.org] to mount the memory directly on top of the CPU.

eDRAM (0)

Anonymous Coward | more than 2 years ago | (#38789801)

How does this compare with embedded dram, which caused a lot of hype few years ago?

XBox 360 (1)

mozumder (178398) | more than 2 years ago | (#38789805)

Also has an embedded DRAM chip, to do things like z-buffer lookup on memory writes.

Probably the most common embedded DRAM system out there..

synthesis (4, Informative)

lkcl (517947) | more than 2 years ago | (#38789851)

there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.

if you wander randomly outside of those geometries you are either on your own or you are into some unbelievably-high development costs.

why is this relevant?

it's because the DRAM manufacturers do *not* stick to the well-known geometries: they vary the geometry in order to get the absolute best performance because the cell layout is absolutely identical for DRAM ICs. and, because those cells _are_ identical, the verification process is much simpler than is required for a complex CPU.

in other words, this company is trying to mix-and-match two wildly different approaches. in other words, what he's doing is either incredibly expensive or is sub-optimal. which begs the question: what's it _for_?

Re:synthesis (0)

Anonymous Coward | more than 2 years ago | (#38789913)

What task can you do that requires nothing more than a few specific operations, but ones that need to be repeated so fast because there are a tremendous number of operations to do?

Breaking encryption.

Re:synthesis (0)

Anonymous Coward | more than 2 years ago | (#38789965)

in other words, this company is trying to mix-and-match two wildly different approaches. in other words, what he's doing is either incredibly expensive or is sub-optimal. which begs the question: what's it _for_?

Because your processor doesn't need to be "optimal" if you're running memory-bandwidth-intensive operations and your choice of how to build it allows you to have a 4kbit wide per-core memory interface. This device basically has about 16x the memory bandwidth of anything else out there, which means for the right application it runs 16 times faster.

(captcha: "defended". heh.)

Re:synthesis (1)

K. S. Kyosuke (729550) | more than 2 years ago | (#38790007)

there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.

Or you can do it the way Chuck Moore does and write your own OKAD, simpler, faster and better. :)

Yo dawg! (1, Funny)

jimmydigital (267697) | more than 2 years ago | (#38789987)

I heard you like to reduce maps, so I put a CPU in your RAM so you can hash while you map.

Humm, IBM did it first. (2, Interesting)

JimCanuck (2474366) | more than 2 years ago | (#38789993)

IBM sells CPU's that have DRAM onboard for quite a while, IBM developed it, patented it, and sells it as "eDRAM" aka "embeddedDRAM".

I guess IBM's POWER7 processor family powering such things like, Sony's PlayStation 2, Sony's PlayStation Portable, Nintendo's GameCube, Nintendo's Wii, and Microsoft's Xbox 360. All have eDRAM.

Maybe news articles should be checked to see if they are really news or not before posting?

Re:Humm, IBM did it first. (1)

vlm (69642) | more than 2 years ago | (#38790097)

In the field, a microcontroller is just a processor with onboard usually static memory, so this thing is pretty close to a microcontroller.
Anyone know of any other microcontroller type chipsets that use dynamic ram?

Re:Humm, IBM did it first. (0)

Anonymous Coward | more than 2 years ago | (#38790299)

No, none of these game consoles have eDRAM in it. It is only used on Power7 and Z Series machines made in 45nm SOI technology (I know because I work at IBM Hopewell Junction where these were initially manufactured).

Caches (2)

unixisc (2429386) | more than 2 years ago | (#38790023)

Normally, in any CPU, you have 1, 2 or even 3 levels of cache - level one being the fastest accessed from the CPU, and higher numbers involving more latency. The whole idea being that data that is frequently accessed should be either within the CPU's register files, or within the level 1 cache. Failing that, the level 2 cache, failing that, level 3 cache or main memory. So for this CPU, the DRAM can be considered an L4 cache?

Incidentally, is it an SoC? Does all the support circuitry - to the South Bridge, PCIx, USB, 802.11 and other peripheral interfaces - get included here? And can someone attach a few extra GB externally to give what's effectively an L5 cache?

I can't say I like this approach - I'd prefer it if the CPU and interface logic was on 1 chip, and the memory on another.

Re:Caches (2)

cbhacking (979169) | more than 2 years ago | (#38790281)

Umm... no. You've apparently completely failed to notice the part where this CPU *has* no cache, at least certainly no L2 or L3. Instead, it talks directly to main memory (which it's embedded in, at least in a portion of, and has extremely fast access to). More accurately, any given gigabit (128MB of RAM) is the cache for one of these CPUs.

I don't know how quickly they can communicate across the DIMM (each 2GB has 16 CPUs, so some intercommunication is critical) - maybe that's more akin to traditional memory access speed - but it's still a ludicrous amount of "cache" and eliminating the multiple levels of caching greatly simplifies the memory controller logic.

That said, I wonder how useful a CPU core with so few transistors (and apparently a low clock speed) will be. It's certainly not going to have all the peripheral interfaces you mention - not even close.

cray3/super scalar system (2)

Rachael (244242) | more than 2 years ago | (#38790055)

cray where heading that way also in the 90ish with their sss system, they where just adding many 2048 cpus per block.

http://en.wikipedia.org/wiki/Cray-3/SSS
http://www.thefreelibrary.com/CRAY+COMPUTER+CORP.+COMPLETES+INITIAL+DEMONSTRATION+OF+THE+CRAY-3...-a016628331

cpu and memory already atomic (1)

CBravo (35450) | more than 2 years ago | (#38790085)

You model separate cpu and memory as two processors: one with only a litte memory and a lot of processing power, the second with a lot of memory and no processing power (theoretically speaking).

Reinvention from 1984 (2)

AlecC (512609) | more than 2 years ago | (#38790103)

Looks like they have reinvented the inmos Transputer, from about 1984. http://en.wikipedia.org/wiki/Transputer [wikipedia.org] . They alwaysintended to take that multicore, but never got that far. But it looks remarkably similar in intention.

IBM already has CPU and eDRAM in their chips (0)

Anonymous Coward | more than 2 years ago | (#38790179)

IBM already offers embedded DRAM option to go with logic to enable high density cache in microprocessors. Power7 already uses this feature. How is this new ? You can use their foundry service to use the technology.

http://www-03.ibm.com/press/us/en/pressrelease/32970.wss

How curious (1)

vikingpower (768921) | more than 2 years ago | (#38790311)

"Venray" is a boring little town in the Netherlands. Both it and neighbouring Venlo are known for the tough crime scene. Link ?

Hmm. (1)

theswimmingbird (1746180) | more than 2 years ago | (#38790335)

I used to be a CPU like you until I took some DRAM to the knee.

More than just embedded DRAM (1)

sulimma (796805) | more than 2 years ago | (#38790467)

This is not just about putting DRAM and a CPU on the same chip while keeping the architecture of both unchanged.
This is about how computer architecture is effected by the possibility of implementing both on the same chip.

Dave Patterson noted in the nineties that the number of DRAM chips per computer went down with time. He predicted that DRAM
will become large enough soon that at least the memory for a single process will fit into one chip soon. At that point it is unecessary
slow and power consuming to move the data to the CPU and back for every computation (or alternativly spend 90% of the CPU chip
area for cache to reduce the number of transports)

When you do put CPU and DRAM on the same chip the cost functions change and different architectures become optimal.
Patterson noted that when you have a CPU and DRAM on the same chip the relative architectural cost functions will be similar to the
technologies of the 70ies, just a few orders of magnitude smaller. Therefore he revisited architectures of that time and suggested to
put a vector computer on a DRAM chip called the IRAM.
http://www.cs.berkeley.edu/~pattrsn/talks/iram.html [berkeley.edu]

Vector computers do not benefit much from cache. Latency is not a big issue for vector computers but they really benefit from bandwidth.
On chip you can connect the DRAM to the CPU with 2048 bits bus width or more. (And the latency would be much smaller than the latency
of a CPU going through a big cache hierachy and an external bus to the RAM)

If more memory is needed than fits on one chip he suggested to minimize data transports between chips. Instead the register state of the
process would be migrated to the DRAM where the desired data resides.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?