Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Researchers Claim 1,000 Core Chip Created

CmdrTaco posted more than 3 years ago | from the eat-the-seeds dept.

Supercomputing 118

eldavojohn writes "Remember a few months ago when the feasibility was discussed of a thousand core processor? By using FPGAs, Glasgow University researchers have claimed a proof of concept 1,000 core chip that they demonstrated running an MPEG algorithm at a speed of 5Gbps. From one of the researchers: 'This is very early proof-of-concept work where we're trying to demonstrate a convenient way to program FPGAs so that their potential to provide very fast processing power could be used much more widely in future computing and electronics. While many existing technologies currently make use of FPGAs, including plasma and LCD televisions and computer network routers, their use in standard desktop computers is limited. However, we are already seeing some microchips which combine traditional CPUs with FPGA chips being announced by developers, including Intel and ARM. I believe these kinds of processors will only become more common and help to speed up computers even further over the next few years.'"

cancel ×


Sorry! There are no comments related to the filter you selected.

Programmable CPU's (3, Interesting)

kge (457708) | more than 3 years ago | (#34745638)

How long will it be before we will see the first motherboards with FPGA emerge?
Then you can download the CPU type of your choice:

-- naah, I don't like this new Intel core, I will try the latest AMD instead...

Re:Programmable CPU's (5, Informative)

Hal_Porter (817932) | more than 3 years ago | (#34745726)

A desktop CPU in an FPGA will always cost more and perform worse (i.e. slower clock rate) than a full custom chip from Intel or AMD. Mind you I've seen embedded designs where a microcontroller, Ram, Rom and custom logic are implemented in a $10 FPGA - especially where volumes are too low for an ASIC.

On the other hand I could definitely see programmable logic inside Intel or AMD CPUs, a sort of super SSE. Then again even there you'd probably be better off using GPU like custom hardware for the heavy lifting. In fact I can see CPU/GPU hybrids being very common in low end machines. Full custom logic is always going to have a performance per $ advantage over FPGAs unless FPGA technology chains drastically.

Re:Programmable CPU's (2)

RKBA (622932) | more than 3 years ago | (#34746114)

FPGAs have been dynamically reprogrammable for years. You could load one with whatever special "hardware" custom instructions you wanted on the fly. Yes, custom logic is faster, but is inflexible.

Re:Programmable CPU's (1)

wagnerrp (1305589) | more than 3 years ago | (#34747956)

They're dynamically reprogrammable, but its not like you can just just instantly flip to another ROM. These things take time to switch to another configuration. They are much better suited for batch operation, running one task completely before moving onto the next, than multitasking.

Re:Programmable CPU's (2)

Lumpy (12016) | more than 3 years ago | (#34746198)

I'd like to see a FPGA 1x Pci express daughter-board and a open and well defined interface so that software can reconfigure and then use the FPGA's on the daughter-board for useful PC tasks....

Game using it for high speed calculations, then DVD Fab uses it to crack BluRay encryption faster, Video encoding, Audio encoding, then the browser uses it for encryption, etc....

A nice open standard without greed attached so everyone can use it in their software. Although in the world of many cores not being used, I guess that's the way it will go instead.

Re:Programmable CPU's (1)

conspirator57 (1123519) | more than 3 years ago | (#34746308)

you'll want >=8x PCIe since most interesting applications (especially those that distributed computing is worst at) are IO bound.

problem: FPGA parts big enough to have PCIe and DDR interfaces and still do interesting stuff with are expensive on their own @ $600 [],1707,RID%253D0%2526CID%253D37133%2526CCD%253DUSA%2526SID%253D32214%2526DID%253DDF2%2526LID%253D32232%2526PRT%253D0%2526PVW%253D%2526BID%253DDF2%2526CTP%253DEVK,00.html []

$1100 for the board is as cheap as i've seen yet.

Re:Programmable CPU's (0)

Anonymous Coward | more than 3 years ago | (#34746830)

It's not FPGA, but the uses you are describing sound like what the nVidia Tesla series cards are for. The C2050 has 448 processing cores and 3GB of GDDR5 RAM.

Re:Programmable CPU's (1)

mrmeval (662166) | more than 3 years ago | (#34749434)

You should jump on some of the newer non-volatile FPGA's that can run a microcontroller core. I found one from Lattice who had it on sale for 29.95 with a jtag programmer. Now they are 50. There are other brands and I'm always looking for cheap dev kits. I think there's a devkit for Omap from Ti that's open source and not to expensive but I'm not finding it.

Re:Programmable CPU's (1)

guruevi (827432) | more than 3 years ago | (#34749442)

They're called FPGA accelerators and they already exist. You just won't find them in your general desktop as the entry level cards cost about as much as a high-end workstation.

Re:Programmable CPU's (1)

ByOhTek (1181381) | more than 3 years ago | (#34746262)

given your last comment - I think pre-defined hardware such as AMD/Intel desktop chips will always be faster than FPGA for a pre-specified set of individual operations. It's only when you get to operation combinations not defined at manufacture time, but used frequently, that FPGA has an advantage.

The current CPU design will stay for most of the work, and an FPGA attachment would handle the specialty work that isn't needed most of the time, and can be dropped.

This issue is reprogramming time and muti-threaded/multiprocessing apps. If you have five applications requiring different FPGA setup, but only two FPGA cores, time sharing could be quite painful.

Re:Programmable CPU's (2)

durrr (1316311) | more than 3 years ago | (#34746528)

A non-FPGA AMD/Intel CPU will always be faster doing general CPU business than a FPGA implemented one doing the same.
It is however a stupid approach, a CPU is built to do general purpose calculations to allow for all software to exist without specialized hardware. A FPGA on the other hand is made to configure into specialized hardware in order to... well, i guess not having to build a lot of prototypes for hardware testing was its original purpose. But its use go far beyond that in that it could turn into that specialized hardware that would make your program run oh-so-bloody-fast.
Ideally we would have a FPGA that instantly reconfigures on the fly and that have a compiler which turns any and all C-code into highly optimized FPGA code. Heck, even a small general purpose FPGA area implemented on motherboards to be used by games and software for a limited hardware optimization space could speed up and enable great things to be done as there are tasks that are MUCH faster when hardware implemented(i've heard the number 100-1000 times acceleration of dealing with gene sequencing data(although i guess the 1000x value is for the $2k FPGA chips)).

Re:Programmable CPU's (1)

SuricouRaven (1897204) | more than 3 years ago | (#34747154)

The typical home user rarely needs to do any really heavy number-crunching - the closest they get is physics in games. It has definate use in scientific computing and analytics, though - espicially as it allows the engineers to constantly improve the programs without needing to get new silicon manufacturered. It's a niche into which GPGPU has settled quite happily, though - and it does such a good job, only the most extremally demanding workloads may justify the expense of FPGAs and people with the skills to put them to use. Weather prediction comes to mind - something that doesn't just demand huge amounts of processing power, but has all calculations needed as soon as possible. It's not much use predicting tomorrow's weather if it takes a day to run.

Re:Programmable CPU's (1)

vertinox (846076) | more than 3 years ago | (#34748668)

The typical home user rarely needs to do any really heavy number-crunching - the closest they get is physics in games.

For the past 5 to 8 years there has been a "rasterization vs. ray tracing" [] debate in the game developing and graphics community (with ray tracing in real time in games only being a theoretical pipe dream until recently).

If someone were to make ray tracing feasible, cheap, and practical for either a console or desktop PC, then yes... Home users will need that number crunching as Ray Tracing is a very embarrassing parallel task.

But that might not be for some time...

Re:Programmable CPU's (1)

Lennie (16154) | more than 3 years ago | (#34747938)

Intel has already a line of Atom-processors with a FPGA for I/O operations.

Re:Programmable CPU's (1)

Dogtanian (588974) | more than 3 years ago | (#34749226)

How long will it be before we will see the first motherboards with FPGA emerge?

A desktop CPU in an FPGA will always cost more and perform worse (i.e. slower clock rate) than a full custom chip from Intel or AMD.

Sure, but no-one's going to do that anyway- if the OP thought that, then he missed the potential of his own idea.

I thought up something similar a few years back, and realised that, yes, the performance would obviously be horribly uncompetitive and pointless if you simply tried to reproduce (e.g.) an x86 chip's circuitry with an FPGA. The obvious idea (or rather, my idea, which I suspect countless other people also figured out independently) is that the FPGA *circuit* implemented in hardware replaces the *program logic* or software. So you could improve performance for a specific task by programming the FPGA with a direct hardware-based circuit.

Of course, I'm no expert in the area, and as you can see from the reply to this post [] where I suggested the idea, there are a number of serious problems with it. While I had anticipated that reconfiguration time and sharing the unit between multiple tasks would be issues, Tacvek points out that since we're talking about hardware circuits, it would be quite possible with malicious programming to damage your computer.

I'm guessing (and *only* guessing) that while it might be possible to restrict what "programming" could be carried out on the FPGA, this would seriously limit and cut down the potential.

Re:Programmable CPU's (1)

pantherace (165052) | more than 3 years ago | (#34745774)

You can get an AMD motherboard with a Hyptertransport link brought out, and then an FPGA to go into it.

(Just be careful before you look at the prices. They suffer from being a very niche market.)

Re:Programmable CPU's (0)

Anonymous Coward | more than 3 years ago | (#34745796)

Forever. They are very hard to program. People have been touting them for this kind of thing for a long time and they've never taken off. Now GPUs offer most of the data-parallel benefits, and there generally aren't enough bit-wise operations in most algorithms to make them worth it. (numerical encoding in codecs is an exception, but that is already handled better in custom hardware. E.g., Sandy Bridge.)

Re:Programmable CPU's (0)

Anonymous Coward | more than 3 years ago | (#34746502)

They are hard to program, but that's not an inherent problem in the FPGA hardware. It's a software problem.

If model-based design ever takes off on a large scale you could hide the VHDL/Verilog from the programmer and just work with models. Model based design is simple and a lot of fun, except for the 'minor' complication that it fails to consistently generate stuff that synthesizes...

Re:Programmable CPU's (1)

conspirator57 (1123519) | more than 3 years ago | (#34745880)

nice idea, but it will be dirt slow and 10x as expensive.

btw: welcome to 2003 when Xilinx released the Virtex II Pro.

Re:Programmable CPU's (1)

melstav (174456) | more than 3 years ago | (#34747188)

How does last year sound? [] Granted, it might be a while before they are commonly found on commercially available boards. And as others have pointed out, If you do it in *real* hardwdare, it will be faster than if you did it in an FPGA. This is more like a customizable coprocessor to the Atom. You could even use it to replace the motherboard chipset, conceivably.

Re:Programmable CPU's (0)

Anonymous Coward | more than 3 years ago | (#34747802)

That sounds like a good approach, though. I always wanted a "wisc" (writable instruction set computer) but just as vliw and wisc were coming out, risc came out and everyone jumped over to that. Now that the riscs have mostly disappeared (ok, it can be argued we are all moving slowly to the ARM), maybe there will be some room for experimentation once again.

1,000 cores (1)

Low Ranked Craig (1327799) | more than 3 years ago | (#34745652)

or 1,000 logic blocks? Are they equivalent? Aren't FPGAs common and generally contain multiple logic blocks?

Re:1,000 cores (2)

Muad'Dave (255648) | more than 3 years ago | (#34745772)

My bet is 1,000 very simple cores - most decent-sized FPGAs contain 10's or 100's of thousands of 'logic blocks'. The Spartan 6 [] series has between 3,840 and 147,443 logic blocks.

Re:1,000 cores (1)

yurtinus (1590157) | more than 3 years ago | (#34746196)

Tens of thousands of blocks, but how many do they spend implementing their CPU cores? Could be using multiple FPGAs or a very very simple CPU core... I'm more intrigued by the blurb about Intel and ARM developing CPU/FPGA chips - could be a lot of fun with (hopefully) a lot lower cost than a Virtex.

Re:1,000 cores (1)

ace of death (462104) | more than 3 years ago | (#34746226)

"By creating more than 1,000 mini-circuits within the FPGA chip, the researchers effectively turned the chip into a 1,000-core processor - each core working on it's own instructions."
This is entirely feasible, but the 'cores' would have to be very very simple. Looking at the data sheet for the Xilinx Virtex 6 FPGA, it contains 118,560 Configurable Logic Blocks, which are equivalent to four Look Up Tables, and 8 flip-flops. If you wanted to create an 8-bit instruction set processor, it would require at minimum 16 CLBs just to decode the instructions, then you have to supply more logic blocks to do any actual arithmetic. So it is possible, but they are not talking about cores comparable to a PC. []

Re:1,000 cores (1)

Muad'Dave (255648) | more than 3 years ago | (#34746374)

I agree, hence the "very simple" in my reply. I bet they are extremely limited, but fast. Other brands/models of FPGAs have different definitions of 'complex' - Altera has some pretty smokin' FPGAs, too.

Re:1,000 cores (1)

durrr (1316311) | more than 3 years ago | (#34746608)

The Xilinx virtex-7 series supposedly contain up to two million logic blocks. If i've got it right the spartan is the xilinx hobby lineup whereas the virtex is their Industry lineup.

Re:1,000 cores (1)

blair1q (305137) | more than 3 years ago | (#34745834)

FPGAs can be programmed to emulate any logic hardware (logically, though not usually electrically, so power and timing will not be accurate though the logical results will be identical). Many CPU cores have been rendered as library modules that can be programmed into an FPGA. Put 1,000 of them in your FPGA (or big array of FPGAs in this case) and route them together, and you can claim you have a 1,000-core CPU.

Of course, it takes more than one FPGA chip to do this, so you can't in any sense claim a 1,000-core chip, just another 1,000-core system, albeit one that accurately emulates the wiring on a 1,000-core chip.

Star Bridge Systems (1)

Talinom (243100) | more than 3 years ago | (#34745654)

It may be too late, but perhaps someone could talk with Viva Computing, LLC who now owns the assets of Star Bridge Systems [] . It was not specified in the news release if they also own the intellectual property.

HEAT (0)

Anonymous Coward | more than 3 years ago | (#34745694)

Oooo I bet they run HOT!

Remember the "10 Deca chip?"

Took long enough... (2)

Crudely_Indecent (739699) | more than 3 years ago | (#34745706)

This story was already submitted two times before eldavojon managed to get it to the front page in a little over an hour... [] []

Re:Took long enough... (2)

seifried (12921) | more than 3 years ago | (#34745840)

Those two submissions are poorly written and have no real detail compared to this one (which is no gem, but is better).

Re:Took long enough... (0)

Anonymous Coward | more than 3 years ago | (#34747042)

Surprisingly, his was chosen over theirs.

Re:Took long enough... (1)

korgitser (1809018) | more than 3 years ago | (#34748350)

There also was some news about a monkey with three asses...

Does anyone have a link... (2)

John Hasler (414242) | more than 3 years ago | (#34745710) a paper that assumes that the reader already knows what a cpu is? This article is content-free.

Life Cycle (4, Interesting)

glueball (232492) | more than 3 years ago | (#34745718)

I think this is a great development. I've been using FPGAs in medical imaging for about 15 years. The groups that use the GPUs are getting great performance--definitely--but seeing as how MRI and CT machines are placed and need to run for 10, 15 20 years, I don't see how the GPUs will survive that time. One large OEM was pushing the GPUs for their architecture and I can't believe it will be successful if success is measured on the longevity scale. I'm sure the service sales guy will clean up.

Why do GPUs fail? I'm not sure of the exact modes of failure but the amount of heat has got to have something to do with it. FPGAs will run much cooler and in the FLOPS/Watt game, will win.

Re:Life Cycle (0)

Anonymous Coward | more than 3 years ago | (#34746272)

A discrete GPU perhaps... might... have a shorter life. On the other hand, it's a lot cheaper and faster, and can be upgraded, and can be replaced cheaply and easily.

Re:Life Cycle (1)

ByOhTek (1181381) | more than 3 years ago | (#34746384)

If they make the GPU replaceable, it's not such a big deal.

If they underclock the GPU to reduce heat, again, not such a big deal.

A GPU might have an expected 5-10 lifetime at full throttle, but if you knock it back to 25%, you probably will get a much better survival rate.

Re:Life Cycle (1)

ace of death (462104) | more than 3 years ago | (#34746394)

The drawback with using FPGAs compared to commodity processors is that the FPGA market currently does not support using the bleeding edge processes that CPUs are manufactured with. Typically a competitively priced FPGA will be at least one generation behind a CPU. In HPC FPGA's are a plausible improvement, but at a smaller scale the development costs for incorporating a custom firmware for an FPGA into an application are significant. It all really rests on what demand is out there for a particular algorithm to be implemented as a firmware for an FPGA. FPGAs have limited floating point resources, for example the largest Xilinx Virtex 6 FPGA has about 2000 25 x 18 floating point units.

Re:Life Cycle (1)

Apocros (6119) | more than 3 years ago | (#34747544)

Actually, usually FPGAs are on the bleeding edge of manufacturing processes. Intel may have beat everyone to 28/32nm, but expect to see 28nm FPGAs from Altera and Xilinx (from TSMC and/or Samsung) around the same time as 28/32nm ASICs from AMD or nVidia. Intel rolls their own, but everyone else is using the same foundries...

Re:Life Cycle (1)

ChrisMaple (607946) | more than 3 years ago | (#34749580)

FPGAs are much slower and less efficient and bigger than a dedicated design because even the simplest gate is actually a block that can be controlled to perform many different functions. That block consists of several latches and a complex gate, perhaps a hundred transistors in all, whereas a 2-input nand gate consists of four transistors. So it's 25 times bigger (area), and the distance to the next gate is increased by 5x (linear). The complexity makes the block inherently slower than a simple gate, and the increased distance raises load capacitance, also slowing response. On top of that, the path from one gate to the next is also programmable, and that also adds delay and size.

FPGAs will never be as fast as purpose-built ICs with the same technology.

Re:Life Cycle (0)

Anonymous Coward | more than 3 years ago | (#34746962)

FPGAs are more difficult to program. And the startup overhead does not go down with experience like it should because FPGA tools for different generations of chips are different even when from the same vendor. Whereas GPU tools are becoming standardized (defacto standard CUDA or open standard OpenCL). And some of the new combined CPU/GPU (AMD Fusion) and GPU/chipset products (Nvidia ION2) have brought the FLOPS/Watt up to the point where GPUs are becoming competitive in some of FPGAs traditional markets. Finally, for the price of a Convey system, as an example, I can buy a large number of graphics cards and still have money left over to power them at full power.

Re:Life Cycle (1)

Bassman59 (519820) | more than 3 years ago | (#34747410)

FPGAs are more difficult to program.

You don't program FPGAs.

FPGA development is synchronous digital logic design. Verilog and VHDL are hardware description languages; they are not programming languages. Having a software-engineering or programming background does not mean you can simply learn Verilog and start doing FPGA design.

Re:Life Cycle (0)

Anonymous Coward | more than 3 years ago | (#34748288)

Yes you do, yes they are, and yes you can.

Re:Life Cycle (3, Interesting)

SuricouRaven (1897204) | more than 3 years ago | (#34747208)

I don't see why an MRI machine processor can't be made fault-tolerant. If a GPU burns out, it could just be disabled and a fault warning indicated - and then the machine can carry on working, even if it does take significently longer to produce an image. Then you call tech support, they come around and pull the faulty part and slot in a new one. The only concern then is making sure parts are available in twenty years - and I imagine any machine that expensive has to come with a long-term support contract which will oblige the manufacturer to ensure a supply of compatible boards in years to come.

Re:Life Cycle (1)

glueball (232492) | more than 3 years ago | (#34749888)

Two things--if there's a failure, then there's a problem. The machines for years used to use military grade hardware. Machines that were designed in 1992, sold in 1994 are still running strong today. Then to cut costs, the OEMs switched to more commodity hardware and they've effectively sucked in uptime since. You make it sound like it's no big deal to call tech support. It is a big deal. To put it in dollar terms, we had a machine go down for technically 4 hours. The tech was there, made the diagnosis of the failure, and turned the machine off until is could be repaired. That took 4 days. The outcome? $100K in revenue put at risk, much of it lost.

Second, those long term support contracts are extremely expensive and no, they do not keep compatible boards available for years to come. They depend on users upgrading their machines and selling their old machines back. Well, people are holding on to their machines a lot longer than they used to and it's putting the OEMs into a corner. I know several of the suppliers to the OEMs and they've told the OEMs to take a flying leap about long term support. Long term support isn't free, it isn't trivial, and the major OEMs are reaching a crunch point of inventory available for spares.


Re:Life Cycle (1)

shadow_slicer (607649) | more than 3 years ago | (#34749376)

Are you serious? There's no way FPGAs beat GPUs in FLOPS/watt.

FPGAs have so much more overhead both in space and power due to programmability, whereas GPUs are pure processing. Further the algorithms necessary for CT and MRI are practically the same algorithms GPUs were designed for, so if you were to use an FPGA, your design would end up with a similar architecture anyway. Further, while low end commercial GPUs (like those you and I use for gaming), may only last 3-4 years, the high end scientific computing GPUs (Tesla, etc.) generally run until they're obsolete.

So my bet is that the FPGA would die sooner:
1. FPGAs have more that can go wrong.
2. FPGAs will run hotter than the equivalent ASIC.
3. FPGAs will also run slower, so they will run hotter for longer periods.

Story is all hype and no content (0)

Anonymous Coward | more than 3 years ago | (#34745724)

What is particularly new about FPGAs w/ dedicated circuits being faster than general-purpose circuits? This just in - ASIC circuit implementing FPGA circuit is 1000s of time faster and more energy efficient.

It seems like the particular news of this story is that each "core" (this article is using lay-notation, so it's hard to pinpoint what exactly they mean by the term processing core) was given some memory. So this seems to be more an article examining distributed computing & how that can be done within an FPGA.

FPGA vs. GPU? (0)

Anonymous Coward | more than 3 years ago | (#34745768)

What are the practical differences between targeting an FPGA on a computing platform and targeting more ubiquitous massively-parallel programmable pipelines in modern GPUs? Also, what are the fundamental differences? Could my GPU already contain FPGAs?

Re:FPGA vs. GPU? (2)

raftpeople (844215) | more than 3 years ago | (#34746402)

Not all problems map well to current GPU offerings. I have a problem that would benefit from parallel processing but due to a branchy algorithm and very random access for read/write, I can't really take advantage of GPU's to the extent some algorithms can (note: I have coded and run it on GPU's so this is more than just theory, additionally I have coded it to run on a network of computers and unfortunately the calc time vs network transmission time ratio for each cycle is not favorable enough for that to be a very good solution either, best solution is many cores accessing same memory).

For this particular problem, a large number of minimally functional "cpus" or "cores" would be ideal, some basic math, logic and branching. An FPGA is one way to try to achieve something like this.

Re:FPGA vs. GPU? (0)

Anonymous Coward | more than 3 years ago | (#34747128)

nVidia has a 240 core GPU card that runs about $1,300 USD. Is that a large enough number? If not, they also have a 448 core card at about $2,500. I haven't done GPU development myself, but those cards do have their own memory, and I would imagine all of the cores can access it.

Re:FPGA vs. GPU? (1)

Bassman59 (519820) | more than 3 years ago | (#34747432)

What are the practical differences between targeting an FPGA on a computing platform and targeting more ubiquitous massively-parallel programmable pipelines in modern GPUs? Also, what are the fundamental differences? Could my GPU already contain FPGAs?

The main difference is that you don't program FPGAs. You do synchronous digital logic design which is implemented in the FPGA fabric. Thinking that you can program them like you program a sequential-execution processor is a recipe for failure. And, yeah, C-to-gates tools are a joke.

Disappointment (4, Funny)

TheL0ser (1955440) | more than 3 years ago | (#34745776)

The story's been up for 20 minutes and no one's tried to imagine a Beowulf cluster of them yet? This is a great sadness.

Old Meme is Old (0)

Anonymous Coward | more than 3 years ago | (#34745810)

Don't you mean you are dissapoint?

Re:Old Meme is Old (0)

Anonymous Coward | more than 3 years ago | (#34745970)

'dat's da point!'

Re:Disappointment (0)

Anonymous Coward | more than 3 years ago | (#34747538)

It is because we have such a thing from a lot of time

Ok it is not a beowulf cluster, but they are supercomputers nonetheless.

YouTube Algorithm (1)

digitaldc (879047) | more than 3 years ago | (#34745806)

The researchers then used the chip to process an algorithm which is central to the MPEG movie format – used in YouTube videos – at a speed of five gigabytes per second: around 20 times faster than current top-end desktop computers.

20x speed is getting closer to what I need before I can even ATTEMPT to build my very own Holodeck. []

Re:YouTube Algorithm (1)

SuricouRaven (1897204) | more than 3 years ago | (#34747222)

This is slashdot. Do you really think anyone here doesn't know what a holodeck is? Half the users have probably tried to design one.

FPGAs ... (1)

Bassman59 (519820) | more than 3 years ago | (#34745828)

Yawn. Seriously.

(says the guy who does FPGA design for a living.)

FP failure (0)

Anonymous Coward | more than 3 years ago | (#34745860)

I would have gotten first post, but I was still waiting for my 'cat /proc/cpuinfo' to finish.

first (1)

iamacat (583406) | more than 3 years ago | (#34745882)

We first need to break a lock of x86 instruction set and the operating system that requires it. CPUs already try to execute multiple x86 instructions in parallel, but this is severely limited by sequential instruction set design. There needs to be a way to express computation A and B using different sets of virtual registers and let hardware execute them sequentially or in parallel depending on its capabilities, or vectorize/parallelize multiple iterations of a loop. If software, including operating systems, is coded in higher level virtual machine bytecode interpreted by hypervisor, a lot of parallelism can be expressed for future use while still permitting efficient execution on current hardware. LLVM is a good start, although it needs a lot more concurrency/vectorization information to take advantage of coprocessors, GPUs and massively parallel architectures.

Re:first (1)

VortexCortex (1117377) | more than 3 years ago | (#34746218)

We first need to break a lock of x86 instruction set

Yep. All hail ARM.

There's a reason why embedded devices use ARM over x86. The x86 instruction set has a lot of instructions that no compilers (and therefore hardly anyone) ever use. Those unused instructions are just sitting there in the silicon, charged up with electrons, draining power, generating heat, and making it harder to create smaller & faster x86 chips. Some of these "deprecated" instructions are microcoded, but that just means they're slower and even less likely to be used by an optimizing compiler.

Chips run fast when you can keep the cache full and just shove instructions down the pipeline.

Branch prediction is needed to guess which branch of instructions to cache. Miss a prediction on x86 and the processor has to flush the unneeded instructions and load in the correct ones from memory.

On many ARM implementations, instructions have an execute bit. Let's say you make a prediction that a JNZ will branch. So you set a prediction register to "1", and shove that branch of code along with a "1" execute bit for all instructions. If you fail the prediction, the register gets set to "0" and all the "1" flagged instructions are skipped right over -- execution bits don't match, then don't execute.

Instead of flushing the pipeline as on x86, on ARM we can just start shoving the correct instructions down the pipe with execution bits set to "0". The pipeline continues as normal, no exceptional case occurs. The interesting part is that we can load both branches into the pipeline and set their execution bits ahead of time on ARM, on x86 we have to wait to see if the prediction was correct or not.

You gain speed when you keep that pipeline stuffed full.

Seriously, x86 is SO FREAKIN' OLD, it needs to die already. It's showing it's age so much that we actually design around its shortcomings to get faster chips! It's ancient, on its last leg, and it's slowing down the whole herd. I can't wait for it to get taken out by a predator.

Re:first (2)

smallfries (601545) | more than 3 years ago | (#34746500)

Sigh. Multi-way branching was already old when ARM implemented it. What you fail to explain (understand?) is that there is a cost associated with either choice. As with most of engineering there is not a simple proposal that wins. In the case that branch prediction is perfect, the predicted execution is cheaper. In the case that the prediction is terrible the multi-way execution wins. In real life branch prediction is neither perfect, nor is it that terrible, so engineers have to balance the likelihood that one technique will be better for a given type of code against the probability that the processor will execute that type of code.

Guess where multi-way branching wins? In small working-set loads typical of embedded processing applications. Guess where branch prediction wins? On a more general set of benchmarks typical of desktop computing.

Your other point is equally incorrect - the decoding overhead for x86 is now minimal (a few percent of the size of the core). However the x86 instruction set is very good at packing lots of code into a small amount of space, which given the effectiveness of the x86 instruction cache is why it destroyed most of the pure-RISC competition. Those unused instructions are not "sitting in the silicon" as you put it rather idiotically. What they are doing is sitting in longer instruction words in the x86 encoding allowing the more frequently used instructions to be encoded in less space. It's a very simple form of compression, and as with other engineering tradeoffs it can win and lose in different circumstances but in the particular case of most x86 benchmarks it beats other instruction encodings.

Seriously, x86 is SO FREAKIN' OLD, it has been finely tuned and matured with age.

Re:first (0)

Anonymous Coward | more than 3 years ago | (#34747792)

Your post contains a mix of wisdom and BS.

> the decoding overhead for x86 is now minimal (a few percent of the size of the core)
"A few percent" is a big deal. Especially since it's still a few percent, even though the cores are getting bigger. And in using more cores per package, well, that few percent is duplicated in each core.

> which given the effectiveness of the x86 instruction cache is why it destroyed most of the pure-RISC competition
You've made a critical failure here: the x86 *instruction* cache stores x86 instructions after they've been decoded into simpler RISCy form. This is something brutally necessary in x86 chips, to reduce the amount of instruction decoding done and allow the other nifty techniques used elsewhere (which would be pretty much impossible if the actual chip was running pure x86 instead of exchanging it for something sane right is it enters the chip). In other words, it's not an advantage - it's a workaround that almost counters an inherent *dis*advantage. "Almost" because, using your quote, it's still costing a few percent of the chip (instead of the much larger percentage that would otherwise be needed).

Re:first (1)

smallfries (601545) | more than 3 years ago | (#34748164)

Well.... no. A few percent is a small deal. A larger percentage would be a bigger deal.

You've made a critical failure here: the x86 *instruction* cache stores x86 instructions after they've been decoded into simpler RISCy form

Yes - after they've been transferred across the bottleneck from memory. So at the point where it matters (the cache fetching lines from memory) the code is in a dense form because of the CISC encoding.

It's really quite simple: RISC is an advantage where the cost of decoding dominates because it simplifies the decoder circuitry. CISC is an advantage where the cost of transferring instructions (and the space that they occupy in memory) is the dominant cost because it reduces the size of the code.

Wisdom and BS, eh? That's the nicest thing I've heard on slashdot in a while...

Re:first (1)

Lemming Mark (849014) | more than 3 years ago | (#34746806)

Admittedly slightly tangential to your discussion of virtual machines ... but part of the point of Intel's IA64 instruction set was to address this kind of thing. The compiler's job was to specify groups of instructions that could be executed safely in parallel, then the CPU would execute these according to its capabilities.

But a higher-level virtual instruction set with just-in-time compilation is admittedly more insulated against future technology and more amenable to the code being run on a variety of add-in components, such as GPUs.

I think they ought to have used dynamic recompilation for the IA32 compatibility, it might even have given them better performance.

1000 CPUs? (0)

Anonymous Coward | more than 3 years ago | (#34745908)

As this is made of an FPGA I'm assuming they may be employing a rather broad definition of the term CPU, I'd imagine it's probably more like 1000 DSPs, and their choice of an mpg algorithm as a demonstration piece would seem to reinforce this assumption.

I'm not saying I wouldn't bite their arm off for a chance to play with it, but this is probably more akin to shifting non-branching code off onto the GPU than it is to general purpose computing.

good grief - maybe rediscover integration as well? (1)

Anonymous Coward | more than 3 years ago | (#34745920)

"However, we are already seeing some microchips which combine traditional CPUs with FPGA chips being announced by developers, including Intel and ARM."

welcome to 2004.

Xilinx Virtex II, includes internal PPC 405GP

Security issues (2)

bluefoxlucid (723572) | more than 3 years ago | (#34745960)

A programmable hardware platform would provide amazing computing power because of hardware specialization: rather than emulating a proper CPU, you would download core architecture into the FPGA to accelerate tasks such as REGEX processing or H.264 decoding. You could compile the entire logic of a program into a gate array with various logical operators and flip-flop circuits for unlimited (albeit slow) registers (L2 registers) as well as including standard registers and SRAM cache (L1).

Although the FPGA runs slower than a regular CPU, direct programming rather than instructional programming (that is logic blocks that perform programmatic functions, rather than logic blocks that interpret discrete instructions to follow programmatic functions) would shorten the overall hardware logic path. In short, the chip would follow fewer clock cycles and instead just "do things." The CPU would be slow, but optimized for your workload. The main performance bottleneck would be the context switch: replacing the logic gate configuration with a new program every time you switch. Other than that, dynamic program expansion could be utilized: inlining operations like multiplication, addition, etc, or breaking them out if space constraints make it hard to load the whole program onto the FPGA that way.

The obvious, major issue we see is, of course, a security issue. You can now reprogram the CPU. This makes it difficult to prevent a program from bypassing any and all hardware security measures. This is solved by implementing a completely new security design on the chip, by which the CPU itself (the FPGA) is under control of external security mechanisms (paging etc handled in the MMU, outside the FPGA space, would largely mitigate most of this); it's not impossible to deal with, it's just an issue that needs to be raised.

In short, this sucks for "download the new Intel CPU into your BIOS/bootloader." This sucks for whatever general purpose CPU you can think of. For an entirely new programmatic platform, however, this would provide some interesting performance possibilities, and some interesting challenges.

Re:Security issues (1)

smallfries (601545) | more than 3 years ago | (#34746630)

This sounds reminiscent of the hype around reconfigurable computing ten years ago. A lot of the hype has died down now that people have tried and discovered that what you've described is wrong.

First point: a specialised hardware circuit will always be faster than a generic circuit.

Second point: generic circuits require a lot more interconnect than specialised circuits which impacts how many of them you can fit on a die relative to specialised circuits.

Third point: a CPU is a set of specialised circuits being selected for execution by the program.

So here is the basic problem. If the target application is made of steps that exist as specialised circuits in the CPU then selecting which of those circuits to apply in sequence will be faster than a generic circuit because the specialised circuit uses the space on the die more effectively and is clocked at a much higher frequency.

If the target application is made of steps which are very unlike the circuits provided on the CPU then the generic design will win. For everything in-between it is a trade-off. Not as many things win as FPGA designs and there is ten years of literature showing marginal improvements.

The problem is not some far-fetched security issue - the real problem for FPGAs is that the instruction set chosen for most processors covers most problems adequately well. There are very few corner cases where an FPGA can win convincingly, and certainly no demand for a reconfigurable processor.

Re:Security issues (1)

bluefoxlucid (723572) | more than 3 years ago | (#34747206)

So here is the basic problem. If the target application is made of steps that exist as specialised circuits in the CPU then selecting which of those circuits to apply in sequence will be faster than a generic circuit because the specialised circuit uses the space on the die more effectively and is clocked at a much higher frequency.

If the target application is made of steps which are very unlike the circuits provided on the CPU then the generic design will win. For everything in-between it is a trade-off. Not as many things win as FPGA designs and there is ten years of literature showing marginal improvements.

Encryption is a lot of things in CPU that are faster in hardware because it's a single clock cycle to do thing that are 30,000 clock cycles on the CPU.

Regex calculation, faster in a specialized hardware chip.

Codec decoding, we use an off-board CPU that has a microkernel and a small program; it benefits from just not running an OS and being a dedicated RISC processor, but in no other way.

GPU, specialized instruction set. Not dedicated to a specific task, but dedicated to a type of task. WAY faster than a regular CPU for that task.

An FPGA configured to run AES at 100MHz will operate faster than a computer calculating AES at 1000MHz if a single round of AES is 1/10 as many cycles on the FPGA as the CPU. This is not very far-fetched, since implementing AES in hardware is QUITE efficient.

You have to realize that a generic processor might, for example, spend 13 cycles doing DIV or MOD; it might spend 2000 cycles calculating a particular hash round because it has to do ADD, SHL, MOD, XOR, XCHG, MUL... on a specific set of data so wide. A specialized chipset might take 50, 100, 200 cycles... so if it's at the same clock rate it's 10, 20, 40 times as fast. That the FPGA may be slow doesn't much matter if it can implement an efficient gate logic that performs a task without any sort of instruction decoding or other overhead-- or really, with instructions such as "ADDSHLMULMODXORXCHMUL" that run in 6-8 cycles instead of 2-3 cycles for each of 6 separate instructions (i.e. 12-18 cycles, and note MOD and MUL are rather long 8-15 cycle operations on their own).

It's the same as how SSE is faster for specific data sets, or floating point instructions are faster, etc. You create highly specialized instructions that pipeline in stages... you could even make a circuit with a sequence of 16 "instructions" compromising one operation with 16 registers, so that rather than playing with registers in a pipeline you can actually, say, ADD, MUL, MOD, SHL, XCHG... and when you perform MUL, the ADD for the NEXT item in the data set begins in that pipeline. Some things on the CPU happen this way; other things simply cannot due to lack of internal storage or internal design; and in some cases the output relies on the input so this is simply not possible. To pipeline so efficiently, however, is to make the entire set of 300000000000 instructions to perform 1 step in a multi-step operation take however long the slowest single instruction takes (i.e. completely decode a full frame of MPEG video in the 13 cycles it takes to perform a single division operation), provided you have enough internal storage to run the pipeline on that much data.

Re:Security issues (1)

smallfries (601545) | more than 3 years ago | (#34747650)

It's odd that you pick crypto as I've spent a little time implementing crypto primitives on weird and exotic hardware. Sure - division is quite slow, that is why most primitives avoid the need for it, or only perform reductions in a specialised field rather than a full division. Multiplication on the other hand is fast and tends to be used a lot.

AES is quite a bad example for FPGAs. The very latest AES extensions from Intel can compute a round of AES in under three clock cycles. Performing the full cipher takes less than twenty clock cycles (on a processor running in excess of 3Ghz). No FPGA in the world can keep up with that performance.

Even if you do implement AES on an FPGA there is a basic issue caused by operations on the state matrix being defined over 8-bit elements while the lookup tables in FPGAs are generally 6-4-bit tables. This means splitting the element operations over multiple slices in the FPGA so the entire round needs to be unfolded over a large chunk of the available space. Even then you've just made the propagation delay in the circuit equal to the clock-speed of the device, while in the x86 equivalent the circuit propagation delay is somewhere under 0.033ns.

I can't be bothered to look up the latest results, but Bernstein's code on the x86 is probably still the fastest for general chips at about 9cycles/byte. Using bit-slicing can take that down to about 5cyc/byte or about 800MB/s. The biggest bottleneck on an FPGA is the terrible clock-speed imposed by the terrible layout tools that you are forced to use. Ignoring real I/O performance can the latest FPGA boards get close to that speed? It would mean delivering 32-bits/cycle on a 200Mhz clock which is a lot better than any AES on FPGA papers than I've read in a while.

Of course the new extension blow that away as 3 cycles/round is slightly less than 1cycle/byte or 3GB/s on a newer processor. As that is my main area of expertise I'll comment less on your other examples. Your basic argument about pipelining in general is wrong though.

In order for there to be a large improvement in speed from pipelining combinations of instructions rather than single instructions you must have some overhead in each cycle that is time being wasted for synchronisation. Time that could be reclaimed by combining the stages. There just isn't that much overhead to be reclaimed in a modern x86 design. The simpler operations feature dual-despatch and the more complex operations use larger circuits to ensure that their execution time is as close to a multiple of two executions of the simpler operations as possible.

Unless you are exploiting overhead within the cycle then in order to be more efficient you have to exploit overhead between cycles. Have you looked closely at the despatch patterns on a modern x86 core? It is very hard to stall the pipelines and the ROB is deep enough to make many operations invisible from a performance perspective. And of course this is all running on a 3Ghz clock with very low latencies (less than 4 cycles for a MUL).

As I said originally there are applications where FPGAs can win: but they are nowhere near as common as people thought. The advantages just don't stack up when you compare them against specialised circuits. In particular the horrific bottlenecks on clock-speed that using that much interconnect creates means that you need a particularly suitable application to beat a specialised CPU at all - and the chances for a reconfigurable processor that can win on many applications is slim to remote.

Re:Security issues (0)

Anonymous Coward | more than 3 years ago | (#34746884)

> ... you would download core architecture into the FPGA to accelerate tasks such as REGEX processing ...

I want to know what sort of REGEX processing you do! :)

Re:Security issues (0)

Anonymous Coward | more than 3 years ago | (#34747018)

You realize this isn't new right?

The only thing they've done is found an FPGA with enough gates and implemented a soft CPU with sufficiently tiny amount of functionality so they can fit the two things together.

Hell, you can find Open source FPGA cpu implementations on the Internet if you put a tiny bit of effort into it.

FPGAs are great for prototyping or when you have an algorithm that needs to change dynamically ... which is an EXTREMELY TINY set of applications.

In almost every case, you're better of in every way by making a standard ASIC after you've got the code for the chips worked out. FPGAs are slower, less reliably, more expensive and less energy efficient ... but ... they are programmable. Once you remove the programmable part, they are a bad device to use.

Re:Security issues (1)

Sulphur (1548251) | more than 3 years ago | (#34747578)

Once upon a time, there was a writable control store computer from TI.

Some one wrote a Super Compiler that produced microcode directly instead of producing instructions as usual.

Perhaps something like this can come back.

Re:Security issues (1)

owlstead (636356) | more than 3 years ago | (#34748934)

I could see this being used by a driver model. A generic driver is present that is able to reprogram the FPGA. Specialized or even derived drivers use the - now static - set of functionality. This could allow you to create generic purpose CPU's that can still be tweaked for certain tasks. It would also allow for upgrades of the algorithms being implemented. Symmetric cryptography and encoding/decoding would be obvious choices.

If updating the FPGA is really slow, I would not try and let applications change the functionality of the FPGA - nobody would like to have slow context switches, whatever the CPU is used for.

Impressive specs (0)

Anonymous Coward | more than 3 years ago | (#34745994)

It's almost enough to run Vista!

Yes...but (0)

arhhook (995275) | more than 3 years ago | (#34746018)

Does it run Linux?

Re:Yes...but (0)

spider256 (1560145) | more than 3 years ago | (#34746040)

Does it run Crysis?

Re:Yes...but (1)

Muad'Dave (255648) | more than 3 years ago | (#34746526)

Yes. []

A New Chip (1)

b4upoo (166390) | more than 3 years ago | (#34746330)

Now we need a chip that can take any given problem and divide it into one thousand parts so we can feed it into these processors. -Gives me a headache!

Re: A New Chip (1)

FeepingCreature (1132265) | more than 3 years ago | (#34746418)

Now we need a chip that can take any given problem and divide it into one thousand parts so we can feed it into these processors. -Gives me a headache!

It's called a "programmer".

Re: A New Chip (1)

fpgaprogrammer (1086859) | more than 3 years ago | (#34746556)

Now we need a chip that can take any given problem and divide it into one thousand parts so we can feed it into these processors. -Gives me a headache!

It's called a "programmer".

It's called a "fpgaprogrammer"

get it right.

Re: A New Chip (1)

blair1q (305137) | more than 3 years ago | (#34746672)

For that you need a 1001-cpu chip.

Ironically... (0)

Anonymous Coward | more than 3 years ago | (#34746496)

this will be much more of a pain for the developer side. Can you imagine writing a full raytracer with VHDL, not to mention having to design and implement your own clocking scheme? Will be interesting to see just how many developers really take advantage of least besides the hardcore HDL devs....

Wow, this will end useful software! (1)

darthwader (130012) | more than 3 years ago | (#34746760)

Software developers have barely figured out how to write single threaded algorithms without crashing. Now we are seeing more multithreaded algorithms with race conditions, deadlocks and other data-sharing bugs.

Can you imagine what will happen if every desktop machine has one or two FPGAs available for programs to use as needed?

PHB says "Hey, I've heard that you can make the program faster if you program custom hardware on the motherboard's FPGA. Get the new intern to write some FGPA code for our algorithms, and then re-write the module to use it. We'll ship it next month!"

Multicore processors has made software development an much more difficult, and putting a FPGA there will make it another two orders of magnitude more difficult. And programmers aren't getting smarter nearl as fast as the hardware is getting more complicated to program.

Re:Wow, this will end useful software! (1)

Nadaka (224565) | more than 3 years ago | (#34747252)

Multi threaded computing is not rocket science. Most bad multi threaded programming is bad because a lot of so called "software developers" just plain suck.

Re:Wow, this will end useful software! (0)

Anonymous Coward | more than 3 years ago | (#34748562)

Guess what? most software developers DO suck, most people suck at what they do regardless of what it is. That fact needs to be taken into account in API and language design if it's to be used by the majority of programmers.

Re:Wow, this will end useful software! (1)

Nadaka (224565) | more than 3 years ago | (#34749336)

Sorry, but to make parallelism painless, you have to restrict the language in ways that make a lot of other things painful.

A language where every method call is a perfect closure is easily made parallel, the only question left is what granularity of parallelism will produce a gain when considering the overhead of managing threads. It also introduces a lot of overhead for constantly copying data on methods you are not going to be making parallel and rendering it slower for some to many applications when compared to a traditional scope and reference model.

Here are rules for making code parallel. Use closures if practical, even if you roll your own. If not, you need to know the parallelism syntax and quirks of your language. Finally you need to know when to NOT use synchronization as well as be able to determine if you are making any gains.

Re:Wow, this will end useful software! (0)

Anonymous Coward | more than 3 years ago | (#34747310)

Multicore processors has made software development much more difficult for bad programmers


And, to think... (2)

GodfatherofSoul (174979) | more than 3 years ago | (#34747036)

Ten years ago some young 6-digit ID Slashdotter was getting modded down for suggesting a Beowulf cluster of cores. Who's laughing now, mods?!?!?

huh (1)

buddyglass (925859) | more than 3 years ago | (#34747356)

Without digging for any additional information, it bugs me that this chip has 1000 cores and not 1024.

What about an all core chip? (2)

ka9dgx (72702) | more than 3 years ago | (#34747828)

The ultimate end to this trend is to build a system that is just core processing logic, with logic and memory all fused as closely as possible. I call it the BitGrid... it consists of 4bit look up tables hooked into an orthogonal grid. Because every single table can be used simultaneously, there is no Von Neuman bottleneck to worry about.

Petaflops... here we come.... !

Re:What about an all core chip? (0)

Anonymous Coward | more than 3 years ago | (#34749520)

use a hypercube and just call it a connection machine

Re:What about an all core chip? (1)

Jepler (6801) | more than 3 years ago | (#34749692)

You've just described the FPGA. Large areas of an FPGA are devoted to thousands of almost-identical functional blocks ("slices" in xilinx parlance). For instance, in one Xilinx family, a slice contains a 4-input LUT, a flip-flop (1 bit of memory, called an FF), and other specific gates that help implement things like carry chains, shift registers, and some 5+input functions the chip designers thought were commonly encountered.

Other areas contain "block RAMs" and "DSP cores" (basically, dedicated multipliers).

But now you've got yourself a lot of hard problems to solve: how to dice a program into something that is represented by essentially LUTs and FFs. how to recognize when a special function outside the LUT, like a carry chain, should be used. how to efficiently route the signals from where they're produced to where they're consumed. how to actually implement an efficient LUT where the contents are field-programmed. how to figure figure out what speed to clock the whole thing at so that it operates properly. how to read in the configuration to the chip. This is an enormous investment in research and software, and you still have to target the chip with languages that are totally alien to your typical C/C++ programmer.

As far as I know there is no production FPGA that you can write software for without using proprietary software. (though often the software can be obtained at no cost, at least for the non-flagship FPGA chips) This is partially because the details of bitstream structure are trade secrets of the respective FPGA companies, but also partially due to the inherent difficulty of the task.

But can you play Doom on it? (1)

cephus440 (828210) | more than 3 years ago | (#34747854)

Someone had to say it, be kind.

Re:But can you play Doom on it? (1)

Dachannien (617929) | more than 3 years ago | (#34748332)

Actually, I think the correct overused meme would be, "Imagine a Beowulf cluster of those!"

That's nothing. (1)

wcrowe (94389) | more than 3 years ago | (#34748498)

Mine goes to 1011.

Amdahl's Law (1)

zildgulf (1116981) | more than 3 years ago | (#34748798)

I think this is fantastic that a 1000-core processor is in development.
I hate to be the devil's advocate but at what point will Amdahl's Law take hold fully and adding more cores to a processor will prove to be a fruitless endeavor?
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>