Beta

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

New Framework For Programming Unreliable Chips

samzenpus posted about 9 months ago | from the this-is-how-you-do-it dept.

Programming 128

rtoz writes "For handling the future unreliable chips, a research group at MIT's Computer Science and Artificial Intelligence Laboratory has developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it's intended. As transistors get smaller, they also become less reliable. This reliability won't be a major issue in some cases. For example, if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice — but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency."

cancel ×

128 comments

Sorry! There are no comments related to the filter you selected.

godzilla (5, Insightful)

Anonymous Coward | about 9 months ago | (#45325135)

Asking software to correct hardware errors is like asking godzilla to protect tokyo from mega godzilla

this does not lead to rising property values

Re:godzilla (5, Interesting)

n6mod (17734) | about 9 months ago | (#45325375)

I was hoping someone would mention James Mickens' epic rant. [usenix.org]

Chicken and the Egg. (3, Informative)

jellomizer (103300) | about 9 months ago | (#45325607)

We need software to design hardware to make software...

In short it is about better adjusting your tolerance levels on individual features.
I want my Integer arithmetic to be perfect. My Floating point, good up to 8 decimals places. But there components meant for interfacing with the human. Audio, so much stuff is altered or loss due to difference in quality of speakers, every top notch ones with Gold(Or whatever crazy stuff) Cables. So in your digital to audio conversion, you may be fine if a voltage is a bit off, or you skipped a random change, as the smoothing mechanism will often hide that little mistake.

Now for displays... We need to be pixel perfect when we have screens with little movement. But if we are watching a movie, a Pixel color #8F6314 can be #A07310 for 1 60th of a second and we wouldn't notice it. And most most displays are not even high enough quality to show these differences.

We hear of these errors and think, how horrible that we are not good perfect products... However it is more due to the trade-off of getting smaller and faster with a few more glitches,

Re:Chicken and the Egg. (2)

CastrTroy (595695) | about 9 months ago | (#45325769)

Yeah, but you could save just as much power (I'm guessing) with dedicated hardware decoders, as you could by letting the chips be inaccurate. As chips get smaller it's much more feasible to hard hardware specific chips for just about everything. The ARM chips in phones and tablets have all kinds of specialized hardware, some for decoding video and audio, other's for doing encryption and other things that are usually costly for a general purpose processor. Plus it's a lot easier for the developer to not have to consider how inaccurate stuff can be, and just writing code as though things are actually going to be correct. Even programming with binary floating point numbers is problematic enough, as there's many decimal floating point numbers that can't be properly represented.

Re:Chicken and the Egg. (0)

Anonymous Coward | about 9 months ago | (#45326657)

Sillicon costs money... If you can shrink the die even more then it's a cost-saving... So if you can make the dedicated video-decoder take up 50% less space but at the cost of causing a few pixels to get the average or same color as the pixel(s) beside it it would be a huge cost-saving for minimal reduction in quality... Also things like motion-estimations needs some calculations so if they are off by a percent or two for a couple of frames will not cause any big visual artifacts...

You can also add in things like "This part only needs this much accuracy for these arithmetic functions so use some estimation with high enough accuracy"... Or maybe even "sin/cos should generate a circle, we don't care if the functions causes the height or width is off by 1-2% as long as it's predictable and always performs the same on the whole batch"...

But the main thing here is if they can shrink the isolation between components on the wafer it can shrink quite a bit but at the same time it will cause more glitches... This is what this is for, as far as i understood the summary.. (Hell no, i will never read the actual article :)

Re:Chicken and the Egg. (1)

Dahamma (304068) | about 9 months ago | (#45328687)

Yeah, but you could save just as much power (I'm guessing) with dedicated hardware decoders, as you could by letting the chips be inaccurate.

Eh, a dedicated hardware decoder is still made out of silicon. That's the point, make chips that perform tasks like that (or other things pushing lots of data that is only relevant for a short period, like GPUs - GPUs used only for gfx and not computation, at least) tolerate some error, so that they can use even less power. No one is yet suggesting we make general purpose CPUs in today's architectures unreliable :)

Re:godzilla (1)

K. S. Kyosuke (729550) | about 9 months ago | (#45326295)

Asking software to correct hardware errors is like asking godzilla to protect tokyo from mega godzilla

OTOH, in measurement theory, it's been long known that random errors can be eliminated by post-processing multiple measurements.

Re:godzilla (2)

vux984 (928602) | about 9 months ago | (#45328093)

OTOH, in measurement theory, it's been long known that random errors can be eliminated by post-processing multiple measurements.

Gaining speed an energy efficiency is not usually accomplished by doing something multiple times, and then post processing the results of THAT, when you used to just do it once and got it right.

You'll have to do the measurements in parallel, and do it a lot faster to have time for the post processing and still come out ahead for performance. And I'm still not sure that buys you any improved efficiency.

random errors can be eliminated by post-processing multiple measurements.

And this is the real crux of the paradox :) Random errors can be introduced by post processing multiple measurements on an unreliable processor doing the post processing.

Now we have to post-post-process the results of the post-processed results to eliminate any random errors there? Turtles all the way down.

That said, as TFA suggested there are operations that can tolerate error, like video decoding -- and if we can realize substantial gains in performance or energy efficiency that translates into your laptop running a lot longer in exchange for a few transient (sub tenth of a second) pixel errors... that's a pretty good trade.

Re:godzilla (0)

Anonymous Coward | about 9 months ago | (#45326419)

Typical CPUs are designed for the worst case: If they don't give the correct result under the most adverse conditions (temperature, electrical noise, bit pattern), then it's considered a bug. One could design CPUs for the average case: Make them such that calculations are mostly correct, except under rare conditions, and even then design the CPU to give less precise results, not entirely random results.

A software analogy: Quicksort has a worst case time complexity of O(n^2), but an average case time complexity of just O(n * log n). Suppose you have a real time application that needs to sort something. Do you allocate quadratic time for the sorting or do you vastly improve the responsiveness by only waiting for the sort result long enough to deal with the average case? Obviously this depends on how horribly things will go wrong when you run into the worst case. (This is where the analogy falls apart. How useful is a partly sorted array? Not very. An almost correct floating point calculation on the other hand might even be just as good as the correct result, depending on the application.)

Re:godzilla (1)

jeffb (2.718) (1189693) | about 9 months ago | (#45326879)

(This is where the analogy falls apart. How useful is a partly sorted array? Not very. An almost correct floating point calculation on the other hand might even be just as good as the correct result, depending on the application.)

Actually, it seems to me that the analogy is still quite valid. Having a large array where items are guaranteed to be off by no more than one spot -- in other words, where some adjacent items may be swapped from their correct positions -- could be quite useful. I'm thinking of things like "sort by most recent" for news articles, or "search by price ascending" in an online store. In fact, I'm seeing such "approximate ordering" a lot more frequently on large-scale Web apps; it's better to have an approximately-ordered list quickly than a precisely-ordered list much more slowly.

Of course, if you're looking for a sorted list to support binary search, your mileage will vary.

Re:godzilla (1)

rasmusbr (2186518) | about 9 months ago | (#45327477)

Nobody is suggesting allowing errors everywhere. Errors will only be allowed where they wouldn't cause massive unexpected effects.

A simple (self-driving) car analogy here would be that you might allow the lights to flicker a little if that saves power. You might even allow the steering wheel to move very slightly at random in order to save power as long as it never causes the car to spin out of control, but you would never allow even a small chance that the car would select its destination at random.

Re:godzilla (0)

Anonymous Coward | about 9 months ago | (#45328197)

You might even allow the steering wheel to move very slightly at random in order to save power as long as it never causes the car to spin out of control,

You've obviously never lived in a place where it snows and ice remains on the roadways for months at a time. Stop by western Minnesota or the Dakotas sometime in mid-January after a 27" snowfall and say that again with a straight face.

The other analogies hold, though XD

Hmmm ... (4, Insightful)

gstoddart (321705) | about 9 months ago | (#45325147)

So, expect the quality of computers to go downhill over the next few years, but we'll do out best to fix it in software?

That sounds like we're putting the quality control on the wrong side of the equation to me.

Re:Hmmm ... (2)

bill_mcgonigle (4333) | about 9 months ago | (#45325213)

So, expect the quality of computers to go downhill over the next few years, but we'll do out best to fix it in software?

If you use modern hard drives, you've already accepted high error rates corrected by software.

Re: Hmmm ... (1)

fizzer06 (1500649) | about 9 months ago | (#45325353)

I haven't accepted bad data from the newer hard drives.

Re:Hmmm ... (1)

fast turtle (1118037) | about 9 months ago | (#45325363)

if you access any server remotely then you're already using this - it's called ECC RAM

Re:Hmmm ... (0)

Anonymous Coward | about 9 months ago | (#45325451)

ECC RAM doesn't use software to correct errors.

Re:Hmmm ... (0)

Anonymous Coward | about 9 months ago | (#45326781)

Then take a USB stick... There you use ECC to correct bad bits... Usually done in software in the embedded microcontroller...

Or if you take basically any router out there... They basically all use NAND flash, since it's cheap due to being shrunk down so much, and that does ECC correction usually in the driver...

Or if you take basically any SSD out there you have the same thing where the firmware takes care of the used ECC...

Doing ECC in hardware can be quite expensive since you can then be locked down to a limited amount of chip-manufactures...

Re:Hmmm ... (1)

K. S. Kyosuke (729550) | about 9 months ago | (#45327603)

It uses algorithms to correct errors, instead of simply using more reliable memory cell hardware. I believe that's the point of the comparison, not whether the algorithm runs in software or in hardware.

Re:Hmmm ... (0)

Anonymous Coward | about 9 months ago | (#45325707)

But in that case you're depending on reliable hardware to correct unreliable hardware. If the hard drive microcontroller played it fast and loose, you'd be screwed either way.

Re:Hmmm ... (0)

Anonymous Coward | about 9 months ago | (#45325229)

eh - more like allowing them to work outside the (rather small) window of reliability. Better reliability in high radiation environments like outer space, would be a good example.

Re:Hmmm ... (1)

Desler (1608317) | about 9 months ago | (#45325239)

Next few years? More like a few decades or more. Drivers, firmware microcode, etc. have always contained software workarounds to hardware bugs. This is nothing new.

Re:Hmmm ... (2)

ZeroPly (881915) | about 9 months ago | (#45325635)

Relax, pal - frameworks that don't particularly care about accuracy have been around for years now. If you don't believe me, talk to anyone who uses .NET Framework.

Re:Hmmm ... (1)

fizzer06 (1500649) | about 9 months ago | (#45327061)

frameworks that don't particularly care about accuracy . . . .NET Framework.

Okay, I'll bite. Explain yourself.

Re:Hmmm ... (1)

ZeroPly (881915) | about 9 months ago | (#45327463)

I'm an application deployment guy, not a programmer. Every time we push something that needs .NET Framework, the end users complain about it being hideously slow. Our MS developers of course want everyone to have a Core i7 machine with 64GB RAM and SSD hard drive - to which I reply "learn how to write some fucking code without seven layers of frameworks and abstraction layers".

Then of course, I can never get a straight answer from the developers on which .NET to install. Do you want 4, 3.5 SP1, 2? The usual answer is "load all of them". I get that .NET Framework is great in theory, but if you have to deal with the actual implementation, you'll see things differently. A lot of times we'll get screen glitches which the devs are convinced is a MS issue, but there's no available fix, so we go with "that's not a serious enough problem to fix".

On the other side of the fence are the Linux apps I have to deploy. The Linux devs send me a .DEB file. I generally have that pushed out the same day.

Re:Hmmm ... (1)

InsightfulPlusTwo (3416699) | about 9 months ago | (#45325997)

You don't seem to have read the article. The software is not going to supply extra error correction when the hardware has errors. It's going to allow the programmer to specify code operations that can tolerate more errors, which the compiler can then move to the lower-quality hardware. Some software operations, like audio or video playback, can allow errors and still work OK, which allows you to use lower-energy less-quality hardware for those operations. If they did as you suggest, and tried to fix hardware errors in the software, that would cause the software to take more energy to correct the errors and be more complex besides, which would seem to negate the benefits of the new hardware. This is not unprecedented since various applications (audio CDs, hearing aids, etc.) already use a lesser standard of error correction.

Re:Hmmm ... (1)

Joshua Fan (1733100) | about 9 months ago | (#45326065)

All in preparation for next big thing after that... MORE accurate hardware! 6.24% more!

I for one welcome (0)

Anonymous Coward | about 9 months ago | (#45325167)

...Our new snowy-screen overlords. I was just thinking how much I liked the good old days where if the TV was flaking out, you could just give the set a good whack and it would get its act together.

Seriously, why do we want to do this? Is power usage going to cut in half? Are yields going to double? I think it's nice to talk about (especially in the interest of making systems that go "kinda bad" instead of completely breaking, but why not just invest the time/effort into fixing the issues directly? Have we run out of ideas on how to do that?

Re:I for one welcome (1)

alexander_686 (957440) | about 9 months ago | (#45325777)

.Seriously, why do we want to do this? Is power usage going to cut in half?

Yes. Well, about in 1/2. Think about signal processors and cell phones. Would you accept a 5% reduction in voice quality for a doubling of your talk time?

Re:I for one welcome (0)

Anonymous Coward | about 9 months ago | (#45325905)

No. Because I can hardly understand people on the phone already. Anyway, the solution there is custom ASICs to do encoding/decoding, not non-deterministic software.

Huh? (1)

Desler (1608317) | about 9 months ago | (#45325183)

but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency."

Which you could already get now simply by not doing error correction. No need for some other programming framework to get this.

Re:Huh? (1)

SJHillman (1966756) | about 9 months ago | (#45325485)

It's not so much about skipping error correction as it is saying when you can skip error correction. If 5 pixels are decoded improperly, fuck it, just keep going. However, if 500 pixels are decoded improperly, then maybe it's time to fix that.

Re:Huh? (1)

Desler (1608317) | about 9 months ago | (#45325721)

And as I said you can do that already.

Re:Huh? (1)

MightyYar (622222) | about 9 months ago | (#45325867)

Really? You can tell your phone/PC/laptop/whatever to run the graphics chip at an unreliably low voltage on demand?

Re:Huh? (1)

HybridST (894157) | about 9 months ago | (#45326307)

For PC and laptop, yes I can.

Overclocking utilities can also underclock and near the lower stability threshold of graphics frequency, I often do see a few pixels out of whack. Not enough to crash, but artifacts definitely appear. A mhz or 2 higher clock clears them up though.

I have a dumb phone so reclocking it isn't necessary.

Re:Huh? (1)

MightyYar (622222) | about 9 months ago | (#45326377)

So you've done this yourself and you still don't see the utility in doing it at the application level rather than the system level?

Re:Huh? (1)

HybridST (894157) | about 9 months ago | (#45327387)

Automating the process would be handy, but not revolutionary. Automating it at the system level makes more sense to me but i'm just a power user.

Re:Huh? (1)

MightyYar (622222) | about 9 months ago | (#45327561)

I don't think it is revolutionary, either... it's just a framework, after all. I was imagining a use where you have some super-low-power device out in the woods sampling temperatures, only firing itself up to "reliable" when it needs to send out data or something. Or a smartphone media app that lets the user choose between high video/audio quality and better battery life. Yeah, they could have already done this with some custom driver or something, but presumably having an existing framework would make it easier, less apt to conflict, and more standard.

Re:Huh? (1)

Xrikcus (207545) | about 9 months ago | (#45327401)

When you do it that way you have no control over which computations are inaccurate. There's a lot more you can do if you have some input information from higher levels of the system.

You may be happy that your pixels come out wrong occasionally, but you certainly don't want the memory allocator that controls the data to do the same. The point of this kind of technology (which is becoming common in research at the moment, the MIT link here is a good use of the marketing department) is to be able to control this in a more fine-grained fashion. For example, you could mark the code in the memory allocator as accurate - it must not have errors and so must enable any hardware error correction, might use a core on the platform that operates at a higher voltage, or would add extra software error correction as necessary. At the same time you might allow the visualization code to degrade to reduce overall power consumption, because the visualization code is not mutating any important data structures. Anything it generates is transient and the errors will barely be noticed.

Or we could just use java (0)

Anonymous Coward | about 9 months ago | (#45325219)

Or we could just use java, with it's "almost" IEEE complete libraries. I mean who really needs a perfect answer anway?

Does that register "really" need to contain that value? How about any value?

Does the stack pointer "really" need to be there?

Does the password really need to match? It's just a hash anyway, what are a few bits of uncertainty?

Does the packet "really" need to get sent?

does the CRC "really" have to match..

These functions will make great improvements to Java.

viewers probably won't notice? (0)

Anonymous Coward | about 9 months ago | (#45325237)

if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice

We used to return those LCD monitors back to the store.

Re:viewers probably won't notice? (1)

Desler (1608317) | about 9 months ago | (#45325263)

You confuse what that sentence is talking about. They aren't talking about stuck pixels on an LCD. It's talking about not spending time doing extensive error correction/masking when a few pixels in the video are corrupted and thus will be decoded with some level of artifacting.

Re:viewers probably won't notice? (1)

viperidaenz (2515578) | about 9 months ago | (#45329053)

A stuck pixel is still just an unreliable transistor...

Re:viewers probably won't notice? (1)

SJHillman (1966756) | about 9 months ago | (#45325501)

You must have gone through a lot of monitors before realizing this has nothing to do with dead pixels on a display.

"A few pixels incorrectly decoded"... (1)

gnasher719 (869701) | about 9 months ago | (#45325247)

h.264 relies heavily on the pixels in all previous frames. Incorrectly decoded pixels will be visible on many frames that are following. What's worse, they will start moving around and spreading.

Re:"A few pixels incorrectly decoded"... (1)

Desler (1608317) | about 9 months ago | (#45325293)

Not always true. There are cases where corrupted macroblocks will only cause artifacts in a single frame and won't necessarily cause further decoding corruption.

Re:"A few pixels incorrectly decoded"... (0)

Anonymous Coward | about 9 months ago | (#45325411)

Incorrectly decoded pixels will be visible on many frames that are following

Hardly 'many', only a few seconds worth.

Re:"A few pixels incorrectly decoded"... (0)

Anonymous Coward | about 9 months ago | (#45325529)

How many frames in a few seconds? For acceptable frame rates that's many enough for me.

And no, I don't think 24 fps is enough. Especially in this day and age.

Re:"A few pixels incorrectly decoded"... (1)

SuricouRaven (1897204) | about 9 months ago | (#45325911)

24fps? Depends on content. It's too high for landscapes establishing shots, talking heads and presentations. Yet too low for high-action scenes and sports. It's a happy medium.

If you don't like it, try to get variable frame rate support more established. Then everyone is happy.

Re:"A few pixels incorrectly decoded"... (1)

MightyYar (622222) | about 9 months ago | (#45325917)

So then for you, the compromise in this particular example would be that you would crank up the power a bit and make the pixels all perfect. Other people without such good eyes could crank down the power and get more battery life.

Re:"A few pixels incorrectly decoded"... (1)

SJHillman (1966756) | about 9 months ago | (#45325519)

So what you're saying is that the pixels are alive, and growing! I smell a SyFy movie of the week in the works.

Re:"A few pixels incorrectly decoded"... (0)

Anonymous Coward | about 9 months ago | (#45325839)

So what you're saying is that the pixels are alive, and growing! I smell a SyFy movie of the week in the works.

The notion of living and evolving pixels is indeed very recent [wikipedia.org] .

Re:"A few pixels incorrectly decoded"... (1)

gigaherz (2653757) | about 9 months ago | (#45325603)

You missed the point. This is a framework for writing code that KNOWS about unreliable bits. The whole idea is that it lets you write algorithms that can tell the compiler where it's acceptable to have a few errores bits, and where isn't. No one said it would apply to EXISITNG code...

Re:"A few pixels incorrectly decoded"... (0)

Anonymous Coward | about 9 months ago | (#45326785)

The idea is sound though...

So lets say you decode a pixel and end up with red 255 green 200 and blue 150 But it should be red 254 green 200 and blue 150.

The question is that acceptable. For viewing maybe. Would have to try it but it *may* create a warble in that pixel. Especially if the value maybe was borderline and flicked back and forth between 254 and 255. Would depend on the frequency of the i frame.

But say maybe an audio decode. Would you be able to hear an off by 1 bit difference? Probably not.

The idea is that accuracy costs time and power. If it were less accurate you may be able to lower both of those. Just so long as you were in the tolerance. I think the entire NTSC spec is basically that :)

Re:"A few pixels incorrectly decoded"... (1)

viperidaenz (2515578) | about 9 months ago | (#45329039)

So why not just add more instructions, for doing faster but less accurate calculations? 24bit operations for RGB values, for example.

How on earth (4, Insightful)

dmatos (232892) | about 9 months ago | (#45325257)

are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.

They'd have to completely re-architect whatever chip is doing the calculations. You'd need three classes of "data" - instructions, important data (branch addresses, etc), and unimportant data. Only one of these could be run on unreliable transistors.

I can't imagine a way of doing that where the overhead takes less time than actually using decent transistors in the first place.

Oh, wait. It's a software lab that's doing this. Never mind, they're not thinking about the hardware at all.

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45325389)

Unreliable computing is like NoSQL, a nice step backwards, allowing quantity to win over quality.

What is unimportant data? If MPEG used all IFRAME packets, it might make sense, but a blown bit will just propagate until the next IFRAME comes around in the sequence, and this would be at best distracting, at worse, render a movie unwatchable.

Already, look at software. It is unreliable as it is. Hardware is where one needs to know every time you stick in 2+2, you get 4, even for large values of 4. We don't need another part of the computing stack where corners are cut.

Re:How on earth (1)

viperidaenz (2515578) | about 9 months ago | (#45328985)

2+2=5 for large values of 2.
When you're performing calculations, you need to know where and how rounding takes place if everything isn't an integer.

Re:How on earth (1)

bestdealex (1496279) | about 9 months ago | (#45325401)

Where are my mod points when I need them?! This is exactly my sentiment as well. Even the simple processing required to check if the data output is correct or within bounds will be staggering compared to simply letting it pass.

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45325429)

Oh, wait. It's a software lab that's doing this. Never mind, they're not thinking about the hardware at all.

Doesn't mean that the research will be completely useless.
Unreliable results sounds like something you would get from quantum computers.
For super scalar CPUs it could also be of interest to add an extra line of floating point operations implemented with analog circuits. Most floating point operations doesn't need an exact result anyway, allowing the CPU to parallelize the operations where possible could have boosted performance slightly if the problem wasn't cache-misses.

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45325539)

Algorithms that operate on floating point data may not need exact results, but they're sure as hell do not need results randomly distributed on an unknown distribution that may or may not violate invariants such as x^2>0

Re:How on earth (1)

MightyYar (622222) | about 9 months ago | (#45325989)

Doesn't that depend on the application? What if I'm simply updating a position based upon an already noisy sensor? I already have a bunch of code to throw out crappy results. I'm taking lots of samples, so as long as most of my measurements are accurate, it's all good. Obviously I can't tolerate a random error in every single cycle, but maybe 1 in a million is OK and lets me run at a lower voltage.

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45326739)

Does this edge case outweigh 6?

1) A shorter-running (or more parallel) approximation to a correct algorithm can be encoded with different instructions
2) The longer-running instructions still exist
3) The code is marked up (pragma or hand-written?) to use the specific instructions that can produce a random error
4) You know the probability distribution of the error, and it is consistent
5) All parts of your application are written to tolerate the error, and are provably free of bugs or assumption violations stemming from it.
6) You believe this is a "better" for power usage than e.g. simpler silicon on the sensor to smooth/average out its samples itself.

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45325493)

something like harvard architecture with different "reliability"

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45325527)

What they are talking about with unreliable transistors is as transistors become smaller you get more unexpected things to happen (electron tunneling) that can cause an unexpected result. The number of these unexpected things will vary transistor to transistor.

So while the transistor might give you the correct action 95% of the time, that other 5% is an issue. They are saying rather than having to redo the entire chip we can keep the chip and check and correct the incorrect issues in software. If the problem happens when it isn't tolerable to have an error we send it back and do it again, but if the error is causing something minor like a single pixel in a movie to be colored wrong for a frame, we can accept it and not have to take the time and energy to reprocess for a minor error.

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45325561)

Well, let's not be too hasty - This is pretty much how any modern hardware video decoder chip works. The logic is "fixed function", that is, there aren't any conditional branches in the VHDL code which use the value of input data to decide which path to take.

For a task such as video decoding, indeed, this is quite ideal. After all, you know that you have some input data which turns into some output data of a certain bounded length, dimensionality, and structure. If there is an error in the input data stream, it will simply decode it "wrong" until the next frame marker / resync point.

That frame will look garbled, but the decode logic won't "crash", simply because it doesn't do any unbounded operations depending on the input (like allocating 'N' bytes of memory or looping 'N' times).

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45325613)

They're not proposing bad RAM. In a CPU, different tasks are performed by different parts of the CPU. Instruction and address decoding are not done by the same transistors that do arithmetic calculations, for example. You could make an unreliable floating point unit and not affect the program flow in any way (except where it depends on results which you'd then know to be unreliable). The simple things that guide program flow (comparisons, address calculations, etc.) are fast and efficient. Calculating the root of a floating point number for example is expensive in comparison. There are many applications where the exact result doesn't matter, but speed and power consumption are important. Unless you've given a lot of thought to the intricacies of floating point math, chances are that your application doesn't really rely on the full precision guaranteed by the hardware specification. A big class of CPU bugs consists of so-called speed-paths, where a part of the CPU expects a calculation in a different part of the CPU to be complete before it has actually completed. If you arrange the calculations such that incomplete calculations just lose precision and don't give wildly different results, then you don't need to wait the maximum time every time. Instead you can wait only long enough for most calculations to complete and take the less precise "speed path" result for the remaining calculations. This speeds up all calculations at the cost of making a few minor mistakes, which are not a problem if the software can handle these mistakes (or if they don't even matter in the first place).

Re:How on earth (1)

viperidaenz (2515578) | about 9 months ago | (#45328889)

A big class of CPU bugs consists of so-called speed-paths, where a part of the CPU expects a calculation in a different part of the CPU to be complete before it has actually completed

Care to expand on that? This is not a typical race condition. What you're describing is a CPU not ordering instructions as expected - not doing its primary purpose.

Re:How on earth (1)

gigaherz (2653757) | about 9 months ago | (#45325681)

This was in slashdot years ago. I can't find the slashdot link, but I did find this one [extremetech.com] . The idea is that you design a cpu focusing the reliability in the more significant bits, while you allow the least significant bits to be wrong more often. The errors will be centered around the right values (and tend to average into them), so if you write code that is aware of that fact, you can teach it to compensate for the wrong values. Of course this is not acceptable for certain kinds of software, but for things like multimedia processing, a small % error in the result wouldn't be appreciable, and over time, the image should keep averaging out the old errors while introducing new ones, assuming the software is designed for it.

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45326019)

This is not magic or novel. It is just extension of stuff already used in the GPU world and available in OpenCL.

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/exp.html [khronos.org]

see the function for exp? half_exp? native_exp? That is what you can optimize - floating point operations mostly.

Re:How on earth (0)

Anonymous Coward | about 9 months ago | (#45326315)

Why did this get so many mod points when they clearly have no idea what they're talking about? This whole post is basically saying "I have no idea how this works therefore it's wrong! Nothing must exist that is outside my understanding!"

Re:How on earth (1)

Warbothong (905464) | about 9 months ago | (#45326545)

How on earth are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.

Very easily: the developer specifies that pixel values can tolerate errors but that branch conditions/memory addresses can't. If you'd bothered to read the summary, you'll see it says exactly that:

a new programming framework that enables software developers to specify when errors may be tolerable.

They'd have to completely re-architect whatever chip is doing the calculations.

Erm, that's the whole point. If we allowed high error rates with existing architectures, none of our results would be trustworthy. I imagine the most practical approach would be a fast, low-power but error-prone co-processor living alongside the main, low-error processor. This could be programmed just like GPUs are at the moment. The nice thing about this work is that the separation can be largely transparent; just annotate your programs and the compiler will figure out which parts can be offloaded to the co-processor.

I can't imagine a way of doing that where the overhead takes less time than actually using decent transistors in the first place.

As far as I can tell there is no overhead involved. In fact it's the other way around: calculating exact answers (as we do now) is a perfectly acceptable way to execute an error-tolerant program. The opposite is not true though: an error-intolerant program cannot be executed with errors. Since we're strictly increasing the execution strategies available, we can only ever increase efficiency (since we can choose to ignore the new strategies).

Re:How on earth (3, Insightful)

bluefoxlucid (723572) | about 9 months ago | (#45328509)

Erm, that's the whole point. If we allowed high error rates with existing architectures, none of our results would be trustworthy. I imagine the most practical approach would be a fast, low-power but error-prone co-processor living alongside the main, low-error processor.

Or you know, the thing from 5000 years ago where we used 3 CPUs (we could on-package ALU this shit today) all running at high speeds and looking for 2 that get the same result and accepting that result. It's called MISD architecture.

Re:How on earth (1)

tlhIngan (30335) | about 9 months ago | (#45326885)

are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.

They'd have to completely re-architect whatever chip is doing the calculations. You'd need three classes of "data" - instructions, important data (branch addresses, etc), and unimportant data. Only one of these could be run on unreliable transistors.

I can't imagine a way of doing that where the overhead takes less time than actually using decent transistors in the first place.

Oh, wait. It's a software lab that's doing this. Never mind, they're not thinking about the hardware at all.

More properly, the language takes care of it.

You declare variables to be "approximate" - where errors are tolerated and you can use lower power hardware to do it (it turns out reliability means having to use higher voltages which raise power consumption, and lower clock speeds which keeps cores powered up longer rather than race them to sleep as fast as possible).

So a counter would be "exact" and have to use the high-powered reliable hardware mode, while the pixel data will be inexact and use low power mode. Even a counter that iterates over the pixel array has to be exact.

And you can easily transition from exact data to inexact data, but transitions back are limited and explicity - you can't test inexact values - you have to promote the inexact data (because there will always be times when you need to deal with it).

Of course, it's a new programming language because existing ones model reliable systems.

unrelliable is not really useful (0)

Anonymous Coward | about 9 months ago | (#45325403)

Yeah we really want those "almost working" machines:
- flying planes
- controlling infrastructure
- running financial transactions
- doing medical inferrence

Well they would probably be alright in mobile devices, except
- when authorizing transactions
- doing secure communications

Well they would be fine for Angry Birds.

Re:unrelliable is not really useful (1)

somersault (912633) | about 9 months ago | (#45325799)

What exactly led you to believe that anyone is wanting to use this concept in situations where 100% reliability is required?

Re:unrelliable is not really useful (0)

Anonymous Coward | about 9 months ago | (#45325881)

Not really. Didn't Samsung get slammed in China for selling truckloads of "almost working" memory in their mobile phones? The result wasn't a few corrupted pixels in Angry Birds. It results in a few corrupted officials getting off their asses after too many Angry Customers couldn't stand their phones locking several times a day anymore while Samsung claimed "almost working" should be good enough for everybody :)

Pentium FDIV (0)

Anonymous Coward | about 9 months ago | (#45325627)

So it's like the Pentium FDIV bug, where "a little error" wasn't enough reason to recall the processors until they got bashed for it.

So what about reliability then? (1)

udachny (2454394) | about 9 months ago | (#45325661)

So a case where a few pixels on the screen are incorrectly calculated is one thing and nobody cares, but what about a mistake of calculating actually valuable units, like precise measurements used in avionics, power plants, energy distribution systems, voting points, monetary amounts, health monitoring data and such?

Do we then need multiple cores running the same code and taking a vote between them or do we end up with math co-processors aimed at lower error rates with larger transistor sizes and also lower frequencies?

When will the cost of fixing and/or getting around errors outweigh the benefit of a smaller transistor size and a cheaper part?

I guess one answer is: have different chips for different purposes at higher cost and different spec.

Similar Idea to EnerJ Language (3, Interesting)

MetaDFF (944558) | about 9 months ago | (#45325745)

The idea of fault tolerable computing is similar to the EnerJ programming language being developed at the University of Washington for power savings The Language of Good Enough Computing [ieee.org]

The jist of the idea is that the programmer can specify which variables need to be exact and which variables can be approximate. The approximate variables would then be stored a low refresh RAM which is more prone to errors to save power, while the precise variables would be stored a higher power memory which would be error free.

The example they gave was calculating the average shade of grey in a large image of 1000 by 1000 pixels. The running total could be held in an approximate variable since the error incurred by adding one pixel incorrectly out of a million would be small, while the control loop variable would be accurate since you wouldn't want your loop to overflow.

Re:Similar Idea to EnerJ Language (1)

JesseMcDonald (536341) | about 9 months ago | (#45328405)

The example they gave was calculating the average shade of grey in a large image of 1000 by 1000 pixels. The running total could be held in an approximate variable since the error incurred by adding one pixel incorrectly out of a million would be small...

What makes them think that the kinds of errors you'd get in a variable in low-refresh-rate RAM would be small? Flip the MSB from 1 to 0 and your total is suddenly divided in half. Or, if it's a floating-point variable, flip one bit in the exponent field and your total changes from 1.23232e4 to 1.23232e-124.

french (1)

Spaham (634471) | about 9 months ago | (#45325823)

am I the only french who thinks that the "Computer Science and Artificial Intelligence Laboratory" sound like this in french :
CS-AIL ?

Re:french (0)

Anonymous Coward | about 9 months ago | (#45326159)

Evidemment t'es le seul.

Seen this done a few times (0)

Anonymous Coward | about 9 months ago | (#45325843)

Such as using imprecise calculations during an animation sequence moving a UI around since it won't really be worthwhile making it absolutely 100% perfect since it will be moving fast.
If it is something like the UI for a mobile app, extremely worth doing, but if it was a game that required pixel-perfect smooth animation, might be a problem. (mind you, in games, both cases are useful since you wouldn't notice, such as spinning stupidly fast in an FPS, or slowly moving through corridors hunting for people)

Simply put, some things do not require quality to perform, they just need something that more-or-less is accurate to a point.
It wouldn't be hard to imagine such a thing working easily, you just need to provide the system with the correct interfaces to allow software to punt code off to either the "TCP" of CPU or the "UDP" of CPU, to put it another way.
With TCP we want 100% of the data perfect, with UDP we only really care if most of it gets there or is even accurate.

We already use it. So if it could drop prices considerably, hell, even moderately, for some parts of a hardwares design, it would be good in the long run.
Mobile applications such as tablets, phones, watches and whatever comes next would benefit greatly from this.
But even fixed applications like desktops would benefit since it would drop prices in general. Only a percentage of hardware would need to be working 100% where some could have failed or be of bad quality. Most hardware already has a yield value, the effective products that survived creation, others are usually gimped and then sold at a cheaper price to make the most use of them.
Designing them with failure in mind could alleviate a lot of headaches and allow more headroom to fix other issues and come up with better designs.

It wouldn't be welcomed very well initially. When people think "inefficient" they'd think "bad", which would be right, but it isn't all bad.
It it adds more frames and less power use with very minimal impact on smooth animation when it is needed, what is to hate?
Not everything requires quality. There are acceptable levels of bad quality. Like JPEG for natural pictures. But not GIF. Screw GIF. And screw people that use GIF for non-animated imagery. Why won't you just unexist?

Frp (gnaa (-1)

Anonymous Coward | about 9 months ago | (#45325955)

recent article put The

Unreliable unreliability (0)

Anonymous Coward | about 9 months ago | (#45326143)

Ok i have to ask this... If chip itself is unreliable, then how can you trust software running on sayd chip to detect errors reliably? Would the unreliability of chip effect software as well?

This could make computers more brain-like (1)

Dr. Spork (142693) | about 9 months ago | (#45326167)

I love this idea, because it reminds me of the most energy efficient signal processing tool in the known universe, the human brain. Give Ken Jennings a granola bar, and he'll seriously challenge Watson, who will be needing several kilowatt-hours to do the same job. Plus, Ken Jennings is a lot more flexible. He can carry on conversations, tie shoes, etc. This is because his central processing unit basically relies on some sort of fault-tolerant software. I think that there will be a lot more applications of a fault-tolerant, energy efficient software strategy, beyond just media decoding. When we get around to asking computers to be creative and apply variously-weighted "rules of thumb", I expect that those operations will run best on systems that sacrifice calculation accuracy for speed and energy efficiency. You gain almost nothing when you apply rough heuristic rules precisely. Let's allow the computers to apply rough rules imprecisely, and reap the speed and energy benefits of the trade.

Re:This could make computers more brain-like (1)

bluefoxlucid (723572) | about 9 months ago | (#45328657)

I love this idea, because it reminds me of the most energy efficient signal processing tool in the known universe, the human brain.

Dumb analogy. Being inaccurate does not make you more intelligent and won't cause emergent behavior.

Give Ken Jennings a granola bar, and he'll seriously challenge Watson, who will be needing several kilowatt-hours to do the same job.

Wrong. Ken Jennings' brain runs on blood sugar, glycogen stored in the liver from previous food (converted into blood sugar by glucagon as blood sugar is consumed for work), fat stored in consolidated fat cells from previous food (converted into blood sugar by lipolysis), and a huge set of neurotransmitters (mainly acetylcholine) stored up by prior processes. Never mind that you get 10% of the energy at each level--the plants convert 2% of the sunlight they collect to energy, which is mainly stored as inaccessible fiber and other structural work (i.e. vitamins, hormones...); herbavores (the normal analogy is 'pork chop' or 'steak') get maybe 10% of that converted energy; you get maybe 10% of the energy input from the herbavores. This is like 0.02% efficiency versus 12% efficient Photovoltaic panels or 38% efficient parabolic solar collectors, not considering the direct inefficiency of Ken Jennings' brain for converting sugar energy into useful work.

Plus, Ken Jennings is a lot more flexible. He can carry on conversations, tie shoes, etc. This is because his central processing unit basically relies on some sort of fault-tolerant software.

No, it's because he has better programming. A gerbil's brain relies on fault tolerant processing, and they can't talk or tie shoes; they can eat and have sex.

I think that there will be a lot more applications of a fault-tolerant, energy efficient software strategy, beyond just media decoding. When we get around to asking computers to be creative and apply variously-weighted "rules of thumb", I expect that those operations will run best on systems that sacrifice calculation accuracy for speed and energy efficiency. You gain almost nothing when you apply rough heuristic rules precisely. Let's allow the computers to apply rough rules imprecisely, and reap the speed and energy benefits of the trade.

Actually that's slow and stupid. This is less effective than taking a working computer with a high clock rate (SOD-CMOS at 394GHz, low-power, accurate) and seeding various inputs with a noise-based RNG (audio-entropyd measures the noise fluctuation on an unused microphone line, for example: this is just spaztastic voltage wobble from EMR).

Stop romanticising the human mind as a result of "lots of imprecise and uniquely organic failings creating something amazing and beautiful". It's a really fucking complex system.

The end of general-purpose computing? (0)

Anonymous Coward | about 9 months ago | (#45326207)

For example, if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice

Ok, fine, but that assumes those transistors are dedicated to decoding, and are not used for anything that requires complete accuracy.

In other words, it assumes that we won't be using general-purpose computers in the future.

Remember that the future will be full of encryption and DRM. Those technologies have maximum brittleness -- just one wrong bit will cause large amounts of data to be discarded or blocked.

Re:The end of general-purpose computing? (1)

viperidaenz (2515578) | about 9 months ago | (#45328757)

In other words, it assumes that we won't be using general-purpose computers in the future.

Too true. Any transistor that is in the path of calculating anything that ends up as a memory location or an offset to one anywhere has the possibility of crashing the process if you're lucky, or compromising the entire system.

Nonsensical Garbage (0)

Anonymous Coward | about 9 months ago | (#45326365)

For every transistor that is DIRECTLY connected to a specific software algorithm, even when this algorithm is encapsulated in a hardware block, there are >10 transistors acting in essential support roles, whose malfunction CANNOT be trivially ignored. Either a chip is fabricated so the TOTAL number of detectable errors across the die is vanishing low, or the chip is USELESS.

We already have the power/speed trade-off of lower precision for chains of maths calculations that can withstand the accumulating errors such lowered precision creates. Running a maths block at such a high speed, or low power, that the actions of individual transistors becomes impossible to predict, is self-defeating.

The problem is that cretinous PhDs, people who have remained in academia for all the wrong reasons, get to spew blue-sky papers to justify their existence to their universities, with no regard to the real world. "Hey, my maths is correct' does NOT ensure the quality of a paper. The 'maths' can always be made correct, with no regard to real world, applied issues.

As for error tolerance, well almost every computer is producing errors all the time. Your Windows PC, for instance, is designed to be error tolerant in the sense that most errors do not 'crash' the machine, and are frequently handled invisibly (to the user) in the background. HOWEVER, this does not mean any test is made to ensure the errors are 'harmless'. While your hardware and OS may seemingly continue to function happily, many errors CAN be silently corrupting data or processing on which you rely. It just so happens that with the experience of billions of computers deployed across the decades, such errors have proven to be worth ignoring on average.

AGAIN raising clocks, or lowering power increases the probability of errors, so the PROPER Computer Science method is to use the LEAST work to complete your task, and this includes using numerical analysis to do no more maths than is absolutely necessary. For instance, morons as poorly skilled as those responsible for the paper in the article had MPEG1/2 decoding as a fully floating point process, because "everyone knows floats, especially doubles, are always FAR better than integers". As a consequence, no two MPEG1/2 decode units created exactly the same output from the same input data.

MPEG4, on the other hand, was designed by DECENT mathematicians, and uses Integer decode methods, producing much more correct output, with less energy per unit of maths work (although decoding MPEG4 is more maths intensive than decoding MPEG1/2). Better, every MPEG4 decode unit produces identical results (if coded correctly).

No! Unreliability is a feature (1)

Mister Liberty (769145) | about 9 months ago | (#45326559)

May the best chi(m)p win.

Re:No! Unreliability is a feature (1)

Iniamyen (2440798) | about 9 months ago | (#45327267)

The cihps rlealy olny hvae to get the frsit and lsat ltetres corcert. Yuor brian can flil in the rset.

Already done (1)

mjr167 (2477430) | about 9 months ago | (#45326697)

Doesn't intel already make a chip that is unreliable?

Oh, GREAT (1)

Iniamyen (2440798) | about 9 months ago | (#45326883)

Yeah, let's take away the only thing that computers had going for them - doing exactly what they're told. THAT sounds like a GREAT idea.

How about stop making crap hardware? (1)

Lumpy (12016) | about 9 months ago | (#45327131)

It can be done, we dont have to race for atomic size transistors before we have the technology ot make them more reliable.

Chips for unreliable programming... (1)

Alejux (2800513) | about 9 months ago | (#45328521)

now that would be world changing!

Decades old news (1)

viperidaenz (2515578) | about 9 months ago | (#45328691)

What do you think the artefacts shown on screen are when you overclock your video card too high? Acceptable (sometimes) hardware errors.

And the inexorable decline of humanity continues (1)

EmagGeek (574360) | about 9 months ago | (#45329017)

This is why everything is disposable and nothing works anymore. People are too willing to sacrifice quality and reliability for cost.

These researchers are assholes (0)

Anonymous Coward | about 9 months ago | (#45329033)

If errors in the low-order bits were economically acceptable, we wouldn't be using high-precision data formats like floats and doubles in the first place. We'd be using 4-bit fixed point or some such BS.

If you look at the history of GPUs, you see the exact opposite trend. The native data types have gotten larger and more precise every generation, because this is actually a very cheap thing to improve.

These assholes are hyping the shit out of a deliberately crippled product nobody asked for. Fuck them. Fuck them all.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?
or Connect with...

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>