Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

NVIDIAs 64-bit Tegra K1: The Ghost of Transmeta Rides Again, Out of Order

Unknown Lamer posted about a month and a half ago | from the order-out-of dept.

Transmeta 125

MojoKid (1002251) writes Ever since Nvidia unveiled its 64-bit Project Denver CPU at CES last year, there's been discussion over what the core might be and what kind of performance it would offer. Visibly, the chip is huge, more than 2x the size of the Cortex-A15 that powers the 32-bit version of Tegra K1. Now we know a bit more about the core, and it's like nothing you'd expect. It is, however, somewhat similar to the designs we've seen in the past from the vanished CPU manufacturer Transmeta. When it designed Project Denver, Nvidia chose to step away from the out-of-order execution engine that typifies virtually all high-end ARM and x86 processors. In an OoOE design, the CPU itself is responsible for deciding which code should be executed at any given cycle. OoOE chips tend to be much faster than their in-order counterparts, but the additional silicon burns power and takes up die area. What Nvidia has developed is an in-order architecture that relies on a dynamic optimization program (running on one of the two CPUs) to calculate and optimize the most efficient way to execute code. This data is then stored inside a special 128MB buffer of main memory. The advantage of decoding and storing the most optimized execution method is that the chip doesn't have to decode the data again; it can simply grab that information from memory. Furthermore, this kind of approach may pay dividends on tablets, where users tend to use a small subset of applications. Once Denver sees you run Facebook or Candy Crush a few times, it's got the code optimized and waiting. There's no need to keep decoding it for execution over and over.

cancel ×

125 comments

Sorry! There are no comments related to the filter you selected.

Decoding a Bit (0, Offtopic)

smitty_one_each (243267) | about a month and a half ago | (#47653361)

Decoding the bits
They come in a tranche
As though a blade on Williams's
Facial avalanche [google.com]
Burma Shave


God rest your soul, Robin.

Is it better? (1)

meerling (1487879) | about a month and a half ago | (#47653407)

Let's see if I have this right:
With the OoOE cpu, the instructions from the code are handled by the cpu to decide what order to process them so you get a faster overall speed.

With the Project Denver cpu, it's an in-order processor, but it uses software at runtime to decide what order to process the code in and stores that info in a special buffer, but that software is itself ran by the cpu in the first place to make the OoOE decisions.

This seems to be kind of flaky to me.

Re:Is it better? (3, Insightful)

wonkey_monkey (2592601) | about a month and a half ago | (#47653435)

What's flaky about it?

The advantage of decoding and storing the most optimized execution method is that the chip doesn't have to decode the data again; it can simply grab that information from memory.

Re:Is it better? (-1)

Anonymous Coward | about a month and a half ago | (#47653811)

shut the fuck up, limey faggot boy.

Re:Is it better? (-1)

Anonymous Coward | about a month and a half ago | (#47654851)

shut the fuck up, worthless brit

Re:Is it better? (-1)

Anonymous Coward | about a month and a half ago | (#47654855)

i have never met a brit that i didn't want to shoot in the face. the entire uk should be nuked into oblivion.

Re:Is it better? (1)

paskie (539112) | about a month and a half ago | (#47653439)

So in case of JVM, you'd think it's flaky for the JIT to happen on the same CPU as the one that is executing the code?

Bear in mind that nowadays, the CPUs don't anymore need to be designed to run even closed source, boxed version operating systems with top performance. The bootloader and kernel can be custom-compiled for the very specific CPU version and won't *necessarily* need the helper.

Re:Is it better? (1)

Anonymous Coward | about a month and a half ago | (#47653453)

I think you're missing an important part. The Order Optimizations are run a separate CPU/CORE. This CPU/core can be shutdown saving power or used executing other threads increasing speed. Remember Transmedia , only made mobile/power saving processor because they could save power by not running the OoOE engine the whole time. This is a great approach to saving power by only doing optimization once and executing several times. One Problem is when the code paths change drastically you need to have some way to detect this which could be done by placing special instructions in code paths that weren't used much during the last optimization.

Re:Is it better? (1)

Guspaz (556486) | about a month and a half ago | (#47653953)

Errm, it's a dual-core chip, and there's no third core for running the optimizations. They run on the same CPU cores that everything else does.

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47653467)

It sounds like a low frequency design, which would be alrighty for the target market. A higher frequency part would probably benefit from an eDRAM scratch pad.

Re:Is it better? (1)

Sockatume (732728) | about a month and a half ago | (#47653485)

More to the point, if the advantage of switching to in-order is having less silicon (and therefore a smaller power draw), isn't that completely undone by having a whole second CPU in there that makes it twice as large as its predecessor?

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47653567)

Not necessarily. With out-of-order, the extra silicon is burning power continuously. With in-order, the second CPU only needs to burn power when the first CPU encounters new code.

Re: Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47653827)

Probably depends on how good your gating is and how often the optimization process has to run. The two cores are larger than a single out of order core; but when optimization isn't being done you can presumably sleep the idle core as deeply as it allows. If this isn't very deeply or if the load doesn't play nicely this could go quite badly; but if it works as intended it could well be as or more efficient. It will be interesting to see if any real world loads really thrash this without much gain in speed or if they perform as desired.

Re:Is it better? (4, Insightful)

loufoque (1400831) | about a month and a half ago | (#47653535)

In-order processors are a better choice as long as your program is well optimized.
Optimizing for in-order processors is difficult, and not something that is going to be done for 99% of programs. It's also very difficult to do statically.

NVIDIA has chosen to let the optimization be done by software at runtime. That's an interesting idea that will surely perform very well in benchmarks, if not in real life.

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47653845)

"Well optimized" doesn't really mean much, unless your input data is of such nature that you can predict cache misses. If you can't, even a compiler with ideal knowledge of your architecture isn't going to make your chip look good. I suspect even the benchmarks will confirm that one. Someone else pointed out it has huge caches - not hard to guess why.

Re:Is it better? (3, Interesting)

Predius (560344) | about a month and a half ago | (#47653979)

This is an area where post compile optimization can shine. By watching actual execution with live data, the post compiler optimizer can build branch choice stats to tune against based on actual operation rather than static analysis at compile time. HP's dynamo project IIRC was based around this idea, it'd recompile binaries for the same architecture it ran on after observing them running a few times. I believe the claims were an average 10% improvement in perf over just compiler optimized binaries.

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47654075)

You are confusing branches with cache behavior and the optimization you describe is done by literally every JIT and you could argue every OoO engine as well.

Cache behavior is a bigger problem because unlike in benchmark scenarios, real machines tend to have more than 1 program running at the time.

Re:Is it better? (1)

Predius (560344) | about a month and a half ago | (#47654241)

Don't take my word for it: http://www.cs.tufts.edu/comp/1... [tufts.edu]

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47654337)

I believe you it's just totally irrelevant to what I said.

Re: Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47654923)

this is worth doing because it will give good performance in games. AMD is selling the same architecture in the Xbox one for tablets now; we'll see who makes call of duty world at war look better in miracast with a Bluetooth keyboard and mouse

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47655779)

Similar gains can be had with Profiler Guided Optimization, or whatever your compiler calls it today, assuming you profile with a representative data set.

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47654133)

In order "may" be better for properly optimized code "if" you can statically know how the code will execute AND you know the exact implementation details of the target processor. That's why JIT compilers are used for chips like this and why OO processors are so common. Your old static code gets optimized by the hardware and magically speeds up.

I'm not arguing that this chip won't work, but nothing is ever clear cut.

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47655075)

this is incorrect. in a multiprocessor, an optimizer cannot know the state
of memory at the time the code is run. such as contention from other cores.
by state of memory i mean who owns which cachelines.

since memory is so much slower than the processor. ooo solves some of this
issue, and hyperthreading solves more of it. i don't see dynamic code working
especially if it is optimized only once.

Re:Is it better? (0)

Anonymous Coward | about a month and a half ago | (#47654719)

Maybe the code to do the optimizations is already optimized in such a way that it runs efficiently in order?

But, but... (0)

Anonymous Coward | about a month and a half ago | (#47653427)

I don't "run facebook". Ick.

Re:But, but... (0)

Anonymous Coward | about a month and a half ago | (#47653549)

Presumably Nvidia is hoping to sell its entire Denver CPU production to Mark Zuckerberg.

special 128MB buffer of main memory? (0)

Anonymous Coward | about a month and a half ago | (#47653437)

2 GiB - 128MB > 2 GB so computer makers can still advertise 2 GB of RAM. Clever.

Re:special 128MB buffer of main memory? (-1)

Anonymous Coward | about a month and a half ago | (#47653671)

And idiots like you will fall for it.

Re:special 128MB buffer of main memory? (-1)

Anonymous Coward | about a month and a half ago | (#47653831)

2048 - 128 = 1920 < 2048

Stupid fucker.

Re:special 128MB buffer of main memory? (3, Informative)

shanipribadi (1669414) | about a month and a half ago | (#47653925)

2 GiB = 2 * 2 ^ 30 Byte
128 MB = 128 * 10^6 Byte
2 GiB - 128 MB = 2019483648 Byte;
2019483648 Byte > 2GB

Who's the stupid fucker now?

Re:special 128MB buffer of main memory? (1)

Anonymous Coward | about a month and a half ago | (#47654253)

It's hard drive manufacturers that insisted on decimal instead of binary prefixes, not RAM makers. In fact, it's fairly difficult to make RAM to decimal prefixes.

At any rate, RAM specs usually don't count OS overhead anyway, so this just makes the kernel heavier in a sense. Hell, this is probably being done with a proprietary kernel blob (as we know, Nvidia loves their magic blobs that give you amazing performance only on their hardware...)

Re:special 128MB buffer of main memory? (0)

fnj (64210) | about a month and a half ago | (#47654591)

It's hard drive manufacturers that insisted on decimal instead of binary prefixes, not RAM makers. In fact, it's fairly difficult to make RAM to decimal prefixes.

Why is it difficult to make a 2.147483648 GB stick of RAM but easy to make a 2 GiB stick of RAM?

Re:special 128MB buffer of main memory? (1)

Immerman (2627577) | about a month and a half ago | (#47655693)

It's not, obviously - they're both the same size at 2^31 bytes. Where you run into problems is if you want to create a 2 000 000 000 byte stick of ram. And then you have a bit of a problem: Addressing.

In hardware, exactly 2^N memory locations are addressable with N bits. You therefore need only ensure that your number of address lines corresponds to your number of memory locations in order to ensure consistency. If however 7% of possible addresses are invalid you need to insert logic somewhere to make sure every single memory access falls within the valid range, and that creates a performance overhead.

Worse, you have to *translate* those addresses. If I try to access byte 2 000 001 in your system the memory addressing infrastructure will have to do range analysis to determine which stick of ram it belongs in: that's multiple integer comparison and branching operations. That's slow. If your memory size is a power of two on the other hand you need only interpret the bottom N bits as the address within a memory bank, and the remaining high address bits as the address of the bank to access. No comparisons or branching needed - every address between 0 and RAM_MAX will automatically address the proper memory cell.

Furthermore since RAM has *always* been manufactured in powers of two, all of the surrounding architecture also has that assumption built in: RAM controllers, BIOS, operating systems - unless you want to create everything from scratch a non-power-off-two size stick of RAM will cause catastrophic errors wherever it's used. And keep in mind the decision of which size RAM sticks are available is made by the RAM manufacturers, NOT the device manufacturer. No RAM manufacturer is going to make off-size RAM because no existing systems will be able to use it. And no device manufacturer is going to commission off-size RAM because a limited run would be substantially more expensive than just using a slightly larger standard size.

where did you get 128 MB cache? (0)

Anonymous Coward | about a month and a half ago | (#47653455)

Denver's large L1 instruction cache (128KB, compared to 32KB for a typical Cortex-A15)

Re:where did you get 128 MB cache? (2)

Narishma (822073) | about a month and a half ago | (#47653703)

It's a "software" cache, it's stored in RAM.

Re:where did you get 128 MB cache? (0)

Anonymous Coward | about a month and a half ago | (#47656321)

RAMDISK returns ooo ya DOS to the rescue

Sounds smart, but is it? (4, Informative)

IamTheRealMike (537420) | about a month and a half ago | (#47653499)

Although I know only a little about CPU design, this sounds like one of the most revolutionary design changes in many years. The question in my mind is how well it will work. The CPU can use information at runtime that a static analyser running on a separate core might not have ahead of time, most obviously branch prediction information. OOO CPU's can speculatively execute multiple branches at once and then discard the version that didn't happen, they can re-order code depending on what it's actually doing including things like self-modifying code and code that's generated on the fly by JITCs. On the other hand, if the external optimiser CPU can do a good job, it stands to reason that the resulting CPU should be faster and use way less power. Very interesting research, even if it doesn't pan out.

Re:Sounds smart, but is it? (1)

loufoque (1400831) | about a month and a half ago | (#47653537)

Just look at the various optimizers that optimize assembly code at runtime and see how well they work.
The idea is old, but it doesn't work that well in practice.

Re:Sounds smart, but is it? (1)

Anomalous custard (16635) | about a month and a half ago | (#47653553)

Well, you could look at what the Hotspot JVM does which is probably a closer analogy, and it works very well.

Re:Sounds smart, but is it? (1)

gnasher719 (869701) | about a month and a half ago | (#47653565)

Well, you could look at what the Hotspot JVM does which is probably a closer analogy, and it works very well.

But then if you are using a JVM that recompiles code on the fly (or Apple's latest JavaScript engine which actually has one interpreter and three different compilers, depending on how much code is used), the CPU then has to recompile the code again! Unlikely to be a good idea.

There's a different problem. When you have loops, usually you have dependencies between the instructions in a loop, but no dependencies between the iterations. OoO execution handles this brilliantly. If you have a loop where each iteration has 30 cycles latency and 5 cycles throughput, the OoO engine will just keep executing instructions from six iterations in parallel. Producing code that does this without OoO execution is a nightmare.

Re:Sounds smart, but is it? (1)

loufoque (1400831) | about a month and a half ago | (#47653717)

If you have a loop where each iteration has 30 cycles latency and 5 cycles throughput, the OoO engine will just keep executing instructions from six iterations in parallel. Producing code that does this without OoO execution is a nightmare.

Loop unrolling is hardly a nightmare, it's one of the simplest optimizations and can easily be automatized.

Re:Sounds smart, but is it? (1)

gnasher719 (869701) | about a month and a half ago | (#47655577)

Loop unrolling is hardly a nightmare, it's one of the simplest optimizations and can easily be automatised.

Good luck. We are not talking about loop unrolling. We are talking about interleaving instructions from successive iterations. That was what Itanium expected compilers to do, and we all know how that ended.

Re:Sounds smart, but is it? (1)

loufoque (1400831) | about a month and a half ago | (#47655717)

You usually unroll step by step. All loads, all computations, all stores.

Re:Sounds smart, but is it? (0)

Anonymous Coward | about a month and a half ago | (#47653733)

The idea of the CPU recompiling the code again is sound. Remember that you can hardly expect the original JVM/JIT to generate the best possible asm for your architecture. People have tried that, and usually came back disappointed by the relative pace of deployed software ("Pentium optimizations"). That's the entire point of OoO: your performance isn't dependent on how good the compiler/JIT knows about your pipelines.

I agree with the rest of your post: I don't see how this can be competitive performance-wise with on-CPU dynamic OoO, it fails on cache misses, mispredicted branches, not-unrollable loops, etc etc.

Re:Sounds smart, but is it? (3, Informative)

Megol (3135005) | about a month and a half ago | (#47653989)

Out of order execution can only do one thing actually: cope with varying latency of operations. For most normal instructions a LIW/VLIW/explicit scheduled processor (yes there are some that aren't a *LIW type) can in most case do better than the dynamic scheduler. Where OoO execution really shines is hiding L1 cache misses and in some cases even L2 cache misses and there static scheduled code have a hard time adapting to hit/miss patterns.
The standard technique for statically scheduled architectures is to move loads up as far as possible so that L1 misses can at least partially be hidden by executing independent code, often using specialized non-faulting load instructions that can fail "softly" and be handled by special code paths. The problem doing things like that is that fine grain handling isn't really possible due to code explosion.

But it is fully possible to do partial OoO execution just for memory operations and maybe that's what Nvidia is doing. Maybe not.

Re:Sounds smart, but is it? (1)

Rockoon (1252108) | about a month and a half ago | (#47654651)

Out of order execution can only do one thing actually: cope with varying latency of operations.

It also covers up the sometimes bad instruction ordering of compilers, which has predictably led to compilers being even worse at it. No point modeling which execution units are free when the OOE pipeline reduces all the important inner loop stuff to the latency of the longest dependency chain after just a couple iterations...

Re:Sounds smart, but is it? (-1, Redundant)

loufoque (1400831) | about a month and a half ago | (#47653589)

How is a JVM a closer analogy than exactly what they are doing?
They are literally optimizing the assembly at runtime.

A JVM does nothing like that.

Re:Sounds smart, but is it? (0)

Anonymous Coward | about a month and a half ago | (#47653713)

It's translating Java bytecode (which is pretty much an instruction set - there even used to be Java CPUs in Sun's better days) into optimized assembler for another architecture. The comparison is pretty apt.

Re:Sounds smart, but is it? (0)

Anonymous Coward | about a month and a half ago | (#47653783)

A JVM does exactly that. It takes Java bytecode and might, if it decides to, compile it to x86 instructions at runtime.

Please, learn something about Java and teh JVM before you open your mouth and shove your foot in it.

Re:Sounds smart, but is it? (3, Insightful)

serviscope_minor (664417) | about a month and a half ago | (#47653873)

A JVM does nothing like that.

It is. The Hotspot JVM compiles bytcode to assembly with optimizations on the fly. It also has some capability to re-optimize based on runtime performance and usage patterns.

Re:Sounds smart, but is it? (1)

Anonymous Coward | about a month and a half ago | (#47653619)

What is known to work is to linearize basic blocks, that is eliminate all forward branches in the code.

This has traditionally helped a lot. The p4 had the trace cache which did something like this, but it was pretty expensive to do in the procssor itself.

Re:Sounds smart, but is it? (1)

Mr Thinly Sliced (73041) | about a month and a half ago | (#47653555)

If you want to look into revolutionary design changes look into the Mill CPU architecture.

They've put their lecture series available on the web about their intended architecture - it's kinda a hybrid DSP / general purpose with some neat side steps of contemporary CPU architectures.

Re:Sounds smart, but is it? (1)

Charliemopps (1157495) | about a month and a half ago | (#47653787)

If you want to look into revolutionary design changes look into the Mill CPU architecture.

They've put their lecture series available on the web about their intended architecture - it's kinda a hybrid DSP / general purpose with some neat side steps of contemporary CPU architectures.

Thanks for that... that's very interesting. If it works, you're right, it would be amazing. Also... a CPU designed by Santa Clause?!?! I'm in!

no (0)

Anonymous Coward | about a month and a half ago | (#47653603)

The transmeta CPU gave bad performance compared to Intel's Pentium 3, intel's greatly improved, high IPC cores, will slaughter it. I guess the ARM Cortex A-15 is comparable to a P3. It's a good thing for nvidia they made a version with normal ARM processors.

Re:Sounds smart, but is it? (0)

Anonymous Coward | about a month and a half ago | (#47653631)

*Uses more CPU time and silicon to save CPU time and silicon...

Whaaaa. I mean, I'll be interested to see performance numbers. But this doesn't seem to good for branch heavy work.

Re:Sounds smart, but is it? (3, Interesting)

taniwha (70410) | about a month and a half ago | (#47653699)

it's certainly different but not revolutionary, I worked on a core that did this 15 years ago (not transmeta) it's a hard problem we didn't make it to market, transmeta floundered - what I think they're doing here is the instruction rescheduling in software, something that's usually done by lengthening the pipe in an OoO machine - it means they can do tighter/faster branches and they can pack instructions in memory aligned appropriately to feed the various functional units more easily - My guess from reading this article is is that it probably has an LIW mode where they turn off the interlocks when running scheduled code.

Of course all this could be done by a good compiler scheduler (actually could be done better with a compiler that knows how many of each functional unit type are present during the code generation phase) the resulting code would likely suck on other CPUs but would still be portable.

Then again if they're aiming at the Android market maybe what;s going on is that they've hacked their own JVM and it's doing JIT on the metal

Re:Sounds smart, but is it? (0)

Anonymous Coward | about a month and a half ago | (#47653767)

Although I know only a little about CPU design, this sounds like one of the most revolutionary design changes in many years.

Um, no. The most revolutionary change in many years was when Power and ARM both finally went OOO, removing the last holdouts of legacy in-order CPU design from the mass market. Both PS3 and X360 were based on in-order PPC chips. Both PS4 and XBone are based on OOO AMD/64 chips. They each get twice the performance at half the clock speed of their predecessor. That's what OOO buys you. nVidia is going backwards in terms of hardware design. They think they can make it up with clever software but every vendor of in-order hardware thought they could make it up with clever software, and they were all wrong.

Re:Sounds smart, but is it? (0)

Anonymous Coward | about a month and a half ago | (#47653905)

Power was already OOO. IBM ripped those power hungry bits out of the Power core, as it was assumed the PPE would exist to do little more than shovel instructions and data into the SPEs.

How does this account of caching? (1)

Viol8 (599362) | about a month and a half ago | (#47653523)

Surely one of the points of OoOE is it can - in theory - take account of whether data is in cache or not when deciding when to do reads? I don't see how a hard coded instruction path can do this.

Re:How does this account of caching? (0)

Anonymous Coward | about a month and a half ago | (#47653591)

This is an excellent point. Everyone who tried this has fallen flat on their face, over and over again. Either Nvidia has found the magic trick to make it work this time, or they'll find out the hard way that it keeps failing...because it's just not a good idea.

I'm curious what Linus has to say about it. I think he's, eh, "talked in his typical diplomatic manner" about this kind of design idea a few times.

Re:How does this account of caching? (1)

MoonlessNights (3526789) | about a month and a half ago | (#47653635)

I really wonder about this, too. Perhaps they determined that the common case of a read is one which they can statically re-order far enough ahead of the dependent instructions for it to run without a stall but that doesn't sound like it should work too well, in general. Then again, I am not sure what these idioms look like on ARM64.

The bigger question I have is why they didn't go all out with their own VLIW-based exposure of the hardware's capabilities. As I recall of the Transmeta days, their problem was related to constrained memory bandwidth causing their compiler and the application code to compete for the bus (which is a problem this design may also have unless their compiler is _really_ tight, which might be true for this low-ambition design) while the benefits of statically renaming registers and packing instructions into issue groups were still substantial.

Re:How does this account of caching? (0)

Anonymous Coward | about a month and a half ago | (#47653743)

Here's another thought: if they need to be speculative+pessimistic about their reads missing cache, then they need to fetch far ahead, probably across branches. Which means the CPU is doing unneeded stuff, not something you want for a power efficient design. If they are optimistic about getting cache hits, then every miss is a bad penalty, and code with bad cache behavior will absolutely crawl an order of magnitude worse.

Re:How does this account of caching? (5, Informative)

Anonymous Coward | about a month and a half ago | (#47653679)

I think the entire point of having 7 micro-ops in flight at any point in time combined w/ the large L1 caches and 128MB micro-op instruction cache is designed to mitigate this, in much the same fashion the shear number of warps (blocks of threads) in PTX mitigates in-order execution of threads and branch divergence.

Based on their technical press release, AArch64/ARMv8 instructions come in, at some point the decoder decides it has enough to begin optimization into the native ISA of the underlying chip, at which point it likely generates micro-ops for the code that presumably place loads appropriately early s.t. stalls should be non-existant or minimal once approaching a branch. By the looks of their insanely large L1 I-cache (128kb) this core will be reading quite a large chunk of code ahead of itself (consuming entire branches, and pushing past them I assume - to pre-fetch and run any post-branch code it can while waiting for loads) to aid in this process.

The classic case w/ in-order designs is of course cases where the optimization process can't possibly do anything in-between a load, and a dependent branch - either due to lack of registers to do anything else, lack of execution pipes to do anything else, or there literally being nothing else to do (predictably) until the load or branch has taken place. Depending on the memory controller and DDR latency, you're typically looking at 7-12 cycles on your typical phone/tablet SoC for DDR block load into L2 cache, and into a register. This seems like it may be clocked higher than a Cortex A15 though, so lets assume it'll be even worse on denver.

This is where their 'aggresive HW prefetcher' comes into play I assume, combined w/ their 128KiB I-cache prefetching and analysis/optimization engine, denver has a relatively big (64KiB) L1 D-cache as well! (for comparison, the Cortex A15 - which is also a large ARM core - has a 32KiB L1 D-cache per core) - I would fully expect a large part of that cache is dedicated to filling idle memory-controller activity with speculative loads to take educated "Stabs in the dark" at what loads are coming up in the code to once again, in the hope of getting some right and mitigating in-order branching/loading issues further.

It looks to me like they've applied the practical experience of their GPGPU work over the years and applied it to a larger more complex CPU core to try and achieve above-par single core performance - but instead of going for massively parallel super-scalar SIMT (which clearly doesn't map to a single thread of execution), they've gone for 7-way MIMT and a big analysis engine (logic and caches) to try and turn single-threaded code into partially super-scalar code.

This is indeed radically different to typical OoO designs in that those designs waste those extra pipelines running code that ultimately doesn't need to be executed to mitigate branching performance issues (by running all branches, when only one of their results matters) - where as denver decided "hey, lets take the branch hit - but spend EVERY ONE of our pipelines executing code that matters - because in real world scenarios, we know there's a low degree of parallelism which we can run super-scalar, and we know with a bit more knowledge, we can predict and mitigate the branching issues anyway!"

Hats off, I hope it works well for them - but only time will tell how it works in the real world.

Fingers crossed - this is exactly the kind of out of the box thinking we need to spark more hardware innovation. Imagine this does work well, how are AMD/ARM/IBM/Intel/IT going respond when their single-core performance is sub-par? We saw the ping-pong of performance battles between AMD/Intel in previous years, Intel has dominated for the last 5 years or so, unchallenged - and has ultimately stagnated in the past 3 years.

Re:How does this account of caching? (0)

Anonymous Coward | about a month and a half ago | (#47653799)

Excellent analysis. You're right in pointing out the huge caches are needed because it both needs to 1) Ensure high hit rates, so it doesn't have to schedule over long cache miss delays 2) Speculatively load everything for exactly the same reason.

From what I understand the traditional stop on making L1 caches big is the inability to keep their hit latency low and the chip frequency up (at the same voltage). I wonder how they deal with that.

Re:How does this account of caching? (0)

Anonymous Coward | about a month and a half ago | (#47653803)

Imagine this does work well, how are AMD/ARM/IBM/Intel/IT going respond when their single-core performance is sub-par?

Um, even if this works well, it's just going to close the gap between nVidia's in-order cores and the OOO cores of everybody else. nVidia aren't going to *beat* anyone's OOO core this way. Their performance will remain sub-par, just not as sub-par as it would have been without this JIT stuff.

And it's very very unlikely to work well.

Re:How does this account of caching? (0)

Anonymous Coward | about a month and a half ago | (#47654505)

At the end of the day, OoO is going to win when branches are less-predictable. Benchmarks tend to be predictable, but the real world isn't. As virtually all useful applications use the network these days, in effect an increasingly-large percentage of code branches ultimately depend on network-derived inputs that can't be predicted accurately over the long or short term...

Re:How does this account of caching? (0)

Anonymous Coward | about a month and a half ago | (#47655755)

This is indeed radically different to typical OoO designs in that those designs waste those extra pipelines running code that ultimately doesn't need to be executed to mitigate branching performance issues (by running all branches, when only one of their results matters) - where as denver decided "hey, lets take the branch hit - but spend EVERY ONE of our pipelines executing code that matters - because in real world scenarios, we know there's a low degree of parallelism which we can run super-scalar, and we know with a bit more knowledge, we can predict and mitigate the branching issues anyway!"

The branching hit isn't cheap- splitting concurrent runahead execution across multiple cores requires a huge shared (higher latency) cache to contain each branch's instruction window, but also an extensive snapshot of all registers. Combine multiple consecutive branches (if, else if, else if...) and we've quickly eaten up what we originally thought to be a huge amount of instruction cache.

In general, static branch prediction for integer-based operations generally give surprisingly high prediction accuracy (>97%) for typical real-world cases on modern processors. Either branch prediction on ARM CPUs is notoriously bad (I doubt it) or Nvidia believes that they can forgo typical branch prediction + superscalar OoO and come up with a faster solution using less die space.

Re:How does this account of caching? (1)

Calinous (985536) | about a month and a half ago | (#47653849)

If their cache lines are 64 bits, then it's quite possible that successive instructions (based on execution time stamp) are in the same cache line. Remember that this has to improve execution speed most of the time, and not decrease execution speed. As for data caches, I'm not sure - a good prefetcher will help a lot in this.
      This has the possibility to slow down execution speed... I wonder how often and how long the execution of a thread can continue when there's a data cache miss... Maybe almost all of the time it doesn't continue far, and in this case the slowdown could be recovered in subsequent steps.

Re:How does this account of caching? (0)

Anonymous Coward | about a month and a half ago | (#47653921)

A good prefetcher does nothing to solve the basic problem that you don't know whether the data is in cache or not, and hence whether to hoist your loads up 2-3 cycles (L1) or 12-200 (L2->memory) cycles.

An in-order chip can't continue when there's a data cache miss. That's why it's called in-order, and why everyone else is doing out-of-order these days.

I have no idea what you mean by "recovering the slowdown in subsequent steps". If the chip is stalled and has to sit doing nothing waiting for memory, it's pure loss, which is again why everyone has been avoiding these designs like the plague.

Re:How does this account of caching? (1)

Calinous (985536) | about a month and a half ago | (#47655325)

Recovering the slowdown in subsequent steps = use the full width of seven microops to run significantly faster than a typical out-of-order ARM design. As for continuing when there's a data cache miss, I was referring to out-of-order designs, which might - or might not - be stalled in a couple more instructions because of dependencies on not-yet-processed data (which is loaded from memory).

Re:How does this account of caching? (1)

Calinous (985536) | about a month and a half ago | (#47655369)

There might be some "hints" for microprocessor for the data to cache - if so, those could be added in the generated microcode at some time before they're really needed, increasing the chance to have them available in cache and/or reducing wait time. Of course, I don't know for sure but you could read a value in a register then zero the register. This might be optimized out of microprocessor run (so it won't consume energy to load and then zero the register), but still go through the data fetch engine, so it would reach L1 or L2 cache.

Well its really just another ARM CPU (0)

Anonymous Coward | about a month and a half ago | (#47653551)

I have heard so many talk about different designs from the A15 Cortex that all seem to come up short when put to actual use. The real problem has always been that unlike the X86 designs, the ARM cores require specific instructional architecture and the results are customized operating systems that don't always support everything. We see this with the tablets, Chromebooks and even smartphones. Its also why its very difficult to just install another OS on many of these devices without the specific instructions for that particular core. I am not opposed to devices made to order, but I know many who would rather be open to the ability to install different flavors of linux. The Nvidia Tegra is just another custom chip that will go into customized devices with little chance to vary their function.

Moral of the story (1)

PopeRatzo (965947) | about a month and a half ago | (#47653643)

If I understand this story correctly, the message is that if you get a tablet with this processor, avoid manufacturers who install a lot of bloatware.

Is it about the CPU, or the OS ? (2)

GuB-42 (2483988) | about a month and a half ago | (#47653653)

Buffer in the main memory, software that optimize most-used code. It looks like an OS job for me, something that could be implemented in the linux kernel and benefit all CPUs, provided that you have the appropriate driver.

According to the paper, it looks like biggest novelty is... DRM. The optimizer code will be encrypted and will run in its own memory block, hidden from the OS. It will also make use of some special profiling instructions which could as well be accessible to the OS. Maybe they will but they say nothing about it.

Re:Is it about the CPU, or the OS ? (0)

Anonymous Coward | about a month and a half ago | (#47653769)

It's inappropriate for the kernel, because it requires extremely detailed knowledge about the CPUs internal micro-architecture. Those things aren't documented because they're closely guarded trade secrets and many techniques are patented. Nobody wants to give out the patent license by putting it in the GPLed kernel.

Re:Is it about the CPU, or the OS ? (1)

jones_supa (887896) | about a month and a half ago | (#47654211)

DRM? You could call as well a whole CPU to be DRM'd, as apart from the interface documentation (commands and their expected outcomes) a chip is a black box.

Re:Is it about the CPU, or the OS ? (2)

swillden (191260) | about a month and a half ago | (#47654249)

According to the paper, it looks like biggest novelty is... DRM. The optimizer code will be encrypted and will run in its own memory block, hidden from the OS.

DRM is already fully supported in ARM processors. See TrustZone [arm.com] , which provides a separate "secure virtual CPU" with on-chip RAM not accessible to the "normal" CPU and the ability to get the MMU to mark pages as "secure", which makes them inaccessible to the normal CPU. Peripherals can also have secure and non-secure modes, and their secure modes are accessible only to TrustZone. A separate OS and set of apps run in TrustZone. One DRM application of this is to have secure-mode code that decrypts encrypted video streams and writes them directly to a region of display memory which is marked as secure, so the normal OS can never see the decrypted data. Another common application is secure storage for cryptographic keys, ensuring that even if an attacker can acquire root on your device, he can't extract your authentication keys (though he can probably use them as long as he controls the device, since the non-secure OS is necessarily the gatekeeper).

Nearly all mainstream Android devices have TrustZone, and nearly all have video DRM implemented in it. A large subset also use it for protection of cryptographic keys (Go to Settings -> Security and scroll down to "Credential Storage -> Storage type". If it says "hardware-backed" your device has TrustZone software for key storage.

So, no, nVidia isn't doing this for DRM. That problem is already solved, though it's stupid because all of the content is on the Internet anyway.

Can they make one... (1)

Torp (199297) | about a month and a half ago | (#47653761)

... that doesn't run Facebook?
Otherwise, no buy.

Transmeta died when Linus was hired (0)

Anonymous Coward | about a month and a half ago | (#47653777)

But the chip, it is twice as large because it has twice the bits (RTFA 32 to 64 is 2x). That means you get twice the goodness like twice the registers and twice the address space (minus 2 on most implementations due to twos-complement, which is too complex to cover here) and twice the power and speed. This is double-plus good in all books I have browsed.

Re: Transmeta died when Linus was hired (0)

Anonymous Coward | about a month and a half ago | (#47653867)

Going from a 32bit to 64bit architecture doesn't double the size of the actual chip.

People don't seem to get ooe (0)

Anonymous Coward | about a month and a half ago | (#47653821)

It transforms a linear execution of code into a data flow model, push versus pull, instead of sending an instruction to the execution unit, your execution units demand the next instruction when they are free, thus they are almost never free and do something.

Re:People don't seem to get ooe (0)

Anonymous Coward | about a month and a half ago | (#47653991)

OOE is wonderful if you have an infinite power budget. Intel's high end Xeon chips are amazing, but you can't use them in tablets or phones.

Re:People don't seem to get ooe (1)

Megol (3135005) | about a month and a half ago | (#47654039)

Nope. All standard OoO mechanisms is one of pushing - that is pushing of operations to execute from the scheduler to the execution units. The execution units are dumb and only consume data, operation information giving a set of results.
In most OoO designs the amount of operations actually capable of flowing through the execution units are less that the theoretical width, limited either by the scheduler or the retirement logic.

A VLIW can scale to greater actual execution throughput however it is hard to make them perform good on many types of code. Compilers is one example of a hard type of program for VLIWs.

The Mill (2)

Anon E. Muss (808473) | about a month and a half ago | (#47653947)

I think NVidia tied their hands by retaining the ARM architecture. I suspect the result will be a "worst of both worlds" processor that doesn't use less power or provide better performance than competitors.

In order execution, exposed pipelines, and software scheduling are not new ideas. They sound great in theory, but never seem to work out in practice. These architectures are unbeatable for certain tasks (e.g. DSP), but success as general purpose processors has been elusive. History is littered with the corpses of dead architectures that attempted (and failed) to tame the beast.

Personally, I'm very excited about the Mill [millcomputing.com] architecture. If anybody can tame the beast, it will be these guys.

Re:The Mill (1)

InvalidError (771317) | about a month and a half ago | (#47654907)

Looking at Shield Tab reviews, the K1 certainly appears to have the processing power but actually putting it to use takes a heavy toll on the battery with the SoC alone drawing over 6W under full-load: in Anandtech's review, battery life drops from 4.3h to 2.2h when they disable the 30fps cap in GFXBench.

The K1's processing power looks nice in theory but once combined with its power cost, it does not sound that good anymore.

Re:The Mill (1)

phoenix_rizzen (256998) | about a month and a half ago | (#47655463)

The K1 in the SHIELD Tablet uses standard ARM Cortex-A15 cores, not the Denver CPU cores detailed in this story. Very different beasts.

Static scheduling always performs poorly (5, Informative)

Theovon (109752) | about a month and a half ago | (#47654085)

I'm an expert on CPU architecture. (I have a PhD in this area.)

The idea of offloading instruction scheduling to the compiler is not new. This was particularly in mind when Intel designed Itanium, although it was a very important concept for in-order processors long before that. For most instruction sequences, latencies are predictable, so you can order instructions to improve throughput (reduce stalls). So it seems like a good idea to let the compiler do the work once and save on hardware. Except for one major monkey wrench:

Memory load instructions

Cache misses and therefore access latencies are effectively unpredictable. Sure, if you have a workload with a high cache hit rate, you can make assumptions about the L1D load latency and schedule instructions accordingly. That works okay. Until you have a workload with a lot of cache misses. Then in-order designs fall on their faces. Why? Because a load miss is often followed by many instruction that are not dependent on the load, but only an out-of-order processor can continue on ahead and actually execute some instructions while the load is being serviced. Moreover, OOO designs can queue up multiple load misses, overlapping their stall time, and they can get many more instructions already decoded and waiting in instruction queues, shortening their effective latency when they finally do start executing. Also, OOO processors can schedule dynamically around dynamic instruction sequences (i.e. flow control making the exact sequence of instructions unknown at compile time).

One Sun engineer talking about Rock described modern software workloads as races between long memory stalls. Depending on the memory footprint, a workload could spend more than half its time waiting on what is otherwise a low-probability event. The processors blast through hundreds of instructions where the code has a high cache hit rate, and then they encounter a last-level cache miss and and stall out completely for hundreds of cycles (generally not on the load itself but the first instruction dependent on the load, which always comes up pretty soon after). This pattern repeats over and over again, and the only way to deal with that is to hide as much of that stall as possible.

With an OOO design, an L1 miss/L2 hit can be effectively and dynamically hidden by the instruction window. L2 (or in any case the last level) misses are hundreds of cycles, but an OOO design can continue to fetch and execute instructions during that memory stall, hiding a lot of (although not all of) that stall. Although it's good for optimizing poorly-ordered sequences of predictable instructions, OOO is more than anything else a solution to the variable memory latency problem. In modern systems, memory latencies are variable and very high, making OOO a massive win on throughput.

Now, think about idle power and its impact on energy usage. When an in-order CPU stalls on memory, it's still burning power while waiting, while an OOO processor is still getting work done. As the idle proportion of total power increases, the usefulness of the extra die area for OOO increases, because, especially for interactive workloads, there is more frequent opportunity for the CPU to get its job done a lot sooner and then go into a low-power low-leakage state.

So, back to the topic at hand: What they propose is basically static scheduling (by the compiler), except JIT. Very little information useful to instruction scheduling is going to be available JUST BEFORE time that is not available much earlier. What you'll basically get is some weak statistical information about which loads are more likely to stall than others, so that you can resequence instructions dependent on loads that are expected to stall. As a result, you may get a small improvement in throughput. What you don't get is the ability to handle unexpected stalls, overlapped stalls, or the ability to run ahead and execute only SOME of the instructions that follow the load. Those things are really what gives OOO its advantages.

I'm not sure where to mention this, but in OOO processors, the hardware to roll back mispredicted branches (the reorder buffer) does double-duty. It's used for dependency tracking, precise exceptions, and speculative execution. In a complex in-order processor (say, one with a vector ALU), rolling back speculative execution (which you have to do on mispredicted branches) needs hardware that is only for that purpose, so it's not as well utilized.

Re:Static scheduling always performs poorly (2)

solus1232 (958622) | about a month and a half ago | (#47654565)

This is a good post (the point about hiding memory latency in particular), but you should still wait to judge the new chip until benchmarks are posted.

If you have ever worked on a design team for a high performance modern CPU, you should know that high level classifications like OOO vs In-Order never tell the whole story, and most real designs are hybrids of multiple high level approaches.

Re:Static scheduling always performs poorly (2)

iamacat (583406) | about a month and a half ago | (#47654787)

One critical piece of information which is available JUST BEFORE time and not much earlier is which precise CPU/rest of device the code is running on! I don't buy that an OOO processor can do as good of a job optimizing for than in real time than a JIT compiler that has 100x time to do its work. If a processor has cache prefetch/test instructions, these can be inserted "hundreds of cycles" before memory is actually used. OOO can work around a single stall, but how about a loop that accesses 128K of RAM, with start location and size discoverable far in advance the actual access.

I think it's obvious that in the ideal world, with unlimited power and money budget, you would do both. If you have to choose, well you take your best guess and go with it.

Re:Static scheduling always performs poorly (3, Interesting)

Rockoon (1252108) | about a month and a half ago | (#47654933)

So it seems like a good idea to let the compiler do the work once and save on hardware. Except for one major monkey wrench: Memory load instructions

Thats not the only monkey wrench. Compilers simply arent good enough in general, and there is little evidence that they could be made to be good enough on a consistent basis because architectures keep evolving and very few compilers actually model specific architecture pipelines...

This is why Intel now designs their architectures to execute what compilers produce well, rather than the other way around. Intel would not have 5 asymmetric execution units with lots of functionality overlap in its latest CPU's if compilers didnt frequently produce code that requires it...

Which leads to compiler writers spending the majority of their effort on big picture optimizations because Intel/etc are dealing with the whole low level scheduling issues for them... the circle is complete.. its self-sustaining.

Re:Static scheduling always performs poorly (1)

AmiMoJo (196126) | about a month and a half ago | (#47655135)

I can only assume that Nvidia's engineers are aware of all this, since it is pretty basic stuff when it comes to CPU design really, and that TFA is simply too low on detail to explain what they are really doing.

Maybe it is some kind of hybrid where they still have some OOO capability, just reduced and compensated for by the optimization they talk about. It can't be as simple as TFA makes out, because as you say that wouldn't work.

Re:Static scheduling always performs poorly (0)

Anonymous Coward | about a month and a half ago | (#47655529)

Not that what you said is wrong, but the primary goal of high-performance microprocessor pipeline is to generate as many outstanding cache misses as possible.

at least they did not just attach cache... (1)

johnjones (14274) | about a month and a half ago | (#47654107)

Scalar design just simply attach more cache... more hits and speculative loads (/MMU) solved it for SPARC/MIPS/Power

The HP research into Dynamo and later the transmeta design concepts showed promise but delivered no product beyond small samples (under 1 million shipped) and yet peoples houses...

  I was most excited by dynamo and VLIW (itanium promised so much and delivered so little) LLVM provides some interesting concepts

  I would really like Texas Instruments (TI) back in the game as I think a large I and D cache combined with specialised (DSP + crypto) offload engines would blow the socks off the current market...

it will be interesting as intel have a smaller geometry yet the market is with ARMHY but do manufacturers care ?

have fun and power consumption matters !

John Jones

Can't be efficient... (0)

Anonymous Coward | about a month and a half ago | (#47654173)

...if the NSA has placed a backdoor in there,

"...relies on a dynamic optimization program (running on one of the two CPUs) to calculate and optimize the most efficient way to execute code..."

and spawn a fork over to NSA collectors.

So on-die intelligent pre-cache. (1)

Khyber (864651) | about a month and a half ago | (#47654499)

"There's no need to keep decoding it for execution over and over."

So kinda in the bed with a combo ReadyBoost/XP-precache.

Why not just use ReadyBoost/precache and make it work the same freaking way?

LLVM (1)

iamacat (583406) | about a month and a half ago | (#47655447)

Why not have all applications ship in LLVM intermediate format and then have on-device firmware translate them according to exact instruction set and performance characteristics of the CPU? By the time code is compiled to ARM instruction set, too much information is lost to do fundamental optimization, like vectorizing loops if applicable operations are supported.

Re:LLVM (0)

Anonymous Coward | about a month and a half ago | (#47655897)

Richard Stallman would have a heart attack?

Re:LLVM (0)

Anonymous Coward | about a month and a half ago | (#47656099)

LOL DECOMPILERS

Seriously, if the code is expression trees rather than machine code, it's trivial to decompile it back to source code.
Nice for open source software, bad for anything else.

Changing form factors, changing software (2)

bmajik (96670) | about a month and a half ago | (#47655607)

Suppose for a moment that you are building a new processor for mobile devices.

The mobile device makers - Apple, Google, and Microsoft -- all have "App Stores". Side loading is possible to varying degrees, but in no case is it supported or a targeted business scenario.

These big 3 all provide their own SDKs. They specify the compilers, the libraries, etc.

Many of the posts in this thread talk about how critical it will be for the compilers to produce code well suited for this processor...

Arguably, due to the app development toolchain and software delivery monoculture attached to each of the mobile platforms, it is probably easier than ever to improve compilers and transparently update the apps being held in app-store catalogs to improve their performance for specific mobile processors.

It's not the wild west any more; with tighter constraints around the target environment, more specific optimizations become plausible.

Progress! (1)

Iniamyen (2440798) | about a month and a half ago | (#47656329)

Once Denver sees you run Facebook or Candy Crush a few times, it's got the code optimized and waiting.

I am so fortunate to live in such an advanced age of graphics processors, that let me run the equivalent of a web browser application and a 2D tetris game. What progress! We truly live in an age of enlightenment!

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?