Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Cliff Click's Crash Course In Modern Hardware

timothy posted more than 4 years ago | from the first-there-were-the-dinosaurs dept.

Intel 249

Lord Straxus writes "In this presentation (video) from the JVM Languages Summit 2009, Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code due to all of the crazy kung-fu and ninjitsu it does to your code while it's running. This talk is an excellent drill-down into the internals of the x86 chip, and it's a great way to get an understanding of what really goes on down at the hardware and why certain types of applications run so much faster than other types of applications. Dr. Cliff really knows his stuff!"

cancel ×

249 comments

Fast forward... (5, Informative)

LostCluster (625375) | more than 4 years ago | (#30772618)

I can't say I've WTFV like I usually RTFA before you get to see it... but I can tell you this: The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

Re:Fast forward... (5, Funny)

Jah-Wren Ryel (80510) | more than 4 years ago | (#30772738)

The first four minutes of the video are spent asking which topic the room wants to see. No need to watch that part. Then it gets more interesting.

That's just the branch predictor pre-loading the cache for each possible conditional result.

Re:Fast forward... (1)

Ginger Unicorn (952287) | more than 4 years ago | (#30774190)

That is proper hard-fucking-core geek wit. Bravo.

Re:Fast forward... (0, Flamebait)

OverlordQ (264228) | more than 4 years ago | (#30772944)

You mean you actually got it to play instead of stare at a play button? Can we please kill Flash already.

Re:Fast forward... (1)

Mitchell314 (1576581) | more than 4 years ago | (#30773040)

Huh, I got to play it nice for a while. Then it kept on stopping. Now it won't play at all. Good show, up as far as I could see (~15 minutes).

And Flash really really needs to die for the greater good. And for us Linux users.

Re:Fast forward... (5, Informative)

Brian Gordon (987471) | more than 4 years ago | (#30773782)

A little javascript-fu reveals that the video player points to a file (at http://flv.thruhere.net/presentations/09-sep-JVMperformance.flv [thruhere.net] ) on some poor guy's machine through a dynamic DNS service! I hope somebody grabbed a copy before he (or slashdot) took his server down.

Re:Fast forward... (0)

Anonymous Coward | more than 4 years ago | (#30774456)

jsclassref. Base64. That's lame.

Re:Fast forward... (1)

Gazzonyx (982402) | more than 4 years ago | (#30773134)

You're lucky that you didn't get it to play; mine played to six minutes and then just stopped and won't play or let me skip past that.

Video is a waste of time... (0)

Anonymous Coward | more than 4 years ago | (#30773502)

I can't even watch this. Anyone got a transcript so that I can skip the video BS and just read it? I can read a lot faster than he can talk, and I wouldn't have to wait 30 minutes for the video to load (slow connection) ...

Re:Video is a waste of time... (1)

Brian Gordon (987471) | more than 4 years ago | (#30773860)

You have to admit it's pretty nice to have the presentation slides automatically display and advance below the video as you watch..

Could someone give me a crash course (1, Funny)

Anonymous Coward | more than 4 years ago | (#30772638)

on the website? I'm not sure what I'm looking at...

Re:Could someone give me a crash course (5, Funny)

Lunix Nutcase (1092239) | more than 4 years ago | (#30772674)

Probably due to your x86 processor doing all sorts of monkeying with the code.

Re:Could someone give me a crash course (2, Funny)

creimer (824291) | more than 4 years ago | (#30772706)

Spaghetti code can be hard to digest.

Re:Could someone give me a crash course (5, Funny)

Icegryphon (715550) | more than 4 years ago | (#30773126)

Spaghetti code can be hard to digest.

Sounds to me like someone is using stale Copypasta.

Re:Could someone give me a crash course (1)

funwithBSD (245349) | more than 4 years ago | (#30773622)

They made the meatballs out of DEADBEEF.

Re:Could someone give me a crash course (1)

networkBoy (774728) | more than 4 years ago | (#30774242)

gotten at the 0xCAFE 0F DEAD BEEF

Re:Could someone give me a crash course (1)

TeknoHog (164938) | more than 4 years ago | (#30773652)

Incidentally, my most reliable Flash player is found on a Nokia N800, running Linux on ARM. Fortunately there are ways to download the video file in many cases.

Code in high-level (1, Insightful)

elh_inny (557966) | more than 4 years ago | (#30772640)

Iit doesn't make sense to code in ASM anymore.
With computing expanding towards more and more parallelism, I can clearly see that one should learn to start coding in the most abstract of way and let the tools do the optimisation for him...

Re:Code in high-level (5, Insightful)

caerwyn (38056) | more than 4 years ago | (#30772710)

That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.

However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.

Re:Code in high-level (2, Interesting)

Com2Kid (142006) | more than 4 years ago | (#30772902)

Also, the compiler doesn't always take advantage of instructions that it could use.

Yah sorry about that. :)

Part of the problem is that compilers have to support a variety of instruction sets, and if the majority of the customers are using an 8 year old revision of an instruction set, even if the newest revision offers Super Awesome Cool features that make code run a lot faster, well you end up with a chicken and egg problem where it makes sense for the compiler team to focus on the old architecture since that is what everyone is using, and no one wants to move to the new architecture since the compiler doesn't take full advantage of it.

Re:Code in high-level (3, Interesting)

Chris Burke (6130) | more than 4 years ago | (#30772928)

That's not entirely true. In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations. Also, the compiler doesn't always take advantage of instructions that it could use.

Yeah and the chip makers release software optimization guides regarding how to avoid such stalls or take advantage of other features, and it's really hard to do that at the C level, and it can be hard for the compiler to know that a certain situation calls for one of these optimizations.

However, determining that takes a lot of effort and a lot of instrumentation, and so you'd better really need that last bit of performance before you go after it.

Agreed, it's basically something you're going to do for the most performance critical part, like the kernel of an HPC algorithm for example.

Re:Code in high-level (4, Informative)

Sycraft-fu (314770) | more than 4 years ago | (#30772966)

Also either start with the assembly the compiler generates, or at the very least make sure to bench your own against what it makes. The Intel Compiler in particular is extremely good at what it does. As such, it is worth your while to see what its solution to your problem is, and then see if you can improve, rather than assuming you are smarter and can do everything better on your own.

Of course all that is predicated on using a profiler first to find out where the actual problem is. Abrash accurately pointed out years ago that programmers suck at that. They'll spend hours making a nice optimized function that ends up making no noticeable difference in execution time.

Re:Code in high-level (1)

phantomfive (622387) | more than 4 years ago | (#30773196)

One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory. One of the biggest processor slowdowns, especially if you are dealing with a lot of data, is cache misses. If you can align your data in memory on the cache pages, then you can make huge performance gains. Since C doesn't give you much control over this, if you really want to optimize it you have to go to assembly.

Also, some of glibc function calls (like memmove or memcpy, I believe) have been optimized in assembly, which is kind of nice. As always, use a profiler to make sure you're actually speeding things up.

Re:Code in high-level (3, Interesting)

dr2chase (653338) | more than 4 years ago | (#30773324)

Dealing with alignment is not that much of an assembler issue, if you are using C. Address arithmetic gets the job done. If you even want your globals aligned (and not just heap-allocated stuff) you *might* need some ASM, but just for the declarations of stuff that would be "extern struct whatever stuff" in C (and in a pinch, you write a bit of C code to suck in the headers defining "stuff", figure out the sizes, and emit the appropriate declarations in asm).

Writing memmove/memcpy in assembler is a mixed bag. If you write it in C, you can preserve a some tiny fraction of your sanity dealing with all the different alignment combinations before you get to full-word loads and stores. HOWEVER, on the x86, all bets are off, the only way to tell for sure what is fastest, is to write it, and benchmark it.

Re:Code in high-level (4, Informative)

TheRaven64 (641858) | more than 4 years ago | (#30774106)

One of the biggest drawbacks of a language like C (and even more C++, and even more Java), is that they don't give you a whole lot of control of how stuff is arranged in memory

I'd say this is more of a C/C++ problem than a Java problem. Or, rather, they are different problems. The problem with C and C++ is that they do give the programmer a whole lot of control about how things are arranged in memory. They don't, on the other hand, give the compiler a lot of freedom to rearrange things.

Java, on the other hand, uses the Smalltalk memory model and so the compiler (and/or JVM) is free to rearrange things in memory as much as it wants to (whether it does, of course, is a matter for the compiler writer). For example, a Java compiler that notices that you are doing the same operation on three instance variables is free to put them next to each other aligned on a 128-bit boundary with some padding at the end so that you can easily use vector instructions on them, even if they were originally declared in different classes. A C compiler can not do this with structure fields.

If you really care about alignment in C, you are free to use valloc() to align on a page boundary and then subdivide the memory yourself. Most of the time, however, it's not worth the effort.

Re:Code in high-level (1, Insightful)

Anonymous Coward | more than 4 years ago | (#30773276)

There is an old saying that performance improvement comes from better algorithms and not instruction fiddling. Simply put if your performance is not adequate using ordinary compiler code then you have serious issues with your software or hardware design.
Note that code fiddling couples the software closely to the specific CPU which is not a good idea unless you can control both indefinitely.

Re:Code in high-level (2, Insightful)

caerwyn (38056) | more than 4 years ago | (#30773442)

That's *generally* true. It's not *always* true.

There are a lot of purely compute-bound applications (think simulations of various sorts, etc) for which the algorithmic optimizations have already been done- but it's still worth going for the last few percent of performance from "instruction fiddling". As another poster said: if your app runs for weeks at a time, 1% improvement becomes significant in terms of time saved- and throwing more hardware at the problem isn't always feasible.

Re:Code in high-level (1)

dbIII (701233) | more than 4 years ago | (#30773328)

Also there is code that is used a lot for a long time.
For example in geophysics there is a process of arranging data called "Pre Stack Time Migration" which can keep a small cluster busy for a week with relatively small datasets. In cases like that tiny improvements save hours. Only one percent of improvement saves more than an hour in a week.

Re:Code in high-level (1)

wisty (1335733) | more than 4 years ago | (#30774400)

I heard a rumor that there's some fundamental geophysical program that's been around for decades. It doesn't accumulate the results in an array, because memory was too expensive when fortran 66 was the hot new thing.

It has a write-to-disk instruction in an inner loop. But it works, and nobody wants to touch it.

A little micro-optimization there would grant a 1000x speedup.

Re:Code in high-level (2, Informative)

RzUpAnmsCwrds (262647) | more than 4 years ago | (#30773352)

It also depends on the compiler. GCC, for example, sucks at auto-vectorization, so it's easy to get 30% or more on loopy scientific code just by using SSE instructions properly.

In contrast, PGI or ICC is much harder to beat using assembly.

Re:Code in high-level (3, Interesting)

TheRaven64 (641858) | more than 4 years ago | (#30774128)

Note that even with GCC, the choices aren't just autovectorisation and assembly. GCC provides (portable) vector types, and if you declare your variables as these then it just has to try to use SSE / AltiVec / Whatever instructions for the operations, and it can easily because your variables are aligned. Primitive operations (i.e. the ones you get on scalars in C) are defined on vectors and so you can do 2^n of them in parallel and GCC will emit the relevant instructions depending on your target CPU. Going a step further, there are intrinsic functions that are specific to a particular vector ISA and can be used with these. Then you get to tell GCC exactly which instruction to use, but it still does all of the register allocation for you.

Re:Code in high-level (1)

frank_adrian314159 (469671) | more than 4 years ago | (#30773850)

In performance-sensitive tight loops, it can still make sense to code in ASM to avoid pipeline bubbles and stalls in some very limited situations.

And that will work until the next rev of the board's chip, which your hardware vendor will change when he wants to and not notify you about. You'll know about it when the customer complaints roll in about poor performance or during your next rev of the firmware when your performance stats go to hell. And, if you're trying to do this for COTS hardware, forget it - you won't even know which chips you'll be running on. The bottom line? Unless price (and cost to your company) is of no concern, write the code as cleanly as possible and run it through an optimizer.

Re:Code in high-level (2, Insightful)

Thiez (1281866) | more than 4 years ago | (#30772736)

Sometimes it's just plain FUN FUN FUN to code in asm. You're right that most programmers will never have a need for it at all (with some exceptions, such as those messing with operating systems or embedded systems), although knowing some ASM can help a lot with debugging. I suppose one could (read: should) learn a little ASM to have a better idea of what the hardware is doing, this will allow you to optimize your code a little, or (more importantly) write it in such a way that makes it easier for the compiler to optimize.

Re:Code in high-level (3, Informative)

marcansoft (727665) | more than 4 years ago | (#30773066)

Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".

On the other hand, it's very true that knowing some ASM can help you write code that the compiler will translate into better assembly code, without going through all of the trouble yourself.

Re:Code in high-level (0)

Anonymous Coward | more than 4 years ago | (#30773842)

amd64 is horrible, the calling convention is ridiculously complicated and different on many operating systems

Re:Code in high-level (3, Informative)

TheRaven64 (641858) | more than 4 years ago | (#30774146)

The calling convention is complicated, but it's nowhere near as different as IA32 calling conventions between platforms. Linux and FreeBSD, for example, use different rules for when to return a structure on the stack and when to return it in registers on IA32, but they use exactly the same conventions (the SysV ABI) on x86-64.

Re:Code in high-level (3, Interesting)

SETIGuy (33768) | more than 4 years ago | (#30774502)

Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun.

Coding assembly on RISC architectures is dead boring because all the instructions do what you expect them to and can be used on any general purpose register.

In the good old days, when x86 was 8086 there were no general purpose registers. The BX register could be used for indexing, but AX, CX and DX couldn't. CX could be used for counts (bit shifts, loops, string moves), but AX, BX, and DX couldn't. SI and DI were index registers that you could add to BX when dereferncing or could be used with CX for string moves. AX and DX could be used in a pair for a 32 bit value. If you wanted to multiply, you needed to use AX. If you wanted to divide, you needed to divide DX:AX by a 16 bit value and your result would end up in AX and the remainder in DX. Compared to the Z80 assembly language, we thought this was easy.

Being able to use %r2 for the same stuff you use %r1 for is just boring.

Re:Code in high-level (3, Interesting)

creimer (824291) | more than 4 years ago | (#30772784)

I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.

Re:Code in high-level (0)

Anonymous Coward | more than 4 years ago | (#30773050)

> no one wanted to get their hands dirty under the hood.

That's not at all how I remember college.

Re:Code in high-level (2, Interesting)

KC1P (907742) | more than 4 years ago | (#30773272)

That's a real shame! But my impression is that for a long time now, college-level assembly instruction has consisted almost entirely of indoctrinating the students to believe that assembly language programming is difficult and unpleasant and must be avoided at all costs. Which couldn't be more wrong -- it's AWESOME!

Even on the x86 with all its flaws, being able to have that kind of control makes everything more fun. The fact that your code runs like a bat out of hell (unless you're a BAD assembly programmer, which a lot of people are but they don't realize it so they bad-mouth the language) is just icing on the cake. You should definitely teach yourself assembly, if you can find the time.

Re:Code in high-level (1)

s73v3r (963317) | more than 4 years ago | (#30773404)

Wow, that sucks. My college ASM class was AWESOME! Granted, it was probably only there to give us a feeling for what was going on under the hood, not to actually learn x86 assembly, but it was taught by a guy who not only was very knowledgeable about the subject, but was also really enthusiastic (even for being upwards of 70!).

Re:Code in high-level (1)

Dunbal (464142) | more than 4 years ago | (#30773440)

I think you can legally get MASM (Microsoft Macro Assembler) somewhere on the internet for free. A good place to start would be Microsoft. Then you can do what real coders do, and teach yourself!

And to think I paid several hundred dollars for that, back in the day.

Re:Code in high-level (2, Informative)

Anonymous Coward | more than 4 years ago | (#30773656)

Or you could get NASM, which is open source :)

Re:Code in high-level (1)

creimer (824291) | more than 4 years ago | (#30773828)

Sweet! The last time that I looked at ASM, I had to run a DOS box under Windows XP that didn't work out too well.

Re:Code in high-level (1)

mfnickster (182520) | more than 4 years ago | (#30774004)

Will NASM let you write structured assembly, like MASM?

I picked up a used copy of Inner Loops [amazon.com] by Rick Booth, and it intrigued me enough to consider tracking down an old version of MASM.

Re:Code in high-level (0)

Anonymous Coward | more than 4 years ago | (#30773894)

I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.

I did an EE and we had to learn ASM for some embedded courses (6811, PIC). Learning it for "larger" processors is certainly possible, but you could always get a hobby kit and learn it. We also had some courses in VHDL to design simple CPUs and VGA emulators (e.g., had to program an FPGA to display certain patterns on a CRT).

I'm guessing you were a CS, and they didn't really go down into hardware was much as in comp. eng. or EE.

Re:Code in high-level (1)

creimer (824291) | more than 4 years ago | (#30773984)

I was learning computer programming at the local community college while working as a lead video game tester. Two-thirds of my classes was Java-centric. When C++ became available again after the college got the money for a renewed Microsoft site license, I took the remaining classes in that language. Ironically, the instructor didn't like the new version of Microsoft Visual Studio and we switched to Linux.

Re:Code in high-level (1)

Kjella (173770) | more than 4 years ago | (#30773982)

I wanted to take ASM in college. I was the only student who showed up for the class and the class was canceled. Since most of the programming classes was Java-centric, no one wanted to get their hands dirty under the hood.

I'm probably going to need an asbestos suit for this post, but to be honest I don't think assembler is a good programming language for humans. My impression is that they absolutely don't want to pollute the instruction set with instructions unless there's a performance benefit to doing so. But what it means in practice is that anyone I've seen writing advanced assembly relies on lots and lots of macros to do essential things, because the combination of instructions is useful but there's no language construct. For example, in general you JMP everywhere which is the low-level equivalent of GOTO and you use that to create the equivalent of FOR and WHILE etc. which is neat to have seen once but gets quite tedious to do over and over.

Most of the real world issues I run into, aren't of the type "yeah with an assembler optimization here we could squeeze another 2% out of it", It's stuff like "wtf why are you putting that inside the loop?" or "why are you doing this processing one by one when a batch update would do this 1000x faster?" If you got a clue on what's happening in C, if you know when memory is allocated/deallocated and that the basic operations you do makes sense, you'll write better code than 90% of the developers out there anyway.

Re:Code in high-level (1)

AdamHaun (43173) | more than 4 years ago | (#30774250)

It's not a great language (family) for general use, but it is a good way to learn something about how CPUs work, what a function call actually is, etc.

Re:Code in high-level (0)

Anonymous Coward | more than 4 years ago | (#30774308)

Actually, a for or while construct is trivially easy in asm, almost easier than in c.

for construct: for(i=amount;i--;)
mov ecx,amount
loop: ...inner loop...
dec ecx
jnz loop

while construct: while(amount!=0)
loop: ...inner loop...
cmp amount,0
jnz loop

Re:Code in high-level (0)

Anonymous Coward | more than 4 years ago | (#30774408)

That's a shame, because Java is actually a great language to learn the principles of assembly. It's very easy to disassemble compiled class files to bytecode, and thus easy to map the stack-based instructions to the Java source.

Re:Code in high-level (1, Insightful)

Anonymous Coward | more than 4 years ago | (#30772820)

Someone has to write those tools.

Re:Code in high-level (2, Insightful)

Just Some Guy (3352) | more than 4 years ago | (#30772886)

Someone has to write those tools.

Yeah, but they can be written in a HLL, too. You don't have to write a program in highly-tuned assembler to make it emit highly-tuned assembler.

Re:Code in high-level (1)

DarkOx (621550) | more than 4 years ago | (#30773080)

You certainly need to know alot about assembler and CPU architecture if you are going to write code that emits highly tuned assembler. Actually you probably do have to write those tools in assembler for all intents and purposes. To really over simplify: Compliers are pretty much syntax checkers and search tree engines. They take your code and replace it with a matching assembly listing or set of listings substituting which ever registers happen to be free etc etc.

What?! (0)

Anonymous Coward | more than 4 years ago | (#30773260)

What the fuck are you talking about. Why the hell do you need to write a compiler in assembler? Do you have any idea how a compiler works? Your last sentence suggests not.

Re:Code in high-level (1)

Just Some Guy (3352) | more than 4 years ago | (#30773928)

You certainly need to know alot about assembler and CPU architecture if you are going to write code that emits highly tuned assembler. Actually you probably do have to write those tools in assembler for all intents and purposes.

That's news to GCC:

$ cd /usr/src/contrib/gcc
$ find . -name '*.[ch]' | wc -l
869
$ find . -name '*.[ch]' | xargs cat | wc -l
895866
$ find . -name '*.asm' | wc -l
34
$ find . -name '*.asm' | xargs cat | wc -l
6520

Translation: In GCC 4.2.1 as shipped with FreeBSD 8-STABLE, there are 869 .c and .h files with a total of 900KLOC, and 34 .asm files with 6KLOC. It seems that GCC itself isn't written with very much assembler.

Re:Code in high-level (1)

toastar (573882) | more than 4 years ago | (#30774356)

Pfft... GCC,

When I was a kid I had to learn to program using Machine Code, Uphill, both ways!

Re:Code in high-level (1)

Just Some Guy (3352) | more than 4 years ago | (#30774430)

My first "real" programming was using a machine language monitor on a C64, so I feel your pain.

Re:Code in high-level (1)

WilyCoder (736280) | more than 4 years ago | (#30773566)

I've heard that the first C compiler was written in C.

Re:Code in high-level (1)

creimer (824291) | more than 4 years ago | (#30773908)

Uh, no. C was written in B. B was written in A. A was written in leftover naughty bits. :P

Re:Code in high-level (2, Interesting)

dave562 (969951) | more than 4 years ago | (#30773092)

I think it depends on what kind of code you're trying to write. If a person desires to write applications then you are right, they might as well write it in a high level language and let the compiler do the work. On the other hand if the person is interested in vulnerability research or security work, then learning ASM might as well be considered a requisite. An understanding of low level programming and code execution provides a programmer with a solid foundation. It gives the potential insights into what might be going wrong when their code isn't compiling or executing the way they want it to. It also gives them the tools to make their code better, as opposed to simply shrugging and saying, "I sure hope they fix this damn compiler..."

Re:Code in high-level (1)

oldhack (1037484) | more than 4 years ago | (#30773342)

Yeah, probably makes sense only for DSPs and microcontrollers. But then isn't 68k used as microcontrollers now?

We used to say there were two many layers of shit. Now it's truly "turtles all the way down."

Re:Code in high-level (1)

oldhack (1037484) | more than 4 years ago | (#30773426)

Is there a study of why we sometimes sub same-sounding words when typing in stream-of-conscious style? Might be something there...

Re:Code in high-level (2, Insightful)

smash (1351) | more than 4 years ago | (#30773470)

Not quite.

But, its certainly better to code in a high level language first, test, tweak the algorithm as much as you can, PROFILE and THEN start breaking out your assembler. No point optimising 99% of your code in super fast asm if it only spends 1% of the cpu time in it. Even if you make all that code 10x as fast, you've only saved 0.9% cpu time. :)

Re:Code in high-level (1)

SETIGuy (33768) | more than 4 years ago | (#30774296)

In non-trivial single threaded application code on a modern processor, the CPU core is spending about 95% of its time waiting on memory transfers. To fix that problem, it can make sense to prefetch and reorder memory accesses. Chances are you know better than your compiler how to do that. It also makes sense to start more threads on a processor with multiple hardware threads so you can do things while waiting for memory.

Most programmers won't even bother to do that, because the processor is fast enough to do what they want without the optimization. Only in heavy duty numerical code and in games does optimization by hand get done. Where you really need top performance regardless of the platform, coders will write multiple versions of an core routine and time them to find what's best on the machine being used.

Premature optimization is evil... and stupid (2, Insightful)

Just Some Guy (3352) | more than 4 years ago | (#30772742)

That's the main reason why I want to shoot people who write "clever" code on the first pass. Always make the rough draft of a program clean and readable. If (and only if!) you need to optimize it, use a profiler to see what actually needs work. If you do things like manually unroll loops where the body is only executed 23 times during the program's whole lifetime, or use shift to multiply because you read somewhere that it's fast, then don't be surprised when your coworkers revoke your oxygen bit.

Re:Premature optimization is evil... and stupid (4, Funny)

RightSaidFred99 (874576) | more than 4 years ago | (#30772906)

And messy and embarrassing. Oh, wait...

Re:Premature optimization is evil... and stupid (1)

Monkeedude1212 (1560403) | more than 4 years ago | (#30772946)

If (and only if!)

Compiler Error: Numerous Syntax Errors.
Line 1, 4; Object Expected
Line 1, 15; '(' Expected
Line 1, 16; Condition Expected
Line 1, 17; 'Then' Expected

Re:Premature optimization is evil... and stupid (1)

Just Some Guy (3352) | more than 4 years ago | (#30773006)

That was Lisp. You should parse it as If(only && !if).

Re:Premature optimization is evil... and stupid (1)

EvanED (569694) | more than 4 years ago | (#30773070)

Always make the rough draft of a program clean and readable.

Not only that, but if the optimized version is much less readable than the initial version, consider keeping and maintaining *both* versions. You can run tests to compare the output of each version, replace the fast, not-obviously-incorrect version with the slow, obviously-not-incorrect version if you hit a bug and see if it's still there, etc.

(MS did or does this with Excel; at least until recently, and perhaps still, the recomputation engine for the spreadsheet was hand-tuned assembly. However, for testing and development reasons, they also had a much slower, high-level-language version.)

Re:Premature optimization is evil... and stupid (0)

Anonymous Coward | more than 4 years ago | (#30773184)

I do that a great deal actually, usually not at the entire application level but certainly like your Excel example. Usually if I find a certain process is slow I will move that function aside or maybe an entire class and try and optimize. I always have the clean code version to test against or go back to simply running with if the deadline to finish something creeps up; worst case some maintenance program can try and produce an optimized version of an algorithm (s)he can at least read understand when someone decides the app is to slow.

Re:Premature optimization is evil... and stupid (3, Interesting)

marcansoft (727665) | more than 4 years ago | (#30773114)

Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.

Re:Premature optimization is evil... and stupid (3, Informative)

Rockoon (1252108) | more than 4 years ago | (#30773220)

Using shift to multiply is often a great idea on most CPUs.

Which CPU's are those? The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

Didn't know that? yeah... it seems like only assembly language programs know this.

Re:Premature optimization is evil... and stupid (4, Informative)

marcansoft (727665) | more than 4 years ago | (#30773736)

Which CPU's are those?

Those with a barrel shifter.

The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

Then someone needs to beat the GCC developers with a cluestick.
$ cat test.c
int main(int argc, char **argv) {
                return 4*(unsigned int)argc;
}
$ gcc -march=core2 test.c -o test
$ objdump -d test ...
00000000004004ec <main>:
    4004ec: 55 push %rbp
    4004ed: 48 89 e5 mov %rsp,%rbp
    4004f0: 89 7d fc mov %edi,-0x4(%rbp)
    4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp)
    4004f7: 8b 45 fc mov -0x4(%rbp),%eax
    4004fa: c1 e0 02 shl $0x2,%eax
    4004fd: c9 leaveq
    4004fe: c3 retq
    4004ff: 90 nop

yeah... it seems like only assembly language programs know this.

I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.

Re:Premature optimization is evil... and stupid (2, Insightful)

AuMatar (183847) | more than 4 years ago | (#30773858)

It depends on where they spend their hardware, and what you're multiplying by. You can make a multiplier faster than shifting, it just requires a lot of hardware to do so. If you're multiplying by a constant power of 2, shifting will always be as fast or faster. If you're multiplying by a non power of 2 constant, shifting and adding may be faster, and probably is if there's fairly few 1s in the binary representation. But if they have a good multiplier then mult may be faster than shift/add for a random unknown multiply.

Also IIRC the p4 got rid of the barrel shifter on Intel. Or maybe it was the gen after that. THey may have re-added it though, it seems fairly stupid not to have one.

Re:Premature optimization is evil... and stupid (1)

marcansoft (727665) | more than 4 years ago | (#30773922)

I was talking of multiplying by a power of two constant, of course. You're quite correct in saying that shift+add combinations may or may not be faster than multiplying by more complex constants, depending on the particular implementation. Usually, two shifts and one add is a fairly safe bet for simpler CPUs, but it can actually slow things down on modern superscalar CPUs where it creates undesirable dependencies in the pipeline.

Re:Premature optimization is evil... and stupid (2, Informative)

TheRaven64 (641858) | more than 4 years ago | (#30774238)

I actually did a benchmark of this a few months ago. For a single shift, there wasn't much in it (on a Core 2); both were decoded into the same micro-ops. For more than one shift and add, the multiply was faster because the micro-op fusion engine wasn't clever enough to reassemble the multiply (and even if it were, you're still burning i-cache for no reason). GCC used to emit shift-and-add sequences for all constant multiplies until someone benchmarked it on an Athlon (which had two multiply units and one shift unit) and found that it was much faster to just emit a multiply.

It's just outdated knowledge (2, Informative)

Sycraft-fu (314770) | more than 4 years ago | (#30773788)

People learn a trick way back when, or hear about the trick years later, and assume it is still valid. Not the case. Architectures change a lot and what used to be the best way might not be anymore.

Michael Abrash, one of the all time greats of optimization, talks about this in relation to some of the old tricks he used to use. One was to use XOR to clear a register on x86. XORing a register with itself gives 0, of course, and turned out to be faster than writing an immediate value of zero in to the register. Reason is that loading a value was slower than the XOR op, and the old CPUs had no special clear logic, zero was just another number.

Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

However, you'll still hear people say it is a great trick because they haven't updated their knowledge.

Re:It's just outdated knowledge (1)

marcansoft (727665) | more than 4 years ago | (#30773830)

Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0. Because they want existing code to run as fast as possible, and in x86 compatibility-is-king land, that means optimizing for the common-if-weird cases, not the sane cases.

Re:It's just outdated knowledge (3, Informative)

Cassini2 (956052) | more than 4 years ago | (#30774052)

I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0.

You are correct. XOR reg,reg was such a common instruction on the x86, that essentially it became the special case CLR instruction. Essentially, if you see a CLR instruction on an x86 assembly printout, it is the XOR instruction in disguise. The x86 has no CLR instruction.

Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

Essentially, all current "simple" CPU instructions execute with the same speed. However, the XOR instruction is still faster than the MOV instruction because of instruction bandwidth and cache effects. Most code today is limited by cache and bandwidth limits, like the need to load instructions into the instruction decode pipeline immediately after a jump instruction. The MOV reg, 0 immediate move instruction is a two-byte instruction, and the XOR reg, reg instruction is a one-byte instruction. As such, in real code, the XOR instruction is usually slightly faster, because it results in smaller code.

Additionally, all of the modern x86 CPU implementations special case the XOR reg,reg instruction into a MOV reg, 0 immediate move instruction inside the instruction decode stage anyway. As such, no significant functional difference exists. The only case where a move instruction is quicker is when the condition codes are propagating a side-effect via the condition code registers. Thus, in theory:
ADD AL, AH
MOV CL, 0
JC somewhere

should execute quicker with a MOV instruction as opposed to a XOR instruction. However, in practice, this piece of code:
XOR CL, 0
ADD AL, AH
JC somewhere

executes with exactly the same speed, because the out-of-order execution units inside the x86 automatically optimize the code and make it equivalent. As such, you are best with the "short small" code, which means that the XOR reg, reg instruction is still the fastest way to do a register clear.

Re:Premature optimization is evil... and stupid (1)

Just Some Guy (3352) | more than 4 years ago | (#30773884)

so there's no reason to explicitly use shift in code (unless you're doing bit manipulation

Well, right. The general advice is to always write what you actually want the compiler to do and not how to do it, unless you have specific proof that the compiler's not optimizing it well.

Re:Premature optimization is evil... and stupid (1)

AuMatar (183847) | more than 4 years ago | (#30773896)

The opposite problem also exists though- by not thinking about performance you can make it expensive or impossible to improve things later without a substantial rewrite. Saying optimize at the end is just as stupid and just as costly. Learning when to care about what level is part of the art of programming. (Although on your specific examples I'll agree with you- especially since I would expect anything but a really old compiler to do mult->shift conversions for you, so you may as well use the more maintainable and readable multiply.)

Re:Premature optimization is evil... and stupid (1)

Just Some Guy (3352) | more than 4 years ago | (#30773970)

Saying optimize at the end is just as stupid and just as costly.

There is an enormous difference between optimization and choosing appropriate algorithms. If you write a program well, it's almost always easy to optimize it later. If you write it poorly, it'll almost always be impossible to optimize at any point of its development. For example, I'd rather sort a big array with an unoptimized (but correct) quicksort than with an extremely clever (but insane) bogosort.

Re:Premature optimization is evil... and stupid (3, Informative)

tomtefar (935007) | more than 4 years ago | (#30773974)

I have the following sticker on top of my display: "Make it work before you make it fast!" Saved me many hours of work.

Re:Premature optimization is evil... and stupid (2, Interesting)

Anonymous Coward | more than 4 years ago | (#30774118)

I think that the premature optimization claims are way overdone. In the cases where performance does not matter, then sure, make the code as readable as possible and just accept the performance.

However, sometimes it is known from the beginning of a project that performance is critical and that achieving that performance will be a challenge. In such cases, I think that it makes sense to design for performance. That rarely means using shifts to multiply -- it may, however, mean that you design your data structures so that you can pass the data directly into some FFT functions without packing/unpacking the data to some other format that the rest of the functions were written to expect. It may also mean that your design scale to many cores and that inner loops be heavily optimized and vectorized. Of course, all of that code should be performance tested during development against the simpler versions.

Profiling after the fact sounds like a good idea, but what if the code has no real "hotspot"? What if you find out that you need to redesign the entire software framework to support zero-copy processing of the data? Also, profiling tools in general are really not that good. Running oprofile on a large-scale application with dozens of threads and data source dependencies on other processes can be less than enlightening. gprof is entirely useless for non-trivial applications. cachegrind is sometimes helpful, but most people working on performance optimization seem to simply build their own timers based on the rdtsc instruction and manually time sections of the code.

I work on software for processing medical device data and performance is often critical. You probably want an image display to update very quickly when it is providing feedback to the doctor guiding a catheter toward your heart, for example. We had one project where the team decided to start over with a clean framework without concern for performance -- they would profile and optimize once everything was working. They followed the advice of many a software engineer: their framework was very nice, replete with design patterns and applications of generic programming, and entirely unscalable beyond a single processor core. There were no performance tests done during development, and of course the timeline was such that there would only be minimal time for optimization once the functionality was complete. The software that it was replacing was ugly, but also scaled nicely to many cores. The software shipped on a system with two quad-core processors, just as it had before.

Let's just say that customers were unimpressed with the new software framework.

Re:Premature optimization is evil... and stupid (1)

Just Some Guy (3352) | more than 4 years ago | (#30774208)

Interesting anecdote that has nothing to do with optimization and everything to do with bad design. Optimization is great for making your program run n% faster. Design is great for making your program run in O(log n) time instead of O(n^2) time. The important part is to come up with a good design, implement it, and address the specific problem areas. I can't think of a single justification for doing it any other way.

Re:Premature optimization is evil... and stupid (0)

Anonymous Coward | more than 4 years ago | (#30774198)

A profiler is only one way of determining what's important. Knowledge and experience is another way. If you've worked in a problem domain for a long time, you know the fundamentals of what's fast enough and what's not acceptable. Utilizing a large store of knowledge to "prematurely" optimize is not necessarily a bad thing. If you know for certain that the clearest, easiest-to-understand way of doing something just isn't going to be fast enough, that's perfectly valid. Just balance that against the clarity of the resulting solution.

These sorts of things generally fall into the realm of algorithmic design, though, not micro-optimizations like substituting shifts for multiplications. The compiler figures all that shit out for you, anyway.

Well that is /.'d (1)

Com2Kid (142006) | more than 4 years ago | (#30772780)

/.'d, to say the least. Wow.

Great lecture so far, 2 minute pauses every 20 seconds make it kind of hard to listen to though!

Re:Well that is /.'d (1)

Jorl17 (1716772) | more than 4 years ago | (#30772952)

And I was here banging the computers to get them to work faster! Damn /.! Next time, tell me before you eat up another server!

Re:Well that is /.'d (1)

MaskedSlacker (911878) | more than 4 years ago | (#30772964)

Try waiting for it to full buffer?

Re:Well that is /.'d (0)

Anonymous Coward | more than 4 years ago | (#30773208)

For me the link is borked. Coral cache anyone? The link is dead. Its dead Jim! We need Miracle Max. Its not completely dead, its only mostly dead. .....Oh wait! I'm mixing my star trek and princess bride metaphors. My bad.

tl;dr (-1, Troll)

Anonymous Coward | more than 4 years ago | (#30772870)

Cliff Click talks about why it's almost impossible to tell what an x86 chip is really doing to your code

tl;dr
Dont care... because I believe in abstraction!

Skynet... (0)

Anonymous Coward | more than 4 years ago | (#30772920)

Now that no one knows what they're doing, who's to keep them from merging. How long is it before several machines of x86 chips become self-aware? The end is nigh comrades!

Alternatively, maybe they'll become Data.

Steve Jobs shits (-1, Troll)

Anonymous Coward | more than 4 years ago | (#30773410)

What do you Apple dick smokers think of your god now?

It's not just x86 (3, Informative)

RzUpAnmsCwrds (262647) | more than 4 years ago | (#30773542)

Features like out of order execution, caches, and branch prediction/speculation are commonplace on many architectures, including the next generation ARM Cortex A9 and many POWER, SPARC, and other RISC architectures. Even in-order designs like Atom, Coretex A8, or POWER6 have branch prediction and multi-level caches.

The most important thing for performance is to understand the memory hierarchy. Out-of-order execution lets you get away with a lot of stupid things, since many of the pipeline stalls you would otherwise create can be re-ordered around. In contrast, the memory subsystem can do relatively little for you if your working set is too large and you don't access memory in an efficient pattern.

I hate flash video (1)

Omnifarious (11933) | more than 4 years ago | (#30773556)

I wish they'd all just use HTML5 or put it on YouTube so I can use youtube-dl or something. Otherwise it either doesn't work at all (my amd64 Linux boxes) or is slow and jerky (my Mac OSX box). It's really frustrating.

Re:I hate flash video (0)

Anonymous Coward | more than 4 years ago | (#30773690)

You hate Flash video and you list YouTube as an alternative? This Flash player reads an FLV file just like YouTube does. You can use Firebug to pull the FLV URL and play it however you like (VLC, mplayer, etc.)

Kung-Fu and Ninjitsu...They're not dead! (1)

geekmux (1040042) | more than 4 years ago | (#30773946)

This just in...Apparently Bruce Lee and Lee Van Cleef are alive and well and working for Intel, which likely accounts for all the "crazy kung-fu and ninjitsu" going on there...

Another fascinating Click talk (0)

Anonymous Coward | more than 4 years ago | (#30773966)

rule of the code (3, Informative)

Bork (115412) | more than 4 years ago | (#30774060)

Just write good clean code that works properly first. The only time you optimize is after it has been profiled to see if there are troublesome spots. The way CPUs run and how compilers are designed, there is very little need to do optimization. Unless you have taken some serious courses of how the current CPU’s work, you efforts will mostly result in bad code that gains you nothing in respect in speed. Your time is better spent on writing CORRECT code.

The compilers are very intelligent in proper loop unrolling, rearranging branches, and moving instruction code around to keep the CPU pipeline full. They will also look for unnecessary/redundant instruction within a loop and move them to a better spot.

One of the courses I took was programming for parallelism. For extra credit, the instructor assigned a 27K x 27K matrix multiply; the person with the best time got a few extra points. A lot of the class worked hard in trying to optimize their code to get better times, I got the best time by playing with the compiler flags.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...