Grand Unified Theory of SIMD

Follow Slashdot blog updates by subscribing to our blog RSS feed

Grand Unified Theory of SIMD 223

Posted by Hemos on Monday February 07, 2005 @12:30PM from the the-string-theory-of-SIMD dept.

Glen Low writes " All of a sudden, there's going to be an Altivec unit in every pot: the Mac Mini, the Cell processor, the Xbox2. Yet programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians. The macstl project tries to unify the architectures in a simple C++ template library. It just reached its 0.2 milestone and claims a 3.6x to 16.2x speed-up over hand-coded scalar loops. And of course it's all OSI-approved RPL goodness. "

This discussion has been archived. No new comments can be posted.

Grand Unified Theory of SIMD

Load All Comments

Search 223 Comments Log In/Create an Account

Comments Filter:

Altivec (Score:5, Informative)

by BWJones ( 18351 ) * writes: on Monday February 07, 2005 @12:31PM (#11597314) Homepage Journal

For those who want a little background on Altivec, of course Wiki has a description here [wikipedia.org]. Apple, who now ships Altivec in every system they make has a pretty good page here [apple.com] and Motorola nee Freescale has one here [freescale.com].

The benefits of Altivec can be truly astounding for those processes that can be "vectorized". After all putting these kinds of calculations in hardware has got it all over software computation. It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.

Share
twitter facebook
- Re:Altivec (Score:5, Informative)
  
  by shawnce ( 146129 ) writes: on Monday February 07, 2005 @12:39PM (#11597413) Homepage
  
  Just pick a few items out ...
  
  Apple provides source code for some of their vector libraries [apple.com]
  
  Parent Share
  twitter facebook
- Re:Altivec (Score:3, Interesting)
  
  by baryon351 ( 626717 ) writes:
  
  It kind of reminds me of when I got one of those Photoshop accelerator hardware cards (Radius Photoengine with 4 DSPs on a daughter card linked to the Thunder series video card) for my IIci. Photoshop filter functions ran faster on that IIci than they did on much later PowerPC systems simply because you now had four hardware DSPs running your image math.
  
  I managed to pick up a ThunderIV last year with the DSP card, and had a run around with photoshop on it. It's impressive stuff. I have an iMac 350 here
- Other way around (Score:2)
  
  by Kiryat Malachi ( 177258 ) writes:
  
  Freescale, nee Motorola. (Nee roughly translates to "formerly known as").
- Re:Altivec (Score:2)
  
  by skraps ( 650379 ) writes:
  
  "Wiki" != "WikiPedia".
  For more, read http://en.wikipedia.org/wiki/Wiki [wikipedia.org].
- - Re:Altivec (Score:2)
    
    by wulfhound ( 614369 ) writes:
    
    Yes it does.. it's a G4, all G4s have Altivec.
  - Re:Altivec (Score:4, Informative)
    
    by mod_critical ( 699118 ) * writes: on Monday February 07, 2005 @12:42PM (#11597442)
    
    Altivec == Velocity Engine
    
    And is part of every G4
    
    Parent Share
    twitter facebook
More AltiVec Goodness (Score:4, Informative)

by LordRPI ( 583454 ) writes: on Monday February 07, 2005 @12:33PM (#11597342)

Apple has had AltiVec optimized libraries for DSP and such since the early releases of OS X.

Share
twitter facebook
- Re:More AltiVec Goodness (Score:2)
  
  by goMac2500 ( 741295 ) writes:
  
  How is parent flamebait? It's a fact, and its not flamebait considering Apple is one of the only companies currently shipping Altivec systems.
- Re:More AltiVec Goodness (Score:3, Insightful)
  
  by bryanzak ( 598580 ) writes:
  
  One of the problems of using libraries though is that the overhead of a function call usually negates any gain in vectorization. The lib call messes all kinds of things up, including instruction flow and caching, etc.
Umm (Score:2, Informative)

by TheKidWho ( 705796 ) writes:

Doesn't XCode have a feature that lets you "vectorize" certain parts of your code already?
- Re:Umm (Score:3, Informative)
  
  by Richard_at_work ( 517087 ) writes:
  
  The next version of Xcode will support autovectorisation, but I dont think it does it atm.
- Re:Umm (Score:2)
  
  by HeghmoH ( 13204 ) writes:
  
  No.
- Yes. (Score:3, Informative)
  
  by Trillan ( 597339 ) writes:
  
  Yes it does [apple.com].
  - Re:Yes. (Score:3, Informative)
    
    by homb ( 82455 ) writes:
    
    No the current version of XCode uses GCC 3.3 and does NOT support autovectorization.
    The page you link to is a page that shows how to code vector-based programs. What the parent is asking is if the standard "Hello World" program can be auto-vectorized with one command-line argument, and that won't work currently.
    The next version of XCode (2.0) with GCC 3.4 will support partial auto-vectorization, as another comment said as well.
A little background (Score:5, Informative)

by xXunderdogXx ( 315464 ) writes: on Monday February 07, 2005 @12:35PM (#11597359) Homepage Journal

From the Wikipedia article on SIMD:
An example of an application that can take advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three 8-bit values for the brightness of the red, green and blue portions of the color. To change the brightness, the R G and B values are read from memory, a value is added (or subtracted) from it, and the resulting value is written back out to memory.

With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "get this pixel, now get this pixel", a SIMD processor will have a single instruction that effectively says "get all of these pixels" ("all" is a number that varies from design to design). For a variety of reasons, this can take much less time than it would to load each one by one as in a traditional CPU design.
But of course I'm sure everyone here knew that..

Share
twitter facebook
- - Re:A little background (Score:2)
    
    by xXunderdogXx ( 315464 ) writes:
    
    If I'm not mistaken, wouldn't MMX be an implementation of SIMD?
  - Re:A little background (Score:4, Informative)
    
    by DLWormwood ( 154934 ) writes: <wormwood@me.PARIScom minus city> on Monday February 07, 2005 @01:23PM (#11597896) Homepage
    
    How is this different for MMX?
    Based on personal recollections reenforced by a quick Wiki'ing, MMX's problem wasn't the concept itself, but Intel's braindead constraints placed on x86 support for vectors. MMX recycled the same registers as used for floating point math, causing expensive context switches between each mode and only allowing integer math to be vectorized. Intel eventually developed SSE to work around some of the bottlenecks, but the eventual dominance of GPUs on the PC platform reduced the development priority for vector math in the CPU.
    
    Parent Share
    twitter facebook
  - Re:A little background (Score:2)
    
    by at_18 ( 224304 ) writes:
    
    MMX is an integer-only implementation of SIMD. It was also problematic because it didn't have its own registers, but re-used the floating point ones of the CPU. SSE is a floating-point implementation of SIMD with its own registers.
  - Re:A little background (Score:5, Informative)
    
    by Dominic_Mazzoni ( 125164 ) writes: on Monday February 07, 2005 @02:54PM (#11599082) Homepage
    
    Quick summary:
    
    MMX (x86): 8-byte registers, only integer operations
    SSE (x86): 16-byte registers, single-precision float ops
    AltiVec (PPC): 16-byte registers, integer and single-precision float ops
    SSE2 (x86): 16-byte registers, double-precision float ops
    
    In order to implement many complex algorithms on x86, you need to use a motley combination of MMX and SSE. There are many flaws in both; lots of very useful instructions are missing, and MMX can't be used in conjunction with non-SIMD floating-point operations without a huge expensive context switch. One of the biggest flaws in MMX/SSE that I found was the lack of instructions to shuffle data around within a (8-byte or 16-byte) register. The only advantage on a modern x86 CPU is SSE2, which is the only SIMD unit with double-precision floats. But you can only work with two doubles at a time, so the speedup is not that great.
    
    AltiVec, on the other hand, included both floats and integers right from the start, with no penalty for switching between them, and it includes a very detailed and useful set of instructions, including an awesome shuffle instruction. My personal experience, coding for both, is that AltiVec is about twice as useful as MMX/SSE/SSE2 combined.
    
    Also, note that in Mac OS X, many of the standard libraries and system calls are already AltiVec-optimized for you, and Apple also provides a great Vector library with lots of common DSP operations.
    
    Parent Share
    twitter facebook
    - Re:A little background (Score:3, Informative)
      
      by TheRaven64 ( 641858 ) writes:
      
      As well as the vDSP libraries, Apple also provide a set of wrapper functions around the vector instructions. These expose the instructions directly, but let the compiler handle register allocation, making using AltiVec directly very easy.
Long thread about using Altivec (Score:5, Informative)

by ThousandStars ( 556222 ) writes: on Monday February 07, 2005 @12:37PM (#11597380) Homepage

The Mac forum at Ars Technica has a long, continuing post [arstechnica.com] about Altivec optimizations and how they should be used. The thread started more than two years ago and still gets relevent points and questions added to it. It's an amazing resource if you're interested in starting.

Share
twitter facebook
- Read the Altivec mailing list (Score:5, Informative)
  
  by kuwan ( 443684 ) writes: on Monday February 07, 2005 @01:05PM (#11597703) Homepage
  
  A better resource for Altivec and SIMD in general is the SIMDtech.org [simdtech.org] website and Altivec [simdtech.org] mailing list. There are tutorials and technical manuals available and the email list is indispensable. While the mailing list is mostly geared towards Altivec optimizations and discussions all SIMD discussion is welcome, including MMX/SSE. There are Apple engineers that read and contribute to the list as well as Motorola/Freescale engineers. It's probably the single best resource available to Altivec programmers and you get to talk directly to the Wizards that created it.
  
  I'm a relative newcomer to the list and it's been an invaluable resource as I've optimized with Altivec.
  
  --
  Join the Pyramid - Free Mini Mac [freeminimacs.com]
  
  Parent Share
  twitter facebook
License issues (Score:5, Informative)

by IO ERROR ( 128968 ) * writes: <error@ioe[ ]r.us ['rro' in gap]> on Monday February 07, 2005 @12:39PM (#11597404) Homepage Journal

Be careful; the "open source" license [pixelglow.com] (PDF) is not GPL-compatible. I don't even think it's BSD-compatible on first reading.
The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.

Share
twitter facebook
- Re:License issues (Score:2, Interesting)
  
  by voxlator ( 531625 ) writes:
  
  True, but only if you don't purchase a license.
  
  Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.
  
  --#voxlator
  - Re:License issues (Score:4, Informative)
    
    by IO ERROR ( 128968 ) * writes: <error@ioe[ ]r.us ['rro' in gap]> on Monday February 07, 2005 @01:04PM (#11597702) Homepage Journal
    
    Simple to understand; if you use it for free, you're expected to release your source code (i.e. the 'reciprocal' part of RPL). If you pay to use it, you don't have to release your source code.
    
    True enough, but using the proprietary license makes it impossible to use this in existing projects without changing the license. Suddenly your open source project is either no longer open source, or doesn't look so attractive.
    One of the nicest features of the GPL (and, to be fair, of the BSD license) is that you do not have to release source code if you don't distribute your software. This RPL requires you to release your source code even if you don't distribute your software. And the proprietary license simply isn't appropriate for any type of open source project.
    The guy wants to get paid, and that's fine, I want to get paid, too. But he's got no business telling me I have to distribute my source code for an internal project that will never be distributed. He could easily have used a method similar to Trolltech's dual-licensing [slashdot.org], but he chose instead to do something a whole lot more obnoxious.
    
    Parent Share
    twitter facebook
    - - Re:License issues-Smells funny. (Score:2)
        
        by IO ERROR ( 128968 ) * writes:
        
        Of course he hasn't taken away my choice, AC. I can't reconcile either of his licenses with my existing projects, so I choose not to use his code. I suspect many existing projects will find themselves in a similar situation when they actually read the licenses, and will also choose not to use his code.
- Re:License issues (Score:2)
  
  by RupW ( 515653 ) * writes:
  
  The Reciprocal Public License requires you to release all of your source code if you link to this library, even if your project is personal or used in-house only.
  
  IANAL, but I read the intent as "if you improve macstl you have to publish your changes to macstl" not "if you link macstl you have to publish source to the entire project".
  
  Obviously I can't say which one matches the legalese.
- - Re:License issues (Score:3, Informative)
    
    by IO ERROR ( 128968 ) * writes:
    
    It sounds like the GPL virus to me.
    
    Look, a troll! The GPL doesn't require you to release your code, unless you distribute it. This RPL thing requires you to release your code, even if you don't distribute it. I've discussed the linking issue elsewhere.
- - Re:License issues (Score:2)
    
    by IO ERROR ( 128968 ) * writes:
    
    If you have an existing work that you can optionally combine with the RPL licensed software, it is unlikely that a court would consider your existing work a derivative of the RPL software.
    
    With C++ templates this is a very thorny issue. When your code instantiates the template, the library code is very inextricably an integral part of your code, and not easily (if at all) separable. This might be a different issue if it were a C library you could just call through an API.
    Currently under the GPL/LGPL th
About the RPL (Score:5, Informative)

by pavon ( 30274 ) writes: on Monday February 07, 2005 @12:47PM (#11597495)

The RPL ( Reciprocal Public License [pixelglow.com]) is an odd choice for this project. It is an even stronger viral copy-left than the GPL, to the point where the FSF takes issue with it. If create a derivative work you are required required to 1) Notify the original author, and 2) Publish your changes even if you only use the program in house. Furthermore, their definition of derivative work is much, much broader than the "linking" definition that the GPL uses.

The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL. In fact, considering the requirements placed on you by the license, I would expect that you will have difficulty incorporating this RPL library into any existing FLOSS project without running into license conflicts. The only thing I can see this being useful for is a new project that you don't mind releasing under the RPL, or with existing BSD style licensed code which you dual license as BSD/RPL (since BSD can be included in anything).

So this library does not appear to very useable for the FLOSS world, although if you want to license it for proprietary software you may.

Share
twitter facebook
- Re:About the RPL (Score:3, Informative)
  
  by geoffspear ( 692508 ) * writes:
  
  Clearly, we need to get everyone in the world to download the source, make one superficial change, and email the entire thing back to the original developer.
  And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?
  - Re:About the RPL (Score:2)
    
    by MenTaLguY ( 5483 ) writes:
    
    And what happens if the original developer dies? Is everyone prohibited from using his code until the copright runs out in 95 years, as they can't notify him of changes?
    
    Yes, unless he has an identifiable successor-in-interest.
- Re:About the RPL (Score:2)
  
  by Baldrson ( 78598 ) * writes:
  
  The fact that it puts these additional requirements / restrictions on the user makes it incompatible with the GPL.
  It's no more incompatible than is a class that overrides a method of a superclass "incompatible" with that superclass. In this instance, the release "method" is more strict.
  - - Pedantic Pissing Contests Aside (Score:2)
      
      by Baldrson ( 78598 ) * writes:
      
      The point is that the GPL doesn't specify release behavior for code that isn't distributed so any "program" P developed with regard to the GPL should not reference such release behavior -- hence the substitution principle works.
- Re:About the RPL (Score:2)
  
  by HeghmoH ( 13204 ) writes:
  
  #1 is understandable, if odd, but #2 is just ridiculous. In-house use doesn't fall under copyright protection to begin with, so how can the RPL regulate it?
  - Re:About the RPL (Score:2)
    
    by phliar ( 87116 ) writes:
    
    In-house use doesn't fall under copyright protection to begin with
    
    False. You may be confusing in-house use with the doctrine of fair use.
    - Re:About the RPL (Score:2)
      
      by HeghmoH ( 13204 ) writes:
      
      You're right, I wasn't thinking. Wide-scale internal use would in fact be governed by the RPL. Small-scale use that fell under fair use would not.
Black Art? Uh... (Score:4, Interesting)

by arekusu ( 159916 ) writes: on Monday February 07, 2005 @12:47PM (#11597496) Homepage

"...the black art of assembly language magicians."

The nice thing about altivec is that it has a C interface. You don't have to use assembly!

Take a look at this Apple tutorial [apple.com] to see how easy it is.

Share
twitter facebook
- Re:Black Art? Uh... (Score:4, Funny)
  
  by Leo McGarry ( 843676 ) writes: on Monday February 07, 2005 @01:03PM (#11597688)
  
  Yes, I think the person who wrote the summary revealed a little more of his own ignorance than he meant to. I don't consider calling "vec_add" inside a loop to be a black art.
  
  Parent Share
  twitter facebook
- Re:Black Art? Uh... (Score:2)
  
  by Paradox ( 13555 ) writes:
  
  Yeah, the C library is out there, and it's not too hard to use. :)
  
  But one could counter that even in the C library, unless you know what you're doing, you may not get as dramatic a speedup as you wanted. Until I looked at serveral of Apple's examples, I couldn't write altivec-aware code properly (i.e. maximum performance benefit).
  
  Once I knew what I was doing I went back and redid the code, and it ran much faster. So it is still tricky to maximize your bang-for-buck.
More source-distro goodness to follow? (Score:2)

by Progman3K ( 515744 ) writes:

Does this mean we can expect source Linux distros to start taking advantage of this?

I know I'll sound like a wannabe leet for saying this, but I already really like my Gentoo workstation because it is a stage1 install (all from source), and I expect this will only make it even faster!

Yay!
Too expensive? (Score:2)

by saddino ( 183491 ) writes:

Sounds great, but $2499 for a redistributable binary? Ouch.
- Re:Too expensive? (Score:2, Insightful)
  
  by voxlator ( 531625 ) writes:
  
  In the corporate world, is it more expensive than paying a developer to design, code, test, and maintain a home-grown version?
  
  Once you've payed a $30 dollar/hour developer for 10 days work, you've forked out ~ $2,500...
  
  --#voxlator
  - Re:Too expensive? (Score:2)
    
    by saddino ( 183491 ) writes:
    
    If the question was "Do I hire my own programmer or buy this technology?" then you would be correct.
    
    But, given this is an optimization and replacement for STL then the question is "Do I just live with STL, or buy this technology?"
    
    In other words, it isn't an essential development cost, it's an extra (I imagine most interested parties already have shipping apps that use STL).
    
    And at this price point, IMHO, I think the answer may be "if it ain't broke, don't fix it."
Slides about SIMD (Score:2, Informative)

by quigonn ( 80360 ) writes:

A bit OT, but nevertheless quite interesting to read and it contains information about SIMD instruction sets other than just MMX/SSE: http://www.fefe.de/ccccamp2003-simd.pdf [www.fefe.de]
Assembly or C++? (Score:2)

by nagora ( 177841 ) writes:

I'll take the Assembly Language, thanks. Especially on such a nice processor.
TWW
- Re:Assembly or C++? (Score:2)
  
  by nagora ( 177841 ) writes:
  
  Especially on such a nice processor as the PowerPC, that is. Sheesh.
  TWW
Autovectorization being add in GCC 4.0 (Score:5, Interesting)

by shawnce ( 146129 ) writes: on Monday February 07, 2005 @12:50PM (#11597543) Homepage

For those that don't already know is that autovectorization is being worked on for GCC by folks from IBM and others.

GCC vectorizatoin project [gnu.org] (site seem offline atm) but the abstract from a recent GCC summit [gccsummit.org] is up.

Autovectorization Talk (google html view of pdf) [216.239.57.104]

Share
twitter facebook
- Re:Autovectorization being add in GCC 4.0 (Score:2)
  
  by TedCheshireAcad ( 311748 ) writes:
  
  If you're serious about performance, use XLC. GCC is great if you're cheap, but it's kind of like putting monster truck tires on a Ferarri.
- Re:Autovectorization being add in GCC 4.0 (Score:2)
  
  by joib ( 70841 ) writes:
  
  Yes, the new ssa architecture in GCC 4.0 allows for autovectorization, but at the moment the focus is on getting GCC 4.0 sufficiently stable for release in a few months. Because of this, IIRC, some of the fancier vectorization passes were deferred until GCC 4.1.
  
  So yes, you might see some performance improvements due to vectorization in 4.0, but you'll have to wait until 4.1 or maybe even 4.2 before you'll see the full potential of it.
  
  -joib, occasional GCC contributor (although I have absolutely zilch to d
It's in the compiler (Score:3, Informative)

by Mad Hughagi ( 193374 ) writes: on Monday February 07, 2005 @12:51PM (#11597557) Homepage Journal

Vectorization (SIMD) is built into the Intel compiler. There is no need to hack in assembly as the compiler will do it for you. This is the case with most vendor supplied compilers, as they want to fully exploit their hardware functionality.

The problem is bringing this functionality to OS compilers, which as far as I know, there is not even an OpenMP (threading) implementation, let alone internal vectorization.

Share
twitter facebook
- Re:It's in the compiler (Score:2)
  
  by nonmaskable ( 452595 ) writes:
  
  It is built in but you don't automagically get full benefit unless you design your data structures and algorithms appropriately. In my case, I got no measurable benefit until I did a fairly extensive redesign.
  
  Intel has a great book on performance tuning that has been extremely helpful, as has Intel's VTune.
  - Re:It's in the compiler (Score:2)
    
    by spitzak ( 4019 ) writes:
    
    With no changes to our code, but turning on most of the switches to the Linux Intel compiler, I got a huge number of "loop was vectorized" messages, and the resulting code was sped up almost 20% (verses only 5% for the Intel compiler with no switches other than -O5). Now it is quite likely that more speedup is possible, but it appears the Intel compiler was quite able to recognize and vectorize code that was not designed for it. (ps the code is floating-point image processing, with repetitive operations don
  - Re:It's in the compiler (Score:2)
    
    by Sebastopol ( 189276 ) writes:
    
    Actually, you DO get automagical compiler speedup. In some cases it can identify vector-izable (is that word?) loops and promote them to SIMD operations.
    
    But yes, otherwise, you need to re-code if the compiler doesn't take the hint, especially in structures/classes. The only objection I have to the Intel intrinsics is they don't look pretty! ;-)
    
    I haven't used VTune since circa 1998, and it had this awesome feature that would point out boneheaded things in your code. One interesting suggestion it made: i
    - Re:It's in the compiler (Score:2)
      
      by nonmaskable ( 452595 ) writes:
      
      Automagical only if it can make the identification; there are several things that can prevent it from doing so, and I managed to do several of them. VTune helps a lot with code like this - I've spent many happy hours tracking down hotspots with it.
already exists (Score:3, Informative)

by jeif1k ( 809151 ) writes: on Monday February 07, 2005 @12:55PM (#11597603)

SIMD support already exists, in the form of C, C++, and Fortran libraries (usually, as a small part of larger numerical libraries), as well as in language constructs in languages like Fortran.

Share
twitter facebook
- Re:already exists (Score:2)
  
  by jkujawa ( 56195 ) writes:
  
  The point of MacSTL is it's portable to both PPC and Intel. You can make a portable SIMD-optimized program.
The future (Score:4, Insightful)

by johnhennessy ( 94737 ) writes: on Monday February 07, 2005 @12:55PM (#11597612)

Surely people can now start to see where the future lies - from a performance viewpoint. We've reached the end of the clocking "free lunch" (see http://www.gotw.ca/publications/concurrency-ddj.ht m/ [www.gotw.ca]).

The way forward is turning the CPU (of a traditional) architecture into a Nanny for a range of various dedicated processing units. IBM saw this years ago, and thus began the whole Cell architecture - but I suspect that their job was much easier. The software that would run on the platform they are designing is fairly specific - games & multimedia which usually lend themselves well to vectorization.

The real challenge for architects (in my humble opinion) is translating will be applying the same technique to other system bottlenecks.

AMD's (and now Intel's) approach of crambing more and more processing cores onto an IC might pay off in the short term, but like the "free lunch" of clock speed, will hit a roadblock when issues like memory bandwidth and caching schemes just have too much work to do with 4 or 8 processing cores hacking at it all the time.

Share
twitter facebook
- Re:The future (Score:2)
  
  by Rinikusu ( 28164 ) writes:
  
  Isn't that pretty much what the Amiga was doing a couple decades ago? The CPU was merely a traffic cop, directing other specialized units to actually do the real work? If so, they're a bit late to the party, eh?
Isn't it what std::valarray is for? (Score:2)

by 21mhz ( 443080 ) writes:

Reading this reminded me about that portion of the standard C++ library which is all about operations on vector data. So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?
- Re:Isn't it what std::valarray is for? (Score:3, Insightful)
  
  by kuwan ( 443684 ) writes:
  
  So, my question is: could an std::valarray specialization for processor-supported types serve as a basis for portable SIMD support in C++?
  
  That's exactly what this is. If you read the part on his website about valarray [pixelglow.com] then you'll see that it does extensive SIMD optimizations for valarray for both Altivec and MMX/SSE/SSE2/SSE3 platforms. He's even added "parallelized algorithms such as integer division, trigonometric functions and complex number arithmetic" which you'd have to code yourself in either ass
- Re:Isn't it what std::valarray is for? (Score:2)
  
  by emarkp ( 67813 ) writes:
  
  I guess you didn't notice: http://www.pixelglow.com/macstl/valarray/ [pixelglow.com].
OS X Tiger will do it for you (Score:2, Interesting)

by jilbert ( 520628 ) writes:

Tiger, the next OS release from Apple, will take care of vector optimization automatically [apple.com] in their version of gcc 4.0. I guess this will make it into the public gcc too.
- Re:OS X Tiger will do it for you (Score:2)
  
  by Junks Jerzey ( 54586 ) writes:
  
  Tiger, the next OS release from Apple, will take care of vector optimization automatically [apple.com] in their version of gcc 4.0. I guess this will make it into the public gcc too.
  
  For the record, this has been in Intel's C compiler for years now. It's also in the current release of the Microsoft Visual C++ compiler, including the free download version.
- Re:OS X Tiger will do it for you (Score:5, Informative)
  
  by be-fan ( 61476 ) writes: on Monday February 07, 2005 @01:33PM (#11598016)
  
  Actually, Apple's Tiger will get an auto-vectorizing compiler courtesy of the public GCC 4.0 release. The auto-vectorizer wasn't developed in Apple's version of GCC. IBM's GCC team at the Haifa Research Lab developed the vectorizer in the public LNO (loop nest optimization) branch of GCC 4.0. I'm not trying to minimize Apple's contribution here, one of their developers did work on the team, but let's give credit where credit is due.
  
  Parent Share
  twitter facebook
  - Re:OS X Tiger will do it for you (Score:2)
    
    by johnnyb ( 4816 ) writes:
    
    Watch out, it's the Loop Nest Monster!
From the limewire... (Score:3, Interesting)

by WilyCoder ( 736280 ) writes: on Monday February 07, 2005 @01:07PM (#11597731)

As two of my professors have stated in class, SIMD and moreso parallel processing will require programmers to think in a fundamentally different way in order for multi-core/multi-processor to really take off.

This project may be a step in the right direction. Benchmarks show that SIMD such as SSE/2/3 only provide a marginal speed increase. And meanwhile, the massively parallel computations done on graphics cards dwarfs anything SIMD claims to produce.

Perhaps we will see GFX manufacturers selling their technology to the CPU makers.

I forget the specifics, but a new GFX card can perform somewhere around 35 GFLOPS, while a 3.4Ghz P4(executing SIMD code) can only produce around 5-6GFLOPS at best.

With projects like Brook GPU emerging, the division of CPU and GFX processor may be narrowed significantly.

Share
twitter facebook
- Algorithms (Score:2)
  
  by Detritus ( 11846 ) writes:
  
  You often need radically different algorithms to get the full benefit of SIIMD. The processing power is there, figuring out how to exploit it can be very difficult.
  You can do a limited version of SIMD with an ordinary CPU. A 32-bit CPU can execute 32 "bit logic" operations with a single instruction. With a properly structured problem, 32 instances can be computed in parallel.
Ignorant submitter, or smart marketing? (Score:3, Interesting)

by javaxman ( 705658 ) writes: on Monday February 07, 2005 @01:08PM (#11597746) Journal

Sorry, I can't read a story submitted by someone who doesn't even know about C [apple.com] libraries [intel.com] that have been around for years.
Or is this just another advertisement pretending to be a story, with the submitter trying to play ignorant about alternative Altivec and MMX libraries ?

Share
twitter facebook
- Maybe it's just Ignorant criticism... (Score:4, Informative)
  
  by kuwan ( 443684 ) writes: on Monday February 07, 2005 @01:48PM (#11598226) Homepage
  
  If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming. You program in standard C++ using std::valarray and you get code optimized for Altivec and MMX/SSE/SSE2/SSE3 without having to do anything else. You don't need to worry about coding to two different libraries on two different platforms nor do you have to worry about learning the platform-specific C intrinsics, alignment issues, head/tail cases, etc.
  
  SIMD programming becomes as easy as this:
  
  float af1 [] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; stdext::valarray <float> v1 (af1, 10); // construct from first 10 elements of af1 stdext::valarray <float> v2 (10, 3.0f); // construct with 10 repeats of 3.0f stdext::valarray <float> v3 (10); // construct with 10 repeats of 0.0f v3 = sin (v1) * cos (v2) + sin (v2) * cos (v1);
  
  He claims that the above code is 17.4x faster than Codewarrior MSL C++, 11.6x faster than gcc libstdc++ and 9.5x faster than Visual C++.
  
  Macstl also provides a cross-platform syntax for using vector registers that is similar to using the native C intrinsics on each platform. So while not all of the native operations are available, his cross-platform "vec" API allows you to write cross-platform code without having to learn both the Altivec and MMX/SSE intrinsics (which is a good solution for someone who knows one platform but not the other).
  
  --
  Join the Pyramid - Free Mini Mac [freeminimacs.com]
  
  Parent Share
  twitter facebook
  - Re:Maybe it's just Ignorant criticism... (Score:2)
    
    by javaxman ( 705658 ) writes:
    
    If you'd actually read what this is all about then you'd have find out that this is a cross-platform library for SIMD programming.
    My point exactly. Does the story say cross-platform anywhere? No, it says :
    programming for the PowerPC Altivec and Intel MMX/SSE SIMD (single instruction multiple data) units remains the black art of assembly language magicians
    er... so, instead of saying something like "here's a product which allows you to use the same API for both PPC and Intel SIMD", the submitter puts in th
liboil (Score:3, Interesting)

by labratuk ( 204918 ) writes: on Monday February 07, 2005 @01:23PM (#11597901)

Another project trying to do something similar is liboil [schleef.org], the Library of Optimised Inner Loops.

However in the future I can see things changing for the structure of the stardard PC.

At the moment in a high end machine you have the CPU, which is a scalar processor, a GPU, which is in essence a glorified vector processor (not just useful for graphics, as projects like GpGPU are showing us), and SIMD extensions to the CPU to allow it to do small amounts of vector processing.

Scalar processors are good for some things (branchy code) and vector processors are good for other things (very predictable parallel code). Having both is very useful.

I would say in the next 5-10 years we will see the GPU join together with the SIMD extensions to provide a seperate general purpose vector processor.

PCs will ship with two processors - one scalar, one vector. And everyone will be happy.

Now, whether this will be transparent to the programmer depends on how automatic code optimisation progresses over the next few years. Is Intel's icc auto vectorisation already good enough? Don't know.

Share
twitter facebook
Why? Altivec-optimized libraries supplied by Apple (Score:4, Interesting)

by coult ( 200316 ) writes: on Monday February 07, 2005 @01:38PM (#11598089)

You really don't need macstl unless you have a strong desire to use valarray in C++...for example, the ATLAS project http://math-atlas.sourceforge.net/ [sourceforge.net] already uses Altivec (and SSE/SSE2, etc) wherever it results in a speedup. So, if your code does linear algebra, use ATLAS and you'll see an automatic speedup in many cases. Other projects such as fftw http://fftw.org/ [fftw.org] include Altivec/SSE/SSE2 optimizations as well. ATLAS includes lots of other optimizations such as cache-blocking, loop-unrolling, etc. I don't know of macstl includes such optimizations, but I do know that ATLAS performance approaches the theoretical peak performance on G4/G5 for things like matrix-matrix multiplication.

Not only that, but Apple's vecLib http://developer.apple.com/ReleaseNotes/MacOSX/vec Lib.html [apple.com] includes ATLAS so you don't even have to download or install anything - it comes with OS X.

Share
twitter facebook
Why limit yourself to Altivec when you have NVidia (Score:4, Insightful)

by kompiluj ( 677438 ) writes: on Monday February 07, 2005 @01:45PM (#11598166)

Well the processing power of Altivec or MMX/SSE/3DNow or whatever is nowhere near the power of you newest NVidia/ATI card you have surely bought for playing Doom III. Why not use it then? Get the brook compiler [stanford.edu]! Furthemore, I see they [pixelglow.com] introduce classes like vec, etc. Such classes have been already designed successfuly for C++. Why not try porting Blitz [oonumerics.org] to the Altivec and/or to the GPU?

Share
twitter facebook
- Re:Why limit yourself to Altivec when you have NVi (Score:3, Insightful)
  
  by TheRaven64 ( 641858 ) writes:
  
  The main reason is that the AGP bus is designed to move data very quickly to the card, but is not so hot at moving it back again. This should change with PCI Express.
OSI-approved RPL goodness. Admit it.... (Score:3, Funny)

by Pyrosophy ( 259529 ) writes: on Monday February 07, 2005 @02:03PM (#11598427)

This story doesn't really mean anything and people are just making up comments.

Share
twitter facebook
Content Addressable Parallel Processors (Score:3, Interesting)

by Baldrson ( 78598 ) * writes: on Monday February 07, 2005 @02:10PM (#11598517) Homepage Journal
The real "grand unified theory" of SIMD is CAPP or content addressable parallel processors. I read a book [amazon.com] on this topic back in the 1970s and it was pretty clear to me that it:
1. Was a great way of dealing with relational data
2. Would have to await much larger scales of integration before becoming practical.
Since then the computer world has become much more relational due to relational databases, and the levels of integration of skyrocketed, but no one major manufacturer of silicon has bothered to revisit this very simple and powerful route to high power computing.
Fortunately there is at least a little ongoing research [mit.edu].
The beauty of these processors is they integrate memory with computation so that the massive economies of scale we witness in memory fabrication apply to computation speeds as well so long as we can move toward relational rather than function computing as a paradigm. Fortunately this appears to be supported by the study of quantum computers, however those computers may never see the light of day for more fundamental reasons.
Share
twitter facebook
- Re:16X increase? (Score:2, Interesting)
  
  by mirko ( 198274 ) writes:
  
  When using Reason 3 [propelerheads.se], some virtual synths have the option to produce an enhanced sound.
  What is curious is that if you are using a pre-Altivec proc (G3), it'll burn more CPU time while the same enhancement will be totally and natively supported by Altivec-enabled units : a 400MHz G4 Powerbook is enhancing these sytnhs more efficiently than an 800MHz G3.
  I guess this was like the simultaneous operations that the ARM assembly language supports (e.g. both storing and rotating values in an operation)...
- Re:16X increase? (Score:5, Informative)
  
  by LordRPI ( 583454 ) writes: on Monday February 07, 2005 @12:43PM (#11597457)
  
  The principle behind SIMD, or, rather, Single Instruction Multiple Data, is that you can process wide arrays of values in a single instruction. With the PowerPC version of SIMD, also known as AltiVec, you can issue an instruction and have it work with a 128-bit wide register. These registers may contain up to 4 32-bit numbers, 8 16-bit numbers or 16 8-bit numbers. For example, I can load two AltiVec registers with 16 unsigned chars, add them together using Vec_Add() and have it return its results to an AltiVec register. So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.
  
  Parent Share
  twitter facebook
  - Re:16X increase? (Score:3, Interesting)
    
    by sribe ( 304414 ) writes:
    
    So this in essense is adding 16 values at once and in theory it's good enough for markeing to claim a 16X speedup, but this is rarely the case.
    
    There are 32 of these registers (independent, not shared with the FPU) which means you can chain together a pretty complex series of calculations without intermediate load/store sequences. The unit has multiple independent computation units with their own dispatch queues (details vary between specific processor models). Some AltiVec opcodes are designed to common s
- Re:16X increase? (Score:2, Informative)
  
  by Anonymous Coward writes:
  
  The concept, and radical performance boost, is in line (pardon the pun) with Expression Templates for C++.
  
  A good example is what happens when you let the compiler decide how to do aritmetic with vectors and matrixes.
  
  Matrix a,b,c,x;
  x = a + b + c;
  
  The naked compiler, in combination with your custom Matrix class, will probably unwind the operator overloads to do something like this:
  
  // assuming a reasonable STL w/function inlining Matrix __t1; for(int i=0; i<a.width; i++){ for(int j=0; j<a.width; j++){
- Moore's Law has nothing to do with assembly (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  Moore's Law has eroded the need for assembly
  
  Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia [wikipedia.org]:
  Moore's law is an empirical observation stating, in effect, that at our rate of technological development and advances in the semiconductor industry, the complexity of integrated circuits doubles every 18 months.
  I wish people would stop saying "But Moore's Law..." for every hardware-related story on Slashdot. Do a bit of reading, please.
  - Re:Moore's Law has nothing to do with assembly (Score:2)
    
    by asliarun ( 636603 ) writes:
    
    You misunderstood.
    
    >> Moore's Law has eroded the need for assembly
    
    > Moore's Law has nothing to do with assembly language and optimizations. From Wikipedia:...
    
    The grandparent was saying that because processor speeds have increased to such an extent (Moore's Law), it doesn't make sense to use assembly to write modern code; even if the assembly code is faster.
- Re:Moore's Law has eroded the need for assembly (Score:3, Funny)
  
  by geoffspear ( 692508 ) * writes:
  
  99% of all jobs in the world require no programming at all. Therefore, there is no need for anyone anywhere to learn C.
  90% of the worlds' people do not own cars. Therefore, there is no need for gas stations. If you pick a living human completely at random from the earth, chances are they don't drive one of these "car" things.
- Re:Moore's Law has eroded the need for assembly (Score:2)
  
  by bonch ( 38532 ) writes:
  
  Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche, much like the Win32 programming I used to do in the early 90's.
  
  I don't consider Doom 3 to be a niche.
  - Re:Moore's Law has eroded the need for assembly (Score:2)
    
    by betelgeuse68 ( 230611 ) writes:
    
    Sure, and you and everyone you know is working on Doom3 or a competitor?
    
    Just because you use it, doesn't mean you engineer it.
    
    You use a TV... when was the last time you even thought of any of the eletronics inside of it?
    
    -M
- Assembly (Score:3, Insightful)
  
  by bsd4me ( 759597 ) writes:
  
  Even in embedded systems, assembly isn't used as much as it used to. It still get used in bootloaders, and sometimes in device drivers. However, most devices are memory mapped, and most of the driver is written in C, and asm() calls are made when appropriate (eg, asm("eieio");), especially when you get to use gcc and asm() syntax for accessing variables.
  - - Re:Assembly-DSPs (Score:2)
      
      by bsd4me ( 759597 ) writes:
      
      It is when programming DSP's (and related devices).
      
      From my experience, yes and no. Fixed-point DSP tends to be done in assembly, mainly because FP techniques don't translate well to C. The compilers also tend to suck. A fair to large amount of floating-point DSP is done with C when the compiler support is good. I have done a lot of floating-point DSP, and we found that the write in C, refine in ASM workflow was best.
      Don't forget that microcontrollers outnumber microprocessors by a large margin.
- Re:Moore's Law has eroded the need for assembly (Score:2, Insightful)
  
  by lowe0 ( 136140 ) writes:
  
  Which is exactly why this sort of thing is so important.
  
  Sure, you could probably get it to work even faster with hand-tuned assembly than simply using this library. But programmer time is expensive, and customizing code adds complexity. By reusing optimized code, you can enjoy some of the benefits of SIMD without having to devote the same amount of resources.
  
  Let's be honest, this isn't a silver bullet - this isn't going to speed up code that doesn't use lots of floating-point vectors anyway. But if it
- Depends on what you are doing (Score:5, Insightful)
  
  by dsci ( 658278 ) writes: on Monday February 07, 2005 @01:09PM (#11597755) Homepage
  
  We write code for hardcore chemical simulations. The limits on what can be studied, ie number of atoms/molecules or timescales of the simulations depends on one thing: speed.
  
  Faster computers means better simulations. BUT, if the code is not as fast as it can be on a particular architecture, your simulations are not going to be as complete as they can be. At least within a given time allotment.
  
  I've recently applied some code optimizations to a Monte Carlo simulation and saw speed ups of over 1000x. That's significant.
  
  It's naive to think that faster computers means we should live with sloppy or unoptimized code. SIMD is a useful technique, and if it means the difference between me getting work done in a week or two or three weeks, I think I'll take the one-week sim.
  
  Parent Share
  twitter facebook
  - Re:Depends on what you are doing (Score:2)
    
    by imsabbel ( 611519 ) writes:
    
    Speedups like a factor of 1000 can only come from high level optimisations (like replacing an O(n^2) with an O(n log n) algo).
    
    Honestly: TO be able to get a 1000 times boost, your original code must have been beyond bullshit.
    
    And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.
    - Re:Depends on what you are doing (Score:2)
      
      by groomed ( 202061 ) writes:
      
      And of course using simd is better than not using it, but i would rather stay on a "let the compiler vectorize it" level. I mean, doing your inner loop in leet assambler only to NOT know after a long simulation if ther results are real or you just botched some line isnt worth it.
      
      Baseless FUD. Why would a few dozen lines of hand coded assembly suddenly invalidate the results?
    - Re:Depends on what you are doing (Score:3, Insightful)
      
      by Dasein ( 6110 ) writes:
      
      Speedups like a factor of 1000 can only come from high level optimisations (like replacing an O(n^2) with an O(n log n) algo).
      
      Nope. Technically, there are two constant burried in here. The definition is g(x) = O(f(x)) => g(x) <= k*f(x) where x > a for some orbitrary a. If you don't change algorithms, all you can do is manipulate the k. For a given k and a given level of improvement, I can give you a new k that hits that level of improvement.
      
      Honestly: TO be able to get a 1000 times boost, you
      - Re:Depends on what you are doing (Score:3)
        
        by aminorex ( 141494 ) writes:
        
        The difference between running mostly in L1 cache and regularly going to RAM (particularly when load/store patterns are pessimal), multiplied by the parallelism of exploiting SIMD can quite feasibly give a 1000x performance difference.
- Moore's Law is OVER (Score:2)
  
  by emarkp ( 67813 ) writes:
  
  Haven't you been paying attention? Processor speed increases stopped 2 years ago. We can put more transistors on silicon, but the free performance ride is over.
  See Herb Sutter's article in the Feb C/C++ Users Journal or the (expanded) one in the March Dr. Dobb's Journal.
- But times are changing, this is becoming valuable (Score:2)
  
  by Paradox ( 13555 ) writes:
  
  Recently Herb Sutter (famous software engineering guru and C++ wizard) posted this essay [www.gotw.ca] in which he reminds us, among other things, that the generalization of Moore's law to processor is allready failing! While computers are continuing to get faster, it's not just in their clockspeed anymore.
  While memory speeds will continue for awhile, already processor speeds are falling off. Check out this graph from the article [www.gotw.ca] where he clearly shows what's happening.
  This brings an interesting dilemma to modern pro
- Re:Moore's Law has eroded the need for assembly (Score:4, Insightful)
  
  by groomed ( 202061 ) writes: on Monday February 07, 2005 @02:01PM (#11598405)
  
  Sorry, but yours is an utterly kneejerk boilerplate response which has nothing to do with the topic at hand and only serves to establish your credentials as a hard nosed realist who has been there and done it.
  
  Moore's Law has eroded the need for such knowledge
  
  Moore's "law" (which is just an off-the-cuff observation, really) has nothing to do with this. If anything, Moore's law has enabled transistor and space devouring SIMD technology.
  
  It would be like concerning myself on how to design circuits...
  
  No, it's nothing like that at all. Just because you own and know how to use money doesn't mean there is no point to the complex financial reckonings that are made every day at institutions all over the world. You may not need, but you is not under discussion.
  
  Yes some people who write games are still concerne with assembly as are people in embedded markets. But those jobs, situations and skills are niche
  
  By this definition, everything is niche. The whole computing industry becomes "niche". Farming is "niche". The paper industry is "niche". What you're describing is just non-descript white collar administrative work which just happens to involve a computer; bit shuffling, rather than paper shuffling.
  
  Those situations are about the last place you will find anyone caring about something called "assembly language."
  
  Again, completely irrelevant.
  
  The point is that with a few dozen lines of SIMD code (whether in assembly or some high level language) any reasonably competent programmer can achieve four-fold, ten-fold, even twenty-fold speedups on critical path code, from scratch, in as little as a week.
  
  These are amazing results, and people should be encouraged to investigate the possibilities, not be dragged down into this drab netherworld of yours.
  
  Parent Share
  twitter facebook
- Re:Moore's Law has eroded the need for assembly (Score:2)
  
  by afidel ( 530433 ) writes:
  
  Doing things like transcoding/encoding of multimedia content is one of those "niche" areas where assembly is still needed. If it takes 1.5hours or 3 hours to transcode a movie is a BIG deal, especially if you have to do it many times to archive a library of old content. Sure most programmers won't need it, ever, but that's been pretty much true since we got high level languages and computers got more than a couple hundred K of RAM.
- Obviously, you arent a PS2 graphics programmer.. (Score:2)
  
  by LordZardoz ( 155141 ) writes:
  
  And one step further, I am betting you do not perform any sort of graphics programming.
  
  On win32 / mac platforms, the need to know how to do this is pretty low. DirectX wraps most of it, as well as the processes needed for GPU programming. I am sure the Mac libs that do the same job as DirectX accomplish much the same.
  
  But low level graphics programming is alive and well for game programming. I do what I can to stay well clear of that, since I dont like graphics programming much (just personal preference

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Altivec (Score:5, Informative)

Re:Altivec (Score:5, Informative)

Re:Altivec (Score:3, Interesting)

Other way around (Score:2)

Re:Altivec (Score:2)

Re:Altivec (Score:2)

Re:Altivec (Score:4, Informative)

More AltiVec Goodness (Score:4, Informative)

Re:More AltiVec Goodness (Score:2)

Re:More AltiVec Goodness (Score:3, Insightful)

Umm (Score:2, Informative)

Re:Umm (Score:3, Informative)

Re:Umm (Score:2)

Yes. (Score:3, Informative)

Re:Yes. (Score:3, Informative)

A little background (Score:5, Informative)

Re:A little background (Score:2)

Re:A little background (Score:4, Informative)

Re:A little background (Score:2)

Re:A little background (Score:5, Informative)

Re:A little background (Score:3, Informative)

Long thread about using Altivec (Score:5, Informative)

Read the Altivec mailing list (Score:5, Informative)

License issues (Score:5, Informative)

Re:License issues (Score:2, Interesting)

Re:License issues (Score:4, Informative)

Re:License issues-Smells funny. (Score:2)

Re:License issues (Score:2)

Re:License issues (Score:3, Informative)

Re:License issues (Score:2)

About the RPL (Score:5, Informative)

Re:About the RPL (Score:3, Informative)

Re:About the RPL (Score:2)

Re:About the RPL (Score:2)

Pedantic Pissing Contests Aside (Score:2)

Re:About the RPL (Score:2)

Re:About the RPL (Score:2)

Re:About the RPL (Score:2)

Black Art? Uh... (Score:4, Interesting)

Re:Black Art? Uh... (Score:4, Funny)

Re:Black Art? Uh... (Score:2)

More source-distro goodness to follow? (Score:2)

Too expensive? (Score:2)

Re:Too expensive? (Score:2, Insightful)

Re:Too expensive? (Score:2)

Slides about SIMD (Score:2, Informative)

Assembly or C++? (Score:2)

Re:Assembly or C++? (Score:2)

Autovectorization being add in GCC 4.0 (Score:5, Interesting)

Re:Autovectorization being add in GCC 4.0 (Score:2)

Re:Autovectorization being add in GCC 4.0 (Score:2)

It's in the compiler (Score:3, Informative)

Re:It's in the compiler (Score:2)

Re:It's in the compiler (Score:2)

Re:It's in the compiler (Score:2)

Re:It's in the compiler (Score:2)

already exists (Score:3, Informative)

Re:already exists (Score:2)

The future (Score:4, Insightful)

Re:The future (Score:2)

Isn't it what std::valarray is for? (Score:2)

Re:Isn't it what std::valarray is for? (Score:3, Insightful)

Re:Isn't it what std::valarray is for? (Score:2)

OS X Tiger will do it for you (Score:2, Interesting)

Re:OS X Tiger will do it for you (Score:2)

Re:OS X Tiger will do it for you (Score:5, Informative)

Re:OS X Tiger will do it for you (Score:2)

From the limewire... (Score:3, Interesting)

Algorithms (Score:2)

Ignorant submitter, or smart marketing? (Score:3, Interesting)

Maybe it's just Ignorant criticism... (Score:4, Informative)

Re:Maybe it's just Ignorant criticism... (Score:2)

liboil (Score:3, Interesting)

Why? Altivec-optimized libraries supplied by Apple (Score:4, Interesting)

Why limit yourself to Altivec when you have NVidia (Score:4, Insightful)

Re:Why limit yourself to Altivec when you have NVi (Score:3, Insightful)

OSI-approved RPL goodness. Admit it.... (Score:3, Funny)

Content Addressable Parallel Processors (Score:3, Interesting)

Re:16X increase? (Score:2, Interesting)

Re:16X increase? (Score:5, Informative)