Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Book Review: OpenCL Programming Guide

samzenpus posted more than 2 years ago | from the read-all-about-it dept.

Programming 40

asgard4 writes "In recent years GPUs have become powerful computing devices whose power is not only used to generate pretty graphics on screen but also to perform heavy computation jobs that were exclusively reserved for high performance super computers in the past. Considering the vast diversity and rapid development cycle of GPUs from different vendors, it is not surprising that the ecosystem of programming environments has flourished fairly quickly as well, with multiple vendors, such as NVIDIA, AMD, and Microsoft, all coming up with their own solutions on how to program GPUs for more general purpose computing (also abbreviated GPGPU) applications. With OpenCL (short for Open Computing Language) the Khronos Group provides an industry standard for programming heavily parallel, heterogeneous systems with a language to write so-called kernels in a C-like language. The OpenCL Programming Guide gives you all the necessary knowledge to get started developing high-performing, parallel applications for such systems with OpenCL 1.1." Keep reading for the rest of asgard4's review.The authors of the book certainly know what they are talking about. Most of them have been involved in the standardization effort that went into OpenCL. Munshi, for example, is the editor of the OpenCL specification. So all the information in the book is first-hand knowledge from experts in OpenCL. The reader is expected to be familiar with the C programming language and basic programming concepts. Some experience in parallelizing problems is a benefit but not a requirement.

The book consist of two major parts. The first part is a detailed description of the OpenCL C language and the API used by the host to control the execution of programs written in that language. The second part is comprised of various case studies that show OpenCL in action.
The authors get straight to the point in the introduction, discussing the conceptual foundations of OpenCL in detail. They explain what kernels are (basically functions that are scheduled for execution on a compute device), how the kernel execution model works, how the host manages the command queues that schedule memory transfers or kernel execution on compute devices, and the memory model.

While this first chapter is all prose, the second chapter dives right in with some code and a first HelloWorld example. The following chapters introduce more and more of the OpenCL language and API step-by-step. All API functions are described in somewhat of a reference style with a lot of detail, including possible error codes. However, the text is not a reference. There is always a good explanation with examples or short code listings, the only notable exception being chapter three, which presents the OpenCL C language. A few more examples would have made the text less dry in this chapter.

An important chapter is chapter nine on events and synchronization between multiple compute devices and the host. This chapter is important because — as any experienced parallel programmer knows — getting synchronization right is often tricky but obviously essential for correct execution of a parallel program.

An interesting feature in OpenCL is the built-in interoperability with OpenGL and, surprisingly, Direct3D. Various functions in the OpenCL API allow creating buffers from OpenGL/Direct3D objects, such as textures or vertex buffers, that can be used by an OpenCL kernel. This opens up interesting possibilities for doing a lot more work on the GPU in graphics applications, such as running a fluid simulation on the GPU in OpenCL, which directly writes its results into vertex buffers or textures to be used directly for rendering without the host CPU having to intervene.

Before delving into the case studies the book briefly discusses the embedded profile that is available for OpenCL and the standardized C++ API that the Khronos Group provides in addition to the regular OpenCL API (which is defined exclusively as C functions). The C++ API makes using some of the OpenCL objects a little bit easier and somewhat nicer.

The second part of the book contains various interesting case studies that show off what OpenCL can be used for, such as computing a sobel filter or a histogram for an image, computing FFTs, doing cloth simulation, or multiplying dense and sparse matrices. The choice and variety of case studies is definitely interesting and most will be immediately applicable to the reader when going forward developing applications using OpenCL. All the code for the examples and the case studies in the book are available for download on the book's website.

Overall, the OpenCL Programming Guide succeeds in being a great introduction to OpenCL 1.1. The book covers all of the specification and more, has an easy to read writing style and yet provides all the necessary details to be an all-encompassing guide to OpenCL. The good selection of case studies makes the book even more appealing and demonstrates what can be done with real-life OpenCL code (and also how it needs to be optimized to get the best performance out of current OpenCL platforms, such as GPUs).

Martin Ecker has been involved in real-time graphics programming for more than 15 years and works as a professional game developer for Sony Computer Entertainment America in sunny San Diego, California.

You can purchase OpenCL Programming Guide from Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Sorry! There are no comments related to the filter you selected.

Grrrr! (-1)

Anonymous Coward | more than 2 years ago | (#38768576)

I can't hold anymore!
I'll rip off my clothes and become the ultimate in bootyasscheek johnson ultimatum supremacy naked!
I will show the world my true power! My true ferocity!
My true bootyasscheek johnson ultimatum supremacy nakedness has been unleashed!

Re:Grrrr! (0)

hedwards (940851) | more than 2 years ago | (#38768634)

Wrong thread, I think you meant to post in the next thread down.

Re:Grrrr! (0)

Anonymous Coward | more than 2 years ago | (#38769198)

You think GP makes sense in any thread??

Re:Grrrr! (1)

hedwards (940851) | more than 2 years ago | (#38770298)

Well, there was the one about MS supporting same sex marriage.

Re:Grrrr! (1)

Sulphur (1548251) | more than 2 years ago | (#38770520)

Wrong thread, I think you meant to post in the next thread down.

He is off his threads.

Ordinary Mortals (0, Interesting)

Anonymous Coward | more than 2 years ago | (#38768706)

Now if only someone made a GPU programming language that ordinary mortals can use. Like letting you address the GPU in plain vanilla C# or BASIC..

Re:Ordinary Mortals (5, Insightful)

Anonymous Coward | more than 2 years ago | (#38768848)

If you need to be using C# or BASIC, you have no business touching the GPU.

Re:Ordinary Mortals (0)

Anonymous Coward | more than 2 years ago | (#38769224)

While I agree, using C or C++ makes this much easier, and in theory faster; I have also programmed using an OpenCL wrapper in C# and it works damn well and damn fast. Hardware shouldn't choose the language, the scope of your goal should.

Re:Ordinary Mortals (2, Interesting)

shaitand (626655) | more than 2 years ago | (#38769878)

A GPU is just another computer. Coding for it in OpenCL isn't much different than writing C code that is just a wrapper around some assembly. There is no reason a MUCH more human friendly interface couldn't be made with the compiler taking care of using the appropriate memory and instructions to optimize for GPU usage.

Hell considering how many people are doing this its amazing there isn't even anything approaching a real comprehensive OpenCL tutorial on the web. Just because you CAN learn to use something almost entirely from spec sheets and api documentation doesn't mean there is anyone who learns something faster or better that way.

Re:Ordinary Mortals (5, Informative)

Anonymous Coward | more than 2 years ago | (#38770286)


CPU's are a assembly line, if you have a quadcore system, you have 4 assembly lines, and they may be very long. Those 4 assembly lines don't get to talk to each other except on either end. They can all be doing the same activity, or a different activity, and operate asynchronously. When they finish what they are doing, they wipe out the assembly line.
GPU's are syncronous and parallel. Every assembly line in a GPU can only do the same instruction code until cleared. So if there are 2048 assembly lines, each of those do the same instructions, with different pieces of data.

So in principle, if you can't parallelize it (eg zlib), it is better run on the CPU. If it can be parallelized (image, video and sound compression, FFT, specific math functions) you can run it on the GPU.

What we haven't done yet is discovered any lossless parallelizeable compression schemes. The problem is that the more fragments you break it up into, the less compression you can do because compression is purely serial. Lossy compression however is not serial, you can go "here's a 64x64 block of data, compress it", and it will do that on the entire image at once, because those 64x64 blocks don't rely on the compression of any of the other blocks in the image. The compression code may be a simple XOR or a Motion vector with the previous image. It can't rely on the neighboring 64x64 blocks.

This is why you see "accelerated" video tear. Because it doesn't wait for all the fragments in the frame to complete before flipping the video buffer. Adobe Flash is especially guilty of this, where you'll see on dual core and quadcore CPU systems screen tearing because Flash assumes it has 100% use of the CPU, even though that same CPU is doing other stuff. If flash used the GPU, it would suffer the same problem since the GPU still is used in Windows Vista and 7 in the accelerated composited desktop.

Anyway. CPU programming and GPU programming are completely different animals.

One thing that GPU's have high potential for, is independent computations. For example, back in 1992, if you were playing a game, the game could only compute the NPC's that are just off the screen. Today, you could use the GPU to compute all the NPC's positions simultaneously. This is currently done with physics computations. Not simply doing "AI" on the GPU, but actually creating neural networks for many NPC's to react to the Playing Character, not just simple "is PC visible, shoot it."

Re:Ordinary Mortals (2)

bored (40072) | more than 2 years ago | (#38770870)

What we haven't done yet is discovered any lossless parallelizeable compression schemes.

Uh, there are a lot of ways to parallelize loss-less compression schemes. I've been involved in projects doing this a couple times over the last decade. One example out of a half dozen I can think off of the top of my head is the history buffer search in LZ77 can be parallelized. How you go about that will make a huge difference in how fast it is.

Re:Ordinary Mortals (1)

Jerry Atrick (2461566) | more than 2 years ago | (#38773948)

The history probe may be parallel but the overall compression isn't because those searches still have to be serially executed - until the probe completes you don't know how much input was consumed or what to start the next probe with. The parallelism doesn't scale beyond speeding up the serial steps.

The same applies to decoding variable length codes. I have SIMD accelerated yanking the next huffman code from a bitstream but I have to know where the 1st bit is to perform the detection. The overall loop is still resolutely serial.

AFAIK there are no lossless compression algorithms able to break that serialism without compromising compression rates.

Re:Ordinary Mortals (1)

AstrumPreliator (708436) | more than 2 years ago | (#38770650)

A GPU is a computing device, but it's not another CPU. So while it may be fairly flexible it's still designed with one thing in mind. Debugging isn't nearly as nice as it is on the CPU, you can't do things such as print to the console from within a kernel (on a GPU) without an extension, and if you initiate a very time consuming process on the GPU your monitor will probably be locked until it finishes. Not to mention memory management is difficult on the GPU since you have to think about things such as coalesced reads/writes and cache coherency. Branching is also a pretty slow process on a GPU which needs to be taken into account.

Lastly OpenCL is not just for GPUs, it's for a bunch of different devices including CPUs, FPGAs, and DSPs. All of these have their own quirks. Now I'm not saying it's not possible to abstract all of this away, but it's not easy like you make it sound. If you can't do it at this level you shouldn't do it at all since you are going to have to work at this level whether you use a wrapper or not. I'm sure in 5-10 years the situation will change, but until then I agree with the GP.

Re:Ordinary Mortals (0)

Anonymous Coward | more than 2 years ago | (#38775726)

Lastly OpenCL is not just for GPUs, it's for a bunch of different devices including CPUs, FPGAs, and DSPs.

One day I would not be surprised if they had an OpenCL back end for cpu only clusters. You sort of have the same concepts. Global memory is memory shared by every node and slow. Local memory is the memory of each node and faster. Imaging cluster computer where you execute your kernels on the grid from a single host.

Re:Ordinary Mortals (5, Informative)

bored (40072) | more than 2 years ago | (#38770666)

Coding for it in OpenCL isn't much different than writing C code that is just a wrapper around some assembly. There is no reason a MUCH more human friendly interface couldn't be made with the compiler taking care of using the appropriate memory and instructions to optimize for GPU usage.

As someone who has actually done some OpenCL programming, I can tell you why your wrong. Learning openCL syntax isn't hard, if you know C# you can probably write some useful openCL code in just an hour or two. It is after all, a C-like language just like C# is a C like language.

That said, don't expect your openCL code to run faster than similar C code compiled with SSE. Thats because making OpenCL run fast is an exercise is looking at memory access patterns, understanding how to share data between hundreds of threads efficiently, etc. My first openCL program was actually slower (by 1/2) than a similar program using all 8 cores of my CPU. I got it on par with the CPU using a top of the line AMD GPU within a day or so, and then spent another two weeks trying different things until finally finding the magic bullet which removed a memory collision I was having and by itself increased the performance of my routine by ~32x. Running the same code on an nvidia GPU put me back in the ballpark of my CPUs again, requiring more time to make it fast on those GPUs. Time I wasn't willing to spend.

The bottom line is that OpenCL could be any language, but, what is necessarily is the ability to make changes which affect how data is laid out in memory, and how that data is being read/written. Furthermore, you need the ability to specify where the memory is used, because GPU's have unforgiving memory hierarchy. So if your not comfortable with the nitty gritty details of how computers (or in this case GPUs) actually work (not some CompSci abstraction) your not going to write good OpenCL code. You also need a gut feeling for how fast something could be, based on the specification of a particular device. Otherwise you won't know when to give up.

Re:Ordinary Mortals (1)

Daniel Phillips (238627) | more than 2 years ago | (#38772350)

OpenCL could be any language, but, what is necessarily is the ability to make changes which affect how data is laid out in memory, and how that data is being read/written.

In short, when the end goal is nothing more or less than optimization, it helps to know what you're doing.

Re:Ordinary Mortals (1)

jo_ham (604554) | more than 2 years ago | (#38775422)

Whatever you did, it might as well be sorcery. Having tried to get into programming at a hobby level, I am lost at arrays and pointers and "simple stuff".

Writing a program for a computer I think is something that I am destined to never understand - my brain is just not wired that way. I can't even imagine how complex it gets when a GPU is involved.

That said, I'm a chemist, but I can never shake the feeling that a programmer could learn what I do with a little bit of book learning and some practice, but despite knowing how to solve the Schrodinger equation by hand (albeit only particle in a box for simplicity) I simply could not do it for the life of me on a computer, even in relatively high level stuff like Maple, let alone the classes that went on to use C to do "simple" chemistry stuff. I couldn't get past getting the computer to read a list of numbers into an array and sort them by size.

I'll stick to my fume hood, I think - theoretical chemistry is not for me :p

Sorry, this is wildly offtopic.

Re:Ordinary Mortals (1)

Daniel Phillips (238627) | more than 2 years ago | (#38772332)

A GPU is just another computer.

Why don't you write some GPU code, then come back and tells us that again ;-)

Re:Ordinary Mortals (0)

Anonymous Coward | more than 2 years ago | (#38775688)

> Coding for it in OpenCL isn't much different than writing C code that is just a wrapper around some assembly

When you know nothing about something, it is usually better to shut up. Programming a GPU is *nothing* like a CPU.

Re:Ordinary Mortals (1)

lightknight (213164) | more than 2 years ago | (#38771128)

Need? No. Prefer? Yes.

I prefer to save C for the linux kernel. C# does just fine for regular programming, and doesn't make you hate yourself when you forget to properly terminate a string.

And I know classes have, for some odd reason, fallen out of style for programmers, but I like them. I've tried functional programming, and I just don't like it. I prefer my code to be more...organized / sane.

Re:Ordinary Mortals (1)

epyT-R (613989) | more than 2 years ago | (#38778909)

yeah except that the c# binaries run like dogs on the user's computer compared with a C equivalent. I avoid .net and java software whenever possible for this reason. if I wanted my workflow to behave like it's on a pentium 75, I'd just use a pentium 75.

Re:Ordinary Mortals (0)

Anonymous Coward | more than 2 years ago | (#38775416)

That is nonsense. I can recommend the 'Thrust' library for Nvidia GPU, it is a C++ wrapper over Cuda. What Thrust gives you is STL like algorithms. This abstracts away much of the low-level grunt work.

It is very easy to build fast data parallel algorithms using Thrust.

As for high level languages, Haskell, F# may be perfect fits for GPU programming. C# too - you could easily have a 'Data Parallel' set of collections that target the GPU - granted for many applications that may require a Tesla board rather than a GeForce.

Re:Ordinary Mortals (0)

Anonymous Coward | more than 2 years ago | (#38768938)

Hahaha. Mouthbreathers like you should go back to being fry cooks rather than trying to continually dumb down computing.

Re:Ordinary Mortals (0)

Anonymous Coward | more than 2 years ago | (#38768988)

Mouthbreathers & fry cooks huh? I bet that once someone creates a way to program a GPU in C# or BASIC, you'll be the first to use it!!!! Creep.

Re:Ordinary Mortals (4, Funny)

Mitchell314 (1576581) | more than 2 years ago | (#38769016)

Then we can move on to making an assembly language that ordinary mortals can use! If only we could wrap it in some kind of "higher level" language with more abstract constructs . . .

Re:Ordinary Mortals (0)

UnknownSoldier (67820) | more than 2 years ago | (#38769256)

LOL! nice ...

Re:Ordinary Mortals (3, Informative)

UnknownSoldier (67820) | more than 2 years ago | (#38769226)

Why don't you start with ShaderToy ? []

And some interesting code snippets ... []

Reddit is the Dig of /. -- group herd-think, circle jerking, wankers, and the rare insightful / informative comment.

Re:Ordinary Mortals (1)

shaitand (626655) | more than 2 years ago | (#38769930)

I'd settle for some choice GPU optimized libraries. Most seem to be very specific and limited in scope. For instance I can find a million and one GPU accelerated libs to find a sub string but so far a basic PCRE lib for any language is completely elusive.

I don't want to build a bunch of GPU wheels.

Re:Ordinary Mortals (4, Interesting)

meza (414214) | more than 2 years ago | (#38770192)

I did some opencl in python with PyOpenCL recently ( I found it very very easy to get going with. You simply prepare all your data in high level, friendly python and then you fire it off to the graphics card and wait for the result. Sure the OpenCL part is written in a language most resembling C but there is no need not to use a better tool for your non-computational parts.

Re:Ordinary Mortals (0)

Anonymous Coward | more than 2 years ago | (#38771016)

If you can't handle OpenCL, then you're a fucking moron.

Re:Ordinary Mortals (1)

Eladith (1365123) | more than 2 years ago | (#38773934)

From "Building Domain Specific Embedded Languages" by Paul Hudak, 1996:

Although generality is good, we might ask what the "ideal" abstraction for a particular application is. In my opinion, it is a programming language that is designed precisely for that application: one in which a person can quickly and effectively develop a complete software system. It is not general at all; it should capture precisely the semantics of the application domain - no more and no less. In my opinion, a domain-specific language is the "ultimate abstraction".

One approach is not to directly program the GPU, but to use library provided (domain-specific) high-level parallel primitives (map,fold,reduce,..) to describe the computation. The library in question then compiles the final low-level code. These libraries are often implemented as domain-specific embedded languages. Topic is a subject for active research, but some more or less mature implementations already exist, some of which are:

  • thrust [] provides STL-like algorithms for C++ while targeting CUDA and OpenMP as backends.
  • ArBB [] implements a parallel array programming library for C++ built on a general purpose virtual machine targeting SSE, AVX and possibly MIC in the future.
  • accelerate [] is an embedded language for array computations in Haskell and at the moment implements backends for CUDA and ArBB.

Great (4, Informative)

WilyCoder (736280) | more than 2 years ago | (#38768930)

I read this book back in August. I've been using OpenGL for almost 10 years now but knew little to nothing about OpenCL.

This book was really good. There were some typos that I found while reading it (other people had already found and reported them). If you get this book make sure you visit the author's addendum & corrections page.

I agree with the review, 9/10. If there were NO typos at all, it would be 10/10 for me.

the new hipwader hull (0)

epine (68316) | more than 2 years ago | (#38769110)

This is neither the book nor the review I would have written.

My book would have started:

WTF is the Khronos Group? Good question. It sure sounds like one of those faux "we really do talk to each other while going our own separate ways" PR initiatives of the African UNIX warlord alliance of so many bland bodies from ages ago whose names we can no longer recall.

Circling threat []

My caption: With the new hipwader hull, Joe had the whole OpenCL stack right at his fingertips.

Can someone comment on the speedup of OpenCL (0)

Anonymous Coward | more than 2 years ago | (#38770362)

As a novice programmer (mostly a hobby now and then for interesting problems), what does OpenCL have to offer over standard c++? From what I know about the matter, which isn't a lot, OpenCL isn't just for GPU's, but CPU's as well. Can someone comment on the speed comparrison of say, a sparse matrix multiplication algorithm, using c / c++ and OpenCL?

While you're at it, if someone has experience learning OpenCL, how is the learning curve compared to learning a new language, say, python for a baseline comparison? I would appreciate it, and I know quite a few others who would like to know as well.

Thanks in advance,

-- Anon Cowardly Programmer.

The problem with this book... (4, Informative)

bored (40072) | more than 2 years ago | (#38770764)

Is that its not really useful for learning OpenCL. Sure it will teach you the syntax and how to write an OpenCL program. That isn't the problem. The problem is that if your writing something in OpenCL you probably want it to be fast. Learning the language is doable by someone with C experience in just a couple hours with just the SDKs shipped by AMD/Nividia/Intel. Learning how to optimize a routine for a particular GPU/etc is the hard part, and is application specific. It also requires knowledge of how compute device actually work at an extremely low level. I don't believe this book teaches that. Save your money, download the spec and a SDK for your device. Start reading the architecture docs..

Re:The problem with this book... (1)

lloy0076 (624338) | more than 2 years ago | (#38775927)

You mean learning how to actually program, where it's the algorithms that make the difference, is difficult? I thought I could buy this book (I have) and then figure out how to break the NSA's latest encryption standards on my iPhone :(

Re:The problem with this book... (0)

Anonymous Coward | more than 2 years ago | (#38800093)

How to optimize GPU code to run fast on various kinds of codes is still a research topic. (We are actively working on it in our research group.) And what makes it run fast on a Nvidia GPU is almost guaranteed to not make it run fast on an AMD GPU. You are on the bleeding edge here folks. But just like every other bleeding edge, there are major gains to be made if you are willing to put up with a little pain. In a few years, things will have been abstracted better and there will be less pain (but for less than optimal gain).


Oh Goody (1)

ooooli (1496283) | more than 2 years ago | (#38773684)

ANOTHER damn thing called a "kernel". Cause that wasn't overloaded enough yet.

A question of base practicality... (1)

Tastecicles (1153671) | more than 2 years ago | (#38775574)

...Could I utilise this programming method to say, encode video streams to a common format, in an efficient (ie fully utilising available GPU/CPU cores) manner? Because right now I have a compute cluster comprising a pair of dual core laptops, one of which has an AMD Radeon HD GPU on-die, the other an Intel chipset GPU (but that's not really important), two P4 desktop machines with NVidia GF7 GPUs and a Sempron box with AMD Radeon HD pci-express. Altogether, that's 7 processor cores and 4 (possibly usable but at the moment /aren't/) GPU dies. I regularly saturate the CPUs with the encoding I'm doing, is there an established method/library/whatever (I'm not a programmer!) of adding the GPUs to a compute cluster using for example, a Linux CD slipstreamed with a bit of custom software, over a Gigabit LAN?

Am I being blonde, or is this already done??

looks awesome? (1)

echonyne (2545100) | more than 2 years ago | (#38789915)

:) Gee.. I was just so going to buy a book on Open CL. (i love the concept of harnessing GPU power) :) looks good to me.. will chk it out later.
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?