Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Writing Linux Kernel Functions In CUDA With KGPU

Soulskill posted more than 3 years ago | from the easy-as-pie dept.

Data Storage 101

An anonymous reader writes "Until today, GPGPU computing was a userspace privilege because of NVIDIA's closed-source policy and AMD's semi-open state. KGPU is a workaround to enable Linux kernel functionality written in CUDA. Instead of figuring out GPU specs via reverse-engineering, it simply uses a userspace helper to do CUDA-related work for kernelspace requesters. A demo in its current source repository is a modified eCryptfs, which is an encrypted filesystem used by Ubuntu and other distributions. With the accelerated performance of a GPU AES cipher in the Linux kernel, eCryptfs can get a 3x uncached read speedup and near 4x write speedup on an Intel X25-M 80G SSD. However, both the GPU cipher-based eCryptfs and the CPU cipher-based one are changed to use ECB cipher mode for parallelism. A CTR, counter mode, cipher may be much more secure, although the real vanilla eCryptfs uses CBC mode. Anyway, GPU vendors should think about opening their drivers and computing libraries, or at least providing a mechanism to make it easy to do GPU computing inside an OS kernel, given the fact that GPUs are so widely deployed and the potential future of heterogeneous operating systems."

cancel ×

101 comments

Sorry! There are no comments related to the filter you selected.

AES-NI (1)

RightSaidFred99 (874576) | more than 3 years ago | (#36051268)

Wonder how this compares in performance to AES-NI [wikipedia.org] , because it sure as hell sounds a lot more complex and fragile.

Re:AES-NI (1)

adisakp (705706) | more than 3 years ago | (#36051322)

It might be more "complicated" but it's probably more useful since currently a lot more systems have GPU's than AES-NI, given that AES-NI is only on a subset of Intel's most recents CPU's.

Re:AES-NI (2)

OeLeWaPpErKe (412765) | more than 3 years ago | (#36052334)

The problem is that parallelized encryption is not as secure as the other modes. Let me show you the difference between CBC, ECB and CTR ( block(i) means the i'th block of data)

1) CBC
  CBC(pwd, block(i)) == encrypt(pwd, block(i)) xor block(i-1)
* block(-1) = hash(pwd, 0) (sometimes half the password is used as block(-1))

2) ECB
  ECB(block(i)) = encrypt(block(i))

3) CTR
  CTR(block(i)) = encrypt(block(i)) xor i

I hope it's obvious why CBC and CTR are the only candidates for parallelization. CBC can only be done in sequence. But there's a huge issue. Ciphers have weak spots, and there are rainbow tables. So let's suppose you have an encrypted file in ECB mode.

encrypt(block(1)) : encrypt(block(2)) : ... : encrypt(block(n)) * bing rainbow table hit ! (ie. somehow you're able to decrypt block(3))

now you have a combination block(n), encrypt(block(n)) and password. Well you've broken the encryption. The problem is the contents of blocks are quite predictable (e.g. you will pretty much know every bit in an ext3 superblock if you know the size of the volume, so you can generate targeted rainbow tables). The only thing you need to find is the password.

Suppose the same happens in CBC mode

encrypt(block(1) xor initializer) : encrypt(block(2) xor encrypt(block(1) xor initializer) : encrypt(block(3) xor encrypt(block(2) xor encrypt(block(1) xor initializer)) ...

Now block(1) is still perfectly predictable, block(1) xor initializer, however, is not. You have to generate 2^(passwordlength + blocksize)/2 rainbow tables before you'd get a single hit. Also, just because you get one hit, doesn't mean it's the correct one (in ECB you know it's the correct one because the plaintext is meaningful. "Bob, I secretly loved your brother last night" is easily recognized as plaintext, while that same string xorred with a pseudorandom value doesn't make sense to anything). That means that you know have to find both the password and the plaintext. That generally, with a 256 bit password and 4 kb blocksize, that you effectively have a "password" that's 4.5 kb. This makes CBC orders of magnitude harder to crack.

It should be said that attacks on ECB or CTR, while a LOT easier, are only theoretical for recent algorithms (e.g. AES). However, CBC remains secure much longer than ECB, both using the same encryption algorithm. CBC 3DES encryption, for example, is considered safe (and it is very doubtful even the NSA or CIA has the resources even for CBC DES).

So, in short, NVIDIA cheated.

Re:AES-NI (1)

slew (2918) | more than 3 years ago | (#36052614)

Just a couple small nits to pick..

Although CBC encryption needs to be done in sequence, CBC decryption can be done mostly in parallel (don't have to wait until you do the AES part of the previous block)...

Also security is better than other modes only in some cases. As a trivial example, in CBC it's easier to tamper with the plain-text.: all you have to do to flip a bit in the plaintext of a CBC encrypted stream is to flip that same bit in the previous block's cipher-text. Although that kills that previous block's decrypted plaintext, it make it possible to easily arbitrarily manipulate somethings (of course if that is a threat model, you should really be doing a MAC, but that is another discussion)...

So, in short, it depends... ;^)

Re:AES-NI (2)

draconx (1643235) | more than 3 years ago | (#36052750)

3) CTR
    CTR(block(i)) = encrypt(block(i)) xor i

Sorry, but what you describe is not CTR mode. Using your notation, CTR would look (roughly) like this:

    CTR(block(i)) = encrypt(counter) xor block(i)

where "counter" is usually constructed by concatenating a nonce value with i
(the block number). It is critical that the resulting counter never be re-used
with the same key for a different block).

Re:AES-NI (2)

kasperd (592156) | more than 3 years ago | (#36052882)

CTR(block(i)) = encrypt(block(i)) xor i

That's not how CTR works. Rather it works like

CTR(block(i)) = encrypt(IV || i) xor block(i)

However since most storage encryptions cheat and use an IV that is the same every time you write to the same logical sector, the CTR mode will actually turn into a pseudorandom one-time-pad. This means if you ever write to the same logical sector number twice, you are potentially leaking data. In the case of ecryptfs it is probably only a problem if you overwrite sectors in an existing file as the design of ecryptfs would make it easy to use a new IV per file, but not per sector.

If you want an encryption that is highly parallelizable and doesn't lose a lot of security when you cut corners and use a fixed IV, I think LRW is your best bet. (I don't like the name LRW as I find it an offence against the inventors of tweakable block ciphers, but I am not aware of any other name for that mode, and I don't even know who invented it.)

Re:AES-NI (2)

doublebackslash (702979) | more than 3 years ago | (#36053952)

I'm curious, would CTR be less vulnerable if one XORed before encryption? Call the operation CXR.
Where ^ is the XOR operator
CXR(block(i)) = encrypt(IV ^ i ^ block(i))

I'm not sure if there is analysis that can be done on the block at that point that makes this undesirable. Methinks not because as far as I know having a well known IV in, say, CBC is not a vulnerability. That implies to me that the security still rests firmly in the key. At the very least it stops being vulnerable to bitwise changes and reinstates the Confusion and Diffusion principals.

There might also be some magic in reading the whole block (since we are talking about block level devices) and having, say, a CBC over the block with an IV calculated with encrypt(IV ^ i) but I think that goes out of scope of my question.

Re:AES-NI (2)

kasperd (592156) | more than 3 years ago | (#36055084)

CXR(block(i)) = encrypt(IV ^ i ^ block(i))

This is about as secure as ECB, but that's still better than what you get from incorrect use of CTR that degenerates to multiple use of a one-time-pad. What you want is a tweakable block cipher. Just use the block using i as tweak. That is how LRW mode works, with a specific construction for the tweakable block cipher.

One of the constructions for the tweakable block cipher is encrypt(t ^ encrypt(plaintext)), a more efficient construction (but requires a larger key) is (t*k2)^encrypt((t*k2)^plaintext). In this construction * is multiplication in a finite field. * is a bit expensive, but still less than the cipher itself. And, * can be optimized if you are doing multiple operations where the different values of t are related.

You should take a look on the paper that introduced tweakable block ciphers. It explains the constructions much better than I could do.

as far as I know having a well known IV in, say, CBC is not a vulnerability.

It is, but only a minor weakness. With early disk encryptions, that simply used sector number as IV, it was possible to construct a file that when written to that file system would produce an easily recognizable pattern in the encrypted data. I have an example of such a file here http://kasperd.net/ivtest.txt [kasperd.net]

There might also be some magic in reading the whole block (since we are talking about block level devices) and having, say, a CBC over the block with an IV calculated with encrypt(IV ^ i) but I think that goes out of scope of my question.

The best way I know to produce an IV is to do a calculation over the plaintext of the entire sector except from the first block of the sector. You could say hash the complete sector with first block replaced by sector number and then encrypt the hash value. The advantage of such a construction is that any change anywhere in the sector will affect every block of the encrypted sector.

Re:AES-NI (0)

Anonymous Coward | more than 3 years ago | (#36051324)

In AES-NI Performance Analyzed, Patrick Schmid and Achim Roos found, "... impressive results from a handful of applications already optimized to take advantage of Intel's AES-NI capability".[6] A performance analysis using the Crypto++ security library showed an increase in throughput from approximately 28.0 cycles per byte to 3.5 cycles per byte with AES/GCM versus a Pentium 4 with no acceleration.[7] [8]

Looks like a 8x speedup with AES-NI, versus a 3-4x speedup using KGPU.

Re:AES-NI (2)

gman003 (1693318) | more than 3 years ago | (#36051384)

Yes, but that was comparing a Pentium 4 (last one came out in 2006) to a brand-new processor (2011). That is NOT scientifically accurate - they are completely different designs, which will produce vastly different runtimes for the exact same instructions. How about doing a comparison between Crypto++ running on a 2500k, and Crypto++ running on a 2500k without being compiled with AES-NI support. That would be infinitely more rigorous.

Re:AES-NI (1)

JonySuede (1908576) | more than 3 years ago | (#36051928)

they were comparing cycle per byte not the total run time so the difference between the cpu generation is less important. But the rest of your argument is still quite valid.

Re:AES-NI (2)

gman003 (1693318) | more than 3 years ago | (#36052074)

No, cycle per byte is EXTREMELY important. Even contemporary processors can execute instructions in highly different amounts of time - a K5 can perform some instructions in 80% the time of an identically-clocked Pentium. And when you compare it to such wildly different architectures as Sandy Bridge and NetBurst, all bets are off. You might as well be throw an 8086 and a SPARC into the mix, because that'll be about as rigorous.

Re:AES-NI (2, Informative)

Anonymous Coward | more than 3 years ago | (#36051416)

KGPU uses AES just as a demonstration, it's architecture is general to any GPU-friendly algorithm.

Re:AES-NI (1)

DarkOx (621550) | more than 3 years ago | (#36051650)

Well I am sure it compares very favorably if you have an old CPU or a CPU of a different architecture which does not feature those instructions.

Re:AES-NI (2)

wagnerrp (1305589) | more than 3 years ago | (#36051884)

It's for an entirely different application. AES-NI is one application specific set of instructions. While encryption and decryption is an application in which dedicated hardware can have tremendous gains, introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU. It's an inherently limiting design methodology. Introducing GPU access to the kernel opens up a very powerful piece of hardware to be used for a wide range of applications, enhancing any process that is suitable for the architecture found on a GPU.

Think of GPUs like picking up a new math co-processor 20 years ago.

Re:AES-NI (1)

makomk (752139) | more than 3 years ago | (#36053408)

introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU

More oversized, expensive and power-hungry than the GeForce GTX 480 they used for this benchmark? It's right at the limits of manufactuability in terms of chip size, costs hundreds of dollars, and has a 300W power consumption at load. You'd need an awful lot of application-specific hardware modules before you even got close to that.

Did a anyone else's brain switch off half way.... (0)

Anonymous Coward | more than 3 years ago | (#36051296)

..... through the summary??? Sorry, But, I had to read it 3 times, to sink in.... Sorry... but, as a geek myself, I find this just far too geeky!....Sorry. (hands back geek card!)

Re:Did a anyone else's brain switch off half way.. (4, Informative)

h4rr4r (612664) | more than 3 years ago | (#36051346)

GTFO!
This is what should be on slashdot, not stories about the latest iphone.

Re:Did a anyone else's brain switch off half way.. (1)

kvvbassboy (2010962) | more than 3 years ago | (#36052690)

Sure.. but the summary is still badly written. Read the TFA, and that makes a lot more sense for us illiterate folks.

Re:Did a anyone else's brain switch off half way.. (1)

thePowerOfGrayskull (905905) | more than 3 years ago | (#36051496)

It wasn't too geeky, but it was written as if by someone with ADD. Perhaps no surprise?

Re:Did a anyone else's brain switch off half way.. (1)

MarkRose (820682) | more than 3 years ago | (#36051988)

Completely off-topic, but I've been looking for a decent ssh client for my crapberry -- thanks!

Re:Did a anyone else's brain switch off half way.. (1)

thePowerOfGrayskull (905905) | more than 3 years ago | (#36073166)

Excellent, glad it helps! Look for some updates coming in the fairly near future...

Re:Did a anyone else's brain switch off half way.. (0)

Anonymous Coward | more than 3 years ago | (#36051512)

Poorly written maybe, but not that geeky. If you were that confused, I'm not certain you ever had a geek card to hand back.

Best possible example (2, Interesting)

Anonymous Coward | more than 3 years ago | (#36051298)

        Hand off encryption routines to a closed source black box. Brilliant.

Re:Best possible example (2)

icebraining (1313345) | more than 3 years ago | (#36051392)

Yes, because the CPU isn't, we're all running open hardware /s

Re:Best possible example (4, Insightful)

Jaqenn (996058) | more than 3 years ago | (#36051400)

As opposed to having them done by my Intel CPU, for which Intel has helpfully provided full schematics.

Re:Best possible example (1)

Anonymous Coward | more than 3 years ago | (#36051592)

Good point.

In fact, Intel CPUs are worse in this regard, as they contain special AES instructions. GPUs, as far as I know, don't do this yet, so you'll know have a higher level of confidence that the correct code is indeed running.

Re:Best possible example (1)

Lunix Nutcase (1092239) | more than 3 years ago | (#36052054)

Yes, and those AES instructions are well documented [scribd.com] .

Re:Best possible example (1)

Noughmad (1044096) | more than 3 years ago | (#36055282)

How can you be sure that what's going on on the processor is the same thing as what's described in the documentation?

Question: (3, Interesting)

Jaqenn (996058) | more than 3 years ago | (#36051338)

(I have never written kernel level code, and the statement that follows is only from listening to what other people are doing)

I thought that a tiny bit of kernel code reflecting calls into a user level process was old news, and has become established as the preferred development model. Is there a reason that it's undesirable?

Because the summary makes it sound like we're sad to be following this model, and we're only doing it because we can't pull NVidia's driver source into the linux kernel.

Re:Question: (2)

sockman (133264) | more than 3 years ago | (#36051402)

The NVIDIA extensions are only available in userland.
So a call to the kernel level crypto system gets routed back out to user land, and back to kernel land via the GPU module. That's why we're sad.

Re:Question: (1)

sjames (1099) | more than 3 years ago | (#36055240)

What I would like to know is since they're already taking the hit for downcalls into userspace, why not use fuse instead and let the userland filesystem daemon use the GPU. Why produce yet another mechanism to protect the kernel from the wierdness that can happen when it depends on userspace rather than the other way around?

Re:Question: (2)

killmenow (184444) | more than 3 years ago | (#36051450)

I've never written kernel modules either so take this with a grain of salt: my understanding is there is a cost associated with the switching/passing back and forth between userspace and kernelspace and it's best to minimize that. I remember similar discussions going back as far as NT4 when Microsoft decided to implement the entire GDI in kernelspace, which is what led to a billion BSODs because video drivers are notoriously shitty code and you'd be way better off stability-wise having that code run in userspace. Performance-wise, not so much.

The interesting thing about encryption code working this way is there is such a tremendous speedup by running the bulk of the encryption code on the GPU as opposed to the CPU that the cost incurred in the user/kernel switch is well worth it.

Re:Question: (1)

blair1q (305137) | more than 3 years ago | (#36051524)

Context-switching is always expensive, but avoiding it without regard to the actual benefit leads to system bloat, so learning where it is and isn't significant is a good skill to have.

The speedup from GPU hardware is so big that it's worth giving up a few hundred cycles of context switching to get a few thousand cycles of reduction in computing.

But (not having read TFA yet) I wonder just how much kernel functionality is really that parallelizable. When does the context switching cost you more than the CUDA gains you? Crypto stuff relying on gigundous keys would be a no-brainer, but where else could it be economical?

Re:Question: (1)

Jah-Wren Ryel (80510) | more than 3 years ago | (#36051776)

Crypto stuff relying on gigundous keys would be a no-brainer, but where else could it be economical?

Maybe RAID computations. Block-level data-deduplication is starting to catch on and that needs to hash every block written to disk. i bet that could benefit from a GPU but the userland overhead may be enough to kill the practicality, at least for anything but long streaming writes.

Re:Question: (1)

Hatta (162192) | more than 3 years ago | (#36051492)

There is overhead in a context switch from kernel space to user space.

Re:Question: (2)

afidel (530433) | more than 3 years ago | (#36051506)

The reason it's undesirable is the hit you taking when moving back and forth between kernel space and user space. The move in each direction requires the CPU to change ring levels which increases latency.

Re:Question: (1)

Anonymous Coward | more than 3 years ago | (#36051572)

Many developers feel that Nvidia's userspace driver workaround, only done to avoid licensing issues, shouldn't be permitted at all. This would be seen as validating Nvidia's actions.

It's also a giant architectural hack so that won't help matters.

Re:Question: (1)

Anonymous Coward | more than 3 years ago | (#36051678)

a tiny bit of kernel code reflecting calls into a user level process

You mean generally? This could be said of micro-kernels but the LInux kernel is monolithic; Drivers for devices typically live entirely inside the kernel.

That being said I don't think it's necessarily desirable to pull every conceivable hardware interaction into the kernel. There is an endless variety of hardware and APIs. Why must all of this churn live in the kernel? The kernel<-->user-space bridge that was built to make the GPU vendors user-space API accessible by the kernel isolates the kernel from the frequent driver updates published by the vendor. The vendor can distribute all the drivers it wants, create new and vastly different hardware and the bridge doesn't have to change as long as the user space API survives.

Note: the above isn't a 'rights' argument; it applies whether or not the hardware and/or drivers are 'open.'

If you've run 'menuconfig' et al. recently and waded through the thousands of devices with their subtle dependencies and relationships, it might occur to you that this may not scale forever. Relegating some of the less ubiquitous stuff to user space through a robust and common interface could be a good idea.

The world is messy. There will always be stuff that can't be sanitized by the kernel gnomes. Making these cases work smoothly contributes to world domination.

Re:Question: (1)

emanem (1356033) | more than 3 years ago | (#36051792)

I've written kernel code in both OpenGL (GPGPU old school)/OpenCL.
Main issue might be context switching? Or writing GPU binary code without having to compile via driver (i.e. a la math accelerator FPU?)
Cheers!

Re:Question: (4, Interesting)

PoochieReds (4973) | more than 3 years ago | (#36051836)

There are also other concerns than the context switch overhead...particularly when dealing with filesystems or data storage devices.

For instance, suppose part of your userspace daemon gets swapped out, and you now need to upcall to userspace. That part that got paged out then has to be paged back in. If memory is tight, then the kernel may have to free some memory, and it may decide to flush out dirty data to the filesystem or device that is dependent on the userspace daemon. At that point, you're effectively deadlocked.

Most of those sorts of problems can be overcome with careful coding and making sure the important parts of the daemon are mlocked, but you do have to be careful and it's not always straightforward to do that.

F*ck Nvidia AND AMD (1)

Anonymous Coward | more than 3 years ago | (#36051352)

Until they open-source drivers, I refuse to buy them. Stuff like this is typically a nightmare to install and keep running anyway.

Re:F*ck Nvidia AND AMD (1)

blair1q (305137) | more than 3 years ago | (#36051536)

Just what are you using for graphics hardware, then? Intel's integrated core?

Re:F*ck Nvidia AND AMD (0)

Anonymous Coward | more than 3 years ago | (#36051560)

Serial Terminal.

Re:F*ck Nvidia AND AMD (1)

Noughmad (1044096) | more than 3 years ago | (#36055292)

Uphill, both ways?

Re:F*ck Nvidia AND AMD (0)

Anonymous Coward | more than 3 years ago | (#36051578)

Yeah, right. Good luck running Battlefield 3 on that crap.

Re:F*ck Nvidia AND AMD (1)

jd (1658) | more than 3 years ago | (#36051626)

The Hercules graphics card. :)

Re:F*ck Nvidia AND AMD (1)

TeknoHog (164938) | more than 3 years ago | (#36051886)

I only used open source graphics drivers, including Intel's integrated, until about 6 months ago when I needed to run some OpenCL code on a Radeon. There is nothing wrong with Intel graphics and the opensource Radeon drivers, unless you are a gamer or need serious GPGPU power. Both are capable of plenty of 3D, for example molecular modelling in my case.

I am posting this on a Powerbook running Linux, and for some strange reason AMD does not release binary drivers for PPC Linux ;) but the opensource Radeon driver is good enough. There are some artefacts in 3D but no serious problems.

Re:F*ck Nvidia AND AMD (1)

serviscope_minor (664417) | more than 3 years ago | (#36052582)

Just what are you using for graphics hardware, then? Intel's integrated core?

Yes, why? I don't play 3D games, so it's fine and stable.

Re:F*ck Nvidia AND AMD (1)

gerddie (173963) | more than 3 years ago | (#36052268)

You might want to rethink your opinion on AMD, they are getting there: http://www.x.org/wiki/RadeonFeature [x.org]

Re:F*ck Nvidia AND AMD (1)

dbIII (701233) | more than 3 years ago | (#36053624)

It looks like the old SGI guys at Nvidia know that as soon as source is released they are going to get jumped on by patent trolls and have to spend a lot of time and money on pointless court cases that can do nothing of value to anyone apart from shifting money into patent troll pockets. They've been bitten once before and the closed drivers are the result.

Wow (2)

killmenow (184444) | more than 3 years ago | (#36051396)

I came to read a discussion of writing kernel functions in CUDA and a discussion about the vagaries of encryption methods broke out.

Re:Wow (1)

Obfuscant (592200) | more than 3 years ago | (#36051570)

I came to read a discussion of writing kernel functions in CUDA and a discussion about the vagaries of encryption methods broke out.

Be careful what you say, next we'll have a hockey game break out.

I'm sorry, but am I the only one here who thinks this is, well, not a good way to go? Even if the code could be kernel-space code on the GPU? I mean, if I buy a CUDA GPU, I'm doing it because I have serious computing I want to do on it, not because I want my file system reads to be faster. I'd be rather miffed if I spent the time writing my CUDA code to speed things up and then found out it wasn't speeding things up because the GPU was already busy doing encryption on the filesystem.

This seems a lot like the problem that cheap HP printers cause. You buy a $50 printer, but the software "driver" consumes the computer you've hooked it to, to the point where you should just consider the computer as part of the printer and get a new one to do computing on.

Re:Wow (1)

wild_berry (448019) | more than 3 years ago | (#36052264)

They're racking their brains as to what to do next.

I would aim for kernel threads running directly through CUDA and the Scheduler knowing the performance profile of suitable work for the GPU and the message-passing cost of moving work to the GPU^H^H^H parallelism co-processor. Make the interface right and you should be able to shift tasks across heterogeneous processing units. Do it perfectly and you can have a Linux Virtual Processor model which allows you to start running a task on your desktop, shuffle it to a laptop for transit, pare it down to use on your mobile phone, buy some CPU time from an internet cluster to grind through some calculations before transferring it home. Choose x86: there's already enough x86 junk in other trees, and it might fix up the ARM shenanigans too!

Re:Wow (1)

tibit (1762298) | more than 3 years ago | (#36052510)

You seem to be seriously overstating the impact of host-based printing. Obviously when you're not printing (and that's probably most of the time!), there's no overhead. And when you are printing, then the rasterizer consumes a little bit of memory and plenty of CPU, but that's transient. I would never venture as far as calling it "consuming" the computer.

I haven't personally felt it to be a problem, and I'm using a host-based printer (HP LJ P1006). It spits out about 17 pages per minute, not too shabby if you ask me. Having a CPU capable of rasterizing that fast in the printer itself would probably double its cost, so I'm not complaining. The Core2 Duo in the iMac is already paid for ;)

Heck, I've used plenty of PCL-only LaserJets, and they were -- for all practical purposes -- host-rendering printers. Some of them could render scalable fonts, but that only helped if you were printing text. As soon as there were graphics involved, or output from professional typesetting/design packages, the printer was receiving a huge monochrome bitmap wrapped in a couple PCL instructions. In all recent-enough cases, though, the host CPU was much faster than the one on the printer, so it actually helped with throughput if the PC would do the rasterization.

Re:Wow (1)

Obfuscant (592200) | more than 3 years ago | (#36053406)

You seem to be seriously overstating the impact of host-based printing.

Uhhh, no. I was there. Firsthand experience.

Obviously when you're not printing (and that's probably most of the time!), there's no overhead.

Other than the half a dozen monitor demons that tell you when there are updates for the drivers, when the printer is out of paper, when the printer is out of ink, when the printer is low on ink and would you like to buy official HP products now?, and whatever other things they had demons doing.

then the rasterizer consumes a little bit of memory and plenty of CPU,

The last 200Mb of disk is "a little bit"?

I haven't personally felt it to be a problem,

And thus it cannot have been a problem for me. Thanks.

Having a CPU capable of rasterizing that fast in the printer itself would probably double its cost,

Wow. A whole $100 for a printer. So the printer could actually be a printer and not just the printhead and servos and the rest of the printer installed as drivers on your main CPU.

Re:Wow (1)

tibit (1762298) | more than 3 years ago | (#36056300)

The monitors and stuff are not a problem inherent in host-based printing. Not at all. For reasons better left to be explained by marketing types, HP's Windows printing support for home printer product line sucks donkey balls. Their support on Linux and Mac doesn't come with any of the overhead.

So what you're complaining against is not host based printing per se, but broken drivers peddled by HP and others, bundled with bloatware. There's no inherent technical reason for it to be that way. And the problem is not because printer is not performant enough. The problem is bloatware. What you feel as the problem is not the rasterizer, it's everything else. Note that the same bloatware, unfortunately, comes with printers that have a built-in rasterizer, too.

As for 200MB of disk: you are not complaining about what the rasterizer is doing, merely about bloatware that came with the printer. A monochrome letter-sized page at 600dpi takes 4.2 Mbytes. At that resolution, you could stuff uncompressed bitmaps for about 45 monochrome pages in 200MB, or 12 pages worth of CMYK bitmaps.

Re:Wow (1)

sjames (1099) | more than 3 years ago | (#36057298)

To be fair, most of that crap isn't actually the printer driver, it's the HP marketing trojan combined with REALLY bad design. A sane driver would only check paper and ink just before, during and just after a print job.

AES speed (1)

afidel (530433) | more than 3 years ago | (#36051446)

I wonder if this would be any faster than an implementation that took advantage of the hardware AES on the newer Intel CPU's? Latency should be lower for the CPU based version as would memory bandwidth.

Re:AES speed (0)

Anonymous Coward | more than 3 years ago | (#36052104)

Yes, it would be faster with AES-NI. But APUs may make a different because GPU and CPU then can shared even L3 cache.
Another issue is: what if we use other algorithms instead of AES? Designing specialized instructions for every algorithm doesn't make sense.

Re:AES speed (1)

afidel (530433) | more than 3 years ago | (#36052144)

I really not sure why you would use anything other than AES at this point and the AES-NI instructions also foil most sideband attacks.

GPU (0)

MM-tng (585125) | more than 3 years ago | (#36051550)

The hardware that is so brilliantly made nobody is allowed to know how it works. And as a result it actually doesn't work. I say congratulations.

All in good time (2)

deadline (14171) | more than 3 years ago | (#36051568)

Proof of concepts are nice, but when the GPU is firmly planted in the CPU, this will make more sense. The PCI bus can be a bottleneck in these types of situations. AMD fusion is a great example of this idea.

Re:All in good time (1)

cnettel (836611) | more than 3 years ago | (#36052082)

If you are indeed reading from something like an SSD, the data bandwidth shouldn't be a problem. The data pipe to any recent GPU is much wider than SATA, and quite favorable latency-wise as well. Of course, you are adding another layer of latency and transfers, but the situation is quite different from a case where you are offloading some computation whose data could otherwise stay in the CPU cache all the time.

Recipe for a corrupted filesystem (1)

drewm1980 (902779) | more than 3 years ago | (#36051660)

Wow, the fragility of an encrypted file system plus the instability of a GPU, implemented in the kernel. Do not even read TFA without doing a full backup of your system.

Re:Recipe for a corrupted filesystem (2)

calmofthestorm (1344385) | more than 3 years ago | (#36051920)

fragility of an encrypted file system{citationneeded}.

I've been using them since 2006. Never had any problems.

Cool test... (1)

Panaflex (13191) | more than 3 years ago | (#36051802)

As someone who's doing a lot of the same work, this is pretty spectacular! I'm surprised they get > 100MB/sec in software - but I guess that's due to using ECB mode vs. CBC. I think the real I/O limit here is probably in the user/kernel mem copies - context switch weight can be optimized with good buffer alignments.

We did a lot of testing with CUDA under openssl 3-4 years ago - in the end it was better to just stick with software. The latencies are the real killers.

Re:Cool test... (0)

Anonymous Coward | more than 3 years ago | (#36052030)

You may wanna take a look at SSLShader (http://shader.kaist.edu/sslshader/) and rethink your 3~4-year-old work again.

Although it is still 'will be available later'

Re:Cool test... (2)

Panaflex (13191) | more than 3 years ago | (#36052162)

That's a pretty cool project! But I do think they still suffer the same latency problems - in order to take advantage of the GPU's full throughput - they have to have a huge number of client connections (chosen solution) or a very deep queue (hard to optimize, only works with larger file sizes).

Certainly this is a great solution for what it is - but it's not a general purpose solution. And you can get a much more reliable and supported solution out there. (e.g. BIG-IP SSL Accelerator, which uses certified FIPS 140-2 hardware.)

Protection (1)

Adrian Lopez (2615) | more than 3 years ago | (#36052010)

Is it a good idea for the protected kernel to rely on unprotected code for critical functions such as filesystem operations? I know that user-space code cannot directly interfere with the kernel, but it also doesn't have to do anything the kernel requests of it. Unless the kernel is designed to treat such user-space code as altogether untrustworthy, it seems to me a bad idea for the kernel to rely on user-space code in this manner.

ECB Mode is totally insecure (3, Interesting)

jasonwc (939262) | more than 3 years ago | (#36052172)

I hope this is just a proof-of-concept design because ECB mode should not be used for this purpose. Wikipedia provides a pretty obvious example of the weakness of ECB mode:

"The disadvantage of this method is that identical plaintext blocks are encrypted into identical ciphertext blocks; thus, it does not hide data patterns well. In some senses, it doesn't provide serious message confidentiality, and it is not recommended for use in cryptographic protocols at all. A striking example of the degree to which ECB can leave plaintext data patterns in the ciphertext is shown below; a pixel-map version of the image on the left was encrypted with ECB mode to create the center image, versus a non-ECB mode for the right image."

http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Initialization_vector_.28IV.29 [wikipedia.org]

Re:ECB Mode is totally insecure (1)

drinkypoo (153816) | more than 3 years ago | (#36052326)

I hope so too, because I was excited by the idea of using my CUDA-capable GPU to do encryption, which might actually get me to use it. It's barely ticking over providing Compiz functions.

Re:ECB Mode is totally insecure (0)

Anonymous Coward | more than 3 years ago | (#36052594)

eCryptfs uses CBC with a secret IV.

Re:ECB Mode is totally insecure (1)

jasonwc (939262) | more than 3 years ago | (#36052700)

According to the summary, the GPU enhanced version uses ECB:

"A demo in its current source repository is a modified eCryptfs, which is an encrypted filesystem used by Ubuntu and other distributions . . . .However, both the GPU cipher-based eCryptfs and the CPU cipher-based one are changed to use ECB cipher mode for parallelism. "

Re:ECB Mode is totally insecure (1)

lucag (24231) | more than 3 years ago | (#36055170)

Writing parallel code is difficult. Writing parallel code which makes sense even more. Actually, if you have a quad-core CPU and do ECB instead of CBC, then you can manage a 4x increase in performance ... no need to use a GPU!
(The reason is that ECB encryptions might be done in parallel, as each of them is independent; for CBC you need to know
the encryption of textblock-1 in order to produce that of a block).
A counter mode (CTR) might make sense for ecryptfs, but the security analysis is definitely non-trivial to make.
Actually, it is amateurish at best to say that this implementation of ecryptfs "is not a toy" ...
(per http://code.google.com/p/kgpu/wiki/IozoneBenchmarkResults [google.com] )
it is, in fact, something which seriously compromises security.

Re:ECB Mode is totally insecure (2)

Melkhior (169823) | more than 3 years ago | (#36054950)

And because a picture straight from the horse's mouth is worth a thousand words, here's what NVidia has to say about it:

http://http.developer.nvidia.com/GPUGems3/gpugems3_ch36.html [nvidia.com]

Go to 36.5, figure 36-11 & 36-13.

Why not OpenCL? (3, Interesting)

gerddie (173963) | more than 3 years ago | (#36052388)

They should go with OpenCL, then there would be a chance that at one point one can use it with free drivers (and other hardware), but I guess that's the prise you pay for a graduate fellowship from NVIDIA.

Re:Why not OpenCL? (0)

Anonymous Coward | more than 3 years ago | (#36054562)

OK, I'd better say something about our decision.
Use OpenCL? No, the reason is that new features of GPGPU computing on Fermi are not available on OpenCL.
Prise for fellowship? Maybe in future... In fact. we already implemented most of KGPU before submitting our application.

Re:Why not OpenCL? (1)

gerddie (173963) | more than 3 years ago | (#36056016)

You might also want to consider this [sourceforge.net] thread on the linux kernel mailing list. It is about adding a module to the kernel that has only one use: to talk to proprietary user space code. The module got rejected from mainline for this reason. By using CUDA and the proprietary user space portions from NVIDIA, you module will also never make it into mainline (unless hell freezes over and NVIDIA opens up their drivers).

Re:Why not OpenCL? (2)

GameboyRMH (1153867) | more than 3 years ago | (#36055566)

Came here to say this. Why the hell are they writing things in CUDA instead of OpenCL? CUDA is closed and Nvidia-proprietary!

Encryption is not the main beneficiary (2)

voss (52565) | more than 3 years ago | (#36052640)

Imagine mysql database GPU accelerated...

GPU accelerated routers, gpu acceleration of anti-virus software.
The use of gpus to accelerate search engines.

Re:Encryption is not the main beneficiary (1)

kvvbassboy (2010962) | more than 3 years ago | (#36052744)

Imagine whether prediction and stock prediction using these. I am surprised that the guys in New York haven't used it already given the massive amount of gold they have in their coffers.

Re:Encryption is not the main beneficiary (1)

makomk (752139) | more than 3 years ago | (#36053452)

These days, automated stock trading is in fashion, which depends on having really tiny latencies - the exact opposite of what you get from GPU acceleration. I believe companies are experimenting with implementing stock trading algorithms on FPGAs connected directly to network interfaces...

Re:Encryption is not the main beneficiary (1)

kvvbassboy (2010962) | more than 3 years ago | (#36055196)

Hmm.. I am pretty much in GPU architecture, but here's why I thought it would be great in the stock and weather forecasting.

1. They involve a lot of matrix multiplications and matrix inversion algorithms, which, from what I heard can be handled nicely by the GPU.

2. This is a very naive thought, but TFA mentioned talked about easy parallelization using GPU. This can be harnessed by the multitude of parallel, machine learning algorithms out there.

However, after some searching, I came across a white paper (dammit, I am really not able to find the link now) which mentioned that GPUs are poor when it comes to using and re-using a large data size due to some kind of latency, which I think is what you are talking about.

Re:Encryption is not the main beneficiary (1)

kvvbassboy (2010962) | more than 3 years ago | (#36055198)

"I am pretty much ignorant* in GPU architecture"

Fixed.

Re:Encryption is not the main beneficiary (0)

Anonymous Coward | more than 3 years ago | (#36056026)

I'm also pretty ignorant on the subject, but I suspect the key to understanding GPGPU computing is that it's SIMD (Single Instruction Multiple Data).

A SIMD machine is analogous to a robot with lots of arms. Each arm can hold any of a number of tools (let's say hammer, screwdriver, paintbrush). But you can't have one arm holding a hammer and another holding a paintbrush at the same time. All the arms have to hold the same tool at the same time. So you have a machine that excels at jobs like hammering a million nails, or driving a million screws, but is useless for jobs like building a house.

Re:Encryption is not the main beneficiary (0)

Anonymous Coward | more than 3 years ago | (#36054048)

Problem needs to parallelize well. Encryption usually does. Encoding usually does.
Databases, not so much, at least not with current architectures.
Can run a lot of queries in parallel, but one query gets at most one core in most modern databases, mysql inclusive.

Re:Encryption is not the main beneficiary (0)

Anonymous Coward | more than 3 years ago | (#36054294)

GPU accelerated routers, gpu acceleration of anti-virus software.

I hope that is a joke, imagining that is making me feel ill. It's also ridiculous.

The reason we HAVE CPUs and GPUs rather than just GPUs that do everything is because the two are complementary and work better on different problems. CPUs are good at decision making, you have short sequences of instructions with a bunch of if-this-then-that thrown in, they also handle sharing (not exactly well but anyway) through locking. GPUs are good at long sequences of branch-free calculations which are independent of what other cores are doing, they choke and shed performance rapidly on if-statements and die a rapid death on problems that require results from other cores which can't be performed in multiple passes. This is what makes it so much faster for 3d rendering, each pixel on the screen is mostly independent of the others so each of the 100+ 'GPU cores' can do a separate pixel each then collate the result together at the end.

I'd love to see you try and break down a SQL query execution engine into a multi-pass algorithm with minimal decision making logic. [Even if you succeed, getting the records off of the storage is always the bottleneck in a well designed DB, good luck making your HDD spin faster with magic GPU power]

Commercial routers are already accelerated and in better ways, rather than installing $100s of GPU, they use specialised network chips (probably programmed FPGAs but same difference) which will always be faster anyway. This is where GPGPU obsession gets ridiculous — dedicated hardware designed to perform a specific well-bounded task is always faster than general purpose hardware wired (both physically or in software) for that purpose, as GPUs get more general and less graphics specialised, they become more like CPUs and therefore worse at their primary task of pixel banging. (We can see this with nVidia's current GPGPU specialised Fermi architecture which consumes 50% more power than AMD's equivalent GPUs across the board; and despite all that wasted power, are barely holding the performance line in 3d graphics)

Re:Encryption is not the main beneficiary (1)

smorken (990019) | more than 3 years ago | (#36054620)

The only time that you want to use a GPU is when your code has a high proportion of numerical operations, and when your problem can be executed in parallel. (modeling, graphics) If this is not the case then using a GPU is not going to speed things up. Code where you are mostly just moving data around with sparse calculations (routers, databases, webservers, AV) is not a good problem for video cards.

Re:Encryption is not the main beneficiary (1)

nochez (1850334) | more than 3 years ago | (#36055828)

Imagine the day when someone finally implements a GPU accelerated "make me a sandwich" (http://xkcd.com/149/) ... that, would be pure awesomeness.

Re:Encryption is not the main beneficiary (0)

Anonymous Coward | more than 3 years ago | (#36056868)

The advantage of GPU is that it can do parallelized, relatively simple (only a few operations) tasks fast. So it probably wouldn't help MySQL, since that relational database is unstable garbage that can't get databases big enough to make it worth it to parallelize. Routing is not simple enough, and network data streams are not reliable enough to parallelize. Antivirus software is mostly limited by disk speed, and I don't think the heuristic algorithms are simple enough.

With search engines, ask Google. From what I recall they use ARM cpus, so I guess the search engine algorithms are not simple.

CUDA? That makes zero sense (2)

tyrione (134248) | more than 3 years ago | (#36054334)

Instead, one should use OpenCL. It's Platform Agnostic for a reason, but don't let Linux's chance to be hypocritical step in the way.

I like the random reference to Ubuntu (1)

RichiH (749257) | more than 3 years ago | (#36055118)

In former times, people made sure you knew they used Slackware, then LFS, then Gentoo, now Ubuntu.

Distributions are like a penis and religion...

Anyway, get off my lawn.

4x speedup is nothing (1)

loufoque (1400831) | more than 3 years ago | (#36055340)

4x speedup is nothing. Using the GPU correctly should bring much higher speedups.
That kind of gain could simply be obtained by optimizing the CPU code.

Re:4x speedup is nothing (1)

Rockoon (1252108) | more than 3 years ago | (#36056422)

Indeed. It has been my experience that when crypto writers move their libs from C to well optimized x86 assembly language they get at least 2x performance boost.

These guys are getting 4x, but only on a fairly powerful GTX 480 GPU. How will a typical mobile GPU's compare? Probably even slower than the CPU, right? This article makes me sad.

SSE , AltiVec (1)

mehemiah (971799) | more than 3 years ago | (#36056164)

there are plenty of architectures specific vector instruction sets on the CPU that the kernel could be taking advantage of instead; for example SSE and AltiVec for x86 and PPC respectivlly.

Re:SSE , AltiVec (1)

mehemiah (971799) | more than 3 years ago | (#36056170)

or VIS for SPARC

open CUDA or give up. (1)

bored (40072) | more than 3 years ago | (#36059590)

For the last ~8 years I've needed extremely fast encryption (and compression) in the project I use. A few years ago when CUDA began to gain traction, I got all excited and actually decided to see what was necessary to make it work and see how fast it was.

Well at the time, I discovered that CUDA enabled encryption is quite fast. The problem is that copying the data segment to the GPU, doing the encryption and then copying the result back is painful. The copies and setup/interrupt/etc add so much latency that it runs at a roughly the same speed as just doing the operation on the CPU. Adding a couple of user/kernel space crossings probably makes the problem even worse. So during this timeframe we used dedicated compression/encryption boards for the customers that needed it fast, and everyone else just got a couple of extra CPU's dedicated to the effort. Now with AES-NI dedicated boards generally aren't necessary. Sure you have to buy a machine specifically with AES-NI right now, but I suspect that with all these instruction set extensions, within a couple of years it will be widespread.

To patch the kernel to support such an ugly hack would be quite stupid, given the fact that AES is already fairly respectable (~100MB/sec or so per CPU) anyone that needs it faster could use blowfish, or find a CPU with AES-NI.

Actually they shouldn't (0)

Anonymous Coward | more than 3 years ago | (#36079408)

"Anyway, GPU vendors should think about opening their drivers and computing libraries, or at least providing a mechanism to make it easy to do GPU computing inside an OS kernel"
          Actually they shouldn't. There's always debate about this kind of thing, but in my humble opinion adding large and complex systems that don't have to be in the kernel into the kernel is not a good thing. For this, a cryptfs userspace crypto shim is a clean solution, this would allow for adding arbitrary new crypto systems too. Regarding "the chicken and the egg", if you have a encrypted root filesystem, a lot of distros already build an initramfs -- basically a preloaded RAM disk -- this is how all the SATA and SCSI drives can be built as modules, but the right ones are loaded before the system tries to mount your hard disk. So any extra cryptfs stuff can be handled there.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>