×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Scalable Nonblocking Data Structures

kdawson posted more than 5 years ago | from the don't-fear-the-multi-core dept.

Java 216

An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

216 comments

Sounds great! (-1, Offtopic)

Linker3000 (626634) | more than 5 years ago | (#23561417)

Woah, yeah - sounds great from the summary.

Now WTF does that all mean!?

I have to RTFA!?

Damn!

Re:Sounds great! (1, Informative)

Anonymous Coward | more than 5 years ago | (#23561567)

It means that the atomic operation required for lock-free data structures are now available for Java from this vendor. Welcome to 1988!

Re:Sounds great! (4, Funny)

moderatorrater (1095745) | more than 5 years ago | (#23561807)

Before, data structures would only perform well in 50-100 threads. With this work, he has it up to over 700 threads, but it hasn't been load tested yet. There's a good chance that he's on the forefront of the next generation of data structures, there's a good chance that his work will be included in the java core (although that's not saying much considering).

Re:Sounds great! (2, Insightful)

hasdikarlsam (414514) | more than 5 years ago | (#23562349)

This still has limited applicability. Making a single data structure useful across hundreds of CPUs is impressive, but many problems can be more easily solved by using multiple structures - pipelining it, or using divide and conquer, or any of many other approaches.

Re:Sounds great! (4, Informative)

mikael (484) | more than 5 years ago | (#23562393)

The author has developed a programming methodology class for parallel programming in Java. In this system, a single application can have 700+ separate threads running (user input, background tasks, dialog windows, scripts, automatic undo logging).

With such applications you will often have a array of variables that are accessible by all threads (eg. current processing modes of the application).

To preserve the integrity of the system, you need to only allow one thread to write to each variable at any time. If you have a single read/write lock for all the variables, you will end up with large number of threads queuing up in a suspended state waiting to read a variable, while one thread writes.

The author uses the Load-Link/Store Conditional [wikipedia.org] pair of instructions to guarantee that the new value is written to all locations. Load-Link loads the value from memory. Store-Conditional only writes the value back if no other write requests have been performed on that location, otherwise it fails.

Check-And-Set [wikipedia.org] only replaces the variable with a new value if the value of the variable matches a previously read old value.

Using these methods (having the writer check for any changes) eliminates the need for suspending threads when trying to read shared variables.

Sounds bogus? (1)

hackingbear (988354) | more than 5 years ago | (#23562729)

Well... if I remember what I learned from OS and hardware classes right. LL/SC and CAS operations do involve locks at the hardware level. These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.

So I fail to see anything radically new in this project. It is not lock-free at all. It sounds more like bogus marketing pitch to me.

Can someone correct or explain to me?

Re:Sounds bogus? (0)

Anonymous Coward | more than 5 years ago | (#23562993)

What a troll here:

Searching for CAS / Memory brings up something entirely different

Equivocating between using CPUs and CPU cores

Anyway, to answer the question, this is a multi-core processor so locks would not be sent over the memory bus but instead sent in L1 cache or something and the memory bus would not be the bottleneck.

Your comments about using multiple CPUs are most likely valid though.

why (4, Interesting)

damn_registrars (1103043) | more than 5 years ago | (#23561441)

why are there fewer than 1 thread per core? It says 768 cores, but only 700 threads. Does it need the rest of the cores just to manage the large number of threads?

Re:why (-1, Offtopic)

Anonymous Coward | more than 5 years ago | (#23561615)

700 threads should be enough for anybody?

68 cores left to surf concurrent pr0n?

Re:why (4, Informative)

Chris Burke (6130) | more than 5 years ago | (#23561655)

Because one is a general statement ("supports 700+ threads"), and the other is a statement about a specific hardware setup ("in production with 768 cores").

It was not meant to imply that the 768 processor system will use exactly 700 worker threads. It was meant to imply that the system breaks through the traditional scalability limits of 50-100 threads, thus the 700+.

Re:why (1)

MarkEst1973 (769601) | more than 5 years ago | (#23562245)

Surely, 640 threads ought to be enough for anybody.

But seriously, have you seen the price of this beast? HUGE price tag. Why not build your own cluster?

There are technologies today (like, say, Terracotta Server [markturansky.com] ) that allow for easy distribution of work across a large number of JVMs.

I suppose the companies that need all those cores and threads in one machine can afford the Honkin' Big Iron. For the rest of us, clustering is getting cheaper and cheaper these days.

Re:why (1)

cheater512 (783349) | more than 5 years ago | (#23562529)

Or its a development system with limited production.
Chances are it is being used for R&D and not actually crunching numbers.

When systems like that hit mass production then we have the data structures for them.
Cant do that with a cluster.

Re:why (0)

Anonymous Coward | more than 5 years ago | (#23561747)

Because the article got it wrong?

We routinely run codes with 10,000's of threads; I've tested it with 100,000 *runnable* threads - works great.

Cliff Click

Re:why (2, Informative)

Anonymous Coward | more than 5 years ago | (#23561753)

his implementation is in Java and the JVM adds some of its own threads like threads for garbage collection, compiler threads etc. so, some of the compute goes towards those threads.

Re:why (1)

drspliff (652992) | more than 5 years ago | (#23561781)

Each Vega2 processor has 48 cores, 768 cores in just 16 processors is pretty good and you can be certain a number of those are reserved for system use on such a large-scale machine; these are already fairly lightweight hardware threads and I can only presume more hardware threads per-core and you'd get some serious I/O starvation issues.

How I'd love to have one of these boxes :)

I think you have it wrong (0)

Anonymous Coward | more than 5 years ago | (#23561987)

There are 48 Vega2 processors per board, with 1 core per processor. The biggest setup stacks 16 of these boards to get to 768 cores.

Re:I think you have it wrong (1, Interesting)

Anonymous Coward | more than 5 years ago | (#23562155)

Well, actually, it's 16 Vega2 chips (each of which is termed a "processor"), each chip has 48 cores, making for 768 total cores. 2 chips per board, and 8 boards in a system. All chips are fully meshed (no switch, fabric, or bus, just a whole bunch of direct multi-GHz wires), so it's a flat SMP UMA machine.

The new Vega3 (announced last week) takes it to 54 cores per chip, and 864 cores in a system (16 x 54 = 864 ...).
 

Re:why (1)

blitzkrieg3 (995849) | more than 5 years ago | (#23562775)

If all of the threads need to be able to update one huge data structure, then eventually you reach a point of no return. The fact that the threads need to lock the data structure every time they do an update causes a bottle neck. Conventionally, 700 threads are no better than 50-100 because the additional ~600 threads would be waiting for the chance to update the data structure. In this case it doesn't matter that they are on separate processors, since those 600 threads aren't using cpu time anyway. Each additional thread added after 100 will go straight to the wait queue, doing effectively nothing.

What this technology allows is atomic updates of data structures, which means that it will not be necessary for a thread to acquire a lock before updating the data structure. Thus, 700+ threads perform better than 50-100.

Java???? (0, Flamebait)

maz2331 (1104901) | more than 5 years ago | (#23561653)

700 threads in JAVA? Why not use C++, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.

Or... is this just a way to avoid having to get the really good coders who are more costly than the burn-bags?

Re:Java???? (4, Insightful)

Anonymous Coward | more than 5 years ago | (#23561739)

700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.

Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?

Re:Java???? (5, Funny)

Anonymous Coward | more than 5 years ago | (#23561847)

700 threads in assembler? Why not use JAVA, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.

Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?

Re:Java???? (2, Funny)

nih (411096) | more than 5 years ago | (#23563087)

700 threads in JAVA? Why not use Brainfuck, actually optimize the hell out of...no wait, fs my head hurts now:(

Re:Java???? (0)

Anonymous Coward | more than 5 years ago | (#23561993)

What are burn-bags? I honestly have no idea. Logically, I can infer that they are not the "really, really good coders", since otherwise the "really, really good coders" would be more costly than themselves.

Re:Java???? (-1)

Anonymous Coward | more than 5 years ago | (#23562401)

You know the old "burning bag of poo on the front porch" gag? Now you know what a burn bag is.

Re:Java???? (1)

raddan (519638) | more than 5 years ago | (#23562403)

700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
Good idea, but-- assembly language is way too high-level. Why not take it a step further and just give programmers front panel switches? Bonus: you save money on keyboards!

Re:Java???? (1)

freedumb2000 (966222) | more than 5 years ago | (#23562605)

I just pictured 700+ worker(-threads) in front of a huge panel with 700+ switches. I think that is where assembly(-line) coding came from.

Re:Java???? (1)

Jherek Carnelian (831679) | more than 5 years ago | (#23562405)

700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread
Because the hardware platform his company sells actually has more than 700 cores - that means 700 simultaneous threads - not sequential, simultaneous.

When you have a highly parallel system, it is counter-productive to try to make the workload sequential.

Re:Java???? (1)

billcopc (196330) | more than 5 years ago | (#23562865)

As much as I love assembler, it's an offer vs demand thing. Very few people are actually proficient at assembler coding, but a crapload of morons can write a thousand lines of useless Java code a week, and they're willing to work for peanuts. Assembler guys want big bucks, mostly because their rarity gives them that power, along with the obscurity of the art.

Let's face it: there are some low-level things you just can't do in any language, and asm coders know it! If it's only a speed issue, it's usually cheaper to throw more hardware at the problem and let the cheap idiotic factory coders chew on it.

Re:Java???? (3, Insightful)

AKAImBatman (238306) | more than 5 years ago | (#23561829)

700 threads in JAVA? Why not use C++

Hmm... lemme think about that. Maybe because Java has decent threading support built into the language? Maybe because the platform is portable to any architecture? Maybe because the JVM can "optimize the hell" out of the running Java code far better than you could "optimize the hell" out of your C++ by hand?

"Java is Slow" is a mantra that is easily 5+ years out of date. Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore. Anyone repeating the "Java is Slow" mantra is merely branding themselves ignorant.

Re:Java???? (1)

Brian Gordon (987471) | more than 5 years ago | (#23561891)

I don't think architecture portability is a concern when you're writing for a 768-core supercomputer :)

Re:Java???? (1)

phasm42 (588479) | more than 5 years ago | (#23561973)

I don't think architecture portability is a concern when you're writing for a 768-core supercomputer :)
You don't actually write code specific to a 768-core computer. The code doesn't need more than 1 core; the idea is to make it scale well to 768 cores.

Re:Java???? (-1, Troll)

Viol8 (599362) | more than 5 years ago | (#23561937)

"Maybe because the JVM can "optimize the hell" out of the running Java code far better than you could "optimize the hell" out of your C++ by hand?"

Oh please , thats just funny. :o)

The day I see a half decent java JIT compiler I'll know the 22nd century has arrived.

"Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore."

Yeah , sure it did. Btw , how is the weather on Planet Pie-in-the-sky these days?

Re:Java???? (1)

Daniel Dvorkin (106857) | more than 5 years ago | (#23561941)

Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore.

Okay, I'll agree that well-written Java code is generally performance-competitive with compiled code, but this is a pretty sweeping assertion. Do you have any evidence for it -- or is it just a little too convenient that "no one even bothers" with benchmarks?

Benchmark from many years ago (2, Insightful)

CustomDesigned (250089) | more than 5 years ago | (#23562527)

Yes, Java didn't surpass, but was competitive with C++ many years ago - at least as far as CPU goes. In fact, if you use gcj, you'll find that Java performance is *very* similar to C++ :-). For the IBM JVM, the turning point was JDK 1.1.6. I think having each thread allocate blocks of memory to dish out without lock contention was a big part of it. One interesting benchmark was this [bmsi.com] .

Now, before you C/C++ or Java fanboys get too excited, the absolute hands down fastest language on most of the benchmarks was LISP (I think it was Common LISP). I don't use LISP because I hate the syntax (and LISP fanboys regard it as a requirement of club membership that you like it). But the performance is quite impressive (with GC too), and that's what I would look into if I needed top performance from a high level language. Eat your heart out C++.

I use Java quite a bit, and the biggest drawback performance wise is the amount of memory chewed up by the GC schemes and JIT. It is otherwise quite fast. I often turn off JIT to save memory (at the expense of CPU), and I would like to see options to use a GC algorithm that minimizes memory consumption (at the expense of CPU).

Re:Java???? (0)

Anonymous Coward | more than 5 years ago | (#23562145)

"Java is Slow" is a mantra that is easily 5+ years out of date. Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore.
Hahaha, oh yes, of course! Do you honestly believe that one C-derived language that's compiled to your target architecture can somehow magically outperform another C-derived language compiled to the same target architecture by "a wide margin"? Java compilers aren't anywhere near as mature as C++ compilers, so excuse my scepticism.

I'm not even going to bother with the absurdity of the idea of JIT bytecode outperforming a compiled language on any architecture.

Anyone repeating the "Java is Slow" mantra is merely branding themselves ignorant.
Or they've suffered through some of these super-duper Java applications that will save the world.

Re:Java???? (1)

cheater512 (783349) | more than 5 years ago | (#23562579)

Well the AVR32 architecture does support some Java byte codes.

Mind you the manual does specifically state that you shouldn't use them inside interrupts. :)

Uh, no. (0)

Anonymous Coward | more than 5 years ago | (#23562205)

If Java surpasses C in performance, then why aren't the Unreal or Doom engines written in it? What you meant to say is: "for most applications, Java can surpass C++ performance."

Oversimplifying things helps no one, except maybe your moderation score if the moderators are that clueless.

Re:Uh, no. (2, Informative)

quanticle (843097) | more than 5 years ago | (#23562397)

He never said that Java had surpassed C in speed, he said that Java had surpassed C++. C++ library classes are not the same as C library classes, and many C++ libraries (especially the ones outside STL and Boost) are woefully under optimized. Java has many more optimized libraries "packaged in" with the language itself.

Second, neither the Doom or Unreal engines are multi-threaded. Java has threading support built into the language. To get the same with C you'd have to use POSIX threads (killing Windows compatibility) or the Windows threading API (killing POSIX compatibility). With Java, you don't have to make that choice.

Re:Java???? (5, Insightful)

JMZero (449047) | more than 5 years ago | (#23562475)

Java is perfectly fast for real world applications, and I'd agree that the "Java is Slow" idea is outdated.

But it's not conclusively faster than C++, at least not in a general sense. In terms of a small task involving lots of simple operations, you'll still often see a significant speed increase using C++. This [multicon.pl] is a good example. Now I'm sure there's more optimizations available on both sides - and plenty of stuff to tweak - but C++ is going to come out ahead by a significant margin on a lot of these tasks.

A good example where the participants on both sides have some motivation is on TopCoder (where I spend a fair bit of time). Performance isn't usually the driving factor in language choice there - but sometimes it is, and when it is the answer is pretty much always C++ (unless it's a comparison between Java BigInteger and a naive implementation of the same in C++).

Reasonably often you'll see people write an initial solution in Java, find it runs a bit too slow, and quickly port it to C++ (or pre-emptively switch to C++ if they think they'll be near the time limit). It's not uncommon to see a factor of two difference in performance.

To be clear - these are not usually "real world" tasks. As more memory and objects come into play, Java is going to do better and better. But these kinds of tasks still exist - there's still plenty of places where C++ is going to be the choice because of performance.

In any case, your contention that Java is so much faster that nobody does benchmarks anymore is unsupported and wrong.

Re:Java???? (3, Informative)

Kupek (75469) | more than 5 years ago | (#23561843)

Java has a well-defined memory model. C++ (and C) do not; behavior depends on the hardware it is run on.

Re:Java???? (0)

Anonymous Coward | more than 5 years ago | (#23562153)

Yes, it does. A memory model that they're apparently flagrantly violating, as the Java memory model is quite clear that while reads and writes are atomic, they're not global to all threads.

Namely, if Thread A writes to a value foo and Thread B reads from that value, Thread B will never get half-stale data. It will always get either the value before Thread A set foo, or the value after Thread A set foo. But the Java language is also quite clear that there's no way to know WHICH value it gets. Even if the read in Thread B occurs after the write in Thread A chronologically, Thread B may be looking at stale data.

To get around that, you need to use locks. Which make these "nonblocking data structures" whatcha-call it, blocking, as the only way to ensure that you get the latest value is to lock all access to the data structure and then flush the memory off the CPU cache and into main memory. (The "volatile" keyword causes all read/writes to volatile values to create such a lock and force such a flush.)

So if they're really non-blocking, the magic is not happening in Java. It's happening somewhere else, either in a modified virtual machine or in code linked via JNI - in other words, C.

Re:Java???? (1)

DarthStrydre (685032) | more than 5 years ago | (#23562389)

Namely, if Thread A writes to a value foo and Thread B reads from that value, Thread B will never get half-stale data. It will always get either the value before Thread A set foo, or the value after Thread A set foo.
It would be nice if this was true... And it is for most things. However per the language spec...

17.7 Non-atomic Treatment of double and long
Some implementations may find it convenient to divide a single write action on a 64-bit long or double value into two write actions on adjacent 32 bit values. For efficiency's sake, this behavior is implementation specific; Java virtual machines are free to perform writes to long and double values atomically or in two parts.
For the purposes of the Java programming language memory model, a single write to a non-volatile long or double value is treated as two separate writes: one to each 32-bit half. This can result in a situation where a thread sees the first 32 bits of a 64 bit value from one write, and the second 32 bits from another write. Writes and reads of volatile long and double values are always atomic. Writes to and reads of references are always atomic, regardless of whether they are implemented as 32 or 64 bit values.

VM implementors are encouraged to avoid splitting their 64-bit values where possible. Programmers are encouraged to declare shared 64-bit values as volatile or synchronize their programs correctly to avoid possible complications.

Re:Java???? (1)

Kupek (75469) | more than 5 years ago | (#23562461)

Even if the read in Thread B occurs after the write in Thread A chronologically, Thread B may be looking at stale data.
If Thread B has stale data, then it's commit (through a compare-and-swap or similar atomic operation) should fail, and the thread should try the operation again.

Take a look at the algorithms in the presentation. I've done a fair amount of lock-free programming [vt.edu] , but all of it was in C. His algorithms look like standard lock-free practices, but since I'm not familiar with what correct lock-free Java code looks like, I can't say with confidence.

I will, however, say that I would be surprised if Cliff Click made such the obvious error you assume he is making. I saw him talk at ISMM 2006, and I trust him more than random people on slashdot.

java.util.concurrent.atomic (1)

CustomDesigned (250089) | more than 5 years ago | (#23562921)

They are really non-blocking, and use a standard Java API - the java.util.concurrent.atomic package. The description of the package indicates it was intended as the low level basis for non-blocking data structures. While most JVMs are written largely in C, the Java code gets compiled to machine language. Standard packages do not have to be implemented via JNI, and can be implemented directly by the JVM - just like GNU C implements many standard C library functions directly in the compiler for performance. In fact, java.util.concurrent.atomic would be a poor choice for implementing via JNI because the transition between C and Java code that happens on every JNI call would defeat the high performance purpose of the package.

Re:Java???? (0)

Anonymous Coward | more than 5 years ago | (#23561859)

Oh dear.

Having achieved a certain level of competence - this is an assumption of mine, it may be entirely wrong - it looks like you're resistant to change of any nature.

In 5 years, let me know how being the COBOL programmer of our age - and less productive by far than everyone around you - is working out.

Re:Java???? (1)

arthurpaliden (939626) | more than 5 years ago | (#23562095)

Actually I like being a COBOL consultant/programmer I get to work when I want and I get paid really big bucks. Lots of COBOL still out there in mission critical processes.

Re:Java???? (0)

Anonymous Coward | more than 5 years ago | (#23561879)

700 threads in JAVA? Why not use C++, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job; plenty-o-examples upon request) and the 768-way Azul box is 10x cheaper than the same cycles that can run C code (by which I mean: same count of MIPs in a cache-coherent shared-memory UMA machine)?

soapbox-off.... but you speak like this is 1995 and are acting as-if Java hasn't moved forward in the last 13 years.

Cliff Click

Re:Java???? (1)

Viol8 (599362) | more than 5 years ago | (#23561959)

"Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job; plenty-o-examples upon request)"

Sorry , remind me what language the JIT compiler is written in. Java is it? No , thought not.

Re:Java???? (1)

pohl (872) | more than 5 years ago | (#23562081)

It's all machine language down at the bottom. So what's your point?

Re:Java???? (0)

Anonymous Coward | more than 5 years ago | (#23562267)

That's why I tell everyone at work to stay away from Java. If you want performance it's assembly or nothing, why waste time with trade-offs?

Re:Java???? (1)

neuromancer23 (1122449) | more than 5 years ago | (#23562537)

Thanks for the awesome advice. x86 assembler is clearly the obvious choice for anyone writing a web or desktop application.

Re:Java???? (1)

Hairy Heron (1296923) | more than 5 years ago | (#23562319)

But above you talked about how JIT compilers sucked. So if we go by the usual logic of X Java program sucks -> Java sucks, then clearly the language that the JIT compiler is written in must by extension also suck, no?

Re:Java???? (2, Informative)

fishbowl (7759) | more than 5 years ago | (#23562691)


In the problems that TFA addresses, I'd wager that most of the time is spent in dissemination barriers of some sort, since invariably the problems in parallel computing move into issues within the problem domain (which is ideally what we want, after all).

As for a JIT being inefficient compared to a static optimizing compiler, it depends so much on the code in question and on the platform, as to not be something you can make blanket statements about.

Let's hear from some HPC researchers on this. Get some real numbers.

Re:Java???? (3, Funny)

Reverend528 (585549) | more than 5 years ago | (#23562213)

Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job
Since when has writing code quickly ever been considered one of Java's strong points? Personally I'd take stdio over Java's alternative (file wrapped in a stream buffer wrapped in a buffered reader wrapped in an enigma) any day of the week.

Sure, Java manages memory for you, but it's generally much easier to incorporate a garbage collector into C than it is to write java without file I/O.

Re:Java???? (4, Insightful)

famebait (450028) | more than 5 years ago | (#23562027)

There is no way anything less than _really_ good coders would get something like this to work with any semblance of efficiency. If you still evaluate coders by which language they use, chances are you're not really that good a programmer.

Re:Java???? (2, Interesting)

tppublic (899574) | more than 5 years ago | (#23562079)

"Why not use C++"

Umm, because Azul runs the Java in hardware. It *is* optimized by being in Java.

Re:Java???? (1)

famebait (450028) | more than 5 years ago | (#23562191)

700 threads in JAVA? Why not use C++,

To get the work done and working correctly before the hardware is obsolete.

Re:Java???? (1)

mikael (484) | more than 5 years ago | (#23562789)

They develop Java hardware for internet applications . Presumably this is for commercial organisations where it is more important to get a functional application designed and assembled than to design/test and implement a completely optimised C++ application.

Re:Java???? (0)

Anonymous Coward | more than 5 years ago | (#23563117)

Just imagine not using Java, but a decent and fast programming language instead and getting the same performance out of your application on a NORMAL system. Really, all trolling/hatred aside, it always pisses me of royally to see someone provide near-supercomputing setups to accelerate something which is INHERENTLY SLOW. Solve the actual problem instead and get Java out of the equation.

scalable noNBLocking data sTRructures .. :) (1, Insightful)

rs232 (849320) | more than 5 years ago | (#23561949)

Congrads Slashdot, you've managed to produce a story that is guaranteed to totally baffle the non-techie sector.

KeyWords:

concurrent data structures, hardware threads, java, large array, scalable parallel access, atomic update, words, finite-state machine, lock-free, data structures ...

Re:scalable noNBLocking data sTRructures .. :) (5, Interesting)

Seferino (837142) | more than 5 years ago | (#23562343)

Good for us. Get the rabble away from Slashdot. Only true nerds should understand the contents. Let me add a few keywords to get rid of the softies: monads, higher-order type systems, return type, genericity.
Your turn.

false sharing? (2, Interesting)

shrimppoboy (853235) | more than 5 years ago | (#23561961)

The brief description in the article sounds suspicious and incompetent.
1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.

2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should reference local scalers in inner loops when parallelizing. Just once, should the thread update the shared variable. Don't reference non-scalers or shared variables in an inner loop. Don't lock in the inner loop, either, if you can avoid it.

If it is not an inner loop then the locking is probably not a big issue any way.

3. Java, really, Java?

Re:false sharing? (1)

Chirs (87576) | more than 5 years ago | (#23562299)

If the data set is much larger than the number of cpus, then it may be possible to arrange things such that the likelihood of two cpus hitting the same cacheline is pretty small.

As for Java, in the article Dr. Click says it has a well-understood and well-implemented memory model.

Re:false sharing? (2, Informative)

julesh (229690) | more than 5 years ago | (#23562755)

The brief description in the article sounds suspicious and incompetent.
1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.


The keyword in your statement is "typically". Click is working on the Azul processor, which is designed from the ground up for this kind of task. While I haven't been able to find details of its cache organisation, I'd say it's pretty safe money that it uses smaller than average cache lines. The hashtable structure described uses the contents of the array in pairs, i.e. 64 bits at a time. If each cache line were 64 bits wide, this would be efficient, no?

2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should reference local scalers in inner loops when parallelizing. Just once, should the thread update the shared variable. Don't reference non-scalers or shared variables in an inner loop. Don't lock in the inner loop, either, if you can avoid it.

What if the bottleneck of your application is referencing shared data that needs to be updated in real time? There are many applications for which this is the case, and this kind of work is useful for anyone who is working on such an application. Just throwing an example out there: the scheduler for a massively parallel operating system would typically need to have many cores referring to a single queue of waiting threads with a high chance for collisions between read and write operations. I'm sure there are plenty more, too.

3. Java, really, Java?

Yes. Java. Really. Java is probably the most important language at the moment for enterprise application development, which is the environment where this kind of issue most frequently occurs, so developing solutions to these problems that work in Java is particularly important. Azul, Click's employer, specialises in high performance computers optimized for running Java software.

Geek serendipity in a summary (3, Funny)

ThreeGigs (239452) | more than 5 years ago | (#23561983)

and a finite-state machine built from the atomic update and logically replicated per array word.


Now *that* is what I call geek speak.

Why not Perl? (0)

Anonymous Coward | more than 5 years ago | (#23562059)

Too bad this isn't being done in Perl.

Well, sort of lock-free. (1, Informative)

Animats (122034) | more than 5 years ago | (#23562103)

It's not really "lock free". The algorithms in the slides still have WHILE loops wrapped around atomic compare-and-swap operations, so they are effectively spin locks, tying up the CPU while the other CPUs do something. However, the design is such that the WHILE loops shouldn't stall for too long.

This concept has two parts - a way of constructing bigger atomic operations from hardware-supported word-sized atomic operations, and a scheme for resizing arrays while they're in use. The latter is more important, especially in Java; stalls due to array resizing are a headache in real time systems with dynamic memory. It works about the way you'd expect; there are flags indicating the location (old or new) of each array element as the array is copied. Making all this work correctly is touchy.

Concurrent algorithms like this are incredibly hard to write, and you need formal methods to get them right. The author of the paper still hasn't figured out to code a FIFO under these rules.

No it is Lock Free (2, Interesting)

tbcpp (797625) | more than 5 years ago | (#23562437)

I used to think this too until I saw the video by the article's author. By lock free we mean that if the thread that has the "lock" were to die, it would not stall out the entire program. With CAS updates, a crashing thread would simply die and cause no ill effects to the data structure. With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock. The maximum time a CAS "lock" can exists is 1 CPU cycle.

Re:No it is Lock Free (1)

neuromancer23 (1122449) | more than 5 years ago | (#23562941)

>> With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock.

Not in Java. I will give you $1000 if you can:

1. Kill a thread prematurely.
2. Cause it not to release a lock inside the JVM.

Re:Well, sort of lock-free. (1)

egoots (557276) | more than 5 years ago | (#23562613)

A shared data structure is considered "lock-free" if its operations don't require mutual exclusion. So if a process is interrupted in the middle of an operation, it will not prevent the other processes from operating on that object.

Technically, there are more formal definitions which state that it guarantees that some process in the system operating on the concurrent object, will make progress in a finite number of steps.

An object is termed "wait-free" if each process will make progress in a finite number of steps.

Re:Well, sort of lock-free. (1)

Kupek (75469) | more than 5 years ago | (#23562635)

It is lock-free, but it is not wait-free. He explains the difference in his slides, and there's plenty of literature around for those who have access to Google.

From the article: (3, Funny)

Kingrames (858416) | more than 5 years ago | (#23562195)

"# A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes."

Clearly, the data structures have been touched by his noodly appendage.

C/C++ (0)

Anonymous Coward | more than 5 years ago | (#23562201)

Anything (free software) like this on C/C++?

"2.5"? WTF? (0)

adavies42 (746183) | more than 5 years ago | (#23562239)

The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well.

Will someone please explain WTF "2.5" means in the summary?

Re:"2.5"? WTF? (4, Informative)

badboy_tw2002 (524611) | more than 5 years ago | (#23562477)

He's built two working data structures and is working on a third (had to read the slides to figure that one out).

one per (2, Funny)

spoonist (32012) | more than 5 years ago | (#23562293)

Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java

Hmmm... one core per Java thread?

That sounds about right for Java apps...

Bulk-Synchronous Parallel model, anyone ? (2, Insightful)

Seferino (837142) | more than 5 years ago | (#23562311)

This is interesting indeed. When reading the summary, it made me think about BSPML, although the slides make it clear that there are a number of differences. Essentially
  • BSPML doesn't limit itself to FSM but has full expressive power, including exceptions -- some implementations of BSPML use monads to solve things that this work solves by scaling down to a FSM
  • BSPML doesn't support dynamic changes to the number of threads
  • many BSPML algorithms are provable
  • BSPML is typically compiled to fully native code
  • BSPML doesn't use processor-specific concurrency-specific optimizations
  • BSPML works on distributed systems.
  • Despite the differences, both models work by sharing code and operating on a high-speed concurrent pseudo-vector, in a completely lock-free model (although locks can be implemented on top of the model, as usual).
    Just my .02 â.

Quoting out of context can be fun... (0)

Anonymous Coward | more than 5 years ago | (#23562423)

3. [...] The FSM supports array resize and is used to control writes.

So their algorithm does not work without the interference of the Flying Spaghetti Monster?

WTF? (2, Informative)

neuromancer23 (1122449) | more than 5 years ago | (#23562429)

Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads". Thousands of threads are typical of java applications even running on Pentium IIIs. Every J2EE Application server spawns a new thread for every request. It's part of the specification. The real issue with Java is the hard thread limit in most JVMs where even calling -Xss will not override the limit. These limits are both asinine and arbitrary. Linux can very easily handle the instantiation of millions of pthreads on a single core host, and we all have multi-core machines these days. The JVM ought to scale to millions of threads, otherwise if a JVM can only support 20k concurrent threads, there really isn't any reason to ever pay more than a few hundred dollars for a web server since you quickly reach the maximum thread count while 99% of your opteron system is pretty much sitting there doing nothing.

Furthermore, the need to synchronize a collection is an indication of poor design (i.e. lack of encapsulation) that is better solved by writing better code that doesn't use disgusting global data structures, but if you really wanted to use a shared collection, all you would have to do is create an implementation that makes each object in the data structure lockable individually, like a linked list where any index could be locked independent of the list itself, and new capacity could be created and then added in bulk. But in the end, updating a collection is a simple constant time operation (just update the reference to point to a new location), so you should never have performance problems even when locking an entire collection to do the update. If you are having performance problems with that, then chances are, it is the way that you are interacting with the collection.

Anyway, here's to hoping that people can learn basic programming concepts so that the world can move on to solving real problems.

Re:WTF? (1)

julesh (229690) | more than 5 years ago | (#23562907)

Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads".

Quoth the article: "On Azul's hardware it obtains linear scaling to 768 CPUs"

That kind-of implies 768 hardware threads are in use.

So? (0)

Anonymous Coward | more than 5 years ago | (#23562441)

I work in the real-time embedded world and we've been doing this for years.

I doubt that very much (1)

EmbeddedJanitor (597831) | more than 5 years ago | (#23563065)

Sure embedded folk have been using FSMs for years. Sure embedded folk have been using arrays for many years,

But a FSM-controlled data access for multi-threading is a bit different. The closest I've seen is the implementation of dual-port RAM which uses hardware state machines to control access to the underlying shared RAM.

Oh, I've been doing embedded for over 20 years too.

These kids today! (0)

Anonymous Coward | more than 5 years ago | (#23562697)

Bah. Just use /dev/null. Unlimited concurrent writers. Scales to massive volumes.

Dr. Click (1)

tsalaroth (798327) | more than 5 years ago | (#23563047)

Since you're apparently reading this - shoot me an email. I'd like to discuss if this could work with Jython.

I'm thinking using a single Honkin' Big Iron instead of a farm for a massive-use service.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...