Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Parallel Programming - What Systems Do You Prefer?

Cliff posted more than 8 years ago | from the better-computing-through-more-CPUs dept.

Programming 23

atti_2410 asks: "As multi-core CPUs are finding their way into more and more computer systems, from servers to corporate desktops to home systems, parallel programming becomes an issue for application programmers outside the High Performance Computing community. Many Parallel Programming Systems have been developed in the past, yet little is known about which are in practical use or even known to a wider audience, and which are just developed, released and forgotten. Or what problems the actual users of parallel programming systems bother the most. There is not even data on the platforms, that parallel programs are developed for. To shed some light on the subject, I have setup a short survey on the topic, and I would also very much like to hear your opinion here on Slashdot!" What Parallel Computing systems and software have you that really made an impression on you, both good and bad?

Sorry! There are no comments related to the filter you selected.

My experiences on 6000-series hardware (1)

Macphisto (62181) | more than 8 years ago | (#13918930)

I've found that parallel computing has provided an excellent means to avoiding obsolescence by allowing the creation of massive computers that have the potential to crush comparitively tiny, "modern" systems. While the prototype AppleCrate [] is just a small, tentative step in this direction, a future system comprising NES subprocessors in addition to the "Oregon Trail"-codenamed CPUs could spontaneously develop mech-transformative properties, allowing the weapon-aided destruction [] of systems not puny enough to be crushed by sheer mass.

Seriously, AppleCrate rules. Check it out. It is not much of a parallel computer since the nodes are, well, they're Apple IIs, and even if that wasn't a problem, I think I could outtype the interlink.

AppleCrate!!!??? (1)

cr0sh (43134) | more than 8 years ago | (#13923975)

If this hasn't been submitted as a Slashdot story yet (I haven't seen it) - it needs to be. This doesn't seem to be a joke!

This guy [] actually seems to have built an 8-processor parallel computer using Apple IIe mainboards [] ! With a custom networking system using the game port [] ! Then, over the top of that, he used the machine with other custom software to make an 8-voice sound synthesizer system [] , using the native hardware (where each "voice" has 5 virtual bits of sample playback capability using PWM square-wave modulation of the native speaker output)!

Ultimately, I know at a deep level that this guy hasn't done anything spectacular or special - his machine won't change the world. However, the sheer chutzpah of doing it! This is the hacker spirit at work, and this guy should be commended for it! This is "News for Nerds, Stuff that Matters"!

I just don't know which is more insane - the fact that this guy has built such a system, or the fact that I want to build one, too, in order to run a 16-color Mandelbrot set generator I wrote in highschool as a parallelized implementation!

APL (1)

TheZeusJuice (766754) | more than 8 years ago | (#13918977)

APL, J, and other such array programming languages have always been particularly suited for massively parrallelizable stuff.

short answer (4, Informative)

blackcoot (124938) | more than 8 years ago | (#13919096)

in order of easiest to hardest to program:

uniform access shared memory (think the bigass (tm) cray machines) -- here you'd typically use mpi (if your programs are supposed to be portable) or the local threading library + vectorized / paralellized math libraries. since its all in a single memory space, it's "as simple" as just doing a good job multithreading the program.

non-uniform access shared memory (think the large modern sgi machines) -- here things get a bit more subtle, because you're going to start caring about memory access and intranode communications. you can still get a reasonable measure of performance by just using threads, however, if your problem is "embarrassingly parallel enough".

distributed memory (beowulf clusters and their ilk, although a bunch of regular linux or windows boxes will do) -- this is where things get excessively complicated very quickly. you have your choice of several toolkits (mpi being standard in the scientific world and superceding the previous pvm standard). here you are going to care a lot about communications patterns (in fact, probably more so than computation). i believe one of the java technologies (javaspaces perhaps? jini maybe?) abstracts this away and gives you the view of the network as a sea of computational power. regardless, you're going to have to pay very careful attention to how data moves because that will typically be your bottleneck. synchronization becomes whole orders of magnitude more expensive on this kind of parallel machine, which is another thing you'll have to figure into your algorithm design.

once your architecture is fixed, you can start to talk about which toolkit to use. a well tuned mpi will work "equally well" in each of these environments and have the added bonus of being portable across architectures. mpich is a well respected implementation, although i found lam to be much easier to use, personally. good luck, i think you're about to open a can of worms only to discover that you've really just opened a can of ill tempered and rather hungry wyrms.

Re:short answer (1)

blackcoot (124938) | more than 8 years ago | (#13919157)

i suppose i should actually answer the question ;-)

most of my parallel programming has been on commodity pc hardware (intel). as a result, i've used a combination of pthreads, compiler auto-vectorization (god bless intel's compiler), and mpi. for the more real time stuff i do now, i use nist's nml as the message layer rather than mpi (i have no idea how they'd compare in terms of performance). almost all my code is in c++ (the ocassional piece being in c).

honestly, if you've got the option of using multiple parallel paradigms in the same program, go for it -- mpi for interproc communications, openmp and compiler autovectorization to max out performance on smp nodes.

good luck :-)

Re:short answer (2, Interesting)

cariaso1 (674515) | more than 8 years ago | (#13919317)

From some recent experiences with the mpiblast [] project, and some much older work at llnl [] I've had better experiences with mpich as being more reliable than lam (one man's limited opinion, a data point not a rule). Also I think it should be more clear that mpiblast is perfectly usable in numa [] architectures. On first read of the parent I thought this was being ruled out. When debugging in parallel Totalview is a godsend, or was the last time I needed/could afford it. For geek points I'd have to agree that the worms remind me of Sarlacc.

Re:short answer (1)

David Greene (463) | more than 8 years ago | (#13920733)

uniform access shared memory (think the bigass (tm) cray machines)

The only recent Cray machine that I am aware of that had uniform access to memory was the MTA2 which used an address scrambling scheme to spread references throughout the memory system so there would be no hot-spots. It also meant that no memory was "local" either.

The current vector and MPP lines are either distributed memory (shared address space, non-uniform latency) or message-passing (no shared address space).

here you'd typically use mpi (if your programs are supposed to be portable) or the local threading library + vectorized / paralellized math libraries. since its all in a single memory space, it's "as simple" as just doing a good job multithreading the program.

MPI is not simple. It requires the programmer to explicitly state the communication among processors. UPC and Co-Array Fortran simplify this and are probably closest to what the programmer would really like to write. The fact that they are missing from the survey is a huge gaping hole that for me calls into question whatever results are compiled. CAF will be in the 2008 standard. UPC isn't as mature yet but will probably be proposed for the standard in a few years.

You're absolutely right that the "best" toolkit depends heavily on the machine architecture. MPI is generally optimized for large messages because the assumption is that communication is expensive. UPC and CAF are optimized for small messages, with an underlying assumption that the machine supports fast remote addressing.

Re:short answer (1)

StevisF (218566) | more than 8 years ago | (#13931322)

I've spoken with an SGI engingeer about the Altix systems and he said all the memory and communications considerations are automatically handled in hardware. The system will automatically pick the closest memory and processors to work together. Writing programs on it is no different than writing programs on any SMP machine.

Parallel languages. (0)

Anonymous Coward | more than 8 years ago | (#13919240)

Erlang, Mumps, Oberon


Chuck Moore's "Sea of Processors" []

u++ (4, Informative)

p2sam (139950) | more than 8 years ago | (#13919393)

We had to use this for school assignments way back when. It ain't bad. A lot more feature-ful than basic pthreads. []

MPI, Co-Array Fortran, & UPC (4, Informative)

Salis (52373) | more than 8 years ago | (#13919588)

MPI is the de facto standard for processor to processor communication with MPICH's implementation being the most stable and well known one. For "lower-level" communication, you can also use UPC or Co-Array Fortran, which are often used on serious computing architectures, like the Cray X1. The difference between MPI and these language-level parallel additions is that, on the language level, the transfer of data between processor looks like assignment between variables, where one of the dimensions of the variables includes the processor identities themselves.

So, in MPI, to send data from processor 0 to processor 1, the 0 processor would call a function

Call MPI_Send(dataout, datacountout, datatype, destination processor #, ...)
(Fortran style)

which must match an MPI_Receive in the processor 1's executing program.

In Co-Array Fortran, OTOH, it would look like

data[1] = data[0]

The fun part about Co-Array Fortran is that 'data' can be defined as a regular multi-dimensional array so that data(1:10,1:20)[1] = data(40:50,60:80)[0] is perfectly ok _and_ the 'processor dimension', denoted by the []'s in Co-Array, can also be accessed using Fortran notation so that data[1:100] = data[0] is perfectly ok too. Or even data[2:2:100] = data[0] for only even processors.

In truth, a Co-Array Fortran compiler will probably turn the language-level additions into MPI function calls (because that's the standard), but I find CAF to be more elegant than MPI.

UPC is similar to Co-Array Fortran, but for C. I've never used it before, though.

Google Co-Array Fortran or UPC for more information.

OpenMP (1)

foooo (634898) | more than 8 years ago | (#13920138)

I found this article on MSDN... nMP/default.aspx

Re:OpenMP (0)

Anonymous Coward | more than 8 years ago | (#13920494)

OpenMP is very good and is the de-facto standard for shared memory architectures. As previous posters have mentioned, there are different multi-CPU architectures out there. For shared memory, OpenMP ( is nice. For distributed memory MPI is nicest IMHO.

OpenMP is much easier to deal with than MPI. It has a simpler interface, using #pragma in C, C++ to detail how the parallelization will be done. No library calls, and memory management like in MPI.

If you have a loop that can be parallelized, you can just do something like (sorry this is likely incorrect syntax, but off the top of my head, you should get the idea)....

void myfunc(float *a, float *b, float *c){ .....
  int x=10;
  \//parallel area starts here.
  \//shared() specified which variables are in shared memory.
  \//private() makes each thread get its own copy (beware initialization).
  \//firstprivate() is same as private, but gets initialized outside of parallel area.
#pragma omp parallel for shared(a,b,c) private(y) firstprivate(x)
    for(unsigned int y=0;y100;++y){
} //end parallel area .....

something like that, anyway. You tell the compiler that some loop can be paralleized, then it does the rest. At run-time you can specify the number of threads to create, how to schedule the processing, etc. It's all pretty nice, and can scale really well if you know what you're doing.

I've used this on some pretty big SUN clusters, and it is sweet. Wall-clock time can drop like a rock if you're program is well suited to this type of approach.

Multi core - "Parallel Computing" (4, Informative)

Heretik (93983) | more than 8 years ago | (#13921443)

Making the jump from multi-core CPUs being available to massive things like clusters, MPI, etc. is a bit of a leap.

Multi-core chips in a typical commodity machine (shared memory, same address space, etc) just means you have multiple threads of execution, but everything else is pretty much the same at the application coding level.

If you're working on an app and want to take advantage of multi-core (or SMP), you just need to have a well threaded app, using the native threading libs (ie pthreads) - nothing fancy. Clusters and big non-shared-memory type supercomputers are a different story altogether from something like a dual-core Athlon.

Re:Multi core - "Parallel Computing" (1)

passthecrackpipe (598773) | more than 8 years ago | (#13921933)

Even a dual athlon (multi-core or not) is likely to have an internal NUMA architecture. I just bought some new servers [] that have exactly those properties.

Re:Multi core - "Parallel Computing" (2, Informative)

Tune (17738) | more than 8 years ago | (#13922144)

>Even a dual athlon (multi-core or not) is likely to have an internal NUMA architecture.

Yes, but the grandparent post still holds, in that there's hardly a difference between a well threaded app on a single processor compared to shared memory/numa multiprocessor SMP. That kind of parallelism stops scaling at 4, maybe 8 cores.

From there on, memory/communication bandwidth becomes the bottleneck, adding more cores does not change speed. That's where the big decisions need to be made at the application level. That where programs become distributed programs. Clustering, grids and massively parallel problem solving impacts not just harware, but all levels of systems design, including application programming.

High performance parallel programming typically involves finding the best trade off between algorithms from a theoretical complexity viewpoint and algorirthms that are easyly distributed, finding a balance between cpu use and communication use, network topology, scalability, schedulability and numerous other buzz words that are relatively meaningless in desktop/server size single, dual, quad core sharde memory/numa architectures. Parallel is BIG

I loved our T3E (1)

Xner (96363) | more than 8 years ago | (#13922332)

Too bad that Cray retreated to more conventional designs.

Re:I loved our T3E (1)

convolvatron (176505) | more than 8 years ago | (#13929575)

actually, speaking as someone who was involved in later
cray products, sgi killed the t3e.

the merger agreement with tera specifically constrained
cray from making a followon machine.

not that cray doesn't have problems....

Google's MapReduce (1)

TheLink (130905) | more than 8 years ago | (#13924267)

"MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system."

Prolog (1)

oliverthered (187439) | more than 8 years ago | (#13926108)

Prolog is quite easy to use and is inherently paralell, that why they chose it for the 5th Generation computing project [] that never really got anywhere (possibly because it was way before it's time)

Shell with pipes (1)

PostItNote (630567) | more than 8 years ago | (#13930088)

If you have data that can be incrementally processed, then shell scripts with pipes can bring about a high degree of speedup
process1 | process2 | process3
with all three of them running on different processors means that your program can get up to a 3x speedup for free! No MPI/PVM/pthreads/etc required!

(Note: the program chain will complete in time roughly proportional to the time of the slowest link. This trick only works when each program doesn't need to read in all the data before it finishes processing.)

OpenMP (0)

Anonymous Coward | more than 8 years ago | (#13931429)

For multi-core you should certainly look at OpenMP [] before you start pthreading your code yourself.

OpenMP is a set of compiler directives which allow the compiler to handle the messy aspects of thread parallelism so that you don't have to. Adding OpenMP directives to a code is much faster than adding pthreads to it.
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?