Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Memory Management Technique Speeds Apps By 20%

kdawson posted more than 4 years ago | from the rememberance-of-data-past dept.

Programming 252

Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.

Sorry! There are no comments related to the filter you selected.

Beware the key term there: (5, Insightful)

Estanislao Martnez (203477) | more than 4 years ago | (#31743272)

Beware the key term there: "up to."

Re:Beware the key term there: (1)

Jurily (900488) | more than 4 years ago | (#31743642)

OK, so they run memory allocation in a separate thread. What exactly does the other thread do while the mm thread is running, and if it blocks like I think, how does that speed anything up?

The idea sounds fun, but this approach requires a rewrite to make it usable, like most everything else out there.

The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.

Re:Beware the key term there: (3, Interesting)

Idiomatick (976696) | more than 4 years ago | (#31743734)

Why couldn't it be applied at the compiler option level? A checkbox and recompile isn't so terrible. It could probably be done at the OS level but it'd be more of a pain.

Re:Beware the key term there: (3, Informative)

Anonymous Coward | more than 4 years ago | (#31743798)

The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.

It would be nice if Unix/Linux had easier and better support for asynchronous I/O.

Operating systems like VMS made all I/O asynchronous by default, with optional flags or function wrappers that would wait for an I/O to complete before returning if you really wanted synchronous I/O. You could even designate a user function to run asynchronously (without the drawbacks and limitations of Unix signals) whenever any specific I/O completed.

Much simpler than Linux, where you have to use completely different and complex mechanisms if you want to use something better than synchronous read()/write() calls.

http://en.wikipedia.org/wiki/QIO [wikipedia.org]
http://en.wikipedia.org/wiki/Asynchronous_System_Trap [wikipedia.org]

Re:Beware the key term there: (3, Informative)

EvanED (569694) | more than 4 years ago | (#31744222)

Operating systems like VMS made all I/O asynchronous by default...

This is mostly true in Windows too actually, given NT's strong VMS inspirations. From what I understand, drivers implement (only) asynchronous I/O calls, and the read/write system calls (NtReadFile and NtWriteFile) contain a specification of whether it should be asynchronous or synchronous. If synchronous, a higher level of the kernel handle that aspect itself, without the driver knowing anything about what's going on.

(I think this is more-or-less correct anyway.)

Re:Beware the key term there: (3, Informative)

Anonymous Coward | more than 4 years ago | (#31744494)

Clearly you didn't read the paper.

One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.

The MM thread preallocates blocks the application is likely to ask for, so that when the application asks for it's 300th small block for storing window coordinates or whatever, the memory manager thread can instantly say "here, I had that". It also batches deallocations and does them out-of-band, while the application continues running.

Re:Beware the key term there: (2, Interesting)

mdf356 (774923) | more than 4 years ago | (#31744512)

but this approach requires a rewrite to make it usable

A rewrite of part of libc, yes. Change the implementation of malloc(3) and link with -lpthread and you're pretty much done.

I don't see how spinning off malloc(3) calls would help anything, but if there's unused CPUs then clearly free(3) can be done by another thread.

Re:Beware the key term there: (2, Informative)

SanityInAnarchy (655584) | more than 4 years ago | (#31744624)

Actually, I can see how malloc would help, if you assume that they're always allocating small-ish amounts -- just keep a certain amount reserved, or don't free stuff instantly (in case it's about to be reallocated).

However, all of this seems very much like it could be done either inside libc (as proposed) or in the kernel, without having to touch existing apps, at least on platforms like Linux where libc is shared.

Re:Beware the key term there: (4, Insightful)

mswhippingboy (754599) | more than 4 years ago | (#31744796)

What you are missing (as are most of the posters so far) is that there is considerable overhead involved in the actual management of the memory in terms of keeping track of what memory is free or allocated. This is outside the issue of maintaining locks. Moving this management overhead to a separate thread allows the otherwise single threaded app to take advantage of additional cores without any code changes. This does not appear all that novel however as modern garbage collectors do this today.

Re:Beware the key term there: (4, Insightful)

Spatial (1235392) | more than 4 years ago | (#31744092)

I like to mentally replace that with the actual meaning: "between 0 and".

It could allow software applications to run between 0 and 20% faster!

Re:Beware the key term there: (5, Insightful)

Georules (655379) | more than 4 years ago | (#31744104)

You might consider mentally replacing it with the sad reality that it might be between 0 and x faster AND it could also be infinitely slower.

Re:Beware the key term there: (1)

gmhowell (26755) | more than 4 years ago | (#31744248)

Why limit yourself? With enough overhead from the memory management, you could get it to go some negative amount faster.

Re:Beware the key term there: (0)

Anonymous Coward | more than 4 years ago | (#31744298)

Not really, where you get your inferior limit from?

It actually means, betwen -infinutum to 20% faster

Re:Beware the key term there: (1)

nikanth (1066242) | more than 4 years ago | (#31744660)

Upto 20% may not be actually 0 to 20% always, it could be -infinity to 20%

Re:Beware the key term there: (0)

Anonymous Coward | more than 4 years ago | (#31744842)

I like to mentally replace that with the actual meaning: "between 0 and".

It could allow software applications to run between 0 and 20% faster!

Why assume the lower bound is 0%? I read it as "between -100% and 20% faster"!

Obligatory Penny Arcade (2, Interesting)

Yvan256 (722131) | more than 4 years ago | (#31744272)

The Fivefold Mother [penny-arcade.com]

Just remember to be aware of multi PROCESSOR (1)

TheSunborn (68004) | more than 4 years ago | (#31743274)

Sound nice, but I hope the library also handle multi processor systems, where each processer have its own ram block. You don't want one cpu to allocate the memory, which is used by an other cpu.

Do operation systems even have support to say "These 2 threads should run on the same processor, but I don't care about which one".

Re:Just remember to be aware of multi PROCESSOR (1)

WrongSizeGlass (838941) | more than 4 years ago | (#31743376)

What platform is this library available for? I didn't see it in the article.

How much does this slow down an application that's running on a single CPU with a single core? Splitting memory allocation off into its own thread will negatively impact performance when running on many of the existing desktops out there.

Re:Just remember to be aware of multi PROCESSOR (1, Troll)

sys.stdout.write (1551563) | more than 4 years ago | (#31743474)

wrapped in a [sic] ugly brown robe and a poorly draped orange sarong

Ah, the Ubuntu color scheme!

Re:Just remember to be aware of multi PROCESSOR (1)

Idiomatick (976696) | more than 4 years ago | (#31743788)

It is purple now you insensitive clod!

Re:Just remember to be aware of multi PROCESSOR (1)

jisatsusha (755173) | more than 4 years ago | (#31743466)

Windows does - SetThreadIdealProcessor() [microsoft.com] .

Re:Just remember to be aware of multi PROCESSOR (1)

maxwell demon (590494) | more than 4 years ago | (#31744468)

That's not what he requested. It lets you say "I want this thread run on that core." What he wanted was a function to tell "I want this thread to be run on whatever core that other thread runs without caring which core it is."

Re:Just remember to be aware of multi PROCESSOR (4, Informative)

Mad Merlin (837387) | more than 4 years ago | (#31744188)

The type of system you're talking about is NUMA (Non-Uniform Memory Architecture), and yes, any OS worth its salt has supported it automagically for years. I think even Windows advertises support for NUMA now, whether it works is another question.

Nothing to see here.... (5, Insightful)

Ancient_Hacker (751168) | more than 4 years ago | (#31743284)

Nothing to see here...

Moving malloc() to a separate thread does not do a thing for the putative word processor.

They might get some speedup if they take a lousy old malloc() and have one thread hold onto the locks.

But of course the *right* way would be to write a new malloc() that can from the get-go run re-entrantly and not require a bevy of slow locks.

Re:Nothing to see here.... (5, Funny)

Anonymous Coward | more than 4 years ago | (#31743354)

new malloc()

I see what you did there.

Re:Nothing to see here.... (5, Interesting)

Anonymous Coward | more than 4 years ago | (#31743448)

Exactly, and they are even comparing it to the old and relatively slow Doug Lea allocator.

If you want to test a new memory allocator, the benchmarks these days are the Hoard allocator, and the TCmalloc allocator from google. These alone will give you more than the 20% speedup mentioned in the article.

However, that isn't the end of the story. There are proprietary allocators, like the Lockless [locklessinc.com] memory allocator, that are about twice as fast as the older allocators which aren't optimized for multi-core machines.

Re:Nothing to see here.... (3, Insightful)

wealthychef (584778) | more than 4 years ago | (#31744484)

But how much of your time is spent allocating memory? If you spend 5% of your time in malloc(), doubling its speed saves you 2.5% of your execution time.

Re:Nothing to see here.... (4, Informative)

kscguru (551278) | more than 4 years ago | (#31743670)

Indeed. This "technique" appears to be nothing more than replacing a poorly-locked malloc() implementation with a good malloc() implementation that has better locks and (probably) does some work speculatively.

With a proper malloc() implementation, locks are NOT highly contended and a thread doing malloc() does not lock out other threads for a long period of time. In theory, the overhead of managing the queueing / signalling to offload work to a malloc()-thread should be higher than the overhead of doing a proper malloc() in the first place - if its not, then the original malloc() implementation is sub-optimal. Modern malloc() implementations use slabs, thread-local caches, and other tricks to avoid slow paths - they really aren't that inefficient, there isn't "20% CPU time" left to squeeze out unless your baseline is a non-optimal malloc() in the first place. Which leads me to conclude that they are doing speculative execution: use an extra thread to pre-expand caches, pre-fault pages, pre-grow the heap segment, and burn a bunch of CPU time on another thread to do it. Speculative execution is, after all, the sexy research area nowadays (especially for some EE researchers who like to publish "Foo runs 20% faster!" papers while ignoring the larger systemic slowdown they induced) - speculative execution only works when hardware is idle, and in the current climate of low-power computing, it's cheaper to turn off idle hardware than use it for speculative execution.

And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup, and I think they are more likely to solve P=NP than to have actually made malloc() take less total work ... which is why I'm so convinced this is just speculative execution, the only way to do less work is to guess what that work was beforehand and burn another thread doing it.

Now, maybe the paper is more restrained in its claims and it's just the journalist hyping this up. But if this is the hyped-up work coming out of a CS department, I wouldn't want to go there...

Re:Nothing to see here.... (2, Insightful)

AuMatar (183847) | more than 4 years ago | (#31743914)

Wouldn't it be rather trivial to write a lockless malloc? Just have every thread allocate its own memory and maintain its own free list- problem solved.

Re:Nothing to see here.... (4, Interesting)

b4dc0d3r (1268512) | more than 4 years ago | (#31744100)

Have every thread allocate its memory from... what? At some point either the operating system or the runtime has to lock something so it doesn't get interrupted, or turn off all interrupts and switching for a few hundred cycles so it can do the allocation. Usually the runtime requests reserved pages far in excess of what it needs, and then doles out pieces, committing them as needed. You need 2k, so the runtime gets 4MB and commits 32k page(s). Next time you need more, then the runtime just returns more of the 32k block.

The operating system has to lock its list for the runtime, and/or the runtime does the same for the program. Someone's locking something somewhere.

Re:Nothing to see here.... (0)

Anonymous Coward | more than 4 years ago | (#31744142)

Most CPU architectures support atomic operations - that's all you need to build lock-free architectures. No disabling of interrupts required.

Re:Nothing to see here.... (0)

Anonymous Coward | more than 4 years ago | (#31744558)

In this performance context an atomic operation is not that much better than a lock. It's a locked bus cycle either way, and too many of them will cause scalability problems.

Re:Nothing to see here.... (0)

Anonymous Coward | more than 4 years ago | (#31744506)

Have every thread allocate its memory from... what? At some point either the operating system or the runtime has to lock something so it doesn't get interrupted

For Fsck Sake.

Clearly your parent is saying that each thread should allocate and manage its own memory at thread startup time.

Re:Nothing to see here.... (1)

Tenareth (17013) | more than 4 years ago | (#31744820)

If each thread allocates its own memory then it is just returning to LWP, which was good a while back, but threading should try to avoid allocating its own memory except in a few specific instances.

However, based on this thread most people don't know how CPUs work.

Re:Nothing to see here.... (0)

Anonymous Coward | more than 4 years ago | (#31744490)

And that's similar to what good malloc's do. What makes it tricky is the need to balance the per-CPU pools.

A very common multithread case is that you have a producer thread (calling malloc a lot) and a consumer thread (which free's all of those objects) If you don't rebalance the pools, the consumer thread will end up all of the system's memory on its per-CPU free list.

The best-of-breed malloc's deal with this by hitting the per-CPU pool in the 99% case, but they still need to occasionally synchronize to balance the free lists.

This is on top of all of the other considerations that make building a great malloc tricky (fragmentation, NUMA-affinity, bounding wasted RAM, ...)

Re:Nothing to see here.... (2, Interesting)

martin-boundary (547041) | more than 4 years ago | (#31743928)

And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup

There's theoretical good and then there's practical good. A good rule of thumb these days is that RAM is the new disk, and most current and legacy software results in huge numbers of CPU stalls. If those stalls can be converted to useful work, even at 2:1 conversion, that's better than having the stalls.

Re:Nothing to see here.... (5, Interesting)

kscguru (551278) | more than 4 years ago | (#31744154)

And now I've read their paper [ncsu.edu] . Quick summary: (1) they do indeed speculatively pre-allocate heap blocks, and cache pre-allocated blocks per client thread. (2) They run free() asynchronously, and batch up blocks of ~200 frees for bulk processing. (3) They busy-loop the malloc()-thread because pthread_semaphore wakeups are too slow for them to see a performance gain (section 2.E.2-3).

In other words, it's a cute trick for making one thread go faster, at the expense of burning 100% of another core by busy-looping. If you are on a laptop, congrats, your battery life just went from 4 hours to 1 hour. On a server, your CPU utilization just went up by 1 core per process using this library. This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers. Idea (2) is interesting (though not innovative); idea (3) makes this whole proposal a non-starter for anything except an academic paper.

Wait-free lock-free data structures (1, Informative)

Anonymous Coward | more than 4 years ago | (#31743330)

You can use a wait-free lock-free data structure to add deallocated memory to a free list to reduce contention on the critical section in a heap. You can also use it to implement the idea presented in the paper.

On Windows, you can use the Interlocked Singly Linked List data structure (see InitializeSListHead).

You can also partition memory allocations over several heaps to reduce contention further.

Summary (1)

mapuche (41699) | more than 4 years ago | (#31743344)

"The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers."

The release says this technique is for programs difficult to parellelize "by running the memory-management functions on a separate thread".

Re:Summary (2, Funny)

14erCleaner (745600) | more than 4 years ago | (#31743440)

The speedup comes from using a memory-allocation library (PHKMalloc) that does extensive and expensive checking to avoid programmer errors, then basically hides most of its overhead in the second thread (so that the allocation thread would be mostly doing sanity checking). For most programs, this probably won't help performance any. It's an old trick in parallel processing research: pick a slow algorithm, then speed it up via parallelism, rather than starting out with an efficient solution.

I once submitted an April Fool's joke to this effect to a moderated website; I claimed superlinear speedup for sorting by starting with a bubble sort, then "modifying it for parallelism" until it morphed into a quicksort. Alas, the moderator rejected it...

Re:Summary (1)

beakerMeep (716990) | more than 4 years ago | (#31744196)

He didn't reject it, he just moved it to another processor.

Re:Summary (1)

droopycom (470921) | more than 4 years ago | (#31744292)

It's an old trick in parallel processing research: pick a slow algorithm, then speed it up via parallelism, rather than starting out with an efficient solution.

Actually its a very interesting trick.

There are so many slow algorithm in the wild that having a simple method to speed them all up would be very useful.

Ah, yes, this would not be a theoretical break-through, but a very practical one indeed... ... if their claims can be substantiated of course

Wow, this is pretty clever (3, Interesting)

Omnifarious (11933) | more than 4 years ago | (#31743356)

I wish I'd thought of it.

Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.

Re:Wow, this is pretty clever (3, Informative)

nxtw (866177) | more than 4 years ago | (#31743402)

If you want faster AES, just upgrade your CPU [wikipedia.org] .

Re:Wow, this is pretty clever (1)

Omnifarious (11933) | more than 4 years ago | (#31743722)

Well, the Intel AES instructions would benefit even more from parallelized AES CTR mode pre-computation than straight multiple cores, so that doesn't invalidate what I'm saying at all. :-)

Re:Wow, this is pretty clever (3, Insightful)

nxtw (866177) | more than 4 years ago | (#31743770)

Well, the Intel AES instructions would benefit even more from parallelized AES CTR mode pre-computation than straight multiple cores, so that doesn't invalidate what I'm saying at all. :-)

Are your storage and network devices that fast?

Re:Wow, this is pretty clever (1)

x2A (858210) | more than 4 years ago | (#31744502)

"Are your storage and network devices that fast?"

Depends what else the system is doing, surely. Having several clients connected to a server, you want to free up the processor as much as possible for servery duties, any savings you make go to those when running full pelt, or convert to energy savings when you're not.

Re:Wow, this is pretty clever (1)

nxtw (866177) | more than 4 years ago | (#31744536)

Depends what else the system is doing, surely. Having several clients connected to a server, you want to free up the processor as much as possible for servery duties, any savings you make go to those when running full pelt, or convert to energy savings when you're not.

What does that have to do with anything? AES-NI enables AES implementations to encrypt and decrypt much faster than any network and storage system I've ever used can provide data. Parallelizing AES-NI would save energy only in the case when running multiple cores at partial load is more efficient than running just one core at near-full load (which could be the case if all cores run at the same clock speed), and would be faster only if your I/O systems are already faster than a single core can encrypt/decrypt with AES-NI.

Re:Wow, this is pretty clever (1)

maxwell demon (590494) | more than 4 years ago | (#31744516)

I always store on /dev/null. I can tell you, it's super fast! However, I'm experiencing a lot of data loss; maybe the device is broken.

Re:Wow, this is pretty clever (1)

mirix (1649853) | more than 4 years ago | (#31743774)

Of course, you need software that uses the instructions as well.

That said, I've got one of the Via chips with hardware RNG on it, and once I loaded the module for it, /dev/random just spews data. It's an insane improvement over normal (software) /dev/random. I believe it has some other sorts of encryption friendly features, but I haven't played with it much, yet.

Re:Wow, this is pretty clever (4, Informative)

macemoneta (154740) | more than 4 years ago | (#31743760)

Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.

That's already implemented in the high performance ssh patch, available here [psc.edu] . Scroll down to the "Threaded CTR cipher mode" patch, if that's the only part you're interested in.

I've applied it to the openssh package on my Fedora 12 system. As is, it provides about 40% increased throughput on my quad-core. You may be able to get more by tweaking it to increase the thread count.

Re:Wow, this is pretty clever (1)

sowth (748135) | more than 4 years ago | (#31744204)

I was thinking Soekris Engineering's vpn accelerator card [soekris.com] would help, but it appears to only be able to do 250 Mbps. (You wanted 1 gigabit/s, right?)

That card is really old too. I first read about it probably 10 years ago. I don't think it has changed in that time... I wonder if someone makes a faster accelerator? Then again, what about the GPU? Has anyone tried encryption with GPUs before? They've done other supercomputing tasks. A quick search says they have. [nvidia.com]

20%?! (5, Insightful)

temojen (678985) | more than 4 years ago | (#31743372)

If most programs are spending 20% of their time on memory management, something is wrong.

Re:20%?! (1)

TheVoice900 (467327) | more than 4 years ago | (#31743560)

From the paper:

Previous studies show that some C programs spend up to one third of their execution time in dynamic memory management routines such as malloc and free

You can check the PDF for the cited studies.

Re:20%?! (2, Insightful)

naasking (94116) | more than 4 years ago | (#31743736)

Not at all. 20% is a very typical overhead for dynamic memory management. Did you think malloc/free costs nothing?

Re:20%?! (1)

SpazmodeusG (1334705) | more than 4 years ago | (#31743836)

I don't think that's what he was getting at. I think he means you can avoid that much malloc/free-ing.
Memory pooling and allocating outside of tight loops comes to mind.

Re:20%?! (4, Informative)

naasking (94116) | more than 4 years ago | (#31743926)

I'm saying that 20% overhead for dynamic memory management is typical of even well-designed programs. Very few programs can take good advantage of efficient bulk-deallocating arenas/regions, and research has shown custom memory pooling schemes are generally no better than malloc/free [umass.edu] .

Re:20%?! (0)

Anonymous Coward | more than 4 years ago | (#31744382)

Define "very few". It depends on the context. Network client/server tasks easily benefit from simple arenas. This is because the work is usually broken down into small request/respond cycles. Any competent C programmer will use arenas to preallocate and bulk-free objects during the particular request/respond phase. Not just for performance, but because it makes memory management far easier, especially the job of growing and parsing input buffers (the actual task of tracking pointers and not leaking memory is usually quite easy regardless).

Re:20%?! (1)

AuMatar (183847) | more than 4 years ago | (#31743940)

And OOP makes things worse in this area- it tends to have a lot of small, short life objects that need to be constructed and destructed. Java is particularly bad at this due to decisions like immutable strings. Those extra object allocations add up quickly.

Re:20%?! (1)

naasking (94116) | more than 4 years ago | (#31744006)

Memory management overhead in GC'd languages is typically around 30%, so not much worse than malloc/free which averages around 15-20%. You gain a lot of productivity for that 10-15% overhead tradeoff though.

Re:20%?! (1)

AuMatar (183847) | more than 4 years ago | (#31744040)

I've personally never seen much if any productivity gain from GC. If anything I've seen a loss- I find that managing memory helps me make better designs, and memory ownership problems are almost always the first sign that there's a major design flaw. Quite frankly memory bugs are rare among decent programmers- I find one a year or so in most places I've worked, and those usually come about due to someone trying to be a bit too clever minimizing memcpy calls. It doesn't even make the top 20 list for bug causes.

But at any rate, I wasn't talking GC vs non-GC, but OOP vs non-OOP. There are non-OOP GC languages and OOP non-GCed languages. I was pointing Java out because particular design decisions in their standard library make the problem even worse- the large number of immutable objects, especially strings. The more objects you make, the more time you spend in constructors, destructors, new, and free regardless of language. A design requiring lots of small objects should be avoided if at all possible, and patterns like flyweight applied if not.

Re:20%?! (1)

naasking (94116) | more than 4 years ago | (#31744346)

You significantly understate the complexities of manual memory management. Any two programs with compatible interfaces written in a GC'd language will compose without leaks, where any two programs with compatible interfaces written with manual memory management will not necessarily compose. This requires diverting valuable resources from development to analyzing the safety of any composition of non-composable programs.

Re:20%?! (1)

Nadaka (224565) | more than 4 years ago | (#31744130)

Besides that java running with optimized options under hotspot can beat c++ in creating/destroying objects vs c++'s alloc/malloc by up to 4 times. In some cases it can push java to execute faster than c++ (well crafted c with structs would still beat them both though).

Re:20%?! (1)

dgatwood (11270) | more than 4 years ago | (#31744172)

Double the overhead is not much worse?

Re:20%?! (1)

naasking (94116) | more than 4 years ago | (#31744326)

By doubling the memory management overhead in the general case, you now get to work with pervasive immutable data, generally not worry about memory leaks, and dramatically simplify the semantics of interface boundaries.

Re:20%?! (1)

Tenareth (17013) | more than 4 years ago | (#31744784)

Yes, most modern programmers do think that way based on most of the code I've seen in the past 10 years.

Re:20%?! (0)

Anonymous Coward | more than 4 years ago | (#31744054)

Today's forecast: a constant cache miss and some bubbles in the pipeline..

Might be particularly applicable to Java (2, Interesting)

Anonymous Coward | more than 4 years ago | (#31743400)

Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules. Java already has some multi-threaded load balancing that occurs automatically, but this algorithm might yield some additional benefits.

Re:Might be particularly applicable to Java (4, Interesting)

SpazmodeusG (1334705) | more than 4 years ago | (#31743584)

Actually i've found the opposite. Java tends to be really good at transparent memory re-use. From experience if i have ~1,000,000 objects of the same type with the some constantly being deleted and replaced Java will run that program faster than a non-memory pooled C implementation (of course the memory pooled C implementation will be faster again).

In fact many of the benchmarks around that you see claiming Java is faster than C will use an example of a program that creates and destroys objects in a tight loop. The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.

Re:Might be particularly applicable to Java (1)

Nadaka (224565) | more than 4 years ago | (#31744148)

I would assume you are running hotspot with the server parameter? Anything else you do to bump java performance?

Re:Might be particularly applicable to Java (1)

buchner.johannes (1139593) | more than 4 years ago | (#31744734)

The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.

What would the results be when using boehmgc?

Re:Might be particularly applicable to Java (1, Informative)

Anonymous Coward | more than 4 years ago | (#31743640)

Actually, Java does the complete opposite. Java allocates a large chunk of heap, and then internally allocates its own objects on that heap. because the majority of objects are short-lived, a generational garbage collector can be used. Also, some allocations are done on the stack. See here [ibm.com]

Re:Might be particularly applicable to Java (1)

glwtta (532858) | more than 4 years ago | (#31743672)

Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules.

That's not my understanding of how Java works. In fact, I can't see any benefit from this approach in Java at all: new object allocation is extremely fast (trivial, really) and heap compacting / garbage collection is already parallelized, and much more optimized than what malloc/free can do.

I may be wrong about this, but the only time Java actually uses malloc is: a) expanding the heap, and b) allocating direct buffers (there's probably some other io-related cases).

Re:Might be particularly applicable to Java (1)

radish (98371) | more than 4 years ago | (#31744010)

Java does basically no mallocs. Obviously there's an initial malloc to allocate the heap to the JVM but after that it's all managed by the JVM itself, and it's been demonstrated as being much faster than a traditional malloc/free approach [ibm.com] . Assuming you set your Xms and Xmx sizes correctly the system malloc implementation is basically irrelevant to Java execution speed.

Re:Might be particularly applicable to Java (1)

Azarael (896715) | more than 4 years ago | (#31744548)

And how does this situation differ other than the fact that the alloc/free operations are done local to the JVM instead of making system calls? The fact that the JVM is doing the work doesn't magically make memory management easier.

The other thing that I'm skeptical about is that the article seems to be contradicted by a more recent paper by the author that they are referencing (see Zorn http://portal.acm.org/citation.cfm?id=582419.582421 [acm.org] ). In the newer paper, Zorn et al. say that custom allocators are less efficient than a modern general purpose allocator.

Uhm, isn't this just garbage collection? (1)

GWBasic (900357) | more than 4 years ago | (#31743418)

Uhm, so what's the big deal? .Net's garbage collector runs in its own thread.

Re:Uhm, isn't this just garbage collection? (1)

jasmusic (786052) | more than 4 years ago | (#31743826)

Plus the allocator is lightning fast because allocation goes in a straight consecutive line rather than looking for empty slots in a predefined grid.

You can malloc it but you can't use it (2, Insightful)

lordlod (458156) | more than 4 years ago | (#31743554)

The article(s) are very scarce on details but it seems like the gains will be limited in most applications. Fundamentally you have to block until the malloc has finished before you can use it. So it helps if you malloc well ahead of time, but not if you malloc as you need it.

A common simplified structure is:

malloc memory
use memory
free memory

With these new innovations you get:

async malloc memory
block until malloc finishes
use memory
async free memory

And free shouldn't take a noticable amount of time.

Re:You can malloc it but you can't use it (1)

jasmusic (786052) | more than 4 years ago | (#31743930)

Looks just like Windows OVERLAPPED I/O.

Re:You can malloc it but you can't use it (0)

Anonymous Coward | more than 4 years ago | (#31744636)

You're a fucking moron.

Re:You can malloc it but you can't use it (1)

jasmusic (786052) | more than 4 years ago | (#31744674)

Wow that was out of left field, not even worth logging in for!

Re:You can malloc it but you can't use it (3, Informative)

TheVoice900 (467327) | more than 4 years ago | (#31744000)

The PDF of the paper has all the details. The article is just fluff.

It's programmers that need parallelization (5, Insightful)

w0mprat (1317953) | more than 4 years ago | (#31743596)

Because we learnt to program for a single threaded core with it's single processing pipeline since way back, using high level languages that pre-date the multi-threaded era, and it involves re-thinking how things are done on a fundamental level if we're ever to make proper use of 32, 64, 128 cores. Oh and we all know how many programmers are 'get off my lawn' types, myself included.
 
If I still coded much anymore it would drive me to drink.

Re:It's programmers that need parallelization (2, Funny)

nextekcarl (1402899) | more than 4 years ago | (#31743710)

...If I still coded much anymore it would drive me to drink.

Maybe that's my problem? If I started drinking maybe I could handle it [programming for other people] again.

Re:It's programmers that need parallelization (-1, Redundant)

Anonymous Coward | more than 4 years ago | (#31743804)

...If I still coded much anymore it would drive me to drink.

Maybe that's my problem? If I started drinking maybe I could handle it [programming for other people] again.

Obligatory: http://xkcd.com/323/ [xkcd.com]

Re:It's programmers that need parallelization (0)

Anonymous Coward | more than 4 years ago | (#31744096)

I think for most programmers it's not so much that they have problems mentally modeling for multithreading as it is that people have problems modeling in general, and multithreading shows their slop more clearly. It's like watching people code when you remove garbage collection from their programming environment. As with garbage collection, though, the fact that tools to help with multithreading will also probably hide poor design skills isn't a reason to be against them.
 
Btw, the article is pretty worthless and as far as I could bother reading it, it sounds like its far from revolutionary.

Re:It's programmers that need parallelization (0)

Anonymous Coward | more than 4 years ago | (#31744132)

On my first job, 15 years ago, I wrote multi-threaded programs for OS/2 running on a dual 100MHz Pentium that had a price tag of a small apartment.

It's nothing "new". Now get of my lawn.

So what's the big deal here? (1)

Edgewize (262271) | more than 4 years ago | (#31743612)

As far as I can tell, the author of this paper just figured out a way to offload a bunch of memory management tasks to an idle CPU core, and then counted it as a performance gain. OK?

So, monolithic single-threaded applications can be made faster on multi-core systems, at the cost of bulk-allocating more memory than is actually required, and only if there is enough idle CPU to run your memory-management thread in realtime on another core. I am not exactly blown away.

The press release tries to trump it up as some kind of major advancement in performance, but it's not a performance gain at all. If you ran four copies of these single-threaded benchmarks simultaneously, you'd probably see a net performance decrease.

Re:So what's the big deal here? (2, Insightful)

Zironic (1112127) | more than 4 years ago | (#31743702)

It's a performance gain because it's extremely rare that all your cores are maxed out at once, if you can distribute the computing power more evenly it's a performance gain in most circumstances even if the net computing power required increases.

mod uP (-1, Troll)

Anonymous Coward | more than 4 years ago | (#31743678)

free() is probably more parallizable than malloc() (1)

JessGras (953965) | more than 4 years ago | (#31743738)

Haven't ready the OP: shame on me.

But as it is true that a program which calls malloc() is basically blocked until malloc() returns, the trick here could be that malloc() is made to be a lot dumber... but free() just puts pointers in a queue "to the other thread" and then the other thread can be more time consuming about being smart with making that space available to malloc() again.

OK I'm expressing it badly. Obviously an app which needs memory is blocked until it gets the memory. But the same isn't true of releasing memory: you don't *need* your free()ed blocks to be immediately available. So maybe that's what they do: backload the joint malloc()+free() work into free(), make malloc() really dumb, and shift the free() work off to the other thread via a simple queue.

Re:free() is probably more parallizable than mallo (5, Informative)

JessGras (953965) | more than 4 years ago | (#31743782)

Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf [ncsu.edu] , they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.

Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!

Re:free() is probably more parallizable than mallo (1)

droopycom (470921) | more than 4 years ago | (#31744308)

So basically, they apply garbage collection techniques to regular malloc/free program

Not bad, all things considered...

Nothing new here (0)

Anonymous Coward | more than 4 years ago | (#31743780)

The guys that developed SmartHeap http://www.microquill.com/ have been preaching this for years.

How does this compare (0)

Anonymous Coward | more than 4 years ago | (#31744262)

How does this compare to jemalloc(), for example? This paper seems to be either very overhyped or plain outright silly. What does this do that other methods to improve allocations like the Windows Low Fragmentation Heap (LFH) and BSD's jemalloc() don't?

Read that title sensationalisticly wrong (0)

Anonymous Coward | more than 4 years ago | (#31744470)

After I scrolled down and then back up the front page, I am genuinely disappointed that this was not a psychology-based article. Imagine slowpokes like me thinking all all the possibilities for a title I parsed as "Memory management technique speed APES by 20%" :(

shameless plug (0)

Anonymous Coward | more than 4 years ago | (#31744568)

a good friend of mine has done this and much better. check out http://newcodeinc.com/ for a threaded fast drop-in replacement for malloc. yes, it is closed source and commercial. i am slowly influencing him to open source it and gain more customers for support. it is currently being used by several large financial firms who love it.

Multicore for word processing? (0)

Anonymous Coward | more than 4 years ago | (#31744704)

Okay, I can understand multicore for web broswers. But word processors? What kind of insanity does it take to write a word processor that needs more than a late-90s PC to run smoothly, let alone multi-anything?

If your word processor can't run smoothly on a single core, it's not the memory allocator's fault. Jesus.

Does it matter anymore? (2, Interesting)

Tenareth (17013) | more than 4 years ago | (#31744770)

There are very few developers left in the US that even know what memory management is.

People don't even try anymore.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?