Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Extreme Memory Oversubscription For VMs

Soulskill posted more than 4 years ago | from the eXtreme-virtualization dept.

Operating Systems 129

Laxitive writes "Virtualization systems currently have a pretty easy time oversubscribing CPUs (running lots of VMs on a few CPUs), but have had a very hard time oversubscribing memory. GridCentric, a virtualization startup, just posted on their blog a video demoing the creation of 16 one-gigabyte desktop VMs (running X) on a computer with just 5 gigs of RAM. The blog post includes a good explanation of how this is accomplished, along with a description of how it's different from the major approaches being used today (memory ballooning, VMWare's page sharing, etc.). Their method is based on a combination of lightweight VM cloning (sort of like fork() for VMs) and on-demand paging. Seems like the 'other half' of resource oversubscription for VMs might finally be here."

cancel ×

129 comments

Sorry! There are no comments related to the filter you selected.

Leaky Fawcet (1)

suso (153703) | more than 4 years ago | (#33212064)

Given how many programs leak memory. Its amazing that companies get away with oversubscribing memory without running into big issues. And desktop programs are usually the worst of the bunch.

Re:Leaky Fawcet (-1, Troll)

Anonymous Coward | more than 4 years ago | (#33212092)

Given how many programs leak memory. Its amazing that companies get away with oversubscribing memory without running into big issues. And desktop programs are usually the worst of the bunch.

Definte Disagree. I think niggers are usually the worst of the bunch.

Re:Leaky Fawcet (-1, Offtopic)

Anonymous Coward | more than 4 years ago | (#33212132)

No, anonymous cowards are the worst of the bunch. (!)

Go back to your alabama swamp-shack , hippie!

Re:Leaky Fawcet (1)

warewolfsmith (196722) | more than 4 years ago | (#33212138)

Leaky memory, you need NuIO memory stop leak, just pour it in and off you go.....NuIO a Microsoft Certified Product

Re:Leaky Fawcet (4, Informative)

ls671 (1122017) | more than 4 years ago | (#33212184)

Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again. I have tried several times to explain that to some slashdotters that bragged about not using any swap space anymore nowadays and that called me stupid for reserving a 2 gig swap partition or more on a 4 gig ram machine that sometimes runs for 2 years before rebooting.

Oh well....

Re:Leaky Fawcet (4, Interesting)

Mr Z (6791) | more than 4 years ago | (#33212232)

Sometimes that doesn't work out so well. If you have a fragmented heap with gaps between the leaked items that keep getting reused, it can lead to a lot of strange thrashing, since it effectively amplifies your working set size.

I think that may be one of the things that was happening to older Firefoxes (2.x when viewing gmail, in particular)... not only did it leak memory, it leaked memory in a way such that the leak couldn't just stay in swap.

Re:Leaky Fawcet (1)

buchner.johannes (1139593) | more than 4 years ago | (#33212262)

Sometimes that doesn't work out so well. If you have a fragmented heap with gaps between the leaked items that keep getting reused, it can lead to a lot of strange thrashing, since it effectively amplifies your working set size.

I think that may be one of the things that was happening to older Firefoxes (2.x when viewing gmail, in particular)... not only did it leak memory, it leaked memory in a way such that the leak couldn't just stay in swap.

Wouldn't that be a good exercise for kernels? Recording the usage patterns of memory subsections, defragmenting them into segments by usage frequency. If that is not possible at runtime, store and apply at the next run.

Or maybe clustering chunks by the code piece that allocated it would already help. That said, I don't know what malloc's current wisdom is.

Re:Leaky Fawcet (1)

Mr Z (6791) | more than 4 years ago | (#33212376)

The heap is entirely in userspace, and the kernel is powerless to do anything about it.

Imagine some fun, idiotic code that allocated, say, 1 million 2048 byte records sequentially (2GB total), and then only freed the even-numbered records. (I'm oversimplifying a bit, but the principle holds.) Now you've leaked 1GB memory, but its spread over 2GB space.

The kernel only works in 4K chunks when paging. Each 4K page, though, has 2K of leaked data and 2K of free space. For all the subsequent non-leak allocations that fit in these holes, you effectively "amplify" the footprint due to the leaked data that shares the same 4K page. If you try to use 1GB of space for some actual work within that same process, the working set the kernel's VM will see will look more like 2GB if all the allocations fill the holes.

Make sense?

Re:Leaky Fawcet (1)

ensignyu (417022) | more than 4 years ago | (#33212404)

The kernel can only defragment pages, which are 4KB on most Linux systems. If you have a page with 4080 bytes of leaked memory and 16 bytes of memory that you actually use, accessing that memory will swap in the entire page.

You can't move stuff around within a page because the address would change (moving pages is OK because all memory accesses go through the TLB [wikipedia.org] ), unless you have a way of fixing up all the pointers to point at the new location. That's generally only possible in a type-safe language like Java, so the memory manager can guarantee that it's modifying a pointer and not some arbitrary data. The Java virtual machine can move objects around as part of the garbage collection process, defragmenting the heap in the process.

Clustering by allocation site might work for some applications, but if you allocate, say, a string, it's difficult to tell if the string is going to be freed now, or later, or never, much less whether it'll be freed at the same time as any other objects. It might depend on the input data.

Re:Leaky Fawcet (1, Informative)

sjames (1099) | more than 4 years ago | (#33212254)

Personally, I like to make swap equal to the size of RAM for exactly that reason. It's not like a few Gig on a HD is a lot anymore.

Re:Leaky Fawcet (5, Informative)

GooberToo (74388) | more than 4 years ago | (#33212436)

Unfortunately you're not alone in doing this. Its a deprecated practice that used to make sense, but hasn't made sense to do so in a very long time.

The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time? In the event you have legitimate applications which use large amounts of memory run away with a bug, it can effectively bring your entire system to a halt as it will take a long, long time before it runs out of memory.

Excluding Window boxes (they have their own unique paging, memory/file mapping, and backing store systems), generally more than 1/4-1/2 memory is a waste these days. As someone else pointed out, sure you can buy more uptime from leaking applications but frankly, that's hardly realistic in the least. The chances of not requiring a kernel update over the span of a couple years is just silly unless you care more for uptime than you do for security and/or features and/or performance.

The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was. These days, if you ever come closing to needing 1:1, let alone 2:1 page file/partition, you're not even close to properly spec'ing your required memory. In other words, with few exceptions, if you have a page file/partition anywhere near that size, you didn't understand how the machine was to be used in the first place.

You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved. You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.

Three decades ago it made sense. These days, its just silly and begging for your system to one day grind to a halt.

Re:Leaky Fawcet (0, Redundant)

GooberToo (74388) | more than 4 years ago | (#33212504)

Why is a factually accurate, topical, informative, and polite message marked troll?

Re:Leaky Fawcet (1)

akanouras (1431981) | more than 4 years ago | (#33212582)

1. Programs can't use the swapped memory directly. The kernel only swaps parts of memory that haven't been accessed in a while.

2. By swapping out unused (even because of leaks) memory, the kernel has more memory to use for disk caching.

All this has nothing to do with whether your system will grind to a halt today instead of one month later.

And to answer your question, the mod(s) apparently thought this was common knowledge and not worth responding to.

Re:Leaky Fawcet (1)

GooberToo (74388) | more than 4 years ago | (#33212600)

Yes, everything you said is known and understood, but hardly topical.

By paging leaked memory, if the leak is indeed bad enough to justify an abuse of the VM to offset it, chances are you'll be suffering from fragmentation and be on the negative side of the performance curve at some point. Its just silly to believe you'll be running a badly leaking application over the span of years and desire to hide the bug rather than fix it. There is just nothing about that strategy which makes sense.

So to bring this full circle, the troll-moderator, was completely wrong. And while you're post is well intentioned, its nieve at best. More likely the moderator is completely clueless as to the subject matter or the moderation was done out of spite.

Re:Leaky Fawcet (1)

akanouras (1431981) | more than 4 years ago | (#33212668)

I apologise, I didn't pay enough attention to the context while replying.

Indeed, using swapping for the sole purpose of mitigating memory leaks is wrong.

Re:Leaky Fawcet (0)

Anonymous Coward | more than 4 years ago | (#33212954)

You must be new here

Re:Leaky Fawcet (4, Interesting)

sjames (1099) | more than 4 years ago | (#33212610)

I often see uptimes measured in years. It's not at all unusual for a server to need no driver updates for it's useful lifetime if you spec the hardware based on stable drivers being available. The software needs updates in that time, but not the drivers.

In other cases, some of the drivers may need an update, but if they're modules and not for something you can't take offline (such as the disk the root filesystem is on), it's no problem to update.

Note that I generally spec RAM so that zero swap is actually required if nothing leaks and no exceptional condition arises.

When disks come in 2TB sizes and server boards have 6 SAS ports on them, why should I sweat 8 GB?

Let's face it, if the swap space thrashes (yes, I know paging and swapping are distinct but it's still called swap space for hysterical raisins) it won't much matter if it is 1:1 or .5:1, performance will tank. However, it it's just leaked pages, it can be useful.

For other situations, it makes even more sense. For example, in HPC, if you have a long running job and then a short but high priority job comes up, you can SIGSTOP the long job and let it page out. Then when the short run is over, SIGCONT it again. Yes, you can add a file at that point, but it's nice if it's already there, especially if a scheduler might make the decision to stop a process on demand. Of course, on other clusters (depending on requirements) I've configured with no swap at all.

And since Linux can do crash dumps and can freeze into swap, it makes sense on laptops and desktops as well.

Finally, it's useful for cases where you have RAID for availability, but don't need SO much availability that a reboot for a disk failure is a problem. In that case, best preformance suggests 2 equal sized swaps on 2 drives. If one fails, you might need a reboot, but won't have to wait on a restore from backup and you'll still have enough swap.

Pick your poison, either way there exists a failure case.

And yes, in the old days I went with 2:1, but don't do that anymore because it really is excessive these days.

Re:Leaky Fawcet (2, Insightful)

GooberToo (74388) | more than 4 years ago | (#33212636)

I often see uptimes measured in years. It's not at all unusual for a server to need no driver updates for it's useful lifetime if you spec the hardware based on stable drivers being available. The software needs updates in that time, but not the drivers.

Yes, we've all seen that. It makes for nice bragging rights. But realistically, to presume that one might have a badly leaking application, which can not ever be restarted, and that memory/paging fragmentation is not a consequence, to justify a poor practice is just that, a poor practice. And of course, that completely ignores the fact that there are likely nasty kernel bugs going unfixed. So it means you're advertising a poor practice, which will likely never be required, as an excuse to maintain uptime at the expense of security and/or reliability.

And if you somehow manage to break the odds whereby the poor practice miraculously pays off, you can always create a paging file.

Re:Leaky Fawcet (0, Redundant)

somersault (912633) | more than 4 years ago | (#33213242)

I'm assuming you've already heard of it, but you can use something like ksplice to patch up the kernel on the fly. It's not necessary to skip updates even if you want 100% uptime.

Re:Leaky Fawcet (4, Insightful)

vlm (69642) | more than 4 years ago | (#33213510)

When disks come in 2TB sizes .... why should I sweat 8 GB?

You are confusing capacity problems with thruput problems. Sweat how poor performance is when 8 gigs gets thrashing.

The real problem is the ratio of memory access speed vs drive access speed has gotten dramatically worse over the past decades.

Look at two scenarios with the same memory leak:

With 8 gigs of glacially slow swap, true everything will keep running but performance will drop by a factor of perhaps 1000. The users will SCREAM. Which means your pager/cellphone will scream. Eventually you can log in, manually restart the processes, and the users will be happy, for a little while.

With no/little swap, the OOM killer will reap your processes, which will be restarted automatically by your init scripts or equivalent. The users will notice the maybe, just maybe, they had to click refresh twice on a page. Or maybe it seemed slow for a moment before it was normal speed. They'll probably just blame the network guys.

End result, with swap means long outage that needs manual fix, no swap means no outage at all and automatic fix.

In the 80s, yes you sized your swap based on disk space. In the 10s (heck, in the 00s) you size your swap based on how long you're willing to wait.

It takes a very atypical workload and very atypical hardware for users to tolerate the thrashing of gigs of swap...

Re:Leaky Fawcet (1)

Just Some Guy (3352) | more than 4 years ago | (#33214562)

With 8 gigs of glacially slow swap, true everything will keep running but performance will drop by a factor of perhaps 1000. The users will SCREAM. Which means your pager/cellphone will scream. Eventually you can log in, manually restart the processes, and the users will be happy, for a little while.

Is there a modern OS with a VM manager that horrible? And while I agree that the ratio of memory speed to HDD speed (but not necessarily SSD speed) keeps growing in favor of RAM, the ratio of RAM size to hard drive throughput still seems about the same. For instance, my first 512KB Amiga 1000 had a 5KB/s floppy, so writing out the entire contents of RAM would take about 100 seconds. These days my home server has 8GB of RAM and each of its drives can sustain about 80MB/s throughput, so writing out the entire contents of RAM would take about... 100 seconds.

Finally, while I don't know as much about Linux's VMM, I know that FreeBSD's is fairly proactive about copying long-unused RAM pages to swap during idle periods. If those processes suddenly decide to access those pages, they're still in RAM and the processes race ahead as normal. If some other process tries to allocate that RAM, then those pages are released and allocated to the new process with no new disk IO at all - because they've already been copied out. I can't think of a single real-world reason why that isn't a good thing.

Re:Leaky Fawcet (0)

Anonymous Coward | more than 4 years ago | (#33215050)

The raisins have calmed down, can we call it a page file now? :)

Re:Leaky Fawcet (3, Interesting)

akanouras (1431981) | more than 4 years ago | (#33212894)

Excuse my nitpicking, your post sparked some new questions for me:

The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time?

Are you sure that's it's only reading/writing 4KB at a time? It seems pretty braindead to me.

The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was.

Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia [wikipedia.org] mentions them as interchangeable terms and other sources on the web seem to agree.

You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved.

Just mentioning here that Swapspace [pqxx.org] (Debian [debian.org] package) takes care of that, with configurable thresholds.

You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.

Quoting Andrew Morton [lkml.org] :
"[On 2.6 kernels the difference is] None at all. The kernel generates a map of swap offset -> disk blocks at swapon time and from then on uses that map to perform swap I/O directly against the underlying disk queue, bypassing all caching, metadata and filesystem code."

Re:Leaky Fawcet (2, Informative)

lars_stefan_axelsson (236283) | more than 4 years ago | (#33213312)

Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia mentions them as interchangeable terms and other sources on the web seem to agree.

It's actually (buried) in the wikipedia article you link, but only a sentence or so. In the old days, before paging, a Unix system would swap an entire running program onto disk/drum. (That's where the sticky bit comes from, as swap space was typically much faster than other secondary storage, if nothing else the lack of a file system helps, the sticky bit on an executable file meant, "keep text of program on swap even when it's stopped executing". This meant that executing the program again would go much faster). Then came paging, where only certain pages of a running program would get ejected to swap space.

Unix systems would then both swap and page. Roughly, when memory pressure was low (but still high enough to demand swap space), the system would page. As memory pressure rose, the OS would decide the situation to be untenable and select entire processes to be evicted to swap for a long time (several seconds to tens of seconds) and then check periodically to see if they could/should be brought back (evicting someone else in the process). The BSDs even divided the task struct into two parts, the swappable and the unswappable part. Where the swappable part would record things like page tables etc. that is superfluous information when all the pages of a process have been ejected. The unswappable part contained only the bare minimum needed to remember there was a process on swap, and to make scheduling decisions regarding it. This made sense when main memory was measured in single digit megabytes, I don't think that Linux bothered with this (or even swapping as a concept, implementing just paging, but don't quote me on that, as memories were becoming bigger fast).

Of course, swapping meant that those of us that ran X on a 4MB Sun system in the eighties would find that our X-term processes had been swapped out (the OS had decided that since they hadn't been used in a while, and where waiting for I/O, they were probably batch oriented in nature and could be swapped out wholesale) and it would take several seconds for the cursor to become responsive when you changed windows... :-) The scheduling decisions hadn't kept up. The solution though was the same as today, buy more memory... :-)

Any good *old* book on OS internals, esp. the earlier incantations of "The Design and Implementation of the FreeBSD Operating System" by McCusick et.al. would have the gory details. (But the FreeBSD version of that book might have done away with that. It was still in the 4.2 version though.) :-)

Re:Leaky Fawcet (1)

akanouras (1431981) | more than 4 years ago | (#33215368)

Thank you very much for your reply, old Unix stories are always fascinating to read! :D

I have amended the Wikipedia "disambiguation" page to make clicking through more ...enticing for future visitors :)

Re:Leaky Fawcet (0)

Anonymous Coward | more than 4 years ago | (#33214054)

Excluding Window boxes (they have their own unique paging, memory/file mapping, and backing store systems), generally more than 1/4-1/2 memory is a waste these days.

Unless, of course, you want to use hibernation.

Re:Leaky Fawcet (1)

StayFrosty (1521445) | more than 4 years ago | (#33214864)

The one good use for a 1:1 memory to disk ratio now days is suspend to disk. If you don't have enough swap space available and you try to suspend, it doesn't work.

Re:Leaky Fawcet (1)

GooberToo (74388) | more than 4 years ago | (#33212520)

Sorry. My other post which provides lots of good, accurate information was troll-moderated. Its been forever since I've last seen meta-moderation actually fix a troll moderated post so I'm hoping others will fix it. Not to mention, its information many, many users should learn.

Hopefully yourself and others will read the post and realize why its a bad idea, which ignores the fact its a popular notion.

Re:Leaky Fawcet (1)

Osso (840513) | more than 4 years ago | (#33212282)

Often you don't want to swap, and want memory allocations to fail. Sometimes everything will be slow and you can barely access your server instead of being able to check on what's happening

Re:Leaky Fawcet (1)

tenchikaibyaku (1847212) | more than 4 years ago | (#33212738)

I have lately disabled my swap for a very simple reason: with 4GB of RAM, the swap was only ever used when some rouge application suddenly went into an eat-all-memory loop *cough*adobe flash*cough*.

It might not be a good reason in theory, but in practice I rather have the OOM kick in sooner rather than having to struggle with a system that is practically hanged due to all the swapping. I can live with having slightly less of my filesystem cached.

Re:Leaky Fawcet (1)

kenh (9056) | more than 4 years ago | (#33213614)

First, it's Leaky Faucet (Unless you are thinking of Farrah Fawcett [tobyspinks.com] 8^)

Second, never try ot teach a pig to fly, it wastes your time and annoys the pig. The same goes for having in-depth technical discussions with many slashdot commenters...

2 years so updates are way behind? (1)

Joe The Dragon (967727) | more than 4 years ago | (#33214656)

2 years so updates are way behind? Not all of them are no reboot ones.

Re:Leaky Fawcet (1)

ultranova (717540) | more than 4 years ago | (#33214896)

Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again.

I once had explorer.exe on Windows 7 go into some kind of seizure where it ended up using over 2 gigs of memory (of a total 4) before I killed it. It certainly was swapping in and out constantly. Fun, that.

Re:Leaky Fawcet (1)

druke (1576491) | more than 4 years ago | (#33212194)

You make it sound like this is some sort of conspiracy. Generally when you'd want to do something like this you would be doing VM servers anyways. they didn't do much (anything, actually) in the way of 'desktop programs' beyond X...

Why does this matter anyways, it's not the vm dev's job to fix memory leaks in openoffice. They have to go forward assuming everything is working correctly. Also, if they're all sharing the memory leak, it'd be optimized anyways :p

Kernel shared memory (5, Informative)

Narkov (576249) | more than 4 years ago | (#33212078)

The Linux kernel uses something called kernel shared memory (KSM) to achieve this with it's virtualization technology. LWN has a great article on it:

http://lwn.net/Articles/306704/ [lwn.net]

Re:Kernel shared memory (5, Informative)

amscanne (786530) | more than 4 years ago | (#33212434)

Disclaimer: I wrote the blog post. I noticed the massive slashdot traffic, so I popped over. The article summary is not /entirely/ accurate, and doesn't really completely capture what we're trying to do with our software.

Our mechanism for performing over-subscription is actually rather unique.

Copper (the virtualization platform in the demo) is based on an open-source Xen-based virtualization technology named SnowFlock. Where KSM does post-processing on memory pages to share memory on a post-hoc basis, the SnowFlock method is much more similar to unix 'fork()' at the VM level.

We actually clone a single VM into multiple ones by pausing the original VM, COW-ing its memory, and then spinning up multiple independent, divergent clones off of that memory snapshot.

We combine this with a mechanism for bringing up lightweight VMs fetching remote memory on-demand, which allows us to bring up clones across a network about as quickly and easily as clones on the same machine. We can 'clone' a VM into 10 VMs spread across different hosts in a matter of seconds.

So the mechanism for accomplishing this works as follows:
1. The master VM is momentarily paused (a few milliseconds) and its memory is snapshotted.
2. A memory server is setup to serve that snapshot up.
3. 16 'lightweight' clone VMs are brought up with most of their memory "empty".
4. The clones start pulling memory from the server on-demand.

All of this takes a few seconds from start to finish, whether on the same machine or across the network.

We're using all of this to build a bona-fide cluster operating system where you host virtual clusters which can dynamically grow and shrink on demand (in seconds, not minutes).

The blog post was intended not as an ad, but rather a simple demo of what we're working on (memory over-subscription that leverages our unique cloning mechanism) and a general introduction to standard techniques for memory over-commit. The pointer to KSM is appreciated, I missed it in the post :)

Re:Kernel shared memory (5, Funny)

pookemon (909195) | more than 4 years ago | (#33212662)

The article summary is not /entirely/ accurate

That's surprising. That never happens on /.

Re:Kernel shared memory (5, Interesting)

descubes (35093) | more than 4 years ago | (#33212666)

Having written VM software myself (HP Integrity VM), I find this fascinating. Congratulations for a very interesting approach.

That being said, I'm sort of curious how well that would work with any amount of I/O happening. If you have some DMA transfer in progress to one of the pages, you can't just snapshot the memory until the DMA completes, can you? Consider a disk transfer from a SAN. With high traffic, you may be talking about seconds, not milliseconds, no?

Re:Kernel shared memory (1)

kenh (9056) | more than 4 years ago | (#33213566)

Let me see if I understand - you take one VM (Say, an Ubuntu 10.04 server running a LAMP stack, just to pick one), then you make "diff's" of that initial VM and create additional VMs that are also running OS/software (as a starting point), Of course, I can load up other software on the "diff'd" VMs, but they increase the actual memory footprint of each VM. So, to maximize oversubscription of memory, I'd want to limit myself to running VMs that are as similar as possible (say a farm of Ubuntu 10.04 LAMP servers), and were I to run a machine with a couple Ubuntu 10.04 LAMP servers, a couple Windows Server 2008 servers, a WIndows Server 2003 server, and an Ubuntu 9.04 server ont he same machine I'd have minimal memory oversubscription benefits (the multiple Windows Server 2008 and Ubuntu 10.04 LAMP servers would share memory, but the one-off Windows Server 2003 and Ubuntu 9.04 would have no shared memory)... Correct?

Interesting idea, seems to me the memory server would cause a serious impact on server performance, but that is the view from my armchair, I'll reserve judgement until I see it in action.

Thanks for following up on the /. story.

Re:Kernel shared memory (1)

Daniel Boisvert (143499) | more than 4 years ago | (#33213882)

This is an interesting approach, especially across hosts in a cluster. Is it safe to assume you expect your hosts and interconnect to be very reliable?

I'm curious about the methods you use to mitigate the problems that would seem to result if you clone VM 1 from Host A onto VM's 2-10 on hosts B-E, and Host A dies before the entirety of VM 1's memory is copied elsewhere. Can you shed any light on this?

Re:Kernel shared memory (1)

calmond (1284812) | more than 4 years ago | (#33213932)

Correct me if I'm wrong, but the method you described sounds almost exactly like LVM Snapshots. A great approach, and saves a ton of disk space. How often should a VM be rebooted or re-cloned though? Memory is a lot more volitile than disk storage, so I would think that the longer the system runs, the more divergent the memory stacks would be, thus the less efficient this method would be over time, or am I missing something? Thanks!

So... is this different from Linux KVM w/ KMS? (1)

Anonymous Coward | more than 4 years ago | (#33212088)

Even the same ratio of over-subscribed memory, around 300%, but without the overhead this article admits it has which reduces it's actual over-subscription ratio down to just over 200% instead:

http://lwn.net/Articles/306704/

Specifically, this link/LKML post: 52 1GB Windows VMs in 16GB of total physical RAM installed:

http://lwn.net/Articles/306713/

Re:So... is this different from Linux KVM w/ KMS? (1)

fR0993R-on-Atari-520 (60152) | more than 4 years ago | (#33212112)

Funny... in the VMware whitepaper [vmware.com] linked to from the article, even VMware wasn't able to get more than 110% memory over-consolidation from page sharing. I wonder what's so different about KVM's page sharing approach?

Re:So... is this different from Linux KVM w/ KMS? (5, Informative)

amscanne (786530) | more than 4 years ago | (#33212174)

I have one possibility. The blog post alluded to this. Page sharing can be done *much* more efficiently on Linux due to the fact that the ELF loader does not need to rewrite large chunks of the binaries when applications are loaded into memory. The Windows loader will rewrite addresses in whole sections of code if a DLL or EXE is not loaded at it's "preferred base" virtual address. In Linux, these addresses are isolated through the use of trampolines. Basically, you can have ten instances of Windows all running the exact same Microsoft Word binaries and they might not share the code for the application. In Linux, if you have ten VMs running the same binaries of Open Office, there will be a lot more sharing.

Re:So... is this different from Linux KVM w/ KMS? (0)

Anonymous Coward | more than 4 years ago | (#33212590)

I'm sorry, but this post and the blog post are extremely inaccurate, and I hesitate to say flat-out wrong. EXEs are never relocated unless mapped via LoadLibrary (a debugging technique only). All code (DLLs and EXEs) are system shared with copy-on-write memory mapping [microsoft.com] . If a DLL is relocated, what typically happens is that the .reloc section is copied to the private address space and rewritten. All other binary sections should remain shared (I believe; IAT rewriting should only happen once globally). Additionally, most (all?) Microsoft DLLs have unique base addresses to minimize the potential relocations.

Re:So... is this different from Linux KVM w/ KMS? (3, Insightful)

amscanne (786530) | more than 4 years ago | (#33212730)

Yes, rebasing is reduced by careful selection of preferred base addresses (particularly by Microsoft for their DLLs). Yes, if DLLs are not rebased then they are shared -- I did not claim otherwise. My point along these lines is that rebasing *does* occur surprisingly often, and can hurt sharing. The actual level of sharing you achieve obviously depends almost *entirely* your applications, workload, data, etc.

By the way, as far as I know versions of Windows newer than Vista enable address-space randomization by default for security purposes. Since the starting virtual address of each DLL is randomized, preferred bases can't be respected. I don't know what impact this has on Windows memory usage post-Vista, but it seems like one can't rely on carefully curated base addresses.

I'm not saying one approach is better than the other (Linux, Windows, whatever) -- I'm only positing a possibility for why one might see better improvement with KSM. One might just as easily see better over-subscription with Windows simply due to the fact that it zeroes out physical pages when they are released, as far as I know. Those zero pages can all be mapped to the same machine frame transparently (without the need for balloooning).

Re:So... is this different from Linux KVM w/ KMS? (0)

Anonymous Coward | more than 4 years ago | (#33212864)

ASLR DLLs are randomly rebased only once while in existence (loaded by some process). It takes a reboot to relocate persistent system DLL's to another address. It's an improvement over preferred base loading in that two DLL's cannot request the same address.

Re:So... is this different from Linux KVM w/ KMS? (0)

Anonymous Coward | more than 4 years ago | (#33212752)

Hmm... last I checked, the .reloc section only specified code locations, and it's the locations themselves and not the index that is modified when relocated. And modification to the mapped code section should trigger a page fault, which results in the entire code section being copied to the private address space. If I'm right about this then both you and the GP are correct.

Re:So... is this different from Linux KVM w/ KMS? (1)

ringm000 (878375) | more than 4 years ago | (#33212592)

Base addresses of DLLs in an application are typically chosen to avoid conflicts with system DLLs and between each other, so these conflicts are relatively rare. When they happen, the DLLs can be manually rebased.

Re:So... is this different from Linux KVM w/ KMS? (1)

Wierdy1024 (902573) | more than 4 years ago | (#33213032)

Using trampolines for every cross-library call seems very inefficient...

The windows method seems better for the more common case, where it does the costly rewriting at library load time, and then avoids an extra jump for every library function call.

Whats the performance impact of this? I bet it's at least a couple of percent, which is significant if it's across the entire system.

Re:So... is this different from Linux KVM w/ KMS? (2, Interesting)

milosoftware (654147) | more than 4 years ago | (#33213218)

On x64 Windows systems, addressing is always relative, so this eliminates the DLL relocation. So it might actually save memory to use 64-bit guest OSses, as there will be less relocation and more sharing.

Just a summary of existing techniques (2, Informative)

Anonymous Coward | more than 4 years ago | (#33212100)

This blog post is just a summary of 3 existing techniques: Paging, Ballooning, and Content-Based Sharing. It does not describe any new techniques, or give any new insights.

It's a solid summary of these techniques, but nothing more.

Re:Just a summary of existing techniques (1)

GooberToo (74388) | more than 4 years ago | (#33212580)

A new implementation of existing technique and/or technology can still be noteworthy. If this isn't the case then an F22 is really just a Wright Brother's Flyer - nothing new. My metaphor is absurd, but you get the point.

Re:Just a summary of existing techniques (1)

dirtyhippie (259852) | more than 4 years ago | (#33212586)

It doesn't even say which if any of those techniques it's using. It's a teaser, not news.

Re:Just a summary of existing techniques (0)

Anonymous Coward | more than 4 years ago | (#33213588)

More or less agreed. I always get annoyed when people demo memory overcommit technology with desktop VMs. It's easy to build a couple dozen Windows XP VMs with 1GB of RAM each and get huge overcommit ratios because XP will run in 256GB of RAM. Cramming 16 GB of allocated VMs into 8 GB of space is trivial when your actual memory requirements are close to 4GB. There's a reason virtualization vendors don't demo memory overcommit on VDI platforms using 4GB Windows 7 VMs running simulated workloads. Regarding the technologies listed:

Page Sharing only works when you have a lot of duplicate memory pages, and if you're using large pages then you essentially have none (unless they are zeroed pages, in which case you have probably over-allocated your VMs).

Paging at the hypervisor level is an absolute disaster waiting to happen. You're basically blindly paging data to disk that VMs think is stored in RAM. Consequently VMs request paged data and disk I/O increases. You're better off letting the VMs handle the paging so that they can intelligently determine what to page out.

Ballooning isn't a bad technology, especially if your VMs have more memory than they need. But if they actually need the memory that they have then it doesn't get your much. It basically relies on the VM to handle it's own paging.

This is one area where virtualization is going to struggle. Oversubscribing CPUs isn't much of an issue because most physical servers are over-allocated CPUs anyway, leading to lots of idle cycles. Oversubscribing memory is trickier because servers typically don't need memory one minute and then less memory the next, and even if memory pages are not actively being "used" at a particular instance there is still a performance hit involved in paging them out.

At this point I think that Microsoft is actually in the lead when it comes to VM memory management with their Dynamic Memory technology. It uses hooks into the VM operating system to determine how much memory a VM actually needs at that particular time and provides it, but also gives it the capability to dynamically scale up memory allocations as workloads require it. You still can't "use" more memory than you actually have and you never will (in the same way that you can't "use" more CPU cycles than you actually have), but Dynamic Memory ensures that you can more highly optimize the memory utilization in your virtual environments.

http://blogs.technet.com/b/virtualization/archive/2010/07/12/dynamic-memory-coming-to-hyper-v-part-6.aspx

OpenVZ? (1, Informative)

Anonymous Coward | more than 4 years ago | (#33212122)

OpenVZ has had this for years now, which is one of the reasons it has gained popularity in the hosting world.

Re:OpenVZ? (1)

KiloByte (825081) | more than 4 years ago | (#33213600)

Or vserver. Or BSD jails.

These just use the good old Unix memory management -- if you can coordinate between multiple VMs, things get a whole lot easier. The problem with VMs with separate kernels (Xen, VirtualBox, VMWare, etc) is that they have no way of knowing a given page mmaps the same block on the disk.

The technique described in the article is a hack that works only if all processes are started before you clone the VMs and nothing else happens later. Vserver does it strictly better -- if multiple VMs use the same file on the disk, it will use the memory exactly once, no matter when it was read.

nothing new (1, Interesting)

Anonymous Coward | more than 4 years ago | (#33212130)

nothing new .. i ran 6 w2k3 servers on a linux box running vmware server with 4GB of ram and allocated 1GB to each vm

Re:nothing new (-1, Troll)

Anonymous Coward | more than 4 years ago | (#33212178)

So, I once ran 20 virtualized supercomputer clusters on the Casio wristwatch I lost up your daddy's ass.

software to create amazing virtual (-1, Redundant)

wdvba (1874464) | more than 4 years ago | (#33212142)

Today, the development of software industry in the world are constantly developing . This will see you through the day there are always software product was launched and introduced ( trial demo ) in favor of creating a user in the selection of utility software , to meet individual needs of different jobs . However , this also makes the users to be questioning the decision to upgrade or switch to a better use than other software ... Detail: http://wdvba.info/ [wdvba.info] -------------- SEo, Net, Java

Is this an ad? (2, Insightful)

saleenS281 (859657) | more than 4 years ago | (#33212148)

I noticed free memory on the system was at 2GB and dropping quickly when they moved focus away from the console session (even though all of the VM's had the exact same app set running). This appears to be absolutely nothing new or amazing... in fact, it reads like an ad for gridcentric.

Not exactly new (1, Informative)

Anonymous Coward | more than 4 years ago | (#33212198)

Oversubscription of memory for VMs has been around for decades - just not for the Intel platform. There are other older, more mature platforms for VM support...

This having been done before ... (5, Informative)

cdrguru (88047) | more than 4 years ago | (#33212214)

One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before. Often, even when there is an opportunity to gather information about older systems, they don't think it is relevent.

Well, here I would say it is extremely relevent to understand some of the performance tricks utilized by VM/370 and VM/SP in the 1970s and 1980s. VM/370 is pretty much the foundation of today's IBM virtualization offerings. In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.

The key bit of information here is that for interactive users running CMS a significant optimization was put in place of sharing the bulk of the operating system pages. This was done by dividing the operating system into two parts, shared and non-shared and by design avoiding writes to the shared portion. If a page was written to a local copy was made and that page was no longer shared.

This was extremely practical for VM/370 and later systems because all interactive users were using pretty much the same operating system - CMS. It was not unusual to have anywhere from 100 to 4000 interactive users on such systems so sharing these pages meant for huge gains in memory utilization.

It seems to me that a reasonable implementation of this for virtualization today would be extremely powerful in that a bulk of virtualized machines are going to be running the same OS. Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense. So instead of booting an OS "natively" you would instead load a shared system where the shared (read only) pages would be loaded along with an initial copy of writable non-shared memory from a snapshot taken at some point during initialization of the OS.

This would seem to be able to be done easily for Linux even to the extent of having it assist with taking a snapshot during initialization. Doing this with Windows should also be possible as well. This would greatly reduce the memory footprint of adding another virtual machine also using a shared operating system. The memory then used by a new virtual machine would only be the non-shared pages. True, the bulk of the RAM of a virtual machine might be occupied by such non-shared pages but the working set of a virtual machine is likely to be composed of a significant number of OS pages - perhaps 25% or more. Reducing memory requirements by 25% would be a significant performance gain and increase in available physical memory.

Ok, but look at this... (1)

ratboy666 (104074) | more than 4 years ago | (#33212242)

Yes, we pay attention...

The concept is in Unix, including Linux, and probably in Windows - COW (copy-on-write) pages...
fork() uses COW, vfork() shares the entire address space (but suspends the parent).

$ man vfork

[snip]

      Historic Description
              Under Linux, fork(2) is implemented using copy-on-write pages, so the
              only penalty incurred by fork(2) is the time and memory required to
              duplicate the parent's page tables, and to create a unique task struc-
              ture for the child. However, in the bad old days a fork(2) would
              require making a complete copy of the caller's data space, often need-
              lessly, since usually immediately afterwards an exec(3) is done. Thus,
              for greater efficiency, BSD introduced the vfork() system call, which
              did not fully copy the address space of the parent process, but bor-
              rowed the parent's memory and thread of control until a call to
              execve(2) or an exit occurred. The parent process was suspended while
              the child was using its resources. The use of vfork() was tricky: for
              example, not modifying data in the parent process depended on knowing
              which variables are held in a register.

[snip]

Re:Ok, but look at this... (1)

sjames (1099) | more than 4 years ago | (#33212886)

Parent is talking about the kernel itself sharing pages across instances, not userspace processes running under a single instance.

Re:This having been done before ... (4, Informative)

Anonymous Coward | more than 4 years ago | (#33212246)

Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent. That means, each time MS Office is loaded on a different machine, the function jump points are re-written according to where in the address space the code gets loaded, which is apparently usually different on different Windows instances. That means very little opportunity for page sharing.

These guys seem to be doing something different...

Re:This having been done before ... (0)

Anonymous Coward | more than 4 years ago | (#33212374)

It's true, if a module isn't loaded at it's preferred base address, there is pointer rewriting that occurs on Windows.

However, each process gets its own address space. An app like MS Office is always going to be loaded at its preferred base address - none of the built-in modules are going to conflict over preferred base address. Two MS Office plugins that were written unaware of each other may conflict on preferred base address, but only the conflicting plugin will have its pointers rewritten. The rest of the process can happily share memory.

Not even considering virtualization, this is useful in the many-user terminal server scenario. If you have 100 users running MS Word, Windows doesn't load 100 copies of winword.exe.

Re:This having been done before ... (1)

ChipMonk (711367) | more than 4 years ago | (#33212496)

Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent.

Is that also true for 64-bit Windows binaries? According to the docs I've read, position-independent binary code is preferred in 64-bits.

Re:This having been done before ... (0)

Anonymous Coward | more than 4 years ago | (#33214092)

Is that also true for 64-bit Windows binaries? According to the docs I've read, position-independent binary code is preferred in 64-bits.

Except that when you're using large amounts of memory you enable large pages, which reduces the possibility of page sharing to almost zero.

Re:This having been done before ... (2, Insightful)

Maarx (1794262) | more than 4 years ago | (#33215270)

You guys gotta learn to use the quote tags instead of the italics. Slashdot knows to hide the quote when displaying your post in abbreviated mode, so we can actually read what you said.

And face it, if you post as AC, you're going to be in abbreviated mode.

Re:This having been done before ... (4, Informative)

pz (113803) | more than 4 years ago | (#33212334)

In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.

From what I can gather, it was not Cambridge University (of which I believe there is still only one, located in the UK, despite the similarily-named Cambridge College [cambridgecollege.edu] in Cambridge, Massachusetts, but as the latter is an adult-educational center founded in 1971, the chances are that wasn't where CP/67 was developed), but rather IBM's Cambridge Scientific Center [wikipedia.org] that used to be in the same building as MIT's Project MAC. Project MAC (becoming later the MIT Lab for Computer Science) being where much of the structure of modern OSes was invented.

Those were heady days for Tech Square. And, otherwise, the parent poster is right on.

Re:This having been done before ... (0)

Anonymous Coward | more than 4 years ago | (#33212530)


One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before.

In my experience age has little to do with it. I know a sysadmin in his 50s who's one of the least knowledgeable people I know when it comes to actually understanding what's going on inside the machine. I know people 20 years younger than could do cartwheels around this guy. Age has little correlation with experience. The desire to understand is really the key.

Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense.

I guess. Kernels are dinky at a few megabytes of code compared to the gigabytes of memory available. Glibc is a couple megabytes. There's some other shared libs for sure, but I've a hard time believing they add up to anything substantial.

The real question here to me is, why try to share memory in the first place? Memory is cheap. This hasn't always been the case, but it is now.

Re:This having been done before ... (2, Informative)

petermgreen (876956) | more than 4 years ago | (#33214792)

Memory is cheap
Kind of, the memory itself isn't too expensive but the cost of a system has a highly nonlinear relationship to memory requirements at least with the intel nahelm stuff (it's been a while since i've looked at AMD so I can't really commend there).

Up to 16GB you can use an ordinary LGA1366 board and CPU.

To get to 24GB you need a LGA1366 board and CPU.

To get to 48GB (or 72GB if you are prepared to take the performance hit and motherboard choice hit that comes from putting three memory modules on a channel) you need a dual-socket LGA1366 board and associated dual-socket capable CPUs (which are far far more expensive clock for clock than thier single socket equivilents) and associated speial case.

To get to 96GB (or 144GB if you are prepared to take the performance hit and motherboard choice hit that comes from putting three memory modules on a channel) you need the aforementioned dual-socket platform plus insanely expensive 8GB modules.

Beyond that you are talking moving to a quad-socket platform afaict.

Re:This having been done before ... (2, Informative)

petermgreen (876956) | more than 4 years ago | (#33215776)

Up to 16GB you can use an ordinary LGA1366 board and CPU.
That line should have said LGA1156

I too am guilty of oversubscribing.... (-1, Troll)

Anonymous Coward | more than 4 years ago | (#33212302)

...to your MOM!!! ba-ZING-a!

VMware ESX does this (yeah it's not free) (0)

Anonymous Coward | more than 4 years ago | (#33212304)

This guy did some crazy stuff with it - 64Gb or so of fake memory on an 8Gb box http://vinf.net/2010/02/25/8-node-esxi-cluster-running-60-virtual-machines-all-running-from-a-single-500gbp-physical-server/

Limitations (1)

sjames (1099) | more than 4 years ago | (#33212340)

This may seem obvious, but in reading some of the trade press and the general buzz, it seems that it isn't obvious to everyone:

Oversubscription only works when the individual VMs aren't doing much. If you have a pile of VMs oversubscribed to the degree TFA is talking about, it means the VM overhead is exceeding the useful computation. There are cases where that can't be helped, such as each VM is a different customer, but in an enterprise environment, it suggests that you should be running more than one service per instance and have less instances.

I swear, some in the trade rags seem to honestly think there is a benefit to splitting a server into 16 VMs and then combining those into a virtual beowulf cluster for production work (it makes perfect sense for development and testing, of course).

Re:Limitations (1)

drsmithy (35869) | more than 4 years ago | (#33212776)

Oversubscription only works when the individual VMs aren't doing much. If you have a pile of VMs oversubscribed to the degree TFA is talking about, it means the VM overhead is exceeding the useful computation. There are cases where that can't be helped, such as each VM is a different customer, but in an enterprise environment, it suggests that you should be running more than one service per instance and have less instances.

No, you ideally want as few services per instance as possible, to reduce dependencies and simplify architectures.

A dozen small VMs running a single service each, is generally easier to look after than a single server running a dozen different services, especially if your environment involves customers and/or services with differing availability requirements.

I swear, some in the trade rags seem to honestly think there is a benefit to splitting a server into 16 VMs and then combining those into a virtual beowulf cluster for production work (it makes perfect sense for development and testing, of course).

There are numerous examples where multiple clustered VMs will perform better than a single OS image, on the same hardware.

Re:Limitations (1)

sjames (1099) | more than 4 years ago | (#33212846)

No, you ideally want as few services per instance as possible, to reduce dependencies and simplify architectures.

If the services can be separated onto 2 VMs, they are necessarily orthogonal. If the availability requirements differ, they should certainly NOT be running as VMs on the same machine.

As for the case of different customers, that would fall under the exception where it can't be helped.

There are numerous examples where multiple clustered VMs will perform better than a single OS image, on the same hardware.

Name one!

Oversubscription (1)

Khyber (864651) | more than 4 years ago | (#33212360)

When can we just effectively get what we pay for? This would explain the sudden jump in Intel-based Camfrog servers with a higher offering of hardware.

This effectively means people can now lie about the hardware they're leasing out to you in a data center. They say you're getting 4GB, you're actually getting 1.5GB of RAM.

Our internet is oversubscribed, our processors are getting there, and now RAM?

When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?

Re:Oversubscription (1)

TooMuchToDo (882796) | more than 4 years ago | (#33212408)

When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?

When people are willing to pay for it. If you shop on price, this is the natural result, the need to squeeze as much as you can out of a capital asset.

Re:Oversubscription (1)

sjames (1099) | more than 4 years ago | (#33212868)

Sad but true, especially when there's always someone out there ready to promise more for less and customers ready to believe the lie.

Re:Oversubscription (1)

gregrah (1605707) | more than 4 years ago | (#33212490)

Really? That was your conclusion upon reading this article??

Virtual memory has been around for quite a while now, and I don't think its inventors came up with the idea with the intention to scam anyone. I'd your outrage at "the designers of this stuff" may be misplaced.

Re:Oversubscription (1)

Khyber (864651) | more than 4 years ago | (#33213202)

Not when this is apparently the exact same technology being used to run multiple heavy-traffic video chat servers on the same physical silicon. No wonder people on Camfrog are complaining about their servers lagging so hard, if this is the kind of thing we're paying for when we're actually expecting physical hardware.

Re:Oversubscription (1)

Slashcrap (869349) | more than 4 years ago | (#33212962)

When can we just effectively get what we pay for? This would explain the sudden jump in Intel-based Camfrog servers with a higher offering of hardware.

This effectively means people can now lie about the hardware they're leasing out to you in a data center. They say you're getting 4GB, you're actually getting 1.5GB of RAM.

Our internet is oversubscribed, our processors are getting there, and now RAM?

When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?

Sorry about your anger issues and obvious lack of understanding about what this is.

Re:Oversubscription (1)

Khyber (864651) | more than 4 years ago | (#33213210)

No, I know EXACTLY what the issue is, having called my hosting provider for my video chat server. They just upgraded to this sort of management system, and my video server had been lagging horribly almost since the moment of implementation. And this would explain it - I've been moved to a shared server with overprovisioned hardware.

Sorry you're not experienced enough with realtime applications to know when something's fucking with your system.

Re:Oversubscription (2, Insightful)

TheRaven64 (641858) | more than 4 years ago | (#33214634)

Nope, you really don't seem to understand what this is at all. It is eliminating duplicated pages in the system, so if two VMs have memory pages with the same contents the system only keeps one copy. To a VM, this makes absolutely no difference - the pages are copy-on-write, and when neither VM modifies them they both can see the same one without any interference (as is common with mapped process images, kernel stuff, and so on). The only thing that will change is that there will be reduced cache contention (as all VMs will be using the same copy of the page, rather than evicting each other's copy to get their own (identical) one into the data cache.

And if you're running realtime applications in a VM, then this isn't the only thing that you don't understand.

What's new? (1)

voxner (1217902) | more than 4 years ago | (#33212480)

I had recently started poking around the lguest hypervisor. From my limited reading I believe 2 of the 3 memory subscription choices mentioned in the article are present in Linux. Existing linux based open source hypervisors like kvm etc use paging/swap mechanism (i,e, for x86 - the paravirt mechanism). Ballooning is possible using the virto_balloon. Kernel shared memory in linux allows dynamic sharing of memory pages between proceses - this probably doesn't apply to virtualization.

I couldn't find any CPU over-subscription thing in open-source hypervisors. It seems to be the only area where open-source hypervisors are lacking.

On an other note, established players like IBM tend to use Type-1 hypervisors (link [ibm.com] ) for enterprise servers, it would be interesting to see how this company fares against them.

Re:What's new? (1)

Slashcrap (869349) | more than 4 years ago | (#33212980)

I couldn't find any CPU over-subscription thing in open-source hypervisors. It seems to be the only area where open-source hypervisors are lacking.

Didn't look too hard, did you?

Re:What's new? (1)

DrPizza (558687) | more than 4 years ago | (#33215436)

If your load average is >1, you have CPU over-subscription....

Why stop at 16? (0)

Anonymous Coward | more than 4 years ago | (#33212492)

Why not 24 or 32? Seriously, why should 5 GB real memory be capable of only supporting 16 VMs? What's the limit?

Re:Why stop at 16? (1)

gregrah (1605707) | more than 4 years ago | (#33212658)

It depends. With a suitably small linux build (say, the firmware running on my router) you could probably go much, much higher than 16 VMs.

At some point though after you've started up enough VMs, the probability that a given virtual memory location necessary to continue processing is actually residing in your RAM effectively drops to 0, and you spend all of your time waiting on disk IO. At that point your system is effectively hosed.

Re:Why stop at 16? (1)

drsmithy (35869) | more than 4 years ago | (#33212800)

What's the limit?

Average VM working set * number of VMs - a few hundred MB for Hypervisor and overheads.

Generally speaking, IME, you don't even need to _begin_ worrying until your RAM is oversubscribed above 2:1 (obviously YMMV depending on what your VMs are doing).

How is this interesting ? (1)

drsmithy (35869) | more than 4 years ago | (#33212750)

VMware has allowed RAM oversubscription for years. Indeed, it's one of the killer features of that platform over the alternatives. Who out there using VMware in non-trivial environments _isn't_ oversubscribing RAM ?

Re:How is this interesting ? (1)

swb (14022) | more than 4 years ago | (#33215860)

We usually advise against it if possible, but some of that is consulting CYA; when clients are new to virtualization they are often very sensitive to perceived performance differences between physical and virtual systems. A new virtual environment where someone decided they wanted 8 Windows machines with 8 GB RAM running in 32 GB physical RAM usually gets too far oversubscribed, swaps hard (on a SAN) and the customer complains mightily.

Usually we find that a little tuning of VMs makes sense, since you don't have to robotically give every x32 system 4 GB RAM or every x64 system 8 or 16 GB RAM. "Detuning" the RAM from individual VMs is almost always possible and allows you to keep your VMs RAM sum running within the total physical RAM and avoid the possibility of swapping.

In many ways it's less of an issue than it was, say, a few years ago, too. The CPUs have gotten so powerful that it actually makes sense to buy less CPU per node but buy more nodes (and hence more RAM). The bonus generally being overall more RAM, generally better performance (since I/O is distributed) and greater HA capacity.

Sales even tells me lately that it's cheaper to buy two nodes x 32 GB and a single node x 64GB of RAM.

No big deal (1)

cyball (1039572) | more than 4 years ago | (#33213020)

Really, this amount of overcommit is nothing. It's been done for decades.

I manage a little over 200 virtual servers, spread across 7 z/VM hypervisors, and 2 mainframes. They are currently running with overcommit ratios of 4.59:1, 3.87:1, 3.56:1, 2.05:1, 1.19:1, 1.19:1, and .9:1. And this is a relatively small shop and somewhat low overcommits for the environment.

That's one of the benefits of virtualization...and yes, I know that if all guests decided to allocate all of their memory at once, we'd drive the hypervisor paging subsystem up the wall. Actually, this did happen a few months ago and while everything was dog slow for a while, z/VM happily paged along with out issue.

VMs (0)

Anonymous Coward | more than 4 years ago | (#33213512)

Newsflash from 2 years in the future: "Massive performance improvements in the VM sector by combining VMs into a single... let's call it... kernel. We present: chroot()!"

What's the fucking point of VMs if you introduce all the security problems over and over again? Might as well leave out the superfluous garbage complexity and improve performance.

I can't wait (1)

hilltop coder (1876242) | more than 4 years ago | (#33215422)

..for the day when my ISP wants to sell me a more expensive class of memory because they oversold their physical memory so much that they can't support users that actually use all of what they were sold.

Why is this an issue? (0)

Anonymous Coward | more than 4 years ago | (#33215712)

Why is this even an issue, modern OS's are designed to use the maximum amount of ram available. Unused RAM is wasted ram, both M$ and penguin-boi so apparently there's no need to oversubscribe since we apparently have too much already.

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?