Patch the Linux Kernel Without Reboots

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Patch the Linux Kernel Without Reboots 286

Posted by kdawson on Thursday April 24, 2008 @11:00AM from the click-n-go dept.

evanbro writes "ZDNet is reporting on ksplice, a system for applying patches to the Linux kernel without rebooting. ksplice requires no kernel modifications, just the source, the config files, and a patch. Author Jeff Arnold discusses the system in a technical overview paper (PDF). Ted Ts'o comments, 'Users in the carrier grade linux space have been clamoring for this for a while. If you are a carrier in telephony and don't want downtime, this stuff is pure gold.'" Update: 04/24 10:04 GMT by KD : Tomasz Chmielewsk writes on LKML that the idea seems to be patented by Microsoft.

This discussion has been archived. No new comments can be posted.

Patch the Linux Kernel Without Reboots

Load All Comments

Search 286 Comments Log In/Create an Account

Comments Filter:

Needed that bad? (Score:5, Insightful)

by MetalliQaZ ( 539913 ) writes: on Thursday April 24, 2008 @11:04AM (#23183178)

If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching. They probably would be taken out of the loop for the in-place patching anyway. So who is "clamoring"?

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by tgatliff ( 311583 ) writes:
  
  I guess a better way to put it would be "oh... Way Cool!!!!"... :)
  
  Meaning, yes I agree that in most cases it is not needed, but I have internal processing servers that have up times of over 3 years, so if I had something like this probably all my servers would have up times of this long..
  - Re:Needed that bad? (Score:5, Insightful)
    
    by Anonymous Coward writes: on Thursday April 24, 2008 @12:43PM (#23185246)
    
    I have internal processing servers that have up times of over 3 years
    
    I've never understood this boasting about uptime. Long uptimes are a bad thing! How do you know a configuration change hasn't rendered one of your startup scripts ineffective? If you have to reboot for some unexpected reason, you could be stuck debugging unrelated problems at very inopportune moments.
    
    You need to schedule regular reboots so that you can test that your servers can start up fine at a moment's notice. Long uptimes are a sign a sysadmin hasn't been doing his job.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by harry666t ( 1062422 ) writes:
      
      OTOH, short uptimes are not KRIEG!
    - Re:Needed that bad? (Score:5, Informative)
      
      by mr_mischief ( 456295 ) writes: on Thursday April 24, 2008 @01:02PM (#23185584) Journal
      
      If you change something in a configuration that requires a change to the startup script, then you also change the startup script.
      
      A patch to the kernel almost never requires changes to startup scripts. They're not talking about adding new functionality with user-space-addressable interfaces with this tool. They're talking about being able to install about 84% of security hotfixes in a hurry outside your scheduled reboots then rebooting on your regular maintenance schedule.
      
      Parent Share
      twitter facebook
    - Re:Needed that bad? (Score:5, Insightful)
      
      by Kookus ( 653170 ) writes: on Thursday April 24, 2008 @01:31PM (#23186102) Journal
      
      Production systems are not for testing purposes. You want to test rebooting? Do it on a test box.
      
      Parent Share
      twitter facebook
      - Re:Needed that bad? (Score:5, Insightful)
        
        by adrianbaugh ( 696007 ) writes: on Thursday April 24, 2008 @01:53PM (#23186532) Homepage Journal
        
        How do you know that your test boxes are configured precisely identically to the production boxes?
        
        dd your production box's system filesystems to another hard drive, put in an identically specced machine, boot that?
        
        Parent Share
        twitter facebook
    - Re:Needed that bad? (Score:5, Insightful)
      
      by Kymermosst ( 33885 ) writes: on Thursday April 24, 2008 @01:39PM (#23186270) Journal
      
      How do you know a configuration change hasn't rendered one of your startup scripts ineffective?
      
      Isn't that what QA systems and effective approaches to change management are supposed to handle?
      
      If I am planning a change, I should discover problems with the startup scripts in QA, not in production, especially if a production reboot is not required to implement the change.
      
      Parent Share
      twitter facebook
- Re: (Score:3, Interesting)
  
  by Chris Burke ( 6130 ) writes:
  
  If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching.
  
  Two things:
  
  The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine. This allows updates to a live machine.
  
  Second, this is telephony, meaning it is t
  - Re: (Score:2)
    
    by diamondsw ( 685967 ) writes:
    
    The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine.
    
    So you take it out of rotation on the load balancer and give it a few minutes to complete all its active connections. Patch/reboot whatever. Bring it back into rotation, and repeat with the other box.
    - Re:Needed that bad? (Score:5, Insightful)
      
      by jelle ( 14827 ) writes: on Thursday April 24, 2008 @12:12PM (#23184610) Homepage
      
      <i>So you take it out of rotation on the load balancer and give it a few minutes to complete all its active connections. Patch/reboot whatever. Bring it back into rotation, and repeat with the other box.</i>
      
      Methods like that usually suck in real-life, because right the day before you want to 'take it out of rotation', a circuit is opened through it that requires five nines (so you can't drop it), and it will remain open for months...
      
      You will end up with 99 boxes waiting to 'get out of rotation' for every
      single box that you don't need to update...
      
      Murphy will make sure of that.
      
      Parent Share
      twitter facebook
  - Re:Needed that bad? (Score:5, Informative)
    
    by Iphtashu Fitz ( 263795 ) writes: on Thursday April 24, 2008 @11:44AM (#23184016)
    
    The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine. This allows updates to a live machine.
    
    If you have a load balanced environment then you have the ability to redirect new connections away from a given server. Then it's just a matter of waiting for the active connections to terminate before the machine ends up in an idle state where you can safely apply patches offline. I've worked in a number of telephony environments and this was always the way we would patch systems. Stop accepting new connections, wait for existing ones to end, then perform the patch, reboot, verify, and start accepting connections again.
    
    Second, this is telephony, meaning it is the infrastructure on which the internet is based. There's no dns tricks or tcp/ip you can use to send people to a different "server" if that server is the switch connected to your fiber backbone. Basically, there are points in the infrastructure where there are by necessity a single chokepoint.
    
    Any mission critical hardware, switches, routers, servers, etc. should be set up in redundant pairs (or triplets, ...) so that if a hardware failure occurs the remaining hardware can keep the service up. Single points of failure are avoided like the plague in datacenters that require 100% uptime. Part of that is to deal with hardware failures but part is also to provide an ability to perform software/firmware upgrades when necessary. Once again, you migrate all traffic off the system you're upgrading then apply the upgrades offline. Upgrading a kernel, especially, in an online environment, is something virtually any sysadmin would want to avoid if at all possible.
    
    Redundancy is key, and any commercial datacenter will offer it all the way from their connections to the outside world to the connections they provide their customers. Every datacenter used by every company I ever worked for (about 10) offered redundant power and redundant network drops (using HSRP, VRRP, etc) for our equipment. If the datacenter needed to upgrade a router they'd move all traffic off one router so they could upgrade and test it, then move traffic off the other and repeat the process. Similarly if we needed to upgrade our firewalls, switches, etc. we'd fail over to the second redundant device first. In some cases we had bonded interfaces right on the end servers so as long as one path remained active we could power down an entire switch, router, firewall, etc. In other cases we relied on load balancing across servers that were alternately connected to one or another switch.
    
    Parent Share
    twitter facebook
    - Re:Needed that bad? (Score:4, Informative)
      
      by Nkwe ( 604125 ) writes: on Thursday April 24, 2008 @12:51PM (#23185396)
      
      Then it's just a matter of waiting for the active connections to terminate before the machine ends up in an idle state where you can safely apply patches offline.
      This assumes that active connections will terminate in a timely fashion. I used to have internet service via an ISDN via a connection to my office. My ISDN calls would connected for a couple of months at a time. Yes, one connection lasting multiple months. There are other cases where a connection, context, or state between two systems would need to be maintained for extended periods of time. Many of these situations can not be solved by load balancing and would benefit greatly by the ability to make kernel changes without interrupting current work, or waiting for it to complete.
      
      Parent Share
      twitter facebook
      - Re:Needed that bad? (Score:5, Informative)
        
        by mOdQuArK! ( 87332 ) writes: on Thursday April 24, 2008 @02:04PM (#23186708)
        
        There's a difference between what YOU as an end user consider to be an open connection, and what the telecom equipment consider as a connection.
        
        For all you know, your apparent always-on connection was actually a virtual connection being frequently switched & reswitched over many different real physical connections. That would be a fairly standard architecture for having a network infrastructure which can have components being worked on while data is still flowing through the network.
        
        When the telecom provider is "waiting for active connections to go away" on a particular device only means that all of the virtual connections that are momentarily being switched that device have been successfully switched to another device. It doesn't mean that any of those virtual connections have to be terminated.
        
        Parent Share
        twitter facebook
      - Re: (Score:3, Informative)
        
        by smallfries ( 601545 ) writes:
        
        True, but I've been standing in switch rooms watching operators manually kill those circuits because they wanted to reboot a box. 5x 9s doesn't mean perfect service, and if anyone complained about it they were told that a ms interruption once every few months was in their SLA. By the time they reconnected they went through another box so how were they to know it was any longer than that.
  - Re:Needed that bad? (Score:4, Insightful)
    
    by Paul Carver ( 4555 ) writes: on Thursday April 24, 2008 @11:45AM (#23184048)
    
    If your load balancer can't take a server out of the pool while allowing current sessions to finish cleanly then you need to shop for a new load balancer.
    
    A decent load balancer will obviously give you the choice of whether to take a server out of service immediately disrupting existing sessions or simply stop sending new sessions to it while allowing existing sessions to continue.
    
    As for your comment about physical connections, that's what portchannels and multilink trunks are for. Or VRRP and HSRP depending on which level of "connected to" you mean.
    
    Parent Share
    twitter facebook
    - Re:Needed that bad? (Score:4, Interesting)
      
      by mr_mischief ( 456295 ) writes: on Thursday April 24, 2008 @01:10PM (#23185728) Journal
      
      Yes, but you end up taking one machine out at a time, giving it time to simmer down, and then bringing it back up. If you have 100 boxes and it takes 30 minutes to simmer down to idle, you have 30 * 100, or 3,000 minutes to do the upgrades. On average, your boxes go unpatched for 1500 minutes.
      
      So you have this security hotfix you really want to apply, but it's going to 25 hours on average to fix a box and 50 hours to fix them all.
      
      You could, with ksplice and a good concurrent control system, make your average time to fix 5 minutes in over 80% of kernel upgrade scenarios rated "Critical". Your boxes could still be rebooted on a regular basis later.
      
      Which do you prefer?
      
      Parent Share
      twitter facebook
  - Re: (Score:3, Interesting)
    
    by Colin Smith ( 2679 ) writes:
    
    The very fact that there is load balancing means that every server is likely to have active connections going through it
    http://conntrack-tools.netfilter.org/ [netfilter.org]
    
    I hot-swap whole networks.
    
    HTH.
- Re: (Score:3, Interesting)
  
  by garlicbready ( 846542 ) writes:
  
  I was about to say another idea might be virtulisation via xen for example
  start up a new virtual machine with the new kernel, then when your sure it's working, just switch everything across from the old to the new, and shut down the old virtual instance
- No, No, No and No again. (Score:5, Interesting)
  
  by Anonymous Coward writes: on Thursday April 24, 2008 @11:40AM (#23183914)
  
  As an admin for some -very- high availability systems, load balancers are not a silver bullet. This solution would most apply for running one-node clusters who are using a single machine as a perimeter network device. (ex. firewall) I see lots of these in the racks at our NOC provider.
  
  1. We connect to several load balanced systems and the complexity introduced by load balancers translates to inexplicable down time. No load balancers means a pretty steady diet of the latest and greatest server hardware, but no down time. The a few minutes of down time costs more than the server hardware.
  
  2. High availability translates more roughly into nodes that can fail (ex. power off) and not take the cluster down. This boils down to active-passive application architecture more than just using heartbeat.
  
  As an FYI, PostgreSQL clustering is a killer application for me. Erlang is also great in many ways, but requires application architecture with active-passive node awareness. Which isn't present in things like Yaws, or even my other favorite non-erlang app nginx. Heartbeat is the solution there, but I'd like to see yaws be cluster aware on its own. http://yaws.hyber.org/
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by 0racle ( 667029 ) writes:
    
    If I may ask, what PostgreSQL clustering solution do you use?
  - Re:No, No, No and No again. (Score:5, Insightful)
    
    by hab136 ( 30884 ) writes: on Thursday April 24, 2008 @01:17PM (#23185838) Journal
    
    As an admin for some -very- high availability systems, load balancers are not a silver bullet. This solution would most apply for running one-node clusters who are using a single machine as a perimeter network device. (ex. firewall) I see lots of these in the racks at our NOC provider.
    
    1. We connect to several load balanced systems and the complexity introduced by load balancers translates to inexplicable down time. No load balancers means a pretty steady diet of the latest and greatest server hardware, but no down time. The a few minutes of down time costs more than the server hardware.
    
    I spent a decade in perimeter networking at a Fortune 50 US bank. My group didn't do the internal network, just the perimiter, and we still had dozens of network sites and thousands of pieces of equipment. The bank itself has hundreds of thousands of employees, millions of users. Online banking and brokerage are about as high availability as you can get save utilities (power, water, telephony, etc) or military. Seconds of online brokerage downtime equated to millions of dollars lost.
    
    The idea that load balancing introduces inexplicable down time is completely unsupported by my experience.
    
    "One-node clusters" seems like marketing speak for "single point of failure". A cluster by definition is two or more nodes.
    
    Redundant routers, switches, firewalls, the works or you're not high-availability in my opinion. The fact that you're talking about Postgresql instead of Oracle or DB2 on mainframes makes me think that your idea of high availability is different than mine.
    
    Parent Share
    twitter facebook
- Re: (Score:3, Interesting)
  
  by QuantumRiff ( 120817 ) writes:
  
  But what about the servers that are placed in remote sites like small cell towers, where space, and backup power are critical issues.
- Re: (Score:2)
  
  by mr_mischief ( 456295 ) writes:
  
  One of the things telephony servers do these days is handle ports, traffic, or both for live calls.
  
  Hey, as long as it's your calls that get dropped and not mine, it's fine with me if the servers drop calls. If you'd rather not have any calls dropped, then this is nice.
  
  You could take the server to be rebooted out of the load balancer's control, in which case existing calls would eventually end and no new calls would get assigned. You could then reboot once no calls would be effected. This solution, though, l
- Re: (Score:2)
  
  by RiotingPacifist ( 1228016 ) writes:
  
  Ok the real story is that hes doing it because its cool. And that's the way Linux works, sure load-balanced servers are nice, but getting rid of all expected downtime, means that you could theoretically run an uptodate system with 99.9999' %, sure. I was wondering when this would happen, i think predicted it a while ago (damn non-suscriber limit on comment history), i was expecting a different approach, but it was just a matter of time until somebody implement this.
  Hopefully this will make it server distros
Unless it fails. (Score:3, Insightful)

by Joe Snipe ( 224958 ) writes: on Thursday April 24, 2008 @11:05AM (#23183188) Homepage Journal

honestly how much downtime are we talking here? 30 seconds?

Share
twitter facebook
- Re:Unless it fails. (Score:4, Funny)
  
  by Anonymous Coward writes: on Thursday April 24, 2008 @11:09AM (#23183298)
  
  honestly how much downtime are we talking here? 30 seconds?
  well, think about the fsck that happens after 180 days or 30+ mounts ?
  
  Parent Share
  twitter facebook
  - Re: (Score:2, Insightful)
    
    by m50d ( 797211 ) writes:
    
    Uh, if you actually need that, then you needed it anyway. And if you don't need it but don't know how to disable it, you shouldn't be running a system.
- Re: (Score:2)
  
  by geekoid ( 135745 ) writes:
  
  It's more then the time. Management and interruption of even a second of downtime can be costly in a large organization.
  All work comes to a halt, all connections need to be reestablished, work momentum is lost, etc.
- Re: (Score:3, Informative)
  
  by UnknowingFool ( 672806 ) writes:
  
  For, your average computer and generic linux servers the downtime is small. But companies often have applications that they need to restart. That is the difference. Also linux is used on equipment other than generic servers: embedded systems, etc where loading isn't optimized cause the equipment should never go down.
- Re:Unless it fails. (Score:4, Informative)
  
  by Tychon ( 771855 ) writes: on Thursday April 24, 2008 @11:36AM (#23183832)
  
  A company that I once had dealings with was quite proud of their five nines. The motivation? It cost them $18,000 per second they were down. 30 seconds isn't just 30 seconds sometimes.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by ACMENEWSLLC ( 940904 ) writes:
  
  Damn, that's the reason I get killed off in Eve. Right as I attack, my connection drops. 2AM, 6 months of work, and 30 seconds of downtime to ruin it all.
- Re: (Score:2, Interesting)
  
  by Random Destruction ( 866027 ) writes:
  
  Well, care of google:
  100 - (((30 seconds) / (1 year)) * 100) = 99.9999049
  
  So if you're trying to keep up 6 9s for some super critical system, you've just used a years worth of downtime.
  
  Even for lower numbers of nines, you still don't get many minutes per year for patching, assuming no hardware failures ever.
Amazing (Score:5, Interesting)

by cromar ( 1103585 ) writes: on Thursday April 24, 2008 @11:05AM (#23183198)

That is truly amazing tech, right there. It would be interesting to know the security implications of being able to hot-patch the kernel, however.

Share
twitter facebook
- Re:Amazing (Score:5, Funny)
  
  by katz ( 36161 ) writes: <Email? What e-mail?> on Thursday April 24, 2008 @11:35AM (#23183794)
  
  Considering that you don't need to prepare the kernel in any way--just execute the program and bang, it's patched--means that someone with root access could slip a rootkit right under your nose (i.e., without the system administrator being aware of this).
  
  - Roey
  
  Parent Share
  twitter facebook
  - Re:Amazing (Score:5, Insightful)
    
    by KeithJM ( 1024071 ) writes: on Thursday April 24, 2008 @11:40AM (#23183896) Homepage
    
    someone with root access could slip a rootkit right under your nose
    Yeah, someone with root access can take control of your server. Oh, wait, they've got root access. They already have control of your server. At some point, you have to just accept that giving someone root access is a security risk.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by swillden ( 191260 ) writes:
      
      someone with root access could slip a rootkit right under your nose
      Yeah, someone with root access can take control of your server. Oh, wait, they've got root access. They already have control of your server. At some point, you have to just accept that giving someone root access is a security risk.
      
      Barring a carefully-implemented Mandatory Access Control system, anyway.
    - Re:Amazing (Score:4, Insightful)
      
      by katz ( 36161 ) writes: <Email? What e-mail?> on Thursday April 24, 2008 @12:03PM (#23184424)
      
      My bad, I meant to say,
      
      "A remote attacker who successfully executes a privilege escalation exploit and gains root access will have an easier time taking control of your server and hiding their tracks".
      
      Thanks for pointing that out
      
      - Roey
      
      Parent Share
      twitter facebook
  - Re: (Score:2)
    
    by Abcd1234 ( 188840 ) writes:
    
    As opposed to slipping a rootkit into the kernel image on-disk, and then waiting for/forcing a reboot?
- Re: (Score:2)
  
  by FooAtWFU ( 699187 ) writes:
  
  A small (but nonzero) implication upgrade over them "only" having root on the server. (think "new spot to deploy a rootkit"). But at that point, you're already in deep trouble, so better avoid getting to that point to begin with.
- Re: (Score:2)
  
  by CompMD ( 522020 ) writes:
  
  The SuckIt Rootkit can already patch the kernel on the fly without root access I believe.
Wrong way to solve the uptime problem (Score:5, Insightful)

by Anon E. Muss ( 808473 ) writes: on Thursday April 24, 2008 @11:07AM (#23183244)

Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.

Share
twitter facebook
- Re:Wrong way to solve the uptime problem (Score:5, Funny)
  
  by Qzukk ( 229616 ) writes: on Thursday April 24, 2008 @11:10AM (#23183320) Journal
  
  Trust me, that was the first thing they thought of, then the CEO came in and said "Why are you ordering more equipment when we have half of our machines sitting there and doing nothing? We could be doing twice the work/traffic/whatever without paying more money!"
  
  Parent Share
  twitter facebook
  - Not only the CEO (Score:5, Interesting)
    
    by Moraelin ( 679338 ) writes: on Thursday April 24, 2008 @11:34AM (#23183774) Journal
    
    Not only the CEO. I lived to see even a hardline IT guy (admittedly, one whose goal in life seems to be to be against whatever you want, and to avoid doing any extra work... actually, make that just: any work) argue along the lines of "nooo, you can't have the servers only 60% loaded! It's a waste of valuable hardware! Why, back in my day (of batch jobs on punched cards, presumably) we had the mainframe used at least an average of 95% before asking for an extra server!"
    
    It always irks me to see people just not understand concepts like "peak" vs "average", or "failing over".
    
    - A cluster of, say, 4 machines (small application, really) which are loaded to 90% of capacity, if one dies, the other 3 are now at 120% of capacity each. If you're lucky, it just crawls, if you're unlucky, Java clutches its chest and keels over with an "OutOfMemoryError" or such.
    
    - if you're at 90% most of the time, then fear Monday 9:00 AM, when every single business partner on that B2B application comes to work and opens his browser. Or fear the massive year-end batch jobs, when that machine/cluster sized barely enough to be ready with the normal midnight jobs by 9 AM, so those users can see their new offers and orders in their browsers, now has to do 20 times as much in a burst.
    
    Basically it amazes me how many people just don't seem to get that simple rule of thumb of clusters: you're either getting nearly 100% uptime and nearly guaranteed response times, _or_ you're getting that extra hardware fully used to support a bigger load. Not both. Or not until that cluster is so large that 1-2 servers failing add negligible load to the remaining machines.
    
    Parent Share
    twitter facebook
  - - Re:Wrong way to solve the uptime problem (Score:4, Funny)
      
      by Anonymous Coward writes: on Thursday April 24, 2008 @11:35AM (#23183814)
      
      If he knew anything, he wouldn't be the CEO.
      
      Parent Share
      twitter facebook
- Re:Wrong way to solve the uptime problem (Score:4, Informative)
  
  by N1ck0 ( 803359 ) writes: on Thursday April 24, 2008 @11:30AM (#23183710)
  
  Mainly why people in the telecom industry have been clamoring for it. Its very difficult to take over the termination of a circuit switched system without some interruption for the end user. And its also not aways easy to busy out all channels on a line as calls drop off so you can free up a machine for patching.
  
  Of course many of the reasons is a lot of commercial telecom apps are badly implemented and need better management controls.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by MrMunkey ( 1039894 ) writes:
    
    I'd mod you up if I had points. It's hard to have fail-over systems when a cable has to be plugged in somewhere, and on top of that the channels have to be synced with the end user.
    - Re: (Score:2)
      
      by Rich0 ( 548339 ) writes:
      
      Why - that's no excuse for not clustering!
      
      Just tell each phone customer to have two sets of phones at home, so that when one line is down they can just use the other. Be sure to charge them for both.
      
      Hmm - that actually is starting to sound like the sort of business model the wired phone company around my area might actually propose...
- Re:Wrong way to solve the uptime problem (Score:5, Insightful)
  
  by trybywrench ( 584843 ) writes: on Thursday April 24, 2008 @11:31AM (#23183728)
  
  Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.
  People using Linux on BigIron(tm) bank on 24/7/365/25years uptime. When a single server costs hundreds of thousands or millions of dollars you can't afford a spare sitting idle. From day 1 the server needs to be making money and never ever stop. For smaller general purpose servers like you can buy at Dell.com then yeah having a fail-over makes sense.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by diamondsw ( 685967 ) writes:
    
    People using Linux on BigIron(tm) bank on 24/7/365/25years uptime. When a single server costs hundreds of thousands or millions of dollars you can't afford a spare sitting idle.
    
    Active-Active clustering or load balancing. Sure, it can be a bitch to get working with all of the data synchronization required (especially for things like databases, which are traditionally active-passive), but if you want real reliability and the efficiency of using both boxes, it's what you do.
    
    Anything less is asking for trouble.
    - Re: (Score:2)
      
      by Abcd1234 ( 188840 ) writes:
      
      And, clearly, you know better how to run a bank's systems than they do, despite having run them this way for, what, 30 years? 40?
      - Re: (Score:2, Insightful)
        
        by Anonymous Coward writes:
        
        Now is not the time to claim banks know what they are doing.
      - Re: (Score:2)
        
        by diamondsw ( 685967 ) writes:
        
        And, clearly, you know better how to run a bank's systems than they do, despite having run them this way for, what, 30 years? 40?
        
        Having seen bank systems (and credit card companies, and pharma, etc), yes, I can damn well confidently say I do. The handle money for a living. I design networks and datacenters for a living.
        
        You do NOT want to know the things I've seen - you'll never use a credit card or fill a prescription again. I could write TheDailyWTF for a month based on one specific credit card provider al
        
        Re: (Score:2)
        
        by Abcd1234 ( 188840 ) writes:
        
        So, what you're saying is kernel hot patching would be exactly the kind of they they'd be interested in (well, assuming banks ran Linux)? :)
      - Re: (Score:2)
        
        by ToasterMonkey ( 467067 ) writes:
        
        People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.
        If you own a piece of Big Iron and run Linux on it, it's going to be virtualized.
        And, clearly, you know better how to run a bank's systems than they do, despite having run them this way for, what, 30 years? 40?
        First, how are you trying to say big banks have been running for 30, or 40 years? The last two posters were specifically talking about Linux, which obviously hasn't been used in big banks or anywhere for 30 years.
        Second, IF a big bank is running Linux on this "BigIron", you can almost bet your ass it's an IBM mainframe we're talking about. That being the case, it would be running 'virtualized', PERIOD [wikipedia.org]. They most likely even have multiple physical mainframes to fail over to, regardless of what the f
  - Re: (Score:2)
    
    by poot_rootbeer ( 188613 ) writes:
    
    People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.
    
    If you own a piece of Big Iron and run Linux on it, it's going to be virtualized. Hundreds of virtual Linux boxes that can arbitrarily failed over, patched, and rebooted, the physical hardware carrying on uninterrupted all the while.
  - Re: (Score:3, Insightful)
    
    by Anon E. Muss ( 808473 ) writes:
    
    People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.
    
    I doubt there are many people running Linux on true Big Iron. I'm not saying it doesn't happen, I'm saying that most Big Iron runs something else. I know many financial institutions and telecom operators use HP NonStop systems. These can stay up 24/7/365/25years, and you pay millions of dollars for that. They have full redundant hardware inside the box, run a proprietary OS, and proprietary applications.
    - Re: (Score:3, Insightful)
      
      by afabbro ( 33948 ) writes:
      
      I doubt there are many people running Linux on true Big Iron.
      
      And you would be wrong. Sure, most mainframes are running z/OS, but a goodly number of them are also running Linux images. I don't know the percentages but the IBM "run Linux on your mainframe" training classes are usually full.
  - Re: (Score:2)
    
    by guruevi ( 827432 ) writes:
    
    And how do bigiron servers do it? Trust me, I've worked with bigiron and there are several solutions:
    
    Some type of virtualization, partitioning or jails, and you can emulate a cluster of machines with minimal performance impact. The 'host' doesn't necessarily need to be upgraded frequently since it's very minimal in function (load a kernel into a processor).
    
    You have your monthly/yearly maintenance that takes everything offline at 3 am and upgrades it if necessary. It's not unusual to see those things 3-5 maj
  - Re:Wrong way to solve the uptime problem (Score:4, Informative)
    
    by mr_mischief ( 456295 ) writes: on Thursday April 24, 2008 @01:30PM (#23186070) Journal
    
    Can we please kill the 24/7/265 phrasing? Where do you people live that there are 365 weeks in a year?
    
    Why not 24/7/52 or 24/7/4.3/12 or just 24/365 (or 24/365.242 for the pedants).
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by geekoid ( 135745 ) writes:
  
  or get a mainframe.
- Re: (Score:2)
  
  by bjourne ( 1034822 ) writes:
  
  Correct, but then you should never fix crasher bugs either. Because it is a mistake and you will never achieve 100% uptime. Use distributed computing instead.... Your argument is flawed. What happens if you have a dual node system and one node suffers a critical software failure while the other is rebooting due to a patched kernel? Your system suffers downtime that it otherwise wouldn't have if it had hot patching.
- Re: (Score:2, Insightful)
  
  by Ed Avis ( 5917 ) writes:
  
  Who cares about servers? I want my Linux desktop to stay up-to-date with security fixes without having to reboot it every few days.
  - Re: (Score:2)
    
    by tzanger ( 1575 ) writes:
    
    Every few days? Which distro are you running that a) has security fixes every few days, and b) requires you to reboot after them?
Unnecessary (Score:2)

by isj ( 453011 ) writes:

> "this is pure gold"

It is also a waste of time. Instead of spending time hot-patching a kernel, jotting down which patch it was, verify that it actually installed, and considering you cannot change the layout of structures anyway in a hot-patch, the time would be better spent designing protocols that can handle a hot-standby switchover.

Yes, there are a few scenarios where the hardware is so expensive that you cannot afford redundancy, but that is rare.
Already been used (Score:5, Informative)

by caluml ( 551744 ) writes: <slashdot&spamgoeshere,calum,org> on Thursday April 24, 2008 @11:10AM (#23183310) Homepage

There was a kernel exploit [milw0rm.com] recently where someone submitted a patch that modified the running kernel using this technology. It didn't work for me, so I had to resort to patching the .c that was affected - but a lot of people reported that it worked.

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by ThisNukes4u ( 752508 ) * writes:
  
  IIRC, that code was actually a modified version of the exploit where the payload was changed to fix the exploit instead of spawn a root shell. Pretty fucking ingenious if you ask me.
replace modules (Score:3, Interesting)

by hey ( 83763 ) writes: on Thursday April 24, 2008 @11:16AM (#23183420) Journal

Rather that a source code level system I'd prefer a way of replacing loadable kernel modules without a reboot. Then push more code into modules -- eg file system. (Hey sounds like a micro-kernel).

Share
twitter facebook
- Re: (Score:2)
  
  by petermgreen ( 876956 ) writes:
  
  you can already replace loadable modules without a reboot as long as they aren't doing anything critical to your kernels operation.
Does this mean... (Score:2)

by Thelasko ( 1196535 ) writes:

I can now install hypervisors without rebooting the victim's... I mean... client's computer?

[strokes handlebar mustache deviously]
The real test... (Score:5, Funny)

by hal2814 ( 725639 ) writes: on Thursday April 24, 2008 @11:17AM (#23183464)

Can ksplice be installed without rebooting?

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by LinuxDon ( 925232 ) writes:
  
  It's in the comment: "ksplice requires no kernel modifications"
  
  So yes, ksplice can be installed/used without rebooting.
Impressive hack (Score:5, Informative)

by EriktheGreen ( 660160 ) writes: on Thursday April 24, 2008 @11:18AM (#23183472) Journal

For those that haven't read the paper, the technique used is straightforward in concept, but the devil is in the details.
He basically compiles a patched and unpatched kernel with the same compiler, compares the ELF output, and uses that to generate a binary file that corresponds to the change. That gets wrapped in a generic module for use, another module installs it along with JMPs to bypass the old code and use the new, and he performs the checks needed to make sure he can safely install the redirects.
He also has to differentiate real changes from incidental ones (the example given is changing the address of a function - all references to it will change, but they don't really need to be included in the binary diff).
The only human work required is to check whether a patch makes semantic changes to a data structure... whether eg. an unsigned integer variable that was being used as a number is now a packed set of flags - the data declaration is the same, but it's being used differently.
Interesting paper. Also a useful new set of capabilities for any Linux user who can't handle downtime for quarterly patching... worth its weight in gold in some businesses.
Erik

Share
twitter facebook
- - Re:Impressive hack (Score:4, Funny)
    
    by EriktheGreen ( 660160 ) writes: on Thursday April 24, 2008 @11:44AM (#23184004) Journal
    
    Well, let's see.
    A silver dollar, from which bits were commonly cut, weighs about .77 troy ounces.
    Today's gold price as of posting is about $889.95 US per troy ounce.
    A silver dollar was typically cut into 8 bits, which gives us a weight per bit of 0.096 ounces. That translates to about $85.66 per bit weight in gold. Remember, this is per system being patched.
    Since the patches being applied ranged from 1 line to 285 lines per the paper, and a reasonable estimate of compiled average bytes per line is something like 20, we get a value of $13,700 per line of patch in gold. Even for the smaller patches, this is significant. The largest patch would be worth nearly $4,000,000 USD in gold.
    Of course, for 64 bit systems vs. 32 bit, the value would be twice as much :)
    Erik
    
    Parent Share
    twitter facebook
    - - Re: (Score:2)
        
        by EriktheGreen ( 660160 ) writes:
        
        I knew there was a good reason to like inflation! As well as SSE instructions :)
If it's that critical, shouldn't you have two? (Score:5, Insightful)

by Paul Carver ( 4555 ) writes: on Thursday April 24, 2008 @11:20AM (#23183514)

I'd rather have at least two of anything important and have statefull failover between them.

If you've got this system that's so critical you can't reboot it for a kernel upgrade, what do you do when the building catches fire or a tanker truck full of toxic waste hops the curb and plows through the wall of your datacenter?

I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything.

If you can't transfer the workload to a location at least a couple hundred miles away without users noticing then you're not in the big league.

And as long as the workload is in another datacenter, what's the big deal about rebooting for a kernel upgrade.

Share
twitter facebook
- Re: (Score:2)
  
  by Akatosh ( 80189 ) writes:
  
  I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything.
  You must work for one of those telephone companies with infinite time, money and no legacy equipment. Must be nice.
- Re: (Score:3)
  
  by EriktheGreen ( 660160 ) writes:
  
  In some engineered systems, it just isn't possible to have redundancy in the way you mean.
  Extreme example: Try to design a fail-over for the space shuttle's solid rocket boosters :)
  Interestingly, I've found that the skill needed (and the pay gathered) to deal with systems that can't be made redundant is much higher than that needed to work on "grid" or cluster systems where multiple cheap pieces of hardware are used.
  And they tend to be more reliable too.
Over-engineered solution to a non-existent problem (Score:4, Insightful)

by hacker ( 14635 ) writes: <hacker@gnu-designs.com> on Thursday April 24, 2008 @11:23AM (#23183556)

Once again, we have an over-engineered solution to a non-existent problem.
Any enterprise-level customer is going to have a VERY lengthy Q&A process before deploying anything into production. This includes testing kernels, hardware, networks, interaction, application, data and so on. One pharmaceutical company I know of is federally mandated to do this twice a year, every year, for every single machine that reads, writes or generates data. Period.
So you hot-patch a running Linux kernel. How do you Q&A that? How do you roll back if the patch fails? Where is your 'control'?
The answer? A duplicate machine. But wait, if you have two identical machines... isn't that... a cluster?
Exactly. And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades. You NEVER EVER touch a running, production system like that.
Well, not if you want any sort of data integrity or control and want to pass any level of quality validation on that physical environment.

Share
twitter facebook
- Re:Over-engineered solution to a non-existent prob (Score:2)
  
  by ROBOKATZ ( 211768 ) writes:
  
  Once again, we have an over-engineered solution to a non-existent problem.
  Welcome to academia. I think it's an interesting start, and maybe someday we'll have solved the additional problems you've listed. And let's face it, rebooting for updates is annoying, mission critical or not.
- You are Wrong (Score:3, Insightful)
  
  by mpapet ( 761907 ) writes:
  
  And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades
  
  Hmmm. I happen to live by your words in an environment where this is theoretically possible, but practically impossible. Why? Because when the cluster rolls to a passive node, the application times out on the existing connections. The tim
  - Re: (Score:2)
    
    by hacker ( 14635 ) writes:
    
    Frankly, if you roll to another node and you lose connections, then your cluster is misconfigured.
    I've built and deployed clusters where I'm actively playing a streaming video across the cluster from a mounted drive, physically yank the power cable from the active node of the cluster, there's about a 1-2 second lag in the video, and then it continues to play right where it was, without any disconnects or interruptions.
    In fact, I use this as a way to demonstrate that there is ZERO loss of connectivity wh
- Re: (Score:2)
  
  by Mantaar ( 1139339 ) writes:
  
  Look, that's just how open source works. It's not only demand driven - most of the time it's just hackers getting interested in stuff they wanna try out - whether it makes sense or not... This way interesting things can happen and if they find someone to use 'em, they may become popular. It's not like the Linux kernel was aimed at becoming the most popular open source OS - actually people may very well have thought of Linux as an "over-engineered solution to non-existent problem" back then! I'm thinking of
- Re:Over-engineered solution to a non-existent prob (Score:2)
  
  by geekoid ( 135745 ) writes:
  
  "Any enterprise-level customer is going to have a VERY lengthy Q&A process before deploying anything into production."
  
  BWAHAHAHAHAhahahahah..
  
  "One pharmaceutical company I know of is federally mandated to do this twice a year, every year, for every single machine that reads, writes or generates data. Period."
  
  Yes, Federally mandated. Most companies aren't, and in fact compaines that do that are the exception.
  
  I have seen and read about too many CFO's pushing out enterprise level software against technical a
- Re:Over-engineered solution to a non-existent prob (Score:3, Interesting)
  
  by mr_mischief ( 456295 ) writes:
  
  You'd roll back much the same way, or even perhaps by rebooting into the previous kernel image from disk.
  
  Every production environment I've ever administered had a smaller version set aside for testing. We'd configure the machines identically and just make the cluster smaller. Then we'd test on the test machines any action that was to be made part of the admin process of the production machines. If it passes on the test machine and fails in production, then you didn't make the machines sufficiently similar.
  
  R
And Microsoft claims to have invented it (Score:4, Informative)

by davecb ( 6526 ) * writes: <davecb@spamcop.net> on Thursday April 24, 2008 @11:29AM (#23183686) Homepage Journal

Tomasz Chmielewski wrote on LKML: the idea seem to be patented by Microsoft, i.e. this patent from December 2002: http://www.google.com/patents?id=cVyWAAAAEBAJ&dq=hotpatching [google.com] In essence, they patented kexec ;)
Andi Kleen promptly provided prior art: The basic patching idea is old and has been used many times, long predating kexec. e.g. it's a common way to implement incremental linkers too.

Share
twitter facebook
- not new technology (Score:2)
  
  by lophophore ( 4087 ) writes:
  
  I worked with a genius engineer at DEC in the late 80s who could patch a running VMS system on the fly, no reboot required.
  
  He had a small program that made the whole thing happen.
- Re: (Score:3, Informative)
  
  by johannesg ( 664142 ) writes:
  
  AmigaOS had its kernel in ROM, and could be patched on the fly. That was back in 1985, so even if it was patented, it isn't now.
  
  The patching function was not an accident either; there was an OS-function for this purpose. Originally it was intended to allow bug-fixed to be installed without having to change the ROM, but it was quickly coopted into a mechanism for enhancing the OS in various other ways as well.
Sorry... (Score:2, Funny)

by PJ The Womble ( 963477 ) writes:

This is old news down in the South.

They don't bother splicing. Them good ol' boys been big on Kernel Sanders for years now.
It's Not For 100% Uptime (Score:3, Insightful)

by Bob9113 ( 14996 ) writes: on Thursday April 24, 2008 @11:44AM (#23184020) Homepage

Lots of people are saying, "100% uptime of a particular machine is neither necessary nor desirable, full failover is better. Full failover is the only way to handle catastrophic hardware failures." Or something to that extent.

But this isn't about 100% uptime. It's about not having to reboot for a kernel upgrade. You should still have hot failover if you want HA, this just removes one more thing that requires a reboot.

It's like people saying, "I don't mind rebooting after installing Office, I don't expect 100% uptime from my workstation." Of course you don't need to be able to do software installs without rebooting. But isn't it nice to have that option available?

Same with this. When (and if) it gets stabilized and standardized, you'll use it. Not for 100% uptime, just because it's nice to not be required to reboot to enable a particular software install.

Share
twitter facebook
- Re: (Score:2)
  
  by Enderandrew ( 866215 ) writes:
  
  If you have a critical system that needs to be up, you better have backup servers.
  
  You fail-over to the backup, patch the first, fail-back, patch the second, etc.
What!? Unpossible! (Score:4, Funny)

by Linux_ho ( 205887 ) writes: on Thursday April 24, 2008 @01:10PM (#23185726) Homepage

This is GPL'd software. Bill Gates told me nobody could improve it. These Linux developers are truly renegades!

Share
twitter facebook
Linux just gets better. (Score:3, Insightful)

by bannerman ( 60282 ) writes: <curdie@gmail.com> on Thursday April 24, 2008 @03:43PM (#23188200)

I would think that on top of the benefits of patching running high-uptime servers this would in the long run also result in yet another benefit to running Linux on your desktop instead of Windows. I don't see any reason RedHat, Ubuntu and everyone else wouldn't implement this type of kernel upgrade for convenience' sake.

Share
twitter facebook
This was the smallest part of the interview... (Score:4, Informative)

by tytso ( 63275 ) * writes: on Thursday April 24, 2008 @06:46PM (#23191110) Homepage

Funny thing... this was the smallest part of my oh, hour and twenty minute interview with the reporter. The reason for the call was to hear about what was up with the 2.6.25 release; she probably spent more time talking with me about KVM and Xen; and I mentioned ksplice just as an aside, as an example of lots of really interesting and exciting work that doesn't necessarily happen as part of a mainline kernel release. I spent maybe 2-3 minutes tops talking to her about ksplice --- and that's what she ends up writing about and getting slashdotted!

Share
twitter facebook
- Re:Maybe... (Score:5, Funny)
  
  by CogDissident ( 951207 ) writes: on Thursday April 24, 2008 @11:11AM (#23183326)
  
  I thought their working slogan was:
  
  Windows 7, it's not awful like Vista!
  
  Parent Share
  twitter facebook
- Re: (Score:2, Funny)
  
  by oodaloop ( 1229816 ) writes:
  
  Let's get the rest of the usual jokes out of the way while we're at it.
  
  If there were no kernel, it would necessary to create our non-rebooting robot overlords are belong to Chuck Norris.
  - - Re:In Soviet Russia, (Score:5, Funny)
      
      by oodaloop ( 1229816 ) writes: on Thursday April 24, 2008 @12:26PM (#23184868)
      
      "But does it run linux?"
      That's a joke? I thought that was just one dedicated user who kept asking on every article.
      
      Parent Share
      twitter facebook
- Re: (Score:2)
  
  by tinkerghost ( 944862 ) writes:
  
  Hmm, when I was doing 8bit assembly, we called it a wedge .... crazy kids ... and get off my lawn
- Re: (Score:2)
  
  by Enderandrew ( 866215 ) writes:
  
  Kexec allows you to boot another kernel from your kernel without a reboot. I think ksplice allows you to just put in a patch to your existing kernel, however, I almost have to assume they use a kexec-like implementation.
- Re: (Score:3, Funny)
  
  by DragonTHC ( 208439 ) writes:
  
  I love anything that makes a billionaire whine.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Needed that bad? (Score:5, Insightful)

Re: (Score:3, Funny)

Re:Needed that bad? (Score:5, Insightful)

Re: (Score:2)

Re:Needed that bad? (Score:5, Informative)

Re:Needed that bad? (Score:5, Insightful)

Re:Needed that bad? (Score:5, Insightful)

Re:Needed that bad? (Score:5, Insightful)

Re: (Score:3, Interesting)

Re: (Score:2)

Re:Needed that bad? (Score:5, Insightful)

Re:Needed that bad? (Score:5, Informative)

Re:Needed that bad? (Score:4, Informative)

Re:Needed that bad? (Score:5, Informative)

Re: (Score:3, Informative)

Re:Needed that bad? (Score:4, Insightful)

Re:Needed that bad? (Score:4, Interesting)

Re: (Score:3, Interesting)

Re: (Score:3, Interesting)

No, No, No and No again. (Score:5, Interesting)

Re: (Score:2)

Re:No, No, No and No again. (Score:5, Insightful)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Unless it fails. (Score:3, Insightful)

Re:Unless it fails. (Score:4, Funny)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:3, Informative)

Re:Unless it fails. (Score:4, Informative)

Re: (Score:2)

Re: (Score:2, Interesting)

Amazing (Score:5, Interesting)

Re:Amazing (Score:5, Funny)

Re:Amazing (Score:5, Insightful)

Re: (Score:2)

Re:Amazing (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Wrong way to solve the uptime problem (Score:5, Insightful)

Re:Wrong way to solve the uptime problem (Score:5, Funny)

Not only the CEO (Score:5, Interesting)

Re:Wrong way to solve the uptime problem (Score:4, Funny)

Re:Wrong way to solve the uptime problem (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re:Wrong way to solve the uptime problem (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re:Wrong way to solve the uptime problem (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Insightful)

Re: (Score:2)

Unnecessary (Score:2)

Already been used (Score:5, Informative)

Re: (Score:3, Informative)

replace modules (Score:3, Interesting)

Re: (Score:2)

Does this mean... (Score:2)

The real test... (Score:5, Funny)

Re: (Score:3, Informative)

Impressive hack (Score:5, Informative)

Re:Impressive hack (Score:4, Funny)

Re: (Score:2)

If it's that critical, shouldn't you have two? (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Over-engineered solution to a non-existent problem (Score:4, Insightful)

Re:Over-engineered solution to a non-existent prob (Score:2)