Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Patch the Linux Kernel Without Reboots

kdawson posted more than 6 years ago | from the click-n-go dept.

Operating Systems 286

evanbro writes "ZDNet is reporting on ksplice, a system for applying patches to the Linux kernel without rebooting. ksplice requires no kernel modifications, just the source, the config files, and a patch. Author Jeff Arnold discusses the system in a technical overview paper (PDF). Ted Ts'o comments, 'Users in the carrier grade linux space have been clamoring for this for a while. If you are a carrier in telephony and don't want downtime, this stuff is pure gold.'" Update: 04/24 10:04 GMT by KD : Tomasz Chmielewsk writes on LKML that the idea seems to be patented by Microsoft.

Sorry! There are no comments related to the filter you selected.

In Soviet Russia, (0, Offtopic)

finalnight (709885) | more than 6 years ago | (#23183132)

In Soviet Russia, the kernel reboots you!

Re:In Soviet Russia, (-1, Troll)

Anonymous Coward | more than 6 years ago | (#23183196)

In Korea only old people reboot their kernels!

Re:In Soviet Russia, (2, Funny)

oodaloop (1229816) | more than 6 years ago | (#23183720)

Let's get the rest of the usual jokes out of the way while we're at it.

If there were no kernel, it would necessary to create our non-rebooting robot overlords are belong to Chuck Norris.

Re:In Soviet Russia, (0)

Anonymous Coward | more than 6 years ago | (#23184288)

You forgot: "Imagine a Beowulf cluster of those" and "But does it run linux?"

Re:In Soviet Russia, (5, Funny)

oodaloop (1229816) | more than 6 years ago | (#23184868)

"But does it run linux?"
That's a joke? I thought that was just one dedicated user who kept asking on every article.

Needed that bad? (5, Insightful)

MetalliQaZ (539913) | more than 6 years ago | (#23183178)

If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching. They probably would be taken out of the loop for the in-place patching anyway. So who is "clamoring"?

Re:Needed that bad? (2, Funny)

tgatliff (311583) | more than 6 years ago | (#23183540)

I guess a better way to put it would be "oh... Way Cool!!!!"... :)

Meaning, yes I agree that in most cases it is not needed, but I have internal processing servers that have up times of over 3 years, so if I had something like this probably all my servers would have up times of this long..

Re:Needed that bad? (3, Interesting)

Chris Burke (6130) | more than 6 years ago | (#23183616)

If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching.

Two things:

The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine. This allows updates to a live machine.

Second, this is telephony, meaning it is the infrastructure on which the internet is based. There's no dns tricks or tcp/ip you can use to send people to a different "server" if that server is the switch connected to your fiber backbone. Basically, there are points in the infrastructure where there are by necessity a single chokepoint.

As to how often these things collide, and how much of a pain it is to actually stop a server for some amount of time, I can't say. But I can see situations where being able to hot-swap a kernel would be useful.

Re:Needed that bad? (1)

diamondsw (685967) | more than 6 years ago | (#23183960)

The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine.

So you take it out of rotation on the load balancer and give it a few minutes to complete all its active connections. Patch/reboot whatever. Bring it back into rotation, and repeat with the other box.

Re:Needed that bad? (5, Insightful)

jelle (14827) | more than 6 years ago | (#23184610)

<i>So you take it out of rotation on the load balancer and give it a few minutes to complete all its active connections. Patch/reboot whatever. Bring it back into rotation, and repeat with the other box.</i>

Methods like that usually suck in real-life, because right the day before you want to 'take it out of rotation', a circuit is opened through it that requires five nines (so you can't drop it), and it will remain open for months...

You will end up with 99 boxes waiting to 'get out of rotation' for every
single box that you don't need to update...

Murphy will make sure of that.

Re:Needed that bad? (4, Informative)

Iphtashu Fitz (263795) | more than 6 years ago | (#23184016)

The very fact that there is load balancing means that every server is likely to have active connections going through it. If you currently have connections going through a specific server, you don't want to drop those connections in order to reboot that particular machine. This allows updates to a live machine.

If you have a load balanced environment then you have the ability to redirect new connections away from a given server. Then it's just a matter of waiting for the active connections to terminate before the machine ends up in an idle state where you can safely apply patches offline. I've worked in a number of telephony environments and this was always the way we would patch systems. Stop accepting new connections, wait for existing ones to end, then perform the patch, reboot, verify, and start accepting connections again.

Second, this is telephony, meaning it is the infrastructure on which the internet is based. There's no dns tricks or tcp/ip you can use to send people to a different "server" if that server is the switch connected to your fiber backbone. Basically, there are points in the infrastructure where there are by necessity a single chokepoint.

Any mission critical hardware, switches, routers, servers, etc. should be set up in redundant pairs (or triplets, ...) so that if a hardware failure occurs the remaining hardware can keep the service up. Single points of failure are avoided like the plague in datacenters that require 100% uptime. Part of that is to deal with hardware failures but part is also to provide an ability to perform software/firmware upgrades when necessary. Once again, you migrate all traffic off the system you're upgrading then apply the upgrades offline. Upgrading a kernel, especially, in an online environment, is something virtually any sysadmin would want to avoid if at all possible.

Redundancy is key, and any commercial datacenter will offer it all the way from their connections to the outside world to the connections they provide their customers. Every datacenter used by every company I ever worked for (about 10) offered redundant power and redundant network drops (using HSRP, VRRP, etc) for our equipment. If the datacenter needed to upgrade a router they'd move all traffic off one router so they could upgrade and test it, then move traffic off the other and repeat the process. Similarly if we needed to upgrade our firewalls, switches, etc. we'd fail over to the second redundant device first. In some cases we had bonded interfaces right on the end servers so as long as one path remained active we could power down an entire switch, router, firewall, etc. In other cases we relied on load balancing across servers that were alternately connected to one or another switch.

Re:Needed that bad? (3, Insightful)

Paul Carver (4555) | more than 6 years ago | (#23184048)

If your load balancer can't take a server out of the pool while allowing current sessions to finish cleanly then you need to shop for a new load balancer.

A decent load balancer will obviously give you the choice of whether to take a server out of service immediately disrupting existing sessions or simply stop sending new sessions to it while allowing existing sessions to continue.

As for your comment about physical connections, that's what portchannels and multilink trunks are for. Or VRRP and HSRP depending on which level of "connected to" you mean.

Re:Needed that bad? (2, Interesting)

Colin Smith (2679) | more than 6 years ago | (#23184974)

The very fact that there is load balancing means that every server is likely to have active connections going through it
http://conntrack-tools.netfilter.org/ [netfilter.org]

I hot-swap whole networks.

HTH.

 

Re:Needed that bad? (3, Interesting)

garlicbready (846542) | more than 6 years ago | (#23183656)

I was about to say another idea might be virtulisation via xen for example
start up a new virtual machine with the new kernel, then when your sure it's working, just switch everything across from the old to the new, and shut down the old virtual instance

No, No, No and No again. (5, Interesting)

Anonymous Coward | more than 6 years ago | (#23183914)

As an admin for some -very- high availability systems, load balancers are not a silver bullet. This solution would most apply for running one-node clusters who are using a single machine as a perimeter network device. (ex. firewall) I see lots of these in the racks at our NOC provider.

1. We connect to several load balanced systems and the complexity introduced by load balancers translates to inexplicable down time. No load balancers means a pretty steady diet of the latest and greatest server hardware, but no down time. The a few minutes of down time costs more than the server hardware.

2. High availability translates more roughly into nodes that can fail (ex. power off) and not take the cluster down. This boils down to active-passive application architecture more than just using heartbeat.

As an FYI, PostgreSQL clustering is a killer application for me. Erlang is also great in many ways, but requires application architecture with active-passive node awareness. Which isn't present in things like Yaws, or even my other favorite non-erlang app nginx. Heartbeat is the solution there, but I'd like to see yaws be cluster aware on its own. http://yaws.hyber.org/

Re:No, No, No and No again. (1)

0racle (667029) | more than 6 years ago | (#23184578)

If I may ask, what PostgreSQL clustering solution do you use?

Re:Needed that bad? (2, Interesting)

QuantumRiff (120817) | more than 6 years ago | (#23184462)

But what about the servers that are placed in remote sites like small cell towers, where space, and backup power are critical issues.

Unless it fails. (2, Insightful)

Joe Snipe (224958) | more than 6 years ago | (#23183188)

honestly how much downtime are we talking here? 30 seconds?

Re:Unless it fails. (4, Funny)

Anonymous Coward | more than 6 years ago | (#23183298)

honestly how much downtime are we talking here? 30 seconds?
well, think about the fsck that happens after 180 days or 30+ mounts ?

Re:Unless it fails. (2, Insightful)

m50d (797211) | more than 6 years ago | (#23184390)

Uh, if you actually need that, then you needed it anyway. And if you don't need it but don't know how to disable it, you shouldn't be running a system.

Re:Unless it fails. (1)

geekoid (135745) | more than 6 years ago | (#23183724)

It's more then the time. Management and interruption of even a second of downtime can be costly in a large organization.
All work comes to a halt, all connections need to be reestablished, work momentum is lost, etc.

Re:Unless it fails. (2, Informative)

UnknowingFool (672806) | more than 6 years ago | (#23183768)

For, your average computer and generic linux servers the downtime is small. But companies often have applications that they need to restart. That is the difference. Also linux is used on equipment other than generic servers: embedded systems, etc where loading isn't optimized cause the equipment should never go down.

Re:Unless it fails. (1)

shamer (897211) | more than 6 years ago | (#23183828)

lol it takes more than 30 seconds just to scan for SCSI devices, on my server anyway.

Total boot time is in the 3 minute range, most of that is server scanning for devices / POST'ing.

of course I'm in no need of true 24/7/365 uptime, but as stated above "Oh Cool!"

   

Re:Unless it fails. (4, Informative)

Tychon (771855) | more than 6 years ago | (#23183832)

A company that I once had dealings with was quite proud of their five nines. The motivation? It cost them $18,000 per second they were down. 30 seconds isn't just 30 seconds sometimes.

Re:Unless it fails. (1)

ACMENEWSLLC (940904) | more than 6 years ago | (#23184770)

Damn, that's the reason I get killed off in Eve. Right as I attack, my connection drops. 2AM, 6 months of work, and 30 seconds of downtime to ruin it all.

Amazing (4, Interesting)

cromar (1103585) | more than 6 years ago | (#23183198)

That is truly amazing tech, right there. It would be interesting to know the security implications of being able to hot-patch the kernel, however.

Re:Amazing (5, Funny)

katz (36161) | more than 6 years ago | (#23183794)

Considering that you don't need to prepare the kernel in any way--just execute the program and bang, it's patched--means that someone with root access could slip a rootkit right under your nose (i.e., without the system administrator being aware of this).

- Roey

Re:Amazing (5, Insightful)

KeithJM (1024071) | more than 6 years ago | (#23183896)

someone with root access could slip a rootkit right under your nose
Yeah, someone with root access can take control of your server. Oh, wait, they've got root access. They already have control of your server. At some point, you have to just accept that giving someone root access is a security risk.

Re:Amazing (1)

swillden (191260) | more than 6 years ago | (#23183996)

someone with root access could slip a rootkit right under your nose
Yeah, someone with root access can take control of your server. Oh, wait, they've got root access. They already have control of your server. At some point, you have to just accept that giving someone root access is a security risk.

Barring a carefully-implemented Mandatory Access Control system, anyway.

Re:Amazing (3, Insightful)

katz (36161) | more than 6 years ago | (#23184424)

My bad, I meant to say,

    "A remote attacker who successfully executes a privilege escalation exploit and gains root access will have an easier time taking control of your server and hiding their tracks".

Thanks for pointing that out

- Roey

Re:Amazing (1)

Abcd1234 (188840) | more than 6 years ago | (#23184186)

As opposed to slipping a rootkit into the kernel image on-disk, and then waiting for/forcing a reboot?

Re:Amazing (1)

FooAtWFU (699187) | more than 6 years ago | (#23184090)

A small (but nonzero) implication upgrade over them "only" having root on the server. (think "new spot to deploy a rootkit"). But at that point, you're already in deep trouble, so better avoid getting to that point to begin with.

Maybe... (0)

Anonymous Coward | more than 6 years ago | (#23183218)

...this will spur microsoft to atleast implement updates that don't require reboots. Hmmm, I think I may have stumbled on MS WIN 7's marketing slogan...

Re:Maybe... (5, Funny)

CogDissident (951207) | more than 6 years ago | (#23183326)

I thought their working slogan was:

Windows 7, it's not awful like Vista!

No more reboots - FTW! (0)

Anonymous Coward | more than 6 years ago | (#23183240)

NT

Wrong way to solve the uptime problem (4, Insightful)

Anon E. Muss (808473) | more than 6 years ago | (#23183244)

Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.

Re:Wrong way to solve the uptime problem (5, Funny)

Qzukk (229616) | more than 6 years ago | (#23183320)

Trust me, that was the first thing they thought of, then the CEO came in and said "Why are you ordering more equipment when we have half of our machines sitting there and doing nothing? We could be doing twice the work/traffic/whatever without paying more money!"

Re:Wrong way to solve the uptime problem (1, Insightful)

Anonymous Coward | more than 6 years ago | (#23183730)

yes, but if the CEO knew anything, he'd know that clustered computing is part of the job (or not) and he (maybe she?) wouldn't ask stupid questions.

Re:Wrong way to solve the uptime problem (4, Funny)

Anonymous Coward | more than 6 years ago | (#23183814)

If he knew anything, he wouldn't be the CEO.

Not only the CEO (4, Interesting)

Moraelin (679338) | more than 6 years ago | (#23183774)

Not only the CEO. I lived to see even a hardline IT guy (admittedly, one whose goal in life seems to be to be against whatever you want, and to avoid doing any extra work... actually, make that just: any work) argue along the lines of "nooo, you can't have the servers only 60% loaded! It's a waste of valuable hardware! Why, back in my day (of batch jobs on punched cards, presumably) we had the mainframe used at least an average of 95% before asking for an extra server!"

It always irks me to see people just not understand concepts like "peak" vs "average", or "failing over".

- A cluster of, say, 4 machines (small application, really) which are loaded to 90% of capacity, if one dies, the other 3 are now at 120% of capacity each. If you're lucky, it just crawls, if you're unlucky, Java clutches its chest and keels over with an "OutOfMemoryError" or such.

- if you're at 90% most of the time, then fear Monday 9:00 AM, when every single business partner on that B2B application comes to work and opens his browser. Or fear the massive year-end batch jobs, when that machine/cluster sized barely enough to be ready with the normal midnight jobs by 9 AM, so those users can see their new offers and orders in their browsers, now has to do 20 times as much in a burst.

Basically it amazes me how many people just don't seem to get that simple rule of thumb of clusters: you're either getting nearly 100% uptime and nearly guaranteed response times, _or_ you're getting that extra hardware fully used to support a bigger load. Not both. Or not until that cluster is so large that 1-2 servers failing add negligible load to the remaining machines.

Re:Wrong way to solve the uptime problem (3, Informative)

N1ck0 (803359) | more than 6 years ago | (#23183710)

Mainly why people in the telecom industry have been clamoring for it. Its very difficult to take over the termination of a circuit switched system without some interruption for the end user. And its also not aways easy to busy out all channels on a line as calls drop off so you can free up a machine for patching.

Of course many of the reasons is a lot of commercial telecom apps are badly implemented and need better management controls.

Re:Wrong way to solve the uptime problem (1)

MrMunkey (1039894) | more than 6 years ago | (#23184014)

I'd mod you up if I had points. It's hard to have fail-over systems when a cable has to be plugged in somewhere, and on top of that the channels have to be synced with the end user.

Re:Wrong way to solve the uptime problem (1)

Rich0 (548339) | more than 6 years ago | (#23184494)

Why - that's no excuse for not clustering!

Just tell each phone customer to have two sets of phones at home, so that when one line is down they can just use the other. Be sure to charge them for both.

Hmm - that actually is starting to sound like the sort of business model the wired phone company around my area might actually propose...

Re:Wrong way to solve the uptime problem (4, Insightful)

trybywrench (584843) | more than 6 years ago | (#23183728)

Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.
People using Linux on BigIron(tm) bank on 24/7/365/25years uptime. When a single server costs hundreds of thousands or millions of dollars you can't afford a spare sitting idle. From day 1 the server needs to be making money and never ever stop. For smaller general purpose servers like you can buy at Dell.com then yeah having a fail-over makes sense.

Re:Wrong way to solve the uptime problem (1)

diamondsw (685967) | more than 6 years ago | (#23184038)

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime. When a single server costs hundreds of thousands or millions of dollars you can't afford a spare sitting idle.

Active-Active clustering or load balancing. Sure, it can be a bitch to get working with all of the data synchronization required (especially for things like databases, which are traditionally active-passive), but if you want real reliability and the efficiency of using both boxes, it's what you do.

Anything less is asking for trouble.

Re:Wrong way to solve the uptime problem (1)

Abcd1234 (188840) | more than 6 years ago | (#23184220)

And, clearly, you know better how to run a bank's systems than they do, despite having run them this way for, what, 30 years? 40?

Re:Wrong way to solve the uptime problem (2, Insightful)

Anonymous Coward | more than 6 years ago | (#23184750)

Now is not the time to claim banks know what they are doing.

Re:Wrong way to solve the uptime problem (1)

poot_rootbeer (188613) | more than 6 years ago | (#23184564)

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.

If you own a piece of Big Iron and run Linux on it, it's going to be virtualized. Hundreds of virtual Linux boxes that can arbitrarily failed over, patched, and rebooted, the physical hardware carrying on uninterrupted all the while.

Re:Wrong way to solve the uptime problem (2, Insightful)

Anon E. Muss (808473) | more than 6 years ago | (#23184640)

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.

I doubt there are many people running Linux on true Big Iron. I'm not saying it doesn't happen, I'm saying that most Big Iron runs something else. I know many financial institutions and telecom operators use HP NonStop systems. These can stay up 24/7/365/25years, and you pay millions of dollars for that. They have full redundant hardware inside the box, run a proprietary OS, and proprietary applications.

Re:Wrong way to solve the uptime problem (1, Informative)

Anonymous Coward | more than 6 years ago | (#23184778)

Big Banks (tm) - like the one I currently work in - can afford to and do have even the largest systems installed in fully redundant configurations. It's part of standard BCM (business continuity management) practice - we need to, and can survive an entire datacenter dropping of the network, for whatever reason up to and including getting bombed off the face of the earth. In normal day to day practice these machines can and are used for load-balancing, to allow primary boxes to get taken down for maintenance.

And as a sysadmin in a bank, the solution described in the story isn't that appealing. It strikes me as something inherently less reliable than doing a cold boot with a new kernel. Scheduled downtime is OK, unscheduled problems because someone wanted to do an upgrade on the fly are *bad*.

Re:Wrong way to solve the uptime problem (1)

guruevi (827432) | more than 6 years ago | (#23184844)

And how do bigiron servers do it? Trust me, I've worked with bigiron and there are several solutions:

Some type of virtualization, partitioning or jails, and you can emulate a cluster of machines with minimal performance impact. The 'host' doesn't necessarily need to be upgraded frequently since it's very minimal in function (load a kernel into a processor).

You have your monthly/yearly maintenance that takes everything offline at 3 am and upgrades it if necessary. It's not unusual to see those things 3-5 major versions behind though depending on the work. Just like in Linux, a lot of it is modularized so much, that you don't have to take the whole thing offline to upgrade parts of it. If you have a somewhat decent vendor, they'll backport recent patches to kernel modules to your version and you can update.

100% is not possible with a single machine, even if you want it, there is no way that you will foresee any updates, patches or just plain people doing something stupid or stuff breaking. Any modern single server (mainframe) costing more than a mere $100,000 is most likely a machine consisting out of several machines already.

Re:Wrong way to solve the uptime problem (1)

geekoid (135745) | more than 6 years ago | (#23183740)

or get a mainframe.

Re:Wrong way to solve the uptime problem (1)

cellmaker (621214) | more than 6 years ago | (#23183780)

Your thinking about the wrong type of equipment. Don't think about typical data room servers, think VERY specialized telephony equipment. Something you don't have redundant racks for. Instead, it is usual that the rack has redundant cards for the specialized functions.

'course, in this case, you would think that we can swap to a redundant card and reload the now inactive one with a pre-patched image. But in reality, this depends on the software management on the box. Some will not allow card-by-card updates and force the entire box to reboot if the software is updated. Those boxes that require a system boot to update the software could benefit from this. But then, there can be company policies about applying "patches". My company got bitten a few too many times by patching live equipment, so patches were suspended unless you got signoff by a number of managers for extraordinary cases.

I remember one time many moons ago, I needed to patch some object on disk & restart a board. I had honed my procedure in a lab all day long. The night of the patch, I had managers & project managers watching over my shoulder and the customer on the speaker phone. So I cranked up the disk editor and went to work. CLICK CLICK CLIKETY CLICK.... You know how key strokes sound over a speaker phone, right? CLICK CLICK ... "Oh!".. Tap Tap Tap... CLICK CLICK CLICK. I've always wondered if the people on the other end of the phone took a moment to look at each other about then. :)

Re:Wrong way to solve the uptime problem (1)

bjourne (1034822) | more than 6 years ago | (#23183860)

Correct, but then you should never fix crasher bugs either. Because it is a mistake and you will never achieve 100% uptime. Use distributed computing instead.... Your argument is flawed. What happens if you have a dual node system and one node suffers a critical software failure while the other is rebooting due to a patched kernel? Your system suffers downtime that it otherwise wouldn't have if it had hot patching.

Re:Wrong way to solve the uptime problem (1)

Explodicle (818405) | more than 6 years ago | (#23184074)

Just how frequent are your critical software failures, and how long does it take you to patch a kernel? I agree that in theory this could happen, but the probability seems extremely low.

Re:Wrong way to solve the uptime problem (2, Insightful)

Ed Avis (5917) | more than 6 years ago | (#23184784)

Who cares about servers? I want my Linux desktop to stay up-to-date with security fixes without having to reboot it every few days.

Unnecessary (1)

isj (453011) | more than 6 years ago | (#23183278)

> "this is pure gold"

It is also a waste of time. Instead of spending time hot-patching a kernel, jotting down which patch it was, verify that it actually installed, and considering you cannot change the layout of structures anyway in a hot-patch, the time would be better spent designing protocols that can handle a hot-standby switchover.

Yes, there are a few scenarios where the hardware is so expensive that you cannot afford redundancy, but that is rare.

Already been used (4, Informative)

caluml (551744) | more than 6 years ago | (#23183310)

There was a kernel exploit [milw0rm.com] recently where someone submitted a patch that modified the running kernel using this technology. It didn't work for me, so I had to resort to patching the .c that was affected - but a lot of people reported that it worked.

Re:Already been used (2, Informative)

ThisNukes4u (752508) | more than 6 years ago | (#23184460)

IIRC, that code was actually a modified version of the exploit where the payload was changed to fix the exploit instead of spawn a root shell. Pretty fucking ingenious if you ask me.

Beats Windows (-1, Flamebait)

hey (83763) | more than 6 years ago | (#23183316)

Whether its really needed or not doesn't matte to me.
The main thing is we can laugh harder at Windows users who have to reboot to install applications!

at root it's just trampolining (1)

norbac (1113477) | more than 6 years ago | (#23183386)

The way it identifies what to patch is cool, but the 'hot' part of the patch is ultimately just simple trampolining -- replacing the start of the patched function in the code segment with a jmp to your new code. I did similar work in the linux kernel for a masters project.

Re:at root it's just trampolining (1)

tinkerghost (944862) | more than 6 years ago | (#23183970)

Hmm, when I was doing 8bit assembly, we called it a wedge .... crazy kids ... and get off my lawn

replace modules (2, Interesting)

hey (83763) | more than 6 years ago | (#23183420)

Rather that a source code level system I'd prefer a way of replacing loadable kernel modules without a reboot. Then push more code into modules -- eg file system. (Hey sounds like a micro-kernel).

Re:replace modules (1, Insightful)

Anonymous Coward | more than 6 years ago | (#23183512)

Theory of operation:
1. Build new_module
2. rmmod old_module
3. modprobe new_module

Gee, that was hard :-)

Re:replace modules (0)

Anonymous Coward | more than 6 years ago | (#23184928)

now try that with your disk driver, or network driver, or that specialized hardware interface to that fiber transceiver, or anything else that you can't re-initialize, without downtime, eh?

Re:replace modules (1)

Uncle Focker (1277658) | more than 6 years ago | (#23183802)

Don't worry, in 50 years you'll be able to do it in Hurd. That is if it ever gets out of alpha state by then.

Re:replace modules (1)

petermgreen (876956) | more than 6 years ago | (#23184358)

you can already replace loadable modules without a reboot as long as they aren't doing anything critical to your kernels operation.

Does this mean... (1)

Thelasko (1196535) | more than 6 years ago | (#23183442)

I can now install hypervisors without rebooting the victim's... I mean... client's computer?

[strokes handlebar mustache deviously]

The real test... (4, Funny)

hal2814 (725639) | more than 6 years ago | (#23183464)

Can ksplice be installed without rebooting?

Re:The real test... (2, Informative)

LinuxDon (925232) | more than 6 years ago | (#23183876)

It's in the comment: "ksplice requires no kernel modifications"

So yes, ksplice can be installed/used without rebooting.

Impressive hack (4, Informative)

EriktheGreen (660160) | more than 6 years ago | (#23183472)

For those that haven't read the paper, the technique used is straightforward in concept, but the devil is in the details.

He basically compiles a patched and unpatched kernel with the same compiler, compares the ELF output, and uses that to generate a binary file that corresponds to the change. That gets wrapped in a generic module for use, another module installs it along with JMPs to bypass the old code and use the new, and he performs the checks needed to make sure he can safely install the redirects.

He also has to differentiate real changes from incidental ones (the example given is changing the address of a function - all references to it will change, but they don't really need to be included in the binary diff).

The only human work required is to check whether a patch makes semantic changes to a data structure... whether eg. an unsigned integer variable that was being used as a number is now a packed set of flags - the data declaration is the same, but it's being used differently.

Interesting paper. Also a useful new set of capabilities for any Linux user who can't handle downtime for quarterly patching... worth its weight in gold in some businesses.

Erik

Re:Impressive hack (1)

Vectronic (1221470) | more than 6 years ago | (#23183572)

I was just about to do the calculation to see how much it would be worth, but I forgot how much a bit weighs...

Re:Impressive hack (3, Funny)

EriktheGreen (660160) | more than 6 years ago | (#23184004)

Well, let's see.

A silver dollar, from which bits were commonly cut, weighs about .77 troy ounces.

Today's gold price as of posting is about $889.95 US per troy ounce.

A silver dollar was typically cut into 8 bits, which gives us a weight per bit of 0.096 ounces. That translates to about $85.66 per bit weight in gold. Remember, this is per system being patched.

Since the patches being applied ranged from 1 line to 285 lines per the paper, and a reasonable estimate of compiled average bytes per line is something like 20, we get a value of $13,700 per line of patch in gold. Even for the smaller patches, this is significant. The largest patch would be worth nearly $4,000,000 USD in gold.

Of course, for 64 bit systems vs. 32 bit, the value would be twice as much :)

Erik

Re:Impressive hack (0, Redundant)

EriktheGreen (660160) | more than 6 years ago | (#23184030)

Move that "Remember, this is per system..." down a paragraph. Slashdot needs a post edit function.

If it's that critical, shouldn't you have two? (4, Insightful)

Paul Carver (4555) | more than 6 years ago | (#23183514)

I'd rather have at least two of anything important and have statefull failover between them.

If you've got this system that's so critical you can't reboot it for a kernel upgrade, what do you do when the building catches fire or a tanker truck full of toxic waste hops the curb and plows through the wall of your datacenter?

I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything.

If you can't transfer the workload to a location at least a couple hundred miles away without users noticing then you're not in the big league.

And as long as the workload is in another datacenter, what's the big deal about rebooting for a kernel upgrade.

Re:If it's that critical, shouldn't you have two? (1)

Akatosh (80189) | more than 6 years ago | (#23183936)

I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything.
You must work for one of those telephone companies with infinite time, money and no legacy equipment. Must be nice.

Re:If it's that critical, shouldn't you have two? (1)

mpapet (761907) | more than 6 years ago | (#23184388)

There are applications where this is simply not possible and I happen to admin some applications like that. This is what active passive clustering is all about. Even then minor updates of any kind are a long, carefully practiced, high-anxiety events.

Another informative post mentioned telphony as the perfect application, I copied it as an FYI

"Its very difficult to take over the termination of a circuit switched system without some interruption for the end user. And its also not aways easy to busy out all channels on a line as calls drop off so you can free up a machine for patching."

 

Re:If it's that critical, shouldn't you have two? (2)

EriktheGreen (660160) | more than 6 years ago | (#23184402)

In some engineered systems, it just isn't possible to have redundancy in the way you mean.

Extreme example: Try to design a fail-over for the space shuttle's solid rocket boosters :)

Interestingly, I've found that the skill needed (and the pay gathered) to deal with systems that can't be made redundant is much higher than that needed to work on "grid" or cluster systems where multiple cheap pieces of hardware are used.

And they tend to be more reliable too.

Re:If it's that critical, shouldn't you have two? (1)

noidentity (188756) | more than 6 years ago | (#23184648)

I think one point made several times is that you will have multiple servers where taking one down wouldn't interrupt services, just that the cost of taking one down is so great that you'd rather replace the kernel live. You can't solve that by adding even more super-expensive servers either.

Year of the linux desktop ... again? (0)

Anonymous Coward | more than 6 years ago | (#23183526)

Maybe this new tech will spur the year of the linux desktop computer! ...

Over-engineered solution to a non-existent problem (3, Insightful)

hacker (14635) | more than 6 years ago | (#23183556)

Once again, we have an over-engineered solution to a non-existent problem.

Any enterprise-level customer is going to have a VERY lengthy Q&A process before deploying anything into production. This includes testing kernels, hardware, networks, interaction, application, data and so on. One pharmaceutical company I know of is federally mandated to do this twice a year, every year, for every single machine that reads, writes or generates data. Period.

So you hot-patch a running Linux kernel. How do you Q&A that? How do you roll back if the patch fails? Where is your 'control'?

The answer? A duplicate machine. But wait, if you have two identical machines... isn't that... a cluster?

Exactly. And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades. You NEVER EVER touch a running, production system like that.

Well, not if you want any sort of data integrity or control and want to pass any level of quality validation on that physical environment.

Re:Over-engineered solution to a non-existent prob (1)

ROBOKATZ (211768) | more than 6 years ago | (#23183862)

Once again, we have an over-engineered solution to a non-existent problem.

Welcome to academia. I think it's an interesting start, and maybe someday we'll have solved the additional problems you've listed. And let's face it, rebooting for updates is annoying, mission critical or not.

Re:Over-engineered solution to a non-existent prob (1)

kortex (590172) | more than 6 years ago | (#23183940)

Thank you. I was getting depressed at what I was reading. Hot patching production kernels = amateur. Never take a *needless* risk. Ever. Hot Patching a running non-production kernel "because-you-can", well then that's a pretty neat thing, high on the geek scale. But don't even come near my prod cluster neophyte or I'll have your limbs removed.

Re:Over-engineered solution to a non-existent prob (0)

Anonymous Coward | more than 6 years ago | (#23184068)

Q&A doesn't proof the absence of bugs. Also, the less you spend the more your shareholders will thank you (or ravage you).

Re:Over-engineered solution to a non-existent prob (0)

Anonymous Coward | more than 6 years ago | (#23184468)

Your process of testing servers involves asking them questions and getting answers?

You are Wrong (3, Insightful)

mpapet (761907) | more than 6 years ago | (#23184576)

And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades

Hmmm. I happen to live by your words in an environment where this is theoretically possible, but practically impossible. Why? Because when the cluster rolls to a passive node, the application times out on the existing connections. The time outs have business ($$$$) implications. I wish it were okay to have infinite retries, but it's viewed as a violation of the service agreement. Telephony is like this too.

An academic ideal for sure, but please speak more humbly because it is no silver bullet.

Re:You are Wrong (1)

hacker (14635) | more than 6 years ago | (#23184856)

Frankly, if you roll to another node and you lose connections, then your cluster is misconfigured.

I've built and deployed clusters where I'm actively playing a streaming video across the cluster from a mounted drive, physically yank the power cable from the active node of the cluster, there's about a 1-2 second lag in the video, and then it continues to play right where it was, without any disconnects or interruptions.

In fact, I use this as a way to demonstrate that there is ZERO loss of connectivity when nodes are downed or recycled.

You might want to look into how your cluster is (mis)configured and fix it.

last of yOUR working capital being squandered... (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#23183580)

in the billionerrors' infactdead betting parlor. that, along with our other remaining resources, should be used to feed the hungry, &/or to bring our people back from the desert. let yOUR conscience be yOUR guide. you can be more helpful than you might have imagined. there are still some choices. if they do not suit you, consider the likely results of continuing to follow the corepirate nazi hypenosys story LIEn, whereas anything of relevance is replaced almost instantly with pr ?firm? scriptdead mindphuking propaganda or 'celebrity' trivia 'foam'. meanwhile; don't forget to get a little more oxygen on yOUR brain, & look up in the sky from time to time, starting early in the day. there's lots going on up there.

http://news.yahoo.com/s/ap/20071229/ap_on_sc/ye_climate_records;_ylt=A0WTcVgednZHP2gB9wms0NUE
http://news.yahoo.com/s/afp/20080108/ts_alt_afp/ushealthfrancemortality;_ylt=A9G_RngbRIVHsYAAfCas0NUE
http://www.nytimes.com/2007/12/31/opinion/31mon1.html?em&ex=1199336400&en=c4b5414371631707&ei=5087%0A

is it time to get real yet? A LOT of energy is being squandered in attempts to keep US in the dark. in the end (give or take a few 1000 years), the creators will prevail (world without end, etc...), as it has always been. the process of gaining yOUR release from the current hostage situation may not be what you might think it is. butt of course, most of US don't know, or care what a precarious/fatal situation we're in. for example; the insidious attempts by the felonious corepirate nazi execrable to block the suns' light, interfering with a requirement (sunlight) for us to stay healthy/alive. it's likely not good for yOUR health/memories 'else they'd be bragging about it? we're intending for the whoreabully deceptive (they'll do ANYTHING for a bit more monIE/power) felons to give up/fail even further, in attempting to control the 'weather', as well as a # of other things/events.

http://video.google.com/videosearch?hl=en&q=video+cloud+spraying

dictator style micro management has never worked (for very long). it's an illness. tie that with life0cidal aggression & softwar gangster style bullying, & what do we have? a greed/fear/ego based recipe for disaster. meanwhile, you can help to stop the bleeding (loss of life & limb);

http://www.cnn.com/2007/POLITICS/12/28/vermont.banning.bush.ap/index.html

the bleeding must be stopped before any healing can begin. jailing a couple of corepirate nazi hired goons would send a clear message to the rest of the world from US. any truthful look at the 'scorecard' would reveal that we are a society in decline/deep doo-doo, despite all of the scriptdead pr ?firm? generated drum beating & flag waving propaganda that we are constantly bombarded with. is it time to get real yet? please consider carefully ALL of yOUR other 'options'. the creators will prevail. as it has always been.

corepirate nazi execrable costs outweigh benefits
(Score:-)mynuts won, the king is a fink)
by ourselves on everyday 24/7

as there are no benefits, just more&more death/debt & disruption. fortunately there's an 'army' of light bringers, coming yOUR way. the little ones/innocents must/will be protected. after the big flash, ALL of yOUR imaginary 'borders' may blur a bit? for each of the creators' innocents harmed in any way, there is a debt that must/will be repaid by you/us, as the perpetrators/minions of unprecedented evile, will not be available. 'vote' with (what's left in) yOUR wallet, & by your behaviors. help bring an end to unprecedented evile's manifestation through yOUR owned felonious corepirate nazi glowbull warmongering execrable. some of US should consider ourselves somewhat fortunate to be among those scheduled to survive after the big flash/implementation of the creators' wwwildly popular planet/population rescue initiative/mandate. it's right in the manual, 'world without end', etc.... as we all ?know?, change is inevitable, & denying/ignoring gravity, logic, morality, etc..., is only possible, on a temporary basis. concern about the course of events that will occur should the life0cidal execrable fail to be intervened upon is in order. 'do not be dismayed' (also from the manual). however, it's ok/recommended, to not attempt to live under/accept, fauxking nazi felon greed/fear/ego based pr ?firm? scriptdead mindphuking hypenosys.

consult with/trust in yOUR creators. providing more than enough of everything for everyone (without any distracting/spiritdead personal gain motives), whilst badtolling unprecedented evile, using an unlimited supply of newclear power, since/until forever. see you there?

"If my people, which are called by my name, shall humble themselves, and pray, and seek my face, and turn from their wicked ways; then will I hear from heaven, and will forgive their sin, and will heal their land."

meanwhile, the life0cidal philistines continue on their path of death, debt, & disruption for most of US. gov. bush denies health care for the little ones;

http://www.cnn.com/2007/POLITICS/10/03/bush.veto/index.html

whilst demanding/extorting billions to paint more targets on the bigger kids;

http://www.cnn.com/2007/POLITICS/12/12/bush.war.funding/index.html

& pretending that it isn't happening here;

http://www.timesonline.co.uk/tol/news/world/us_and_americas/article3086937.ece
all is not lost/forgotten/forgiven

(yOUR elected) president al gore (deciding not to wait for the much anticipated 'lonesome al answers yOUR questions' interview here on /.) continues to attempt to shed some light on yOUR foibles. talk about reverse polarity;

http://www.timesonline.co.uk/tol/news/environment/article3046116.ece

And Microsoft claims to have invented it (3, Informative)

davecb (6526) | more than 6 years ago | (#23183686)

Tomasz Chmielewski wrote on LKML: the idea seem to be patented by Microsoft, i.e. this patent from December 2002: http://www.google.com/patents?id=cVyWAAAAEBAJ&dq=hotpatching [google.com] In essence, they patented kexec ;)

Andi Kleen promptly provided prior art: The basic patching idea is old and has been used many times, long predating kexec. e.g. it's a common way to implement incremental linkers too.

Imagine (1)

MortenMW (968289) | more than 6 years ago | (#23183722)

Imagine a Beowulf cluster of hot-patching Linux-servers

kexec (1)

kondor6c (1278766) | more than 6 years ago | (#23183836)

Wasn't this possible before with kexec?

Re:kexec (1)

Enderandrew (866215) | more than 6 years ago | (#23184340)

Kexec allows you to boot another kernel from your kernel without a reboot. I think ksplice allows you to just put in a patch to your existing kernel, however, I almost have to assume they use a kexec-like implementation.

Sorry... (2, Funny)

PJ The Womble (963477) | more than 6 years ago | (#23183954)

This is old news down in the South.

They don't bother splicing. Them good ol' boys been big on Kernel Sanders for years now.

It's Not For 100% Uptime (2, Insightful)

Bob9113 (14996) | more than 6 years ago | (#23184020)

Lots of people are saying, "100% uptime of a particular machine is neither necessary nor desirable, full failover is better. Full failover is the only way to handle catastrophic hardware failures." Or something to that extent.

But this isn't about 100% uptime. It's about not having to reboot for a kernel upgrade. You should still have hot failover if you want HA, this just removes one more thing that requires a reboot.

It's like people saying, "I don't mind rebooting after installing Office, I don't expect 100% uptime from my workstation." Of course you don't need to be able to do software installs without rebooting. But isn't it nice to have that option available?

Same with this. When (and if) it gets stabilized and standardized, you'll use it. Not for 100% uptime, just because it's nice to not be required to reboot to enable a particular software install.

Re:It's Not For 100% Uptime (1)

Enderandrew (866215) | more than 6 years ago | (#23184352)

If you have a critical system that needs to be up, you better have backup servers.

You fail-over to the backup, patch the first, fail-back, patch the second, etc.

AIX (0)

Anonymous Coward | more than 6 years ago | (#23184042)

Hasn't this kind of feature been available on AIX for quite some time? I'm told you have to "unroll" patch installations if you need to insert one somewhere into the existing patch-chain, which sucks, but you can do it all on a live system.

pure gold for rootkit writers (0)

Anonymous Coward | more than 6 years ago | (#23184212)

If you are a carrier in telephony and don't want downtime, this stuff is pure gold.

If you're a darkhat writing rootkits, this is priceless :)

Imagine a Beowulf cluster of... (0)

Anonymous Coward | more than 6 years ago | (#23184474)

Systems continuously cycling the kernel version: downgrading to 0.99, upgrading back to git head and then back to 0.99...

kdawson (0, Troll)

obsolete1349 (969869) | more than 6 years ago | (#23184484)

Oh goody another KDawson post. Isn't there a way to filter them out?

Gates Was Right (1)

BigBlueOx (1201587) | more than 6 years ago | (#23184592)

"open source creates a license so that nobody can ever improve the software" according to Bill Gates. Therefore this is not an improvement. QED.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?