×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Knuth Got It Wrong

kdawson posted more than 3 years ago | from the to-be-or-not-to-be-heap dept.

Software 298

davecb writes "Think you've mastered the art of server performance? Think again. Poul-Henning Kamp, in an article at ACM Queue, finds an off-by-ten error in btrees, because they fail to take virtual memory into account. And he solves the problem in the open source 'Varnish' HTTP accelerator, for all of us to see and use."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

298 comments

as Knuth told me when I was at his house (5, Funny)

commodoresloat (172735) | more than 3 years ago | (#32572660)

"Who the hell are you and what are you doing in my house?"

Re:as Knuth told me when I was at his house (4, Funny)

interval1066 (668936) | more than 3 years ago | (#32572680)

I wrote Knuth an email once. He never wrote me back.

Re:as Knuth told me when I was at his house (3, Funny)

MrEricSir (398214) | more than 3 years ago | (#32572730)

Next time, don't title your e-mail "buY h3rb@L c1aL 1s today!!"

Re:as Knuth told me when I was at his house (0, Redundant)

oldhack (1037484) | more than 3 years ago | (#32573292)

Why doesn't somebody tell me these things? I need to know stuff like this.

Re:as Knuth told me when I was at his house (2, Interesting)

Ungrounded Lightning (62228) | more than 3 years ago | (#32573152)

I wrote Knuth an email once. He never wrote me back.

As with the xkdc.org reference in the grandparent, this is something of an in joke. Don doesn't do email. B-)

Re:as Knuth told me when I was at his house (5, Funny)

TheGratefulNet (143330) | more than 3 years ago | (#32573450)

I called Wirth, once. but I think I called him by name and not by value.

perhaps I made the wrong judgement call; my phone overflowed.

Re:as Knuth told me when I was at his house (2, Funny)

Meshach (578918) | more than 3 years ago | (#32572682)

"Who the hell are you and what are you doing in my house?"

Get off my lawn.

Credit where credit is due... (5, Informative)

Anonymous Coward | more than 3 years ago | (#32572762)

link [xkcd.org]

First Post (-1, Troll)

Anonymous Coward | more than 3 years ago | (#32572672)

Bitches!

holy shit? (0, Troll)

Anonymous Coward | more than 3 years ago | (#32572694)

wow, my mind just got blown

Crank it to 11 (4, Funny)

MrEricSir (398214) | more than 3 years ago | (#32572700)

10 times faster? Yawn. Wake me up when it's 11 times faster.

Re:Crank it to 11 (5, Funny)

PatPending (953482) | more than 3 years ago | (#32572736)

10 times faster? Yawn. Wake me up when it's 11 times faster.

And wake me up when it's 1010 times faster.

Re:Crank it to 11 (5, Funny)

Nadaka (224565) | more than 3 years ago | (#32572782)

That has to be the one of the better binary jokes around.

Re:Crank it to 11 (0)

Anonymous Coward | more than 3 years ago | (#32573358)

But does it go to 1011?

Re:Crank it to 11 (0)

Anonymous Coward | more than 3 years ago | (#32572758)

I have one that goes to 12
http://xkcd.com/670/

Re:Crank it to 11 (1)

AnonymousClown (1788472) | more than 3 years ago | (#32572816)

10 times faster? Yawn. Wake me up when it's 11 times faster.

Director: "Yeah but, why not just make each parts of the algorithm a little faster and that way 10x faster is faster than the old ten times faster."

You: "But this goes up to 11 times faster."

Nope, he didn't (5, Informative)

siddesu (698447) | more than 3 years ago | (#32572714)

Knuth's analysis is valid in the framework of his assumptions, and what is described in the linked article has been known as "cache oblivious b-tree" for not so short time.

Agreed (5, Insightful)

eldavojohn (898314) | more than 3 years ago | (#32572792)

Knuth's analysis is valid in the framework of his assumptions, and what is described in the linked article has been known as "cache oblivious b-tree" for not so short time.

Yeah, using this logic, once quantum computers get to the point of solving general problems there's going to be an awful lot of people who "got it wrong" because their algorithms do not apply in a quantum computer. Advancements in technology causing algorithms & data structures to be tweaked means the original person who thought them up "got it wrong"? Ridiculous.

"Oh, RSA was brilliant. But they got it wrong. You see, they failed to account for xanxinglydiumapping in 2056 computers. Poor stupid wrong bastards."

Re:Nope, he didn't (5, Insightful)

SashaMan (263632) | more than 3 years ago | (#32572986)

Mod parent up. There was a comment on the article that referenced cache oblivious algorithms, which was a new concept to me and very interesting. Basically, this set of algorithms assumes a memory hierarchy (e.g. fast ram vs slow disk) that is optimized to limit the number of times the slower memory is accessed. Importantly, cache oblivious algorithms are optimal REGARDLESS of the size of the cache. That's opposed to a cache aware algorithm, like a normal b-tree, where the size of each node is set according to the page size of the machine.

A very helpful overview here from MIT Opencourseware: http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-fourteen/ [catonmat.net]

Re:Nope, he didn't (2, Interesting)

Darinbob (1142669) | more than 3 years ago | (#32573482)

Much of this also depends on the relative size of the data elements compared to sizes of page or cache lines.

For instance if the b-tree node is roughly the size of a page, the entire node must be read into memory at once from disk. Even reading a single byte of the record may necessitate reading the entire page. However this size will likely be much larger than a cache line. So you could have quite a lot of variance in performance based on how much of the structure you access; say scanning through the whole node versus just reading a few words at the top.

Re:Nope, he didn't (2, Interesting)

lalena (1221394) | more than 3 years ago | (#32573388)

Yes, you are correct. I think the author could have just showed figures 5 & 6 and said obviously figure 6 is better and I would have gotten the point.
Instead I had to read a bunch of text and timing charts before he simply showed what his improvement was. Yes, cache oblivious is worse, but that wasn't the problem Knuth was trying to solve. You could make further arguments that the paging system should group links by location AND popularity. You could move more popular links to the top of the tree so you don't have to traverse past the first page to find them. Also, I would think different applications would have different potential improvements. Algorithms for hosting links to news articles (where newer articles are more popular) might not work well with this algorithm when every newly inserted link ends up at the bottom of the tree.
On the flip side, most people don't care. Unless you have many servers, it is cheaper to throw money at the problem in the form of physical RAM before you start thinking about problems like this.

Many servers (4, Informative)

tepples (727027) | more than 3 years ago | (#32573746)

Unless you have many servers

The article does in fact mention many servers, specifically replacing twelve Squid servers with three Varnish servers.

it is cheaper to throw money at the problem in the form of physical RAM before you start thinking about problems like this.

No matter how much physical RAM you have, you're still limited by the speed of the interface between RAM and the CPU. If a set of closely related nodes can fit in a cache line, this improves locality so that the CPU can do more work instead of twiddling its thumbs.

Re:Nope, he didn't (4, Insightful)

Darinbob (1142669) | more than 3 years ago | (#32573432)

Agreed. Very often these results get misused and misunderstood. Every operation has a cost, and you need to know what those costs are before you decide which one of those gets measured.

For instance, it is common with sorting algorithms to measure (in Big-Oh notation) how many comparison operations there are. However these may not necessarily be very expensive relative to other operations; such as the cost of swapping two elements. If the elements are large structures and the comparison keys are integers, then you should be concerned about the number of copies. But if the elements are pointers to structures and the keys are long strings, then the concern will be with the number of comparisons.

Almost every text book on algorithms you see is going to assume some sort of idealized computing device. In the real world there are multiple levels of memory speeds; caches up to virtual memory to disk. There are also other external factors that must be taken into account as well; for instance if you don't have enough memory, virtual or not, to hold all the data at once. Say you're sorting a database selection; or you're on a mainframe with limited memory space but reading customer data from tapes.

Re:Nope, he didn't (2, Insightful)

jemfinch (94833) | more than 3 years ago | (#32573608)

No, and no. The data structure described by Henning-Kamp is not a B-tree, but a heap. Additionally, it's cache-aware, not cache-oblivious.

Discrete Math &/or DataStructures material (0)

Anonymous Coward | more than 3 years ago | (#32573724)

"Knuth's analysis is valid in the framework of his assumptions, and what is described in the linked article has been known as "cache oblivious b-tree" for not so short time." - by siddesu (698447) on Monday June 14, @07:29PM (#32572714)

The entire discussion in the article really reminded me of 2 classes from academia: Discrete Math (where you see the Set Theory, Probability Theory, Logic, & the P-NP type material discussed in this article), & DataStructures (where you learn about Binary Trees and implementations via computer algorithms of them (which leads into things like Tri-E etc.)).

I'd absolutely HIGHLY recommend both courses to any CSC student in fact (they BOTH make you *THINK*, & much more than other classes I felt @ least, did...).

Why?

Heh - it was the ONLY things that allowed me to "keep up" with some of the article's material is why... lol! Being "familiar beforehand" tends to help that way, you know?

APK

P.S.=> The MORE DIFFICULT of the two subject though, imo @ least? Discrete Math (for me @ least) - As I felt that you don't just "learn it by rote only" & just "doing your homework" only either, as many other maths are like... this one?? You have to have your "thinking cap on", 24x7 in it... still, it's GOOD stuff! apk

Knuth didn't get it wrong (4, Insightful)

gnasher719 (869701) | more than 3 years ago | (#32572724)

You should have read a bit further to the bit with B-trees and B*-trees.

Re:Knuth didn't get it wrong (1)

wonderboss (952111) | more than 3 years ago | (#32573108)

Here! Here!

And the slashdot entry is also wrong. The article doesn't say anything about "an off-by ten error in btrees". It doesn't talk about btrees except for a passing reference. It talks about heaps and heapsort.

Re:Knuth didn't get it wrong (0)

Anonymous Coward | more than 3 years ago | (#32573702)

"Hear, hear!" -- It's not: "Here I am, and I agree!" It's more like: "Listen to this guy, he's talkin' sense."

Re:Knuth didn't get it wrong (0)

Anonymous Coward | more than 3 years ago | (#32573120)

B-trees or not B*-trees, that is the question.

Re:Knuth didn't get it wrong (5, Informative)

ImABanker (1439821) | more than 3 years ago | (#32573298)

How does one get from, "I have not found a systematic flaw in the quality of Knuth et al.'s reasoning" to "Knuth Got it Wrong"?

Re:Knuth didn't get it wrong (0)

Anonymous Coward | more than 3 years ago | (#32573348)

Yes, because that article is about the optimization of heaps and not B-trees.

Re:Knuth didn't get it wrong (1)

Nimey (114278) | more than 3 years ago | (#32573648)

But then the post wouldn't be sensational, wouldn't get as many views, and would leave you with a positive view of Slashdot editors.

don't use swap, doofs (4, Insightful)

conspirator57 (1123519) | more than 3 years ago | (#32572726)

article summary:

"disk is slower than RAM. some doofs don't realize their system is swapping. ergo algorithm is bad."

throw in 'Knuth is wrong' to generate page hits.
???
profit.

non-sensationalized takeaway: "remember swap is slow; try not to use it."

Are you serious? Do you even know who phk is? (3, Informative)

Anonymous Coward | more than 3 years ago | (#32572936)

Poul-Henning is definitely not a "doof". He's single-handedly responsible for a huge amount of the FreeBSD kernel. Yes, this is the same FreeBSD that powers Yahoo!, that provided a significant portion of Mac OS X, and runs hundreds of thousands of businesses world-wide.

To suggest that phk doesn't know what he's talking about is absurd. He's one of the top three UNIX gurus in the entire world. In fact, the Internet today is what it is thanks to his hard work and dedication.

Re:Are you serious? Do you even know who phk is? (3, Insightful)

MightyMartian (840721) | more than 3 years ago | (#32573028)

I don't think anyone accused him of not knowing what he's talking about. He's been accused of using an extremely hyperbolic headline to draw readers. Besides, what he's saying isn't exactly brand new.

Re:Are you serious? Do you even know who phk is? (-1, Troll)

Anonymous Coward | more than 3 years ago | (#32573128)

It's not hyperbolic at all. phk is 100% correct, like usual. Knuth's estimates have been shown to be incorrect. In case you didn't catch it, that's what science is all about. Knuth posed his theory and his conclusions, and phk showed how they were wrong.

Re:Are you serious? Do you even know who phk is? (4, Insightful)

McNihil (612243) | more than 3 years ago | (#32573072)

To place algorithm theory in the same space as implementation is just plain wrong... and the article does say that in a better way than kdawson's sensationalist header.

The implementation of a particular algorithm on specific hardware is more inline with resource economy than anything else and to subject comparison to the theoretical implementation is just ludicrous.

For instance one could with simple means device a unit where only bubble-sort would be possible because of the massive overhead of qsort on said machine. This is trivial... for instance Casio Fx180 calculator.

Re:Are you serious? Do you even know who phk is? (1)

oldhack (1037484) | more than 3 years ago | (#32573322)

Yeah, well, it worked - we responded, eh. kdawson, you bastard.

Re:Are you serious? Do you even know who phk is? (1)

davecb (6526) | more than 3 years ago | (#32573382)

Hey, blame me, not kdawson! We both know who phk is.

--dave (with tongue firmly in cheek) c-b

Re:don't use swap, doofs (1)

TENTH SHOW JAM (599239) | more than 3 years ago | (#32573064)

non-sensationalized takeaway: "remember swap is slow; try not to use it."

Not really. Closer to "Remember, swap is slow. Think about how you use it."

Re:don't use swap, doofs (2, Insightful)

maraist (68387) | more than 3 years ago | (#32573068)

Did you read the article? Not that it's terribly innovative or anything, but he's not saying anything about swap is slow.. He's saying when you need to page to disk (e.g. a database or disk-spilling cache), don't use traditional B-Tree's or B+Tree's which tend to only store a single level in a memory-page / disk-page. This causes Log(n) disk-lookups to find the leaf-node (where all B+Tree data lives, and half of B-Tree data lives). Instead store multiple levels of the tree in a contiguous mem/disk page. It's a relatively simple re-organization, and I don't think he's scaling out as well as he's suggesting (when you get to 1B + records - you're not going to get a large number of the 20 level lookups in a single vertical slice).

Re:don't use swap, doofs (0)

Anonymous Coward | more than 3 years ago | (#32573276)

... He's saying when you need to page to disk (e.g. swap)...

Re:don't use swap, doofs (1)

maraist (68387) | more than 3 years ago | (#32573780)

You realize that mmap is essentially programmatic swap right? That if you do random file-seeks on a large (hundred+ gig) file, that you're relying on the file-system cache 'swapping' (though not to a swap-file). And that both of these are very similar to explicit paginated disk-in / disk-out with the O_DIRECT flag.

I've programmed app-level caches, and am very aware of the memory/disk trade-offs, so I completely understand where he's coming from (even though I don't know the details of his tool).

What I think you're getting at is an app that assumes it can just use 300gig of RAM, and thus swap the whole OS to disk (e.g. /dev/sda2 the swap partition). Maybe he did it and maybe he didn't, but I can certainly see how to apply his algorithm to explicitly swapped RAM (that doesn't hit swap). And I suspect that the performance characteristics would be similar (except for the responsiveness of other apps on the machine). Though for non O_DIRECT file-cache based super-sized apps, you'd have similar peer-app stalls.

Re:don't use swap, doofs (1)

gmack (197796) | more than 3 years ago | (#32573104)

More like: swap is slow. When you are using it try to make things you need to access at the same time closer together to minimize reads.

already proposed 40 years ago... (1)

godrik (1287354) | more than 3 years ago | (#32572742)

You mean people are implementing heap like that ?

That would indeed be stupid. People have been using B-tree http://en.wikipedia.org/wiki/B-tree [wikipedia.org] instead of Binary search trees for the cache access (or sequential disk IO) reason forever.

More recently we have been talking about cache oblivious and cache aware algorithm : http://en.wikipedia.org/wiki/Cache-oblivious_algorithm [wikipedia.org]

I should write an article about Z-order and be on the first page of slashdot...

Re:already proposed 40 years ago... (1)

AnonymousClown (1788472) | more than 3 years ago | (#32572880)

I should write an article about Z-order and be on the first page of slashdot...

Don't forget the pretty colorful graphs!

Journaling Filesystems (1)

KriKit (892231) | more than 3 years ago | (#32572784)

Don't Journaled filesystems themselves use b-trees? Does this mean they could be much faster in theory?

Re:Journaling Filesystems (2, Interesting)

KriKit (892231) | more than 3 years ago | (#32572802)

If were talking about swap then nevermind, nothing you can do there. The filesystem is the swap.

Re:Journaling Filesystems (1)

Guy Harris (3803) | more than 3 years ago | (#32573396)

Don't Journaled filesystems themselves use b-trees?

File systems that use b-trees use b-trees; file systems that don't use b-trees don't use b-trees. Both types of file systems can be journaled or not journaled.

Theoretical performance vs real-world performance (2, Insightful)

InsertWittyNameHere (1438813) | more than 3 years ago | (#32572794)

Yes it's true. In some real-world applications an algorithm encounters it's worst case running time more than the predicted theoretical average case running time. This is where case by case optimizations come into play.

Knuth never claimed the algorithm was the best choice in YOUR particular case. Don't drag his name through your sensational mud for the sake of your slashvertisement.

Re:Theoretical performance vs real-world performan (5, Informative)

DragonWriter (970822) | more than 3 years ago | (#32572904)

Yes it's true. In some real-world applications an algorithm encounters it's worst case running time more than the predicted theoretical average case running time.

That's actually not the issue.

The issue is that in the real world, the assumptions underlying the calculation of algorithmic complexity may not hold. For instance, Knuth's analysis that the author of the article here holds to be misleading (not, as the Slashdot title suggests, "wrong") calculates the complexity based on the assumption of ideal random access memory, that is, memory for which all accesses are equal cost.

In real-world computers using caches (as pretty much all real-world computers do, often at many different levels) that assumption does not hold -- access to some parts of the memory address space are much more expensive than accesses to other parts of the address space, and which parts of the address space are expensive changes over time (and how it changes over time potentially depends on the caching strategy used at every level lower than the level at which the algorithm is implemented.)

This means that algorithms in the real world can scale worse than their theoretical "worst-case", if that theoretical worst-case scaling is based on the assumption of constant memory access cost, since that assumption does not hold in the real world.

Re:Theoretical performance vs real-world performan (3, Insightful)

Darinbob (1142669) | more than 3 years ago | (#32573536)

But everyone already knows that (or should). Knuth knows that as well, caches and paging systems are not unknown things to him. He was writing a text book for students, and simplified a complicated problem to the point where it can be studied and analyzed more easily. Similar to teaching students Newtonian physics. It is sensationalist and naive to call Knuth wrong here.

Careful there... (3, Informative)

Estanislao Martnez (203477) | more than 3 years ago | (#32573672)

This means that algorithms in the real world can scale worse than their theoretical "worst-case", if that theoretical worst-case scaling is based on the assumption of constant memory access cost, since that assumption does not hold in the real world.

I agree with all of your post except this part. Algorithmic complexity theory is about orders of magnitude, not about precise numbers. That's why we have O(n) as a complexity class but not O(2n), O(3n) as separate classes; saying that an algorithm has O(n) worst-case time performance is saying that the time it takes to run it is approximated by some linear function of the problem size. We don't particularly care which linear function it is, because that depends too closely on your assumptions about the computational model and/or hardware. What we really care about when we call it O(n) is that that's better than something like O(n^2) (a.k.a. polynomial time [wikipedia.org] or just "P") or O(log n), no matter what computational model you assume.

As long as the slow memory accesses in the physical hardware still respect some bound, you can treat that upper bound as the constant worst-case memory access cost, and use the algorithmic complexity analysis to calculate an upper bound on the algorithmic complexity. If we turn your argument on its head, we'd say instead that the actual physical cost of running algorithms is often faster than the complexity class indicates because many memory accesses are much faster than the constant upper bound, but that doesn't make the linear upper bound invalid. The best you might be able to do is to assume a finer-grained computational model with variable cost memory access, and prove that you can get a lower upper bound there, but the original higher upper bound is still an upper bound for the computational model chosen, and can still be useful when reasoning about a system.

Re:Theoretical performance vs real-world performan (1)

Palinchron (924876) | more than 3 years ago | (#32573758)

For instance, Knuth's analysis that the author of the article here holds to be misleading (not, as the Slashdot title suggests, "wrong") calculates the complexity based on the assumption of ideal random access memory, that is, memory for which all accesses are equal cost.

No, it assumes memory for which all access are *at worst* a certain equal cost - in other words, memory for which accesses have an upper bound. Knuth's analysis still holds if certain pieces of memory are FASTER than the norm. If we take memory access going through swap as the normal, worst case, upper-bound cost of memory access, and RAM and cache hits as special better cases, the analysis still applies to the real world.

This means that algorithms in the real world can scale worse than their theoretical "worst-case", if that theoretical worst-case scaling is based on the assumption of constant memory access cost, since that assumption does not hold in the real world.

Doesn't it? The theoretical worst case corresponds nicely to the real-world worst case of all memory coming from swap. The (typical) case in which there is also a bunch of fast RAM and even faster CPU caches is just not the worst case.

Knuth didn't get anything wrong (5, Insightful)

jfengel (409917) | more than 3 years ago | (#32572812)

There isn't any "off by ten error", and this isn't telling us anything we don't already know (in CS terms): implementation on an actual computer can be different in performance from an abstract machine.

What the author is saying (quite well) is that the virtual memory performance amounts to cache misses, which cause extra performance overhead. He found a case where it was significant and got a 10x speedup in his particular application.

The article is a little over-zealous its characterization, though it's careful to note that this is not actually a theoretical novelty. The summary, on the other hand, bastardizes and exaggerates it.

The article is interesting, and worth reading, but if you RTFS without RTFA you'll be dumber than you were before. Thanks, kdawson.

Re:Knuth didn't get anything wrong (1)

juuri (7678) | more than 3 years ago | (#32572946)

What's troubling to me about this article is it reads like this guy hasn't been paying attention for quite a while.

Squid is his use case to beat? Has he been in any high performance app stack in the past five years?

Squid... really?

Re:Knuth didn't get anything wrong (2, Interesting)

Lord Crc (151920) | more than 3 years ago | (#32573172)

Squid is his use case to beat? Has he been in any high performance app stack in the past five years?

Squid... really?

Considering Varnish was written due primarily to the lacking performance of Squid, I don't see why this is so bad?

Re:Knuth didn't get anything wrong (2, Interesting)

Cirvam (216911) | more than 3 years ago | (#32573226)

Other then varnish and squid what caching reverse proxy software is there? It looks like Nginx added caching to their core recently, although I'm not exactly sure if its intended to act in the same capacity as squid and varnish or to be more like mod_cache for lighttpd. I guess you could use apache and mod_proxy but that's not exactly high performance. I know my employer looked at the various offerings and we ended up writing our own on top of a webserver called OKWS.

Cache behavior is hard and not portable (1)

MagikSlinger (259969) | more than 3 years ago | (#32572824)

I am sympathetic with his argument, but trying to figure out the best algorithm for a given OS and hardware combo is very difficult and usually not portable.

Re:Cache behavior is hard and not portable (1)

DragonWriter (970822) | more than 3 years ago | (#32572944)

I am sympathetic with his argument, but trying to figure out the best algorithm for a given OS and hardware combo is very difficult and usually not portable.

Assuming that the number of actual caching strategies is smaller than the number of hardware and OS combos (which it would seem likely to be, since at least some of those should share caching strategies), what is really needed is analysis of the determinants of algorithm performance that take into account different caching strategies. That's still difficult compared to simple ideal assumptions like constant memory access cost, but simple metrics that don't capture real world behavior well are of limited utility.

Re:Cache behavior is hard and not portable (1)

owlstead (636356) | more than 3 years ago | (#32573160)

If you make sure that you require less pages to do something, you will generate less page faults. I would think that this fact is rather universal. Of course, you can go further by tweaking the parameters in such a way that they run well on a specific processor.

Re:Cache behavior is hard and not portable (1)

MagikSlinger (259969) | more than 3 years ago | (#32573414)

I suppose, but have all modern OS's finally agreed on a universal LRU algorithm, or do we still have "clever" algorithms that try to figure out if the application is user-centric or server-centric then changes the page expiry accordingly?

The lengths people will go to for $5 (0)

mysidia (191772) | more than 3 years ago | (#32572852)

I think you only get it if you find an actual bug in Knuth's software though :)

Knuth was a Christian. (-1, Flamebait)

Anonymous Coward | more than 3 years ago | (#32572866)

Just letting you know.

news at 11 (3, Insightful)

inkyblue2 (1117473) | more than 3 years ago | (#32572906)

Algorithm revised in light of real-world performance constraints! Read all about it!

Seriously, we just rewrote a tree (that runs in a high traffic serving environment) this month at work because it wasn't streamlined just right to take full advantage of the underlying architecture. No one will write a paper about it.

Also, hey kids, profile your code.

Re:news at 11 (1)

Michael Kristopeit (1751814) | more than 3 years ago | (#32572956)

we just rewrote a tree (that runs in a high traffic serving environment) this month at work because it wasn't streamlined just right to take full advantage of the underlying architecture. No one will write a paper about it.

because you didn't want to admit that you stored the tree as folders and files on a FAT32 disk on windows 98?

Ow my eyes (1, Insightful)

Anonymous Coward | more than 3 years ago | (#32572930)

R-G colorblindness makes it near impossible to distinguish the graphs from each other.

Why trust the OS? (4, Interesting)

michaelmalak (91262) | more than 3 years ago | (#32572942)

I'm not sure what the advantages are of dancing around with the OS's virtual memory system. If it were me, I would detect the amount of physical memory available, allocate 80% of it, lock it into physical memory and manage the paging myself. It would be portable, not subject to changes in the VM algorithm by the OS authors, and easier to directly monitor and improve performance. But that's just me.

Re:Why trust the OS? (1, Funny)

Anonymous Coward | more than 3 years ago | (#32573086)

Do you really need 3277MB of my RAM?

Firefox 1.0 programmer I assume... :p

Re:Why trust the OS? (2, Insightful)

Anonymous Coward | more than 3 years ago | (#32573356)

Then someone starts two copies of your program and you lock 160% of available memory

Don't Trust the OS! (1)

amcdiarmid (856796) | more than 3 years ago | (#32573478)

In a virtual world it lies: I suspect that most Operating Systems lie about what physical hardware:

VMware allows for over-commitment of most hardware. (CPU, Memory, and Hard Disk). Windows allows for over-commitment of Memory

Since this is making assumptions about memory management: (Flash: Various algorithms may be tuned for optimized use in your specific use-case).

In your case of grabbing 80% of memory: This works if you really need the memory - in which case you have to have it: If this forces Windows (linux, whatever) to swap some "system" memory it make slow down system processes (disk write, responding to network requests, running tomcat...) and cause your system to slow down. Perhaps it works for you.

If your "system" resides in VMware, your memory may be a swap file depending on over-commitment. Say some Joe-Rent-a-Network has connected you a server in a "secure" location. In a Co-Lo. On VMware. Now if each system has an application (or two) that grabs 80% of available memory...

Lets say Your 80% memory grab meets the swap file, across a network link to the cheapo SAN...

If I'm Joe Rent-a-Network and you make me have to go figure out why performance sucks for everyone I'm gonna kick your ass.
Now get off my lawn

Re:Don't Trust the OS! (1)

michaelmalak (91262) | more than 3 years ago | (#32573682)

In Windows, the API to allocate physical memory is VirtualAlloc [microsoft.com] .

Good point about VMWare. I hadn't thought about that.

I just used 80% as an example. Obviously one would provide command-line switches or config options for either X% of memory or Y MB of memory.

Re:Why trust the OS? (2, Interesting)

Anonymous Coward | more than 3 years ago | (#32573498)

Why is this funny? Almost all DBMS kernels do it -- they manage physical memory and disk blocks themselves. More to the point, Dr. Kamp is glossing over the fact that one doesn't use in-memory data structures where data my reside on disk.

In this case, the suitable data structure is any variation of B+Tree, which actually contains children of a node in one memory page and when there is no more room in that page a page split happens. He used the wrong D/S to begin with, concludes Knuth was wrong and finally ends up implementing Knuh's B+Tree.

well (2, Informative)

larry bagina (561269) | more than 3 years ago | (#32572968)

I don't have time to read through the article and verify the numbers (at least right now) but anyone who's even paged through TAOCP knows it was written for computers where "swap" was what the operator did when the tape was full. (Ok, they also had drum memory).

Stupid headline (3, Insightful)

Sloppy (14984) | more than 3 years ago | (#32573020)

Good article (increase your locality to get fewer page faults). Stupidly wrong Slashdot headline and summary. "Off by ten error?" Please!

kdawson, next time, RTFA before you post someone's lame summary.

Summary is wrong - btrees != binary trees (3, Informative)

SashaMan (263632) | more than 3 years ago | (#32573040)

The summary is wrong when it talks about "an off by ten error in btrees". In fact, the article talks about how normal binary heap implementations are slow when virtual memory is taken into account.

In fact, b-trees ARE cache aware and ARE optimized to limit paging on disk. PHK's algorithm is essentially a cache-aware version of a binary heap.

That is, binary tree is to b-tree as binary heap is to PHK's b-heap.

Re:Summary is wrong - btrees != binary trees (0)

Anonymous Coward | more than 3 years ago | (#32573578)

binary tree is to b-tree as binary heap is to PHK's b-heap.

And thus the name, "B-heap". As was stated specifically in the article, his colleague Colin Percival noted the similarity you mention above, and suggested the name "B-heap" by analogy with the name "B-tree".

This headline is silly (4, Informative)

MadAndy (122592) | more than 3 years ago | (#32573074)

Knuth's stuff assumes everything is RAM-resident - as long as you don't violate that what he wrote is as valid as ever. I'm quite certain that he'll have different suggestions for tasks involving storing data on disk. Even though the disk writes are implicit because the OS is doing them, they're still there, just as they would be had he coded them explicitly. So of course you're going to get poor performance using a RAM-resident algorithm for a disk-resident application.

The RAM resident stuff is still useful, both at a lower level, but also for those of us creating applications that can live entirely in memory. A web site I did recently is careful to ensure that the entire primary data set can fit in memory, and for that site everything he wrote is still perfectly valid.

In fact, for very high performing websites you try to ensure that at least most of your requests come from memory rather than disk, which makes Knuth's stuff more important than ever. If you can't do it in RAM then you'd better have a lot of spindles!

Re:This headline is silly (1)

RzUpAnmsCwrds (262647) | more than 3 years ago | (#32573600)

Knuth's stuff assumes everything is RAM-resident - as long as you don't violate that what he wrote is as valid as ever.

CPU cache.

Timer wheels (1)

0xABADC0DA (867955) | more than 3 years ago | (#32573138)

Continuing a discussion...

Seems to be that this bheap just reduces the number of pages probably needed from log(n) to log(n)/C where C is the number of levels that are stored in the same page. And for removing a node it may need to access log(n)/C random pages. So this is just a constant factor improvement... it's just that the constant is number of pages so has a large real world cost.

I'd like to get people's thoughts on using timer wheels [lkml.org] instead, like the linux kernel uses for timers. Say you take say 8 wheels of increasing spans of time where each wheel is just an unordered list:

insert: one of the 8 wheel list-ending pages must be resident (certainly will be)
delete: just zero out the list entry (probably 1 page-in).
remove min: 1 page to pop the list head at wheel 0, or a bunch of sequential accesses to cascade the expire timers down.

Could some data structure expert please comment on the relative merits of bheap vs timer wheels? Seems to me that a small fixed set of pages and sequential access should be far better in terms of swapping. The OS should be designed to recognize in-order access, especially if each list is a mmap'd file, and the thread blocked does not need to have any locks (you can replace the list with a new one while it is being processed).

Re:Timer wheels (2, Informative)

chhamilton (264664) | more than 3 years ago | (#32573486)

The main reason they are using the buckets is to delay sorting costs as far as possible into the future so that there is less cost for most timers (as most timers are apparently deleted far before expiring). I'd suggest that the major performance gain is due to this lazy sorting, and not because their data structure avoids page faults. (Well, it does avoid page faults that the old linked list algorithm would have caused, but these page faults are due to the sorting going on in the original code which is avoided in the second. If timers did not expire, the two approaches would be quite similar, both generating page faults when sorting the linked lists, which likely have bad locality, and neither being as good as an IO-efficient algorithm.)

Using timer wheels as a heap structure wouldn't be appropriate unless many of the objects placed in the heap are removed prior to making it to the top of the heap. If this is not the case the sorting of the items from one bucket to the next bucket (sorting a linked list) would cause many page faults if the list didn't fit in internal memory. Timer wheels do nothing to address data locality which is the main problem faced by extremely large heaps. Your mention of in-order access is only true if the lists at each bucket as indeed stored sequentially in memory. This is hard to guarantee unless that space is reserved ahead of time or some dynamic reallocation scheme is used. I read the linked article as implying that simple linked lists were used, which generally have very bad data locality. Even if if a linear list was guaranteed, however, the sorting step when pushing things from one bucket down to the next bucket would incur page faults (assuming the list was too big to fit in memory) unless an I/O-efficient or cache-oblivious sort were used. (Which could easily be done, making an IO-efficient timer wheel structure.)

The algorithm discussed in the article is for a general purpose heap. In most heaps the main cost is in removing elements from the root of the heap as they bubble up through it, rather than deleting them prematurely (as is the case with timers). Different approaches for fundamentally different problems.

Don Knuth pays people who find errors (2, Interesting)

peter303 (12292) | more than 3 years ago | (#32573294)

I forget the exact amount, but it was like PI or E dollars for every typo. I am not sure what the payment is for an algorithmic error.

Seymour Cray on virtual memory (3, Insightful)

shoppa (464619) | more than 3 years ago | (#32573306)

"Virtual memory leads to virtual performance."

Just what I always wanted, a B-tree implementation that is guaranteed to swap.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...