Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

File System Design part 1, XFS

CmdrTaco posted more than 8 years ago | from the get-your-hack-on dept.

Data Storage 57

rchapman writes "Generally, file systems are not considered "sexy." When a young programmer wants to do something really cool, his or her first thought is generally not "Dude, two words... File System." However, I am what is politely termed "different." I find file systems very interesting and they have seldom been more so than they are right now. Hans Reiser is working on getting Reiser4 integrated into the Linux kernel, the BSD's are working on getting a journaled file system together, and Sun Microsystems just recently released a beta of ZFS into OpenSolaris. "

cancel ×

57 comments

Sorry! There are no comments related to the filter you selected.

woohoo! We're ... (0, Offtopic)

Bin_jammin (684517) | more than 8 years ago | (#14559401)

gonna party like it's 1979zzzzzzzzzzzzzzzzz

Oh, snap. (3, Interesting)

cbiffle (211614) | more than 8 years ago | (#14559403)

the BSD's are working on getting a journaled file system together


Oh, snap. Somebody's not running Soft Updates. :-)

(Yes, I understand that Soft Updates is not technically metadata journalling as practiced by the Linux people. No, I don't believe there are a significant number of practical situations where the results will differ.)

Re:Oh, snap. (1)

rplacd (123904) | more than 8 years ago | (#14560771)

It's not a "Linux people" thing. XFS, for example, is from SGI (but you probably knew that).

The main difference is, there is no fsck in XFS. None whatsoever. With ext3, or ufs2 with soft updates, you can still type "fsck /dev/whatever" on an unmounted filesystem, and it'll grind through it, but try to fsck an XFS filesystem, and nothing will happen. It's a no-op.

Re:Oh, snap. (3, Informative)

Anonymous Coward | more than 8 years ago | (#14560890)

The main difference is, there is no fsck in XFS. None whatsoever.

What the fuck?

Have you read this [die.net] , or even used XFS before, for that matter?

Re:Oh, snap. (1, Funny)

Anonymous Coward | more than 8 years ago | (#14563537)

Don't you mean "What the fsck?" :)

Re:Oh, snap. (1)

rplacd (123904) | more than 8 years ago | (#14566051)

Sure, and I've never needed it despite all the crashes my main desktop has had (flaky power).

I've used XFS for about four years now, on three systems.

Re:Oh, snap. (1)

perlchild (582235) | more than 8 years ago | (#14561291)

I'm not sure why, but to repair an XFS filesystem, fsck is a no-go, you need xfsrepair.

Re:Oh, snap. (0)

Anonymous Coward | more than 8 years ago | (#14562655)

I love FreeBSD but there are some situations where journaling can save the day. Imagine you're running one a big ass server with 2GiB of ram and 2TiB of diskspace and there's a crash due to, e.g. bad hardware. You reboot and, voila! You can't fsck the 2TiB filesystem because every GiB of diskspace needs 1MiB of memory during the fsck run. You're basically screwed. These days the problem is basically gone on 64bit CPUs, but if you're stuck on obsolete Dr. Pentium processors there's nothing you can do.

File system design (5, Informative)

Bogtha (906264) | more than 8 years ago | (#14559476)

If you're interested in this, you'll probably also be interested in Practical File System Design with the Be File System [nobius.org] (PDF), by Dominic Giampaolo, the designer of the Be file system. There's also a Slashdot review [slashdot.org] of this book.

Re:File system design (1)

sgt scrub (869860) | more than 8 years ago | (#14563844)

great link thanks! The file system for BeOS was/is ahead of the times.

Blatant error (5, Interesting)

lostlogic (831646) | more than 8 years ago | (#14559574)

Sector size on hard disks is 512 bytes, not 512kbytes. WTF, don't act like an authority and be a dumbass. Imagine the data waste if we actually had 512k physical sectors on disks.

Also the scaling numbers are completely hokey.

Mod parent up (1)

sethadam1 (530629) | more than 8 years ago | (#14559748)

He didn't say it very eloquently, but the difference between 512 bytes and 512 kilobytes is pretty sigificant. 512KB is a half of a megabyte. Can you image if a hard drive sector was .5 megabyte?? You'd end up with tons of wasted space.

Re:Mod parent up (0)

Anonymous Coward | more than 8 years ago | (#14559939)

That is now fixed in the article, it was an editing error on the writers(my) side of things.

Re:Mod parent up (0)

Anonymous Coward | more than 8 years ago | (#14560037)

Maybe in some version of the article, but not the one on that internet thing...

Also it was an editing error in several locations. If you're going to cut yourself out as a technical authority, it'll really pay to read through at least the first technical paragraph a few times.

Re:Blatant error (1, Informative)

Intron (870560) | more than 8 years ago | (#14560033)

The author also says disk mfgs are lying when they use K = 1000 bytes, M = 1000000 bytes. This person is a know-nothing.

Re:Blatant error (0)

Anonymous Coward | more than 8 years ago | (#14560619)

Somehow, I think the BIPM [bipm.org] would would a little more weight in this then some random /. troll. Although you are partially correct, 'K' is supposed to be lower case.

I concur, Mod Parent Up (5, Insightful)

NixLuver (693391) | more than 8 years ago | (#14560054)

from TFA:
"There is a minimum size you can write to or read from the disc. This minimum size is called a "sector," and is usually around 512k. So, unless you really like 512k files, it is very likely that you will end up either wasting space or cutting off the end of the file if your file system doesn't deal with this."

This is clearly not a typo - which is what I was certain I would find when I did RTFA. This guy has a basic, fundamental flaw in his understanding of the very thing he's writing an article about. This is a non-starter, IMO. Combine that with poor sentence structure and bad scansion ... I mean:

"Note: My ibook has a "30 gig" drive. This is bullshit and I'll tell you why: Drives are defined by the binary definition of mega, kilo and giga. For example, a kilobyte is not 1000 bytes, but actually 1024 bytes. However, your HD manufacturer uses the metric definitions, even up to gigabytes. Now I can see you thinking..."But Wait Mr. Mad Penguin Person...Thats patently ridiculous and means they are lying on the box." Yah... "

If I'd written something like that, I'd delete it right away and start from scratch.

don't ask for much (0)

Anonymous Coward | more than 8 years ago | (#14561753)

Come on, the guy did most of his research on Wikipedia. I think that expecting any sort of familiarity with or understanding of the material he is discussing is unreasonable.

Re:I concur, Mod Parent Up (1)

Kell_pt (789485) | more than 8 years ago | (#14561846)

Either way, it's missing the unit, wether it's "b" or "B".

A few examples:

512Kb - Kilo bits
512KB - Kilo bytes

The writer did not include the unit, but used smallcaps, so one would assume it reads as:
512k - kilo bits
512K - kilo bytes

It makes sense, because would interpret in any other way, due to the context. Noone, of course, except someone on ./ nitpicking and trying to prove that the person who wrote the article is actually ignorant (not sure why people do that, maybe it makes them feel better?).

Re:I concur, Mod Parent Up (1)

NixLuver (693391) | more than 8 years ago | (#14562219)

Because it's not nit picking; it's absolutely incorrect whether you assume he means 512KiloBytes or 512Kilobits. The actual value is 512 BYTES. No kilo at all. There's no /. bias here unless you are willing to suggest that anyone who requires accurate information is a /.er. I doubt that's what you're trying to imply, eh?

Re:I concur, Mod Parent Up (1)

Kell_pt (789485) | more than 8 years ago | (#14562617)

You got a point there. It's just that 64 bytes (512 bits) would be a possible (although unlikely) value.

Re:I concur, Mod Parent Up (1)

Kell_pt (789485) | more than 8 years ago | (#14562633)

64 kbytes. And then again, nevermind. :)

Re:I concur, Mod Parent Up (0)

Anonymous Coward | more than 8 years ago | (#14564480)

There sure are a lot of arrogant posts from small dicked weenies around here.

Re:Blatant error (2, Insightful)

Anonymous Coward | more than 8 years ago | (#14560324)

There are lots of other errors as well. For instance he asserts that the inode contains the filename (they don't). Other things are unclear. He refers to UFS and says it scales to around disks of 1TB, but does not define what he means by UFS (as opposed to FFS). He shows a considerable bias to PC hardware by refering to MBR's. He seems to think that taking something out of a B+-tree is faster than removing something from the front of a linked list. I have no idea why he thinks that Unix at Berkeley was "stillborn".

Re:Blatant error (0)

Anonymous Coward | more than 8 years ago | (#14650345)

The stillborn comment is a recurring /. troll about the BSDs dying. Check out the BSD section at -1, and I'm sure you'll find a few examples.

It gets worse (0)

Anonymous Coward | more than 8 years ago | (#14560333)

"An inode is quite simply a data structure that stores a link to the file data, the file name and metadata."

Last time I checked, filenames were separate from inodes. Wikipedia agrees with me.

Re:It gets worse (1)

KarmaMB84 (743001) | more than 8 years ago | (#14591870)

Fast symlinks can be stored in inodes.

Filesystems not sexy? (2, Funny)

codergeek42 (792304) | more than 8 years ago | (#14559579)

You've not played much with Ext3, then, have you? =)

Re:Filesystems not sexy? (0)

Anonymous Coward | more than 8 years ago | (#14559820)

Uh yeah real sexy. It's just ext2 with a journal. ext2/3 just plain sucks, can you say triple indirect blocks and fixed limit of inodes?

Re:Filesystems not sexy? (0)

Anonymous Coward | more than 8 years ago | (#14561544)

can you say, im willing to wager you would have to google to even find out what that means

Re:Filesystems not sexy? (1)

sconeu (64226) | more than 8 years ago | (#14561382)

Parent must have all his pr0n on an ext3 filesystem. That's why he thinks it's sexy. ;-)

Re:Filesystems not sexy? (0)

Anonymous Coward | more than 8 years ago | (#14562846)

Well, if you consider dead and unrecoverable bodies of data sexy. EXT2 is more reliable... jesus.

I tried JFS because NOBODY talks about using it, guess what, it's fast and hasn't died in 3 years. I lost ext3 partitions every other month.

Times must be changing... (2, Funny)

creimer (824291) | more than 8 years ago | (#14559749)

So constructing a complier from stratch is no longer sexy?

Re:Times must be changing... (2, Funny)

Doug Merritt (3550) | more than 8 years ago | (#14562979)

So constructing a complier from stratch is no longer sexy?

It's just as much of a chick magnet as it ever was!

But don't let that stop you. It's fun.

division (2, Interesting)

newr00tic (471568) | more than 8 years ago | (#14559813)

With everyone and their parrot talking about RAID these days, it would've been fun if some sort of dual array would work as ONE filesystem; where one(++) redunant set took care of the balancing/tree'ing, etc., (separately,) and the other(s) kept the actual files. If there was _yet_ another set (a ++third), with the relevant META-information belonging to the files, you would imagine it to be a step forward to what is now, well; I can, anyway..

Re:division (1, Insightful)

Anonymous Coward | more than 8 years ago | (#14559903)

RAID is best done at a separate LVM layer so that any FS can be built on top of it. I don't see much advantage in building this into the FS. What advantage is there in putting metadata on separate volumes? You need less reliability or something?

journaling (1)

Ayanami Rei (621112) | more than 8 years ago | (#14560889)

XFS and ext3 can have their metadata logs/journals on other physical devices, seperate from the actual filesystem blocks.
Sometimes filesystems are RAID aware, in that they choose to allocate blocks at the beginning of RAID strides and stuff like that, But that's about as flexible as filesystems get.

less reliability? (2, Insightful)

newr00tic (471568) | more than 8 years ago | (#14561955)

No, I was thinking more along the lines of when(/if) META-data becomes big, and you'd get further throughtput by having it on its own drives, so as to speed things up.

By all the three examples I provided, I tried to "account" for both speed and reliability, even though it's only a vague theory..

--No wonder (_real_)things keep standing still for fscking 10 years at the time, and only Disney features are implemented; people turn down theories just as snappy as they turn down webdesigns (50ms, or whatever)..

obligatory (4, Insightful)

DrSkwid (118965) | more than 8 years ago | (#14559987)

If you like on disk file systems you should read Venti: a new approach to archival storage [bell-labs.com] .

Plan9 [bell-labs.com] 's primary on-disk storage is Fossil [wikipedia.org] , which runs in user mode. (Plan9 doesn't have a super user)

You can run arbitrary programs in Plan9 that present a file/folder directory structure by using the common 9P protocol. All devices look like files and folders and can be manipulated like any other, even at the permission level.

For instance, I have an image mounter that takes a tga file and presents 1 folder containing 4 files, red, green, blue and alpha.
I can then use any tool I like to manipulate those files using the file semantics we are all familiar with. I even have a flag that mounts the files as textual rather than binary, i.e :
00 00 ff ff
00 00 ff ff
ff ff 00 00
ff ff 00 00

and I can do image processing with awk !

Re:obligatory (3, Informative)

rplacd (123904) | more than 8 years ago | (#14560858)

The good news is, you don't need to install plan 9 to use venti. You can do it with plan9port [swtch.com] on a Linux/FreeBSD/Mac OS X/etc box today.

NSS for Linux (2, Interesting)

marquis111 (94760) | more than 8 years ago | (#14560113)

NSS has been ported to Linux too. That's an another modern industrial-strength filesystem with features sorely needed by Linux.

Re:NSS for Linux (1)

Slamtilt (17405) | more than 8 years ago | (#14561658)

It requires loading binary kernel modules, and you'll have to run it on OES, so that puts it beyond the pale for quite a few people. I've been a bit underwhelmed by it performance wise, and there were some nasty bugs in the initial release. SP2 may make a difference. I hope so.

This Kind Of Article (1, Insightful)

Anonymous Coward | more than 8 years ago | (#14560175)

What compels people to make the leap from "I've grasped the basics of a large and complex field" to "I think I'll write an article about it for the Slashdot crowd" via "I'm sure it doesn't matter that I'm not a good writer" and "I think I'll go with a self-satisfied tone"?

Re:This Kind Of Article (1)

Jerry Coffin (824726) | more than 8 years ago | (#14562667)

What compels people to make the leap from "I've grasped the basics of a large and complex field" to "I think I'll write an article about it for the Slashdot crowd" via "I'm sure it doesn't matter that I'm not a good writer" and "I think I'll go with a self-satisfied tone"?

If he really understood the basics, he'd undertand how the concept of "hard link" means the file name is not stored in the inode.

There's an old maxim (usually attributed to Butler Lampson) that says almost any problem in programming can be solved with another level of indirection. The reverse is true as well: almost any solution in programming will be broken if a level of indirection is removed -- and that's exactly what he's done in his explanation.

Author doesn't mention his newbie status (2, Insightful)

Anonymous Coward | more than 8 years ago | (#14560223)

From the article:
Small difference there. It is also a very fast file system, allowing reads of up to 7 GB/sec.

An assumption which could only be made by a newbie. Maximum throughput of a filesystem is not filesystem architecture dependent, but hardware dependent.
I could give you 7GB/sec out of a FAT drive, given the proper hardware.
Several other quotes suggest a bit of 'newbieness' like "B+trees are insanely complex".
The concept was designed by a human, therefor it is clearly understandable by a human. It's not say, some potentially impossible-for-humans-to-really-comprehend law of nature.
The author should just acknowledge that he or she does not know enough about B+trees, or does not know them well enough to illuminate the subject sufficiently.
There's nothing wrong with that, but trying to scare people off of them just because he or she doesn't understand them well enough is damaging to potential readers.

I want to postscript this with an evaluation that the article is not bad. If those things are changed, I would even say it is good.

-fooburger

Re:Author doesn't mention his newbie status (1)

eyck (98824) | more than 8 years ago | (#14561531)

Maximum throughput of a filesystem IS filesystem architecture dependant, and XFS solved that problem at it's time. Check your facts.


Also, imagine this - your filesystem uses some kind of block size, allocating a block requires round-trip through the filesystem (including touching superblocks and modifying list of free blocks).

What happens when you're trying to write a lot of data to such synchronous filesystem?

You're bound by round-trip time, no amount of faster hardware would help. Similiar situations used to happen at the time XFS was being designed.

Re:Author doesn't mention his newbie status (1)

putaro (235078) | more than 8 years ago | (#14592860)

Maximum throughput on given hardware would be constrained by your architecture, not maximum throughput on any hardware.

I've done FS development for 15 years and that article screamed clueless newbie.

Doesn't Live Up To Its Billing (3, Informative)

JoshDanziger (878933) | more than 8 years ago | (#14560293)

Sorry, this article didn't really teach me anything interesting about filesystems. In general, the article was poorly written. For example, taking two sentences to say: "B+Trees are complex. Let me rephrase that. B+Trees are very, very complex." Readers of all types appreciate their time and don't want to have to waste it.

You were lost at points between trying to sound like an expert to trying to sound like a grandfather explaining the grande old days of filesystem development. Are you a storyteller or a teacher? Pick one.

Content-wise, there wasn't really much there for me. You spent a lot of time explaining the problems of a binary tree, but I think that your target audience already understands the time complexity of a binary tree. Then, you glaze over the B+ tree because its complicated.

Sorry if I sound harsh. I hope that this comes off as constructive criticism.

Re:Doesn't Live Up To Its Billing (0)

Anonymous Coward | more than 8 years ago | (#14563036)

For example, taking two sentences to say: "B+Trees are complex. Let me rephrase that. B+Trees are very, very complex." Readers of all types appreciate their time and don't want to have to waste it.

Heard on slashdot!

Re:Doesn't Live Up To Its Billing (0)

Anonymous Coward | more than 8 years ago | (#14563732)

What is complex about b+ trees? That's what lost me. Now some of the optimizations you can do to various b+ tree algorithms to minimize writes and what have you might start to get "complex" but b+ trees are pretty damn simple.


Especially if you're some kind of "filesystem guy" or something like this guy claims, just about all filesystems use b-trees and their variants .

Re:Doesn't Live Up To Its Billing (1)

TwoCans (255524) | more than 8 years ago | (#14564806)

> Content-wise, there wasn't really much there for me.

Yeah, I doubt there was anything in it for anyone interested in filesystems.
And seeing XFS is my day job, the mistakes were pretty obvious, too.

One, a b+tree does not make a filesystem.

Two, in all that talk about b+trees in XFS, he made some basic mistakes. There's
only one inode b+tree per AG, there's two extent free list b+trees per AG, and
the superblock has no b+trees in it at all. And they are used in many other
places in XFS as well.

Three, there is so much in XFS that this didn't even mention that it could
hardly even be called an overview of XFS, let alone design. I mean, the overview
page for the XFS project (http://oss.sgi.com/projects/xfs/ [sgi.com] ) has a more informative
description of XFS than this article....

Learn before you teach (1)

mindstormpt (728974) | more than 8 years ago | (#14561486)

Sorry, but if you really like file systems maybe you should try learning something about them before deciding to write this kind of articles. I've had less than 10h of classes on file system design, do not consider myself to know anything about the subject and still got the impression that I knew quite a bit more than you.

XFS? (-1, Troll)

ravyne (858869) | more than 8 years ago | (#14561609)

He might want to change that name. XFS is the filesystem used by Microsoft's Xbox and Xbox 360 game consoles and is based on NTFS.

Re:XFS? (0)

Anonymous Coward | more than 8 years ago | (#14561954)

Who gives the fuck what Microsoft names it's kiddie boxes and assorted file systems? XFS (as in filesystem) has much longer history than "XFS on Xbox".

Re:XFS? (1)

Anonymous Coward | more than 8 years ago | (#14562218)

No, XFS is the filesystem created by SGI. You might want to reconsider your post.

Re:XFS? (0)

Anonymous Coward | more than 8 years ago | (#14562619)

No, it's not. That filesystem is called 'FATX' on Xbox1 and 'XTAF' on Xbox360 (endianness).

Check FGS (Google File System) (1)

Osvaldo Doederlein (34220) | more than 8 years ago | (#14577334)

Here: http://labs.google.com/papers/gfs.html [google.com] . Abstract: "We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. (...)" Pretty damn cool stuff, very advanced but perhaps a bit too tuned for Google's needs. See also the papers on their own clustering technology and distributed programming framework (MapReduce), which go hand in hand with GFS. These are some of the technical secrets behind Google's search engine and other apps, it's funny how little you hear about them. For massive tasks like Google search, I don't think anybody can compete with standard technology, i.e. big machines (vertical scalability), off-the-shelf clusters and SQL servers, and conventional programming techniques.
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?