×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Lustre File System Getting New Community Distro

Soulskill posted more than 3 years ago | from the cluster-buster-passes-muster dept.

Data Storage 68

darthcamaro writes "Oracle acquired a lot of open source tech from Sun that has since been forked — or is in the process of being forked. The open source Lustre high performance computing file system isn't on the list of forked projects, but it is getting a new, community-driven distro that is trying really hard to say that they're not officially a fork. 'Since April of 2010 there has been confusion in the community, and we've seen an impact in the business confidence in Lustre,' Brent Gorda, CEO and president of Whamcloud told InternetNews.com. 'The community has been asking for leadership, the commitment of a for-profit entity that they can rely on for support and a path forward for the technology.'"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

68 comments

What is Lustre File System (5, Informative)

Icyfire0573 (719207) | more than 3 years ago | (#34884600)

From their website:
http://wiki.lustre.org/index.php/Main_Page [lustre.org]

High Performance and Scalability

For the world's largest and most complex computing environments, the Lustre file system redefines high performance, scaling to tens of thousands of nodes and petabytes of storage with groundbreaking I/O and metadata throughput.

Re:What is Lustre File System (1)

Elbereth (58257) | more than 3 years ago | (#34884622)

Any benchmarks?

Re:What is Lustre File System (2)

fgodfrey (116175) | more than 3 years ago | (#34884816)

Obviously, we have internal benchmarks that tend to show that Lustre is good but I can't talk about specifics on those. What I can do, though, is link to this: http://www.cs.rpi.edu/~chrisc/COURSES/PARALLEL/SPRING-2009/papers/MADbench2-2009.pdf [rpi.edu]

The stuff that I found most interesting is on page 12. The machines named Jaguar and Franklin are Cray's running Lustre. Bassi and Jacquard are both running GPFS. On page 15 they claim that they can make up for the deficiency in Lustre's default settings for shared access to a single file by tuning it.

Unsurprisingly, the type of operation you're doing ends up determining which filesystem is best for your application.

In terms of scalability, from the Wikipedia page for the Jaguar system at Oak Ridge National Labs (a large Cray XT5), their Lustre filesystem is 10 petabytes with read/write performance of approximately 240GB/sec (not sure what benchmark was used to get that number).

Re:What is Lustre File System (-1, Redundant)

lgw (121541) | more than 3 years ago | (#34885242)

The machines named Jaguar and Franklin are Cray's running Lustre.

The apostrophe is never used to form a plural. Not ever. No, not even then.

In terms of scalability, from the Wikipedia page for the Jaguar system at Oak Ridge National Labs (a large Cray XT5), their Lustre filesystem is 10 petabytes with read/write performance of approximately 240GB/sec (not sure what benchmark was used to get that number).

OK, so I'm not surprised if someone gets good performance from a Cray, but can't Lustre be used with lots of commodity hardware instead? I thought that was knid of the point.

Re:What is Lustre File System (2)

fgodfrey (116175) | more than 3 years ago | (#34885340)

It certainly *can* be used with commodity hardware, but the majority (or maybe all?) of Lustre installations are in high performance computing with thousands, or tens of thousands, of clients (usually the nodes of a supercomputer) accessing the shared file system.

Where more commodity hardware can come in is the installation of the filesystem servers themselves. A system's Object Storage Targets and Metadata Servers (pieces of Lustre) can be external to the Cray and connected via some interconnect such as Infiniband. It should be noted that even the "commodity" hardware for the filesystem isn't exactly cheap if you want a huge capacity and high reliability...

Re:What is Lustre File System (1)

h4rr4r (612664) | more than 3 years ago | (#34885526)

Any reason Luster cannot be spelled correctly?
Would that impact performance?

Re:What is Lustre File System (1)

fgodfrey (116175) | more than 3 years ago | (#34885666)

No, but it would affect the ability of someone to trademark the name, and since Lustre has always been the project of a commercial company (originally Cluster File Systems, then Sun, then Oracle, and now OpenSFS and this company), that is something that would be considered...

Re:What is Lustre File System (1)

h4rr4r (612664) | more than 3 years ago | (#34885748)

LusterFS, there I solved your problems. I will accept payment in any matter of ways.

Re:What is Lustre File System (1)

fgodfrey (116175) | more than 3 years ago | (#34885832)

That's nice. Go talk to the people who actually work for one of those companies and complain to them. Until then, it's a product name and it's going to keep getting spelled the way the manufacturer spells it...

Re:What is Lustre File System (1)

h4rr4r (612664) | more than 3 years ago | (#34885842)

By now you should be near deaf from the whooshing sounds going on right above you.

Re:What is Lustre File System (2)

lachlan76 (770870) | more than 3 years ago | (#34886296)

Lustre's the standard spelling outside of the US, so it wouldn't make much difference. More likely to be just the preference of the original author.

Re:What is Lustre File System (3, Funny)

drsmithy (35869) | more than 3 years ago | (#34887300)

Any reason Luster cannot be spelled correctly?

Assuming the name is supposed to indicate something that shines, and not a sex addict, it is spelled correctly.

Re:What is Lustre File System (1)

TheRaven64 (641858) | more than 3 years ago | (#34888446)

Lustre is spelled correctly. It's your fork of the English language that swaps the order of letters in re endings.

Re:What is Lustre File System (1)

Skal Tura (595728) | more than 3 years ago | (#34887446)

We are in process to use it in commodity hardware, research is underway to see if it is feasible to use Lustre for our needs, where high performance high capacity storage is a key figure.

And i don't see any reasons why not. Of course, getting the kind of load we have in production is extremely hard in testing environment, so only future will show.

And lustre does not need lots more expensive than consumer grade hardware with some nice switches, LACP... Of course it solely depends on your bandwidth requirements

Re:What is Lustre File System (1)

Tynin (634655) | more than 3 years ago | (#34885434)

The machines named Jaguar and Franklin are Cray's running Lustre.

The apostrophe is never used to form a plural. Not ever. No, not even then.

In terms of scalability, from the Wikipedia page for the Jaguar system at Oak Ridge National Labs (a large Cray XT5), their Lustre filesystem is 10 petabytes with read/write performance of approximately 240GB/sec (not sure what benchmark was used to get that number).

OK, so I'm not surprised if someone gets good performance from a Cray, but can't Lustre be used with lots of commodity hardware instead? I thought that was knid of the point.

I don't think a lot of people are going to go into many details on this article, because anyone using luster is liking using it in some way to leverage the idea of the cloud inside their respective business'es. Yes, I think it would scale well with commodity hardware, however their is no getting around the need of a fast interconnect between all nodes. If you cannot afford at least 10GbE for your entire storage cluster, don't even bother with luster, you'll hit a bottleneck on network IO likely with just a few boxes. Anyone who is serious about it is going to go with an Infiniband solution as it scales almost indefinitely due to its non blocking nature. So getting any performance out of lustre isn't going to be done on the cheap.

Re:What is Lustre File System (1)

Tynin (634655) | more than 3 years ago | (#34885454)

Gah, misspelled luster over and over... I'm a few drinks into starting the weekend ;-)

Re:What is Lustre File System (1)

Daniel Phillips (238627) | more than 3 years ago | (#34886220)

I don't think a lot of people are going to go into many details on this article, because anyone using [lustre] is liking using it in some way to leverage the idea of the cloud inside their respective [businesses]

There are people who know a lot about Lustre and aren't beholden to anyone. It is a GPLed open source project after all.

Re:What is Lustre File System (1)

Tynin (634655) | more than 3 years ago | (#34886698)

Completely agreed. Sorry for speaking in generalities (likely even incorrect generalities)... not in my proper sorts today. Still, the network IO bottleneck is the same problem for everyone, and an expensive one at that. Cheers.

Re:What is Lustre File System (1)

Skal Tura (595728) | more than 3 years ago | (#34887474)

Depends how much you need BW and how many storage nodes are you running.

What we are wanting to test out is using same system to provide OSTs and be a client, thus if we run a 8 switch stack which has switch to switch capacity of 48Gbps, and internal switch switching capacity of 176Gbps, and 48 ports in each switch + 2xModule slots for 10Gbe, i think we are going to be fine. If running fewer storage nodes we can put in Dual/Quad link NICs and LACP them if 10Gbe is not a possibility (out of ports or something).

But for our usage, the aggregate BW required to storage is probably the last in consideration unless we run huge storage nodes (think dozens upon dozens of 2Tb drives on single). Our usage is heavily dependant upon the public internet facing node's connection to internet, and storage usage will not exceed that greatly that of public internet (2-3x).

For our testing cluster of 15 frontend nodes we need only 1Gbps internet capacity, thus the 4 storage nodes with single 1Gbps links is more than enough :) Sure single node is limited to 266.66Mbps, but that is well more than enough, as we are hoping to also have local caching via other tools etc.

And we are hoping to launch this almost directly to production this spring. What is holding us back from doing this sooner is budget.

Re:What is Lustre File System (1)

Jane Q. Public (1010737) | more than 3 years ago | (#34885706)

Mod parent up.

The machines named Jaguar and Franklin are Cray's running Lustre.

Extraneou's apo'strophe's make me cringe. Come on, people! Thi's i's one of the 'simple'st -- and mo'st ab'solute -- rule's in the whole Engli'sh language!

Re:What is Lustre File System (1)

bill_mcgonigle (4333) | more than 3 years ago | (#34886500)

The apostrophe is never used to form a plural. Not ever. No, not even then.

You need to mind your p's and q's on this one. There are specific do's and don'ts regarding use of the apostrophe for possessives.

You wouldn't want to go to an Oakland As game - that would be confusing. If you got straight Cs on your report card, folks would think you were a real computer geek.

Since you might be, being on Slashdot, it's possible your colleagues may get confused if you tell them to fix the QoSs on their routers.

(The general rule is to only pluralize with an apostrophe if the resultant plural would look confusing without one. Punctuation, after all, is intended to provide clarity to the written word.)

Re:What is Lustre File System (1)

vegiVamp (518171) | more than 3 years ago | (#34896160)

> The apostrophe is never used to form a plural. Not ever. No, not even then.

Actually, it can. Just not in english :-) In my native dutch, apostrophe-s is a plural, while attached s is a possessive. OP has still made a mistake, but he might be a non-native speaker.

And, yes, we get the same shit here because of english contamination :-)

Re:What is Lustre File System (1)

mscman (1102471) | more than 3 years ago | (#34909212)

The reason you can get such good performance from a Cray is the way the filesystem is connected to the machine. Many machines use an external Lustre solution, requiring slower InfiniBand connections to the entire Lustre "appliance". With a machine like a Cray, you can use internal Lustre where the disks and controllers are connected to IO nodes, and IO nodes are connected to all the compute nodes via the low-latency, high-speed interconnect internal to the system.

Re:What is Lustre File System (1)

colin_faber (1083673) | more than 3 years ago | (#34887396)

IOR was used across, um.. I think ~4000 client nodes or something around that. I don't recall exactly. As for 10 PB in size, that's not uncommon in this arena, and in fact there are a lot of sites which this much or more storage out there which don't make much noise publicly.

Re:What is Lustre File System (0)

Anonymous Coward | more than 3 years ago | (#34891398)

Colin,
It was 1344 client nodes, one for each OST. File-per-process, with each file using a stripe count of one.
Dave

Re:What is Lustre File System (1)

drinkypoo (153816) | more than 3 years ago | (#34888626)

Obviously, we have internal benchmarks that tend to show that Lustre is good but I can't talk about specifics on those.

So, uh, why mention them?

The stuff that I found most interesting is on page 12. The machines named Jaguar and Franklin are Cray's running Lustre.

So you need a Cray to get good performance?

Re:What is Lustre File System (0)

Anonymous Coward | more than 3 years ago | (#34889058)

The HPC top500 is full of such benchmarks, that's where Lustre is targeted.
You're unlikely to see other benchmarks, since nobody else can afford the hardware configuration Lustre is meant to run on (tens of thousands of cores, up to the hundreds of thousands, if I'm not mistaken, both PetaFLOP clusters (Cray's Jaguar, and I think the other is IBM's RoadRunner) run Lustre), and any benchmark using other hardware is meaningless*.

And such a benchmark would have to take place on the same installation as well, since otherwise you're effectively benchmarking the interconnect, rather than the filesystem itself.

* not to say Lustre can't run on commodity hardware, you just gotta take the results with several grains of salt, considering it's designed for use on massively distributed petabyte+ filesystems.

If you're interested though, last i cared enough to check, the Top500's benchmarking suite is readily available.

Re:What is Lustre File System (4, Interesting)

CAIMLAS (41445) | more than 3 years ago | (#34884770)

At a functional level, Lustre (GPL) is to ZFS (CDDL) as CXFS (commercial) is to XFS (GPL) for SGI. They are the upper 'cluster' layer to take advantage of the underlying filesystems' capability. I believe this approach is divergent from that of GFS, due to the upper/lower approach, but I'm not that familiar with clustered filesystems.

However: Arguably, Lustre on ZFS is a mumuchch better option due to ZFSs inherent capability superiorty over XFS. I've liked XFS historically, but ZFS is so drastically superior than anything else out there (in terms of storage management and available capacity and throughput) - all 'out of the box' that it's a no-brainer to use zvols for things other than direct zfs posix access. (For instance, they make great VM iSCSI targets, or local raw disks for VMs, or..)

Side note: the linux zfsonlinux.org port is being successfully used as the base volume manager for Lustre right now, so it is apparently quite capable/stable at that level. (zfsonlinux does not yet have zfs posix support.) Lustre on ZFS It, apparently, scales much better than the traditional LVM/RAID/etc. backend methods.

Re:What is Lustre File System (2)

Monkeedude1212 (1560403) | more than 3 years ago | (#34885024)

At a functional level, Lustre (GPL) is to ZFS (CDDL) as CXFS (commercial) is to XFS (GPL) for SGI.

And who says the IT world has too many confusing Acronyms?

Re:What is Lustre File System (1)

CAIMLAS (41445) | more than 3 years ago | (#34885758)

I don't suppose "They made (asked) me to do it!" is a legitimate excuse?

Inversely, if we had long names for everything, we'd soon get confused and have insufficient time to actually work.

Re:What is Lustre File System (1)

Daniel Phillips (238627) | more than 3 years ago | (#34886192)

Lustre on ZFS is a mumuchch better option due to ZFSs inherent capability superiorty over XFS.

I can tell you with a high degree of confidence that ZFS is a poor option for Lustre compared to its traditional backends, Ext3 and Ext4. One simple reason: ZFS has about half the transaction throughput.

Re:What is Lustre File System (0)

Anonymous Coward | more than 3 years ago | (#34886566)

Lustre on ZFS is a mumuchch better option due to ZFSs inherent capability superiorty over XFS.

I can tell you with a high degree of confidence that ZFS is a poor option for Lustre compared to its traditional backends, Ext3 and Ext4. One simple reason: ZFS has about half the transaction throughput.

But ZFS is cool and trendy among geeks! Ext3/4 can't beat that!!

Re:What is Lustre File System (1)

Nutria (679911) | more than 3 years ago | (#34889672)

But ZFS is cool and trendy among geeks!

Really? I thought the CDDL put the kibosh on that idea a year ago...

Re:What is Lustre File System (1)

CAIMLAS (41445) | more than 3 years ago | (#34895998)

It does? How do you figure that? You will be somewhat limited compared to other filesystems for 'raw speed', but that's why ZFS has build in read and write cache functionality (via SSD or ramdisk). Unless we're talking about massive amounts of sustained reads and writes, with no time for the disks to catch up, I suspect that (oh) 32Gb of SSD or so would do the trick for most hosts, or 128Gb for a 'high demand' host member to Lustre.

So yeah, if you consider cache, ZFS is going to blow the snot out of anything without it. It would be trivial to configure a zvol to offer total IOPS capability which would saturate the bus bandwidth with small writes, for instance.

If you're archiving, just don't use cache drives.

Besides, I'm reasonably certain that Lustre actually would prefer ZFS as a backend (in terms of performance) because that's what they're using. You can't scale to petabytes of storage with those other filesystems, if for no other reason because they lack the built-in parity and suffer a number of other 'lots of disks' management issues.

Seriously: for the price of storage today, adding an SSD in front of a couple disks to make the performance markedly better is not a significant cost. It's an easy out, and is far cheaper and easier than pretty much any other high-IOPS approach.

Re:What is Lustre File System (1)

Guy Smiley (9219) | more than 3 years ago | (#34901880)

Actually, I'm 100% certain Lustre is NOT using ZFS today. It is actually using ldiskfs for the backing filesystem, which is a modified version of ext4. While work was ongoing to port Lustre over to ZFS, this was not completed.

Re:What is Lustre File System (1)

Daniel Phillips (238627) | more than 3 years ago | (#34886208)

Lustre on ZFS It, apparently, scales much better than the traditional LVM/RAID/etc. backend methods.

By the way, where did you get that idea?

Re:What is Lustre File System (1)

Skal Tura (595728) | more than 3 years ago | (#34887508)

Same where he thinks ZFS is fast & default for lustre, and same where he thinks Lustre and LVM/RAID is the same type of a thing :D

Re:What is Lustre File System (1)

CAIMLAS (41445) | more than 3 years ago | (#34896042)

Yes, because that was obviously what I intended to convey. Thank you for pointing out that your level of reading comprehension is likely very similar to that of a politicians'.

Re:What is Lustre File System (2)

CAIMLAS (41445) | more than 3 years ago | (#34896038)

I'll tell you were I got that idea: experience.

Managing filesystems in lvm2, on raid cards - all with their own specific commands - is a real pain in the ass when you've got tens of hosts or more per admin, with many different roles and functionality.

So then you've got to have snmp set up for each of those hosts (often with different controller cards) to monitor those RAID cards status (with the shitty RAID console tool which lacks anything resembling documentation). Then you've got to manage LVM, with its "easily" understood UUIDs. And then you've got to manage your filesystem - whatever that may be - to verify it's intact.

With Lustre, you'll be adding another layer of 'shit I have to monitor'. Even if your hardware is 100% identical across all nodes, is that not a significant pain in the ass to manage for?

ZFS just needs an HBA and 'bare' drives. Baring ZFS itself failing on you (not outside the bounds of reason), you've really only got one/two things to look out for with zfs: data errors and disk failures, both of which are reported with zpool status and can be easily monitored. This is regardless of the version of the system, or even which OS it's running on (Solaris and its derivatives, FreeBSD, and yes, even Linux).

Unlike with RAID, I don't have to contend with the possibility of RAID-5 write hole.

If I need to move to new hardware in a pinch, I can - trivially. I can be a desktop, assuming I've got a suitable controller for all the disks (nothing 'special').

Rebuilds are a fraction of the time as they are on hardware RAID (or mdraid, until recently), because only the actual data is replicated.

Hell, you're going to get more performance out of ZFS than you will from the 'common' vendor (Dell, HP, IBM) RAID cards.

I'm assuming Lustre will scale better on ZFS better than on traditional methods because everything else scales better on ZFS. Growth is easier; management is easier; maintenance is easier. "Storage management" is actually somewhat enjoyable with ZFS, instead of the tedious skulduggery it usually tends to be otherwise.

Re:What is Lustre File System (1)

Skal Tura (595728) | more than 3 years ago | (#34887504)

ZFS is slow on linux and Lustre runs on EXT3/EXT4 by default. Infact, Lustre is quite a big contributor to making ext4 in to existence by optimizing Ext3.

http://en.wikipedia.org/wiki/Ext4 [wikipedia.org]

Comparing lustre to LVM/RAID is comparing Apples to Oranges. One is network cluster file system, another is local storage management.

Re:What is Lustre File System (1)

CAIMLAS (41445) | more than 3 years ago | (#34896068)

I didn't compare Lustre to LVM/RAID - I compared ZFS to it.

What sits on either would be Lustre, obviously. ZFS is significantly superior to RAID + LVM in pretty much every way, barring super-expensive hardware RAID controllers where RAID has a slight trump in and of itself. (Though it should be noted that these RAID controllers would likely provide significant benefit to a ZFS system, too.)

What I have to wonder is: what kind of storage methods or devices does a 'network cluster file system' use? Here's a guess: local storage. Since ZFS is already geared to be a 'network aware' filesystem, in many ways (as much as a filesystem can be, I suppose), it's got fundamental underpinnings (like get/send and the ability to use pools and zvols on other physical hosts transparently) which make it much, much better for clear pictures of your data than ext# on LVM or straight RAID.

To be frank, ext# has more in common with FAT16 than it does ZFS filesystem level functionality (nevermind the fact that ZFS is filesystem, volume manager, and storage controller all in one, doing things which "just a filesystem" could never accomplish on its own).

Ended project (4, Informative)

diegocg (1680514) | more than 3 years ago | (#34884786)

According to insidehpc [insidehpc.com], Oracle has stopped developing Lustre and developers "have reportedly been encouraged to apply for other positions within the company".

A group of Lustre users already created OpenSFS [opensfs.org] on October 2010 to continue developing Lustre.

Re:Ended project (1)

hackstraw (262471) | more than 3 years ago | (#34886568)

If necessary, it will be forked. Between OpenSFS and WhamCloud there will always be a home for lustre. WhamCloud already has contraclts with Lawrence Livermore National Lab and Oak Ridge National Lab. Oak Ridge already has the largest Lustre filesystem to date. And there is also DDN which supplies the hardware for most of the larger Lustre sites which has a local copy of Lustre that they distribute as well. Luistre is more than fine, its just a little lost finding a home at this time.

I see, so according to the F/OSS folks.... (1)

Anonymous Coward | more than 3 years ago | (#34884818)

Oracle acquired a lot of open source tech from Sun that has since been forked — or is in the process of being forked.

Is really:

Oracle acquired a lot of open source tech from Sun that has since been fucked— or is in the process of being fucked [by Oracle].

Very first thing to do is... (3, Interesting)

Daniel Phillips (238627) | more than 3 years ago | (#34886120)

Lose every tie to ZFS. Every. Single. One.

Right now.

Like every piece of software Oracle is involved in, ZFS is a big fat patent trap. Not only that, but ZFS is a lot slower than Ext3 and Ext4, and probably Btrfs[1] as well. There is absolutely no benefit to using ZFS as an object storage target, there is only the certainty of legal problems.

[1] Oracle is involved with Btrfs too, so exercise due caution.

Re:Very first thing to do is... (1)

Anonymous Coward | more than 3 years ago | (#34886198)

Unfortunately, Lustre-on-ZFS [zfsonlinux.org] is substantially faster that lustre on ext3, mainly because ZFS combines the features of an lvm and a filesystem. That eliminates the need to have SAN appliance heads managing the storage and provides some additional data integrity features. It's cheaper too.

Re:Very first thing to do is... (1)

Lennie (16154) | more than 3 years ago | (#34886274)

Is it faster because of the ZFS intent log and second level cache on SSD ?

Re:Very first thing to do is... (2)

Daniel Phillips (238627) | more than 3 years ago | (#34886824)

Unfortunately, Lustre-on-ZFS [zfsonlinux.org] is substantially faster that lustre on ext3, mainly because ZFS combines the features of an lvm and a filesystem

That's bafflegab and incorrect. Or if you disagree, please explain why.

Re:Very first thing to do is... (2)

Daniel Phillips (238627) | more than 3 years ago | (#34886834)

And by the way, is your opinion based on benchmarks, or on hype from Sun? I strongly suspect the latter.

Re:Very first thing to do is... (1)

dsouth (241949) | more than 3 years ago | (#34890774)

It appears to be based on the linked site:

"In particular, ZFS’s advanced architecture addresses two of our key performance concerns: random I/O, and small I/O. In a large cluster environment a Lustre I/O server (OSS) can be expected to generate a random I/O workload. There will be 100’s of threads concurrently accessing different files in the back-end file system. For writes ZFS’s copy-on-write transaction model converts this random workload in to a streaming workload which is critical when using SATA disks. For small I/O, Lustre can leverage a ZIL placed on separate SSD devices to maximize performance."

The LLNL ZFS study has been pretty widely publicized in the HPC community. Lustre uses the filesystem API rather than mounting in. Until now Lustre used ext under-the-hood for data storage, so the performance improvement from ZFS is relative to ext. ext3/4 may very well outperform ZFS on a workstation or small server, but that's not the what Lustre is used for (even their test system is ~900TB).

Disclaimer: I used to work for LLNL.

Re:Very first thing to do is... (1)

Daniel Phillips (238627) | more than 3 years ago | (#34902320)

It appears to be based on the linked site:

"In particular, ZFS’s advanced architecture addresses two of our key performance concerns: random I/O, and small I/O. In a large cluster environment a Lustre I/O server (OSS) can be expected to generate a random I/O workload. There will be 100’s of threads concurrently accessing different files in the back-end file system. For writes ZFS’s copy-on-write transaction model converts this random workload in to a streaming workload which is critical when using SATA disks. For small I/O, Lustre can leverage a ZIL placed on separate SSD devices to maximize performance."

The LLNL ZFS study has been pretty widely publicized in the HPC community. Lustre uses the filesystem API rather than mounting in. Until now Lustre used ext under-the-hood for data storage, so the performance improvement from ZFS is relative to ext. ext3/4 may very well outperform ZFS on a workstation or small server, but that's not the what Lustre is used for (even their test system is ~900TB).

Disclaimer: I used to work for LLNL.

Disclaimer: I used to work on Ext3. I would classify the above as "hype from Sun". There is a hidden cost to making all the writes linear on spinning media: the reads become nonlinear. This is usually the wrong tradeoff.

Note that a traditional journal is another way of linearizing writes in that a transaction write transaction can be considered durably recorded to media as soon as the journal write completes.

Benchmarks tell the true story, not hype, and on good information and belief the benchmarks say ZFS is slow.

Re:Very first thing to do is... (1)

Fallen Kell (165468) | more than 3 years ago | (#34886798)

If you are comparing ZFS performance on linux, then, yes, it is slower, because ZFS on linux is not done at the kernel level and thus has a huge performance loss as compared to ZFS on Solaris/OpenSolaris. There have been plenty of benchmarks out there showing ZFS's performance besting EXT3 and EXT4 on identical hardware (with one running OpenSolaris and the others on linux). It is a shame that Oracle has no intentions of continuing its development, the same with lustre. Two years ago they were talking about lustre on Solaris with ZFS, but that never materialized, the same with ZFS in a kernel module for linux, again, it never happened because of the Oracle deal... With Oracle, if they can't see a way to make money on it immediately, it is dead. No long term view there at all.

Re:Very first thing to do is... (1)

Daniel Phillips (238627) | more than 3 years ago | (#34886848)

If you are comparing ZFS performance on linux, then, yes, it is slower

No, I am comparing Ext3/4 on linux to ZFS on Solaris.

Re:Very first thing to do is... (1)

Daniel Phillips (238627) | more than 3 years ago | (#34886852)

There have been plenty of benchmarks out there showing ZFS's performance besting EXT3 and EXT4 on identical hardware (with one running OpenSolaris and the others on linux)

Link please.

Re:Very first thing to do is... (0)

Anonymous Coward | more than 3 years ago | (#34887464)

These guys seem to have measurements [scalabilty.org] that show zfs performance is well below xfs. Everything I've seen backs up these claims.

Re:Very first thing to do is... (0)

Anonymous Coward | more than 3 years ago | (#34888158)

Why don't you provide a link, for your standpoint?

Re:Very first thing to do is... (0)

Anonymous Coward | more than 3 years ago | (#34889200)

No, ZFS in the Linux kernel never happened because the GPL forbids it. It has nothing to do with Oracle, it's a problem with Linux's licensing.

It makes no sense to make this up to be Oracle's fault on the grounds that they won't re-license it to satisfy Linux' licensing requirements. Sun had good reasons for reasons for going with the CDDL, and Oracle has equally good reasons for sticking with it. If the Linux people want kernel-mode ZFS so badly, they are welcome to address the restrictions set in place by the licensing the linux kernel is under, that's the deal-breaker here, and something that should be kept in mind, the issues preventing kernel-mode ZFS on Linux are coming from Linux, not ZFS..

On the other hand, if you're hellbent on Linux and not too invested in the kernel, you're welcome to try out the happy compromise that is Nexenta, the Linux userland, with all it's Ubuntuiness, on top of the OpenSolaris kernel, with kernel-mode ZFS (and other goodies, like Crossbow and Solaris Zones), though with the understanding that the Nexenta implementation lagged behind the OpenIndiana implementation somewhat, which in turn, lags behind Solaris' reference implementation.

Also, I don't recall neither Oracle, nor Sun, ever talking about an initiative to get kernel-mode ZFS on Linux simply on the grounds that the restrictions set in place by Linux's license are the show stopper, though still, nothing stops distributions from porting ZFS to the Linux kernel, and distributing it as a separate add-on as is done for binary NVidia drivers (for example) to get around those restrictions - the fact that nobody bothers doing this also isn't Oracle's fault - but ZFS is open source after all, if you want it, and nobody else will do it for you, you're welcome to start up a project yourself and hope it captures the interest of others in the community - Oracle won't stop you, just like they didn't stop Nexenta, FreeBSD or Apple (all of which have kernel-mode ZFS). though you might get a fair deal of flack from the Linux crowd for bypassing GPL restrictions.

Re:Very first thing to do is... (1)

icebraining (1313345) | more than 3 years ago | (#34892416)

Sun had good reasons for reasons for going with the CDDL, and Oracle has equally good reasons for sticking with it.

Yes, keeping Linux out on purpose:

In the words of Danese Cooper, who is no longer with Sun, one of the reasons for basing the CDDL on the Mozilla license was that the Mozilla license is GPL-incompatible. Cooper stated, at the 6th annual Debian conference, that the engineers who had written the Solaris kernel requested that the license of OpenSolaris be GPL-incompatible. "Mozilla was selected partially because it is GPL incompatible. That was part of the design when they released OpenSolaris. [...] the engineers who wrote Solaris [...] had some biases about how it should be released, and you have to respect that"

http://meetings-archive.debian.net/pub/debian-meetings/2006/debconf6/theora-small/2006-05-14/tower/OpenSolaris_Java_and_Debian-Simon_Phipps__Alvaro_Lopez_Ortega.ogg [debian.net]

the fact that nobody bothers doing this

http://zfsonlinux.org/ [zfsonlinux.org]

On the other hand, if you're hellbent on Linux and not too invested in the kernel

That makes no sense. Linux _is_ the kernel. Do you mean GNU?

Re:Very first thing to do is... (1)

Skal Tura (595728) | more than 3 years ago | (#34887532)

Under linux this is so true, but under *BSD the Deduplication portion works too, and that is an excellent feature if you are running a huge amount of storage.

Re:Very first thing to do is... (4, Informative)

TheRaven64 (641858) | more than 3 years ago | (#34888496)

ZFS is a big fat patent trap

Oracle has released the ZFS code under the CDDL. While lots of Linux people hate the license, it has very strong patent retaliation clauses. Oracle explicitly grates you patent licenses for everything required to use ZFS via clause 2.1. All other contributors do via clause 2.2. Anyone exerting patents against ZFS immediately (well, within 60 days) loses this grant and has their (copyright) license terminated as well via clause 6.2.

Since Sun accepted third-party contributions to ZFS under the OpenSolaris program, if Oracle tried exerting patents against any ZFS distributor then they would immediately have to stop distributing Solaris and then remove all of these contributions before they could start again.

The ZFS patents are only an issue for a reimplementation of ZFS for Linux, and that's a problem caused by the GPL. Using the FreeBSD or NetBSD ports of ZFS (or even the FUSE port) gives you an explicit grant to the patents.

Re:Very first thing to do is... (1)

icebraining (1313345) | more than 3 years ago | (#34892436)

The ZFS patents are only an issue for a reimplementation of ZFS for Linux, and that's a problem caused by the GPL.

"Mozilla was selected partially because it is GPL incompatible. That was part of the design when they released OpenSolaris. [...] the engineers who wrote Solaris [...] had some biases about how it should be released, and you have to respect that" - Danese Cooper
http://caesar.acc.umu.se/pub/debian-meetings/2006/debconf6/theora-small/2006-05-14/tower/OpenSolaris_Java_and_Debian-Simon_Phipps__Alvaro_Lopez_Ortega.ogg [acc.umu.se]

Re:Very first thing to do is... (2)

Anonymous Coward | more than 3 years ago | (#34889110)

Comparing ZFS to ant of the EXT FSes is pointless, and utterly misses the point of ZFS.

Do ext3/4 provide snapshotting?
Do they provide deduplication?
Do they perform hash checks to avoid duplicating files in the first place?
Do they provide ANY of the dozens of features that set ZFS apart from other filesystems?

Don't bother, the answer is no.

And if you're going to disable those features on ZFS, then you have no reason to be using it in the first place, so you're effectively making an apples to zebras comparison to argue that because a certain technology is owned by a certain company you don't like, you should avoid it at all costs on the basis that other products that don't even more remotely close to offering the same functionality are somehow better at what it provides, by not providing it at all.

Beyond that, Oracle is stuck abiding by the terms of the CDDL for as long as they continue to distribute ZFS under the CDDL, while keeping in mind that releases distributed under the CDDL remain under the CDDL.

You're not one of those butthurt zealots fearmongering over ZFS on the grounds that the GPL forbids making a useful Linux implementation of it, are you? You certainly seem to be.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...