Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

We are sorry to see you leave - Beta is different and we value the time you took to try it out. Before you decide to go, please take a look at some value-adds for Beta and learn more about it. Thank you for reading Slashdot, and for making the site better!

Comments

top

Brown Dog: a Search Engine For the Other 99 Percent (of Data)

Vesvvi The problem isn't the format of the data... (23 comments)

Although you have a point, you don't understand the realities of science, data, and publishing.

Journal articles never contain sufficient information to replicate an experiment. That's been reported multiple times and also discussed here previously indirectly: in particular there was the study about how difficult/impossible it is to reproduce research. Many jumped into the fray with the fraud claims when that report hit, but the reality is that it's just not possible to lay out every little detail in a publication, and those details matter a LOT. As a consequence, it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.

The data is not hidden behind paywalls: there is minimal useful data in the publications. Of course, the paywalls do hide the methods descriptions, which is pretty bad.

There are two major obstacles to dissemination of useful data. This first is that the metadata is nearly always absent or incomplete, and the format issue is a subset of this problem. The second is that data is still "expensive" enough that we can't trivially just have a copy of all of it. This means that it requires careful stewardship if it's going to be archived, and no one is paying for that.

about a month and a half ago
top

Laying the Groundwork For Data-Driven Science

Vesvvi science driven science? (55 comments)

This particular push may not be effective, but it's not hype.

Science may be data-driven, but historically scientists have not been trained to be good data custodians. They know reasonably well how to use data, but they don't know how to store it, label it, transfer it, etc. Go pick an article from 5 years ago which is data-heavy and try to get the original dataset from the authors: 95 times out of a hundred you'll spend a month emailing people and you'll end up with nothing. Four more out of the 100 you'll get an Excel spreadsheet without labels on the columns. Scientists desperately need to become better at managing data.

Personally, I think that this program is targeting a small subset of the people who need help, and as such it won't be very effective. These look like infrastructure projects, but infrastructure only drives trends in extremely rare cases. Here's a quote from one funded proposal:

This project develops web-based building blocks and cyberinfrastructure to enable easy sharing and streaming of transient data and preliminary results from computing resources to a variety of platforms, from mobile devices to workstations, making it possible to quickly and conveniently view and assess results and provide an essential missing component in High Performance Computing and cloud computing infrastructure.

Will that project help teach scientists they shouldn't email files to themselves as a method of long-term archival? Yes, that really is extremely common. We should be focusing on building data tools which are extremely simple, extremely broad in scope, and encourage or force adoption of those tools.

about a month and a half ago
top

Mark Zuckerberg Throws Pal Joe Green Under the Tech Immigration Bus

Vesvvi Re:This debate is about money. (261 comments)

H1B are allowed to get new employment within their area of expertise. I've seen a lot of this with H1B employees exiting our workplace: the only problem is they can't take a temporary position doing something menial while they look for a new job at a different company.

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re:Unfamiliar (370 comments)

So "p" is the probability of a drive being down at any given time. A hard drive takes a day to replace, and has a 5% chance of going dead in a year. A given hard drive has a "p" of ~1.4e-4.

For RAID6 with 8 drives, you can drop 2 independent drives: failure = 1.4e-10. It's out in the 6+ nines.

It would take 6x sets of mirrors to get the same space. Each mirror has a failure probability of (p^2), 1.9e-8. Striped over the mirrors, all sets have to stay active: success = (1-p^2)^6, failure = 1.1e-7. Way easier to calculate without the binomial coefficient, by the way.

Technically, the mirrors are 3 orders of magnitude more likely to fail, but the odds are still ridiculously good. Fill a 4U with 22 drives (leave some bays for hot-swap) as mirrors and it's failure = 2e-7. Statistically, neither of these is going to happen: you just won't see two drives happen to go down together by random chance.

People already know this. There are much more advanced models that account for the what-happens-next situation after you've already lost a single drive, and of course it non-linearly worse. But just to keep it simple, going back to the naive model, for the RAID6 with 7 remaining drives, the failure probability is now up to 4e-7 during the re-silver time. The mirror model stays at a "huge" failure = 1.4e-4 during a re-silver, but it's brief, predictable, and with low system impact. It's my stance that that kind of probability keeps it in the category of less-important compared to many other factors for a risk analysis.

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re:above, below, and at the same level. ZFS is eve (370 comments)

Sorry, I'm not that familiar with OpenSolaris.
Don't the first and second commands create a zpool backed by a file? That's not what's at question here, I want to know if you can back a zpool with a zvol created on that same zpool.

A quick test showed that it does work on FreeBSD to create a zpool upon a zvol from a different zpool. The circular version has made it hang for a not-insignificant amount of time...

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re:Unfamiliar (370 comments)

That's a nice writeup.
I'm sure you've chosen that configuration for a reason, but I think it's a good example for why stripes over mirrors can be a better choice for some applications

You are running raidz2(7x4TB)+raidz2(8x2TB). Let's say that instead it was 3x(mirror(2x4TB))+4x(mirror(2x2TB)). Your capacity is 32TB as-is, or 20TB as mirrors: obviously that's a huge loss, and factoring in heat/electricity/performance/reliability it's likely that the raidz is a good choice for a home setup. Bandwidth would also be more that sufficient for home use.

But as you mention, the upgrades either take forever (one drive at a time) or require ridiculous free ports (add 7x at once?!). Even if you were to do them all at once, it would still be a fairly slow process with a massive performance hit.
On the other hand, with mirrors you can increase capacity 2 drives at a time, and at that level it's reasonable to leave both drives active as part of the "mirror" (now, 4-way) for some time. This is my preferred approach: new drives get added to a mirror set and run along with the system for a month or two. This stress-tests them, and if any point there are warning signs the drives can be dropped out immediately. If all is good after the test period, the old 2x of the mirror are removed and the space is immediately available (autoexpand=on). The process can then be repeated. Overall it takes as much or more time than your approach, but the system is completely usable during that time with no real performance hits, and of course the overall system performance is substantially improved with the equivalent of 7 devices running in parallel instead of 2.

There are definitely situations in which raidz2/3 makes more sense than mirrors, but if you're regularly expanding or looking for performance, I think the balance favors mirrors.

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re:above, below, and at the same level. ZFS is eve (370 comments)

Have you confirmed using a zvol underneath a zpool, and if so was it a different zpool?
I've wanted to do that in the past, but it was specifically blocked. It's a pretty ugly thing to do, but it does give you a "new" block device that could be imported as a mirror on-demand. With enough drives in the zpool, that new device is nearly independent from its mirror, from a failure perspective.

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re:production ready? (370 comments)

Is it a problem to add them by ID? I intentionally use partition IDs because they're stable, and it works well in both Linux and FreeBSD, but the FreeBSD people seem to like labels or raw device names.
Regardless, every import should bring the same pool online in the same way, regardless of the device names.

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re:I agree... (370 comments)

Maybe your ZIL comments are specific to Linux? It used to be the case in FreeBSD that you had to have the ZIL present to import, and a dead ZIL was a very big problem, but that was many versions ago (~3-4 years?). I personally went through this when I had a ZIL die and the pool was present but unusable. I was able to successfully perform a zpool version upgrade on the "dead" pool, after which I was able to export it and re-import it as functional without the ZIL.

Note that this was NOT a recommended sequence of operations, and I wouldn't suggest it unless you have no choice.

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re: Working well for me (370 comments)

Not all Adaptec controllers are supported by FreeBSD. It would be a "safer" choice to use LSI, since they work great in Linux and FreeBSD: that gives you the option to migrate your host OS should you desire.

Admittedly, if you're changing over that much then buying new controllers isn't a big deal, but I like to have the option of having the "reference" implementation of ZFS just a few minutes away.

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re:Unfamiliar (370 comments)

If you want to add 4 more TB, then you attach a new set of mirror, and you're left with RAID6(12x)+RAID1(2x). There is zero rebalancing (for better or worse): it's available immediately and transparently. The only catch is that you can't remove it again, but you can replace it with any combination of storage that provides equal or greater capacity to your RAID1(2x).

You could also grow your RAID6, and it's more efficient that it would be on most normal hardware RAID. But please don't do that: RAID5/6 really should be phased out, and it's not a good idea to create huge RAIDZ groups, even as RAIDZ2+. If you really want to stick with RAID5/6, it's better to just make a new group: leave your RAID6(12x) and add another RAID6(n

about 2 months ago
top

The State of ZFS On Linux

Vesvvi Re:above, below, and at the same level. ZFS is eve (370 comments)

I think you're giving the wrong idea here. I have yet to find a format of storage capacity that zfs won't support, with one exception: you can't create a zvol on a zpool, then attach that zvol as back-end storage for the same zpool. That is specifically disallowed, and I'm guessing that you can't use a zvol from one zpool to back-end another zpool either. This is a very bizarre (also, probably dumb) thing to do, but even this can be overridden if you're really desperate. For more practical applications, everything else just works: at least in FreeBSD, you can "hide" the block devices behind all different kinds of abstractions to provide 4k writes, encryption, whatever, and zfs will consume those virtual block devices just fine.

about 2 months ago
top

Climate Change Skeptic Group Must Pay Damages To UVA, Michael Mann

Vesvvi You're both absolutely, painfully correct. (497 comments)

This is the saddest part about the AGW debate. From my viewpoint, it looks like the pro-AGW people pushed back against criticism overwhelming their "opponents" with data and consensus, and tried to extinguish them via marginalizing them.

That was absolutely the worst thing that could have been done, for the reasons you note above. On the other hand, if they had embraced and extended, the whole debate could have been extinguished.

We could be working with alternate energy sources for reasons of dominance in international trade, national security / energy independence, etc. Instead, we've actively been pushed backward by the the pro-AGW agenda, and they deserve some of the blame for that.

about 4 months ago
top

How Did Those STAP Stem Cell Papers Get Accepted In the First Place?

Vesvvi Re:The same way many global warming papers got pub (109 comments)

I believe you are misreading the previous post. What they are saying is that political and financial pressure has an impact on scientists. It was not a comment on politics/finance vs science.

about 5 months ago
top

Oklahoma Moves To Discourage Solar and Wind Power

Vesvvi Re:But They Sell at Wholesale (504 comments)

They only make a profit if they can sell it or store it at the time it is being generated. When the demand dips low enough (a substantially non-zero number, given technical and business overheads), they start losing money.

about 6 months ago
top

Mathematicians Devise Typefaces Based On Problems of Computational Geometry

Vesvvi Re:zeroes (60 comments)

I really wish I had mod points for this.

about 7 months ago
top

Stanford Team Tries For Better Wi-Fi In Crowded Buildings

Vesvvi Re:and how do they track users across muilt units? (43 comments)

I assume this is something like Ubiquiti's "zero-handoff" system, where the wireless APs coordinate to form an virtual network that spans across all the APs. I'm not a networking pro, but I guess you could say it's like the opposite of a VLAN? Although based on the article, it sounds like there are effectively VLANs running on their mesh to partition it into per-user virtualized segments. In that situation, there is nothing that prevents you from having a static or even public IP.

about 8 months ago
top

Ask Slashdot: What's the Most Often-Run Piece of Code -- Ever?

Vesvvi Re:Solved. Next? (533 comments)

You could break it down further to look at the lower-level "operations" being performed.

In order to see if the next amino acid should be a glutamine instead of an asparagine, each of the two are going to be bounced into the active site to see if they "fit": if they fit, they're incorporated and if not, they bounce out and the next one is tried.

So you could think of this as an if, else repeated really ridiculously quickly: at the speed of molecular diffusion, minus just a tiny bit. That makes it both extremely fast and extremely parallel, as you observe.

As a side note, nearly ever molecular interaction operates in this way, unless a substrate is literally "handed-off" between two enzymes. That would definitely make a molecular recognition operation the most highly "executed" routine (at least down to that scale) by untold orders of magnitude.

about 10 months ago

Submissions

Vesvvi hasn't submitted any stories.

Journals

Vesvvi has no journal entries.

Slashdot Login

Need an Account?

Forgot your password?