Slashdot: News for Nerds


Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Genome Researchers Have Too Much Data

Soulskill posted more than 2 years ago | from the we-should-try-storing-it-in-dna dept.

Biotech 239

An anonymous reader writes "The NY Times reports, 'The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore's law. The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data. Now, it costs more to analyze a genome than to sequence a genome. There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"

cancel ×


Last post (2, Funny)

Anonymous Coward | more than 2 years ago | (#38241606)

All previous posts have been purged due to too much data.

Re:Last post (5, Funny)

NFN_NLN (633283) | more than 2 years ago | (#38242386)

There is now so much data, researchers cannot keep it all.' One researcher says, 'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'"

Perhaps they can come up with a new type of storage mechanism modeled after nature. They could store this data in tight helical structures and instead of base 2 use base 4.

Wrong problem (4, Interesting)

sunderland56 (621843) | more than 2 years ago | (#38241614)

They don't have too much data, they have insufficient affordable storage.

Re:Wrong problem (0, Informative)

Anonymous Coward | more than 2 years ago | (#38241626)

Only kind of correct - they also don't really have a clue what it means. It is kind of like reading a binary program and trying to say saying what the program does.

Re:Wrong problem (2, Informative)

Anonymous Coward | more than 2 years ago | (#38241920)

Only kind of kind of correct - they also don't really have a clue as to the accuracy, with the short read illuminas that dominate, they have problems with repeats and inversions and deltions, the basepairs with hydroxy methyl C or thiophosphate, the sequence of the centromeres and telomeres, and the ability to contigs into phase with parental genomes....aside from that, it's all peachy

oh yeah, I bet the contamination rates are not real good either (there was a paper a few months ago on this, looking at public data bases, kinda scary)

Re:Wrong problem (1)

Samantha Wright (1324923) | more than 2 years ago | (#38242066)

Actually, short reads aren't that bad as they may seem from a distance—the lab for which I consult has spent about a year surveying second-gen sequencing platforms, and it turns out the the 5th-generation ABI SOLiD platform finally lives up to its name, even though it uses only ~20 nt reads instead of the Illumina's 100. The chemistry has improved to a point where read quality isn't the biggest issue any more.

Re:Wrong problem (2)

TheRealMindChild (743925) | more than 2 years ago | (#38241750)

"to the cloud!"

Re:Wrong problem (5, Funny)

bugs2squash (1132591) | more than 2 years ago | (#38241756)

If only they had some kind of small living cell it could be stored in...

Re:Wrong problem (0)

Anonymous Coward | more than 2 years ago | (#38242426)

Whoosh! me away but i thought the actual data capacity of dna was rather limited...

but then again you dont have to place everything in 1 place

class ameba
string DNA { DNA = "evolushiuooon"

class lawyer public : ameba

Re:Wrong problem (4, Insightful)

jacoby (3149) | more than 2 years ago | (#38241804)

Yes and no. It isn't just storage. What we have comes off the the sequencers as TIFFs first, and after the first analysis we toss the TIFFs to free up some big space. But that's just the first analysis, and we go to machines with kilo-cores and TBs of memory in multiple modes, and many of our tools are not yet written to be threaded.

Re:Wrong problem (1)

rubycodez (864176) | more than 2 years ago | (#38241952)

surely storage and transmission can't be an issue, the capacity and bandwidth of a mini-van full of 2TB disks from RAID sets should be sufficient

Re:Wrong problem (5, Informative)

TooMuchToDo (882796) | more than 2 years ago | (#38241914)

Genomes have *a lot* of redundant data across multiple genomes. It's not hard to do de-duplication and compression when you're storing multiple genomes in the same storage system.

Wikipedia seems to agree with me: []

The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.

Disclaimer: I have worked on genome data storage and analysis projects.

Re:Wrong problem (1)

tgd (2822) | more than 2 years ago | (#38242020)

Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.

640K will always be enough!

Re:Wrong problem (5, Funny)

StikyPad (445176) | more than 2 years ago | (#38242146)

Warning: Monkeying with lossy compression for human genomic data may lead to monkeys.

Re:Wrong problem (3, Informative)

Anonymous Coward | more than 2 years ago | (#38242240)

It's not lossy compression.

You store the first human's genome exactly. Then you store the second as a bitmask of the first -- 1 if it matches, 0 if it doesn't. You'll have 99% 1's and 1% 0's. You then compress this.

Of course it's more complicated than this due to alignment issues, etc, but this need not be lossy compression

Re:Wrong problem (1)

Remus Shepherd (32833) | more than 2 years ago | (#38242256)

So compressed, you have 4 megabytes of data...per individual. 7 billion individual human beings means you potentially need 28 petabytes of storage...

That's just for human beings. If we look at the sequences of non-human species the storage needed expands exponentially. Why, even if we used efficient DNA storage to keep all this data, we'd need a whole planet just to house it.

Re:Wrong problem (5, Informative)

GAATTC (870216) | more than 2 years ago | (#38242050)

Nope - the bottleneck is largely analysis. While the volume of the data is sometimes annoying in terms of not being able to attach whole data files to emails (19GB for a single 100bp flow cell lane from a HiSeq2000) it is not an intellectually hard problem to solve and it really doesn't contribute significantly to the cost of doing these experiments (compared to people's salaries). The intellectually hard problem has nothing to do with data storage. As the article states "The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.". We just finished up generating and annotating a de novo transcriptome (sequences of all of the expressed genes in an organism without a reference genome). Sequencing took 5 days and cost ~$1600. Analysis is going on 4 months and has taken at least one man year at this point and there is still plenty of analysis to go.

Not True (0)

Anonymous Coward | more than 2 years ago | (#38242194)

No, we really have too much data by comparison to the number of people who can analyze that data. That's the problem, and yes, I do this for a living and at the University I am at we can't analyze it fast enough.

Now back to analyzing genomes...

Re:Wrong problem (1)

msauve (701917) | more than 2 years ago | (#38242310)

"they have insufficient affordable storage."

I've got an idea to solve that, which I'm going to patent.

You store the sequence as a chain of different types of molecules (I'll call them "base pairs") which can link together, that way the storage will take up really minimal space. You could even have a chemical process which replicated the original, to produce more of the original.

Re:Wrong problem (1)

Anonymous Coward | more than 2 years ago | (#38242320)

"The amount stored has more than tripled...taking up nearly 700 trillion bytes of computer memory"

Let's see, 1 trillion bytes is 1TB. Have they considered Newegg? Even with the Thailand flooding disaster, Western Digital Caviar Green WD30EZRX 3TB hard drives are going for $299.99, plus $7.86 shipping. They'd need (Dr. Evil pinkie to corner of the mouth maneuver) 234 of these hard drives. Without quantity or government discount this comes out to be a little over $72K. Somehow I can't see a federal program set up to act as a centralized database straining under this load. What did they think they were budgeting for? An accumulation of Slashdot comments?

P.S. C. Titus Brown, quoted in the article, used to be at CalTech and was part of the local Python group.

Nope (3, Insightful)

masternerdguy (2468142) | more than 2 years ago | (#38241640)

No such thing as too much data on a scientific topic.

Re:Nope (0)

Anonymous Coward | more than 2 years ago | (#38241760)

Um, yes there can if the data kept is less valuable than the data not able to be stored.

Re:Nope (2, Insightful)

blair1q (305137) | more than 2 years ago | (#38241782)

Sure there is.

They're collecting data they can't analyze yet.

But they don't have to collect it if they can't analyze it, because DNA isn't going away any time soon.

It's like trying to fill your swimming pool before you've dug it. I hope you have a sealed foundation, because you've got too much water. You might as well wait, because it's stupid to think you'll lose your water connection before the pool is done.

Same way they've got too much data. No reason for them to be filling up disk space now if they can just get the data again when they know what to do with it.

Re:Nope (0)

Anonymous Coward | more than 2 years ago | (#38242254)

Trying to fill up your pool before you dig really gets in the way, it makes it more difficult to actually do the digging. Having too much research data, shouldn't have any effect on the analysts. They don't need to start looking at each new genome that comes in. It can just sit on the hard drives of the people doing the sequencing.

Bad... (3, Insightful)

Ixne (599904) | more than 2 years ago | (#38241656)

Throwing out data in order to be able to analyze other data, especially when it comes to genes and how they interact, sounds like one of the worst ideas I've heard.

Re:Bad... (-1)

blair1q (305137) | more than 2 years ago | (#38241810)

Uh, why?

Are genes going to stop interacting before you've figured out how to analyze them properly?


Figure out what you're going to do when you collect the data, then collect the data.

The data will still be collectible. Even if it's not worth anything on eBay...

Re:Bad... (3, Informative)

Samantha Wright (1324923) | more than 2 years ago | (#38242170)

Although that isn't quite what we're talking about here, reductionism in biology has been an ongoing problem for decades. Traditional biochemists often reduce the system they're examining to simple gene-pair interactions, or perhaps a few components at once, and focus only on the disorders that can be succinctly described by them. That's why very small-scale issues like haemophilia and sickle-cell anaemia were sorted out so early on. As diseases with larger and more complex origins become more important, research and money is being directed toward them. Cancer has been by far the most powerful driving force in the quest to understand biology from a broader viewpoint, primarily because it's integrally linked to a very important, complicated process (cell replication) that involves hundreds if not thousands of genes, miRNAs, and proteins.

Re:Bad... (0)

Anonymous Coward | more than 2 years ago | (#38242230)

Forgetting is essential.

What did you have for lunch on May 23, 1986?

High energy physics experiments call selectively looking at interesting things "triggering". The rate of raw data pouring out of a high energy experiment is nigh unstorable by mankind.

Re:Bad... (1)

JEBowers (2523296) | more than 2 years ago | (#38242456)

The problem with the raw data (and dna sequence) is a lot of it is wrong (errors). When confronted with a large data set with errors it is often best to reduce it to the portion that is more correct, than to treat all data as correct for later analysis. For some sorts of analysis such as genome assemblies this may be the only realistic way to proceed.

ASCII storage? (0)

Anonymous Coward | more than 2 years ago | (#38241668)

ACGT... 4 symbols only in this alphabet. I hope they're not storing it in ASCII form ;)
If so, better get this bzip2 or lzma compressor going.

Re:ASCII storage? (3, Informative)

Samantha Wright (1324923) | more than 2 years ago | (#38242250)

ASCII storage of nucleotide and protein information is actually very standard. The most widespread format is called FASTA, named after the fast alignment program that introduced it. When you sequence a whole genome on a second-generation sequencing platform (like Illumina or SOLiD), there's a step in the process where you end up with a huge (10-100 GB) text file containing little puzzle pieces of DNA that must then be assembled by a specialized program. These files usually don't hang around very long, but the point of keeping them in this inefficient storage format is, simply, performance: CPUs are oriented toward byte-based computing at a minimum, and so frequent compression/decompression becomes prohibitively inefficient.

Big biotechnology purchases are typically hundreds of thousands of dollars though, so most labs are used to shelling out for this kind of price bracket.

Work! (0)

Anonymous Coward | more than 2 years ago | (#38241680)

I see an opportunity for work, and jobs.

Re:Work! (3, Funny)

Anonymous Coward | more than 2 years ago | (#38242008)

I see an opportunity for work, and jobs.

Wozniak. He is called Wozniak. But opportunity will have to wait, because Jobs is dead. Sorry to break it to you like this.

Come on, every story has an Apple angle, if you look at it the right way.. in fact, I bet those researchers could store all that data on an iPod if they wanted! You can plug it right in and sync with iTunes!

Re:Work! (2)

Samantha Wright (1324923) | more than 2 years ago | (#38242304)

Bioinformatics is indeed a very lucrative profession, but few programmers have the willingness to memorize the huge canon of data while they're in college that is required to be proficient in it. The curriculum is about 70% computer science and 30% life sciences, including organic chemistry at some universities.

Time for the scientists to ge to work (4, Insightful)

Hentes (2461350) | more than 2 years ago | (#38241692)

Most scientific topics are like this, there is too much raw data to analize it all. But a good scientist can spot the patterns and can distinguish between important stuff and noise.

Re:Time for the scientists to ge to work (5, Insightful)

BagOBones (574735) | more than 2 years ago | (#38241736)

Research team finds important role for junk DNA []

Accept in the field of DNA they still don't know what is and is not important.

Re:Time for the scientists to ge to work (1)

Hentes (2461350) | more than 2 years ago | (#38242028)

That's exactly what makes science interesting, when new better models show that some of the data previously disposed as junk can also be predicted. But making a perfect model would require infinite resources, so sometimes tradeoffs has to be made.

Re:Time for the scientists to ge to work (1)

Samantha Wright (1324923) | more than 2 years ago | (#38242390)

Transposons are interesting and complex, but they don't play much of a role in mammals. Intergenic DNA is still important in that it provides scaffolding (an active chromosome resembles a puff-ball with all of the important genes at the outside edges, where they're most accessible to incoming proteins) and flex room (sometime proteins will actually bend DNA and pinch it to make sure the important genes stick out) but so far we believe that the actual sequence of most of the human genome isn't very important. 95% of it appears to be under no evolutionary pressure (that is, even if it mutates, the organism is fine.)

Re:Time for the scientists to ge to work (1)

blair1q (305137) | more than 2 years ago | (#38241842)

A good scientist will design the experiment before collecting the data. If he spots patterns, it's because something interesting happened to another experiment. Then he'll design a new experiment to collect data on the interesting thing.

Seriously, this is a non-problem. Don't waste resources keeping and managing the data if you can make more. And I can't imagine how you can't make more data from DNA. The stuff is everywhere.

Re:Time for the scientists to ge to work (1)

bberens (965711) | more than 2 years ago | (#38242074)

And I can't imagine how you can't make more data from DNA. The stuff is everywhere.

I work in a cheap motel you insensitive clod!

Re:Time for the scientists to ge to work (4, Insightful)

sirlark (1676276) | more than 2 years ago | (#38242382)

A good scientist will design the experiment before collecting the data. If he spots patterns, it's because something interesting happened to another experiment. Then he'll design a new experiment to collect data on the interesting thing.

Flippant response: A good scientist doesn't delete his raw data...

More sober response: Except to do an experiment said scientist might need a sequence. And that sequence needs to be stored somewhere, often in a publicly accessible database as per funding stipulations. And that sequence has literally gigabytes more information than he needs for his experiment, because he's only looking at part of the sequence. Consider also that sequencing a small genome may take a few days in the lab, but annotating can take weeks or even months of human time. And the sequence is just the tip of the iceberg, it doesn't tell us anything because we need to know how the genome is expressed, and how the expressed genes are regulated, and how they are modified after transcription, and how they are modified after translation, and how the proteins that translation forms interact with other proteins and sometimes with the DNA itself. Life is messy, and singling out stuff for targeted experimentation in the biosciences is a lot more difficult than in physics, and even chemistry.

Seriously, this is a non-problem. Don't waste resources keeping and managing the data if you can make more. And I can't imagine how you can't make more data from DNA. The stuff is everywhere.

Sequencing may be getting cheaper, but it's not so cheap that scientists facing funding cuts can afford to throw away data simply to recreate it. Also, DNA isn't the only thing that's sequenced or used. Protein's are notoriously hard to purify and sequence, RNA can also be difficult to get in sufficient quantities. The only reason DNA is plentiful is because it's so easy to copy using PCR [] , but those copies are not necessarily perfect.

Re:Time for the scientists to ge to work (1)

Samantha Wright (1324923) | more than 2 years ago | (#38242444)

Fortunately, most biologists are unimaginative, and the medical establishment's coffers are bottomless, so really only four genomes ever actually get much mileage: human, rat, mouse, and chimpanzee. Perhaps a parasite or virus here and there. I weep for plant biologists.

Re:Time for the scientists to ge to work (1)

Nyall (646782) | more than 2 years ago | (#38241882)

Time for scientists to get to work? What an elegantly simple solution.

The next time I have to debug something, maybe my fist step should be identifying the problem [taken from dilbert..]

Seems (0)

Anonymous Coward | more than 2 years ago | (#38241722)

Seems like we need to stop sequencing genomes until we've figured out if there's anything useful we can do with all that data.

I don't see how "too much data" can be a problem. Just stop taking in the new data, concentrate on the data sets you already have and only get more when you find a gap in what you need.

Have they tried compression? (0)

Anonymous Coward | more than 2 years ago | (#38241764)

I mean, it's only the same four letters after all.

Re:Have they tried compression? (1)

Samantha Wright (1324923) | more than 2 years ago | (#38242484)

Can't analyse a compressed sequence. Gotta decompress it first. Disks are cheaper than time.

So, create a public DNA museum of sequences (0)

Anonymous Coward | more than 2 years ago | (#38241778)

So, create a public DNA museum of sequences.

I assume that some of this data will be useful, one day.

Re:So, create a public DNA museum of sequences (1)

bberens (965711) | more than 2 years ago | (#38242108)

So, create a public DNA museum of sequences.

They have those, they call it the "public school system."

Re:So, create a public DNA museum of sequences (2)

Samantha Wright (1324923) | more than 2 years ago | (#38242536)

Done: NCBI [] , DDBJ [] , and Ensembl [] all perform that role. The problem is what to do with all of it.

Well... (-1, Troll)

geek342 (2523280) | more than 2 years ago | (#38241792)

All say, its bullshit. Whole human genome can fit [] into 4.4GB disk.

Re:Well... (2)

Baloroth (2370816) | more than 2 years ago | (#38241858)

Oh hey look you made another account to goatse /. with. Good job.

Re:Well... (1)

GameboyRMH (1153867) | more than 2 years ago | (#38242134)

Still doesn't elude my filter. [] See, I was thinking ahead when I designed that. He'll either have to quit trolling or get off his lazy ass and put some effort into trolling us.

Re:Well... (-1, Troll)

geek345 (2523288) | more than 2 years ago | (#38242418)

Strange.... I got 3140 victims on blog till they took me down (I actually was glad that they did as the link became too 'known' here.
Now I have 2,773 victims on current link.... I guess its time to move...
See you,
          Your Trolly,
                  The bored geek.

PS: In total I have more that 16000 victims here.
Food is also great:

"Goatse! Damn, I have not been goatsed in YEARS :("
"Grow up, retard; the goatse shit is old - much older than you'll ever be."
"Yeah, thanks for the goatse at work..."
"You Asshat. You know, some of us read ./ at the office. Please don't post obscene links."
"OMG! What is that man trying to do? Catch a football?!"
"I read your comment a second too late. It took me years to get that image out of my head, and now it is back !!!"
"that AEONITY.COM link is NSFW dammit"
"Whoever modded this up should be banned..."
"Thanks for the Goatse."
"Congratulations, you just ruined my day!"

"What an ass. Warning: Unpleasant picture in the link. That's what I get for browsing at 1, I guess."
"I'm just curious what gratification you get from this... do you jerk off to your hit counter?"
"O neat, you quoted me! Now I have to ask, why do you do this? seriously whats the motivation?"
"1999 called they want their overused shock pictures back."
"Parent post is a goatsex picture. Do not follow. You're an asshole of the proportions in that picture."
"Link above is to goatse. Fuck you douchebag."
"Turn on TinyUrl previews. It saves lives."
"Ugh. Goatse. NSFW. Asshole (poster and picture, both)."
"Seriously ... new account to post that ... what a douche!"
"You're a fucking douchbag." - "That is the most accurate comment yet"
"Not gonna click it to find out, but I'd be surprised if parent's link wasn't goatse... It appears you would be correct sir. Why oh why do I always forget..."
"My word, what is wrong with your anus? I'd get that checked out."
"It's because of Assholes like you that I can no longer trust URL shorteners"
"Thanks, I'm reading slashdot in class like a good student and just got tubgirl'd."
"Watching second monitor, there was something wrong with the other screen. Control + w. Phew..."
"Hey family! Come look! They're opening the Google Talk client! Now, click here......" (sees goatse)
"I tried to post warnings about the goaste loving jerk yesterday but was modded into oblivion as a karma whore"
"Posting your picture online again?", "Really? Are you not tired of this yet?"
(Me posts goatse link and tells that it is SFW): "You mean NSFW asshole."
"Can you not afford normal entertainment?" "This is grown up talk, 4chan is that way ->"
"Oops. goatse link" - "The AC speaks truth! (Well I didn't let it finish loading, but the browser was connecting to"
"He likes his urinal cakes nice and sudsy, so he tries to piss us off."
"Link is Goatse" - "Thanks. Does nerd soccer attract nerd hooligans?"
"You must be really bored, eh? Take your shit somewhere else. We don't serve your kind around here."

"You are a homo. Stop it"
"Die Goatse motherfucker"
"Motherfucker. Some of us are at work and don't want to have a drilled out anus pop up on their fucking screen. Christ."
"BAN HIM!" "Ur a faggot for posting that."
"Death to all assholes - Let's put you first into the guillotine"
"You fucker" - "I had the same thought as you. What a fucking asshole. The link is nsfw."
"I hate your guts.", "WTF you fucking asshole.", "Fucking troll, do not click there"
"I hope you die in a fire before you are old enough to contaminate the gene pool."
"It would be more interesting if I had a piece of pipe and your face, in close proximity so I could smash your face beyond recognition,"
"Bravo teeny bopper. You're a really mature mother fucker (or do you prefer father fucking? Damn you homo erotic shittter)."
"Wait! I think I hear your mommy calling to give your tongue a good soap washing. And maybe she'll execute you too"
"I did not even bother to look, but this same idiot has been doing this for weeks now. Fuck off asshole."
"Asshole. literally. Goatse is so old. Grow up you fool."
"Asshole... Ginormous asshole, in fact." "Ugh. Goatse. You asshole."
"Better than you, you arse bandit." "You're a lowlife faggot piece of shit."
"Ah, a sheep troll. "Baaa! I post disgusting photos! Baaa!"
"I hate you"

"First time testemonies:
"As a warning to all, the link in the above post is disgusting and shocking. It pretty much ruined my day."
"Wow, all these years I managed to avoid seeing the guy, priding myself on my resilience to clicking on random image links from friends and trolls alike, taking comfort in the fact that I could identify a shock JPG based on a few lines of pixels while the holding the clipped window at the edge of my screen, and yet... now it's all for naught."

"After all these years, I finally fell for it. Just off to bleach my eyes.. thanks for that."
"Damnit, mod this guy up before GP gets any one else. My eyes, dear god my eyes, I'd managed not to see that until today!"
"WARNING: Don't click on the parent's link! Damn goatse! The first I experienced, no less.
"Parent is goatse. Dammit, and I've avoided it for a decade."
"ALERT ! goatse ya got me :("
"The fuck is a goatse? it's some dude pulling his arse open."

"To late... and the captcha word was "wisdom".."
"Well done. I haven't been suckered into a goatse link in years."
"Now *that* is how you goatse. Even got me, and I'm an oldfag."
"Long time since I've been rickrolled with goatse!"
"Goatse URL - Haven't seen that guy in a while"
"Damnit! nearly 15 years reading /. and I still fall in a trap !"
"Well played, sir. It's been a while since I've been Goatse'd"
"Congrats. It's been a long time since I saw goatse."
"Looks very open to me... (congrats, 'twas a while ago I was goatsed the last time)"

Strong emotion:
"Son of a..."
""No ads? Kwel!" click... "WTF!! My eyes!!!! rip.... them.... OFF!!"
"i WAS eating lunch you ass!"
"Oh dear god my eyes. Haven't seen THAT awful image in a while."
"My eyes are burning... argh! Damn you!"
"MY EYES... dude i am at work here "S "
"Oh goddammit. I didn't need that right before bed."
"Goatse warning! I'm still recovering."
"Please friends, I beg of you, do not click that link! Do not look at that image, whatever you do! It is a bad image! It is a goatse image."
"Man you made me barf .... disgusting little fellow the GOATSE Guy"
"Ok I did not need to see this. kindly please go die in a fire."

Dumbassess talking:
"Again with the goatse? We get it guy, you're edgy and cool because you're ten years late to a meme."
"Oh wow, retro-trolling. Soon we'll be back to page-widening, Steven King is dead and bell bottoms."
"Hey moron, try using different links."
"You fucking piece of shit!" , "You sorry piece of shit.", "You cunt.", "Fuck you." "Get fucked"
"What a retard..... enough said...." "Nice. Asshole."
"Yup, this is what your life amounted to. Posting goatse on Slashdot and collecting comment trophies."

"Fuck you, you fucking fuck"
"Enough of that you sick fuck"
"Can someone make a fucking goatse blocker firefox plugin please? This is pissing me off now."
"I am sick and tired of that crap on /. "
"Don't visit the link above, everyone. -sigh- Especially at work."
"Doh! One has to also recognize data urls. *sigh*"
"Damn! Mod this fucker to hell"
"*sigh* Goatse alert..."

"Wow, that looks like a pretty well-hidden goatse. I have still yet to fall for one of those on /. but I know that it's only a matter of time..."
"Who the hell still falls for this? I just assume any link in the comments is to goatse..."
"Anyone who clicks on these deserves it. Lazy fucker's using a URL that trolls have been using for at least a year now."
"Persistent little goatse fan aren't you?"
"Christ, I hate to think about what kind of disturbing, depraved images you've seen that would make you jaded enough not to be repulsed by goatse! Sicko."
"What the fuck is wrong with you? Did your mother drop you on your head, or are you just a fucking mental case by default? Grow up you infantile halfwit."
"When did you people stop being content with a simple rickroll? This is why Slashdot needs a "delete for spam" option."
"Goatse trolls are getting better these days..."
"Why the sudden coordinated campaign for Goatse? Is someone making money off this?"
"You're right, this is the most coordinated troll campaign in a long time. Multiple accounts, multiple pages."
"Urgh...dammit, am I the only one thinking the goatse trolls are getting worse lately than they have been in the past five years?"
"Who found a way to monetize goatse at this late date? If we got half the effort of that campaign on real stuff we'd all have better software by now."
"Boy Goatsex is out in force today... - Every topic is littered with them..."
"You can't actually expect the Slashdot users to actually know enough not to respond to a goatse troll, right ?"
"Can we start banning people who post that hiding it behind a url shortening link like"
"How many times are you going to spam this link? Like we don't know where that goes......"
"The GP's post is just to get you to click a link with the Goatse man's picture on it."
"One of these days, this asshole gonna have a hard drive crash and lose his precious list, consigning his life's work to oblivion. He'll probably kill himself."

"Cool goatse link bro"
"Giggles. That made my day. Thank you."
"You are one dedicated troll."
"Well played, sir. Well played."
"A link that redirects to a page containing goatse? How clever of you!"
"Thank you for that informational link"
"Interesting use of Data URLs for Goatse linking."
"Nice Goatse dude"
"Good one, Sir."
"Nice link to have!"
"But I found those that first link very informative however"
GOATSE - Yes, well done. Haven't fallen for one of those in years (though I did have some suspicions before clicking).
"Goatse is getting old but I still appreciate it. Thanks for making me smile at work."

Funny warnings:
"Don't even think of clicking.. goatse alert. Way to get me fired, bro"
"That links to goatse. Don't click! If you actually want to see what it looks like, check out the blog post."
"Asshole! DON'T CLICK!"
"Yeah don't click that. link... ever."
"DO NOT CLICK - goatse" - "Too late :/ "
"Why the fuck would you post something like that? Warning to all: scatology in its worst form. Do not look."
"Would advise against clicking the link in the troll post above. Especially if you're at work atm."
"Link Warning!!! Not for those at work or of a nervous disposition, or even those bored with the stupid Goatse image."
"You really don't want to click the fundamental link, you'll be scarred for life."
"Above Link NSFW... The picture was just wrong... wrong..."
"WARNING! Above link is not something anyone wants to see!"
"Parent should be modded down. Link is NSFW and mentally scarring."
"High likelyhood of being a Goatse link. Proceed with caution"
"Didn't click it, but the magic 8-ball says goatse."
"Danger, goatse" "Don't click the link! Goatse wannabe."
"Someone please mod this guy down... Don't click his link."
"I dunno if this is supposed to be goatse or what, but clicker beware on parent."
"Don't click link, Its a trap" - "Get an ax!"

"My habit of opening new pages in tabs that don't focus immediately has saved me this time. Thanks for the warning!"
"MODS please ban this guy"
"What the hell is up with this guy? can somebody gag him or something please"
"Just post the damn url, i'm not going to click on a tinyurl link and get goatse'd or something.."
"That's somewhat clever, but some of us do know what base-64 encoding is."
"Could not someone at slashdot write a small script to blacklist url's that have been flagged troll? I'll do it if you pay me a slave wage..."
"Mod to -1, please. this guy is an 'asshole'.... (yes, you guessed it)"
"That link is goatse-esque. Yuck."

Re:Well... (1)

GameboyRMH (1153867) | more than 2 years ago | (#38242528)

3140? Haha that's pretty lame. I got more than that on a pastebin URL I had in my sig for a couple of weeks. Still I can see why you keep going, considering some of the lulzy responses, although there's a good amount of pity mixed in there these days.

Seriously though, you're a one-trick pony, the Microsoft of trolling. Try to mix it up. You're boring the grizzled veterans on here.

Re:Well... (0)

tom17 (659054) | more than 2 years ago | (#38241862)


Damn, I have not been goatsed in YEARS :(

Re:Well... (0)

Anonymous Coward | more than 2 years ago | (#38241926)

goatse alert!

They should learn (4, Insightful)

hbar squared (1324203) | more than 2 years ago | (#38241806)

...from CERN. Sure, the Grid [] was massively expensive, but I doubt genome researchers are generating 27 TB of data per day.

Re:They should learn (0)

Anonymous Coward | more than 2 years ago | (#38242164)

Pretty close... the article claims a single center (BGI in China) outputs the equivalent of 2,000 human genomes per day. That's over 6 trillion A,C,G,T's per day, which is the processed result. If you include the signal generated by the sequencing instruments before converting to nucleotides it's much, much more data. Now, multiply that by the 5 or so centers like BGI around the world, and add all the smaller independent labs, and I bet it's way more data than CERN produces per day. However, biology has the advantage that it's decentralized and they all are not working on the same problem and data at the same time.

Is it .. (3, Interesting)

ackthpt (218170) | more than 2 years ago | (#38241808)

Is it outpacing their ability to file patents on genome sequences?

as a genome researcher (5, Informative)

ecorona (953223) | more than 2 years ago | (#38241826)

As a genome researcher, I'd like to point out that I, for one, do not have nearly enough genome data. I simply need about 512GB of RAM on a computer with a hard drive that is about 100x faster than my current SSD, and processing power about 1000x cheaper. Right now, I bite the bullet and carefully construct data structures and implement all sorts of tricks make the most out of the RAM I do have, minimize how much I have to use a hard drive, and extract every bit of performance available out of my 8 core machine. I wait around and eventually get things done, but my research would go way faster and be more sophisticated if I didn't have these hardware limitations.

Re:as a genome researcher (-1, Troll)

geek345 (2523288) | more than 2 years ago | (#38241878)

I understand you. I did one small research in biology too (also genome related).
I found this statistic [] tool to be invaluable.
It saved me so much time!

Re:as a genome researcher (1, Offtopic)

martas (1439879) | more than 2 years ago | (#38242118)

Goatse alert.

Re:as a genome researcher (1)

GameboyRMH (1153867) | more than 2 years ago | (#38242190)

Oh god someone modded this informative. IT'S GOATSE, GENIUSES.

Re:as a genome researcher (4, Insightful)

Overzeetop (214511) | more than 2 years ago | (#38242078)

It will come, but it doesn't make the wait less frustrating. I'm an aerospace engineer, and I remember building and preparing structural finite element models by hand on virtual "cards" (I'm not old enough to have used actual cards), and trying to plan my day around getting 2-3 alternate models complete so that I could run the simulations overnight. In the span of 5 years, I was building the models graphically on a PC, and runs were taking less than 30 minutes. Now, I can do models of foolish complexity and I fret when a run takes more than a minute, wondering if the computer has hung on a matrix inversion that isn't converging.

You should, in some ways, feel lucky you weren't trying to do this twenty years ago. I understand your frustration, though.

Just think - in twenty years, you'll be able to tell stories about hand coding optimizations and efficiencies to accommodate the computing power, as you describe to your intern why she's getting absolute garbage results from what looks like a very complete model of her project.

Re:as a genome researcher (0)

Anonymous Coward | more than 2 years ago | (#38242344)

I find it hilarious that you use 'she' instead of 'he' in reference to your intern.

Funny how even in a field of aerospace engineering, interns are assumed female. :D

Isn't it compressable? (2)

BlueCoder (223005) | more than 2 years ago | (#38241828)

I would figure most genomes are highly compressible. Especially if compressed against thousands of samples of a species and even across different species.

I have half my mothers genome and half my fathers. I couldn't have that many mutations. To store all three genomes couldn't take more than 2.0001 times the size of a human genome.

Re:Isn't it compressable? (2)

Derekloffin (741455) | more than 2 years ago | (#38242024)

That is what I was thinking. Maybe they just need a more customized compression algorithm. The problem there, I suppose, is figure out matches can be an expensive operation in itself.

Re:Isn't it compressable? (0)

Anonymous Coward | more than 2 years ago | (#38242086)

Exactly. But watch out for the alien species called "the butterfly" -- 380 chromosomes for that super species (compared to 46 for us humans). "the butterfly" probably flew to earth with their "fern" plants, which have upwards of 1,260 chromosomes).

Re:Isn't it compressable? (1)

oodaloop (1229816) | more than 2 years ago | (#38242088)

I would figure most genomes are highly compressible

I know right? I can fit all of my DNA inside of a single cell! When will these people learn?

Re:Isn't it compressable? (0)

Anonymous Coward | more than 2 years ago | (#38242306)

Multiple human genomes can be stored very efficiently, but there is a huge production of data that comes from sources that don't have a well studied reference. Take the field of "metagenomics" for example where they sequence the DNA from a patch of dirt or a sample of the ocean. The vast majority of organisms in these samples have never been seen or characterized before so you can't do a reference based compression. Compression of these types of sequences using something like bzip only gives a 3-5x reduction.

Time to outsource these efforts (1)

bogaboga (793279) | more than 2 years ago | (#38241840)

I think these researchers should look at outsourcing these efforts, and China now has bragging rights to the fastest computer [] .

After all most of our electronics are all imported. It's sad, but what do you do when "...the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data..." as the intro to this submission says?

Re:Time to outsource these efforts (1)

Punchcardz (598335) | more than 2 years ago | (#38242278)

FTFA: "BGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day."

Steve Yegge is on the way. (1)

doom (14564) | more than 2 years ago | (#38241864)

But stand back! Steve Yegge is on the way to show them how to get things to scale: []

Where does it all come from? (3, Funny)

WaffleMonster (969671) | more than 2 years ago | (#38241910)

I was under the impression the complete DNA sequence for a human can be stored on an ordinary CD.

Given the amount of data mentioned in TFA it it begs the question what the hell are they sequencing? The genome of everyone on the planet?

Re:Where does it all come from? (1)

godrik (1287354) | more than 2 years ago | (#38242182)

no, the genome of every bacteria in a soil sample. We (I work as a computer scientist in a genome related research lab) do not work only on human genomes.

Re:Where does it all come from? (1)

Daniel_Staal (609844) | more than 2 years ago | (#38242238)


Every living being on the planet. And as many of the dead ones as they can get their hands on.

Re:Where does it all come from? (2)

Punchcardz (598335) | more than 2 years ago | (#38242252)

This is true, but doesn't really capture the types of experiments that are being done in many cases. Yes, your genome can be stored on a CD. However, next gen sequencing is usually done with a high degree of overlapping coverage, to catch any mistakes in the sequencing, which is still basically a biochemical process despite geting large text files as the end result. So any genome is sequenced multiple times: say 8x coverage as fairly standard. That is if you are interested in sequencing a single genome. If you are interested in sequencing all the mRNAs that tell you which genes are active in which tissue and cell type, expect that you need to do a similar amount of sequencing for each tissue and cell type in the human body. Now imagine doing that with different experimental conditions: disease states, environmental factors etc. Of course, on top of that, you will need replicates of each experimental condition in order to have statistical power to say anything meaningful. On top of that there is the sequencing that you can do to identify differences in the epigenome: how the DNA is marked with things like methyl-groups, how it is wrapped around histones, all of which we are finding has a huge functional difference. Having the a genome sequence is a lot like having the total word list of the english language. It is huge and powerful, but there is a lot more information you need before you can write Shakespeare.

I have an idea (0)

Anonymous Coward | more than 2 years ago | (#38241924)

Let's store all that data in tightly coiled long-chain molecules made of Carbon, Nitrogen, Oxygen and Phosphorous (with a dash of Hydrogen.) It'll be really compact and much cheaper than hard disks.

Cheap Storage (0)

Anonymous Coward | more than 2 years ago | (#38241932)

You can build a cheap array of 3TB Sata Drives for about .10$/GB. (a couple months back). How much data do they have?

I have too much books to read! (0)

Anonymous Coward | more than 2 years ago | (#38241954)

Having lots of data is a good thing! It eliminates the time spent of researchers waiting for more data. Limiting your view to smaller scopes of data is alot more powerful/flexible then simple relying on what little data you get as it comes in. This simply means we need to develop new research methods to deal with such large amounts of data or simply that we need more researchers. Other fields of science also encounters this issue and found ways to deal with it. NASA for example is also in this position somewhat thanks to all the space telescopes. Yet they don't complain and instead look for ways to increase the amount of data as well as new methods to deal with the data.

Dealing with storage of large sets of data, while a large task, isn't impossible. It's just a matter of money. It may mean that the data sets may need to be centralized with resources pooled so that costs are kept are a minimal. Well, it's hard to say anything about this aspect with so little info.

Transmission of data is one issue they can't deal with beyond a certain point unless they pay to put down more fiber directly between them and they places they want it to go (generally impractical due to extreme costs). Technically, since latency doesn't seem to be an issue, they can always just mail hd to the places they want the data. More work and add in shipping cost, it pretty small price to allow large amounts of data to be sent quickly.

In the end, it really all just comes down to money and that's pretty much normal and a good thing. It mean that things are going as fast as possible barring money rather then technical issues that causes wait time.

Your genome will fit conveniently on a CD. (2)

bhspencer (2523290) | more than 2 years ago | (#38242030)

From the article "three billion bases of DNA in a set of human chromosomes". A base may hold 1 of 4 values A, C, G and T. So each base can be represented with 2 bits. 2 bits * 3 billion = 750MB.

Re:Your genome will fit conveniently on a CD. (1)

bhspencer (2523290) | more than 2 years ago | (#38242136)

From the article "There will probably be 30,000 human genomes sequenced by the end of this year". 750MB*30,000 = 2.813 TB (terabytes). So we can store all sequenced genomes so far on a single 3TB HDD. You can pick one up from newegg for less than 250 USD.

Re:Your genome will fit conveniently on a CD. (1)

bhspencer (2523290) | more than 2 years ago | (#38242216)

Correction that total is 22.5 TB.

Re:Your genome will fit conveniently on a CD. (0)

Anonymous Coward | more than 2 years ago | (#38242364)

Still not really that much data...

Re:Your genome will fit conveniently on a CD. (1)

bhspencer (2523290) | more than 2 years ago | (#38242548)

That is the point I was trying to make.

Re:Your genome will fit conveniently on a CD. (2)

Daniel_Staal (609844) | more than 2 years ago | (#38242292)

That's human genomes.

They are also sequencing plants, and (other) animals, and fungus, and bacteria, and viruses, and...

diff (2)

dabridgham (814799) | more than 2 years ago | (#38242058)

Someone needs to introduce these researchers to the 'diff' program.

Welcome to the club (1)

pingbat (1648191) | more than 2 years ago | (#38242096)

Well, hard times for these guys. They have tons of data with next to no noise, errors or uncertainty. I can name 20 people i know personally that would love datasets like that for their research. Am I the only one seeing it this way? Shame you didn't buy the hard drives 4 months ago though. Tough break.

Don't we all? (1)

angiasaa (758006) | more than 2 years ago | (#38242098)

I thought we (humans) all had roughly (if not exactly) the same amount of data.. This title reeks of intent to mislead! :)

Drops in NGS Costs Outpacing Storage Costs (4, Informative)

Anonymous Coward | more than 2 years ago | (#38242102)

The big problem is that the dramatic decreases in sequencing costs driven by next-gen sequencing (in particular the Illumina HiSeq 2000, which produces in excess of 2TB of raw data per run) have outpaced the decreases in storage costs. We're getting to the point where storing the data is going to be more expensive than sequencing it. I'm a grad student working in a lab with 2 of the HiSeqs (thank you HHMI!) and our 300TB HP Extreme Storage array (not exactly "extreme" in our eyes) is barely keeping up (on top of the problems were having with datacenter space, power, and cooling).

I'll reference an earlier /. post about this:

There are some solutions to the storage problems such as Goby ( but those require additional compute time, and we're already stressing our compute cluster as is. Solutions like "the cloud(!)" don't help much when you 10TB of data to transfer just to start the analysis - the connectivity just isn't there.

"GNome" researchers have too much data (1)

MikeTheGreat (34142) | more than 2 years ago | (#38242142)

Hehe - I mis-read this as "GNome researchers" have too much data.

Probably along the lines of several thousand comments to the effect of "I can't stand GNOME 3", "I liked GNOME 2 better", etc, etc :)

Here's an idea.... (1)

Nidi62 (1525137) | more than 2 years ago | (#38242148)

I'm sure all the insurance companies would love to buy up all that data...

Go to the cloud (1)

sowalsky (142308) | more than 2 years ago | (#38242208)

For individual research units, the cost of maintaining the processing power and storage space for these types of projects can be cost-prohibitive. Cloud-based options offer distributed computing power and low-cost storage that is often a more economical solution that paying for the equipment in house, especially when genomic projects can come in spurts rather than a continuous stream.

Disclaimer: I work with large amounts of genomic data and use both in-house and cloud-based analysis tools.

Easy... (1)

jellomizer (103300) | more than 2 years ago | (#38242212)

'We are going to have to come up with really clever ways to throw away data so we can see new stuff.'

Have all you data open on a Windows share, and a FTP. Have them available on the full internet. Make some honest mistakes in setting up permissions. Copy and Past the "wrong link" into a hackers/gaming website. Wait a while.... All your data has been replaced with illegal information. which makes it easy to clean out. Problem solved.

AI (2)

slyrat (1143997) | more than 2 years ago | (#38242346)

This seems like just the kind of problem that AI will help with narrowing the field of 'interesting' things to look at. Either that or better ways to search through the data that is available along with better ways to store said data will probably work.

Reminded of a Parallel Computing Problem (2)

wbtittle (456702) | more than 2 years ago | (#38242374)

Way back in 1993, I visited an atomic laboratory in Pennsylvania. On the tour, they showed us the 30,000 core computing machine they had purchased several years before. "We still can't program it".

30 seconds later he pointed to the next piece of metal.

This is our 120,000 core computer.

I raised my hand "Why did you buy a 120,000 core machine when you can't even program the 30,000 core machine!"

"Well it's faster."

one of my early lessons in big companies attacking the wrong problem.

Is it a searching problem? (2)

camh (32881) | more than 2 years ago | (#38242384)

A couple of researchers in Sydney think they've got a model for searching the genoma much more efficiently. They're trying to fund their research and development with crowdsourcing: [] : "The PASTE project [is] based on a new number system we call Permutahedral Indexing - P.I. for short, an N-dimensional map that efficiently locates and interrelates complex datasets in the space of all possible data. P.I. does this efficiently even when the data has hundreds of independent dimensions and comes in petabytes and exabytes."
They don't seem to need much money in the scheme of things - I might just throw in $25.

Still waiting. (0)

Anonymous Coward | more than 2 years ago | (#38242436)

Got the cure for cancer yet?

RAW Data... (0)

Anonymous Coward | more than 2 years ago | (#38242472)

I think there is general misconception here that the amount of data is too much to store. This is not the case, genomes are not that big. The real problem is that sequences can be obtained at much much higher rate and extreme ease (guaranteed good results) these days than a biochemist/molecular biologist/etc. can characterize said genes on a molecular level and assign proper function.

Just having the sequence for a new species is not enough. There is a certain amount of knowledge that can be transferred to the data of this new organism, since a lot of proteins/genes have already been characterized in other organisms. But there is also plenty of new or different stuff and doing real research on those is the hard part.

Speaking from personal experience I've entered a field where people have been working with a certain class of genes/proteins for *ages*, yet the way they work and do their job is still a complete mystery. The classic approach of just knocking out a gene, calling the resulting mutant some silly name (developmental biologist anyone?) simply is not sufficient anymore.

No &@$^% (0)

Anonymous Coward | more than 2 years ago | (#38242504)

I just had to reprimand a genome research student who was using up multiple TB of space (I'm not allowed to put them under quota) with stupid uncompressed xml files.
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account