Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!



Ask Slashdot: How to set up a big data/data science project portfolio?

jda104 Start Small (1 comments)

I'm in Computational Biology, and I'd say that the most valuable skills you should learn (and the ones most often seen in this field) are more mathematical and/or statistical than "big data." Understanding how to properly normalize your data or calculate a p-value will take you much further than being able to spin up a 100-node Hadoop instance in most labs.

I think you should spend the first year on your home PC. Download RStudio and work through a few R Tutorials, then find some data/questions that interest you and poke around. Post your results to a blog so that you'll have something to show for the time you spend and release the code on GitHub so that it's open to future employers.

I'd say get comfortable with a data analysis language (R will probably serve you best currently), and a data manipulation language (Python, Perl, etc.) and start asking questions of data that's around you (your email archives, a log of Internet sites you visit, your spending records, etc.). Once you've found that well-designed algorithms can't handle some dataset you're looking at, then look at Hadoop and other "big-data" projects.

When you're ready, I'd steer your towards Next-Generation Sequencing data. Most of the bioinformatics questions being asked (and funded) now have at least some interaction with NGS, and analysts capable of working with that data are highly valuable. Check out the 1,000 Genomes project when you're ready to start playing with free Sequencing data.

more than 2 years ago

Ask Slashdot: Modern Development Training

jda104 On-The-Job Training (1 comments)

I think most people pick these skills up at work -- seeing these practices employed by others. I guess the short answer would be: go work for a company which is already employing people who understand these topics and learn from them for a couple of years.

Of course, that doesn't help you if you're not looking to change careers...

more than 2 years ago

Graphs Show Costs of DNA Sequencing Falling Fast

jda104 Re:Moore's law is too slow (126 comments)

Interesting. I view this from a completely different perspective: if DNA sequencing really is outpacing Moore's Law, that just means that the results become disposable. You use them for your initial analysis and store whatever summarized results you want from this sequence, then delete the original data.

If you need the raw data again, you can just resequence the sample.

The only problem with this approach, of course, is that samples are consumable; eventually there wouldn't be any more material left to sequence. So this wouldn't be appropriate in every situation.

more than 3 years ago

Graphs Show Costs of DNA Sequencing Falling Fast

jda104 Re:Moore's law is too slow (126 comments)

I assume you're talking about incoming data, not the final DNA sequence. As I understand it the final result is 2 bits/base pair and about 3 billion base pairs so about a CD's worth of data per human. And if you were talking about a genetic database I guess 99%+ is common so you could just store a "reference human" and diffs against that. So at 750 MB for the first person and 7.5 MB for each additional person I guess you could store 2-300.000 full genetic profiles on a 2 TB disk. Probably the whole human race in less than 100 TB.

The incoming data is image-based, so yes, it will be huge. Regarding the sequence data: yes; in its most condensed format it could be stored in 750MB. There are a couple of issues that you're overlooking, however:
1. The reads aren't uniform quality -- and methods of analysis that don't consider the quality score of a read are quickly being viewed as antiquated. So each two bit "call" also has a few more bits representing the confidence in that call.
2. This technology is based on redundant reads. In order to get to an acceptable level of quality, you want at least ~20 (+/- 10) reads at each exonic loci.
So that 750MB you mention for a human genome grows by a factor of 20, then by another factor of 2 or 3, depending on how you store the quality scores.

Your suggestion of deduplicating the experiments could work, but definitely not as well as you think because of all the "noise" that's inherent in the above two steps.

If you really just wanted to unique portions of a sample, you could use a SNP array which just reads the samples at specific locations which are known to differ between individuals. Even with the advances in the technology, the cost of sequencing a genome still isn't negligible. For most labs, it's still cheaper to store the original data for reanalysis later.

more than 3 years ago

Should Wikipedia Just Accept Ads Already?

jda104 Working on Commission? (608 comments)

I often stumble across some product on Wikipedia that I'm interested in buying (album, book, etc.). I actually would find it very convenient if such pages had a "Purchase this Item" link. I'm sure Amazon would kick in a few million for that privilege, or you could use their pre-existing referral program. I think most users would view those links as added value to Wikipedia.

more than 3 years ago

College Student Finds GPS On Car, FBI Retrieves It

jda104 Re:Dissapointing for Consipiracy Theorists... (851 comments)

And in light of this, why are we assuming GPS? I can't get find GPS satellites through the metal in my car roof, let alone through my entire car. Is it more likely they they're just tracking the cellular connections?

more than 3 years ago

College Student Finds GPS On Car, FBI Retrieves It

jda104 Dissapointing for Consipiracy Theorists... (851 comments)

This thing looks like "futuristic" technologies from the a 1980s movie: picture.

And the FCC ID is the same as the one in a mobile credit card terminal)...

I guess it's comforting to see that, in this instance, the government isn't decades ahead of the rest of us...

more than 3 years ago

File concurrency solution for non-programmers?

jda104 Re:I think there is... (2 comments)

I think there is a software for that. But it may not be out there iin public. Maybe developers are just keeping them private. Nevertheless, there's a high probability that the software you will need already exists. what is more important is your ETA. So you may find other means. Laptop Troubleshoot Tips

Of course! I'll just fix their laptops! Why hadn't I thought of that?

more than 3 years ago

Could Anti-Texting Laws Make Roads More Dangerous?

jda104 Re:Accelerometers in phones? (709 comments)

Sense vehicular motion (including vibration) and shut down the texting function while in motion.

Passengers in cars (, boats, and trains) may object to that one...

more than 3 years ago

BP Buys "Oil Spill" Search Term

jda104 Re:have they bought "Beyond Pitiful" yet? (439 comments)

How exactly can the PR and marketing department assist a mile underwater?

use their bodies to plug up the well?

Honestly it's the best use for marketing and PR people....

Only on /. would this be rated as "Insightful" instead of "Funny"...

more than 4 years ago

9/11 Made Us Safer, Says Bruce Schneier

jda104 Re:NoSQL? Waittaminute (280 comments)

A big-ass Oracle or IBM-DB2 can do the job if you pay enough for tuning.

Why is it that, ever since Key-Value DBs came into vogue, that relational databases instantly got perceived as so neanderthal?

A normal-ass Oracle database would surely be just fine for storing a no-fly list which, by necessity, has magnitudes of order less than 6.whatever billion names; I'm guessing it would do so without much tuning, also.

more than 4 years ago

What Happens When IPv4 Address Space Is Gone

jda104 IP-based lawsuits (520 comments)

I wonder what it would mean to the RIAA (or any IP-based litigation) to have multiple ISP customers consistently NAT'ted to the same IP.

... Maybe this won't be so bad after all!

more than 4 years ago

NYTimes Visits Menlo Park's TechShop

jda104 Re:New San Jose location? (36 comments)

I bet you could help those people quite a bit by selling the computer you're using and donating the proceeds.

We might appreciate it, too...

more than 4 years ago

New Litigation Targets 20,000 BitTorrent-Using Downloaders

jda104 Re:"Sue fucking everyone" (949 comments)

I'm not familiar with the intricacies of the Torrent protocols, but it seems like this group would either need to be in one of two groups:
A.) Connect to a swarm as a "spectator" not uploading or downloading any data.
B.) Connect to the swarm and actively upload/download.

If A., it seems like it would be hard to prove that any IP logged as participating in the swarm is actively engaged in any malicious behavior. If B., aren't they (the group) guilty of the same crimes of which they're accusing these other people?

I guess I just don't see how they could assure the courts of a crime being committed without having to participate in the exact same action in order to prove it.

more than 4 years ago

Company Invents Electronic Underpants

jda104 Risk? (110 comments)

I guess most people aren't worried about incontinence AND sperm production, but I would not want a wireless transmitter hosted that close to my reproductive factories...

more than 4 years ago

First Collisions At the LHC

jda104 Re:Surprised (256 comments)

Agreed. I thought the black hole would envelop Earth much more quickly than this.

more than 4 years ago

Why Some Devs Can't Wait For NoSQL To Die

jda104 Re:No all databases are for business (444 comments)

I think you're committing the same sin of which you're accusing the author, just on the opposite side of the pendulum.

Saying that all DRBMSs won't "cut it" for modern applications in any domain is pretty narrow-minded. It seems like a simple rule of thumb to me: put relational data in an RDBMS, put key-value data in a Key-Value DB...

As an aside, having worked on the Information Retrieval side of bioinformatics for the past few years, I've found that the complex side of bioinformatics is generally in the computation, not the retrieval. I've been well-suited by a single RDBMS server up to this point, though I have played around with MemCached for a couple of web apps.

more than 4 years ago

Will Your Answers To the Census Stay Private?

jda104 Re:You know what's really sad? (902 comments)

"Seemingly from the dawn of man all nations have had governments; and all nations have been ashamed of them."
- G.K. Chesterton

more than 4 years ago

Tracking Pedophiles By Their Typing Habits

jda104 Re:It's worse than junk (292 comments)

Even in Javascript you have the opportunity of embedding timing data from the Key Press events into the typing data.

more than 4 years ago

Tracking Pedophiles By Their Typing Habits

jda104 Re:Typical /. summary (292 comments)

Regarding the visibility of typing patterns, any in-browser chat/forum will likely use Javascript on the client side. JS provides access to the Key Up and Key Down events in text boxes; adding time stamps to those events is trivial. And I agree that being a 40 year old man doesn't make you a pedophile, but I think, probabilistically, that being a 40 year old man in a "Teenz Only!!1!" chatroom may rightfully raise a flag or two.

more than 4 years ago



File concurrency solution for non-programmers?

jda104 jda104 writes  |  more than 3 years ago

jda104 (1652769) writes "I work with a group of about a dozen "data analysts," most of whom have some informal programming experience. We currently have an FTP server setup for file/code sharing but, as the projects get more complicated, the number of outdated versions of code and data floating around among group members has become problematic; we're looking for a more robust solution to manage our files.

I see this as a great opportunity to introduce a revision control system, though there will surely be a bit of a learning curve for non-programmers. I've primarily worked with Subversion (+TortoiseSVN), but I would rather not have to spend my time manually resolving file conflicts and locking issues for each user and anything beyond commit, update, and revert (such as branching, merging, etc.) would probably not be used.

We're definitely not "software developers," but we write many Perl and R scripts to process datasets that can be many dozens of GBs. The group's personal machines are evenly split between Windows and Macs and our servers are all Linux, currently.

Is there a revision control system that "just works" — even for non-programmers? Or should we just head in a different direction (network share, rsync, etc.)?"


jda104 has no journal entries.

Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>