×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Open Data Needs Open Source Tools

Soulskill posted more than 4 years ago | from the stop-trying-to-fork-reality dept.

Open Source 62

macslocum writes "Nat Torkington begins sketching out an open data process that borrows liberally from open source tools: 'Open source discourages laziness (because everyone can see the corners you've cut), it can get bugs fixed or at least identified much faster (many eyes), it promotes collaboration, and it's a great training ground for skills development. I see no reason why open data shouldn't bring the same opportunities to data projects. And a lot of data projects need these things. From talking to government folks and scientists, it's become obvious that serious problems exist in some datasets. Sometimes corners were cut in gathering the data, or there's a poor chain of provenance for the data so it's impossible to figure out what's trustworthy and what's not. Sometimes the dataset is delivered as a tarball, then immediately forks as all the users add their new records to their own copy and don't share the additions. Sometimes the dataset is delivered as a tarball but nobody has provided a way for users to collaborate even if they want to. So lately I've been asking myself: What if we applied the best thinking and practices from open source to open data? What if we ran an open data project like an open source project? What would this look like?'"

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

62 comments

eclipse? (3, Informative)

toastar (573882) | more than 4 years ago | (#31416114)

Is Eclipse not open source?

Re:eclipse? (1)

Monkeedude1212 (1560403) | more than 4 years ago | (#31416186)

return false;

Re:eclipse? (1)

zuzulo (136299) | more than 4 years ago | (#31416608)

Just force everyone to use a versioning system. Wouldnt take a lot of tweaks to make an existing open source versioning system suitable for various types of data sets after all. Just depends on whether the data you are using is compressed, but even so the meta data and analysis associated with the raw data is unlikely to be compressed ... the hard part would be convincing everyone involved to actually use it. jmho and ymmv of course ... ;-)

Re:eclipse? (1)

gnat (1960) | more than 4 years ago | (#31418522)

Hi, zuzulo. A versioning system would definitely be part of the solution, but there's more than git behind a successful open source project. In my post, I tried to sketch some of the tools that the data world is missing. Even if everyone just slapped the data into git, that implies it's stored in a format that makes it look like source code and so is amenable to diff and patch. What if we add a new column to the database? That affects every row, but should it be stored as a completely new version of the data? There are lots of interesting questions.

Re:eclipse? (3, Informative)

Monkeedude1212 (1560403) | more than 4 years ago | (#31416228)

Who modded him offtopic?
Eclipse has an open source Data Tools Platform [eclipse.org]

Re:eclipse? (1)

Hurricane78 (562437) | more than 4 years ago | (#31417310)

Well, he should have mentioned that.

Re:eclipse? (0)

Anonymous Coward | more than 4 years ago | (#31418902)

Why?

If it was a programming question and he said eclipse would it still be off topic to you?

Exactly. You're just an malinformed idiot.

Re:eclipse? (1)

epine (68316) | more than 4 years ago | (#31423442)

Eclipse has an open source Data Tools Platform

For an extremely laid-back Zen-like stream-of-consciousness definition of "has". My stream of consciousness experience trying to grok this thing was extremely irritating.

From Eclipse Data Tools Platform (DTP) Project [eclipse.org]

"Data Tools" is a vast domain, yet there are a fairly small number of foundational requirements when developing with or managing data-centric systems. (What does it do?) A developer is interested in an environment that is easy to configure (what does it do?), one in which the challenges of application development are due to the problem domain (what does it do?), not the complexity of the tools employed. (What does it do?) Data management, whether by a developer working on an application (what does it do?), or an administrator maintaining or monitoring a production system (what does it do?), should also provide a consistent (what does it do?), highly usable environment that works well with associated technologies. (What does it do?)

Three rules plucked from Ten rules for writing fiction [guardian.co.uk] by Elmore Leonard

Never open a book with weather. If it's only to create atmosphere, and not a character's reaction to the weather, you don't want to go on too long. The reader^H^H^H^H^H^Hgeek is apt to leaf ahead looking for people^H^H^H^H^H^Hpurpose.

Don't go into great detail describing places and things (or meta framework), unless you're Margaret Atwood and can paint scenes with language. You don't want descriptions that bring the action, the flow of the story, to a standstill.

Try to leave out the part that readers tend to skip. Think of what you skip reading a novel: thick paragraphs of prose you can see have too many words in them.

I generally get along well with Eclipse, but for the love of God:

What does DTP do?

Well... (2, Insightful)

fuzzyfuzzyfungus (1223518) | more than 4 years ago | (#31416166)

The organizational challenges are likely a nasty morass of situation specific oddities, special cases, and unexpectedly tricky personal politics; but OSS technology has clear application.

Most of the large and notable OSS programs are substantially sized codebases distributed and developed across hundreds of different locations. If only by sheer necessity, OSS revision control tools are up to the challenge. That won't change the fact that gathering good data about the real world is hard; but it will make managing a big dataset with a whole bunch of contributors and keeping everything in sync a whole lot easier. Any of the contemporary(ie. post-CVS distributed) revision control systems could do that easily enough. Plus, you get something resembling chain of provenance(at least once the data enter the system) and the ability to filter out comitts from people who you think are unreliable.

Really? (1, Insightful)

Anonymous Coward | more than 4 years ago | (#31416224)

it can get bugs fixed or at least identified much faster (many eyes),

So then why were there all those buffer overflow issues, null pointer issues in the Linux kernel before Coverity ran it's scan on the code? Why did that Debian SSH bug exist for over 2 years if this is true?

Parent not a troll. (3, Informative)

aristotle-dude (626586) | more than 4 years ago | (#31417068)

Having lots of eyes looking at code is no substitute for using tools like what coverity on your software along with test driven development. Humans can easily miss problems with code that a tool or smoke test can uncover.

Open Street Map (3, Informative)

Anonymous Coward | more than 4 years ago | (#31416260)

I perfect example of collaboration with a massive dataset:

http://www.openstreetmap.org/

Re:Open Street Map (1)

toastar (573882) | more than 4 years ago | (#31416436)

Gratz for actually reading the article to the end, I gave up after the first paragraph.

Re:Open Street Map (1)

AxeTheMax (1163705) | more than 4 years ago | (#31417928)

Openstreetmap is good and useful if you don't want to fork out money. But it suffers from some vandalism, and some bad data. It needs more quality control if I'm going to depend on it in a remote location or when a life may be at stake. It will probably get more QC and then end up with some of the negative points that Wikipedia has.

Already being done (4, Insightful)

kiwimate (458274) | more than 4 years ago | (#31416288)

What if we ran an open data project like an open source project? What would this look like?

Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.

Re:Already being done (3, Informative)

viralMeme (1461143) | more than 4 years ago | (#31416412)

> Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about ..

Wikipedia isn't an open source project, it's an online collaborative encyclopedia. Mediawiki [mediawiki.org] on the other hand is the software project that powers Wikipedia.

Re:Already being done (3, Insightful)

mikael_j (106439) | more than 4 years ago | (#31416494)

I don't think kiwimate was saying that Wikipedia is an open source project, just that Wikipedia is a great example of an open data project run like an open source project.

/Mikael

Re:Already being done (2, Insightful)

wastedlife (1319259) | more than 4 years ago | (#31417802)

Unlike most open source projects, Wikipedia accepts anonymous contributions and then immediately publishes them without review or verification. That seems like a very strong difference to me.

Re:Already being done (0)

Anonymous Coward | more than 4 years ago | (#31416598)

I concur. Open source (depending on size) is not always great either. There are how many Linux kernels running around and off shoots to the projects? Now imagine that scenario with data. Well this dataset is great for this one specific purpose, but can't be tied back to this other dataset over here even though I need information from it. So I'll create a third dataset which combines the first two... Enough people don't know what they are talking about (any financial analyst when talking industry specific for example), I just don't know how trustworthy the data would ultimately be.

Re:Already being done (4, Insightful)

musicalmicah (1532521) | more than 4 years ago | (#31416734)

What if we ran an open data project like an open source project? What would this look like?

Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.

Gee, you make it sound so terrible when you put it like that. It also happens to be an amazing source of information and the perfect resource for an initial foray into any research topic. It's a shining example of what happens when huge amounts of people want to share their knowledge and time with the world. Sure, it's got a few flaws, but in the grand scheme of things, it has made a massive body of information ever more accessible and usable.

Moreover, I've seen all the flaws you've listed in closed collaborative projects as well. Like all projects, Wikipedia is both a beneficiary and a victim of human nature.

Re:Already being done (1, Insightful)

Anonymous Coward | more than 4 years ago | (#31416852)

What if we ran an open data project like an open source project? What would this look like?

Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.

Right, because no-one ever edits wikipedia because some self-interested, self-proclaimed authority has written something erroneous ?

Re:Already being done (3, Interesting)

Hurricane78 (562437) | more than 4 years ago | (#31417172)

I've said this a thousand times before: Make Wikipedia a P2P project without a single control, and build a cascading network of trust relationships on top of it (think CSS rules, but on articles instead of elements, and one CSS file per user, perhaps including those of others), and you solve all problems with then not-existing central authorities, and so also with censorship.

The only caveat: People have to learn again, who to trust and who not. (Example of where this fails: Political parties and other groups with advanced social engineering / rhetorics / mass psychology skills, like marketing companies.)

Re:Already being done (1)

Monkeedude1212 (1560403) | more than 4 years ago | (#31418570)

Just to play devil's advocate

Make Wikipedia a P2P project without a single control, and build a cascading network of trust
relationships on top of it, and you solve all problems with then not-existing central authorities, and so also with censorship

Who in their right mind is going to set up a network where they themselves are not the central authorities of what goes onto it? Those people who get bored will create thousands upon thousands of useless articles to use up space on the server - and since no authority is there to restrict that, it'll work. And if you put any kind of countermeasure in place, thats opening itself to censorship, or centralized authority.

The only caveat: People have to learn again, who to trust and who not. (Example of where this fails: Political parties and other groups with advanced social engineering / rhetorics / mass psychology skills, like marketing companies.)

Sounds more like everything would become politically charged - actually. Not that I'd ever cite wikipedia for a paper, but suppose I did, and my teacher can't find the article because we differ in political opinion, and thus we don't trust the same sources.

Re:Already being done (1)

lennier (44736) | more than 4 years ago | (#31421560)

Who in their right mind is going to set up a network where they themselves are not the central authorities of what goes onto it?

Nobody, but fortunately that's exactly not the system being proposed. This system would allow everyone to themselves be the central authorities of what goes into their network.

Those people who get bored will create thousands upon thousands of useless articles to use up space on the server - and since no authority is there to restrict that, it'll work.

Sure. You've just described 'the web' (anyone can post anything even useless trivia!), so this system won't be much different. But you seem to be under a misapprehension: there won't be any one 'server' - rather there'll be a single unified data-storage/publishing fabric where you pay for raw information storage/transmission (upload/download more bits, pay more) but not for policy control.

What is needed to make this work is the ability for all users to rebroadcast other people's content, thereby becoming their own editors. That means it really won't play well with existing copyright and probably not even with micropayments (if I publish my own iTunes library of playlists I don't want to have to pay millions of dollars in licencing fees), so free content would work best with this model. Free content would mean all data could be cached close to where it was wanted, instead of currently where iTunes or playing YouTube means streaming from a central server and hitting them with bandwidth costs.

Sounds more like everything would become politically charged - actually. Not that I'd ever cite wikipedia for a paper, but suppose I did, and my teacher can't find the article because we differ in political opinion, and thus we don't trust the same sources.

Yes, there'd be political differences on what constitutes a 'trustworthy' source, exactly as there is now.

What would happen in your example would be, in order for you to study at your school, first you would have to agree to use the same Wikipedia that your teacher requires everyone in the class use, in exactly the same way that schools now require students to use approved textbooks. And so on right up to the science journals.

This is just what happens already, it's just a matter of standardising the technology platform and opening access to content so everyone can be a creator, editor and rebroadcaster (not of others' feeds, but of their own). The concept of 'trusted repositories' or 'trusted editors' would still exist, and political/philosophical communities would now define themselves largely by what authorities they choose to use or create as their trusted repositories or ontologies. Information articles would have to have a formal pedigree showing their source, just like papers have citation lists and Wikipedia articles have both citations and update histories.

Yes, the technology and information platforms we choose do have strong political effects, so we should be informed about these effects, but we needn't be afraid of embracing open information once we're aware of how to use it.

Re:Already being done (1)

lennier (44736) | more than 4 years ago | (#31421702)

Here's a bit more detail on what I think is an important but overlooked element of this vision:

In order for people to be not only content creators but content editors, rebroadcasting/remixing other people's work (and this is hugely important - we can't have a unified wiki/blog/Xanadu/datasphere without it, the ability to publish a view of the world is essential), what we need is a unified global publishing and computing fabric.

By 'computing' I mean we need the ability to publish arbitrary computable functions over data streams, not just raw data (as current publishing systems such as blogs, wikis, web pages, Twitter, CSS, RDF etc let us do). These should be pure functions without side effects, since no one user publishing anything should be able to modify anyone else's state. Sorry, but JavaScript, Java, Python, Perl, Ruby, .NET all seem unsuitable for such a massively distributed language, because they haev side effects.

The data we publish should also be not limited in any way: not just text, not 140 chararacter limits, but numbers, media, weather reports, binary blocks. So blogs, wikis, Twitter all fail here. Something over top of RDF might work, with a lot of tweaks. Google Wave is almost promising, but Robots don't seem at all the kind of publishable pure functions I'm envisaging; they are very side-effectful, modifying streams in realtime.

I'd love to know what current or envisaged publishing frameworks let us do this.

Re:Already being done (2, Interesting)

lennier (44736) | more than 4 years ago | (#31421434)

I've said this a thousand times before: Make Wikipedia a P2P project without a single control, and build a cascading network of trust relationships on top of it (think CSS rules, but on articles instead of elements, and one CSS file per user, perhaps including those of others), and you solve all problems with then not-existing central authorities, and so also with censorship.

I agree wholeheartedly. If I understand correctly, this is very like what David Gelernter [edge.org] is saying with his datasphere/lifestreams concept: a fully distributed system with no centre where any node can absorb and retransmit its own view of the data universe. Twitter and 'retweets' is a sort of lame, struggling, misbegotten attempt to shamble towards this idea.

What would happen, I think, is that such a distributed Wikipedia would converge on a few 'trusted super-editors' who produced their own authorised versions - like Linux kernel forks or distributions - since the pressure to join a 'good enough' peer group would force forking to only happen where necessary. And yes, there'd probably emerge separate political factions: a Mainstream Wikipedia, a Citizendium, a Conservapedia, an Encyclopedia Dramatica, a UFOpedia, a Treknopedia, each of which has their own idea of what subjects are/are not 'noteworthy' or which sources are well-attested... but that's fine, we have that already, what we'd win in a truly distributed system is not the ability the ability to fork (which the GPL already gives us) but the ability to easily remerge which is currently a real pain.

There's no reason, for instance, why Citizendium, TVTropes, Encyclopedia Dramatica, C2, MeatballWiki, etc all couldn't share the same technical base and content and link to and import/export from each other, and just provide different editorial policies or views. And I think we'd all win hugely if we could bring that about.

Wikipedia == Anarchy != Open Source (2, Insightful)

jonaskoelker (922170) | more than 4 years ago | (#31418448)

What if we ran an open data project like an open source project? What would this look like?

Wikipedia. With all the inherent problems of self-proclaimed authorities

Who do not have commit access.

That is one of the keys to running an open source project well: you, being the giant with some source code, let everybody stand on your shoulders so they can see farther. And you let others stand on their shoulders so they can see even farther still.

But you don't let just about anyone become part of your shoulders. Especially not if that would weaken your shoulders (i.e. bad code or citation-free encyclopaedia entries).

That's the difference between Open Source projects and the Wikipedia project: Wikipedia lets the midgets stand on the shoulders of the giant, even if that makes the giant shorter rather than taller. Well-run open source projects don't let that happen. And poorly run open source projects don't exist due to survivor bias ;-)

Re:Already being done (1)

Yvanhoe (564877) | more than 4 years ago | (#31418632)

And a huge success.
Face it : the problem you mention exist today but are hidden to the public's eye. Giving the public a way to correct it is what wikipedia did and proved as workable.

Re:Already being done (1)

grcumb (781340) | more than 4 years ago | (#31420918)

What if we ran an open data project like an open source project? What would this look like?

Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.

So, basically just like any other large-scale, cooperative human enterprise, with the sole distinction that everyone gets to see the sausage being made (and to make it, if they choose)?

Re:Already being done (0)

Anonymous Coward | more than 4 years ago | (#31426334)

Open software is less beset by the "Dim-Bulb Twit problem" than Wikipedia because the cost of entry is higher - you have to be able to code (usually in something more demanding than BASIC.) It's not clear that this (imperfect) filter would apply to open data. (E.g., would requiring raw SQL help filter DBTs?

Re:Already being done (1)

tehcyder (746570) | more than 4 years ago | (#31439786)

Wikipedia. With all the inherent problems of self-proclaimed authorities who don't know what they're talking about; bored trouble-makers who inject bad information because they're, well, bored; petty little squabbles which result in valid data being deleted; and so on.

You misspelled slashdot.

Use Open Standards (4, Informative)

The-Pheon (65392) | more than 4 years ago | (#31416332)

People could start by documenting their data in standardized formats, like DDI 3 [ddi-alliance.org].

Re:Use Open Standards (1)

aero6dof (415422) | more than 4 years ago | (#31416822)

For other scientific data sets which are primarily tabular numeric data, I've always liked NetCDF [wikipedia.org] and or occasionally HDF [wikipedia.org]

Open data needs open data structure and owner (4, Insightful)

bokmann (323771) | more than 4 years ago | (#31416490)

Interesting problem. Several things come to mind:

1) The Pragmatic tip "Keep knowledge in Plain Text" (fro the Pragmatic Programmer book, that also brought us DRY). You can argue whether XML, JSON, etc are considered 'plain text', but the spirit is simple - data is open when it is usable.

2) tools like diff and patch. If you make a change, you need to be able to extract that change from the whole and give it to other people.

3) Version control tools to manage the complexity of forking, branching, merging, and otherwise dealing with all the many little 'diffs' people will create. Git is an awesoe decentralized tool for this.

4) Open databases. Not just SQL databases like Postgres and MySQL, but other database types for other data structures like CouchDB, Mulgara, etc.

All of these things come with the poer to help address this problem, but come with a barrier to entry in that their use requires skill not just in the tool, but in the problem space of 'data management'.

The problem of data management, as well as the job to point to one set as 'canonical' should be in the hands of someone capable of doing the work. PErhaps there is a skillset worth defining here - some offshoot of library sciences?

Re:Open data needs open data structure and owner (5, Informative)

GrantRobertson (973370) | more than 4 years ago | (#31418142)

Perhaps there is a skillset worth defining here - some offshoot of library sciences?

That offshoot is called "Information Science." Most "Library Science" programs now call themselves "Library and Information Science" programs. There is now even a consortium of universities that call themselves "iSchools." [ischools.org] In my preliminary research while looking for a graduate program in "Information Science" it seems as if the program at Berkeley [berkeley.edu] has gone the farthest in getting away from the legacy "Library Science" and moving toward a pure "Information Science" program.

I personally think that the field of "Information Science" is really where we are going to find the next major improvements in the ability of computers to actually impact our daily lives. We need entirely new models of how to look at dynamic, "living" data and track changes not only to the data but to the schema and provenance of that data. That is how "data" becomes "information" and then "knowledge." I won't write my doctoral thesis here, but suffice it to say that simply squeezing data into a decades old model of software version control is not quite going to cut it. In software version control you don't have as much of a trust problem. Yes, you do care if someone inappropriately copies code from a proprietary or differently-licensed source. However, you don't have as much incentive for people to intentionally fudge the code/data one way or another. In addition, data can be legitimately manipulated, transformed, and summarized to harvest that "information" out of the raw numbers. This does not happen with code. Yes, there is refactoring, but with code it is not as necessary to document every minute change and how it was arrived at. With data, the equations and algorithms used for each transformation need to be recorded along with the new dataset. In addition, the reason for those transformations and the authority of those who did the transformation.

Throw into the mix that there will be many different sets of similar data gathered about the same phenomena but with slightly different schemas and different actual data points which will all have different provenances but will need to be manipulated in ways to bring their models into forms that are parallel to all the other data sets associated with those phenomena while still tracking how they are different ... and you will see that we don't just need a different box to think outside of, we need an entirely different warehouse. (You know, the place where we store the boxes, outside of which we will do our thinking.)

Many of the suggestions posted here are a start, but only a start.

Re:Open data needs open data structure and owner (1)

gnat (1960) | more than 4 years ago | (#31418962)

Could you post a link to your thesis? It sounds interesting. Thanks!

Re:Open data needs open data structure and owner (1)

GrantRobertson (973370) | more than 4 years ago | (#31420686)

The thesis isn't written. I'm not even in graduate school yet. But that is likely what my thesis will be about when I finally write it. My head continuously swims with all the connections between information that need to be tracked. Maybe I'll get to be a pioneer. Woo hoo.

You can find information about many of my ideas at www.ideationizing.com [ideationizing.com].

Linked Data? (1)

aharth (412459) | more than 4 years ago | (#31419106)

Semantic Web technologies (in particular RDF, a graph-structured data format) are ideally suited for publishing data. Also, these technologies facilitate the integration of separate pieces of information; integration is what you want to do if thousands of people start publishing structured data. Linked Data [w3.org] (RDF using HTTP URIs to identify things) is already used by the NYT and the UK government to publish data online.

Standards by Domain needed. (3, Interesting)

headkase (533448) | more than 4 years ago | (#31416562)

High-level: Save your differences from day to day, bittorrent those differences to others, merge back in differences from others. Low-level: OMG, we used different table-names.

Re:Standards by Domain needed. (2, Insightful)

oneiros27 (46144) | more than 4 years ago | (#31421400)

You're assuming that the differences are something that someone can keep up with in real time. If someone makes a change in calibration that results in a few month's worth of data changing, it might take weeks or even months to catch up (as you're still trying to deal with the ingest of the new data at the same time). As for bittorrent, p2p is banned in some federal agencies -- and as such, we haven't had a chance to find out how well it scales to dealing with lots (10s of millions) of small files (1 to 16MB).

As for the low-level issues -- it's not even close. The problem is that people build their catalogs to handle the type of science they want to do; they often don't revolve around the same concepts, and they might have one or thousands of tables. See my talk Data Relationships: Towards a Conceptual Model of Scientific Data Catalogs [nasa.gov] from the 2008 American Geophysical Union.

I've been working for years with people who want to search the data from the systems I maintain, but the way that they want me to describe the data to make it searchable aren't easy to define -- even terms like 'instrument' mean something different between their system and mine. (and I have a paper submitted for the Journal of Library Metadata's special 'eScience' issue, dealing with issues in terminology and other problems that the library field doesn't typically run into, but we have to deal with in science informatics)

Disclaimer : If it's not apparent from the message, I work in this field.

I agree in principle but dont believe it (1)

godrik (1287354) | more than 4 years ago | (#31416654)

I just think it is not possible to build such useful data. I am working in parallel computing through a theoretical scheduling perspective.
Each single paper you see is interested in a slightly different model which needs slightly different parameters or have a look at slightly different metrics.

Despite I would love to have a database that provides the instances of all those guys as well as their implementations and results, I do not believe it is going to happen. Since every scientist need different parameters they will all end up with different databases. This will remove the interested of having such a database to begin with.

However, it is obvious to me that we want the data that were used to generated the results available so that reviewers can have a look at them.

There are lots of good examples of open data (0)

Anonymous Coward | more than 4 years ago | (#31417000)

The NCBI has a lot of open data sets that they maintain and update regularly. My favorite is MEDLINE, a dataset of medical literature metadata (abstracts, titles, etc). Not quite open source, but available to researchers under a free (essentially) non-commercial attribution license.

There are good analogies between open source and open data. The key one is community participation. Large data sets will likely have problems and inconsistencies. These are going to be exposed by people using the data in odd and unexpected ways, so having a good mechanism for user feedback and improving the data is key, as is versioning and sane schema evolution.

There is a nice series on open source government data in the freedom to tinker blog:
http://www.freedom-to-tinker.com/

Open Source Data (1)

Registered Coward v2 (447531) | more than 4 years ago | (#31417136)

What if we ran an open data project like an open source project? What would this look like?'"

Every time someone asked about the date, they'd get a reply of RTFM

Whenever someone did like the data they'd fork it with their own approved data

MS would issue a white paper saying why closed source data is better and cheaper

Everytime someone announced some new data, RMS would yell "That's GNU!!!!!>

Technical solution to internal issue (1)

geoff_syndicate (863418) | more than 4 years ago | (#31417216)

What I've been saying for ages is that the biggest problems for the open data movement are mostly found inside Government agencies. Until the open data promoters can establish a cohesive pitch, based around solving goals for the agency in question, then these technical solutions are a waste of time. Nat's latest 'open source' model for open data will only excite those already sold on the idea.

Most of the people who need convincing as to why they should get on board the open data train, need to be sold on the benefits to *them*, not the benefits to the technical community.

Re:Technical solution to internal issue (1)

gnat (1960) | more than 4 years ago | (#31419028)

Absolutely! All too often we're guilty of saying "open the data because that's what I believe you should do", not "open the data because this is how it will make your life easier" or "open the data because this is how it will help you do your job", etc. It's come from a technologist-centric pull, but it won't succeed until it becomes a bureaucrat-originated push.

Too difficult unless funding sources demand it. (1)

virtualXTC (609488) | more than 4 years ago | (#31417450)

The real problem is the lack of a standardized language between different scientists / agencies. It's really up to the funding sources (such as the NCI) to come up with the standards else you end up with standards, that while technically better, that only a few follow, ie: chembank.broad.mit.edu. Further, having mutiple "standards

Open data = published data (0)

Anonymous Coward | more than 4 years ago | (#31417532)

There is already an extensive system in place for reviewing and communicating "open" data--peer reviewed publication. If you want to ensure that your data, analysis, and conclusions are part of the collective memory, then publish it in plain language (probably English). "If it isn't published, you didn't do it."

Sharing via BitTorrents (0)

Anonymous Coward | more than 4 years ago | (#31417740)

One of the biggest problems is that these datasets are often very large, causing bottlenecks with downloading the data as well as sharing results or variations of the data.
I noticed that BioTorrents [biotorrents.net] is a new open source BitTorrent tracker aimed especially at sharing legal open access datasets and software.

sounds familiar (0)

Anonymous Coward | more than 4 years ago | (#31417816)

Isn't this what http://sciencecommons.org/ is all about: Freeing data to open up collaboration and revive the sexiness that science is!

What would this look like? (0)

Anonymous Coward | more than 4 years ago | (#31417876)

If it were open source data, after a while would it have more eye-candy and little added functionality and the mail list would be flooded with flame wars over meaningless minutia? Or not?

Re:What would this look like? (1)

Hal_Porter (817932) | more than 4 years ago | (#31420488)

If it were open source data, after a while would it have more eye-candy and little added functionality and the mail list would be flooded with flame wars over meaningless minutia? Or not?

That's minutiae

Metadata handling with CKAN (2, Informative)

Bazman (4849) | more than 4 years ago | (#31419936)

Looked at the CKAN software (www.ckan.net)? They run their own knowledge archive,a nd the software also powers the UK data.gov.uk site. RESTful API and python client.

Interesting view: (0)

Anonymous Coward | more than 4 years ago | (#31419986)

Open source encourages laziness (because there are 1mil others out there who can fix it later/better, so good enuf is enuf for now), it can get /interesting/ bugs fixed or at least identified much faster (many eyes), it promotes collaboration /in a clique, outside of which you just get told to 'fix it yourself'/, and it's a terrible training ground for skills development as there is just code, no doco.

OpenDAP (2, Informative)

story645 (1278106) | more than 4 years ago | (#31420580)

The main point of the openDAP [opendap.org] project is to facilitate remote collaboration on data, and there are already a few organizations that use it to share data. I've used the python variant for NetCDF files and found it pretty happy and the web interface is clean. The best part of the OpenDAP project is probably that the data doesn't need to be downloaded/copied to be processed, which is really important for anyone who can't afford the racks of harddrives some of these datasets need.

One solution (1)

DCFusor (1763438) | more than 4 years ago | (#31420786)

Is what we do on the fusor forum for amateur high energy scientists. It's not perfect, but we basically share in the same manner as open source software all that we do, and it's working fine for us. We help the newbies when we can, or tell them to search the extensive archives for when that question has been asked and answered before, post data, pictures of our gear and all that. It's a good crowd, but a small site, so don't all go there at once....it won't take it and this isn't funded by some large outfit, it's just our own money. Real names are universally used there -- this site is for real work, not kiddie flame wars. There's not much moderation, but jerks lose the ability to log in quickly. Here is the open source fusor forum [fusor.net] for you to check out. This is mostly a bunch of old guys having some fun, and helping some new guys get into the game. All sorts of advice and data shared openly and all in one place. Far from perfect, but a good start, I'd say. Check out the "recent threads" link which is as close to slashdot format as it gets on that site.

Lead by example (1)

konohitowa (220547) | more than 4 years ago | (#31420790)

Perhaps /. could lead the way by providing an open database of their stories and comments (license changes would be needed with opt-out).

Then again, I might just think that because I'd rather have a different interface to the same info rather than the one I'm stuck with.

Bad Start (1)

AmberBlackCat (829689) | more than 4 years ago | (#31421302)

They lost me when I read "Open source discourages laziness (because everyone can see the corners you've cut)".

Whoever said that hasn't seen a lot of open source GUI's [wikipedia.org] lately. Then they had the nerve to say open source products make bugs more likely to be identified because more people are looking at it. But how many of those people know what they're looking at? And is the core group, that knows what they're looking at, any bigger than some for-profit's programming team?

OT: What's up with that "Open Source" logo? (1)

Ellis D. Tripp (755736) | more than 4 years ago | (#31422146)

Where did it come from, and what is it supposed to represent?

It's probably just cause I'm an electronics geek with a fondness for "hollow state", but that thing sure looks like the business end of a "magic eye tube" to me.

For those who have no idea what a magic eye tube is:

http://www.magiceyetubes.com/eye02.jpg [magiceyetubes.com]
http://en.wikipedia.org/wiki/Magic_eye_tube [wikipedia.org]

Open society needs open data and analysis tools (1)

kiwigrant (907903) | more than 4 years ago | (#31422606)

Investigative journalism is dying; citizens need direct access to government data and the tools to analyse it themselves. We can't rely on the media to expose flaws in government policy any more so we need:
  • data
  • meta-data e.g. how to avoid obvious misinterpretations, errors etc
  • free tools for storing data (and running basic analyses) e.g. SQLite, MySQL, PostgreSQL etc
  • free tools for analysing data e.g. R, SOFA (Statistics Open For All - https://sourceforge.net/projects/sofastatistics/ [sourceforge.net]) etc
  • free resources for learning about analysis e.g. CAST (http://cast.massey.ac.nz/collection_public.html), wikipedia etc
  • free tools for presenting and disseminating results e.g. OpenOffice Impress, WordPress etc

Usenet (0)

Anonymous Coward | more than 4 years ago | (#31424386)

Collaboration, archiving, openness, trolls.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...