Web Pages Are Weak Links in the Chain of Knowledge

Hemos posted more than 10 years ago | from the destroying-our-young dept.

The Internet 361

PizzaFace writes "Contributions to science, law, and other scholarly fields rely for their authority on citations to earlier publications. The ease of publishing on the web has made it an explosively popular medium, and web pages are increasingly cited as authorities in other publications. But easy come, easy go: web pages often get moved or removed, and publications that cite them lose their authorities. The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library. As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture.""

Worst Record Keeping (1)

nberardi (199555) | more than 10 years ago | (#7547347)

I really think we are living in a world right now of some of the worst record keeping of knowledge.

Re:Worst Record Keeping (2, Funny)

klokwise (610755) | more than 10 years ago | (#7547445)

i really hope you have some evidence to back that up.

Re:Worst Record Keeping (0)

Anonymous Coward | more than 10 years ago | (#7547474)

Its a self-reinforcing Conspiracy!


Re:Worst Record Keeping (2, Interesting)

Urkki (668283) | more than 10 years ago | (#7547447)

Nah. There was a time when only very very few could even read, let alone write, let alone keep any kind of records...

But get your point. Too bad there are some restrictions on copying the web pages you are referencing...

There should be some service, a bit like google's cache, you could use to store the referenced pages. I submit the page to the service, then provide two links in my own document, one to the original page (which will likely expire eventually) and one to the cached version. I wonder if they could get around copyright issues the same way google cache gets around them, even though this is a bit more permanent storage than google cache... Most web page authors certainly would not have any problem with having their pages archived there, quite the opposite, most would be happy to have their work referenced by others...

Re:Worst Record Keeping (1)

gilrain (638808) | more than 10 years ago | (#7547574)

Unfortunately, this method would throw out the good with the bad. If the website you submitted to the archive did not expire immediately, it would probably change for the better; and your referenced copy would not reflect the changes. Essentially, you would be referencing two different versions of the same work.

Re:Worst Record Keeping (1)

NickFitz (5849) | more than 10 years ago | (#7547617)

But unless you were revising your own work to reflect those changes, surely you should continue to reference the old version?

This would be akin to (off the top of my head) citing a reference in the first edition of A Vision [yeatsvision.com] by the poet W B Yeats. As the second edition was a complete rewrite [yeatsvision.com] bearing virtually no similarity in either argument or conclusions to the first, updating one's references to the second edition would not only be undesirable, it would probably be impossible.

Re:Worst Record Keeping (5, Interesting)

robslimo (587196) | more than 10 years ago | (#7547449)

Ummm, maybe only as applies to this topic, which is to say that web pages are a poor place to keep records.

I'd contend that researchers & scientists in general would be quite silly to site an electronic-only resource in their publications, because the persistence of that resource relies on too many factors (the whim of the webmaster, backups or lack thereof, fiber seeking and grid seeking backhoes, etc).

I think that will all sort itself out and real scientists will continue or return to citing more traditional resources.

What I think is much more disturbing and disruptive is the pseudo-science and mis-information that is overly abundant on the web. Too many web sites, personal and commercial, spout 'facts' in such great detail that they have the appearance of authority. Too often, novice/amatuer scientists can be seriously mis-lead by some of the crap that can be found on the web masquerading as 'science'.


Anonymous Coward | more than 10 years ago | (#7547351)

Thats why.. (1)

panxerox (575545) | more than 10 years ago | (#7547370)

I've started to keep archivied copies of webpages instead of links, the next time you want it it's gone. Unfortunatly you can't share them like links.

Re:Thats why.. (0)

Anonymous Coward | more than 10 years ago | (#7547427)

Yeah, this is another problem that doesn't really exist. If you are doing something that you seriously consider research it is not a problem to copy an entire web page or even an entire site to a local folder.
What was the point again?

archive.org and copyright? (5, Interesting)

McDutchie (151611) | more than 10 years ago | (#7547500)

I've started to keep archivied copies of webpages instead of links, the next time you want it it's gone. Unfortunatly you can't share them like links.
If you can't share them, then how come archive.org can? How come archive.org seems to be above copyright law?

Re:Thats why.. (1)

Urkki (668283) | more than 10 years ago | (#7547502)

I think you're in violation of copyright law! Please stand still and wait for a strike team from local lawyer station to arrive and arrest you, while their research team finds out who's copyright you're infringing upon, ie who should get 10% of the profit of suing you.

Well, (5, Interesting)

jeffkjo1 (663413) | more than 10 years ago | (#7547383)

Really, is there a reason to archive everything in the world? Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?

100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)

Re:Well, (4, Insightful)

fredrikj (629833) | more than 10 years ago | (#7547454)

100 years from now, should anyone be forced to accidentally stumble over goatse? (which is very disturbingly archived on archive.org)

Do you really think goatse will be "disturbing" 100 years from now? Only 40 years ago, people thought the Beatles were disturbing :P

Re:Well, (5, Insightful)

operagost (62405) | more than 10 years ago | (#7547535)

Do you really think goatse will be "disturbing" 100 years from now?

The day goatse.cx is no longer disturbing, is sure to be the first day of Armageddon ...

Re:Well, (1)

Xzzy (111297) | more than 10 years ago | (#7547498)

> Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?


Don't you see where this is going? The next obvious step is government installed "web vaults" where people can submit their oh-so-valuable chicken scratchings and they will be stored, under the same URL, for eternity.

No more geoshitties man, we're talking lifetime free webspace for every citizen in the US!

Re:Well, (5, Insightful)

GeorgeH (5469) | more than 10 years ago | (#7547581)

100 years from now, should anyone be forced to accidentally stumble over goatse?
The fact that you and I can refer to goatse and people know what we're talking about means that it's an important part of our shared culture. I think that anything that archives the good and bad of a culture is worth keeping around.

Re:Well, (5, Interesting)

mlush (620447) | more than 10 years ago | (#7547625)

Sure, your 4 year old has some pretty drawings, but should they be put in a library someplace?

I would be fascinated to see my Great Grandad's first drawings, his school web page, his postings to USENET. I only knew him as on old man ....

To a historian often the most interesting stuff is the ephemera, the diary of an ordanary person gives a view of every day life you will never get looking at 'formal' archives (ie newspaper, film librarys etc etc) which only covers 'important' stuff

"This is no way to run a culture." (1, Flamebait)

Cokelee (585232) | more than 10 years ago | (#7547385)

This is no way to run a culture.

Tell the RIAA that.

Music is a part of our culture.

Re:"This is no way to run a culture." (1)

mirko (198274) | more than 10 years ago | (#7547473)

Music is a part of our culture.

Yep, but not only Britney's.
That's the reason why I created GNUArt.net [gnuart.net], in order to give most artistss i know the opportunity to share music or Art they once created instead of dumping their old tapes or photos...

Of course, this may die if noone helps but I'll have at leat made these last a little more, and eventuallly given these the opportunity to be reworked by others...

Books have an ISBN... (5, Interesting)

Advocadus Diaboli (323784) | more than 10 years ago | (#7547386)

...which means that with that ISBN I can refer to the book and find it at libraries or bookstores. Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there. Then it would be easy to track an article if its moved to another site or whatever just by looking up a sort of catalog for these numbers.

Re:Books have an ISBN... (1)

Madmanz123 (203006) | more than 10 years ago | (#7547414)

I'm pretty sure there has been some discussion of that. Dave Winer (scripting.com) has talked about a universal ID for blog posts, but things are very preliminary.

Re:Books have an ISBN... (1)

IamGarageGuy 2 (687655) | more than 10 years ago | (#7547418)

The dewey decimal system of the internet. We could use large numbers, maybe 4 sets of 3 digits that are unique or something like that or ....Hold on...

Re:Books have an ISBN... (1, Informative)

Anonymous Coward | more than 10 years ago | (#7547425)

> Why don't we setup a sort of unique web page number ...

Read the article. They mention a system called DOI.

Re:Books have an ISBN... (4, Interesting)

daddywonka (539983) | more than 10 years ago | (#7547459)

Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there.

The article mentions this: "One such system, known as DOI (for digital object identifier), assigns a virtual but permanent bar code of sorts to participating Web pages. Even if the page moves to a new URL address, it can always be found via its unique DOI."

But it seems that these current systems must use "registration agencies" to act as the gatekeeper of the unique ID.

Re:Books have an ISBN... (1)

Waffle Iron (339739) | more than 10 years ago | (#7547669)

But it seems that these current systems must use "registration agencies" to act as the gatekeeper of the unique ID.

Why not just embed an off-the-shelf GUID in the header of the document? That doesn't require any central authority.

The <A> tag could be enhanced with a "guid" attribute. If a browser gets a "page not found" error on a link, it could automatically submit the GUID in the link to Google or some other search service to look for the current location.

Re:Books have an ISBN..(but web pages are googled) (5, Insightful)

WillAdams (45638) | more than 10 years ago | (#7547476)

That was why Tim Berners-Lee wanted URL to stand for ``Universal'' (not Uniform) Resource Locator.

The problem is, few people have formal training as librarians, or understand how to file away a document under such schemes (whether or no pages like this are worth preserving is another issue entirely).

Then there's the technical issue---where's the central repository? Who ensures things are correctly filed? Who pays for it all?

With all that said, I'll admit that I use Google's cache for this sort of thing---it lacks the formal hierarchy, but the search capabilities ameliorate this lack somewhat. It does fail when one wants a binary though (say the copy of Fractal Design Painter 5.5 posted by an Italian PC magazine a couple of years ago).

Moreover, this is the overt, long-term intent behind Google, to be the basis for a Star Trek style universal knowledge database---AI is going to have to get a lot better before the typical person's expectations are met, but in the short term, I'll take what I can get. ;)


Re:Books have an ISBN... (0, Redundant)

mshiltonj (220311) | more than 10 years ago | (#7547663)

which means that with that ISBN I can refer to the book and find it at libraries or bookstores. Why don't we setup a sort of unique web page number if articles of interest or knowledge are published there.

You are absolutely right!

We need some sort of Uniform Resource Identifier for the Internet. Maybe we should create an organization, a Consortium if you will, of companies on the World Wide Web to agree on a standard.

Good idea! I wonder why know one has thought of it before?

Do we really want to remember? (-1, Redundant)

Anonymous Coward | more than 10 years ago | (#7547391)

Do we really want to remember goatse.cx and tubirl.com and strange sites like this [slamhost.com] for all time?

then don't look for culture in web pages... (4, Interesting)

TechnoVooDooDaddy (470187) | more than 10 years ago | (#7547392)

honestly, the transient nature of webpages makes it an unsuitable medium for the long term establishment of "culture" our categorization happy, buzz-word ridden nature so commonly prevalent will have to find a new term for what is the web. boo-freaking-hoo.. meanwhile i'll keep doing my thing, posting pics for my family to see, putting calendar events up on the web so my homebrew-club will know when we're meeting and not worry about any "culture" i might be potentially creating then destroying when i take stuff back down.

man i need coffee, insomnia is a bitch...

Re:then don't look for culture in web pages... (1)

Urkki (668283) | more than 10 years ago | (#7547562)

The problem of some random personal web page perhaps 2 people ever looked at disappearing. The problem is that web pages actually referenced by others are disappearing, thus breaking the big web of knowledge that has been forming for as long as we've had printed press.

There really should be a permanent way of storing web pages, and storing them at the state they were at one given moment of time. So the archiving would naturally be the responsibility of the referer.

We just need a web service for that. It could even be profitable business, charge for every URI permanently stored there, perhaps by byte, which would also largely solve the issue of abuse. Only hinderance is copyright law I think., so it should get the status of public library...

Re:then don't look for culture in web pages... (2, Informative)

YU Nicks NE Way (129084) | more than 10 years ago | (#7547650)

Even if your statement accurately reflected the concerns in the article, it would still be misguided.

Historians are concerned about all the ephemera of a civilization, not just the "official" ones. The random archives of everyday junk can, and often do, tell a very different story about the civilization than the story that the society would like to hear about itself, so historians treasure those postings of pics for your family to see.

For example, if you read the official press, you'd see a lot of articles about how bad the economy is for IT folk. That's entirely true, as far as it goes, but it only goes so far. The official press talks about the disappearance of jobs, and about the outsourcing of jobs, and about the unemployment rate, but doesn't talk about the fates of individual people displaced by the upheaval. Are the people who've been thrown out of work starving, or are they managing to live and to feed and clothe their families? The official story doesn't cover that -- but those silly little picture pages do, just by showing the children of these unemployed workers well-fed and dressed in new-ish clothes. Web pages are very cheap, so that indicates that the unemployed techies aren't starving.

It's kind of like the character in the play who found out one morning that he'd spent his whole life speaking in prose. You've spent your whole life participating in the culture, and a record of that life is important to a historian interested in your culture.

Don't do that. (4, Insightful)

Valar (167606) | more than 10 years ago | (#7547395)

You probably shouldn't be quoting any kind of "Bob's World of Great Scientific Insight" type pages anyway. I mean, the majority of sites that go under in less than 100 days are the one person operations that one should identify as bad sources anyway. So it might seem obvious that quoting someone's blog in a research paper is just a plain stupid idea, but it happens way more often than you might think.

Look at Microsoft (1, Offtopic)

willpost (449227) | more than 10 years ago | (#7547524)

One month a thoughtful Microsoft programmer will post the bug on a page with a workaround, source code, and a patch using Visual Studio.

The next month the bug officially doesn't exist, the workaround page is gone, the source code is who knows where, and it's .Net

If you go to Linux.org though, the FAQ and bug postings are preserved for all to see.

You're right though, in that Microsoft should be identified as one of those bad sources anyway.

The web can hold insight, in the right field (3, Interesting)

mactari (220786) | more than 10 years ago | (#7547526)

That's a fairly reductionist view if taken too far. Not all researchers are tech whizzes (no pun intended), and I've seen a number of, in my case, professors of English Literature who run the same sort of, "Throw up ten pages with Under Construction signs, test publish a few papers, and let the site sit for years, one day to mysteriously disappear," web site lifespan that "Bob's World" might as well.

Perhaps even more interestingly, it doesn't always really matter if you've done great, repeatable research in the "soft science" fields or outright humanities. You don't have to be a literature expect to have a good insight on "Bartleby the Scrivener". A grad student's blog, as an example, might contain excellent contributions to the conversation.

Now that said, in the context of the article -- dealing with "a dermatologist with the Veterans Affairs Medical Center in Denver" -- I would tend to agree with you heartily. Hard science needs to pull, in my layman's view, from research that the article's author researched well enough to see that it wasn't a few 0's and 1's that might be pulled later, in general.

And heck, what's the harm in saving the pages on your drive and contacting the original author if they disppear? Hard drive space is cheap. If you take yourself seriously, you might want to grab a snap, even if it is technically illegal (not that I know that it is; Google seems to do it right often).

Throwing out the baby with the bathwater (4, Insightful)

Liselle (684663) | more than 10 years ago | (#7547404)

People are worried about losing the information on the web: but all that is really happening is that the URLs are no good after a while, you lose the snapshot. The information is not necessarily going anywhere. If there is a need or a want, someone will throw it up, or another will host it. That's the beauty of the web, you get the good with the bad, but time has a way of getting rid of the chaff.

What would be interesting would be a website that archives those snapshots for posterity. Well, what do you know, there are several such sites already! Looks like we're in good shape. The sky is not falling. ;)

Reliability (5, Interesting)

lukewarmfusion (726141) | more than 10 years ago | (#7547405)

It's not just the short lifespan of a webpage... it's also the fact that the source isn't always reliable. Web publications are rarely given the same strict editorial process as most journal articles. The content might be just as good - or better - but they're also not given the same credibility.

I'm a recent grad of a University... my freshman year, profs wanted us to start using the Internet more so we were asked to submit at least x number of references from Internet sources. By my senior year, they were trying to get us to stop using the Internet. Using a URL as a reference was sometimes forbidden by the professor.

The final irony? (2, Interesting)

the real darkskye (723822) | more than 10 years ago | (#7547406)

That matters in part because some documents exist only as Web pages -- for example, the British government's dossier on Iraqi weapons.
"It only appeared on the Web," Worlock said. "There is no definitive reference where future historians might find it."
Much like the WMDs themselves then ...

Re:The final irony? (0)

Anonymous Coward | more than 10 years ago | (#7547487)

"It only appeared on the Web," Worlock said. "There is no definitive reference where future historians might find it.

Total bullshit anyway. The Iraq Dossier has a hard copy in the Commons library at least. It sure as hell will be archived from there. Where the fuck did he get this "factoid"? Some shitty web-page?

Re:The final irony? (-1, Offtopic)

Anonymous Coward | more than 10 years ago | (#7547577)

Unfortunately we had to wait a few months before hitting Iraq so they had plenty of time to hide things. Finding any chemicals will be a needle in a haystack. The country is as big as a mid size state in the US. Now go find every 55 gallon drum in Illinois. Now think of trying to find them in buildings or burried in the sand someplace. That's the scope of the problem. People are just fucking fools if they think this is easy.

I think it interesting that 3 months ago or so even Clinton said that in 98 there were WMD's that were not accounted for.

I also find it instructive that even though Iraq claimed to have no scuds and even though the inspectors found no scuds, that day 1 of this war, Iraq started lobbing some scuds. But nobody bothered to notice the inconsitancy. Granted, they probably didn't have many, but both Iraq and the inspectors said they weren't there. Hmm what's next?

I got your solution right here, people... (0)

ubiquitin (28396) | more than 10 years ago | (#7547411)

It's called a header redirect, folks. In one line of php, do:
header ("Location: http://www.newsite.com/over_here.html");

Re:I got your solution right here, people... (1)

mausmalone (594185) | more than 10 years ago | (#7547477)

helps, but not for those pages which are wholly removed. For example, a few faculty members here have some research posted on their personal sites, but they died. Now their sites will be taken down, and anyone referencing that research is gonna have a hard time getting a copy of it.

PHP? Try mod_alias (built-in) (0)

Anonymous Coward | more than 10 years ago | (#7547621)

Redirect permanent /oldschema/foobar.html /newschema/quux.html

Rigidity stifles creativity (4, Insightful)

apsmith (17989) | more than 10 years ago | (#7547416)

Any extra effort required to make web pages and their URL's preserved for eternity makes it more difficult for people to create them in the first place, which will mean less knowledge available, not more. Something unobtrusive that goes around preserving pages for posterity, like the Internet Archive [archive.org], is the best soplution.

Hardcopy (4, Insightful)

Overzeetop (214511) | more than 10 years ago | (#7547424)

This is why every time I use a web reference I make a hardcopy of it and include it in my research folder. It did not take long for me to figure out that web pages are no more useful than manufacturer catalogs - once the year is up, you might never get that tidbit of information back. If it's too large to want to print, I'll hardcopy the couple of pages I need, and PDF the whole thing for digital storage.

Having a hardcopy (1) documents the information and it's (purported) source, and (2) allows offline access for comparison and validation.

Re:Hardcopy (2, Insightful)

lukewarmfusion (726141) | more than 10 years ago | (#7547592)

One problem with using a hard copy is that you're the only one holding that copy. If the site disappears from the Internet, then your readers must rely on your printout (or cache) as a reliable source. You may not have a way to prove that your printout wasn't modified between download and printout. With more traditional methods, there are so many printed copies that such a claim could be disputed easily. I think your solution is the best one under the current situation, though.

Don't forget the damage done by censorship! (1)

Jerry (6400) | more than 10 years ago | (#7547432)

I was recently looking for pages about the peer review work of the global warming paper underlying the KYOTO Doctrine. Pages less than a month old were removed. Articles on ABC, Time, CNN and newspaper sites by the hundreds have 'old' pages missing.

There is no substitute for the printed page... yet.

Let me get this strait... (2, Informative)

ericspinder (146776) | more than 10 years ago | (#7547433)

You mean to tell me that those researchers found a dead link on the Internet, the horror. Were can I get one of those jobs!
Another study, published in January, found that 40 percent to 50 percent of the URLs referenced in articles in two computing journals were inaccessible within four years
That's because they were ads for companies that went out of business.

besides if you want to see old pages just go the the the wayback machine [waybackmachine.org]. Between that and backup tapes, everything you ever wrote still lives (in many cases I wish it didn't !).

Yes, big issue! (4, Interesting)

Erwos (553607) | more than 10 years ago | (#7547440)

I've personally been working (internally so far) on a website of modern-day Orthodox-Jewish responsa to various issues of Jewish law, so this is an issue I've given some thought to.

To say this is some kind of problem specific to the web is misleading. There are old, well-quoted sources of Jewish thought whose texts are simply lost to us in this current day and age. Example: a famous and extremely popular commentary on the Talmud and Torah, Rashi, is missing for at least a few chapters of Talmud. That would be the equivalent of IEEE misplacing some standards papers and then NO ONE having copies, just lost to the sands of time. Yet it did happen, proving this at least _was_ a serious issue.

However, these days, with such things as the Way-Back Machine and Google caching, actually LOSING entire web pages doesn't happen very often, and, I'd bet, it happens far less frequently than the loss of books.


web pages as knowledge (0, Interesting)

Horny Smurf (590916) | more than 10 years ago | (#7547451)

While I use the web as a source of information (information which is unavailable in any other format), I would not cite any information unless I can personally verify it. Would you trust "Anonymous Coward" when he tell you to "click this link [goatse.cx]"? So why would you trust some random website?

Clicked on link (1, Funny)

Anonymous Coward | more than 10 years ago | (#7547555)

Hey, I clicked that link and all I got was some discusting picture. I am outraged, now I trust nothing on the web. How dare you take advantage of me like that, I have never heard of such a thing. I had thought that all things on the internet are not only important but true, and now I am not too sure. I hope your happy, Jerk!

Interesting... (4, Funny)

Rinikusu (28164) | more than 10 years ago | (#7547468)

I found that out years ago.. :P

From a researcher's perspective, I used the web primarily as a quick "google" to get some ideas on where I might do further research. For instance, while a particular paper may have been taking offline regarding my search, many times the search will proffer an author's name. Take that name to the library's database (or googling it, too), and you might can get a list of more publications that the author has penned. Even better: sometimes, you can get a valid email address from other links and you can write and ask the original researcher himself about various publications, many times they have copies on hand and can send them to you. My research involves the web, but does not end with the web, which is where many people find themselves hung.

Hey, guys. See that big building with those obsolete books? Lots of chicks hang out there. :)

And? (1)

woodhouse (625329) | more than 10 years ago | (#7547472)

I don't see how this is news. Most people who write science papers are well aware of the problems with citing web pages, and we'll try to cite books and published papers wherever possible. Generally, people with something important to say will publish it properly, so this is not usually a problem.

The only people who exclusively cite web pages are likely to be the same people who write bad papers anyway, so I can't see the issue here.

A problem recognized already some time ago.... (4, Interesting)

tsvk (624784) | more than 10 years ago | (#7547475)

Usability expert Jakob Nielsen addressed the issue of linkrot in a column already in 1998: Fighting Linkrot [useit.com].

culture (0, Redundant)

theMerovingian (722983) | more than 10 years ago | (#7547478)

This is no way to run a culture.

Do we run the culture, or does the culture run us?

Re:culture (0)

Anonymous Coward | more than 10 years ago | (#7547603)

Do we run the culture, or does the culture run us?

This isn't Soviet Russia.

What's the problem here ? (5, Insightful)

JackJudge (679488) | more than 10 years ago | (#7547489)

Why would we want to archive 99.9% of today's web content ?
Does anyone archive CB radio traffic ??

It's not a permanent storage medium, never could be, too many points of failure between your screen
and the server holding the data.

Re:What's the problem here ? (4, Interesting)

southpolesammy (150094) | more than 10 years ago | (#7547597)

Yes, good point. The Internet is much more akin to CB radio since it is uncontrolled, unverified, entirely volunteer-based, entirely virtual, and highly volatile. By contrast, books, TV, and other media are highly controlled, subject to external verification, have a high cost of entry, are either themselves physical media, or require a physical presense in order to communicate, and are largely static in content.

The problem with the Washington Post's article is that their premise is flawed. They assume that the Internet is a mostly static source of information, when it is definitely a mostly dynamic information source. Webpages are meant to be updated, and with updates come change. It's inevitable. To assume that we keep every update to the webpages in separate locations is a false assumption. It's cool to see sites like the Wayback machine do this, but it's not required.

Re:What's the problem here ? (1)

sporty (27564) | more than 10 years ago | (#7547656)

There's a difference. CB traffic is usually casual conversation. People create websites to give out "important" information to a large spectrum.

One has an active listener with active feedback. One is completely passive.

Only in the case of an interview would CB traffic be completely informational. I don't remember the last time I put up a web page to say something to someone. People do put up "web applications" such as forums and chatrooms of sorts.

While they are of the same fruit, they are still apples and oranges.

Backup Your Important Data (4, Insightful)

Slider451 (514881) | more than 10 years ago | (#7547490)

Anything worth publishing digitally should be recorded in a more permanent medium.

I constantly backup all my digital photos because they are important to me. I also print the best ones for placing in photo albums, distributing to friends, etc.

The website they are published to is just a delivery medium, and not even the primary one. It can disappear and I wouldn't care. People who know me can always get access to them. Scientists should view their work the same way.

revisionism (1, Insightful)

bobrankle (680584) | more than 10 years ago | (#7547491)

Would be much more worried about if the site said the same thing. What about revisionism, I would wonder if the reference cited even said the same thing as what it was cited for, it's easy enough to change the pages so that they can be twisted to make the referencer look stupid (don't like their use of the reference) or to just out and out lie after they get referenced. Unless they are locked down, and we all know that is not really possible, someone somewhere will find their way in.

long-term storage needs... (2, Insightful)

mwilliamson (672411) | more than 10 years ago | (#7547493)

This is not just a problem with Web pages, it is a problem with all popular media formats today. How can we make sure future generations will be able to make use of any of our media? (makes me think of a buddy's magneto-optical drive...who the hell else has one) One solution is to actively copy from format to format as technologies change, but this requires constand upkeep throughout the ages. Relying on future generations to maintain our most precious information is not a responsible behavior for a culture.

Printed media, while having a low data/pound ratio, has managed to survive and span generations for centuries. I think the need for paper libraries cannot be forgotten. The challenge is distilling out what is worth keeping, and this challenge is better met now rather than later because we have more or less a good idea of what is significant information, and what is crap.

Permalinking and archiving (5, Insightful)

seldolivaw (179178) | more than 10 years ago | (#7547508)

The ephemeral nature of the web is a very real problem, but it's important not to overstate it. The reason so much more information is lost these days is partly a reflection of the fact that we produce so much more of it. The Library of Alexandria was the distilled knowledge of an entire civilisation; it was unique, irreplaceable and massively important information. The web is full of information that is of low quality, often massively redundant (thousands of pages explain the same thing in different ways) and certainly replaceable (the web is not the final repository of the information: it's a temporary place where that information is published). In the same way, for centuries, newspapers have produced thousands of redundant issues with a lifetime of just a few days. The reason no one decries the loss of our newspapers is because the publishers themselves still archive the information, even if this is somewhat hard to get to. The same is true of web pages, only the number of publishers is vastly larger.

Individual newspapers had their own ways of making their archives public (in many cases for a fee) because storing that information is a cumulative, ever-increasing cost. On the web that cost is much lower, but still present. In addition, there's the question of relevancy: www.mysite.com/index.html may contact valuable information, relevant enough to be on the front page today, but in a week's time you don't want it to still be there. So what we need is archiving, for the web.

But manual archiving is inefficient and a pain to maintain, since it involves constantly moving around old files, updating index pages, etc.. Plus linkers don't bother to work out where the archive copy is eventually going to be: they link to the current position of the item, as they should.

So what the web needs is automatic archiving. One way to do this (a solution to which was the partial subject of my final year project at uni) is to include additional a piece of additional metadata (by whatever mechanism you prefer) when publishing pages; data that describes the location of the *information* you're looking for, not the page itself. So mysite.com/index.html would contain meta-information describing itself as "mysite news 2003.11.23 subject='something happened today'". User-agents (browsers) when bookmarking this information could make a note of that meta-data, and provide the option to bookmark the information, rather than the location (sometimes you want to bookmark the front page, not just the current story). Those user agents, on returning to a location to discover the content has changed, could then send the server a request for the information, to which the server would reply with the current location, even if that's on another server.

Of course, this requires changes at the client side and the server side, which makes it impractical. A simpler but less effective solution is for the "archive" metadata to simply contain another URL, to where the information will be archived or a pointer to that information will be stored. This has the advantage of requiring only changes to the client-side.

Suggestions of better solutions are always welcome :-)

Make sure you have a paper reference. (1)

Kjella (173770) | more than 10 years ago | (#7547511)

Personally, I find web links *can* be much more efficient than having to dig out an issue of some science journal (which the local library will *not* have, and your request will be forwarded by carrier snails), if they're there.

But, always the paper reference. If it doesn't have one, it'd sure better be a reference to a known professor somewhere, so whoever is interested can dig up a homepage somewhere. If it doesn't even have that, don't use it.

Personally, I haven't found it that difficult to cite articles and such. Sources of information is a much bigger problem. Like e.g. statistics, or overviews or similar reference material. They are often moved/updated/reorganized/removed and you have no idea about it.


Reason for this? (4, Insightful)

bobthemuse (574400) | more than 10 years ago | (#7547512)

The article states that the average life for a website is 100 days, but wouldn't journals and formal publications (the most often cited documents in research) last longer than the average? Also, is the average skewed because websites are more likely to contain 'current information'? "Average lifetime" is misleading, does this mean the average time the page stays the same, or the average time before the information in the page is unavailable?

But compared to what? (0)

Anonymous Coward | more than 10 years ago | (#7547514)

Given that 99% of the web pages out there would never have been written in the first place, 100 days seems better 0 days doesn't it?

The advantage of a easy-to-use, disposible medium is in the low cost of publishing. But that low cost opens the doors for a things less worthy of writing down in the first place.

If you want to do serious research.... (3, Insightful)

RobertAG (176761) | more than 10 years ago | (#7547516)

Then DOWNLOAD the pages from your web citations.

For example, a short time ago, I did a white paper on power scavenging sources. About 1/2 the articles I read were HTML or PDF sources. Rather than just citing the URL, I downloaded/saved every online article I referenced. If someone wants the source and cannot find it, I'll just provide it to them. If your paper is going to be read by a number of people, it makes good sense to have those sources on-hand; it never hurts to cover your arse.

Hard drive/Network/Optical space is virtually unlimited, so storage isn't a problem. Paper journals are archived by most libraries, anyway, so until they start archiving technical sources, I'm going to have to do my OWN archiving.

Cool URIs don't change (4, Interesting)

KjetilK (186133) | more than 10 years ago | (#7547537)

May I remind everyone to read and understand TimBL's Cool URI's don't change [w3.org]. It's not that hard to design systems where you do not have to change the URI every 100 days, folks.

URL + date (2, Insightful)

More Trouble (211162) | more than 10 years ago | (#7547539)

Proper URL citations include the date. I'm not worried so much about the page being taken down (since it is presumably archived), as much as changing. If you don't record which version your were referring to, the content can change dramatically.


DSPACE (1, Informative)

Anonymous Coward | more than 10 years ago | (#7547540)

Look at DSpace [mit.edu], the mission of which is "To create and establish an electronic system that captures, preserves and communicates the intellectual output of MIT's faculty and researchers."

Each data set (collection) has a handle [handle.net], suppoosedly longer lasting than URNs. We're talking about long term data storage here.

There's an implementation [cam.ac.uk] of it at Cambridge University, and my organisation will be evauluation it as soon as the SuSE Linux Enterprise Server software lands on my desk and I've installed my server.


Re:DSPACE (3, Interesting)

tomknight (190939) | more than 10 years ago | (#7547595)

Bugger, forgot to log in.

Look at DSpace [mit.edu], the mission of which is "To create and establish an electronic system that captures, preserves and communicates the intellectual output of MIT's faculty and researchers."

Each data set (collection) has a handle [handle.net], suppoosedly longer lasting than URNs. We're talking about long term data storage here.

There's an implementation [cam.ac.uk] of it at Cambridge University, and my organisation will be evauluation it as soon as the SuSE Linux Enterprise Server software lands on my desk and I've installed my server.


cant erase my usenet postings (5, Interesting)

peter303 (12292) | more than 10 years ago | (#7547545)

I started posting usenet in the late 1980s. These g*dd*mn things are still are still on the net. I was less guarded at that time. Everyone *knew* them becase disk space ws so scare that usenet postings would disappear in 7-14 days.

Pretty revealing quote, isn't it? (0)

Anonymous Coward | more than 10 years ago | (#7547549)

"This is no way to run a culture."

I was unaware that a culture needs to be run.

Signal:Noise (1)

goldspider (445116) | more than 10 years ago | (#7547550)

With such a low signal:noise ratio on the Web, would you really want to capture everything?

Good record-keeping doesn't necessarily mean keeping everything, just stuff worth keeping.

But... (0)

Anonymous Coward | more than 10 years ago | (#7547552)

...Google [google.com] will stay.

If the authors are too stupid to include phrases for good search results instead of dead links, I don't need their book.

Information comes and goes. Important things stay.

Blogging Fragments, Like the Ancients (1)

handy_vandal (606174) | more than 10 years ago | (#7547563)

I collect miscellaneous links on my web site. Over time, I've started adding excerpts along with links. The excerpts help remind me what the link was about, but they also serve another purpose: when the link goes bad, I can use keywords in the excerpt to search for related pages on the web.

Our knowledge of ancient history has proceeded in a similar manner. Much of what we know about, say, pre-socratic philosophers, we know because of references in Aristotle and other later scholars. The original sources may be totally lost, but at least we have some names and quotations.


the problem is bigger (5, Insightful)

professorhojo (686761) | more than 10 years ago | (#7547565)

it's not simply webpages that are the problem. it's digital storage in toto.

because we as a generation are quickly moving away from our previous long-lived forms of storage, and toward digital management of archives, it's trivial for someone to decide to unilaterally delete (not backup?) a whole decade of data in some area of our history.

i remember the photographer who found the photograph of bill clinton meeting monica lewinsky 10 years ago. he was in a gaggle of press photographers, but nobody else had this picture because they were all using digital cameras and he was still on film. most of their pictures from that day had been deleted years ago since they weren't worth the cost of storing. but this guy had it on film.

yes. websites are disappearing. but there's a greater problem lurking in the background. the cost of preserving this stuff digitally, indefinately. who's going to pony up the cash for that? unfortunately, no one. and we'll all ultimately pay dearly for that... (hell -- we already have trouble learning from the past.)

Give and take - it's cultural change, dummy. (4, Insightful)

3Suns (250606) | more than 10 years ago | (#7547567)

Easy come, easy go... here's another cliche: Give and Take. What's great about the web is that it has effectively demolished the barriers to entry in publishing. Everybody and their grandmother has a blog now - you can't compare webpages to magazine articles or newspapers. There's just so much more information being published now that its average lifespan is bound to go down. So what?

Publications that cite [web pages] lose their authorities? Who the hell told you to cite a webpage? Might as well cite a poster you saw downtown. If the webpage is a reputable source in the first place, it'll keep it around permanently. Still better than scientific journals that are squirrelled away in the basements of university libraries - anyone can get to a webpage.

This is no way to run a culture. Last time I checked, nobody ran our culture... It kinda runs itself. The proliferation of accessable, ephemeral webpages over permanent, priveliged paper publications (wah, too many p's!) is a sign that our information culture has moved on into a new era. Liked the old one? Tough! Now information has to maintain its own relevance in order to be permanent... and I for one welcome that change.

Uhhhhh... (0)

Anonymous Coward | more than 10 years ago | (#7547572)

...ever heard about google? Damn I even don't bother to put the link to google here because it's so obvious. ..again.. slashdot "news"

And the Fizz in My Mntn Dew Is Gone Even Sooner (1)

RobotRunAmok (595286) | more than 10 years ago | (#7547576)

But wait, I think I still have a Mosaic presskit from the '91 Comdex. Does that count?

It's not Web Pages, its the Web itself that will be the cultural artifact. With the bar for publishing on the Internet placed so low, it falls to Father Time to become the Web's ultimate Editor-in-Chief.

On a related note, I'm moving, and came across reams of stuff I wrote while a college student, and boy does it suck! Tonight I light a candle to Neil Gaiman's I-Net God in thanks that my potentially career-wrecking pukage is preserved only in patchouli-smeared folders in my basement and not on a global network of servers. I feel like I've dodged more bullets than Neo; in 20 years when you guys do a vanity-google, I hope y'all feel the same way, but I'm guessing you will have wished the Web was even more forgetful than it is.

Question (1)

Texodore (56174) | more than 10 years ago | (#7547582)

Is this an authority actually questioning the validity of the Internet and it's use in research? Or is the authority simply using this as a ruse to say, "read our publication, as the Internet is making it outdated?" I tend to vote for the latter. I've read too much good research on the Internet - valid research - overlooked by the mainstream medical and science community just because it didn't mean more money for someone or didn't fit the status quo.

maybe a new TLD for this? (1)

line.at.infinity (707997) | more than 10 years ago | (#7547612)


The idea being that files uploaded here are expected to be permanently. Then professors can say urls with *.arc are o.k. for references, and *.RespectedName.arc/* are even better. In the math community, articles from arxiv.org and a few others are generally respected sources. This might not be the case for cultural studies, for example, where there are no central repositories. If there were more and better permanent archiving services, this would be less of a problem. Maybe the government could run such a service?

Being wrong is surely worse; commentary techniques (1)

arsinmsn (602830) | more than 10 years ago | (#7547629)

Once recognized a site may deserve preservation.

Web pages that are flat-out wrong and un-moderated are all the better for being ephemeral. I've often wished for a meta-critical facility like slashdot's ranking system for general web pages. This too is problematic, though; sadly, those with the most committment to cruising around the web & instering commentary are rarely the most qualified.

Wikipedia is an interesting example; try looking up a topic you know something about there. Even if you were to spend the considerable time necessary to iron out all of the misconceptions in many of the articles, there is no guarantee that someone won't come along the next day with an ax to grind and undo all your work.

Sorry if this is OT, but highlighting reliable info seems a more pressing issue.

No way to run a culture? (3, Insightful)

theolein (316044) | more than 10 years ago | (#7547635)

As the board chairman of the Internet Archive says, "The average lifespan of a Web page today is 100 days. This is no way to run a culture."

To the contrary, I think this is highly typical of the culture we have today, where everything is a transient fad in the media, technology and politics.

And it is also self feeding, I think, since market forces need to clear out the old to make room for the new in order to meet sales forecasts and shareholder expectations. And this is very true for pop, news and technology, which explains the lack of staying power of pop icons these days and becomes interesting when you want to ask yourself if you really need that new 3GHz machine just to surf the web.

And it is highly convenient in politics where a politician doesn't have to be accountable for what he said 100 days ago.

And so, the lack of long time life on the web is simply symbolic of all the rest here really, even if it is highly questionable.

Oh come on. It's not as if... (1)

csoto (220540) | more than 10 years ago | (#7547636)

secretaries didn't print them out for their PHBs to read and stuff.

quality references (1)

Dr. GeneMachine (720233) | more than 10 years ago | (#7547641)

The Washington Post reports on the loss of knowledge in ephemeral web pages, which a medical researcher compares to the burning of ancient Alexandria's library.

Err, which serious medical researcher would cite a web page? Everything remotely reliable, that is, in science at least, peer reviewed, is published in journals. While these may have a web appearance, they are also published in print - and that's what you cite.

What about... (0)

Anonymous Coward | more than 10 years ago | (#7547659)

..Google and archive.org? Is the knowledge about these two sites gone, too?

Poor humans. Watch MTV and stay informed about everything you're supposed to know.

genguid and google (2, Insightful)

hey (83763) | more than 10 years ago | (#7547680)

Use genguid (or other tool) to make a globally unique number
and place that number at the bottom of your
page a link with google's "I'm feeling lucky"
searching for the GUID.

Ephermeral Data is no data (1)

grolaw (670747) | more than 10 years ago | (#7547682)

Loss of reference links is worse than having no data.

In law a citation may be relied upon for a judicial ruling. If the citation is valid at the time of the original ruling, but no longer in existance when the case is reviewed on appeal (typically 2-5 years later) then the question of the validity of the precedent cited becomes the issue rather than the authority of the citation. The whole legal construct is built upon stare decisis and if what goes before vanishes into cyber-haze then the usefulness of web citations is nil.

Of course, Westlaw (tm) and Lexis/Nexis (tm) will have redirectors for their pages - but the cost of those services is very, very high. Infrastructure is costly even when the content is copyright free.
