Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Wayback Machine Safe, Settlement Disappointing

Zonk posted more than 7 years ago | from the get-me-out-of-here-mr-wizard dept.

182

Jibbanx writes "Healthcare Advocates and the Internet Archive have finally resolved their differences, reaching an undisclosed out-of-court settlement. The suit stemmed from HA's anger over the Wayback Machine showing pages archived from their site even after they added a robots.txt file to their webserver. While the settlement is good for the Internet Archive, it's also disappointing because it would have tested HA's claims in court. As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."

cancel ×

182 comments

Simple post (3, Informative)

Kagura (843695) | more than 7 years ago | (#16019777)

Exclusion policy.... (1, Informative)

Anonymous Coward | more than 7 years ago | (#16019815)

The whole exclusion policy [archive.org]

Thought I'd go karam slutting maybe have a load of karma hit you too. ;-)

Re:Simple post (0, Flamebait)

Anonymous Coward | more than 7 years ago | (#16019836)

Why is the parent redundant? The summary does not list the URL at all

Re:Simple post (3, Insightful)

freehunter (937092) | more than 7 years ago | (#16020500)

Uhh, yeah, the summary linked to www.archive.org/web/web.php or something like that, which is the site in question. I know everyone likes to rag on the editors, but they aren't horrible, at least not all the time.

Info published on the Internet... (2, Insightful)

msauve (701917) | more than 7 years ago | (#16020276)

shouldn't be copyrightable - there is nowhere more "public domain" than the Internet. Same with radio/TV - anyone who makes use of the public airwaves should sacrifice any claim to copyright for that priviledge. If someone wants to control their works through copyright, they should use controlled, private distribution.

I'll no doubt have lawyer (and lawyer wannabees) protesting - but that only follows the literal and common sense meaning of "public domain," instead of the legal rationalization which has been brought about by those who want to have their cake, and eat it too.

Re:Info published on the Internet... (1)

Khuffie (818093) | more than 7 years ago | (#16020316)

Erm, if I post something on my website (which I bought the domain for and paid for hosting), it is not a public space, since I paid for it. Stuff on www.whitehouse.gov, on the other hand, would be, since tax payer money paid for it.

Re:Info published on the Internet... (3, Insightful)

Lactoso (853587) | more than 7 years ago | (#16020363)

And just what does that check to your hosting company pay for aside from the physical location and maintenance of the webserver? Propogation of your website's IP address to DNS and bandwidth. And what do you need bandwidth for if not to share your web pages with the internet at large...

Re:Info published on the Internet... (5, Informative)

phulegart (997083) | more than 7 years ago | (#16020699)

so if my content is behind a protected "members area" then it is still public domain and should be freely available? If I am a photographer, and my site clearly states that all images are copyright of a certain date and that use of them without my permission is forbidden, that means nothing? If someone uses images of me without my permission, that they got from a website or protected members area, how is it that I can get them removed by complaining? If they are public domain, then it should be my tough luck, right?

If I post your credit card and bank information on a forum site, does that mean it is now public domain and you have no protection?

If I post on a forum site that I am selling stolen credit card info and bank info, my post should not be touched, because it is public domain and it should be freely available?

Re:Info published on the Internet... (3, Interesting)

iminplaya (723125) | more than 7 years ago | (#16021286)

If I post your credit card and bank information on a forum site, does that mean it is now public domain and you have no protection?

If anything bad comes from it, it only means that the banks employ weak security. That information by itself should mean nothing. Complain to the financial institutions, not the person who posts it. Make it the bank's problem and it will go away. Don't use their services until they make it secure without making it unduly inconvenient for the customer. The silly passwords and 20 minute waits for failed logins do nothing for security. Make financial security the institution's responsibility instead of suppressing the flow of information. And furthermore, you know what you can do with your copyrights. If you don't want people to use your photos keep them to yourself. If you don't want your information divulged, then don't reveal it to anybody.

Re:Info published on the Internet... (2, Insightful)

phulegart (997083) | more than 7 years ago | (#16021385)

what you are saying, is that the person who puts the information on the internet, is the one who decides if it is public domain. As opposed to the person to whom the information belongs.

You know the current standard the US follows, for copyright of printed works, is LIFE+70 years? That means that once the author copyrights their work, the copyright is good for 70 years after they die. Only after the copyright expires and it is not renewed, the work becomes public domain.
http://onlinebooks.library.upenn.edu/okbooks.html [upenn.edu]
there are some specific exceptions based on when the work was copywritten, when the work was published, what country it was published, whether or not the copyright notice was properly added to the work, and more.

To continue the library analogy I started earlier, the internet is a library. websites are the books. each must be treated as an individual entity. If someone steals your identity through a phishing scam, and uses that info immediately, then sure you might be able to get out of liability by appealing to your bank. DOes that mean that phishers should be allowed to run their scams freely and uncontested, because they can just pot your info and declare it public domain, which would then in turn give them license to use that info however they wanted?

What if YOU didn;t put those photos on the internet? What if your Ex Girlfriend stole them by using your spare key when you were at work? Sorry charlie, they are on the net now and are public domain? I don't think so.

Re:Info published on the Internet... (1)

iminplaya (723125) | more than 7 years ago | (#16021443)

Phishing is only an issue due to ineffective(and I believe intentionally) authentification employed by the financial institutions. And now we use that scam to suppress the flow. Make no mistake, it's the feeble security methods that make phishing so profitable, not the exposure of your information. If you reveal the info to anybody, it's no longer exclusively yours. Make the trustholders trustworthy. Weak, selectively enforced laws will never cut the mustard, as we are witnessing today.

Re:Info published on the Internet... (2, Informative)

phulegart (997083) | more than 7 years ago | (#16021567)

Phishers do not deal with security. Phishers deal with unsuspecting and uneducated internet users. I'm sorry you are so scared to do it, but really.. go ahead and visit http://paypal-protect.org./ [paypal-protect.org.] It is a phishing site that we are attempting to take down. Go ahead and login with a bogus email and garbage password. It doesn't check anything before hand. It simply takes you into a site that aside from the URL, does look like Paypal. You are then asked to provide everything. Name, address, social security, even your PIN number for your credit card. It won't even allow you to proceed without your PIN. Then, after you submit your information (which is then sent to whomever is running the scam), you are redirected to the actual paypal site.

Now, if a poor sap fell for it, anything that sap could have done online that involved money, the phisher can do.

You want to try to make the distinction about "If you reveal your info".. well, what if I worked at the gas station you frequent, and I copied your cred card info and ccv2 number from the back, when you made a purchase? OOPS, it was YOUR fault for actually buying something. According to you, the only way to be safe is to isolate yourself from the world, and make everything you need from scratch. Noone should be responsible for protecting your interests.

If I grabbed your info from your trash, it's your fault, right? because you didn't incinerate your trash, right?

You are wrong, in that everything posted on the internet is public domain. That is an assumption you are attempting to back up with obfuscation. What is posted on the internet is no different than what is on the shelf in a library, what is on TV, and what is on the radio. You have the right to enjoy it. You do not have the right to rebroadcast it without permission.

Re:Info published on the Internet... (0)

Anonymous Coward | more than 7 years ago | (#16021445)

Are you really this dense or do you have some sekrit, soon to be revealed point you're trying to make? If your content is not 'published' to the masses (in a passworded user area, intended to be private), then no. If you have a separate and evident copyright in effect over your content, then it is obviously governed by whatever copyright that is.

Your other examples are obviously illegal acts and not covered by fair-use or copyright laws.

Re:Info published on the Internet... (2, Insightful)

phulegart (997083) | more than 7 years ago | (#16021005)

here's a little story... it deals with archiving and the like.

My friend's hosting service got hacked. we caught it right away, before a site had been put into place, but the individuals attempted to put up the site http://paypal-protect.org./ [paypal-protect.org.] We shut them down quick. They went on to hack another hoster, and currently have their little phishing site up and running. I suggest you go to the site, and without using ANY real information, login with a bogus email and password, and check it out. If you take a look at the WHOIS entry for paypal-protect.org, you will see a name and address of an actual individual. We called this guy and told him that it was likely his name, info and credit card were used illegally to register the domain.

THe important thing to notice, is the EMAIL contact in the WHOIS entry. GO ahead, and do a google search for that email address. You will turn up two forum posts this guy made, where he is selling credit card info, bank info, Ccvv2 numbers and more. Now, the first result in your google search is a post at paypalsucks.com. You would not BELIEVE what it took to get the admin there to remove the post. And his policy wasn't to remove posts normally, but to just move them to a "garbage" thread, which would still be publically available. The second and third result in your google search, were a post left on a free board that was created at anyboard.net. I was able to get that board taken down within 12 hours of notifying the host, netbula. THe board was being used for lots of CC resellers, for at least 5 years before I got it shut down. How do I know? Three of those years are archived at archive.org.

However... EACH OF THOSE POSTS is still there in the google cache. Go ahead and see. Why is this important? Because all you need to see, if you are in the market to buy stolen Identities and credit cards, is the contact information. It does not matter if it is in an archive, or if it is in an active forum. Archiving it has made it virtually impossible to remove from the net, because now there is no way of knowing exactly who has archived this information.

Now, I've not provided clickable links for a reason. I've provided enough information here, that if you want to check my facts, you can do so.

A library might be public domain, but the books within are not. There are some books that ARE considered public domain, but that does not mean that EVERY book is public domain.

Mod parent up. (1)

piper-noiter (772438) | more than 7 years ago | (#16021280)

He makes an interesting point about how archives make it hard to delete previous illicit activity.

Jimmy James says. .. (1, Informative)

Anonymous Coward | more than 7 years ago | (#16019780)

"Dave, don't mess with the man with the wayback machine."

... I could make it so you were never born. (3, Interesting)

Corngood (736783) | more than 7 years ago | (#16020810)

You missed the best part of the quote.

I want.... (4, Funny)

Whiney Mac Fanboy (963289) | more than 7 years ago | (#16019783)

Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."

I want a search engine that only indexes items excluded in the robots.txt file :-)

Re:I want.... (2, Interesting)

hackstraw (262471) | more than 7 years ago | (#16019860)

I want a search engine that only indexes items excluded in the robots.txt file :-)

What's interesting is that I've heard of robots that do that exclusively. It may of been here on slashdot, but I've heard of people putting stuff in their exclude list in robots.txt and some robots _ONLY_ searched those files.

A world without cooperation (5, Insightful)

Anonymous Coward | more than 7 years ago | (#16019999)

Obeying robots.txt is "voluntary" in the same sense that obeying RFCs is voluntary. In other words, it isn't. You can technically ignore any and all standards, but there will be sanctions. In the case of robots.txt, these sanctions can very well be court ruling against you, because robots.txt is an established standard for regulation of the interaction between automated clients and webservers. As such it is an effective declaration of the rights that a server operator is willing to give to automated clients in contrast to human clients. This is especially important with regard to services which mirror webpages. Doing so without the (assumed) consent of the author is a straightforward copyright violation and if the author explicitly denies robot access, then the service operator knowingly redistributes the work against the author's will.

Even if you don't fear the legal system, disregarding robots.txt can quickly get you in trouble. There are junk-scripts which feed bots endlessly and there are blocklisting automatisms against unbehaving bots. If people program their bots to ignore robots.txt, these and possibly more proactive self-defense mechanisms will become the norm. Is that the net you want? Maybe obeying robots.txt is the better alternative, don't you think?

Re:A world without cooperation (-1, Flamebait)

Anonymous Coward | more than 7 years ago | (#16020021)

Shut it and go upstairs, you faggot. Mom's made your favourite dinner.

Re:A world without cooperation (2, Interesting)

pimpimpim (811140) | more than 7 years ago | (#16020086)

Yeah, and in a world with full cooperation, you wouldn't have to lock your door because no-one would enter your house, at that would mean that there will be serious actions against them. Dream on, mr AC! robots.txt is a flaky way of security, and everyone knows it. If I would want to find out something nasty/interesting of a certain company, I'd look at the robots.txt files to see what I could find.

Furthermore, there are perfectly good ways to lock content away from the outside in a more rigourous way, password-protected pages, pages only accessable via VPN to the intranet, etc. All other information, that is put unlocked, unencrypted, over the internet can be considered open. There will be some chance that you will find it accidentally, for example.

Re:A world without cooperation (5, Insightful)

Anonymous Coward | more than 7 years ago | (#16020201)

An attitude like yours is exactly why people go to court over these things. If you don't even adhere to the most basic rules, then it's easier and less costly to have you pay my lawyers and a fine instead of trying to stop robots from reading information that human users are supposed to see without difficulty. The lack of common courtesy on the net is disconcerting. The server tells you in no uncertain terms that you are not welcome, but you keep requesting "forbidden" pages. Consider an analogous situation in real life: You are walking in the park and someone asks you for a dollar. You decline, but the beggar keeps asking. You're saying that accepting your first denial as binding is "voluntary" and the beggar can keep bugging you as long as he likes. If that happened to me twice, I'd have the asshole arrested, and that's exactly what you're going to see online if people don't behave, especially when their behaviour leads to copyright violations which would have been avoided if they had followed the robot exclusion standard.

A world without culture. (0)

Anonymous Coward | more than 7 years ago | (#16020727)

"Consider an analogous situation in real life: You are walking in the park and someone asks you for a dollar. You decline, but the beggar keeps asking. You're saying that accepting your first denial as binding is "voluntary" and the beggar can keep bugging you as long as he likes. If that happened to me twice, I'd have the asshole arrested, and that's exactly what you're going to see online if people don't behave, especially when their behaviour leads to copyright violations which would have been avoided if they had followed the robot exclusion standard."

And yet no one sees the analogy between the above, and those "please do not copy" reminders on artists web pages. Maybe we can pull out that old slash-standby (you locking up MY culture with your robot.txt file).

Re:A world without cooperation (0)

Anonymous Coward | more than 7 years ago | (#16021184)

You are an idiot.

A more apt analogy would be wearing a t-shirt with writing on it while holding a sign that says "do not take my picture"

You can't sue me for taking your picture.

Re:A world without cooperation (1)

iminplaya (723125) | more than 7 years ago | (#16021354)

...and the beggar can keep bugging you as long as he likes.

As long as he doesn't physically assault you, you should have no recourse. Copyright violations do not relate to phsical assault and removal of your physical property that would deny you the use of that property. Copyright violation is nothing more than the denial of a special privilege granted by the government. A privilege that that been abused for far too long.

Re:A world without cooperation (0)

Anonymous Coward | more than 7 years ago | (#16021584)

Spidering isn't a copyright violation you twats, republishing it might be. Just because I ignore robots.txt means nothing. I may be gathering statistics or building a hash of content to determine if your site's updated. Is it rude...yes--but fortunately that isn't unlawful in most of the world.

And to the parent--obeying RFCs most certainly *is* voluntary--subject to the same rules as any other social club. Or are you going to have me fined for violation of 1855 http://www.faqs.org/rfcs/rfc1855.html [faqs.org] ) and pointing out that I think you're fscking tools?

Go to hell with your mindless legal threats. Having a begger arrested for asking for change twice--it's people like you who give a bad name to freedom loving people.

Re:A world without cooperation (0)

Anonymous Coward | more than 7 years ago | (#16021082)

WRONG!!! In this case voluntary means there is no law forcing anyone to follow the robots.txt file. Webbrowsers do not follow it, so I can easily view the material that is "protected" from robots. There is no implied contract for the visitor (human or automated) following the prescription of another file on the same system. It is simply a standard for friendly (or dare I say moral) use of a system.

Copyrights and robots.txt files have nothing to do with each other. If a search engine can create a set of keywords from your publicly available information, or cache information from your publicly available information without copyright infringement (of which I do not know the understanding of the law on this point), then a robots.txt file does not change the copyright status of the information.

A robots.txt file is not the same as a door with a lock. Simply placing a file in a publicly accessible place is like leaving a very expensive HDTV in a dumpster (garbage is public domain in most localities, police don't even need a search warrant for it), with a big sign that reads, "If you want to be nice, don't take the TV in the dumpster."

Re:A world without cooperation (2, Informative)

grumbel (592662) | more than 7 years ago | (#16021325)

Obeying robots.txt is "voluntary" in the same sense that obeying RFCs is voluntary. In other words, it isn't.

How about we have a look what the RFC-drafts (its not even official) say about robots.txt:

"Web site administrators must realise this method is voluntary, and is not sufficient to guarantee some robots will not visit restricted parts of the URL space."

"It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it."

Its really that simple, robots.txt is not a security tool, its a guideline, nothing else. If you don't want robots to collect your data simply don't send it them.

This is especially important with regard to services which mirror webpages. Doing so without the (assumed) consent of the author is a straightforward copyright violation

Its a straightforward copyright violation, yep, but that has nothing todo with robots.txt, since having it or not, doesn't make it any less a violation.

Re:I want.... (1)

Kamineko (851857) | more than 7 years ago | (#16020657)

Can't you do this yourself with the Google API?

Autolawyers (3, Insightful)

Doc Ruby (173196) | more than 7 years ago | (#16019811)

What's really disappointing is that it's apparently cheaper to pay lawyers to settle a case than it is to defend your right to ignore optional guidelines like robots.txt in US courts.

If Congress were serious about keeping the US economy "safe and effective", it would reform the "lawyers' job security" laws. Instead it will surely make them even worse, and make the lawyer tax on technology mandatory.

Re:Autolawyers (3, Insightful)

arthurpaliden (939626) | more than 7 years ago | (#16019858)

Unless lawyers are paid by the state, like doctors in Canada, they cannot be considered officers of the court who's job it is to represent your rights before said court. Once they accept payment from a client, either actual or pending, they become no more that hired sales consultants peddaling their clients version of the truth.

Re:Autolawyers (1)

The Only Druid (587299) | more than 7 years ago | (#16019970)

I don't agree with any of the statements in your post.

"Unless lawyers are paid by the state, like doctors in Canada, they cannot be considered officers of the court who's job it is to represent your rights before said court. Once they accept payment from a client, either actual or pending, they become no more [than] hired sales consultants [peddling] their [clients'] version of the truth."


Second, there is no distinction between being an advocate for a client's version of the truth, and being an advocate for that same client's rights in court. Unless you presume that your client is intentionally lying or otherwise misrepresenting their case to deprive the other party/parties of rights, then these two concepts are identical. If you do, in fact, presume that the client is intentionally attempting to deprive the other party/parties of their rights, then no bar in the country believes it is ethical for that lawyer to proceed.

Third, there is no proper analogy between the hypothetical socializing of medicine and of law. Doctors attempting to heal a single patient are not competing with one another, while lawyers for adverse clients in a single case are by definition competing with one another. While I will not claim there are no possible benefits to socializing lawyers, those benefits are not based in some such analogy.

Re:Autolawyers (1)

The Only Druid (587299) | more than 7 years ago | (#16020034)

Wow, worst formatting errors I've ever let through. Obviously, my text shouldn't be italic, and the paragraphs should be introduced as "first" and "second". Ugh.

Re:Autolawyers (1)

Bloke down the pub (861787) | more than 7 years ago | (#16020040)

I don't agree with your any of your use of italics. Which are belong to us. Or something.

Re:Autolawyers (4, Insightful)

Doc Ruby (173196) | more than 7 years ago | (#16019977)

There's a good case to be made for lawyers being paid by the state, as they certainly are working in those offices on that business. But even more than doctors they cannot be allowed to make their own interests coincide with that of the state. Lawyers often work for people against the state, which must be recognized by the state as a primary responsiblity of lawyers. Doctors rarely find their interests conflicting with that of the state (except when they're not getting paid on time ;), so that structure isn't as dangerous.

There's probably a way to ensure that lawyers represent people's rights better than they do now. Regular random audits of billings and practices. More "contempt of court" punishment. More suspended/revoked licenses, especially for repeated frivolous representation. More "malpractice" awards. There ought to be more competition, with more standardized reviews contextualizing all those "scores", published for consumers.

Lawyers even more than doctors hide behind consumer ignorance and blind "respect". Exposing their performance as part of the shopping process would make them more competitive, and better adhere to the required "ethics" that usually are assumed to come with the tie.

Re:Autolawyers (1)

The Only Druid (587299) | more than 7 years ago | (#16020084)

It's worth noting that your suggestions about increased contempt and malpractice damages (against lawyers) are possible today, without any new legislation: you would probably be surprised how much existing leeway there is for judges to make such damages. For a variety of reasons, they rarely do so. I like the idea of random audits, but it'd require a very sophisticated system of deployment to prevent harassment (for example, how do you weight a lawyer's likelihood of being audited? Should a more prolific lawyer be more likely to be audited?).

Re:Autolawyers (2, Interesting)

Doc Ruby (173196) | more than 7 years ago | (#16020197)

Lawyers should be required to instruct (off the clock) clients how to complain, and judges should ask clients if they've been informed (checking against a form the client signs). Failure should be like violating Miranda rights.

Yes, a more prolific lawyer should be more likely to be audited. Probably every nth case (by all lawyers) should have an audit initiated secretly to follow the proceedings, reporting malpractice as it's observed, so corrections aren't applied only after the case is derailed. That doesn't sound so sophisticated, but it does seem like lawyers would spend their careers learning to abuse it. NP complete, but best effort counts.

Another big reform that seems essential is to direct all punitive damages (not compensation damages) to the state, or perhaps even to some certified victim's fund, rather than to the plaintiff (and a percentage to their lawyers). That seems like a fundamental abuse that needs to be fixed, and would help fund a better justice department to make better decisions. Oh, and big penalties for lawyers introducing invalid evidence, all evidence determined before trial in separate hearings... anything for lawyer accountability to standards would make big improvements.

Re:Autolawyers (1)

AuMatar (183847) | more than 7 years ago | (#16020300)

The main reason they rarely do- most judges used to be lawyers.

Re:Autolawyers (1)

MindStalker (22827) | more than 7 years ago | (#16020486)

From what I understand there is a group of lawyers who are assigned to you if you are charged with a crime and can't afford a lawyer yourself. Despite what you may see on TV these lawyers do a decent job (not always) of disagreeing with the State. And if they do a bad job you can often get a Judge (who seem to be reallly good at disagreeing with the State) to rule that your lawyer was incompetent. Sure if we went to a system of all public layers we would need some tougher checks and balances, but so far it seems the third branch of the government is pretty good at disagreeing with the first two. /I'm sure I'm gonna get some liberal to jump on this and argue about recent appointments and such and such..... Flame On!! :)

Re:Autolawyers (1)

nebaz (453974) | more than 7 years ago | (#16020081)

They don't have to be paid by the state, merely licensed by the state. That license comes with certain responsibilities, I think some pro-bono work must occur every year under some circumstances, for example.

Re:Autolawyers (2, Informative)

hackstraw (262471) | more than 7 years ago | (#16019880)

If Congress were serious about keeping the US economy "safe and effective", it would reform the "lawyers' job security" laws. Instead it will surely make them even worse, and make the lawyer tax on technology mandatory.

I don't see that happening any time soon -- http://www.yourcongress.com/ViewArticle.asp?articl e_id=1671 [yourcongress.com]

Re:Autolawyers (1)

Doc Ruby (173196) | more than 7 years ago | (#16019993)

Interesting stats. I'd love to see the percentage of challengers to incumbents who are lawyers. Every second November, like this coming November 7, 2006, we can fire all the lawyers in the House, and probably about 30% of the lawyers in the Senate. And replace them with people who legislate, rather than lawyer.

Re:Autolawyers (1)

Blakey Rat (99501) | more than 7 years ago | (#16020412)

Yeah, but the problem here is that archive.org kept the material accessible even though their own policy is to delete material if robots.txt says to. It has nothing to do with the right of archive.org to ignore the robots.txt file, it's all about whether archive.org must follow their own published policies.

Don't need no Wayback (5, Funny)

kaizenfury7 (322351) | more than 7 years ago | (#16019823)

If you go directly to their site [healthcareadvocates.com] , you get a version of their site that looks like it's from 1995.

Re:Don't need no Wayback (2, Funny)

cptgrudge (177113) | more than 7 years ago | (#16019903)

Quick! Get those people some Rounded Corners and Gradients!

Welcome to Web 2.0!

Re:Don't need no Wayback (1)

MindStalker (22827) | more than 7 years ago | (#16019973)

No, nobody needs Web 2.0.

But the site doesn't even look midly professional. I could have made that page back in high school, and I SUCK at web design.

Re:Don't need no Wayback (1)

Al Dimond (792444) | more than 7 years ago | (#16020355)

Since when did "professional" mean "difficult to make"? If the site conveys its content in a clear way who cares if you could have made it in high school? A web site that's simple to implement is a great thing, and extra technologies (that usually will increase development, maintenance and bandwidth costs) need to be justified in terms of how they actually make the site's experience better.

Re:Don't need no Wayback (2, Insightful)

MindStalker (22827) | more than 7 years ago | (#16020456)

I don't know, maybe I just don't expect my local newspaper to look like my highschool newspaper.
Inital impressions go a long way. It may seem silly to some people, but in buisness it can mean the difference between people taking you seriously and buying your product, or not.

Re:Don't need no Wayback (1)

alsundma (753428) | more than 7 years ago | (#16019915)

Wow! I thought you were linking somewhere else as a joke. That site takes me takes me right back to the glorious 1990's.

Re:Don't need no Wayback (1)

loraksus (171574) | more than 7 years ago | (#16019921)

Not anymore...
Go Slashdot!

I sense a little two-faced opinion here (4, Insightful)

InsaneGeek (175763) | more than 7 years ago | (#16019830)

which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place

So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place? Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all? How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with. And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.

Re:I sense a little two-faced opinion here (1)

Amouth (879122) | more than 7 years ago | (#16019931)

it should.. follow you around for ever.. but it should also be noted that you where 9.. and the other party has to decide if you at 9 knew better or not.. it is their point of view.

the way back thing always told you when it was.. never trying to show it off as now

Re:I sense a little two-faced opinion here (1)

Drooling Iguana (61479) | more than 7 years ago | (#16020781)

Now say "KHAAAN!!!"

I sense a collection of poor analogies here (2, Interesting)

Anonymous Coward | more than 7 years ago | (#16020010)

So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place?

You never intended to make your search results publicly available. These guys intentionally made their web page publicly available.

Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all?

That's a better point, but the question is whether the Wayback Machine "republished copyrighted material". If they instead archived material available in the public domain, it is a different matter entirely, regardless of what the creators of that material want.

How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with.

If you don't want something to be used freely, don't release it into the public, unless there are legal protections in place. If it's the GPL, people are legally forbidden from incorporating it into a (publicly released) closed source product. If it's the LGPL, people can do so. If you don't like that, don't release it publicly.

And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.

This is a fact of Internet life and always has been. This isn't different from other activities of 9-year-olds or anyone else in the public sphere. If you streak through a mall naked and someone snaps your picture, too bad: you can't make the photos disappear.

Re:I sense a collection of poor analogies here (1, Funny)

Anonymous Coward | more than 7 years ago | (#16020131)

If you streak through a mall naked and someone snaps your picture, too bad: you can't make the photos disappear.

It wasn't me. It was my imaginary twin brother!!!

Re:I sense a little two-faced opinion here (4, Interesting)

fm6 (162816) | more than 7 years ago | (#16020060)

Another example: someone I know wrote an essay that he thought only people in his class would ever see. It contained one or two mildly embaressing disclosures, not terribly personal, but not something you'd want a complete stranger to know about you. Some idiot put it up on the school web site without his permission.

Here's a nasty possibility. Suppose somebody unintentionally publishes information useful to terrorists. DHS drops by and points out the error, and the information is withdrawn. Does Wayback Machine have a right to keep the information online?

In fact, Wayback Machine has never asserted their right to keep anything online. As the article points out, they'll remove stuff that's noncompliant with the current robots.txt, even though it was compliant at the time it was spidered. This lawsuit wasn't about their right keep stuff online. It was just somebody accusing them of being negligent about enforcing their own policies.

Re:I sense a little two-faced opinion here (1)

wik (10258) | more than 7 years ago | (#16020220)

What is their policy for websites that no longer exist? Their website says nothing about this.

I want to remove archives of my websites for hostnames/domains that are no longer connected to the internet. Obviously, the robots.txt method cannot work here.

Re:I sense a little two-faced opinion here (1)

DamnStupidElf (649844) | more than 7 years ago | (#16020448)

Here's a nasty possibility. Suppose somebody unintentionally publishes information useful to terrorists. DHS drops by and points out the error, and the information is withdrawn. Does Wayback Machine have a right to keep the information online?

Why don't you just play the child pornography card instead? At least that's *illegal*, unlike putting publicly available information online instead of hidden in some dusty library gaurded against terr'ists by a librarian.

The fact is, if something is actually illegal to posess, the Internet Archive can't possess it either. That said, hopefully no DHS flunky notices this case and subpoenas the whole archive to make sure it's clean of terrorist helping information...

Re:I sense a little two-faced opinion here (4, Insightful)

gsn (989808) | more than 7 years ago | (#16020114)

Thats crazy - when you typed in your search term into AOL you had an expectation of privacy and you did not for one minute believe that they would release that data. All webpages are copyright and the Wayback machine is using fair use to archive copies for educational use. If you publish information (its automatically copyright) and someone reproduces it they might be able to under fair use or they might be infringing your copyright - talk to your lawyer. And yes if you posted something on the net when you were 9 that was stupid it might well follow you around for the rest of your life. Same goes if you were in a porno in college and you put it online. Sorry. Tough shit. Maybe your parents should have paid more attention to your online activities. Or you should have known better. IANAL and 9 year olds may get some protection as minors but basic point remains - you publish something online you had no expectation of privacy. This is not at all what you were doing when you sent AOl your search queries - you published zilch.

If you post something on the net then I can point my browser to it - there is no privacy, and nor was there any expectation of it. I could have used wget -r -erobots=off on your page every day and got all its content - and I'd have that archive even when you deleted it or moved it into some private archive, and it happily ignored your robots.txt. Since obeying robots.txt is volutary I simply chose not to.

News websites often want you to pay to for older content but there is nothing theoretically stopping you from saving all the content day by day. You are comparing apples and oranges.

Heres the summary - we posted evidence online that was used against us in a court of law, we lost, we sued the people who provided that evidence, and because its cheaper to settle than deal with bloody lawyers we settled with them.

Re:I sense a little two-faced opinion here (1)

MadEE (784327) | more than 7 years ago | (#16020306)

Thats crazy - when you typed in your search term into AOL you had an expectation of privacy and you did not for one minute believe that they would release that data. All webpages are copyright and the Wayback machine is using fair use to archive copies for educational use.
I am certain that when they published the site concern over the wholesale copying of it was about as high on their list as privacy is to a search engine user less so when most search engine's TOSs allow them to pull this crap. Regardless the use being educational in nature doesn't make something automatically fair use particularly when it's published publicly.

Re:I sense a little two-faced opinion here (1)

Chosen Reject (842143) | more than 7 years ago | (#16020142)

So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place?

AOL's privacy policy was not such that your searches would be released to the world at the time people made those searches. That ended up not being the case, so you would have a legitimate concern. Also, AOL searches were not being made public a la a web page at the time people were making them. I am sure many people would not have used AOL, or at least changed their search habits, had they known all of their searches would be immediately posted on the web.

Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all?

If this were a copyright issue, it would have been brought up as such. HA is not saying that archive.org violated copyright, only that they ignored a voluntary robots.txt file. If your copyrighted material is being infringed upon, you are more than welcome to stop the perpetrators. However, this is more like time shifting on TV, which in the US and many other countries, is considered perfectly legal.

How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with.

No, you knew better and GPLd the code. Therefore it falls under a license that gives you the legal right to stop the offending party from reselling it in a closed-source project. But this isn't about licenses, it's about a voluntary robots.txt file.

And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.

It's unfortunate that sometimes we do some dumb things when we might not have known any better yet still have to live with people knowing it forever. But then, this is no different than how things have gone on forever, it's just that now we have a much larger audience. Lots of kids do dumb things before they fully understand the consequences of their actions, and lots of people remember those things. However, that is no basis for a legal challenge.

There is nothing two-faced about this. Even if all of /., except for you of course, were actually all of one mind on those issues, this issue doesn't concern any issue you have brought up, except for maybe the 9-year-old thing, but even then, that is just a reality we have to deal with.

Re:I sense a little two-faced opinion here (2, Interesting)

DeadboltX (751907) | more than 7 years ago | (#16020174)

why do people make such god awful analogies?

if you give private information to AOL and they release it publicly then you can get upset
if you post private information on "check-out-my-ssn.com" and its public to the whole world then you can't get mad.

Re:I sense a little two-faced opinion here (2, Insightful)

alexhs (877055) | more than 7 years ago | (#16020342)

Maybe you need to inform yourself of what Robot [robotstxt.org] Exclusion [wikipedia.org] is and isn't.

Its purpose is not to censor information but to avoid incident by agressive robots that could stress WWW servers (introduction in the first link).

HA action is revisionism. Like a politician yelling something then a few years later claiming he never said such a thing and threatening people with a piece of evidence to the contrary.

If you don't want it read... (3, Insightful)

saskboy (600063) | more than 7 years ago | (#16019831)

...Don't put it on the Internet. In fact, don't even type it into a computer, or write it down.
People shouldn't put anything on the Internet that they wouldn't want their worst enemy, boss, NSA, or grandmother to see. Obviously since the porn industiry exists online, few people follow this rule, but it's a good one none the less.

I enjoy Archive.org and when I get nostalgic about my websites of the past, it's there to show me a glimpse into history.

Re:If you don't want it read... (0)

Anonymous Coward | more than 7 years ago | (#16019952)

Of coarse that's a reasonable percaution to take, but that's exactly because things are the way they are. That doesn't mean there couldn't or shouldn't be change in the way things work.

(Oh, and what if your company puts it online?)

Re:If you don't want it read... (1)

From A Far Away Land (930780) | more than 7 years ago | (#16020019)

My company already is online. And per its objectives, has links to many other companies who have "it" online too. If you're familiar with the term Free Software or Open Source, you'll have heard the phrase that software wants to be free. It may sound strange to anthropomophize lines of code, but to me it means that the natural state of information is "free" to everyone, and to conceal it requires work. The natural state of the universe tries to balance vacuums and areas of higher pressure, so when there isn't enough work into keeping a secret safe, the natural tendency is for it to slip.

A recent example is of the CNN reporter caught with her mic on in the bathroom. She badmouthed her sister-in-law when she wasn't diligently working at keeping the secret of how she felt. The universe conspired against her secret keeping, and now the whole world knows the real information.

That's my long winded way of saying, "Shit happens, so plan for it."
Either don't create a secret, or plan for when it gets out.

Re:If you don't want it read... (1)

MindStalker (22827) | more than 7 years ago | (#16019988)

What about my financial information for almost every single bank, credit cards, bill I have. And there is little I can do about it.
It might be fairly secure... But its on the web. Point is everything will eventually be on the web, its only a matter of do you trust the security of the site. Should you trust the security of myspace? No..

Re:If you don't want it read... (1)

saskboy (600063) | more than 7 years ago | (#16020061)

"It might be fairly secure... But its on the web."
Lack of real information security is the trade we made as a computerized networked society, for convenience in banking. With the effort saved in banking I'd say it's worth it, even with the potential identiy scams the plague thousands of people every year. Crime happens whether it's online or off.

unringing the bell? (0, Offtopic)

hguorbray (967940) | more than 7 years ago | (#16019843)

Good thing this isn't anything to do with the Bush Administration -else they'd have retroactively classified all this stuff as 'Top Secret' and then charged the Wayback machine of Treason under the Patriot act ....

and then the machine would find itself held without trial or charges in Gitmo until it turned to rust.

Sometimes you gotta laugh to keep from crying.....Hopefully you're laughing at this

Wayback of missing documentation? (0)

Anonymous Coward | more than 7 years ago | (#16019932)

Sometimes you gotta laugh to keep from crying.....Hopefully you're laughing at this

Laughter is induced by the ironic or unexpected. Unfortunately, I fully expect what you say would be how things would play out :-(

Re:unringing the bell? (1)

Kainaw (676073) | more than 7 years ago | (#16020150)

Um... The Patriot Act is terrible, but Congress passed the Patriot Act, not Bush. Nobody in Gitmo will every be charged with anything related to the Patriot Act because it is for surveillance, not prisoners of war. Have you been watching too much Michael Moore? It is idiotic statements about the Patriot Act that keeps the public from understanding the truth of why it is bad. So, it can never be fixed. I often wonder if Congress paid Moore and the ACLU to go after the act with idiotic (and completely unrealistic) statements so the stupid public would never know what was really in it.

metaphorically speaking (1, Troll)

nizo (81281) | more than 7 years ago | (#16019902)

... you can't really un-ring the bell of publishing something online...


For the life of me I can't figure out what ringing a bell and publishing something online have in common. Maybe if we didn't use digital clocks we could turn back the sands of time and use a different mixed metaphor instead?

Re:metaphorically speaking (2, Informative)

LordNimon (85072) | more than 7 years ago | (#16019996)

There's only one metaphor - "you can't unring a bell", so there is no mixed metaphor.

Re:metaphorically speaking (1)

The Only Druid (587299) | more than 7 years ago | (#16020057)

The use of the phrase "you can't unring the bell" in the discussion of Free Speech is an old one, based on the concept that no matter what you do after someone rings a bell, you can't "unring" it. The use here as an analogy is appropriate, in that you cant "un-release" information from the internet.

Re:metaphorically speaking (0)

Anonymous Coward | more than 7 years ago | (#16020269)

Nah. I heard it on "My Name Is Earl" last night.

-----

Earl: Sorry about that.don't worry I'll find your dog. But htats it right, then your life is exactly back to the way it was seven months ago, we're done.

Scott: yeah, I think that's completely everything back to normal.

Earl: good.

Scott: unless (to Tess) you didn't have sex with anyone else while we were broken up , did you?

Tess: I used my hand on a guy a little.

Earl: yeah I'm not sure how to unring that bell.

-------
MY NAME IS EARL
1X07 - BROKE STOLE BEER FROM A GOLFER
Original Airdate (NBC): 08-NOV-2005
link [twiztv.com]

Re:metaphorically speaking (1)

whitehatlurker (867714) | more than 7 years ago | (#16020671)

I've never heard this expression either, and I agree that it is poor. You just wait until the echoes of the bell have dissipated and it's like the bell never rang.

[Curmudgeon]Un-ring? Bah! Nonsense.[/Curmudgeon]

But.... (2, Informative)

Stanislav_J (947290) | more than 7 years ago | (#16019912)

....even if Wayback did respect the robots.txt (which I was under the impression that they generally do), any pages archived before the robots.txt was placed on the server aren't going to automatically disappear -- they are still there. You have to directly ask them to remove the previously arvhived pages if you don't want them to be accessible.

Retroactive robots.txt (5, Insightful)

Kelson (129150) | more than 7 years ago | (#16020036)

I recently discovered exactly how the Wayback Machine deals with changes to robots.txt.

First, some background. I have a weblog I've been running since 2002, switching from B2 to WordPress and changing the permalink structure twice (with appropriate HTTP redirects each time) as nicer structures became available. Unfortunately, some spiders kept hitting the old URLs over and over again, despite the fact that they forwarded with a 301 permanent redirect to the new locations. So, foolishly, I added the old links to robots.txt to get the spiders to stop.

Flash forward to earlier this week. I've made a post on Slashdot, which reminds me of a review I did of Might and Magic IX nearly four years ago. I head to my blog, pull up the post... and to my horror, discover that it's missing half a sentence at the beginning of a paragraph and I don't remember the sense of what I originally wrote!

My backups are too recent (ironic, that), so I hit the Wayback Machine. They only have the post going back to 2004, which is still missing the chunk of text. Then I remember that the link structure was different, so I try hitting the oldest archived copies of the main page, and I'm able to pull up the summary with a link to the original location. I click on it... and I see:

Excluded by robots.txt (or words to that effect).

Now this is a page that was not blocked at the time that ia_archiver spidered it, but that was later blocked. The Wayback machine retroactively blocked access to the page based on the robots.txt content. I searched through the documentation and couldn't determine whether the data had actually been removed or just blocked, so I decided to alter my site's robots.txt file, fire off a request for clarification, and see what happened.

As it turns out, several days later, they unblocked the file, and I was able to restore the missing text.

In summary, the Wayback Machine will block end-users from accessing anything that is in your current robots.txt file. If you remove the restriction from your robots.txt, it will re-enable access, but only if it had archived the page in the first place.

Re:Retroactive robots.txt (1)

ebyrob (165903) | more than 7 years ago | (#16020251)

In summary, the Wayback Machine will block end-users from accessing anything that is in your current robots.txt file. If you remove the restriction from your robots.txt, it will re-enable access, but only if it had archived the page in the first place.

That's pretty cool. I wish more software behaved in a manner that well thought out.

Re:Retroactive robots.txt (1)

rthille (8526) | more than 7 years ago | (#16020750)

Cool maybe, but also bad. I can gain control over content [at least to prevent access] I never originally published if I now control the domain.

That's uncool.

Re:Retroactive robots.txt (1)

Jeremy Erwin (2054) | more than 7 years ago | (#16020891)

an aquaintance of mine was interested in basing a wargame on modern day protest movements. One of the sources he planned to use was a19-- an adhoc organization devoted to producing some sort of protest march on august 19th (of some year). They had a website called a19.org. It was no longer of any value to them, and the domain eventually found its way into the hands of a net parasite.

You know the type:

You searched for quantum chromodynamics. Would you like to buy flowers instead?


and of course, robots.txt was used to block the really interesting stuff that was published on a19.org.

The moral of the story?

Use wget -r as your web browser.

What REALLY pisses me OFF (4, Insightful)

scenestar (828656) | more than 7 years ago | (#16019947)

Is that some sites that used to exist had no robots.txt file, yet still get blocked

After a certain domain was no longer in use for years some adware search rank linkpharm whatever it is added a robots.txt file to a "hijacked" domain.

One can now get formerly accessible sites removed from archive.org. EVEN IF THE ORIGINAL OWNER NEVER INTENDED TO.

Check out their robots.txt... (3, Interesting)

Anonymous Coward | more than 7 years ago | (#16019959)

Check out their robots.txt: http://www.healthcareadvocates.com/robots.txt [healthcareadvocates.com] They ONLY restrict Internet Archive, from accessing their web site, but don't restrict any other spider... Haven't they heard of Google's cache?

Re:Check out their robots.txt... (2, Interesting)

Sir Pallas (696783) | more than 7 years ago | (#16020625)

Which is funny, because ia_archiver is actually the Alexa Internet crawler; it's a throwback to before Amazon.com bought Alexa. (To this day, Alexa donates crawl data to the Archive.)

Wayback Machine essential for public domain (3, Interesting)

proxima (165692) | more than 7 years ago | (#16020063)

Many people think of the Wayback Machine as being a tool for history and nostalgia. However, consider copyright expiration (IANAL, etc.). Many web pages have items like "Copyright 1995-2006 Blah". Some of the content was created as early as 1995. Assuming, of course, that items created in modern times eventually have their copyright expire, we will need a record of the content of these pages at that time.

As more content moves online, the idea of publishing a work becomes blurred. Revisions years later can effectively update the copyright of the work, if the reader cannot distinguish when the content was created. So the Wayback Machine will hopefully provide that resource. The amount of potentially public-domain content there is huge.

As a side note, it will be interesting to note when the first GPL programs (for example) lose their copyright. Of course, by then, the languages will seem more than archaic.

Re:Wayback Machine essential for public domain (1)

MindStalker (22827) | more than 7 years ago | (#16020523)

Actually its not without caselaw. If you change then republish something you get a new copyright on it. BUT someone can still copy the old material if they can find old material that the most recent revision of has fallen out of copyright. /Yes even you can take Shakespear, change of few words and copyright your publication. :)

Re:Wayback Machine essential for public domain (1)

proxima (165692) | more than 7 years ago | (#16020624)

Actually its not without caselaw. If you change then republish something you get a new copyright on it. BUT someone can still copy the old material if they can find old material that the most recent revision of has fallen out of copyright. /Yes even you can take Shakespear, change of few words and copyright your publication. :)

Right, I was operating under that assumption. Therefore, it is very important that we have a record of what existed at a given point in time.

What I don't know for certain is the answer to this hypothetical situation: A PDF or text file (or whatever) is made available on X date. X+100 years later (or whatever), the file is still available (perhaps not from the same source, but assume the file itself is dated). Is the file in the public domain, if accessed at a later date? I think it is, so long as the file is the same, bit-for-bit. Translate it into a new format, and you might have a new copyright, I'm not sure (and I'm not sure there is case law on it).

This brings up an analogy with other types of creative works. Are photographic reproductions of old artwork copyrighted? From what I understand, this depends on the country (with this [cornell.edu] being the relevant U.S. ruling that such a photograph is not copyrightable).

Re:Wayback Machine essential for public domain (0)

Anonymous Coward | more than 7 years ago | (#16020877)

Actually, you'll find that copyright on almost nothing will expire thanks mostly to Disney. The latest round of copyright extention (1998) extended copyrights to the life of the creator + 70 years (or 95 years if it was a work of corporate authorship).

Basically, "[u]nder this act, no additional works made in 1923 or afterwards that were still copyrighted in 1998 will enter the public domain until 2019". http://en.wikipedia.org/wiki/Sonny_Bono_Copyright_ Term_Extension_Act [wikipedia.org]

Isn't ignoring robots.txt unauthorised access? (1)

datajack (17285) | more than 7 years ago | (#16020115)

First, let me get two points expressed first. 1) IANAL, 2) I wholeheartedly agree with the aims of wayback and support that organisation whole-heartedly. I am playing devil's advocate here.

In the UK Computer Misuse laws, there is the concept of unauthorised access. It is an offence to access data on a computer system without authorisation.

Typically it is assumed that access to data held on a publicly available website, without notice to the contrary, is authorised. A notice displayed stating that you should not look at the data unless you are me is sufficient to make you aware that you should not access it. Similarly, a robots.txt file is the place to explicitly definae what data is unauthorised for access by automated spider systems. Anyone writing such a system can be reasonably expected to know that robots.txt contains such information and should therefor have the spider check that to see if access to the data is unauthorised. Failure to check that does not magically make the access any more legal. I would imagine that the US has similar provisions.

The creatiopn of a robots.txt file after the spider has collected the information will not make the previous access and data collection illegal nor should it affect the presentation of that data. Copyright law may have an imapct though.

Re:Isn't ignoring robots.txt unauthorised access? (1)

dangitman (862676) | more than 7 years ago | (#16020341)

Typically it is assumed that access to data held on a publicly available website, without notice to the contrary, is authorised. A notice displayed stating that you should not look at the data unless you are me is sufficient to make you aware that you should not access it.

That sounds rather absurd. It's like posting a massive page of text in a busy public location, with a sticky note attached saying "do not read this text."

I would think that in terms of computer networks, "unauthorized access" means breaking into a site that is protected by password or other security measures. The fact that your machine can reach a site and get content without any password or hacking amounts to authorisation in my opinion. If you aren't authorising public access, then why did you post it in public?

In more inflammatory terms - how about a page that was public, but said something like "You are not authorized to read this if you are a Jew. Offenders will be prosecuted." Somehow, I don't think the courts would take a positive view of that.

Re:Isn't ignoring robots.txt unauthorised access? (1, Informative)

Anonymous Coward | more than 7 years ago | (#16020451)

The robots exclusion standard was primarily designed to exclude robots from the parts of the server's namespace that robots can't handle, like (practically) infinite url trees or shop sites. You don't want bots to crawl a neverending swamp of dynamically generated content that points to ever more dynamically generated content. You also don't want bots to order stuff or vote for comments when they crawl the scripts (the webmonkey should have used POST, not GET, but if he chose to use robots.txt instead, you're going to at least get an angry call). There are many more reasons to exclude robots from certain url prefixes. If you're operating a robot, follow that standard, for your own good. Some servers are actively hostile if you don't follow robots.txt.

No it isn't. (1, Informative)

Anonymous Coward | more than 7 years ago | (#16020542)

robots.txt is not about whether accesses are "authorized" or not. Because the web server will still serve up the content if the robot asks for it! If you only want "authorized" users accessing the content, you should put some sort of access control mechanism where users have to type a password or something. Not only will that keep the robot out, but it demonstrates a clear intent to keep the robot out.

robots.txt is more of a "please don't look at this" request to spiders. If the spider asks for the content anyway and your server happily sends it, then you can't claim this is "unauthorized" access.

Does anyone here know what copyright is?! (2, Insightful)

Anonymous Brave Guy (457657) | more than 7 years ago | (#16020193)

Pretty much every time we have a discussion about the legality of web/Usenet archive sites, the only argument with any legal weight that's given for what would otherwise be a clear infringement of copyright is that the rightsholder is implicitly consenting to certain uses by making the material available on that medium. The degree to which this holds in general is debatable, and AFAIK has never been tested in any major court case in any jurisdiction. However, even if robots.txt is voluntary, it's a clear statement of intent. There is no way you can claim implicit permission to copy the material when the supplier explicitly indicated, using a recognised mechanism, that they did not want it copied.

That makes comments like this one by Doc Ruby [slashdot.org] and this one by saskboy [slashdot.org] seem a little presumptuous, IMNSHO.

I disagree (0)

Anonymous Coward | more than 7 years ago | (#16020281)

"Publishing" something online is hardly a reason for that information to stay online and be available indefinitely. After all, latest AOL fiasco just shown us that not all information should be available in perpetuity. With technology getting ever simpler it becomes trivial to expose online documents that were never meant to be seen by others. Claiming that any exposure is grounds for the information to be available to anyone in perpetuity is clearly wrong.

For a simple example, say your personal diary with your private thoughts and writings somehow falls out of your bag and ends up on the street. It is available for anyone to read. Would you agree to have its content published and disseminated to all the world in newspapers or some such? Or would you rather someone returns it to you quietly and the information stays private.

Archiving information produced by other people without their express consent is wrong and, potentially, harmful. This is one case where I strongly beleive copyright law should be applied and enforced.

HA by law should have to give up the data (1)

MushMouth (5650) | more than 7 years ago | (#16020596)

IIRC This was in response to a situation where someone was suing HA, the plaintiff's law firm hammered archive.org and was able to get some of the pages that they were interested in. At which time HA sued the archive for copyright infringement because they changed their robots.txt to prevent the information from getting to the plaintiff's attorneys. The problem with this whole thing is that adding the robots file after the lawsuit is akin to destroying evidence during a trial and they should have been found in contempt of court. Them expecting the archive to delete the data is unlikely as unless they are serving the data there is no copyright violation. I don't see why the plaintiff's lawyer didn't serve the archive with a subpeona for the information like gmail users have had their "deleted" email subpeona'd

Violated their Own Policies (1)

jafiwam (310805) | more than 7 years ago | (#16020804)

Their policy is pretty simple, and direct, and involves minimal interaction with a human. (A bonus.)

Put in a robots.txt.

Direct wayback to index what you want or dont.

THAT DIRECTION IS APPLIED TO FILES ON THEIR SITE FROM PREVIOUS VERSIONS.

Meaning, if you deny all, and their bot sees it, all of your stuff is supposed to get deleted from the archive.

If they didn't do that they violated their own policy.

True, there can be complications (such as switching domain names) that might keep any given text in there without interaction.

What they do is a great and and tremendously useful tool. But not entirely out of the "gray area" for copyright problems.

What about robots.txt in/from the future? (1)

Bozzio (183974) | more than 7 years ago | (#16021019)

Obeying robots.txt files is voluntary, after all,

It may still be voluntary today, but who knows what the future will bring?

I, for one, welcome our robot.txt overlords.

wrong (2, Interesting)

oohshiny (998054) | more than 7 years ago | (#16021042)

The US has copyright laws, and lots of people rely on it, including open source projects.

The robots.txt file is a clear indication of the conditions under which a copyright holder gives you access to their copyrighted materials. As such, it is not "voluntary".

In addition to probably being in violation of copyright law, it is simply rude for companies to ignore robots.txt files; if the Internet Archive does this, they are badly behaved.

If courts should decide that robots.txt files can be ignored at will, then more sites will require registration, click-through licenses, and those annoying "try to read this" safeguards, making life more miserable for all of us.

The best thing for everybody, including the Internet Archive, would be for the robots.txt standard to be enforced strongly by courts.

Wrong, wrong, wrong (3, Informative)

kimvette (919543) | more than 7 years ago | (#16021066)

As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."


Wrong, wrong, wrong. archive.org explicitly tells you that if you want your content removed from their index, that you should modify your robots.txt and re-submit your site, and when their bot reads your robots.txt and sees the appropriate directives, your content will be dropped from the index. See:

http://www.archive.org/about/faqs.php#2 [archive.org]

http://web.archive.org/web/20050305142910/http://w ww.sims.berkeley.edu/research/conferences/aps/remo val-policy.html [archive.org]

Let's review the text here, just in case someone from archive.org scurries to change it:

Addendum: An Example Implementation of Robots.txt-based Removal Policy at the Internet Archive

 


To remove a site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt) and then submit your site below.

The robots.txt file will do two things:

          1. It will remove all documents from your domain from the Wayback Machine.

          2. It will tell the Internet Archives crawler not to crawl your site in the future.

To exclude the Internet Archive's crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:

                                              User-agent: ia_archiver

                                              Disallow: /

Robots.txt is the most widely used method for controlling the behavior of automated robots on your site (all major robots, including those of Google, Alta Vista, etc. respect these exclusions). It can be used to block access to the whole domain, or any file or directory within. There are a large number of resources for webmasters and site owners describing this method and how to use it. Here are a few:

                      http://www.global-positioning.com/robots_text_file /index.html [global-positioning.com]

                      http://www.webtoolcentral.com/webmaster/tools/robo ts_txt_file_generator [webtoolcentral.com]

                      http://pageresource.com/zine/robotstxt.htm [pageresource.com]

Once you have put a robots.txt file up, submit your site (www.yourdomain.com) on the form on http://pages.alexa.com/help/webmasters/index.html# crawl_site [alexa.com] .

The robots.txt file must be placed at the root of your domain (www.yourdomain.com/robots.txt). If you cannot put a robots.txt file up, submit a request to wayback2@archive.org.


By not honoring those directives, are they not engaging in both copyright infringement and fraud?
Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...