Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

W3C Gets Excessive DTD Traffic

ScuttleMonkey posted more than 6 years ago | from the stop-the-intertubes-i-wanna-get-off dept.

The Internet 334

eldavojohn writes "It's a common string you see at the start of an HTML document, a URI declaring the type of document, but that is often processed causing undue traffic to W3C's site. There's a somewhat humorous post today from W3.org that seems to be a cry for sanity and asking developers and people to stop building systems that automatically query this information. From their post, 'In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG) and in the process doing something like validating against a DTD or schema. Handling all these requests costs us considerably: servers, bandwidth and human time spent analyzing traffic patterns and devising methods to limit or block excessive new request patterns. We would much rather use these assets elsewhere, for example improving the software and services needed by W3C and the Web Community.' Stop the insanity!"

cancel ×

334 comments

Sorry! There are no comments related to the filter you selected.

Wow (2, Funny)

geekoid (135745) | more than 6 years ago | (#22356634)

"Webmasters" strike again. Clowns.

Re:Wow (4, Insightful)

Breakfast Pants (323698) | more than 6 years ago | (#22356690)

Not only that, this document gets cached all over the place by ISPs, etc., and they *still* get that many hits.

Re:Wow (3, Interesting)

x_MeRLiN_x (935994) | more than 6 years ago | (#22356754)

The summary strongly implies and the article states that this unwanted traffic is coming from software that parses markup. Placing the DTD into a web page or other medium where markup is used is the intended and desirable usage.

I don't claim to know why you have a problem with webmasters (I am not one), but if you're a programmer and perceive them to have less technical ability than yourself, well.. your ilk seem to be the "clowns" this time.

Re:Wow (5, Insightful)

Bogtha (906264) | more than 6 years ago | (#22356772)

Why on earth are you blaming webmasters? They are just about the only people who cannot be responsible for this. People who write HTML parsers, HTTP libraries, screen-scrapers, etc, they are the ones causing the problem. Badly-coded client software is to blame, not anything you put on a website.

Gumdrops (4, Insightful)

milsoRgen (1016505) | more than 6 years ago | (#22356856)

They are just about the only people who cannot be responsible for this.
Exactly, for as long as I've been involved with HTML's various forms over the years it was always considered proper technique (from W3C documentation) to include the doctype (or more recently xmlns). Certainly sounds like a parser issue to me.

The only thing I'm unclear on is whether your average browser is contributing to this problem when parsing properly written documents.

Re:Wow (-1, Troll)

Charlotte (16886) | more than 6 years ago | (#22356908)

They are just about the only people who cannot be responsible for this.

You're kidding, right? They literally wrote the standard. If they didn't want the traffic they should have specified the matter in their RFCs.

Webmasters, indeed - what this committee needs is a couple of sysadmins. You can tell them apart by their attention to the consequences of their actions.

Re:Wow (2, Informative)

milsoRgen (1016505) | more than 6 years ago | (#22356950)

You're kidding, right? They literally wrote the standard.
Well yes they (as long as the 'they' you are refering to is the W3C) did, and no where in the standards they have approved does it call for every system parsing a document with a DTD, to request that information over and over again. Especially considering that data tends to remains static once committed to an official standard.

Re:Wow (4, Insightful)

MenTaLguY (5483) | more than 6 years ago | (#22356986)

That's the whole purpose of the public identifier (e.g. "-//W3C//DTD HTML 4.01//EN") in the doctype, and the SGML and XML Catalog specifications!

The expectation is that software would ship with its own copies of "well-known" DTDs with associated catalog entries; the URL is only there as a fallback. The problem is ignorant and/or lazy software developers not implementing catalogs and simply downloading from the URI each time.

Re:Wow (1, Insightful)

Anonymous Coward | more than 6 years ago | (#22357082)

The problem is ignorant and/or lazy software developers
No, the problem is a standards body which doesn't take into account that there are quite a few ignorant and/or lazy software developers around. Don't make people put one of your URLs in every web document if you don't want that file to be downloaded a gazillion times.

Re:Wow (5, Informative)

Bogtha (906264) | more than 6 years ago | (#22357044)

They literally wrote the standard.

"Webmasters" refers to people who run websites, not the W3C. And this particular feature is an artefact of SGML, which was around for over a decade before the W3C ever existed.

If they didn't want the traffic they should have specified the matter in their RFCs.

You mean like how RFC 2616 describes the caching mechanism that is being ignored by the problem clients? Or are you referring to the established-for-decades SGML system catalogue that they mention in the HTML 4 specification [w3.org] multiple times [w3.org] ?

You can tell them apart by their attention to the consequences of their actions.

If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

Re:Wow (5, Insightful)

Blakey Rat (99501) | more than 6 years ago | (#22357206)

If people writing client software actually did what they were supposed to, this wouldn't be a problem. This is not a designed-in bug, this is caused by a minority of developers eschewing the specifications and standard practice out of either ignorance or apathy.

Wow, it just struck me... welcome to Microsoft's world.

Their security was so bad for so many years because they worked on the assumption that:
1) Programmers know what they're doing
2) Programmers aren't assholes

Of course, the success of malware vendors (and Real Networks) has proved those two assumptions wrong many years ago, and probably 90% of the development work on Vista was adding in safeties to protect against idiot programmers, and asshole programmers.

And now the W3C is getting their lesson on a golden platter.

In short, here's the lesson learned:
1) Some proportion of programmers don't know what they're doing and never will
2) Some proportion of programmers are assholes

Re:Wow (5, Insightful)

Anonymous Coward | more than 6 years ago | (#22357176)

They literally wrote the standard

Yeah, the standard. If your shitty http engine is too shitty to process html without having to look up the DTD on the w3c's website every single page, your shitty http engine shouldn't be allowed out on the internet.

Re:Wow (1)

x_MeRLiN_x (935994) | more than 6 years ago | (#22356928)

I agree with your main point, but blaming authors of screen scrapers is ridiculous. Screen scraping is reading the final output of program (or in this case web page) in image format and converting that into usable data with methods such as OCR (optical character recognition).

Re:Wow (1)

Bogtha (906264) | more than 6 years ago | (#22357086)

Screen scraping is reading the final output of program (or in this case web page) in image format and converting that into usable data with methods such as OCR (optical character recognition).

Actually, the term is widely used as a synonym for spidering a site. It's rare I see it used in the way you describe. Sorry for the confusion.

First? (0, Redundant)

robo_mojo (997193) | more than 6 years ago | (#22356636)

"oops"

The Solution (5, Funny)

OdieWan (757584) | more than 6 years ago | (#22356648)

I have a solution to the problem; I wrote it down at http://www.w3.org/TR/html4/strict.dtd [w3.org] !

Re:The Solution (5, Funny)

Anonymous Coward | more than 6 years ago | (#22356680)

Don't click that link! It's some sort of ascii pornography!

Re:The Solution (0)

colinrichardday (768814) | more than 6 years ago | (#22356750)

But wouldn't you access ascii porn by means of a submit button instead, or perhaps using DOM?

Re:The Solution (0)

jgoemat (565882) | more than 6 years ago | (#22356732)

ROFL! I can't believe I used up my Mod points...

Re:The Solution (1)

Rinisari (521266) | more than 6 years ago | (#22357084)

I clicked on that just for the principle of the thing.

Do what.... (5, Funny)

Creepy Crawler (680178) | more than 6 years ago | (#22356652)

Do what any other respectable web provider would do..

Put links to Goatse in the definitions!

Who made the DTD a URL? (2, Interesting)

Anonymous Coward | more than 6 years ago | (#22356656)

Oh, that was you? I thought that making every webauthor refer to a W3C URL in every web page was going to get someone in trouble someday. Today seems to be someday.

Re:Who made the DTD a URL? (2, Insightful)

colinrichardday (768814) | more than 6 years ago | (#22356726)

Or you could do what I do, and simply download the DTD, install it on your system,
and use that instead.

Re:Who made the DTD a URL? (4, Insightful)

ozamosi (615254) | more than 6 years ago | (#22356920)

It does contain a URL. It also contain a URN (for instance "-//W3C//DTD HTML 4.01//EN"). The point of a URN is that it doesn't have a universal location - you're supposed to find it wherever you can, probably in local cache somewhere.

The URL can be seen as a backup ("in case you don't know the DTD for W3C HTML 4.01, you can create a local copy from this URL" - in the future, when people have forgotten HTML 4.01, that can be useful), or the same way XML namespaces is used - you don't have to send a HTTP request to http://www.w3.org/1999/xhtml [w3.org] to know that a document that uses that namespace is a xhtml document - it's just another form of a unique resource identifier (URI), just like a URN or a guid.

What the W3C is having a problem with is applications that decide to fetch the DTD every single request. That's just crazy. Why do you even need to validate it, unless you're a validator? Just try to parse it - it probably won't validate anyway, and you'll have to do either do it in some kind of quirks mode or just break. If you can parse it correctly, does it matter if it validates? If you can't parse it, does it matter if it validates? And if you actually do want to validate it, why make the user wait a few seconds while you fetch the DTD on every page request? The only reasonable way this could happen that I can think of is link crawlers who find the URL - but doesn't link crawlers usually avoid to revisit pages they just visited?

Re:Who made the DTD a URL? (1)

milsoRgen (1016505) | more than 6 years ago | (#22356982)

it probably won't validate anyway
Ain't that the truth, brother... I find myself coding for the program parsing the information, way more often then I am coding for the standards. As coding standards-based markup always runs into issues.

Re:Who made the DTD a URL? (0)

Anonymous Coward | more than 6 years ago | (#22356992)

Yes, that's the idea. However, regardless of the purpose of the URL, you can't make people put one of a few URLs which point to your server in *every* web page, and expect to get away with it unscathed. It's a stupid idea. Complaining afterwards that people are people only puts the icing on the cake, IMHO.

Re:Who made the DTD a URL? (1)

MenTaLguY (5483) | more than 6 years ago | (#22357094)

Minor quibbles: "-//W3C//DTD HTML 4.01//EN" is not a URN but a PI (public identifier), and there is a reason to have validating parsers: the DTD can contain essential information for correctly interpreting the document (e.g. entity declarations, as is obviously the case in HTML).

Other than that you're spot on.

Re:Who made the DTD a URL? (1)

Bogtha (906264) | more than 6 years ago | (#22357112)

Why do you even need to validate it, unless you're a validator? Just try to parse it

The external DTD subset isn't just for error checking. It defines the character entities and the content model for element types. If you don't have access to the DTD (or hard-coded HTML-specific behaviour) you can't parse it fully.

Re:Who made the DTD a URL? (0)

Anonymous Coward | more than 6 years ago | (#22357230)

Even non-validating parsers have to fetch the DTD. The DTD may contain XML entities that the parser has to substitute in to the parsed document (think <, for example).

The problem is with the docs (4, Insightful)

Mantaar (1139339) | more than 6 years ago | (#22357092)

The problem does not lie in the mechanism itself - it's in the documentation - or the lack of understandable (or at least often-used) docs directly at the source.

Simple caching on client side could already improve the situation a whole lot... BUT:

When people implement something for html-ish or svg-ish or xml-ish purposes, they go google for it: "Howto XML blah foo" - result, they're getting basic screw-it-with-a-hammer tutorials that don't point out important design decisions, but instead Just Work - which is what the author wanted to achieve when they started writing the software.

It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2. But since most tutorials and howtos on the net are just dumbed-down copypasta for quick and dirty hacks - and since nobody fucking enforces the standards - nobody does it the Right Way.

So if I start writing some sax-parser, some html-rendering lib, some silly scraper, whatnot... and the first example implementations only deal with basic stuff and show me how to do it so basic functionality can be implemented... and I'm not really interested in that part of the program anyways, because I need it for putting something more fancy on top... once after I'm through with the initial testing of this particular subsystem, I won't really care about anything else. It works, it doesn't seem to hit performance too badly, it's according to some random guy's completely irrelevant blog - hey, this guy knows what he's doing. I don't care!

This story hitting /.'s front page might actually help improve the situation. But.. it's like this with stupid programmers - they never die out, they'll always create problems. Let's get used to it.

Re:The problem is with the docs (0)

znerk (1162519) | more than 6 years ago | (#22357274)

It's a little bit like people still using ifconfig on Linux though it's been deprecated and superseded by iptables and iproute2
Oh, please do tell me how to use iptables or iproute2 to set my ip address, or to enable/disable a network adaptor.

Sadly NIMP hasn't taken a hold. (-1, Offtopic)

Anonymous Coward | more than 6 years ago | (#22356664)

I honestly have no idea why the NIMP project [nimp.org] even support it.
please see the related links.

WARNING: GNAA (2, Funny)

SirBudgington (1232290) | more than 6 years ago | (#22356704)

Don't click the link, it's malware.

Leave it to Slashdot... (2, Funny)

PocketPick (798123) | more than 6 years ago | (#22356668)

It's a good we don't contribute to the problem - Oh, wait...

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
                        "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<title>Slashdot: News for nerds, stuff that matters</title>

Re:Leave it to Slashdot... (5, Informative)

snl2587 (1177409) | more than 6 years ago | (#22356700)

Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.

Re:Leave it to Slashdot... (2, Insightful)

Vectronic (1221470) | more than 6 years ago | (#22356742)

And if he really wanted to be funny, he would have quoted it from the webpage that the Story/Blog was posted on on W3C

Re:Leave it to Slashdot... (2, Informative)

corsec67 (627446) | more than 6 years ago | (#22356888)

Actually, do any browsers get the DTD?
From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.

Browsers are also pretty good about caching stuff.

Re:Leave it to Slashdot... (2, Informative)

milsoRgen (1016505) | more than 6 years ago | (#22356914)

From the article, it seems like the problem is with software that processes XML, like a web crawler, not a browser.

FTA:

The vast majority of these requests are from systems that are processing various types of markup (HTML, XML, XSLT, SVG)

I don't claim to fully grasp what software is causing the problem but it does seem to effect more than just XML.

Re:Leave it to Slashdot... (1)

MenTaLguY (5483) | more than 6 years ago | (#22357010)

Browsers should ship with the DTD. That's the whole point of the public identifier (e.g. "-//W3C//DTD HTML 4.01/EN"), so that a local copy can be obtained using the PI as an index into a local catalog. The URL is only there as a fallback.

Re:Leave it to Slashdot... (1)

corsec67 (627446) | more than 6 years ago | (#22357072)

Except that I was talking about software ASIDE from browsers, like a XML validator, crawler, etc...
Stuff that deals with generic XML and is being used for xhtml.

Re:Leave it to Slashdot... (2)

MenTaLguY (5483) | more than 6 years ago | (#22357102)

Even then, those should be caching in a local catalog, based on the PI.

Re:Leave it to Slashdot... (0)

Anonymous Coward | more than 6 years ago | (#22356900)

Note: its the thief that does the stealing. So, anyone putting them up to it (and/or is providing the means and/or the easy access) is irrelevant ?

Re:Leave it to Slashdot... (1)

darkpixel2k (623900) | more than 6 years ago | (#22357264)

Note: It is my understanding that the browser is what looks up the DTD. So /. having the declaration is irrelevant.

Yeah. Whose retarded idea was it to give something a valid URI and then say "wait--don't query this URI", it's just for show. Well if it's just for show, don't make it a URI that various automated systems might want to query because a programmer failed to include some 'query everything except this' code.

(In case it isn't ovbious, I'm talking out my butt. I really have no clue when it comes to DTD's except that most WYSIWYG web design programs past that garbage in automagically.)

Umm, no. (5, Informative)

pavon (30274) | more than 6 years ago | (#22356758)

That is supposed to be there according to the standard. And all the major browsers cached that that file after loading it (at most) once, and then never read it again. So no, slashdot is not causing a problem. The problem is all the other HTML processing software besides browsers that do not cache their DTD files, not the files for containing it.

If you want to complain, it should be the fact that slashdot is serving a strict.dtd when it doesn't validate against it.

Re:Umm, no. (1)

Skapare (16644) | more than 6 years ago | (#22357040)

It's the whole design of HTML/XML, that needs to have DTD files in the first place to do the processing, that is all wrong. I warned about this well over 12 years ago. At least what little code I've written to process HTML/XML has always entirely ignored the DTD.

Re:Leave it to Slashdot... (1)

Bogtha (906264) | more than 6 years ago | (#22356792)

No, Slashdot is not contributing to the problem, that is correct code. Just because a URI is listed, it doesn't mean that software should request it each and every time it sees it. Most code that sees that URI should already have a copy of the DTD in the local catalogue. It's only generic SGML software that cannot be expected to have a copy of the DTD.

Who designed this crazy system?! (1, Funny)

Anonymous Coward | more than 6 years ago | (#22356670)

Isn't this what you call "eating your own dogfood"?

Delay (5, Interesting)

erikina (1112587) | more than 6 years ago | (#22356674)

Have they tried delaying the response by 5 or 6 seconds? It could cause a lot of applications to hang pretty badly. That or just serve a completely nonsensical schema every thousandth request. Gotta keep developers on their toes.

Re:Delay (3, Informative)

bunratty (545641) | more than 6 years ago | (#22356906)

RTFA. They returned the 503 Service Unavailable error to many abusers, and they just kept on with abusive requests. Many abusers aren't checking the response to the request at all.

Re:Delay (1)

erikina (1112587) | more than 6 years ago | (#22357132)

Probably because a 503 Service Unavailable might not break the app, just skip the validation stage. You need to do something to degrade the usefulness of the application (cause it to hang or break). Also an across the board 5 second wait, will mean developers will see the problem at development time - not only after it has already been deployed, causing problems and has been blocked.

Re:Delay (5, Insightful)

bwb (6483) | more than 6 years ago | (#22357154)

Sure, they're ignoring the response status, but I'll betcha most of them are doing synchronous requests. If I were solving this problem for W3C, I'd be delaying the abusers by 5 or 6 *minutes*. Maybe respond to the first request from a given IP/user agent with no or little delay, but each subsequent request within a certain timeframe incurs triple the previous delay, or the throughput gets progressively throttled-down until you're drooling it out at 150bps. That would render the really abusive applications immediately unusable, and with any luck, the hordes of angry customers would get the vendors to fix their broken software.

Re:Delay (4, Funny)

dotancohen (1015143) | more than 6 years ago | (#22357104)

You must be a Microsoft engineer.

Re:Delay (3, Informative)

RhysU (132590) | more than 6 years ago | (#22357128)

Good: Delivered a piece of code once that tested just fine for us, but blew up at the customer's site. We never realized that the new J2EE-like features were hitting a live URL during DTD parsing.

Better: Had a build system once that looked for a host and had to TCP timeout before the build could continue. Had to happen several hundred times a build cycle.

The Java libraries do this down in their innards unless you're very careful to avoid it.

MIT needs a CDN! (2, Interesting)

rekoil (168689) | more than 6 years ago | (#22356694)

I'm surprised none of the CDNs out there haven't volunteered to host this file - the problem is they'd have to host the entire w3.org site, else move the rest of it to a another hostname.

That's what you get for making stupid rules. (1, Interesting)

v(*_*)vvvv (233078) | more than 6 years ago | (#22356696)

They insist that every document begin with a declaration that includes a link to their site. Now they are complaining about traffic.

The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.

Re:That's what you get for making stupid rules. (1)

colinrichardday (768814) | more than 6 years ago | (#22356774)

You don't need a DTD, nor do you need to link it to W3C.

Re:That's what you get for making stupid rules. (5, Informative)

Bogtha (906264) | more than 6 years ago | (#22356832)

They insist that every document begin with a declaration that includes a link to their site.

It's not a link. It's a reference to an external DTD subset. It's there so that generic SGML software can properly parse the document without any special knowledge of HTML.

The link in the declaration serves absolutely NO purpose other than to comply with the standard that they created. It sounds like the whole purpose was so that they could have every source page begin with a link to their site. Serves the right.

No, external DTD subsets are a part of SGML, which is at least a decade older than the W3C.

Re:That's what you get for making stupid rules. (1)

reddburn (1109121) | more than 6 years ago | (#22357254)

From the W3C specifications for XHTML documents [Link [w3.org] ]

3.1.1 - Strictly Conforming Documents ...There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in DTDs using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions... An XML declaration is not required in all XML documents; however XHTML document authors are strongly encouraged to use XML declarations in all their documents.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

This is the DTD they require. Because the DTD is not declared inline in an XHTML document, it must contain the external reference (the second part - the link to "w3.org/TR/xhtml1/DTD/xhtml1-XXXX.dtd") to the W3C's DTD - which is, presumably, what they're bitching about.

caching (0, Redundant)

TheSHAD0W (258774) | more than 6 years ago | (#22356698)

Add some sort of caching parameter to the DTD spec, that specifies how long browsers should cache those DTDs.

Another potential solution: Have browsers keep the DTDs cached, and then check the file date periodically when re-requested. This will still put some load on the w3c's servers, but significantly less than complete re-downloads.

Caching DTDs locally (1)

NetSettler (460623) | more than 6 years ago | (#22356728)

Another potential solution: Have browsers keep the DTDs cached, ...

Or the routers. Frankly, if the result is known to not change, w3 could probably agree with the network authorities to put copies around the net and treat those heavily used URIs as URNs and just never got to w3 (or rarely go there) instead.

The notion that URNs have to be known in advance as "the popular thing" rather than being discovered after-the-fact by noticing high-volume URIs is probably the real bug here.

Re:caching (1)

corsec67 (627446) | more than 6 years ago | (#22356780)

W3C already says how long the DTD should be cached for: 90 days, using the Cache-Control HTTP header, which is set to "max-age=7776000" (seconds).

They already do. (4, Informative)

pavon (30274) | more than 6 years ago | (#22356804)

The spec already recommends this and all the major browsers do it. The software that is causing the problem are generic XML/SGML processing packages which were designed to be able to deal with documents with any random DTD, not just the main HTML/XHTML ones from W3C. They are the folks that are downloading each DTD every single time and not caching it, contrary to the standard. Sometimes caching is a configuration option which defaults to off and administrators never turn it on.

Re:caching (2, Insightful)

Bogtha (906264) | more than 6 years ago | (#22356864)

Add some sort of caching parameter to the DTD spec, that specifies how long browsers should cache those DTDs.

You're solving that problem at the wrong layer. HTTP already includes caching mechanisms, the W3C already use them, and part of the problem is that buggy software is ignoring them.

Another potential solution: Have browsers keep the DTDs cached

Please read the article. This is already supposed to happen. Buggy software fails to do this, which is the problem being talked about.

Simple solution (5, Funny)

mcrbids (148650) | more than 6 years ago | (#22356702)

The answer to this problem is quite easy.

Continue to host the data referenced on a single T-1 line. That will cut your expenses to the bone since you'll never exceed 1.54 Mbps and that should be quite cheap. And, any dumfuxorz who fubarred their parser to not cache these basically static values will probably figure it out... very quickly.

You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.

Problem solved!

Starting on the 1st, fool (5, Funny)

Scrameustache (459504) | more than 6 years ago | (#22356896)

You don't have to leave it on the T-1, maybe just 1 month out of the year. Every year.
I suggest April! :D

I'm just conforming! (-1, Flamebait)

ShatteredArm (1123533) | more than 6 years ago | (#22356708)

Hey, you made the specs. Why should you blame me if I'm conforming? Does the spec allow me to assume that all my documents are going to use that DTD, and that it won't change?

What to do, what to do...

Re:I'm just conforming! (1)

jlarocco (851450) | more than 6 years ago | (#22357062)

Hey, you made the specs. Why should you blame me if I'm conforming? Does the spec allow me to assume that all my documents are going to use that DTD, and that it won't change?

Sigh.

  • If you're the one writing the xml this is almost no concern of yours.
  • The DTD won't change. That's the point of having a standard DTD.
  • The standards say absolutely nothing about fetching the DTD from the web every time an xml file is being validated.

What to do, what to do...

Try getting a clue.

Re:I'm just conforming! (1, Insightful)

ShatteredArm (1123533) | more than 6 years ago | (#22357120)

I was just being facetious, first of all... But if you must...

The DTD won't change. That's the point of having a standard DTD.

What's the point of having a DTD if it won't change? Oh yeah, there is none. Conceptually, the DTD is there to define the data, and unless you know what is in the DTD, you cannot use it to validate, which is its purpose. And conceptually, if you assume the data is defined a certain way, you don't need a DTD.

If you're the one writing the xml this is almost no concern of yours.

Generally the DTD is for the person parsing the XML. If you're writing the XML, you don't need a DTD, because you already know the schema. If it's only for the XML writers, all you'd need to do is place your schema with the rest of the specs for your application.

Now I wasn't suggesting that in practice you should go to the server every time and fetch the DTD. But clearly you take things too seriously.

Try getting a clue.
Try getting a sense of humor.

Poetic justice (0, Redundant)

shark swooner (1077115) | more than 6 years ago | (#22356714)

Serves them right for forcing us to include the same long urls that point to files that never change in every single HTML file ever.

had this problem with hibernates website... (3, Interesting)

rgrbrny (857597) | more than 6 years ago | (#22356718)

the doctype was being used during a xsl transform during our build process; when the hibernate sight flaked out, the builds would fail intermittently.

solution was to add a xmlcatalog using a local resource.

bet this happens a lot more than most people realize; we'd been doing this for years before we noticed a problem.

'Web Community'? (1)

radimvice (762083) | more than 6 years ago | (#22356744)

A plea to the web community to stop pinging the W3C DTDs isn't going to solve anything. What will work is blocking any unnecessary DTD traffic aggressively, and if that doesn't do the job, blocking it even more aggressively. Intelligently designed software / ISPs / routers will cache, filter and block these requests for the sake of their own efficiency, bandwidth, and proper function. Buggy, bloated and inefficient applications won't. Nothing's ever going to convince the 'web community' to stop pinging the DTDs out of an altruistic concern for W3C's servers, it will need to become beneficial for those software developers to devote the extra development/debugging/patching efforts to do so.

Simple (1)

Citizen of Earth (569446) | more than 6 years ago | (#22356760)

I can't think of a problem that is simpler to solve. Just stop serving these documents. The offending programs will be fixed very quickly.

Irony (4, Funny)

davburns (49244) | more than 6 years ago | (#22356764)

So, w3c complains about their bandwidth, and the response is: The Slashdot Effect. Doesn't that make the old bandwidth problem seem less of a problem?

I'm just loving the irony in that.

Such an easy solution (5, Funny)

mwasham (1208930) | more than 6 years ago | (#22356790)

And it is only 4 articles down.. Host with Yahoo! Yahoo Offers All-You-Can-Eat Storage and Bandwidth http://hardware.slashdot.org/article.pl?sid=08/02/08/1811236 [slashdot.org]

Re:Such an easy solution (1)

shawn(at)fsu (447153) | more than 6 years ago | (#22357220)

I lol'ed. Kudos you're a good problem fixer.

Submitted this to /.? (5, Funny)

dotancohen (1015143) | more than 6 years ago | (#22356810)

Great, they cry "we get too much traffic", so we go ahead and slap them on the front page of slashdot. Sick, sick fucking joke.

Re:Submitted this to /.? (5, Informative)

ger (3028) | more than 6 years ago | (#22357150)

To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us [del.icio.us] , etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.

Re:Submitted this to /.? (0)

Anonymous Coward | more than 6 years ago | (#22357234)

That should teach you to put potentially problematic URLs on separate sub-domains, so that you can more easily deal with traffic issues should they arise, either by letting a different server(-farm) handle the requests or by using DNS to point the clients away from your servers entirely if it gets too much.

I always thought it was stupid (1)

RelliK (4466) | more than 6 years ago | (#22356828)

I always thought it was stupid that XML documents include reference to a DTD hosted on a remote server that you do not maintain. This is wrong on so many levels, I don't even know where to begin:

1. The validation will not work if the remote server is down, or network is down, or your connection to the internet is down, or if the file is not accessible for any other reason.

2. You are at the mercy of some third-party to ensure that the file is correct and that it doesn't change.

3. You are susceptible to man-in-the-middle attack.

etc.

For some insane reason, all XML examples have this reference to a remote URL. Most people never change defaults, so we get in a situation where nearly every time XML is validated, W3C site gets hit. The geniuses at W3C should have thought of that *before* this happened. Now they have to live with it...

Re:I always thought it was stupid (1)

MenTaLguY (5483) | more than 6 years ago | (#22357036)

What people were supposed to do is include a copy of the DTDs with their software. That's what the PI string is there for, as an index into a local catalog of DTD resources. The URL was supposed to be only a fallback measure.

Re:I always thought it was stupid (2, Interesting)

MtHuurne (602934) | more than 6 years ago | (#22357172)

I wrote my thesis in Docbook and installed the processing toolchain on a laptop. Sometimes the processing would fail and sometimes it worked. After a while I noticed it worked when I was setting behind my desk and failed when I was sitting on my bed. After some digging, I found out that the catalog configuration was wrong and the XML parser was downloading the DTDs from the web. This was before WiFi, so sitting on the bed meant the laptop did not have internet access.

The core of the problem is that most XML parsers will automatically and transparently fetch the DTD from the URL and do not cache it. So if you have no DTDs installed locally, or if your XML parser cannot find them (catalog configuration is easy to mess up), the parsing will work just fine and if processing the XML takes a significant amount of time, you probably won't notice the small delay from downloading the DTD.

There are several possible solutions for this:

  • Do not automatically fetch DTDs from the web: make it an explicit option that the user has to set.
  • Be vocal when fetching a DTD from the web, for example issue a warning.
  • Cache fetched DTDs locally.

All of these are things that should be addressed in the XML parsers.

I'm going to say this as clearly as possible. (3, Informative)

glwtta (532858) | more than 6 years ago | (#22356834)

Browsers cache the DTDs.

There, you can now stop posting your hilarious "jokes".

Obivious Solution (1)

PaK_Phoenix (445224) | more than 6 years ago | (#22356868)

Load too big on your server, need to slow down the traffic a bit.

Slashdot it.

That should work

Surprise (3, Insightful)

MBCook (132727) | more than 6 years ago | (#22356904)

I've got to say, this doesn't surprise me at all. In the time I've spent at my job, I've been repeatedly floored by the amazing conduct of other companies IT departments. We've only encountered two people I can think of who have been hostile. Everyone else has been quite nice. You'd think people would have things setup well, but they don't.

We've seen many custom XML parsers and encoders, all slightly wrong. We've seen people transmitting very sensitive data without using any kind of security until we refused to continue working without SSL being added to the equation. We've seen people who were secure change their certificates to self-signed, and we seem to consistently know when people's certificates expire before they do.

But even without these things, I can't tell you how many people send us bad data and flat out ignore the response. We get all sorts of bad data sent to us all the time. When that happens, we reply with a failure message describing what's wrong. Yet we get bits of stuff all the time that is wrong, in the same way, from the same people. I'm not talking about sending us something that they aren't supposed to (X when we say only Y), I'm saying invalid XML type wrong... such that it can't be parsed.

We have, a few times while I've been there, had people make a change in their software (or something) and bombard us with invalid data until we we either block their IP or manage to get into voice contact with their IT department. Sometimes they don't even seem to notice the lockout.

Some places can be amazing. Some software can be poorly designed (or something can cause a strange side effect, see here [thedailywtf.com] ). I really like one of the suggestions in the comments on the article... start replying really slow, and often with invalid data. They won't do it. I wouldn't. But I like the idea.

A lesson from network history (1)

idontgno (624372) | more than 6 years ago | (#22356922)

which is never ever learned...

A freely accessible [wikipedia.org] network [wikipedia.org] resource [wikipedia.org] is begging to be driven, smoking and shattered, into the ground by the ill-mannered, ill-trained, or ill-intentioned hordes.

Personally, I blame the introduction of AOL in 1994 to the Usenet for this downward spiral. We were doing just fine before all you "me too"s started pouring in.

Get off my lawn, you clueless kids!

Re:A lesson from network history (0)

Anonymous Coward | more than 6 years ago | (#22357050)

Godwin says otherwise. Blame the Jews.

Stupid design decisions in standards ... (1)

Lazy Jones (8403) | more than 6 years ago | (#22356984)

... come back to haunt you.

Perhaps they will stop putting HTTP-URLs in standardized tags now... Also, enjoy life as a web content provider who spends many hours per week blocking Referers (nice typo in the original RFC!) and dealing with broken clients, something that the W3C never spent much time pondering about.

Make it slower, not faster (2, Insightful)

Thunderbear (4257) | more than 6 years ago | (#22356990)

If the problem is that it gets served out too many times, then make the server slow as molasses. If it takes 1-2 minutes to get the DTD from the server, or more, then it is quickly discovered by the performance teams.

Stop the insanity (1)

kaosgoblin (1230836) | more than 6 years ago | (#22356998)

Must Add 5 miles of data to my code now, they need MORE DATA!!!

The HTML 5 doctype kind of solves this (1)

Jugalator (259273) | more than 6 years ago | (#22357088)

That doctype is simply <!DOCTYPE HTML> [w3.org] !

Recording UA? (1, Redundant)

dotancohen (1015143) | more than 6 years ago | (#22357116)

What are the user agents making the requests? Do these programs identify themselves with a UA string or something?

heh (1)

rastoboy29 (807168) | more than 6 years ago | (#22357138)

I bet Slashdot.org could possibly find some bloggers that would be more than happy to receive that traffic!

Re:heh (1)

milsoRgen (1016505) | more than 6 years ago | (#22357204)

that'd be fun, hijack a dns server and have all the requests directed to whatever project you have...

Look investors look! Look at all the hits we're getting! More money please!

Sorry that was me! (1)

syousef (465911) | more than 6 years ago | (#22357166)

I'm sorry the typo's mine. I made it when I was working late one night and spilt spagetti down my shirt. I had no idea that it would propagate so far and ruin the web. Oops. Anyway I've fixed it, but it's not in the stable CVS branch yet so I'm afraid you'll just have to put up with it for a while longer.

(For those without a sense of humour, yes this is a joke)

rofl (0)

Anonymous Coward | more than 6 years ago | (#22357186)

its not too hard to host the dtd file on your own server, amirite? not like its gonna change....

ISP, where's your DTD server? (1)

leek (579908) | more than 6 years ago | (#22357200)

Perhaps ISPs should install caching DTD servers.

People would have another reason to complain about their ISP's quirks.

That's the problem with a URI for an ID (3, Insightful)

argent (18001) | more than 6 years ago | (#22357214)

I think they screwed up, and brought this on themselves. I already thought that it was annoying having so verbose an identifier... this just makes it more hateful.

If they'd at least made the identifier NOT a URI, something like domain.example.com::[path/]versionstring, or something else that wasn't a URT, so it was clearly an identifier even if it was ultimately convertible to a URI, they would have avoided this kind of problem.

Adding another use to an existing standard... (0)

Anonymous Coward | more than 6 years ago | (#22357222)

Classic problem of using something developed for one purpose (a remote resource locator) for something else (a unique identifier).

If they had did something simple like using some non-functional protocol identifier like 'ident' (i.e. xmlns="ident://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd". Browsers and other software would have been developed to never actually 'do' anything with such a URI

The SlashDot effect could shut down this site! (1)

Jack Pallance (998237) | more than 6 years ago | (#22357258)

Here, I'm posting the *real* article so you guys don't have to click through this blogspam!

http://www.w3.org/1999/xhtml For further information, see: http://www.w3.org/TR/xhtml1 [w3.org] Copyright (c) 1998-2002 W3C (MIT, INRIA, Keio), All Rights Reserved. This DTD module is identified by the PUBLIC and SYSTEM identifiers: PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" $Revision: 1.1 $ $Date: 2002/08/01 13:56:03 $ --> %HTMLlat1; %HTMLsymbol; %HTMLspecial;

Obligatory (0, Troll)

Zombie Ryushu (803103) | more than 6 years ago | (#22357270)

That sounds like a DTD thing to do! If you are a dee, please don't marry a tee, because if you marry a tee, your kids will be DEE TEE DEE."
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>