How Does Bing Get Around robots.txt?
Having received a large number of people to my site who don't know what a FAQ is recently, I was combing through my access logs and found a growing number of people finding my web site from Microsoft's Bing search engine. That's odd. I've got msnbot blocked in my robots.txt. A scan of the logs show that msnbot (and its variants like msnbot-media, etc.) continue to check robots.txt, but nothing else. So how did Bing index my site?
Here are my current theories:
- Ignore robots.txt and index anyway (unlikely)
- Use semantic web technology like Google Wave's spell checker to infer things about my site based on other pages' links to it
- Use information "phoned home" from users' browsing (Microsoft's EULAs allow for it)
Ignoring robots.txt is the most straight forward, but also the least likely. While they may "accidentally" index forbidden pages now and then (see the Bing forums - they do), even Microsoft's evil has its limits.
Besides that, a Ms. W. from Murphy and Associates on behalf of MSN Live Search contacted me over a year ago requesting that I allow msnbot to scan my site. I kindly said, "No thank you," and listed just a few of the crimes against humanity (most in the 1990s) that directly effected me and let her know that I couldn't be paid to help Microsoft in any way, including allowing their search engine to index my site. Quality over quantity.
I tried contacting Ms. W. after finding all of these Bing referrals, but either she is no longer with them or would prefer not to get involved in my dispute with Microsoft. Nonetheless, I have made an effort to remind Microsoft about my policy, and that I an none too pleased that they are still indexing my site.
But if they aren't doing it through the straight forward method, then how?
Well, the recent Google Wave spell checker demonstration had me thinking of other uses of semantic web technologies. It seems to me that much can be inferred about blind spots on the web if a grip on the context of pages can be made. So without indexing my pages, the links that use my site as a primary source can contribute to an inferred index about what my site contains. The text of anchor tags that link to my site would be an excellent source for high quality query keywords, linking directly to the most relevant information.
(Sergey, if you aren't working on such technology, I make no claim to the ideas here. They seem like a natural extension of your Wave spell checker work. Please just be sure to exclude links to any pages a site's robots.txt forbids. We don't want to start being evil, now.)
This theory is certainly doable. And even Microsoft techies can index links from indexed pages and put them into the results page without much trouble. And having no conscience to speak of, they would never think that maybe they should cross check robots.txt to prevent unwanted indexes from happening.
That leads to the third potential method of gathering information, the EULA. Many people have discussed Microsoft's ever changing End User License Agreements for their products, and how much information they allow Microsoft to gather about their users' working habits. Pretty much everything now "phones home" with information that Microsoft will tell you is meant to make your computing life better.
Well, with "permission" from millions of people to track their browsing habits (through IE, their "security" offerings, or even special proxy servers of partner ISPs), Microsoft wouldn't have to crawl any sites. They could just let their users browse the web and send back just the indexes. How much is currently understood about what Microsoft products are sending back to Redmond?
Of the three methods, the first strikes me as unlikely; the second is the most interesting, and most Googley; and the third strikes me as the most likely thing that Microsoft would do. It's all perfectly legitimate.
Well, it's legitimate except that I would rather not have Microsoft benefiting from any work that I do. I have tried to contact one of their agents to let them know I am none too pleased with the current situation, but I was ignored. And signing up for an MSN Live account to post on their Bing site isn't going to happen - I refuse their EULAs across the board.
So how can I get Microsoft to pay attention? For the time being, I'm rerouting all traffic with REFERER containing "bing.com" to Google. There are worse places I could send them, but I don't want to be cruel.
So, are there any other ways of getting around robots.txt do you think Microsoft employs? What other remedies are there to prevent Microsoft from using such circumventions?
In the mean time, here's the enjoyable Bing Bang.