User-agent: msnbot Disallow: / User-agent: MSRBOT Disallow: / User-agent: MSNbot Disallow: / User-agent: * Disallow: /cgi-bin/ Disallow: /dropbox/ Disallow: /_stats/ Disallow: error.html Disallow: noaccess.html Disallow: /cam/ Disallow: /_ufcw1518/ Disallow: /_bccu/ Disallow: /_pt/ # the following is courtesy of http://www.subsume.com/robots.txt # Doesn't properly parse certain URL schema, resulting in more 404 madness. # A lot of IPs have been banned for that behavior, but this is the first "named" spider # to be so poorly written. User-agent: FindLinks Disallow: / # Another one that 404s with non-http links. If I start seeing many more like this, # I'm just going to say "fuck it" for entries in this file and go straight to IP # banning of even the named bots. User-agent: Gaisbot Disallow: / # Is there some shit library that all these bots are using that explains why # they can't parse a decent URL? # Oh, and bonus points for having the most generic name here, zoominfo.com dolts. User-agent: NextGenSearchBot Disallow: / # Last one I'm going to list that 404s with non-http links. Expect this file to get shorter soon. :-/ # Also, how many fucking IPs (with no rDNS on top of it) does this thing need to come in on? # Bonus points for being in SPEWS! I should ban on that alone, but this is their one chance to obey exclusion. User-agent: Gigabot Disallow: / # Inexplicably, Yahoo! Slurp intentionally 404s using a generated URL that starts with # /SlurpConfirm404/ # We're not here to feed up pages that confirm or deny your spider's internal state. # By fishing for 404s you get nothing, or 403s if you ignore this. # Update: We just saw 209.191.87.215 spider a 404 page and 68.142.249.17 spider robots.txt. # Update: Clearly they lie about exclusion support. Welcome to an IP ban, Yahoo. User-agent: Slurp Disallow: / # These fuckers are just like Slurp, only they're using # /this_is_a_test_of_404_response # This is a test of your exclusion support, assholes. User-agent: BecomeBot Disallow: / # Ah, MSN, why I've allowed your bad behavior for so long is a good question. # Your crimes are: crawling at a rate 100 times more than anyone actually gets referred by you, # tons of 404s for URLs we *never* had on our server (e.g., /Gen_2002/images/BookbagButton1.gif), # and for putting *any* focus on search when you can't even ship your core OS as promised. User-agent: msnbot Disallow: / # FAST Enterprise Crawler/6 comes in from an IP with no reverse DNS. # It does a crapload of crawling, but I'm not seeing any referred traffic. # It does not have a proper bot page that I could find, so User-agent is a guess. # It my not even honor this file at all, which will lead to an IP block. User-agent: FAST Disallow: / # ZyBorg/1.0 Dead Link Checker is another intolerably bad bot. # Everything that applies to the previous agent applies to this piece of crap, too. # The connection (WiseNut/LookSmart) with the shitty grub-client (listed next) doesn't surprise me. User-agent: ZyBorg Disallow: / # You fuckers aren't honoring the * disallows, so you don't get to see anything. # And if you don't honor this, we'll go to blocking specific hosts. # Update: We are now blocking host IPs. Die! User-agent: grub-client Disallow: / # Another bot that ignores * disallows, even though they claim they follow the protocol. # And what the hell is with Yahoo-VerticalCrawler-FormerWebCrawler in the agent? Pick a name! # This may be the same bot that was listed as FAST above, but it gets a special list. # Dirty, dirty bot. I kind of hope this is ignored so I get to block by IP. # Update: It is! I do! User-agent: fast Disallow: / # More * ignorance. User-agent: NaverBot Disallow: / # Intentionally generates 404s by changing the case of a known good URL it just spidered. # We don't know if it's testing case sensitivity or what, but we don't really care. # Use the bloody URL you're given! User-agent: baiduspider Disallow: / # Also 404s URLs by changing case. User-agent: LNSpiderguy Disallow: / # QuepasaCreep is an unknown spider that screws up all links it tries. User-agent: QuepasaCreep Disallow: / # VoilaBot is an another spider that seems to 404 all the time. # And despite a complete / ban, it still bugs us multiple times a day. # Say hello to a 195.101.94.0/24 block you greedy French fucks! # The bastards are coming in on 193.252.148.0/24, too. s/are/were/ User-agent: VoilaBot Disallow: / # This agent charges a fee for its "services" but provides sites with no compensation. # You want to make money by leeching content from our site? Pay us. # Plus, we think stupid people should be allowed to copy in lieu of learning. # More idiots in the market makes us look like absolute geniuses by comparison. User-agent: TurnitinBot Disallow: /