Rogue robots

(Robots that ignore robots.txt file)

The number of bots accessing popular websites exceed the number of real users by wide margin. For example in one week Softpanorama site was accessed from 14735 unique addresses. Less then 5K of them can be classified as "real users" ( users that actually read at least one page on the site). That means that bots represent 66% of all IP addresses that accessed the site.

Only around 200 of those bots read robots.txt file. So all other robots can be viewed as rogue. In other words rogue robots dominate the Web. IP the fires GET request non-stop (50 more more request per minute) and does not read robots.txt should be classified as rogue robot too.

Most robots "uncritically" use URLs from the pages they scan and it looks like a lot of their source URLs are "poisoned". That include Google and Microsoft robots. What is worse is that some crazy URL that robot gets is used again and again -- looks like they have no mechanism to decrease validity of pages that contain many broken URLs. So much about Google intelligence and quality of Google programmers. Judging form actual behaviour they just don't care.

But truth be told behavior of all robots has elements of suspicious behavior.

One important method of distinguishing whether the robot is "crazy"/undebugged or outright evil is to check whether it obeys robots.txt file. You can include a couple of "test" directory for particular robot and observe results. Also you can (and should) include all old (now non-existent) directories and see which robots still attempt to access files in them.

The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

For example:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. Some crawlers like Googlebot and Slurp recognize strings containing "*", while MSNbot and Teoma interpret it in different way.

If robot does not obey robots.txt or is producing way too many 404 using non-existent URLs it should be hunted and killed ;-).

For example here is definitely evil robot :-) - - [24/Aug/2012:03:51:00 -0700] "GET /Net/telnet.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:52:15 -0700] "GET /Algorithms/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:54:14 -0700] "GET /Bulletin/archive.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:54:14 -0700] "GET /Scripting/perl.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:54:21 -0700] "GET /Freenix/linux.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:03:55:22 -0700] "GET /Solaris/Whitepaper/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:01:39 -0700] "GET /Antivirus/Spyware/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:21:30 -0700] "GET /Skeptics/cs_skeptic.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:23:16 -0700] "GET /WWW/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:24:17 -0700] "GET /Bookshelf/xml.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:25:00 -0700] "GET /Social/overload.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)" - - [24/Aug/2012:04:25:24 -0700] "GET /Admin/index.shtml%0D HTTP/1.0" 404 12973 "-" "Wget/1.12 (linux-gnu)"
Very similar to "crazy robots" are "obnoxious copiers" who overload the site by trying to mirror all the content. Sometimes several times a day. For example:

Bug#699133 wget When issuing the following exact command wget -m I get wget malloc() smallb

Web scraping - Wikipedia, the free encyclopedia

Technical measures to stop bots[edit]

The administrator of a website can use various measures to stop or slow a bot. Some techniques include:

Incapsula Finds Malicious Bots Account for Approximately 30 Percent of Internet Traffic

Other report findings include:

"We have been conducting this study since 2012, and one constant in our findings is that malicious bots are becoming increasingly sophisticated and harder to distinguish from humans. These bots pose a huge threat to websites and are capable of large-scale hack attacks, DDoS floods, spam schemes and click fraud campaigns," said Marc Gaffan, CEO of Incapsula. "With the vulnerabilities exposed in the past year, notably Shellshock, it is more important than ever that companies operating websites are diligent in securing their sites from malicious traffic."

