AllResearch.com Problems
For the second time I’m having to ban AllResearch.com from crawling my site. The first time I contacted them about the excessive crawling they sent a generic email. We’ll see what they do with this second email. Read on for the emails…
Sent on 9-12-2004:
Subject: Hits on eSnider.net
Date: September 20, 2004 7:30:40 AM EDT
To: support@allresearch.com
Dear Sir/Ma’am,
My name is Larry Snider and I am the webmaster at eSnider.net. Lately, I have noticed an incredible number of hits originating from 38.144.36.16. The frequency is about 1 hour and occurs on or around the top of the hour. Here is an example from my logs:
38.144.36.16 - - [19/Sep/2004:23:02:02 -0500] “GET /backend.php HTTP/1.1″ 200 1895 “-” “Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)”
38.144.36.16 - - [19/Sep/2004:23:03:16 -0500] “GET /modules.php?name=News&file=article&sid=268 HTTP/1.1″ 200 23135 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:18 -0500] “GET /modules.php?name=News&file=article&sid=269 HTTP/1.1″ 200 23472 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:19 -0500] “GET /modules.php?name=News&file=article&sid=270 HTTP/1.1″ 200 23953 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:20 -0500] “GET /modules.php?name=News&file=article&sid=271 HTTP/1.1″ 200 24328 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:21 -0500] “GET /modules.php?name=News&file=article&sid=272 HTTP/1.1″ 200 23827 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:30 -0500] “GET /modules.php?name=News&file=article&sid=273 HTTP/1.1″ 200 23139 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:31 -0500] “GET /modules.php?name=News&file=article&sid=274 HTTP/1.1″ 200 22954 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:33 -0500] “GET /modules.php?name=News&file=article&sid=276 HTTP/1.1″ 200 23107 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:34 -0500] “GET /modules.php?name=News&file=article&sid=275 HTTP/1.1″ 200 23114 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
38.144.36.16 - - [19/Sep/2004:23:03:36 -0500] “GET /modules.php?name=News&file=article&sid=267 HTTP/1.1″ 200 23648 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”
The first hit seems to originate from a Mac and then 10 hits from a Windows 98 machine. This sort of behavior is highly suspicious. Why someone would disguise a crawler in such a misleading way is beyond me.
I would like an explanation as to the purpose of such crawling. If none is given I shall be forced to exclude that IP from all future access.
Thank you.
Their response:
From: support@allresearch.com
Subject: [AllResearch #LUY-50829-924]: Hits on Snider.net
Date: September 27, 2004 1:54:10 PM EDT
Reply-To: support@allresearch.com
Thank you for contacting us regarding the traffic to your site.
We have recently developed a search engine that indexes RSS feeds. Your feed has been selected to be included in our index.
Our search engine is very careful to not request an RSS or RDF page more than necessary. When we fetch an RSS page, we very carefully look for any possible clue about when to re-visit. Our system will utilize all of the techniques outlined by protocol, including:
1) XML tags in your rss page: updateFrequency, updatePeriod, updateBase, ttl, skipDays, skipHours
2) The “e-tag” HTTP header.
3) The “Last-Modified” HTTP header.
If NONE of the above methods give us a clue as when to revisit, then we will re-fetch your RSS page again in one hour.
For more information about the RSS or RDF protocol, we suggest you visit the following links:
http://web.resource.org/rss/1.0/
http://blogs.law.harvard.edu/tech/rss
http://www.w3.org/RDF/
Thank you for your understanding,
The WebClipping Team
It was at this point that I banned 38.144.36.16 in my .htaccess file. But on 12-3-2004, they must have changed their IP address from which they do their crawling. From 12-3-2004 to 12-14-2004 I received almost 2,700 hits from them (38.144.36.19) with ZERO hits on robots.txt. Their bot doesn’t even identify itself in the HTTP_USER_AGENT variable. That alone is very suspicious but what they do is resell MY information to businesses. How are they presenting my data? I have to subscribe to find out…
Sent on 12-14-2004:
Subject: Hits on eSnider.net
Date: December 14, 2004 4:52:20 PM EST
To: support@allresearch.com, noah@webclipping.com
It would be very much appreciated if your ill-behaved bot would stop crawling my site (www.esnider.net). I say ill-behaved because it does not check robots.txt for permission to crawl my site like all other legitimate crawlers, it evidently does not check my backend.php to see if anything has changed since the last time it checked before it downloads everything in the feed, and your bot does not properly identify itself in the HTTP_USER_AGENT variable.
Your current model is flawed especially when you consider that many personal websites have either hit limits or bandwidth restrictions or both. Most of these people do not code their own websites as in the case of a CMS package used as a weblog so there really is no way to modify their RSS feed code.
I have contacted AllResearch.com before about this and know the 3 methods you employ to check for changed articles so I’d rather not get the generic “We have recently developed a search engine” email. Please just remove my site from your list of websites.
Thank you.

















[…] In short, AllResearch piggybacks the blogosphere to make a nice buck, with a complete indifference to limiting the unnecessary load they put on others’ sites. This, in my opinion, is abusive, and I’m not the only one noticing it. Daniel Bowen at GeekRant went through the same experience in February 2005. Larry Snider had a conversation with AllResearch in December 2004. Google AllResearch and find out more. […]
Pingback by Random Synapses » Blog Archive » Blocking abusive spider from AllResearch — March 17, 2007 @ 7:46 am