Search Engines: costs vs. benefits

Summary
For a non-commercial information provider, search engine costs are the bandwidth generated on a site by search robots that has to be paid for. Their benefits are the user page reads that result. Search bots have become so numerous in recent years that they are a major drain on site resources while most provide almost no benefit in return. Identified bots accounted for 45% of my site hit and bandwidth costs by July 2007. This ratio continues in 2018, when I switched to a host that separates known bots from real viewers. Yahoo remains by far the worst bot abuser.

A solution available to all users is to use a robots.txt file to exclude all bots but Google. Bot hits decrease to a small fraction of those to an unprotected site, and visitor count decreases only slightly.

Details
Here were some of my search engine numbers when I started this project in July 2007:

Active html pages: 293
Total page hits: 42870
User page views/mo (known bots removed): 23387
search engine bot hits/mo user hits/mo benefit:cost
Yahoo 10323 874 1:12
Ask 1359 89 1:15
MSN 1358 184 1:7
Google 598 8499 14:1
Twiceler 473 0 0
Voila 224 0 0
Seekport 149 0 0
Picsearch 87 48 1:2
Ichiro 83 0 0
Cazoodle 78 0 0
Gigablast 62 60 1:1
Naver 52 27 1:2

Active html pages: 293 Total page hits: 42870 User page views/mo (known bots removed): 23387
search engine	bot hits/mo	user hits/mo	benefit:cost
Yahoo	10323	874	1:12
Ask	1359	89	1:15
MSN	1358	184	1:7
Google	598	8499	14:1
Twiceler	473	0	0
Voila	224	0	0
Seekport	149	0	0
Picsearch	87	48	1:2
Ichiro	83	0	0
Cazoodle	78	0	0
Gigablast	62	60	1:1
Naver	52	27	1:2

42% of my user views came from Google, and you can't beat their benefit:cost ratio. But all the other engines above except Gigablast were a net loss to me.

How do you get rid of bots that make excessive requests? The simplest way is to set up a robots.txt file in your root server directory containing:

User-agent: Slurp
Disallow: /

to get rid of Yahoo's main agents. Repeat for other bots as appropriate. If, after reading this page, you want to get rid of the whole lot except Google, as I have, use:

User-agent: Google
Disallow:
User-agent: *
Disallow: /

Data collection method
I've been archiving page logs since 1 March 2005. For each GET *.html entry in my logs, I search first for bots (case insensitive) with strings similar to these:
Ask: teoma OR minefield
GenieKnows: geniebot
Google: googlebot
Ichiro: ichiro
MSN: msnbot OR msrbot OR msiecrawler OR lanshanbot
Naver: naverbot OR yetibot
Sproose: crawler@sproose
Yahoo: slurp OR yahooseeker OR Yahoo-MMCrawler
miscellaneous bots: bot.htm OR bot" OR crawl OR spider

Once these entries are removed, I search for a user hit from the engine:
Ask: .ask.
GenieKnows: genieknows
Google: .google. OR earthlink OR aolsearch
Ichiro: .goo. OR .nttr.
MSN: msn.co OR search.live
Naver: naver
Sproose: sproose
Yahoo: yahoo OR alltheweb OR altavista.com

This method relies on the refer field of the user log. It has limitations: people can turn refer off in some browsers, proxy or content filters can interfere with the refer field, and page caching reduces recorded user views but not bot hits. Another problem with pinning down searchbot results is metasearchers who check several databases to come up with their results and give preference to pages that are in many bases. A final problem is that some bots, open source ones in particular, are run independently by many users; identification of user views is often not possible with these.

So, results of this method have to be investigated for consistency. To start, here are the referers that had an average of 10 bot+user hits/mo or more over the study period, in order of their total activity on my site:

Google is so well known that its name has become a verb. Hitting each of my pages an average of twice per month since Mar05, it delivered a consistent 45% of my user views. In addition, many sites have a "powered by Google" search box. The Best Buy among robots!
Yahoo is by far the worst bot abuser there is, many months hitting all of my pages every day while delivering only 3% of my users. (Lanshanbot is their Chinese language bot.) Turned off Aug07.
Microsoft's search engine is proving far less popular than its operating systems, delivering only 1% of my users while its bots hit me 7 times that often. Turned off Sep07; turned on again med-Feb08 to separate its effects from Yahoo, turned off for good Nov08.
Wikipedia delivered 11% of my page viewers. It doesn't use bots, but relieson page writers to manually locate and enter external links related to their subject.
Ask/Jeeves bot hits me ten times as often as its users do, but contributes to several metasearchers.
Voila was hopefully checking 55% pages/mo for French content since Mar05 and came up with only 10 isolated user hits the whole time, presumably from bilingual searchers. It seems to have vanished.
Seekport is a German site that has been checked 50% pages/mo since Mar05, looking for pages in 7 European languages, with precisely one user hit resulting over the entire period. It now seems to have vanished. It didn't obey robots.txt
Majestic is a distributed searcher designed for users with broadband connections. It's been hitting 33% pages/mo since Mar05. User hits cannot be identified.
Ichiro is a Japanese language searcher. It has hit 27% pages/mo since Mar05 and has returned 10 hits over the entire period.
Picsearch is an image searcher that hit 29% pages/mo since Apr05 to identify them. Turned off Mar08.
Lucene provides the open source searcher Nutch. The users who leave it in default configuration have hit a total of 30% pages/mo since Mar05.
Gigablast is a general search engine that hit 28% pages/mo since Mar05. Its 2353 bot hits over the period don't quite match its 2067 user views.
Heritrix is the Internet Archive's web crawler, it's hit 20% pages/mo since Mar05. Worth keeping.
Accoona was a general search engine that hit 19% pages/mo since Sep05 and returned 7 user views over the entire period.
Naver is a Korean language searcher. It has hit 12% pages/mo since Mar05 and has slowly increased its user views to the current 28/mo.
TurnItIn provides anti-plagiarism services to educational institutions. It has hit 16% pages/mo since Mar05 to collect data to support this. I support it too.
Cazoodle appeared Nov06, it collects "information for next-generation Web search and integration solutions". It hits 26% pages/mo and delivers no users.
OmniExplorer has hit 15% pages/mo since Mar05. If you are selling things they list they might be useful, otherwise they aren't. They stopped hitting me after I wrote for information.
Shopwiki appeared in Jul06 and is building a shopper's index - if you are selling things you want them, otherwise they are no use. They hit 18% pages/mo until I turned them off.
Larbin is Linux freeware that anyone can run. Fortunately, most people don't have the resources to do much damage with it; total hits since Mar05 are 11% pages/mo. Identification of user hits is not possible.
BruinBot is a project of the Computer Science department of UCLA. It hit 75% pages/mo May05-Aug05.
Krugle has searched for open source code for developers since Apr06, hitting 14% pages/mo. If you have code to distribute, it's useful, otherwise it's no use.
Sensis is an Australian searcher that has hit 8% pages/mo since Apr05, but provided no user hits.
IBM's research crawler hit 10% pages/mo Mar05-Jan07.
NetSeer appeared in Nov07 by hitting all of my pages twice, and said it was building a marketing index, apparently to focus future ads. They have removed the info page on their crawler (12Dec07)
Shim is a University of Tokyo research project. It hit 15% pages/mo Nov05-Dec06.
CounterStrike is an online computer game that hit 6% pages/mo Jun05-Sep05; all are in effect user hits. It doesn't obey robots.txt.
UBI is a project of the Nagaoka University of Technology, Japan, to survey the use of languages on the internet. It hit all my pages Mar06 and Dec06.
NetResearchServer is a customisable search engine sold to users. It hit 5%/mo of my pages Mar05-Sep07 and now seems defunct. Identification of user hits is not possible.
Scirus searched scientific journals online. It hit 75% of my pages Oct05 and Feb06 and didn't find any. It's gone now, doubtless replaced by others.
MSR-ISRCCrawler is a Microsoft bot that looks through music files for people who are looking for copyright violations. It ignores robots.txt and has hit 20%/mo of pages since Apr08.

There are hundreds of people who have set up open source crawlers under hopeful names, who have rapidly discovered the huge resources needed for even the most focussed database, and who have vanished. Looking for crawl, bot.htm, bot" and spider after all other checks gets almost all of them. Dumped into a file, a refer field sort enables new ones to be spotted and checked.

Experimental Action
Yahoo was disallowed Aug07 (the graph below shows why!), MSN Oct07.

The first result: site hit and bandwidth costs due to bots went from 45% in Jul07 to 27% over Nov07-Jan08. Here are the expectations for Nov07-Jan08 from a line fit Mar05-Aug07 and the actual:

source expected actual % change
total user hits 33910±1980 28300 -16.5
Google % 46.29±1.35 51.20 +10.6
Wikipedia % 12.16±0.73 12.42 +2.1
Yahoo % 3.43±0.54 0.60
MSN % 0.91±0.37 0.07

source	expected	actual	% change
total user hits	33910±1980	28300	-16.5
Google %	46.29±1.35	51.20	+10.6
Wikipedia %	12.16±0.73	12.42	+2.1
Yahoo %	3.43±0.54	0.60
MSN %	0.91±0.37	0.07

The expected loss in viewer hits from blocking Yahoo and MSN is 3.7%. The actual loss seems to be about 16% from the total hit activity, 11% based on Google and 2% on Wikipedia. The mean of these three is 9.7%. The loss in user views seems about 3x that expected.

Metasearch engines (ones that check many databases) require almost no investment to set up, so come and go. Dogpile is the only significant one I have been able to identify on my site (0.2% of viewers). Only three of the 27 I have located hit more than 100 pages each over the 34 month period; most had fewer than 10. They are minor players.

18% of user hits came from my own pages, so 84% of user hits have been traced to 5 referring sources. Possible modifications of Yahoo+MSN refer fields can only account for ±0.6% of any change, so most people don't do it.

MSN was re-enabled mid-Feb08 to see if the effects of MSN and Yahoo could be separated. The result: all the unexpected loss in user hits was due to Yahoo. It appears that only a third of user views due to Yahoo actually came from a Yahoo site or from a metasearcher that says they use Yahoo. So, its actual benefit:cost is about 1:4.

As a final experiment, I allowed Google, Turnitin and archive.org and disallowed everything else mid Nov08 as described above. The result - a gratifying loss of useless bot hits and almost no loss of users. I recommend it.

Unfortunately, this project was blown out of the water by a dishonest registrar who stole my domain name early November 2009. I hope the results to this point give you some useful information on how many bots there are out there that aren't worth their keep, and how to deal with them.

user-agents.org is a useful resource for identifying searchbots, but havn't gone into search results. Colossus has large lists of search sites.

John Sankey
other notes on computing