|
Robots,
spiders, worms and crawlers
All major search engines send out
a little program called a 'spider' to index your pages. Some search
engines use them to 'index' your entire site, some just 1 or 2 pages.
Spiders take a 'snapshot' of your page, and determine what your
page is about by looking at text on the page, META
tags, and various other page factors. Most
directories such as Looksmart, Zeal, and the Open Directory
Project also send out a spider. However, since they are not search
engines, the primary function of their spiders are to ensure your
site is still up and running.
These robots leave a trace behind
of their access attempts in your server log files just as a human
visitor does, so if you have access to your stats you will be able
to spot them. Your best hint of this indexing attempt will be seen
by checking access attempts to your 'robots.txt' file in the root
of your webs directory. If you don't have a 'robots.txt' file that
is because you never created one. Don't worry however, a spider
will still crawl your site without one. All search engines check
for this little text file that will tell the crawler where and where
not to go. It can also allow and disallow certain robots if you
find a particular spider to be nasty in nature. The main purpose
of this file is so the robot will not index directories or files
it isn't supposed to, such as cgi directories, administration files,
etc. If you wish to create a 'robots.txt' file but don't know where
to start, visit the 'official' robots.txt site by clicking
here.
Unfortunately, there are also malicious
spiders on the web that are used for reasons other than search engines.
Some spiders are designed to copy your website to the clients hard-drive,
others are designed to collect e-mail addresses to be used for sending
unsolicited e-mail.
Find out how
to stop SPAM and junk e-mail by clicking here.
Click here
for a list of spiders that have visited this site.
Spiderhunter
has excellent resources about spiders, if you wish to learn more.
|