19 Oct 2020

What Is A Robots.Txt File?

0 Comment

What is a Robots.Txt File?

A robot (also called a spider) is an automated software program which scans web pages (as well as newsgroups and other internet structures) looking for things. There are hundreds, if not thousands, of robots tirelessly scanning the internet day and night. The result of their toils is often beneficial (they allow massive search engines like Google to exist). Sometimes their purposes are merely interesting (as with internet mapping robots), and occasionally they are actually malicious and evil (as with email harvesters).

What these robots are looking for depends upon their purpose.

  • Search engine robots such as Scooter, Slurp and Googlebot gently probe your web site, recording all of your pages in their massive indexes.
  • ArchitextSpider gathers internet statistics.
  • Sometimes spiders are used to validate the links on a web site.
  • EmailSiphon and Cherry Picker are looking for email addresses to add to their spam lists. These robots are not your friend.

Usually it’s safe to just ignore these robots, although if you have access to your server log files it is always a good idea to keep an eye on their travels through your site. Sometimes, however, there are pages in your site that you simply do not want or need indexed. This could be for many different reasons:

  • You’ve got directories that don’t need to be indexed.
  • You’ve got some old pages which you no longer want people to visit.
  • For whatever reason, you don’t want all or part of your site appearing in certain search engines.
  • Some pages are temporary so should not be indexed.
  • You are paying for bandwidth and want to exclude robots and spiders which do not benefit you and your goals.
  • Some robots are not well written and “hammer” a site with extremely rapid requests or generate lots of 404 errors.
  • or it could simply be because you don’t want visitors landing on certain pages.

One good reason to exclude certain directories is to help out the search engines. Think about it – they have a lot of work to do to completely index your entire site. This causes traffic on the internet, on your ISP and your host. Anything that you can do to help reduce this traffic will help the greater good.

The Robots Exclusion Standard

In order to aid you in informing robots of your intentions, a series of agreements called the Robots Exclusion Standard has been created. This is not supported by any official internet standards committee, it is not backed by any big corporations and it is not enforced by anyone, including web server software. Instead, the standard was created by a bunch of webmasters and made public to aid in solving the problems that robots create.

So what good is it? Well, many robots, including most of those used by major search engines, have agreed to follow the standard. In fact, it is now considered good form for any beneficial or well-written robot to follow this standard, as well as the ROBOTS metatag. (Actually, any robot that does not follow the standard is often looked upon as either malicious or sloppily coded).

What the standard allows you to do is created a special file called robots.txt. The file always has the same name, and it must reside in your root directory. Only one file is allowed per web site.

Important – This is critically important. The robots.txt file is a standard which is voluntarily supported by a robot or spider. There is no requirement that it be used. Thus, malicious spiders (such as EmailSiphon and Cherry Picker) will not use this file.

Robots.txt is a simple text file which contains some keywords and file specifications. Each line of the file is either blank or consists of a single keyword and it’s related information. The keywords are used to tell robots which portions of your web site are NOT to be spidered (we will refer to this as exclusions).

Keywords in the Robots.Txt file

These are the keywords that are allowed:

User-agent – This is the name of the robot or spider. You may also include more than one agent name if the same exclusion is to apply to them all. You do not need to worry about case (in other words “googlebot” is the same as “GOOGLEBOT” and “GoogleBot”.

A “*” indicates this is the “default” record, which applies if no other match is found. For example, if you specify “GoogleBot” only, then the “*” would apply to any other robot.

Disallow – This tells the robot(s) specified in the User-agent which parts of your web site are off-limits. For example, /images tells the robot not to look at any files in the images directory, any any directory below it. Thus, “/images/special/” would not be indexed by the robot.

Note that /se will match any directory beginning with “/se” while /se/ will only match a directory named “/se/”.

You can also specify individual files. For example, you could say /mydata/help.aspl to prevent just that one single file from being spidered.

A value of just / indicates nothing is allowed to be spidered.

You must have at least one disallow per user-agent record.

# – Start of a comment. You can include a pound character anywhere in a line to being a comment.

Example Robots.Txt File

The following example disallows certain directories and all files contained within those directories.

User-agent: *
Disallow: /images/
Disallow: /banners/
Disallow: /Forms/
Disallow: /Dictionary/
Disallow: /_borders/
Disallow: /_fpclass/
Disallow: /_overlay/
Disallow: /_private/
Disallow: /_themes/

This example disallows all robots:

User-agent: *
Disallow: /

This file disallows Googlebot from examining a specific web page:

User-agent: GoogleBot
Disallow: tempindex.asp

It is important to remember that the Robots.Txt file is available to everyone. Thus you never want to specify the names of sensitive files or folders. If you must exclude them, it is better to use password protected pages which cannot be reached by search engines at all (they don’t have the password!)

[top]
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x