The tireless search for information (II)

Now we analyze search engine robots, how they work and their main characteristics to identify them.

By Osvaldo Callegari*

Internet Search Engines: Tools or Threat to Privacy? It seems that the form of the information search is a well-kept state secret. Currently, Artificial Intelligence techniques are used to decipher searches and also to establish user behavior.

Search engine robots
A web robot, known by several denominations (Spider, Crawler etc.) is an application that performs searches on the Internet to schedule a copy of the information of the page on a server or simply a copy of the index, for that the owner of a web page hosts in the root directory a robots file.txt indicating that his page can be indexed by search engines.

- Publicidad -

Website owners use the /robots.txt file to give instructions about their site to web robots this is in some cases called the Robot Exclusion Protocol.

It works like this: a robot wants to see the URL of a website, for example:
http://www.example.com/welcome.html.

Before doing so, first check http://www.example.com/robots.txt and find:
User Agent: *
Do not allow: /

The "user agent: *" means that this section applies to all robots.
The message "Do not allow: /" tells the robot not to visit any page of the site.
There are two important considerations when using /robots.txt:

    • Robots can ignore your /robots.txt.
        ◦ Especially malware bots that scan the web for security vulnerabilities and email address pickers used by spammers won't pay attention.
    • The /robots.txt file is a publicly available file.
        ◦ Anyone can see which sections of their server they don't want robots to use.

"It is not advisable to use /robots.txt to hide information"

Some additional definitions
A robot is a program that automatically traverses the hypertext structure of the Web by retrieving a document and recursively all referenced documents.

- Publicidad -

Note that "recursive" here does not limit the definition to any specific cross-sectional algorithm; even if a robot applies a certain heuristic to the selection and ordering of documents to visit and space out requests over a long period of time, it is still a robot.

Normal web browsers are not robots, as they are operated by a human and do not automatically retrieve reference documents (except online images).

Expanded robot development
"There is no will for a final Robot Standard to prosper.txt"

There are no efforts on this site to develop /robots.txt, and it is not known whether technical standards bodies such as the IETF or the W3C work in this area.

There are some industry efforts to extend robot exclusion mechanisms. See for example the collaborative efforts announced in Yahoo! Search Blog, Google Webmaster Central Blog, and Microsoft Live Search Webmaster Team Blog, which includes support for wildcards, sitemaps, additional META tags, and so on.

Of course, it is important to realize that, other older robots may not support these new mechanisms. For example, if you use "Do not allow: /*.pdf$", and a robot does not treat '*' and '$' as wildcard and anchor characters, your PDF files are not excluded.

- Publicidad -

The details:
/Robots.txt is a de facto standard, and is not owned by any standards body.

There are two historical descriptions:
    • The original 1994 document A Standard for Robot Exclusion.
    • a 1997 Internet draft specification: A method for controlling web robots.

External resources:
    • HTML 4.01 Specification, Appendix B.4.1
        ◦ https://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1
    • Wikipedia - Robot Exclusion Standard
        ◦ https://en.wikipedia.org/wiki/Robots_exclusion_standard

Overview, simple recipes of how to use /robots.txt on your server.
This example excludes three directories.

Note that you need a separate "Do Not Allow" line for each URL prefix you want to exclude; you cannot say "Do not allow: /cgi-bin//tmp/" on a single line.

In addition, you may not have blank lines in a record because they are used to delimit multiple records. Note also that globbing and regular expression are not supported on User-agent or Disallow lines.

The '*' in the User-agent field is a special value that means "any robot".

Specifically, you can't have lines like "User-agent: *bot*", "Disallow: /mp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Anything that is not explicitly dismissed is considered a fair game to recover.

Here are some examples:
- To exclude all robots from the entire server.
User Agent: *
Do not allow: /

- To allow all robots to have full access.
User Agent: *
Reject:
(or just create an empty "/robots.txt" file, or don't use one at all)

- To exclude all robots from the server.
User Agent: *
Do not allow: / cgi-bin /
Do not allow: / tmp /
Do not allow: / garbage /

- To exclude a single robot
User-agent: BadBot
Do not allow: /

- To allow a single robot
User-agent: Google
Reject:
User Agent: *
Do not allow: /

- To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easiest way is to place all files to be rejected in a separate directory, say "stuff", and leave the only file at the level above this directory:
User Agent: *
Do not allow: / ~ joe / stuff /
Alternatively, you can explicitly reject all disallowed pages:
User Agent: *
Do not allow: /~joe/junk.html
Do not allow: /~joe/foo.html
Do not allow: /~joe/bar.html

Robots.txt (continued)
If a webmaster doesn't want their page to be analyzed by a bot, they can insert a method called robots.txt, which prevents GoogleBot (or other bots) from investigating one or more pages (or even all the content) of the website.

Google Bot
Google uses a large number of computers to send its crawlers to every corner of the network to find these pages and see what's on them. Googlebot is Google's robot or web crawler and other search engines have their own.

How Googlebot works
Googlebot uses sitemaps and databases of links discovered during previous crawls to determine where to go next. Every time the crawler finds new links on a site, it adds them to the list of pages to visit below. If Googlebot finds changes to broken links or links, it will take note of that so that the index can be updated. The program determines how often it will crawl the pages. To make sure that Googlebot can properly index your site, you need to check its crawlability. If your site is available to crawlers, they come often.

GoogleBot discovers links to other pages, and targets them as well, so it can easily span the entire web. It is the robot that Google uses to 'crawl' Internet sites. It not only indexes web pages (HTML, HTML5) but also extracts information from PDF, PS, XLS, DOC files and some others.

How often Googlebot accesses a website depends on the pagerank of the website. The higher this value, the more assiduously the robot will access your pages.
For example, we can test that sites with PR10 (the highest value), such as yahoo.com or usatoday.com, have been 'crawled' by GoogleBot yesterday or even today, while others have been accessed several weeks ago. This can be verified by accessing the 'cache' of this page.

DeepBot
Googlebot has two versions, DeepBot and FreshBot. DeepBot investigates deeply trying to follow any link on a page, in addition to putting such a page in the cache, and making it available to Google. In March 2006, it took him a month to complete the process.

FreshBot
Freshbot researches the web for new content. Visit sites that change frequently. Ideally, the FreshBot will visit the page of a newspaper every day, while that of a magazine every week, or every 15 days. So, for example, you can catch news that has just happened, without having to wait weeks.

Check
To check if GoogleBot has accessed our website, we will have to take a look at the logs of our server. In them, we must observe if there are access logs in which 'GoogleBot' appears. Generally the name of the server will appear, which may be one of these:
Server IP Address's

crawl1.googlebot.com   216.239.46.20
crawl2.googlebot.com   216.239.46.39
crawl3.googlebot.com   216.239.46.61
crawl4.googlebot.com   216.239.46.82
crawl9.googlebot.com   216.239.46.234
crawler1.googlebot.com 64.68.86.9
crawler2.googlebot.com 64.68.86.55
crawler14.googlebot.com        64.68.82.138

Once Googlebot has 'crawled' our page, it will follow the links it finds on it (the HREF and the SRC). Therefore, if you want GoogleBot to index your website, it is only necessary that some other site has a link to yours. If not, you can always add your URL directly from Google.

Different robots
There are several different types of robots. For example, AdSense and AdsBot check the quality of ads, while Android Mobile Apps checks Android apps.

Some of the most important robots
Name
User Agent
Googlebot (desktop)
Mozilla/5.0 (supported; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot (mobile)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (supported; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot Video
Googlebot-Video/1.0
Pictures of Googlebot
Googlebot-Image/1.0
Googlebot News
Googlebot-News

How Googlebot visits your site
To find out how often Googlebot visits your site and what it does there, you can dive into its log files or open the Crawl section of the Google Search Console.
If you want to do really advanced things to optimize your site's crawling performance, you can use tools like Kibana or Screaming Frog's SEO Log Analyzer.

Conclusion
It is always necessary to use different search engines, many times we find that a search engine gives us little information about something and when choosing another we find what we want. That is why it is always good to use different tools, as well as use the advanced options of the same, some reach the Deep Web, which is not advisable because of the quality of spurious information that resides, this does not mean that there are no consumers.

An interesting tip: before the search engine goes ahead let's write the page we want to see in full, so that advertising does not overwhelm us.

Robots, search engines and the management of information make us increasingly dependent on them, the way is to inform ourselves with different sources, experiment personally and obtain references from people of flesh. The independence of information makes us free.

* The names and brands mentioned are names and trademarks of their respective authors. Concepts and sources consulted Google ® sources, Wikipedia, Dns Queries, Robotstxt.org and Yoast.com

* To contact the author of this article write to [email protected]

Author: Duván Chaverra Agudelo

Jefe Editorial en Latin Press, Inc,.

Comunicador Social y Periodista con experiencia de más de 16 años en medios de comunicación. Apasionado por la tecnología y por esta industria. [email protected]