Account
Please wait, authorizing ...

Don't have an account? Register here today.

×

The tireless search for information (II)

Now we analyze search engine robots, how they work and their main characteristics to identify them.

By Osvaldo Callegari*

Internet Search Engines: Tools or Threat to Privacy? It seems that the form of the information search is a well-kept state secret. Currently, Artificial Intelligence techniques are used to decipher searches and also to establish user behavior.

Search engine robots
A web robot, known by several denominations (Spider, Crawler etc.) is an application that performs searches on the Internet to schedule a copy of the information of the page on a server or simply a copy of the index, for that the owner of a web page hosts in the root directory a robots file.txt indicating that his page can be indexed by search engines.

- Publicidad -

Website owners use the /robots.txt file to give instructions about their site to web robots this is in some cases called the Robot Exclusion Protocol.

It works like this: a robot wants to see the URL of a website, for example:
http://www.example.com/welcome.html.

Before doing so, first check http://www.example.com/robots.txt and find:
User Agent: *
Do not allow: /

The "user agent: *" means that this section applies to all robots.
The message "Do not allow: /" tells the robot not to visit any page of the site.
There are two important considerations when using /robots.txt:

    • Robots can ignore your /robots.txt.
        ◦ Especially malware bots that scan the web for security vulnerabilities and email address pickers used by spammers won't pay attention.
    • The /robots.txt file is a publicly available file.
        ◦ Anyone can see which sections of their server they don't want robots to use.

"It is not advisable to use /robots.txt to hide information"

Some additional definitions
A robot is a program that automatically traverses the hypertext structure of the Web by retrieving a document and recursively all referenced documents.

- Publicidad -

Note that "recursive" here does not limit the definition to any specific cross-sectional algorithm; even if a robot applies a certain heuristic to the selection and ordering of documents to visit and space out requests over a long period of time, it is still a robot.

Normal web browsers are not robots, as they are operated by a human and do not automatically retrieve reference documents (except online images).

Expanded robot development
"There is no will for a final Robot Standard to prosper.txt"

There are no efforts on this site to develop /robots.txt, and it is not known whether technical standards bodies such as the IETF or the W3C work in this area.

There are some industry efforts to extend robot exclusion mechanisms. See for example the collaborative efforts announced in Yahoo! Search Blog, Google Webmaster Central Blog, and Microsoft Live Search Webmaster Team Blog, which includes support for wildcards, sitemaps, additional META tags, and so on.

Of course, it is important to realize that, other older robots may not support these new mechanisms. For example, if you use "Do not allow: /*.pdf$", and a robot does not treat '*' and '$' as wildcard and anchor characters, your PDF files are not excluded.

- Publicidad -

The details:
/Robots.txt is a de facto standard, and is not owned by any standards body.

There are two historical descriptions:
    • The original 1994 document A Standard for Robot Exclusion.
    • a 1997 Internet draft specification:  A method for controlling web robots.
      
External resources:
    • HTML 4.01 Specification, Appendix B.4.1
        ◦ https://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1
    • Wikipedia - Robot Exclusion Standard
        ◦ https://en.wikipedia.org/wiki/Robots_exclusion_standard
          
Overview, simple recipes of how to use /robots.txt on your server.
This example excludes three directories.

Note that you need a separate "Do Not Allow" line for each URL prefix you want to exclude; you cannot say "Do not allow: /cgi-bin//tmp/" on a single line.

In addition, you may not have blank lines in a record because they are used to delimit multiple records. Note also that globbing and regular expression are not supported on User-agent or Disallow lines.

The '*' in the User-agent field is a special value that means "any robot".

Specifically, you can't have lines like "User-agent: *bot*", "Disallow: /mp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Anything that is not explicitly dismissed is considered a fair game to recover.

Here are some examples:
- To exclude all robots from the entire server.
User Agent: *
Do not allow: /

- To allow all robots to have full access.
User Agent: *
Reject:
(or just create an empty "/robots.txt" file, or don't use one at all)

- To exclude all robots from the server.
User Agent: *
Do not allow: / cgi-bin /
Do not allow: / tmp /
Do not allow: / garbage /

- To exclude a single robot
User-agent: BadBot
Do not allow: /

- To allow a single robot
User-agent: Google
Reject:
User Agent: *
Do not allow: /

- To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easiest way is to place all files to be rejected in a separate directory, say "stuff", and leave the only file at the level above this directory:
User Agent: *
Do not allow: / ~ joe / stuff /
Alternatively, you can explicitly reject all disallowed pages:
User Agent: *
Do not allow: /~joe/junk.html
Do not allow: /~joe/foo.html
Do not allow: /~joe/bar.html

Robots.txt (continued)
If a webmaster doesn't want their page to be analyzed by a bot, they can insert a method called robots.txt, which prevents GoogleBot (or other bots) from investigating one or more pages (or even all the content) of the website.

Google Bot
Google uses a large number of computers to send its crawlers to every corner of the network to find these pages and see what's on them. Googlebot is Google's robot or web crawler and other search engines have their own.

How Googlebot works
Googlebot uses sitemaps and databases of links discovered during previous crawls to determine where to go next. Every time the crawler finds new links on a site, it adds them to the list of pages to visit below. If Googlebot finds changes to broken links or links, it will take note of that so that the index can be updated. The program determines how often it will crawl the pages. To make sure that Googlebot can properly index your site, you need to check its crawlability. If your site is available to crawlers, they come often.

GoogleBot discovers links to other pages, and targets them as well, so it can easily span the entire web. It is the robot that Google uses to 'crawl' Internet sites. It not only indexes web pages (HTML, HTML5) but also extracts information from PDF, PS, XLS, DOC files and some others.

How often Googlebot accesses a website depends on the pagerank of the website. The higher this value, the more assiduously the robot will access your pages.
For example, we can test that sites with PR10 (the highest value), such as yahoo.com or usatoday.com, have been 'crawled' by GoogleBot yesterday or even today, while others have been accessed several weeks ago. This can be verified by accessing the 'cache' of this page.

DeepBot
Googlebot has two versions, DeepBot and FreshBot. DeepBot investigates deeply trying to follow any link on a page, in addition to putting such a page in the cache, and making it available to Google. In March 2006, it took him a month to complete the process.

FreshBot
Freshbot researches the web for new content. Visit sites that change frequently. Ideally, the FreshBot will visit the page of a newspaper every day, while that of a magazine every week, or every 15 days. So, for example, you can catch news that has just happened, without having to wait weeks.

Check
To check if GoogleBot has accessed our website, we will have to take a look at the logs of our server. In them, we must observe if there are access logs in which 'GoogleBot' appears. Generally the name of the server will appear, which may be one of these:
 Server                IP Address's

 crawl1.googlebot.com   216.239.46.20
 crawl2.googlebot.com   216.239.46.39
 crawl3.googlebot.com   216.239.46.61
 crawl4.googlebot.com   216.239.46.82
 crawl9.googlebot.com   216.239.46.234
 crawler1.googlebot.com 64.68.86.9
 crawler2.googlebot.com 64.68.86.55
 crawler14.googlebot.com        64.68.82.138

Once Googlebot has 'crawled' our page, it will follow the links it finds on it (the HREF and the SRC). Therefore, if you want GoogleBot to index your website, it is only necessary that some other site has a link to yours. If not, you can always add your URL directly from Google.

Different robots
There are several different types of robots. For example, AdSense and AdsBot check the quality of ads, while Android Mobile Apps checks Android apps.

Some of the most important robots
Name
User Agent
Googlebot (desktop)
Mozilla/5.0 (supported; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot (mobile)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (supported; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot Video
Googlebot-Video/1.0
Pictures of Googlebot
Googlebot-Image/1.0
Googlebot News
Googlebot-News

How Googlebot visits your site
To find out how often Googlebot visits your site and what it does there, you can dive into its log files or open the Crawl section of the Google Search Console.
If you want to do really advanced things to optimize your site's crawling performance, you can use tools like Kibana or Screaming Frog's SEO Log Analyzer.

Conclusion
It is always necessary to use different search engines, many times we find that a search engine gives us little information about something and when choosing another we find what we want. That is why it is always good to use different tools, as well as use the advanced options of the same, some reach the Deep Web, which is not advisable because of the quality of spurious information that resides, this does not mean that there are no consumers.

An interesting tip: before the search engine goes ahead let's write the page we want to see in full, so that advertising does not overwhelm us.

Robots, search engines and the management of information make us increasingly dependent on them, the way is to inform ourselves with different sources, experiment personally and obtain references from people of flesh. The independence of information makes us free.

* The names and brands mentioned are names and trademarks of their respective authors. Concepts and sources consulted Google ® sources, Wikipedia, Dns Queries, Robotstxt.org and Yoast.com

* To contact the author of this article write to [email protected]

Duván Chaverra Agudelo
Author: Duván Chaverra Agudelo
Jefe Editorial en Latin Press, Inc,.
Comunicador Social y Periodista con experiencia de más de 16 años en medios de comunicación. Apasionado por la tecnología y por esta industria. [email protected]

No thoughts on “The tireless search for information (II)”

• If you're already registered, please log in first. Your email will not be published.

Leave your comment

In reply to Some User
Suscribase Gratis
SUBSCRIBE TO OUR ENGLISH NEWSLETTER
DO YOU NEED A SERVICE OR PRODUCT QUOTE?
LATEST INTERVIEWS

Webinar: NxWitness el VMS rápido fácil y ultra ligero

Webinar: Por qué elegir productos con certificaciones de calidad

Por: Eduardo Cortés Coronado, Representante Comercial - SECO-LARM USA INC La importancia de utilizar productos certificados por varias normas internacionales como UL , Ul294, CE , Rosh , Noms, hacen a tus instalciones mas seguras y confiables además de ser un herramienta más de venta que garantice nuestro trabajo, conociendo qué es lo que certifica cada norma para así dormir tranquilos sabiendo que van a durar muchos años con muy bajo mantenimiento. https://www.ventasdeseguridad.com/2...

Webinar: Anviz ONE - Solución integral para pymes

Por: Rogelio Stelzer, Gerente comercial LATAM - Anviz Presentación de la nueva plataforma Anviz ONE, en donde se integran todas nuestras soluciones de control de acceso y asistencia, video seguridad, cerraduras inteligentes y otros sensores. En Anviz ONE el usuario podrá personalizar las opciones según su necesidad, de forma sencilla y desde cualquier sitio que tenga internet. https://www.ventasdeseguridad.com/2...

Webinar: Aplicaciones del IoT y digitalización en la industria logística

Se presentarán los siguientes temas: • Aplicaciones del IoT y digitalización en la industria logística. • Claves para decidir el socio en telecomunicaciones. • La última milla. • Nuevas estrategias de logística y seguimiento de activos sostenibles https://www.ventasdeseguridad.com/2...

Sesión 5: Milestone, Plataforma Abierta que Potencializa sus Instalaciones Manteniéndolas Protegidas

Genaro Sanchez, Channel Business Manager - MILESTONE https://www.ventasdeseguridad.com/2...
Load more...
SITE SPONSORS










LATEST NEWSLETTER
Latest Newsletter