Web crawler

Web crawlers (also known as bots, robots or spiders) are a type of software designed to follow links, gather information and then send that information somewhere.


Googlebot work flow:
check list for where to go –> scan page –> send to google –> list and record.

  • Googlebot retrieves the content of webpages (the words, code and resources that make up the webpage).
  • If the content it retrieves has links to other things, that is noted.
  • It then sends the information to Google.

Google index

The information that Googlebot sends back to Google computers updates the Google index. The Google index is where webpages are compared and ranked.

  • In order for your webpages to be found in Google, they must be visible to Googlebot.
  • In order for your webpages to rank optimally, all webpage resources must be accessible by Googlebot.

Difference between Googlebot and Google index

  1. Googlebot
  • Googlebot retrieves content from the web.
  • Googlebot does not judge the content in anyway, it only retrieves it.
  • The only concerns Googlebot has is “Can I access this content?” and “Is there any further content that I can access?”
  1. The Google index
  • The Google index takes the content it receives from Googlebot and uses it to rank pages
  • The first step of being ranked by Google is to be retrieved by Googlebot.

Can Googlebot “see” my pages?

To get an idea of what Google sees from your site do the following Google search…

By putting “site:” infront of your domain name you will be requesting Google to list the pages Google has indexed for your site.
Tip: Make sure there is no space between “site:” and your domain name when you do this. Here is an example using this site…
If you see less than the amount of pages that you would expect, you will likely need to ensure that you are not blocking Googlebot with your robots.txt file.

Can Googlebot access all my content and links completely?

The next step is to ensure Google is seeing your content and links correctly.
Just because Googlebot can see your pages does not mean that Google has a perfect picture of exactly what those pages are.
Google bot does not see a website the same way as humans do. In the above image there is a webpage with one image on it. Humans can see the image, but what Googlebot sees is only the code calling that image.
Googlebot may be able to access that webpage (the html file), but not be able to access the image found on that webpage for various reasons.
In that scenario the Google index will not include that image, meaning that Google has an incomplete understanding of your webpage.

How Googlebot “sees” a webpage?

Googlebot does not see complete web pages, it only sees the individual components of that page.
If any of those components are not accessible to Googlebot, it will not send them to the Google index. To use our earlier example, here is Googlebot seeing a webpage (the html and css) but not seeing the image.
It isn’t just images. There are many pieces to a webpage. For Google to be able to rank your webpages optimally, Google needs the complete picture.
There are many scenarios where Googlebot might not be able to access web content, here are a few common ones.

  • Resource blocked by robots.txt
  • Page links not readable or incorrect
  • Over reliance on Flash or other technology that web crawlers may have issues with
  • Bad HTML or coding errors
  • Overly complicated dynamic links
  • Most of these things can be quickly checked by using the Google guidelines tool.
    If you have a Google account use the “fetch and render” tool found in the Google search console. This tool will provide you with a live example of exactly what Google sees for an individual page.

Can Googlebot access all of my page resources?

If CSS and javascript files are blocked by your robots.txt file then it can cause some severe misunderstandings about your webpage content (much worse than just a missing image).
It is increasingly true that a webpage may actually be different, or have different content if the page resources are not loaded.
An example to illustrate this would be a mobile page that uses CSS or javascript to determine what to show depending on what device is looking at the page. If Googlebot can not access the CSS or Javascript of that page, it may not realize the page can be mobile.
In this scenario and others like it, Google will “see” your page, and may even understand it, but it may not know it enough to realize that it can be ranked in many other scenarios than what the HTML alone is presenting.
This can also be checked for using the Google guidelines tool.

How many Googlebots / Google webcrawlers are there?

  • Googlebot (Google Web search)
  • Google Smartphone
  • Google Mobile (Feature phone)
  • Googlebot Images
  • Googlebot Video
  • Googlebot News
  • Google Adsense
  • Google Mobile Adsense
  • Google Adsbot (landing page quality check)


How Google Works —
The Googlebot guide —

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s