Website Crawling: A Guide on Everything You Need to Know

sovrnmarketing // May 10, 2010

Understanding website crawling and how search engines crawl and index websites can be a confusing topic. Everyone does it a little bit differently, but the overall concepts are the same. Here is a quick breakdown of things you should know about how search engines crawl your website. (I’m not getting into the algorithms, keywords or any of that stuff, simply how search engines crawl sites.)

So what is website crawling?

Website Crawling is the automated fetching of web pages by a software process, the purpose of which is to index the content of websites so they can be searched. The crawler analyzes the content of a page looking for links to the next pages to fetch and index.

What types of crawls are there?

Two of the most common types of crawls that get content from a website are:

Site crawls are an attempt to crawl an entire site at one time, starting with the home page. It will grab links from that page, to continue crawling the site to other content of the site. This is often called “Spidering”.
Page crawls, which are the attempt by a crawler to crawl a single page or blog post.

Are there different types of crawlers?

There definitely are different types of crawlers. But one of the most important questions is, “What is a crawler?” A crawler is a software process that goes out to websites and requests the content as a browser would. After that, an indexing process actually picks out the content it wants to save. Typically the content that is indexed is any text visible on the page.
Different search engines and technologies have different methods of getting a website’s content with crawlers:

Crawls can get a snapshot of a site at a specific point in time, and then periodically recrawl the entire site. This is typically considered a “brute force” approach as the crawler is trying to recrawl the entire site each time. This is very inefficient for obvious reasons. It does, though, allow the search engine to have an up-to-date copy of pages, so if the content of a particular page changes, this will eventually allow those changes to be searchable.
Single page crawls allow you to only crawl or recrawl new or updated content. There are many ways to find new or updated content. These can include sitemaps, RSS feeds, syndication and ping services, or crawling algorithms that can detect new content without crawling the entire site.

Can crawlers always crawl my site?

That’s what we strive for at sovrn, but isn’t always possible. Typically, any difficulty crawling a website has more to do with the site itself and less with the crawler attempting to crawl it. The following issues could cause a crawler to fail:

The site owner denies indexing and or crawling using a robots.txt file.
The page itself may indicate it’s not to be indexed and links not followed (directives embedded in the page code). These directives are “meta” tags that tell the crawler how it is allowed to interact with the site.
The site owner blocked a specific crawler IP address or “user agent”.

All of these methods are usually employed to save bandwidth for the owner of the website, or to prevent malicious crawler processes from accessing content. Some site owners simply don’t want their content to be searchable. One would do this kind of thing, for example, if the site was primarily a personal site, and not really intended for a general audience.
I think it is also important to note here that robots.txt and meta directives are really just a “gentlemen’s agreement”, and there’s nothing to prevent a truly impolite crawler from crawling. sovrn’s crawlers are polite, and will not request pages that have been blocked by robots.txt or meta directives.

How do I optimize my website so it is easy to crawl?

There are steps you can take to build your website in such a way that it is easier for search engines to crawl it and provide better search results. The end result will be more traffic to your site, and enabling your readers to find your content more effectively.
Search Engine Accessibility Tips:

Having an rss feed or feeds so that when you create new content the search software can recognize new content and crawl it faster. sovrn uses the feeds on your site as an indicator that you have new content available.
Be selective when blocking crawlers using robots.txt files or meta tag directives in your content. Most blog platforms allow you to customize this feature in some way. A good strategy to employ is to let the search engines in that you trust, and block those you don’t.
Building a consistent document structure. This means when you construct your html page that the content you want crawled is consistently in the same place under the same content section.
Having content and not just images on a page. Search engines can’t find an image unless you provide text or alt tag descriptions for that image.
Try (within the limits of your site design) to have links between pages so the crawler can quickly learn that those pages exist. If you’re running a blog, you might, for example, have an archive page with links to every post. Most blogging platforms provide such a page. A sitemap page is another way to let a crawler know about lots of pages at once.

To learn more about configuring robots.txt and how to manage it for your site, visit http://www.robotstxt.org/. Or contact us here at sovrn. We want you to be a successful blogger, and understanding website crawling is one of the most important steps.

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
_auth	1 year	This cookie is set by Pinterest that collects statistical details to track the use of its services.
_pinterest_referrer	past	This cookie is set by Pinterest to track the use of its services.
_pinterest_sess	1 year	This cookie is set by Pinterest that collects statistical details to track the use of its services.
_routing_id	session	This cookie is set by Pinterest that collects statistical details to track the use of its services.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
language	session	This cookie is used to store the language preference of the user.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.

Cookie	Duration	Description
__hstc	1 year 24 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_52355958_2	1 minute	This cookie is set by Google and is used to distinguish users.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_mkto_datetime	1 month	This cookie is set by Marketo.
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.