Not getting enough traffic from search engines? You may have a crawling problem. Crawling your web pages is one of the first, in a series of, steps that search engines must take in order to have your web pages show up in search results. If your pages aren’t crawled, then they won’t show up in the search results.
What is Crawling?
The search engines’ crawlers (also known as robots or spiders) start by downloading what they perceive to be the most important pages (such as the homepage) on your website . They collect all the links on the page and put them into a queue to be downloaded later. If they perceive one of the discovered pages to be of low importance, they might not download it at all.
After a page is crawled, the search engine will analyze it and determine if it is worthy of being included in their search index. As time goes on, Search engines will continue to request pages they have previously downloaded in order to see if the content has changed.
Based upon how important a search engine, such as Google, thinks your website is, they will allocate a certain amount of bandwidth to crawling your website. Bandwidth is consumed as each page on your website is downloaded by the search engine. When the search engine reaches its bandwidth limit for your website, they will stop crawling your web pages until the next crawl period.
How to Improve Crawling
Since the search engine allocates crawl bandwidth to your website, it is important to direct the search engine crawlers to the content you want in their search index and not to duplicate or unnecessary content. Some of these techniques are quite technical, but having the website development team understand and address the underlying issues is very important for your overall SEO.
1. Focus Crawlers on Desired Content
Help the crawlers find and focus on your desired content
- Use a sitemap: XML or HTML (sitemaps should list your most important content)
- Use a robots.txt file to block content you don’t want search engines to crawl.
- No-follow any internal links to pages you don’t want in the index, such as: Login, Terms of Use, Privacy Policy, etc.
- 301 redirect (permanent redirect) pages that are no longer exist to either a single 404 page, or better still, to other related content on your website.
- Limit use of 302 redirects (temporary), use 301s (permanent) instead.
- Don’t allow the search engine to index internal website search results (block with robots.txt)
- Reduce or avoid using browser side creation of links in content to other pages on your website. Avoid having links in JavaScript, AJAX, Flash, etc, unless you have an HTML equivalent on the page.
2. Increase Page Importance
The search engine crawlers start with what they perceive to be the most important pages and return to them most often. To increase the importance of pages:
- Decrease the number of clicks from your home page to deep content (being close to other important pages makes a page more important)
- Increase the number of internal links to pages (number of links to a page makes the page more important)
- Increase the number of external direct links to pages or to pages that link to many pages on your site such as a top-level category page (dramatically increases the importance of the page and increases the importance of the whole site)
- Avoid using nofollow on internal links to important content (nofollow takes the vote away that one page can give another.)
3. Increase the Number of Pages Crawled per Crawl Session
Reducing page bloat helps increase the number of pages crawled in a session
- Decrease the kilobyte size of pages on your website by eliminating blank/white space in your HTML
- Use common external CSS and JavaScript files. (The search engines won’t keep downloading these files)
4. Avoid Duplicate Content
Stop the search engines from wasting their time on duplicate content. Having multiple pages with the exact same content won’t help you rank better and wastes crawl bandwidth.
- Don’t create duplicate content on your website. This often happens by accident as a result of using a poorly configured content management system.
- If you have duplicate content, 301 redirect requests for the duplicate page to the correct page or use the canonical META tag if you have more than one legitimate version of a web page. (These methods tell the search engine which page is original)
- Don’t use session variables in your URLs to track users, use cookies instead. Sesion variables may look like this: http://domain.com/product.php?SESSID=BX56J7AS4096H (Session variables in URLs often cause search engines to crawl duplicate content)
- Have only one version of your website, either www.domainname.com or domainname.com. 301 redirect the undesired version to the desired one.
5. On-Page Factors
All web page text content, served from your website, that is displayed on the page to the user, should be contained in the HTML from the page requested.
- Avoid using iframes and frames to load content (Often causes extra page fragments to show in the index)
- Avoid using Ajax/Javascript to load content or links (These methods make it tougher for Search Engines to find the content)
- On pages that should be in the index, make sure that you DON’T have META tags that instruct the search engine NOT to index the content
6. Detect and Avoid Crawler Problems
- Register your website for Google’s webmaster tools. Their crawler reporting tool will indicate crawl problems that Google encounters when crawling your website
- Avoid spider traps. These are sort of internal link black holes on dynamic pages that create new many pages by adding parameters or sub-directories to URLs which create infinite loops for search engines
- Avoid including items such as calendars that contain links that go forward and backward infinitely in time.
Finding and addressing these crawler issues can dramatically increase search traffic your website. Depending on your content management system, fixing these crawling issues may take some effort, but it usually pays off in the long run.