The 5 Most Common Google Indexing Issues by Website Size

Google is open about the fact that it doesn’t index all the pages it can find. Using the Google Search Console, you can see the pages on your website that are not indexed.

Google Search Console also provides you with useful information about the specific issue that prevented a page from being indexed.

These issues include server errors, 404s, and hints that the page may have thin or duplicate content.

But we never get to see any data showing which problems are the most common across the whole web.

So… I decided to gather data and compile the statistics myself!

In this article, we’ll explore the most popular indexing issues that are preventing your pages from showing up in Google Search.

Indexing 101

Indexing is like building a library except instead of books, Google deals with websites.

If you want your pages to show up in search, they have to be properly indexed. In layman’s terms, Google has to find them and save them.

Then, Google can analyze their content to decide for which queries they might be relevant.

Getting indexed is a prerequisite for getting organic traffic from Google. And as more pages of your website get indexed, you have more chances of appearing in the search results.

That’s why it’s really important for you to know if Google can index your content.

Here’s What I Did to Identify Indexing Issues

My day-to-day tasks include optimizing websites from a technical SEO standpoint to make them more visible in Google and as a result, I have access to several dozens of sites in Google Search Console.

I decided to put this to use in order to hopefully make popular indexing issues… well, less popular.

For transparency, I broke down the methodology that led me to some interesting conclusions.

Methodology

I began by creating a sample of pages, combining data from two sources:

  • I used the data from our clients that were readily available to me.
  • I asked other SEO professionals to share anonymized data with me, by publishing a Twitter poll and reaching out to some SEOs directly.

Both proved fruitful sources of information.

Excluding Non-Indexable Pages

It’s in your interest to leave some pages out of indexing. These include old URLs, articles that are no longer relevant, filter parameters in ecommerce, and more.

Webmasters can make sure Google ignores them in a number of ways, including the robots.txt file and the noindex tag.

Taking such pages into consideration would negatively affect the quality of my findings, so I removed pages that met any of the criteria below from the sample:

  • Blocked by robots.txt.
  • Marked as noindex.
  • Redirected.
  • Returning an HTTP 404 status code.

Excluding Non-Valuable Pages

To further improve the quality of my sample, I considered only those pages that are included in sitemaps.

Based on my experience, sitemaps are the clearest representation of valuable URLs from a given website.

Of course, there are many websites that have junk in their sitemaps. Some even include the same URLs in their sitemaps and robots.txt files.

But I took care of that in the previous step.

Categorizing Data

I found that popular indexing issues vary depending on the size of a website.

Here’s how I split up the data:

  • Small websites (up to 10k pages).
  • Medium websites (from 10k to 100k pages).
  • Big websites (up to a million pages).
  • Huge websites (over 1 million pages).

Because of the differences in the size of the websites in my sample, I had to find a way to normalize the data.

One very large website struggling with a particular issue could outweigh the problems other, smaller websites may have.

So I looked at each website individually to sort the indexing issues they struggle with. Then I assigned points to the indexing issues based on the number of pages that were affected by a given issue on a given website.

And the Verdict Is…

Here are the top five issues I found on websites of all sizes.

  1. Crawled – currently not indexed (Quality issue).
  2. Duplicate content.
  3. Discovered – currently not indexed (Crawl budget/quality issue).
  4. Soft 404.
  5. Crawl issue.

Let’s break these down.

Quality

Quality issues include your pages being thin in content, misleading, or overly biased.

If your page doesn’t provide unique, valuable content that Google wants to show to users, you will have a hard time getting it indexed (and shouldn’t be surprised).

Duplicate Content

Google may recognize some of your pages as duplicate content, even if you didn’t mean for that to happen.

A common issue is canonical tags pointing to different pages. The result is the original page not getting indexed.

If you do have duplicate content, use the canonical tag attribute or a 301 redirect.

This will help you make sure that the same pages on your site aren’t competing against each other for views, clicks, and links.

Crawl Budget

What is crawl budget? Based on several factors, Googlebot will only crawl a certain amount of URLs on each website.

This means optimization is vital; don’t let it waste its time on pages you don’t care about.

Soft 404s

404 errors mean you submitted a deleted or non-existent page for indexing. Soft 404s display “not found” information, but don’t return the HTTP 404 status code to the server.

Redirecting removed pages to others that are irrelevant is a common mistake.

Multiple redirects may also show up as soft 404 errors. Strive to shorten your redirect chains as much as possible.

Crawl Issue

There are many crawl issues, but an important one is a problem with robots.txt. If Googlebot finds a robots.txt for your site but can’t access it, it will not crawl the site at all.

Finally, let’s look at the results for different website sizes.

Small Websites

Sample size: 44 sites

  1. Crawled, currently not indexed (quality or crawl budget issue).
  2. Duplicate content.
  3. Crawl budget issue.
  4. Soft 404.
  5. Crawl issue.

Medium Websites

Sample size: 8 sites

  1. Duplicate content.
  2. Discovered, currently not indexed (crawl budget/quality issue).
  3. Crawled, currently not indexed (quality issue).
  4. soft 404 (quality issue).
  5. Crawl issue.

Big Websites

Sample size: 9 sites

  1. Crawled, currently not indexed (quality issue).
  2. Discovered, currently not indexed (crawl budget/quality issue).
  3. Duplicate content.
  4. Soft 404.
  5. Crawl issue.

Huge websites

Sample size: 9 sites

  1. Crawled, currently not indexed (quality issue).
  2. Discovered, currently not indexed (crawl budget/quality issue).
  3. Duplicate content (duplicate, submitted URL not selected as canonical).
  4. Soft 404.
  5. Crawl issue.

Key Takeaways on Common Indexing Issues

It’s interesting that, according to these findings, two sizes of websites are suffering from the same issues. This shows how difficult it is to maintain quality in the case of large websites.

  • Larger than 100k, but smaller than 1 million.
  • Larger than 1 million.

The takeaways, however, are that:

  • Even relatively small websites (10k+) may not be fully indexed because of an insufficient crawl budget.
  • The bigger the website is, the more pressing the crawl budget/quality issues become.
  • The duplicate content issue is severe but changes its nature depending on the website.

P.S. A Note About URLs Unknown for Google

During my research, I realized that there’s one more common issue that prevents pages from getting indexed.

It may not have earned its place in the rankings above but is still significant, and I was surprised to see it’s still so popular.

I’m talking about orphan pages.

Some pages on your website may have no internal links leading to them.

If there is no path for the Googlebot to find a page through your website, it may not find it at all.

What’s the solution? Add links from related pages.

You can also fix this manually by adding the orphan page to your sitemap. Unfortunately, many webmasters still neglect to do this.

More Resources:

#