How to Audit an XML Sitemap So Google Indexes Only Ranked Pages

Most XML sitemaps have the same problem, and most site owners do not know it. The sitemap includes every URL the CMS emits - every tag archive, every author archive, every calendar-month archive, every filtered product listing, every canonical of a canonical - and Google reads that list as "these are the URLs the site wants indexed." Google then crawls them all, decides most of them are thin or duplicate, and quietly does not index a lot of them.

That "quiet non-indexation" shows up as flat organic traffic despite an ever-growing sitemap. The page count reported in Google Search Console keeps climbing. Actual indexed pages stay roughly the same. Nothing appears broken. Nothing appears missing. Everything appears to be working - except that nothing is ranking better than it did six months ago.

The audit that fixes this is not glamorous. It is a URL-by-URL check that ends with a sitemap containing only the pages the site actually wants ranked and the removal of everything else. Here is how to run it.

rows of card catalog drawers in a library archive room
Photo by George Diamanto on Pexels

What Google actually does with a sitemap

Search engines treat a sitemap as a signal, not a directive. The presence of a URL in a sitemap tells Google, "we consider this page worth crawling and potentially indexing." Its absence does not tell Google not to crawl - crawl paths still exist through internal links, backlinks, and previously-known URLs - but its presence directly influences discovery priority.

That means every URL you leave in the sitemap is a claim you are making about your own content. Google reads the whole set as an implicit assertion. If half the URLs are auto-generated tag pages with 200 words of boilerplate, Google is inferring things about your site's overall quality from the average of what is in the sitemap.

Google's documentation is explicit about this: sitemaps should contain "canonical URLs that you want to be crawled and indexed." Everything else is noise, and noise costs you.

The four categories of sitemap leakage

Almost every leaky sitemap I have audited falls into one of four categories.

Category one: tag and taxonomy archives with thin content. Your CMS auto-generates a tag page for every tag on every post. Each tag page has a short intro and a list of post titles. There is no substantive content. Google reads them as thin, does not index them, and the crawl budget you spent on them is wasted.

Category two: paginated archive pages that duplicate the canonical. Blog page 2, page 3, page 4 - each an archive of post titles, none with substantive unique content. Search engines figured out how to handle these years ago, but including them in the sitemap invites crawl waste.

Category three: URLs the CMS generates that were never intended to rank. Author pages when you have three authors and one of them has never written a post. Category pages when the category was created for organizational purposes and has no visitor-facing utility. Preview links that made it into the export somehow.

Category four: legacy URLs from a previous site structure. A 2018 migration left old category slugs in the sitemap generator. A theme change added new URL patterns without removing old ones. A plugin update duplicated URL structures. All of these show up as sitemap entries that Google follows, finds nothing at (or worse, finds soft 404 responses), and downgrades your overall crawl trust for.

Step one: pull the sitemap and inventory it

Start with the sitemap URL itself - usually at /sitemap.xml or /sitemap_index.xml. If your CMS has an index sitemap that references child sitemaps, follow every child. The audit is on the flattened list of every URL in every sitemap the index references.

Export the list to a spreadsheet. Add columns for:

URL
Depth (root, category, tag, post, custom)
Word count of the target page (fetch this via a bulk crawler)
Indexed status (from Google Search Console URL Inspection or a bulk indexation checker)
Organic clicks in the last 90 days (from Google Search Console)

The spreadsheet is your working document. Nothing else needs to be looked at until it is populated.

notebook and pen on a desk with search analytics printouts
Photo by cottonbro studio on Pexels

Step two: sort by indexed and clicks

Sort the spreadsheet by indexed status first, then by clicks descending. This gives you three visible groups.

Group A: indexed, receiving clicks. These are the pages Google agrees are worth ranking. They stay in the sitemap. Move on.

Group B: indexed, receiving zero or near-zero clicks. These are the pages Google indexed but does not rank. Either the content is not competitive for its target queries or there is no query that matches it. Investigate further before deciding - if the page is a genuine service or product page, it may need SEO work rather than removal. If it is a thin tag archive, it should come out.

Group C: not indexed. These are the pages Google decided are not worth indexing. Every one of them is a candidate for removal from the sitemap. Not necessarily from the site - some of them may have utility for visitors even if they will never rank - but from the sitemap absolutely.

Step three: fix or remove Group C

For every URL in Group C, the decision tree is:

Does this page have substantive, unique content? If no, remove it from the sitemap (and probably from the site). The CMS setting that generates it likely also needs adjustment.
Is this page a canonical target, or a duplicate that canonicalizes elsewhere? If it is a duplicate, remove it from the sitemap and confirm the rel=canonical tag is pointing to the target.
Is this page the correct URL for its content, or has the URL structure changed? If the URL is legacy, remove it from the sitemap and either 301 it to the current URL or let it 410.

Most Group C URLs are thin tag archives, duplicates, or legacy leftovers. Removing them from the sitemap does not hurt anything. The pages that have SEO potential move to Group B for further work. The rest are dropped.

Step four: work on Group B

Group B is where the SEO work lives. Indexed but not ranking usually means one of three things:

The page targets a query where the current organic result is much stronger. Rewrite for a longer-tail query or drop.
The page has weak internal linking - nothing links to it, so Google cannot infer its importance. Add internal links from higher-authority pages.
The page competes with a stronger page on your own site for the same query. This is keyword cannibalization, and the fix is either to consolidate the two pages or to differentiate them clearly.

The full workflow for the cannibalization case is a topic worth an article of its own. The point here is that Group B is where audit results turn into ranking wins - not by removing anything, but by identifying which pages need work.

"The biggest quiet SEO win most sites can make is not building more content - it is removing the URLs that Google is spending crawl budget on and not ranking. The reduced sitemap raises the average quality of what you are asking Google to consider, and the pages you actually want ranked get more crawler attention." - Dennis Traina, founder of 137Foundry

Step five: submit the cleaned sitemap

Once the sitemap is cleaned, resubmit it in Google Search Console. Bing Webmaster Tools accepts sitemap submissions too, and Bing's documentation covers the process there.

Monitor for two to four weeks. The indexation report in Search Console will show a drop in the "excluded" and "discovered - currently not indexed" counts, which is expected and desired. What you want to see is a steady climb in the "indexed" count as a percentage of submitted URLs, and eventually a growing "indexed" count in absolute terms as the pages you have worked on start showing up as ranked.

If the "excluded" count does not drop after four weeks, it usually means the URLs were still being generated by the CMS and the audit only cleaned the sitemap - not the underlying URL structure. Fix that at the CMS level.

Step six: prevent future leakage

The audit is a one-time cleanup. The prevention is a sitemap-generation configuration that only includes canonical, index-worthy URLs going forward.

For most modern CMSes this is a plugin setting or a config file. WordPress users can control it through Yoast, RankMath, or a similar plugin. Custom CMSes usually have a template file that emits the sitemap - modify the query to exclude thin archives.

The Screaming Frog SEO Spider is a good tool for cross-checking whether your sitemap generation is now aligned with what actually deserves indexing - run it against your site and compare its list of crawlable URLs against your current sitemap.

Whichever CMS you are on, the target is that every URL in the sitemap passes a "would this rank if it were the only URL on the site" test. Everything else should be removed at the generator, not filtered at the sitemap.

search engine analytics dashboard on a monitor with charts
Photo by Jakub Zerdzicki on Pexels

What the audit is worth

For a mid-sized content site with a few thousand URLs, this audit typically drops the sitemap by 40 to 70 percent. It takes one focused day of work by someone who understands the CMS and the SEO tooling.

The measured payoff shows up over the following two to three months as:

The overall indexation rate climbs from typical mid-range (60 to 70 percent) to high (85+ percent).
Crawl requests to low-value URLs drop and requests to index-worthy URLs rise.
Rankings for pages that were already indexed improve, because those pages are getting a bigger share of crawl attention.
Rankings for pages that were not indexed start appearing, because the pages are now being crawled and indexed for the first time.

None of these are guaranteed. All of them are typical.

For related technical-SEO audits and workflows, the 137Foundry blog publishes deeper writeups on the specific mechanics of each step, and the technical SEO service page covers what we take on for clients who want the audit done for them. Related work on data-side content pipelines lives on the services hub for other angles on the same underlying problem.

Clean the sitemap. Prevent the leakage at the CMS. Then wait. The indexation numbers move on their own.

What Google actually does with a sitemap

The four categories of sitemap leakage

Step one: pull the sitemap and inventory it

Step two: sort by indexed and clicks

Step three: fix or remove Group C

Step four: work on Group B

Step five: submit the cleaned sitemap

Step six: prevent future leakage

What the audit is worth

More Articles

How to Build a Data Quality Gate That Blocks Bad Records Before They Reach Production Tables

How to Diagnose and Fix Keyword Cannibalization Between Blog Posts and Service Pages

How to Build a Replay Mechanism for a Data Integration That Lost Hours of Events Without Reprocessing the Successful Ones