how to

XML sitemap best practices

the short answer

An XML sitemap lists the URLs you want crawled, with optional lastmod, changefreq, and priority hints, helping search and AI engines discover and re-crawl your content efficiently. Best practice: include only canonical, indexable, 200-status URLs; keep lastmod accurate; reference the sitemap from robots.txt; split into a sitemap index once you exceed 50,000 URLs or 50MB; and never list redirected, noindexed, or duplicate pages. A messy sitemap wastes crawl budget and erodes trust in your signals.

A sitemap is a list of the pages you actually want indexed — nothing more. Every URL in it should be canonical, return a 200 status, and be eligible for indexing.

Including junk teaches engines to distrust your sitemap and wastes the crawl budget you want spent on real content. AI engines discover sitemaps the same way classic crawlers do, so a clean, discoverable sitemap helps GEO as well as SEO.

50,000the maximum number of URLs (and 50MB uncompressed size) a single sitemap file may contain before you must split into a sitemap indexSource: sitemaps.org protocol / Google Search Central, 2024

What belongs in a sitemap

A sitemap is a list of the pages you actually want indexed — nothing more. Every URL in it should be canonical, return a 200 status, and be eligible for indexing. Including junk teaches engines to distrust your sitemap and wastes the crawl budget you want spent on real content.

In practice: include only canonical, indexable, 200-status URLs — no redirects, no noindex, no duplicates; use absolute URLs with your preferred protocol and host (https, www-or-not, consistently); set lastmod to the real last-modified date so engines know what to re-crawl; and keep priority and changefreq honest, or omit them, since engines largely treat them as hints.

Structure and scale

A single sitemap holds up to 50,000 URLs and 50MB uncompressed. Past that, use a sitemap index file that points to multiple sitemaps — and many sites do this deliberately by section (posts, products, pages) so they can spot crawl issues per area.

Reference your sitemap from robots.txt with a `Sitemap:` line and submit it in Google Search Console and Bing Webmaster Tools. AI engines and crawlers discover sitemaps the same way, so a discoverable, clean sitemap helps GEO too.

Common sitemap mistakes

Sitemap problems are quiet — pages just don't get indexed, and you rarely get an error. The fixes are simple once you know what to look for.

The usual culprits: listing redirected or 404 URLs, which waste crawl and signal staleness; listing noindex or canonicalized-away pages, which contradicts your own signals; stale lastmod dates that never change, which engines learn to ignore; forgetting to reference the sitemap from robots.txt; and not regenerating the sitemap when content changes.

frequently asked

Do priority and changefreq actually matter?
Google has said it largely ignores priority and changefreq, treating them as weak hints. lastmod is the most useful field when it's accurate. Don't over-invest in tuning priority — focus on listing only clean, canonical URLs.
Should I include images and videos in my sitemap?
You can use image and video sitemap extensions if media discovery matters for your site, but for most sites a clean URL sitemap is the priority. Add media extensions only when you have substantial media you want indexed.
How do I know if my sitemap is healthy?
Run your site through seo·check — it discovers your robots.txt and sitemap, counts the URLs, and flags sitemap health as part of the audit, so you can catch missing or broken sitemaps fast.

Published April 21, 2026 · Last updated June 16, 2026

ready to try seo·check?

check a url