Robots.txt, Sitemap.xml, and llms.txt: What Each File Actually Controls

A practical guide to what robots.txt, sitemap.xml, and llms.txt control, how they work with AI crawlers, and what to check before you ship.

Most teams know they should have robots.txt and a sitemap. More teams are now asking whether they also need llms.txt, Content Signals, or a list of AI crawler rules.

The awkward bit is that these files are often discussed together, even though they do different jobs. Mix them up and you either block discovery by accident, publish stale URLs, or give AI systems a confident but outdated map of your site.

Here is the practical model I would hand to engineering, SEO, and legal before a release.

The quick version

  • robots.txt tells compliant crawlers what they may fetch.
  • sitemap.xml tells search engines which canonical URLs matter and when they changed.
  • llms.txt gives AI retrieval systems a curated map of your most useful public resources.
  • Content Signals tell automated systems how you want content used after it has been accessed.
  • None of these files is a login wall, a privacy control, or a substitute for fixing page-level metadata.

robots.txt: crawl guidance, not access control

robots.txt sits at the root of a host, for example https://example.com/robots.txt. It applies only to that host and protocol. A file on example.com does not control www.example.com, docs.example.com, or a staging domain.

Use it for:

  • Blocking crawler access to account, admin, cart, search-result, or internal app paths.
  • Pointing crawlers towards your sitemap files.
  • Making crawler policy explicit for search and AI user agents.
  • Reducing crawl waste on URLs that should never become public entry points.

Do not use it for:

  • Hiding private data. Anyone can open a URL directly if the app serves it.
  • Removing an already indexed page from search results. Use noindex on a crawlable page, or an X-Robots-Tag header for non-HTML files.
  • Blocking CSS, JavaScript, or images that search engines need to render important pages.

The classic production failure is still this:

User-agent: *
Disallow: /

That can be fine on staging. On production, it is the sort of tiny file that can ruin a week.

If you want public discovery, a safer baseline looks more like this:

User-agent: Googlebot
User-agent: Bingbot
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /dashboard/
Disallow: /billing/
Disallow: /settings/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-blog.xml
Sitemap: https://example.com/sitemap-help.xml

Notice the private app paths are blocked, not the whole site.

sitemap.xml: URL discovery and freshness hints

sitemap.xml is a discovery and freshness hint. It helps search engines find the canonical URLs you care about, especially when the site has a lot of pages, a deep help library, or newly published content.

A good sitemap is:

  • Publicly reachable with HTTP 200.
  • Valid XML.
  • Focused on canonical, indexable URLs.
  • Free from redirects, 404s, login pages, duplicate filter URLs, and noindex pages.
  • Updated when important content changes.
  • Honest with lastmod. Do not set every URL to today unless every page genuinely changed today.

Multiple sitemaps are fine. In fact, they often make ownership clearer:

  • sitemap.xml for core public pages.
  • sitemap-blog.xml for editorial content.
  • sitemap-help.xml for help and remediation guides.
  • sitemap-tools.xml for free tools.

The main thing is to avoid noisy overlap. If the same URL appears in three advertised sitemap files, search engines can handle it, but it makes auditing harder than it needs to be.

Bing is worth a special mention. If freshness matters, use IndexNow as well as sitemaps. A sitemap helps crawlers discover URLs. IndexNow actively tells Bing and participating engines that a URL changed.

llms.txt: AI crawler guidance (optional, emerging)

llms.txt is an emerging convention for giving AI retrieval systems a concise map of your site. Think of it as a curated reading list, not a rules engine.

Treat it as:

  • Optional, but useful if you have public docs, tools, guides, pricing, or product pages that AI systems should cite accurately.
  • A way to point agents at the best URLs instead of making them guess from navigation.
  • A maintenance commitment. A stale llms.txt is worse than no llms.txt because it gives machines a neat-looking wrong answer.

Keep it short. Start with:

# Example Company

> One sentence explaining what the site is and who it helps.

## Core pages
- [Pricing](https://example.com/pricing): Current plans and limits.
- [Features](https://example.com/features): Product capability overview.

## Help
- [Help library](https://example.com/help): Fix guides and support docs.

## Machine-readable discovery
- [robots.txt](https://example.com/robots.txt): Crawler policy.
- [sitemap.xml](https://example.com/sitemap.xml): Canonical public URLs.

If legal or privacy policy says a crawler should not use certain content, do not quietly point llms.txt at that same content. Keep the intent aligned across the files.

AI crawler policy: decide what you actually want

AI crawler controls are getting more granular. That is useful, but it also means old "block all AI" or "allow all AI" snippets are usually too blunt.

Separate the jobs:

  • Search and answer inclusion: crawlers such as OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, and Bingbot help content appear in search or answer products.
  • Training and model improvement: crawlers or controls such as GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, and CCBot are about broader model/data use.
  • User-triggered fetching: agents such as ChatGPT-User and Claude-User may fetch a page because a user asked for it.

If your goal is discoverability, do not accidentally block the search/answer crawlers you want to appear in. If your goal is attribution without training use, make that distinction explicitly.

Cloudflare's Content Signals give you a compact way to state post-access preferences:

Content-Signal: search=yes, ai-input=yes, ai-train=no

Or, if your policy is visibility-first:

Content-Signal: search=yes, ai-input=yes, ai-train=yes

These are preference signals, not enforcement. If you need enforcement, that belongs in CDN/WAF rules, auth, or application access control.

Crawler control stack mapping robots.txt, sitemap.xml, and llms.txt responsibilities.

How these files work together

Use this mental model:

  1. robots.txt for crawl rules + sitemap reference
  2. sitemap.xml for canonical URL discovery and freshness hints
  3. llms.txt for curated AI-facing guidance
  4. Content Signals for use preferences after access
  5. Page-level tags for indexing, canonicalisation, and snippets

Then pair it with page-level signals:

  • Canonical tags
  • Meta robots and X-Robots-Tag
  • HTTP status codes
  • Structured data
  • Internal links
  • Clean redirects

If those disagree, crawlers have to guess. For example, a URL in the sitemap that redirects to another URL is noisy. A URL in the sitemap with noindex is worse. A public article that is allowed in robots.txt but missing from navigation, sitemap, and llms.txt is technically crawlable but easier to miss.

A 20-minute audit you can run today

Start with the files:

curl -i https://example.com/robots.txt
curl -i https://example.com/sitemap.xml
curl -i https://example.com/llms.txt

Then check the sitemap URLs:

curl -s https://example.com/sitemap.xml | grep -o '<loc>[^<]*</loc>'

Pick ten URLs from the sitemap and verify:

  • They return HTTP 200.
  • The canonical URL matches the sitemap URL.
  • They do not have noindex.
  • They are linked from somewhere useful.
  • They render meaningful content without requiring login.

Now test with crawler user agents:

curl -A "Googlebot" -I https://example.com/
curl -A "bingbot" -I https://example.com/
curl -A "OAI-SearchBot" -I https://example.com/
curl -A "Claude-SearchBot" -I https://example.com/

You are looking for accidental 403s, bot challenges, redirect loops, or stripped content.

If you want a quick check before editing production, use Scavo's free AI Robots.txt Validator & Generator. Paste in your current file, see which AI bots are allowed or blocked, then generate a cleaner policy if the rules are muddled.

Owner checklist

  • Review robots.txt before releases that touch routing, auth, public/private areas, or CDN bot settings.
  • Make sure every advertised sitemap returns HTTP 200 and contains only canonical, indexable URLs.
  • Update lastmod only when the content meaningfully changes.
  • Enable Bing IndexNow if you publish time-sensitive content.
  • Keep llms.txt pointed at the current pricing, features, help, policy, and tool pages you want cited.
  • Distinguish search/answer visibility from training use in AI crawler rules.
  • Keep CDN/WAF bot settings aligned with your robots.txt intent.
  • Monitor weekly so a stale discovery file does not sit there for months.

Where Scavo helps

Scavo checks robots.txt availability and crawl-risk patterns, confirms sitemap presence, detects llms.txt, and checks AI crawler/access signals so drift is caught early. The free AI Robots.txt Validator & Generator is the quick version when you only want to sanity-check crawler rules.

That turns crawler policy from "set once and forget" into something you maintain like any other production surface.

What to do next in Scavo

  1. Run a fresh scan on the main domain and any important subdomains.
  2. Open the AI visibility and SEO sections first.
  3. Check robots.txt, sitemap, llms.txt, canonical, meta robots, and AI crawler policy together.
  4. Fix the smallest contradiction first, then re-scan.
  5. Add the files to your release checklist so the next routing change does not quietly break discovery.

Sources

Keep digging with related fixes

Jun 4, 2026

Google AI Search Controls: Measure Before You Touch the Toggle

Google and the UK CMA just turned AI Search inclusion into an operational decision. Here is how to choose, measure, and monitor before changing anything.

Read article
May 14, 2026

Google Is Adding More Links to AI Search. Your Website Still Has to Earn the Click

Google is making AI answers link out more clearly, but that does not remove the need for crawlable, quotable, well-structured pages.

Read article
May 6, 2026

AI Agent Readiness Is the New Website Health Check: What to Fix First

Cloudflare's 2026 Agent Readiness data shows the web still has basic AI visibility gaps.

Read article

Ready to see this on your site?

Run a free scan and get a prioritized fix list in under 30 seconds. Or unlock full monitoring to keep the wins rolling in.