Most teams know they should have robots.txt and a sitemap. More teams are now asking whether they also need llms.txt, Content Signals, or a list of AI crawler rules.
The awkward bit is that these files are often discussed together, even though they do different jobs. Mix them up and you either block discovery by accident, publish stale URLs, or give AI systems a confident but outdated map of your site.
Here is the practical model I would hand to engineering, SEO, and legal before a release.
The quick version
robots.txttells compliant crawlers what they may fetch.sitemap.xmltells search engines which canonical URLs matter and when they changed.llms.txtgives AI retrieval systems a curated map of your most useful public resources.- Content Signals tell automated systems how you want content used after it has been accessed.
- None of these files is a login wall, a privacy control, or a substitute for fixing page-level metadata.
robots.txt: crawl guidance, not access control
robots.txt sits at the root of a host, for example https://example.com/robots.txt. It applies only to that host and protocol. A file on example.com does not control www.example.com, docs.example.com, or a staging domain.
Use it for:
- Blocking crawler access to account, admin, cart, search-result, or internal app paths.
- Pointing crawlers towards your sitemap files.
- Making crawler policy explicit for search and AI user agents.
- Reducing crawl waste on URLs that should never become public entry points.
Do not use it for:
- Hiding private data. Anyone can open a URL directly if the app serves it.
- Removing an already indexed page from search results. Use
noindexon a crawlable page, or anX-Robots-Tagheader for non-HTML files. - Blocking CSS, JavaScript, or images that search engines need to render important pages.
The classic production failure is still this:
User-agent: *
Disallow: /
That can be fine on staging. On production, it is the sort of tiny file that can ruin a week.
If you want public discovery, a safer baseline looks more like this:
User-agent: Googlebot
User-agent: Bingbot
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /dashboard/
Disallow: /billing/
Disallow: /settings/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-blog.xml
Sitemap: https://example.com/sitemap-help.xml
Notice the private app paths are blocked, not the whole site.
sitemap.xml: URL discovery and freshness hints
sitemap.xml is a discovery and freshness hint. It helps search engines find the canonical URLs you care about, especially when the site has a lot of pages, a deep help library, or newly published content.
A good sitemap is:
- Publicly reachable with HTTP 200.
- Valid XML.
- Focused on canonical, indexable URLs.
- Free from redirects, 404s, login pages, duplicate filter URLs, and
noindexpages. - Updated when important content changes.
- Honest with
lastmod. Do not set every URL to today unless every page genuinely changed today.
Multiple sitemaps are fine. In fact, they often make ownership clearer:
sitemap.xmlfor core public pages.sitemap-blog.xmlfor editorial content.sitemap-help.xmlfor help and remediation guides.sitemap-tools.xmlfor free tools.
The main thing is to avoid noisy overlap. If the same URL appears in three advertised sitemap files, search engines can handle it, but it makes auditing harder than it needs to be.
Bing is worth a special mention. If freshness matters, use IndexNow as well as sitemaps. A sitemap helps crawlers discover URLs. IndexNow actively tells Bing and participating engines that a URL changed.
llms.txt: AI crawler guidance (optional, emerging)
llms.txt is an emerging convention for giving AI retrieval systems a concise map of your site. Think of it as a curated reading list, not a rules engine.
Treat it as:
- Optional, but useful if you have public docs, tools, guides, pricing, or product pages that AI systems should cite accurately.
- A way to point agents at the best URLs instead of making them guess from navigation.
- A maintenance commitment. A stale
llms.txtis worse than nollms.txtbecause it gives machines a neat-looking wrong answer.
Keep it short. Start with:
# Example Company
> One sentence explaining what the site is and who it helps.
## Core pages
- [Pricing](https://example.com/pricing): Current plans and limits.
- [Features](https://example.com/features): Product capability overview.
## Help
- [Help library](https://example.com/help): Fix guides and support docs.
## Machine-readable discovery
- [robots.txt](https://example.com/robots.txt): Crawler policy.
- [sitemap.xml](https://example.com/sitemap.xml): Canonical public URLs.
If legal or privacy policy says a crawler should not use certain content, do not quietly point llms.txt at that same content. Keep the intent aligned across the files.
AI crawler policy: decide what you actually want
AI crawler controls are getting more granular. That is useful, but it also means old "block all AI" or "allow all AI" snippets are usually too blunt.
Separate the jobs:
- Search and answer inclusion: crawlers such as
OAI-SearchBot,Claude-SearchBot,PerplexityBot,Googlebot, andBingbothelp content appear in search or answer products. - Training and model improvement: crawlers or controls such as
GPTBot,ClaudeBot,Google-Extended,Applebot-Extended, andCCBotare about broader model/data use. - User-triggered fetching: agents such as
ChatGPT-UserandClaude-Usermay fetch a page because a user asked for it.
If your goal is discoverability, do not accidentally block the search/answer crawlers you want to appear in. If your goal is attribution without training use, make that distinction explicitly.
Cloudflare's Content Signals give you a compact way to state post-access preferences:
Content-Signal: search=yes, ai-input=yes, ai-train=no
Or, if your policy is visibility-first:
Content-Signal: search=yes, ai-input=yes, ai-train=yes
These are preference signals, not enforcement. If you need enforcement, that belongs in CDN/WAF rules, auth, or application access control.
How these files work together
Use this mental model:
robots.txtfor crawl rules + sitemap referencesitemap.xmlfor canonical URL discovery and freshness hintsllms.txtfor curated AI-facing guidance- Content Signals for use preferences after access
- Page-level tags for indexing, canonicalisation, and snippets
Then pair it with page-level signals:
- Canonical tags
- Meta robots and
X-Robots-Tag - HTTP status codes
- Structured data
- Internal links
- Clean redirects
If those disagree, crawlers have to guess. For example, a URL in the sitemap that redirects to another URL is noisy. A URL in the sitemap with noindex is worse. A public article that is allowed in robots.txt but missing from navigation, sitemap, and llms.txt is technically crawlable but easier to miss.
A 20-minute audit you can run today
Start with the files:
curl -i https://example.com/robots.txt
curl -i https://example.com/sitemap.xml
curl -i https://example.com/llms.txt
Then check the sitemap URLs:
curl -s https://example.com/sitemap.xml | grep -o '<loc>[^<]*</loc>'
Pick ten URLs from the sitemap and verify:
- They return HTTP 200.
- The canonical URL matches the sitemap URL.
- They do not have
noindex. - They are linked from somewhere useful.
- They render meaningful content without requiring login.
Now test with crawler user agents:
curl -A "Googlebot" -I https://example.com/
curl -A "bingbot" -I https://example.com/
curl -A "OAI-SearchBot" -I https://example.com/
curl -A "Claude-SearchBot" -I https://example.com/
You are looking for accidental 403s, bot challenges, redirect loops, or stripped content.
If you want a quick check before editing production, use Scavo's free AI Robots.txt Validator & Generator. Paste in your current file, see which AI bots are allowed or blocked, then generate a cleaner policy if the rules are muddled.
Owner checklist
- Review
robots.txtbefore releases that touch routing, auth, public/private areas, or CDN bot settings. - Make sure every advertised sitemap returns HTTP 200 and contains only canonical, indexable URLs.
- Update
lastmodonly when the content meaningfully changes. - Enable Bing IndexNow if you publish time-sensitive content.
- Keep
llms.txtpointed at the current pricing, features, help, policy, and tool pages you want cited. - Distinguish search/answer visibility from training use in AI crawler rules.
- Keep CDN/WAF bot settings aligned with your
robots.txtintent. - Monitor weekly so a stale discovery file does not sit there for months.
Where Scavo helps
Scavo checks robots.txt availability and crawl-risk patterns, confirms sitemap presence, detects llms.txt, and checks AI crawler/access signals so drift is caught early. The free AI Robots.txt Validator & Generator is the quick version when you only want to sanity-check crawler rules.
That turns crawler policy from "set once and forget" into something you maintain like any other production surface.
What to do next in Scavo
- Run a fresh scan on the main domain and any important subdomains.
- Open the AI visibility and SEO sections first.
- Check
robots.txt, sitemap,llms.txt, canonical, meta robots, and AI crawler policy together. - Fix the smallest contradiction first, then re-scan.
- Add the files to your release checklist so the next routing change does not quietly break discovery.