Most teams know they should have robots.txt and a sitemap. Far fewer teams can explain what each file actually controls once a site is live.
Here is the practical model you can hand to engineering, SEO, and legal without confusion.
robots.txt: crawl guidance, not access control
robots.txt tells crawlers where they should or should not crawl. It is not a security control.
Key constraints:
- It does not block direct browser access.
- It should never be treated as a way to hide sensitive URLs.
- It can reduce crawl waste when configured correctly.
Common production failure:
User-agent: *withDisallow: /accidentally shipped to production.
sitemap.xml: URL discovery and freshness hints
sitemap.xml helps search engines discover canonical URLs, especially on larger sites.
A good sitemap is:
- Publicly reachable (HTTP 200)
- UTF-8 XML
- Focused on canonical, indexable URLs
- Updated when key content changes
A bad sitemap is usually stale and full of redirects, noindex URLs, or non-canonical paths.
llms.txt: AI crawler guidance (optional, emerging)
llms.txt is an emerging convention for communicating AI crawler guidance and key content entry points.
Treat it as:
- Optional but increasingly useful for AI discoverability workflows
- A policy/context layer, not a security boundary
If you publish llms.txt, keep it aligned with legal/privacy decisions and your robots policy.
How these files work together
Use this baseline:
robots.txtfor crawl rules + sitemap referencesitemap.xmlfor canonical URL discoveryllms.txtfor AI-facing guidance (if it fits your policy)
And pair this with page-level signals:
- Canonical tags
- Meta robots
- Proper status codes
Owner checklist
- [ ]
robots.txtis reviewed before every production release touching routing or IA. - [ ]
sitemap.xmlgeneration is automated (not manual uploads). - [ ]
llms.txtpolicy is reviewed with legal/privacy owners before publication. - [ ] Weekly scan validates availability and syntax behavior.
Fast validation commands
curl -i https://your-domain.com/robots.txt
curl -i https://your-domain.com/sitemap.xml
curl -i https://your-domain.com/llms.txt
Where Scavo helps
Scavo checks robots.txt availability and crawl-risk patterns, confirms sitemap presence, and detects llms.txt so drift is caught early.
That turns crawler policy from "set once and forget" into something you maintain like any other production surface.
Sources
- Google: Introduction to robots.txt
- Google: robots.txt reference
- Google: Build and submit a sitemap
- llms.txt
What to do next in Scavo
- Run a fresh scan on your main domain.
- Open the matching help guide in
/help, assign an owner, and ship the smallest safe fix. - Re-scan after deployment and confirm the trend is moving in the right direction.