Start here
Before You Fix It: What This Check Means
Robots.txt is the first crawler policy file many bots check before broad crawling. In plain terms, this checks whether the site-wide crawler rules are present and not obviously incomplete. Scavo evaluates response status plus policy content from `/robots.txt`.
Why this matters in practice: operational drift here often causes hard-to-debug regressions across environments.
How to use this result: treat this as directional evidence, not final truth. This result reflects what was observable at scan time and should be verified in your own production context. First, confirm the issue in live output: verify directly in live production output with browser/network tools Then ship one controlled change: Serve robots from stable edge/origin path (no app HTML fallback). Finally, re-scan the same URL to confirm the result improves.
Background sources
TL;DR: Your robots.txt file is missing or misconfigured, which can either block search engines from important pages or waste crawl budget.
A missing robots.txt means search engines crawl everything — including admin pages, staging content, and duplicate URL parameters. A misconfigured one can accidentally block your entire site from indexing. Nearly 38% of indexed websites now include AI-specific restrictions in robots.txt (EngageCoders, 2024), making it also your primary control point for AI crawler access.
What Scavo checks (plain English)
Scavo evaluates response status plus policy content from /robots.txt.
Current logic includes:
Info: robots was not checked in this runFail: server error (5xx) or wildcard sitewide block (User-agent: *+Disallow: /)Warning:401/403,404, unexpected status, missing directives, HTML response, missing groups, or missing valid sitemap directivePass: accessible parseable file with valid sitemap directive and no wildcard root block
Scavo also parses group count, sitemap directives, and wildcard root policy state.
How Scavo scores this check
Scavo assigns one result state for this check on the tested page:
- Pass: baseline signals for this check were found.
- Warning: partial coverage or risk signals were found and should be reviewed.
- Fail: required signals were missing or risky behavior was confirmed.
- Info: Scavo could not gather enough reliable evidence on this run to score pass/fail confidently.
In your scan report, this appears under What failed / What needs attention / What is working for robots_txt, followed by Recommended next steps and Technical evidence (for developers) when needed.
- Scan key:
robots_txt - Category:
TECHNICAL
Why fixing this matters
Robots policy is a high-leverage operational file. Small mistakes can accidentally suppress discovery, break diagnostics, or confuse crawler behavior.
Many incidents come from non-intentional changes: auth middleware, edge rewrites, or staging policy leakage.
If you are not technical
- Open
https://your-domain/robots.txtand confirm readable plain-text directives. - Confirm it is publicly reachable without login.
- Confirm sitemap line exists and points to a valid URL.
- Re-run Scavo and verify status improves.
Technical handoff message
Copy and share this with your developer.
Scavo flagged Robots.txt (robots_txt). Please restore /robots.txt to HTTP 200 plain text with valid User-agent groups, avoid unintended wildcard Disallow: /, and include valid Sitemap: directive(s). Share live output and re-run the scan.If you are technical
- Serve robots from stable edge/origin path (no app HTML fallback).
- Keep directives minimal and explicit.
- Avoid global block rules unless intentionally in controlled environments.
- Validate sitemap URLs and keep them canonical.
- Add CI/deploy checks that parse and lint robots before release.
How to verify
curl -i https://your-domain/robots.txtreturns200and plain text.- Confirm parser can read at least one user-agent group.
- Confirm wildcard root policy is not blocked unless intentional.
- Confirm sitemap directives are valid absolute URLs.
- Re-run Scavo and inspect details (
group_count,valid_sitemap_count,wildcard_root_policy).
What this scan cannot confirm
- It does not guarantee indexing outcomes for individual pages.
- It does not evaluate every non-standard crawler directive interpretation.
- It cannot infer business intent from policy text alone.
Owner checklist
- [ ] Assign one owner for robots policy.
- [ ] Keep robots file version-controlled.
- [ ] Add release validation for status/content/sitemap lines.
- [ ] Recheck after CDN/auth/routing changes.
FAQ
Is robots.txt required for indexing?
Not strictly required, but it is strongly recommended for crawl governance and operational clarity.
Why is wildcard Disallow: / a fail?
Because it blocks root crawling for general bots and is usually a production-severity misconfiguration.
Can robots.txt be behind authentication?
For public sites, that usually defeats its purpose because crawlers cannot read policy.
Why check sitemap in robots if sitemap is separate?
Because robots and sitemap directives work together in crawler discovery workflows.
Sources
- Google Search Central: robots.txt intro
- Google Search Central: robots.txt reference
- RFC 9309: Robots Exclusion Protocol
Need a production-safe robots baseline for app, docs, and private route patterns? Send support your path map.