AI Bot Policy Not Set in robots.txt

GPTBot is blocked by 5.89% of all websites, with 35.7% of the top 1,000 sites blocking it (Ahrefs, 2024). Nearly 38% of indexed sites now have AI-specific restrictions, up from 8% in 2023 (EngageCoders). If you don't set explicit policy, you can't control whether your content appears in AI products or training data. A deliberate policy — whether allowing or blocking — is better than leaving it undefined.

Start here

Before You Fix It: What This Check Means

AI crawler policy checks whether major AI user agents receive intentional, explicit directives. In plain terms, this tells you whether AI crawlers and answer systems can understand and reuse your content correctly. Scavo reads `robots.txt` status/content and evaluates root-path policy for key AI crawlers.

Why this matters in practice: unclear machine-facing signals can reduce retrieval quality and citation consistency.

How to use this result: treat this as directional evidence, not final truth. Bot access outcomes can vary by edge controls, geo policies, and temporary WAF behavior. First, confirm the issue in live output: verify bot-facing output and policy files on the final URL Then ship one controlled change: Normalize robots groups and remove contradictory root rules. Finally, re-scan the same URL to confirm the result improves.

TL;DR: Your robots.txt doesn't specify rules for AI crawlers like GPTBot or ClaudeBot, leaving your AI visibility to chance.

GPTBot is blocked by 5.89% of all websites, with 35.7% of the top 1,000 sites blocking it (Ahrefs, 2024). Nearly 38% of indexed sites now have AI-specific restrictions, up from 8% in 2023 (EngageCoders). If you don't set explicit policy, you can't control whether your content appears in AI products or training data. A deliberate policy — whether allowing or blocking — is better than leaving it undefined.

What Scavo checks (plain English)

Scavo reads robots.txt status/content and evaluates root-path policy for key AI crawlers:

  • GPTBot
  • ChatGPT-User
  • OAI-SearchBot
  • ClaudeBot
  • anthropic-ai
  • PerplexityBot
  • Google-Extended
  • CCBot

Scavo classifies each agent policy as explicit/wildcard and state (allowed, blocked, mixed, unspecified).

How Scavo scores this check

Result behavior:

  • Warning: robots.txt missing/empty (404 or blank)
  • Fail: wildcard root block with no explicit per-bot exceptions
  • Info: robots policy unavailable in scan
  • Info: default/wildcard policy is clean but no explicit bot rules
  • Warning: one or more agents blocked or mixed/conflicting
  • Pass: explicit, consistent per-bot policy with no critical conflicts

In your scan report, this appears under What failed / What needs attention / What is working for ai_crawler_policy, followed by Recommended next steps and Technical evidence (for developers) when needed.

  • Scan key: ai_crawler_policy
  • Category: AI_VISIBILITY

Why fixing this matters

Policy clarity matters more than hype. Teams need to know whether they are intentionally visible, intentionally restricted, or unintentionally drifting due to inherited wildcard rules.

Without explicit policy, legal, content, and engineering can make conflicting assumptions about AI usage rights and discoverability.

Common reasons this check flags

  • User-agent: * with broad Disallow: / and no AI exceptions.
  • Duplicate groups produce mixed allow/disallow outcomes.
  • Robots exists but does not mention any modern AI agents explicitly.
  • Production robots differs from documented policy in legal/content docs.

If you are not technical

  1. Decide business stance for each major crawler class (allow, block, conditional).
  2. Ensure legal/comms language matches technical robots policy.
  3. Ask engineering for one plain-language matrix by bot.
  4. Re-run Scavo and check blocked/mixed counts.

Technical handoff message

Copy and share this with your developer.

Scavo flagged AI Crawler Policy (ai_crawler_policy). Please clean up robots.txt so each target AI bot has clear root-path intent (allow/block), remove mixed directives, and document policy decisions for legal/content stakeholders.

If you are technical

  1. Normalize robots groups and remove contradictory root rules.
  2. Add explicit directives for bots you intentionally manage.
  3. Avoid relying on ambiguous inherited wildcard behavior for critical decisions.
  4. Keep one source-of-truth robots file under version control.
  5. Reconcile legal policy text with actual robots directives.

How to verify

  • Fetch live robots.txt from production.
  • Parse each target bot and confirm single clear root policy.
  • Confirm wildcard behavior does not accidentally override intent.
  • Re-run Scavo and verify policy score + blocked/mixed counts improve.

What this scan cannot confirm

  • It does not enforce contractual/legal licensing terms by itself.
  • It does not guarantee third-party compliance beyond published robots norms.
  • It does not test all non-standard/private crawler identities.

Owner checklist

  • [ ] Assign owner for AI crawler stance and robots implementation.
  • [ ] Keep a reviewed bot policy matrix (intent + technical directive).
  • [ ] Version control robots updates with approval trail.
  • [ ] Audit robots policy after CDN/security migrations.

FAQ

Is blocking all AI crawlers always wrong?

No. It can be intentional. The issue is unintentional or undocumented blocking.

Why does missing robots.txt return warning instead of fail?

Because policy is undefined rather than explicitly contradictory, but it still increases control risk.

Why include Google-Extended separately?

It allows policy distinction for AI-related usage controls beyond standard crawl/index behavior.

Should we list every bot on earth?

No. Prioritize major bots relevant to your business and review regularly.

Sources


Need a bot-policy matrix draft (intent + robots syntax) your team can approve quickly? Send support your preferred allow/block stance per crawler.

More checks in this area

ai_bot_access_parity

AI Crawlers Blocked More Restrictively Than Search Engines

ClaudeBot saw the highest growth in block rates — increasing 32.67% year-over-year (EngageCoders, 2024). If you block AI crawlers while allowing Googlebot, you're letting Google use your content in its AI products (Gemini, AI Overviews) while excluding others. Consider whether this asymmetry aligns with your content strategy, or whether parity across all bots better serves your interests.

Open guide
ai_chunkability

Content Not Structured for AI Processing

44.2% of AI citations come from the first 30% of content (Profound), so front-loading key facts matters. AI models work better with structured, chunked content — clear headers, concise paragraphs, fact boxes, and attributed claims. Walls of unstructured text force AI to guess at relevance, reducing your chances of being cited or recommended in AI-generated responses.

Open guide
ai_citation_readiness

Content Not Structured for AI Citation

44.2% of all LLM citations come from the first 30% of text, with content depth and readability being the most important factors for citation (Profound). AI-driven referral traffic increased more than tenfold from July 2024 to February 2025, with 87.4% coming from ChatGPT (Adobe). To be cited, your content needs clear, fact-based claims with attribution — not just narrative prose.

Open guide