Will I lose AI citations forever if I leave Cloudflare's default block on?

Not forever, but immediately and continuously. Retrieval crawlers like OAI-SearchBot and PerplexityBot fetch a page each time a user asks a question whose answer might cite you. If they get a 403 every time, the answer is grounded on a competitor's page that did let them in. The damage compounds the longer the block stays on, because AI engines also build internal preference signals from successful past fetches.

Is allowing AI crawlers a security risk?

On its own, no, because Cloudflare's WAF Skip action only bypasses the specific protections you list. IP rate limiting, custom WAF rules, and managed rules outside the skip list still run. The risk is user-agent spoofing: anyone can set a user-agent to PerplexityBot. The fix is to combine the user-agent check with cf.verified_bot_category on Bot Management or, on free plans, to keep aggressive rate limits on the allowed user agents.

Why does Cloudflare block all AI bots by default instead of asking?

Cloudflare's reasoning is that scraper traffic was overwhelming origin servers and most site owners had no visibility into the trade-off. The default block is a safety net for the silent majority. The dashboard plus AI Crawl Control then exists to let informed teams flip individual switches once they understand what they are giving up. The default is conservative, not absolute.

Do I still need llms.txt and a sitemap if I configure the WAF correctly?

Yes, because the WAF only controls access. Once a crawler is in, llms.txt tells it which pages you consider canonical, and a sitemap tells it the full set of URLs you want indexed. Without both, an allowed crawler may still spend its budget on shallow or duplicate pages and miss the content you most want cited.

Cloudflare blocks AI crawlers by default: how to fix it without breaking security

Cloudflare's default "Block AI bots" rule is a managed WAF rule that returns a hard block to known AI user agents (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, and the rest of the list) on every new zone created after 1 July 2025. The rule was introduced as part of Cloudflare's Content Independence Day announcement and applies to roughly 20% of the public web that sits behind Cloudflare's network.

The side effect is the part most teams miss. "Block AI bots" does not distinguish between training crawlers and retrieval crawlers. It blocks both. If OAI-SearchBot cannot reach your pages, ChatGPT Search stops citing you. If PerplexityBot is blocked, Perplexity stops citing you. If Google-Extended is blocked, Google's AI Overviews lose your content as a grounding source. The default that protects your training data also evicts you from the AI answer surface you are now investing in.

This article walks through what gets blocked, how to verify your zone's current state, and how to selectively allow retrieval bots through Cloudflare without dismantling your security posture. We assume you are running on a recent Cloudflare zone with the standard WAF tier (free or paid; the steps differ slightly for Bot Management customers).

What you'll have at the end

A Cloudflare configuration where retrieval crawlers (OAI-SearchBot, PerplexityBot, Google-Extended, ClaudeBot for citations) reach your pages, training crawlers (GPTBot if you do not want OpenAI training on your content, CCBot, Bytespider) stay blocked, and your bot-management posture against unverified scrapers stays intact. Plus visibility into what is actually hitting your site, so the configuration is data-driven rather than vibes.

What Cloudflare blocks today

The managed "Block AI bots" rule covers a published list of known AI user agents. Per Cloudflare's own AI Crawl Control documentation, the rule targets at least: Amazonbot, Applebot, Bytespider (ByteDance), CCBot (Common Crawl), ClaudeBot (Anthropic), DuckAssistBot, Google-CloudVertexBot, GoogleOther, GPTBot (OpenAI), Meta-ExternalAgent, OAI-SearchBot, PerplexityBot, PetalBot (Huawei), TikTokSpider, plus unverified bots that match similar fingerprints.

The list mixes three behaviourally different bot types into a single block action.

Training crawlers. GPTBot, CCBot, Bytespider, Meta-ExternalAgent. Their job is to scrape content into training corpora. Blocking them is usually the explicit intent.
Retrieval crawlers. OAI-SearchBot, PerplexityBot, ClaudeBot (when used for retrieval), Google-Extended, Applebot. Their job is to fetch pages on demand to ground a real-time AI answer with citations. Blocking them removes you from the answer.
User-driven agents. ChatGPT-User, Perplexity-User, Claude-User. These fire when a human pastes a URL into a chat, or when an AI agent acts on behalf of an end user. According to Cloudflare's December 2025 report, this category grew 15x year-over-year. Blocking them tells your visitors' AI assistants that your site refuses to be summarised.

The distinction matters. A SaaS marketing site that wants to be cited by ChatGPT Search and Perplexity but does not want OpenAI training on its content has to peel apart a bundle that Cloudflare ships as one switch.

Step 1: verify your zone's current state

Before changing anything, see what is actually being blocked.

Open the Cloudflare dashboard, pick the zone, and navigate to Security → Bots. Look for "Block AI bots" in the configuration panel. If the toggle is on (default for zones created after 1 July 2025), every bot in the list above gets a hard block at the edge.

Then go to AI Crawl Control in the left navigation. This was rebranded from "AI Audit" in early 2026 and is now generally available across all plans. The Crawlers tab shows a real-time table of which AI services have requested your content in the last 24 hours, broken down by user agent, robots.txt compliance, and which sections of your site they hit. The data answers a question most teams have never asked: who is actually trying to read me, and would I miss them if I stayed blocked?

If the table is empty, it does not mean nothing is happening. It means the Block AI bots rule shut the door before any crawler made it past the edge. Switch the rule to a 24-hour observation window first (set it to "Allow" temporarily, or move it below a custom rule that logs instead of blocks), then come back and read the table.

Step 2: pick the bots you want to allow

The list comes down to two answers.

For citations and AI search visibility, allow at minimum:

OAI-SearchBot (ChatGPT Search retrieval)
PerplexityBot (Perplexity citations)
Google-Extended (Google AI Overviews and Gemini grounding)
ClaudeBot if you want Claude.ai's web feature to cite you
Applebot for Apple Intelligence

For training, decide on a per-vendor basis. If you publish reference content you want to be cited but not memorised wholesale (a typical SaaS playbook), keep the training crawlers blocked: GPTBot, CCBot, Bytespider, Meta-ExternalAgent, Google-CloudVertexBot. If you publish content you actively want in foundation models, allow them.

For user-driven agents, allow. ChatGPT-User and Perplexity-User fire when a human pastes your URL or when an AI agent fetches the page to answer a specific user prompt. Blocking them is functionally equivalent to blocking visits from logged-in users of those products.

Step 3: write a custom WAF rule that overrides the managed rule

Cloudflare's WAF evaluates rules in a defined order. Per the WAF custom rules documentation, custom rules are evaluated before managed rules. So a custom "Skip" rule on a specific user agent runs first, the request bypasses the Block AI bots managed rule, and the bot reaches your origin.

In the Cloudflare dashboard, go to Security → WAF → Custom rules and create a rule with the following expression (paste this in the rule editor's "Edit expression" field):

(http.user_agent contains "OAI-SearchBot") or (http.user_agent contains "PerplexityBot") or (http.user_agent contains "ChatGPT-User") or (http.user_agent contains "Perplexity-User") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "Applebot") or (http.user_agent contains "ClaudeBot")

Set the action to Skip, and in the Skip configuration check the boxes for WAF Managed Rules, Bot Fight Mode, and Block AI bots. Save and deploy.

What this does: when a request matches one of these user agents, Cloudflare bypasses the listed protections and the request continues to your origin. Other custom rules, IP-level rate limiting, and security features outside the skipped list still apply, so a malicious actor cannot just spoof User-Agent: PerplexityBot and bypass everything.

Step 4: verify with cryptographic identity, not user-agent strings

A user-agent header is a string the client controls. Anyone can set it to OAI-SearchBot and pretend to be ChatGPT Search. The shared cure is Web Bot Auth, an IETF draft that Cloudflare implements at the edge and that an increasing number of AI vendors adopt.

Per Cloudflare's Web Bot Auth rollout, well-behaved bots now sign their HTTP requests with an Ed25519 key. The request carries Signature-Agent, Signature-Input, and Signature headers. Cloudflare validates the signature against the bot's published key directory and only marks the request as a verified AI crawler when the signature is valid.

For your custom rule, that translates to a more defensible expression that combines user agent with Cloudflare's verified-bot category and the cryptographic signature presence. The exact field names depend on whether you are on Bot Management (paid) or the free tier; on Bot Management you get cf.verified_bot_category, and the cleaner rule becomes:

(cf.verified_bot_category in {"AI Crawler" "Search Engine Crawler"}) and (http.user_agent contains "OAI-SearchBot" or http.user_agent contains "PerplexityBot" or http.user_agent contains "Google-Extended" or http.user_agent contains "ClaudeBot" or http.user_agent contains "Applebot")

This rule only allows the listed user agents when Cloudflare has cryptographically verified them. A spoofed request that sets the right user-agent string but cannot produce a valid signature is treated as an unverified bot and falls back to your default protections.

Step 5: keep training crawlers out via robots.txt as well

If you decided to keep GPTBot, CCBot, Bytespider, and friends blocked, do it in two layers, not one. Cloudflare's edge block is the hard line. Your robots.txt is the polite line, and most retrieval crawlers honour it (Cloudflare's own data shows ChatGPT-User stops fetching when robots.txt disallows it; Perplexity, per Cloudflare's August 2025 incident report, did not).

A defensible public/robots.txt for a SaaS site that wants citations but no training:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Google-CloudVertexBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot
Allow: /

User-agent: *
Allow: /

The dual policy (Cloudflare WAF skip rule + robots.txt allow) is what makes the system defensible. The WAF rule physically lets the request through. The robots.txt directive tells the well-behaved crawler it is welcome. Both layers communicate the same intent in different protocols.

Verifying it works

Two checks you can run in 10 minutes.

The synthetic check. From a machine outside Cloudflare's network, run a request with the user-agent header set:

curl -A "Mozilla/5.0 (compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)" -I https://yourdomain.com/

You should see HTTP 200, not 403. Repeat with GPTBot if you blocked training crawlers; you should see 403. Note: this only tests the user-agent layer. A real signed bot will also carry verification headers your synthetic curl will not.

The real-traffic check is more honest. After 24-48 hours, return to AI Crawl Control in the dashboard and confirm: OAI-SearchBot, PerplexityBot, and ClaudeBot now show up in the Crawlers tab with non-zero request counts and "allowed" status; GPTBot, CCBot, and Bytespider show "blocked". If the allowed list is empty after 48 hours on a site that gets organic traffic, your custom rule expression is wrong; recheck the field names and operator (contains is case-sensitive).

Common failures and fixes

Custom rule deployed but bots still blocked. Two causes: the rule expression has a typo (most often http.user_agent eq when you meant contains); or the Skip action does not include the "Block AI bots" checkbox, only "Bot Fight Mode". Re-open the rule, scroll to the Skip configuration, and tick all three boxes (Managed Rules, Bot Fight Mode, Block AI bots).

Allowed bots show up but get challenged with a CAPTCHA. Super Bot Fight Mode is rate-challenging "likely automated" traffic before your custom rule applies, but only if the rule order is wrong. In the dashboard, drag your custom rule to the top of the WAF rules list. Order matters even within custom rules.

You allowed PerplexityBot and Cloudflare still blocks Perplexity by company-level reputation. In August 2025 Cloudflare delisted Perplexity from its Verified Bots Program over the stealth-crawler incident. As of early 2026, Perplexity is gradually being rehabilitated, but some zones still inherit a network-level reputation block. Check Cloudflare's status page and the AI Crawl Control changelog before assuming your rule is broken.

You allowed ClaudeBot but Claude.ai still cannot read your page. ClaudeBot is the user agent for crawling, but Claude's web search feature uses a different fetch path. The user-agent for end-user-driven Claude queries is in transition; allow Claude-User and Claude-SearchBot in the same rule alongside ClaudeBot to cover all current and near-future Anthropic crawlers.

Your traffic spikes from one bot. AI crawlers can be heavy. If a single allowed bot generates more than 5% of your origin requests, add a per-IP rate limit (Cloudflare WAF → Rate Limiting Rules) of, say, 60 requests per minute on requests where http.user_agent contains "PerplexityBot". The bot still gets fresh content, your origin does not get hammered.

Going further

The single switch is going away. Cloudflare's pay-per-crawl system, currently in private beta and rolling to general availability through 2026, lets sites return HTTP 402 Payment Required with a crawler-price header, and lets AI vendors decide on the fly whether to pay for the fetch. The economics shift from "block or allow" to "price by content type". Once that ships broadly, the Block AI bots managed rule becomes the floor, not the ceiling, of your AI-traffic policy.

If you are configuring this on a brand-new zone, also pair it with a /llms.txt file that tells well-behaved AI crawlers which pages you consider canonical. The WAF rule controls access; llms.txt controls what they see first when they get in. The combination is what gets a brand-new site cited within weeks instead of months.

Sources

Photo by Nauris Ranga ↗ on Unsplash ↗

Frequently asked questions

Will I lose AI citations forever if I leave Cloudflare's default block on?: Not forever, but immediately and continuously. Retrieval crawlers like OAI-SearchBot and PerplexityBot fetch a page each time a user asks a question whose answer might cite you. If they get a 403 every time, the answer is grounded on a competitor's page that did let them in. The damage compounds the longer the block stays on, because AI engines also build internal preference signals from successful past fetches.
Is allowing AI crawlers a security risk?: On its own, no, because Cloudflare's WAF Skip action only bypasses the specific protections you list. IP rate limiting, custom WAF rules, and managed rules outside the skip list still run. The risk is user-agent spoofing: anyone can set a user-agent to PerplexityBot. The fix is to combine the user-agent check with cf.verified_bot_category on Bot Management or, on free plans, to keep aggressive rate limits on the allowed user agents.
Why does Cloudflare block all AI bots by default instead of asking?: Cloudflare's reasoning is that scraper traffic was overwhelming origin servers and most site owners had no visibility into the trade-off. The default block is a safety net for the silent majority. The dashboard plus AI Crawl Control then exists to let informed teams flip individual switches once they understand what they are giving up. The default is conservative, not absolute.
Do I still need llms.txt and a sitemap if I configure the WAF correctly?: Yes, because the WAF only controls access. Once a crawler is in, llms.txt tells it which pages you consider canonical, and a sitemap tells it the full set of URLs you want indexed. Without both, an allowed crawler may still spend its budget on shallow or duplicate pages and miss the content you most want cited.

Studio

Start a project.

One partner for companies, public sector, startups and SaaS. Faster delivery, modern tech, lower costs. One team, one invoice.

Tell us what you are building Read more articles

Infrastructure and SEO

Supabase RLS at scale: 7 patterns for queries that stay fast

RLS slows queries 2x to 11x on large tables when policies skip the SELECT wrapper, miss indexes, or hide joins. Here are 7 patterns that fix it.

May 9, 20269 min read

Infrastructure and SEO

The 7 schema markup types every blog needs in 2026

Article, BreadcrumbList, Organization, Person, WebSite, FAQPage, HowTo. Which schema types still earn rich results in 2026, and which are pure AI citation fuel.

Apr 30, 20268 min read

Infrastructure and SEO

llms.txt: what it is, what to put in it, and if you need one

llms.txt is a markdown index that hands LLMs a curated map of your site. Here is what to put in it and whether it actually moves AI citations.

Apr 26, 20267 min read

What you'll have at the end

What Cloudflare blocks today

Step 1: verify your zone's current state

Step 2: pick the bots you want to allow

Step 3: write a custom WAF rule that overrides the managed rule

Step 4: verify with cryptographic identity, not user-agent strings

Step 5: keep training crawlers out via robots.txt as well

Verifying it works

Common failures and fixes

Going further

Sources

Frequently asked questions

Start a project.

Related articles

Supabase RLS at scale: 7 patterns for queries that stay fast

The 7 schema markup types every blog needs in 2026

llms.txt: what it is, what to put in it, and if you need one