Skip links

Robots.txt Disallow All: Blocking AI Bots is as misguided as blocking Google in the 90s!

Robots.txt

Introduction: Lessons from the Past

In the late 90s, many media companies decided to block search engine bots, including Google, from crawling their websites by using robots.txt disallow all feature in their robots.txt file. They felt that search engines were unfairly exploiting their content. But boy, oh boy! It was a big mistake when it came to their web traffic. Over time, they came to realize that collaboration, not exclusion, drove visibility, traffic, and revenue.

Mark Twain once said, “History Doesn’t Repeat Itself, but It Often Rhymes.” In line with this statement, businesses today grapple with a similar dilemma about AI and LLM crawlers such as GPTBot and PerplexityBot. 

The AI Crawler Dilemma: Visibility vs. Protection

There is growing anxiety among content creators and businesses about how proprietary data might be utilized to train these models. 

The concerns revolve around misuse and the potential for distorting their intellectual property. Despite these valid worries, it’s essential to consider the implications of shutting out AI crawlers entirely.

In today’s AI-driven world, a wholesale ban on all AI bots, such as GPTBot and PerplexityBot, could no doubt prevent your content from being used to train large language models (LLMs), but it will also make your brand, company, and offerings invisible to these LLMs. 

My perspective on this issue is to strike a balance. I advise allowing access to these bots while denying access to your copyrighted and subscription-based content. This approach, if implemented effectively, will enable you to safeguard your interests while boosting your brand’s online presence and user engagement, offering a promising outlook for your brand’s future.

What is Robots.txt? A Modern Guide

A robots.txt file is kind of like a concert pass, telling web crawlers who can get in and where they’re allowed to go. Just as only those with a backstage pass can access restricted areas at a concert, robots.txt lets you specify which parts of your website search engine bots can visit and which areas are off-limits.

This file helps manage how search engines crawl and index your site, preventing them from accessing sensitive or unnecessary pages and helping to reduce server load. However, it’s important to remember that not all bots respect these rules—How it Works:

It uses simple rules to instruct crawlers, such as Disallow (to prevent crawling specific URLs) and Allow (to allow crawling specific URLs).

Not a guarantee:

It’s important to remember that a robots.txt file doesn’t guarantee that a page won’t be indexed. It’s a suggestion to crawlers, and some may ignore your robots.txt entirely. For this reason, robots.txt shouldn’t be relied on as a security measure since determined or malicious crawlers can easily bypass it.

Purpose of Robots.txt:

Website owners use robots.txt to:

  • Manage crawler traffic and prevent server overload. 
  • Block specific directories or files from being crawled. 
  • Guide crawlers to important pages for indexing.

How to Find or Upload Your Robots.txt file

You can usually find a website’s robots.txt file by adding /robots.txt to the end of the website’s URL (e.g., example.com/robots.txt). If there is none, your robots.txt file should be placed in the root directory of the website.

Pro Tip: If You're Using CMS

If you’re using a Content Management System (CMS) such as WordPress, Wix, or Blogger, you don’t need to create or edit your robots.txt file. Moreover, you might also be using a plugin like Yoast or AI Monitor WP on top of your CMS. 

In such a case, a search settings page or something similar helps you manage whether search engines can crawl your page. 

If you want to keep a page hidden from search engines or make it visible again, check out how to adjust your page’s visibility in your CMS. For example, search “Yoast hide page from search engines” to find what you need.

The Ideal Robots.txt File in 2025

Here’s what makes a robots.txt file ideal in today’s AI-dominated information discovery process: 

1. User-agent Directive

The User-agent directive is crucial—it specifies which crawlers (also known as bots) the rules apply to. 

A common mistake is mentioning only Googlebot. Instead, it’s ideal to use User-agent: *, which applies the rules universally to all crawlers. This ensures your directives aren’t limited to just one search engine but are inclusive and applicable to the broader bot community.

Example:



User-agent: *


Why does this matter?

Not all web traffic comes from Google—so universal bot coverage maximizes your site’s reach while managing crawler activity effectively. 

2. Allow and Disallow Directive

The Allow and Disallow directives are the backbone of your robots.txt file, dictating which parts of your site are accessible to crawlers and which are restricted. Used strategically, they balance visibility with protection. Here’s how to wield them effectively:


User-agent: * 

Disallow: /private

Translation: “All bots: Stay out of my private folder!”


User-agent: *

Disallow: /secret-lab/ 

Allow: /public-cat-videos/

Translation: “All bots: stay out of my secret lab (no one needs to see my failed robot uprising blueprints), but feel free to binge my cat videos!”


User-agent: *  

Disallow: /

Allow: /blog/ 

Translation: “All bots: Block my entire site except the /blog/ directory.”

Granular Control for AI Crawlers

To future-proof for AI, apply rules specifically for LLM bots like GPTBot or PerplexityBot:

Example:


User-agent: GPTBot

Disallow: /ChatGPT-clone/

Allow: /blog/  

User-agent: * 

Disallow: /user-dashboards/ 

Translation: Blocks AI crawlers from the ChatGPT clone you are working on in your free time. However, this allows them to index public content (e.g., blogs) for visibility in AI tools.

Things to Avoid in Robots.txt

Conflicting Rules: 


Disallow: /blog/  

Allow: /blog/latest-news/ 

Outcome: Some crawlers (like Google) will allow /blog/latest-news/, while others may ignore the Allow directive.

Overly Broad Blocks:


Disallow: /blog/

Allow: /blog/latest-news/ 

Outcome: This blocks your entire site—use only if you want zero visibility.

As mentioned earlier, this is not a watertight method to ensure compliance. You must p

3. Crawl Delay

An ideal crawl delay in robots.txt generally ranges from 1 to 10 seconds, with 10 seconds being the most common suggestion. This delay, which is specified using the Crawl-delay: directive, tells search engine crawlers how long to wait between requesting pages from your website.


User-agent: *  
Disallow: /proprietary-data/  
Allow: /  
Crawl-delay: 10

Translation: Don’t come back knocking before the 10 seconds have passed. 

4. Sitemap Directive

The Sitemap directive is a guiding star for crawlers. It tells them where to find the sitemap file—a comprehensive list of your site’s URLs. This makes it easier for bots to understand your site’s structure and index it efficiently.

Example:


Sitemap: https://www.example.com/sitemap.xml

Why does this matter?

A well-placed Sitemap directive ensures search engines have all the vital info they need to index your site properly, boosting your visibility. 

Update this file and add new rules as your site evolves. That ”/ai-pet-rock-store/” directory? Yeah, block it now.

Robots.txt Example for 2025: Future-proofed courtesy AI Monitor


User-agent: *  
Disallow: /secret-lab/  
Disallow: /proprietary-data/  
Allow: /  
Crawl-delay: 10 
Sitemap: https://yoursite.com/sitemap.xml

Why Blocking AI Crawlers Is a Strategic Mistake

The Precedent of Search Engines:

Brands that embraced SEO thrived; those that resisted faded into obscurity. Similarly, LLMs will shape future discovery.

Case Study:

One of our clients saw a 46% traffic drop after blocking AI bots, while a competitor that allowed them gained featured snippets in AI tools.

Checklist for Website Owners and Content Creators

☑️ Audit Your robots.txt: Ensure it’s not disallowing AI crawlers (e.g., GPTBot).

☑️ Segment Access: Use granular rules to protect paid content or confidential data.

☑️ Monitor Compliance: We have a free tool called AI Bot Monitor that you can use to track bot activity.

Conclusion: Adapt or Be Invisible

In my personal opinion, blocking AI crawlers today is as myopic as blocking Google in the 90s. The key lies in strategic access—shielding critical data while ensuring your brand remains part of the AI-driven conversation. Update your robots.txt, embrace transparency, and position your content for the future.

Your next step? Review your robots.txt at yoursite.com/robots.txt—before AI overlooks your business entirely.

LET’S ANSWER THE QUESTIONS

Frequently Asked Questions

Instructor Avinash Tripathi

Are You Monitoring Your Brand Online?

Try Our FREE AI Brand Monitor!

  • Instantly audit your brand’s online presence & sentiment.
  • See how visible you are compared to your competitors.
  • Discover where you’re winning — and where you’re not.
🍪 This website uses cookies to improve your web experience.