In the late 90s, many media companies decided to block search engine bots, including Google, from crawling their websites by using robots.txt disallow all feature in their robots.txt file. They felt that search engines were unfairly exploiting their content. But boy, oh boy! It was a big mistake when it came to their web traffic. Over time, they came to realize that collaboration, not exclusion, drove visibility, traffic, and revenue.
Mark Twain once said, “History Doesn’t Repeat Itself, but It Often Rhymes.” In line with this statement, businesses today grapple with a similar dilemma about AI and LLM crawlers such as GPTBot and PerplexityBot.
The AI Crawler Dilemma: Visibility vs. Protection
There is growing anxiety among content creators and businesses about how proprietary data might be utilized to train these models.
The concerns revolve around misuse and the potential for distorting their intellectual property. Despite these valid worries, it’s essential to consider the implications of shutting out AI crawlers entirely.
In today’s AI-driven world, a wholesale ban on all AI bots, such as GPTBot and PerplexityBot, could no doubt prevent your content from being used to train large language models (LLMs), but it will also make your brand, company, and offerings invisible to these LLMs.
My perspective on this issue is to strike a balance. I advise allowing access to these bots while denying access to your copyrighted and subscription-based content. This approach, if implemented effectively, will enable you to safeguard your interests while boosting your brand’s online presence and user engagement, offering a promising outlook for your brand’s future.
What is Robots.txt? A Modern Guide
A robots.txt file is kind of like a concert pass, telling web crawlers who can get in and where they’re allowed to go. Just as only those with a backstage pass can access restricted areas at a concert, robots.txt lets you specify which parts of your website search engine bots can visit and which areas are off-limits.
This file helps manage how search engines crawl and index your site, preventing them from accessing sensitive or unnecessary pages and helping to reduce server load. However, it’s important to remember that not all bots respect these rules—How it Works:
It uses simple rules to instruct crawlers, such as Disallow (to prevent crawling specific URLs) and Allow (to allow crawling specific URLs).
Not a guarantee:
It’s important to remember that a robots.txt file doesn’t guarantee that a page won’t be indexed. It’s a suggestion to crawlers, and some may ignore your robots.txt entirely. For this reason, robots.txt shouldn’t be relied on as a security measure since determined or malicious crawlers can easily bypass it.
Purpose of Robots.txt:
Website owners use robots.txt to:
Manage crawler traffic and prevent server overload.
Block specific directories or files from being crawled.
Guide crawlers to important pages for indexing.
How to Find or Upload Your Robots.txt file
You can usually find a website’s robots.txt file by adding /robots.txt to the end of the website’s URL (e.g., example.com/robots.txt). If there is none, your robots.txt file should be placed in the root directory of the website.
Pro Tip: If You're Using CMS
If you’re using a Content Management System (CMS) such as WordPress, Wix, or Blogger, you don’t need to create or edit your robots.txt file. Moreover, you might also be using a plugin like Yoast or AI Monitor WP on top of your CMS.
In such a case, a search settings page or something similar helps you manage whether search engines can crawl your page.
If you want to keep a page hidden from search engines or make it visible again, check out how to adjust your page’s visibility in your CMS. For example, search “Yoast hide page from search engines” to find what you need.
The Ideal Robots.txt File in 2025
Here’s what makes a robots.txt file ideal in today’s AI-dominated information discovery process:
1. User-agent Directive
The User-agent directive is crucial—it specifies which crawlers (also known as bots) the rules apply to.
A common mistake is mentioning only Googlebot. Instead, it’s ideal to use User-agent: *, which applies the rules universally to all crawlers. This ensures your directives aren’t limited to just one search engine but are inclusive and applicable to the broader bot community.
Example:
User-agent: *
Why does this matter?
Not all web traffic comes from Google—so universal bot coverage maximizes your site’s reach while managing crawler activity effectively.
2. Allow and Disallow Directive
The Allow and Disallow directives are the backbone of your robots.txt file, dictating which parts of your site are accessible to crawlers and which are restricted. Used strategically, they balance visibility with protection. Here’s how to wield them effectively:
User-agent: *
Disallow: /private
Translation: “All bots: Stay out of my private folder!”
Translation: Blocks AI crawlers from the ChatGPT clone you are working on in your free time. However, this allows them to index public content (e.g., blogs) for visibility in AI tools.
Things to Avoid in Robots.txt
Conflicting Rules:
Disallow: /blog/
Allow: /blog/latest-news/
Outcome: Some crawlers (like Google) will allow /blog/latest-news/, while others may ignore the Allow directive.
Overly Broad Blocks:
Disallow: /blog/
Allow: /blog/latest-news/
Outcome: This blocks your entire site—use only if you want zero visibility.
As mentioned earlier, this is not a watertight method to ensure compliance. You must p
3. Crawl Delay
An ideal crawl delay in robots.txt generally ranges from 1 to 10 seconds, with 10 seconds being the most common suggestion. This delay, which is specified using the Crawl-delay: directive, tells search engine crawlers how long to wait between requesting pages from your website.
Translation: Don’t come back knocking before the 10 seconds have passed.
4. Sitemap Directive
The Sitemap directive is a guiding star for crawlers. It tells them where to find the sitemap file—a comprehensive list of your site’s URLs. This makes it easier for bots to understand your site’s structure and index it efficiently.
Example:
Sitemap: https://www.example.com/sitemap.xml
Why does this matter?
A well-placed Sitemap directive ensures search engines have all the vital info they need to index your site properly, boosting your visibility.
Update this file and add new rules as your site evolves. That ”/ai-pet-rock-store/” directory? Yeah, block it now.
Robots.txt Example for 2025: Future-proofed courtesy AI Monitor
Brands that embraced SEO thrived; those that resisted faded into obscurity. Similarly, LLMs will shape future discovery.
Case Study:
One of our clients saw a 46% traffic drop after blocking AI bots, while a competitor that allowed them gained featured snippets in AI tools.
Checklist for Website Owners and Content Creators
☑️ Audit Your robots.txt: Ensure it’s not disallowing AI crawlers (e.g., GPTBot).
☑️ Segment Access: Use granular rules to protect paid content or confidential data.
☑️ Monitor Compliance: We have a free tool called AI Bot Monitor that you can use to track bot activity.
Conclusion: Adapt or Be Invisible
In my personal opinion, blocking AI crawlers today is as myopic as blocking Google in the 90s. The key lies in strategic access—shielding critical data while ensuring your brand remains part of the AI-driven conversation. Update your robots.txt, embrace transparency, and position your content for the future.
Your next step? Review your robots.txt at yoursite.com/robots.txt—before AI overlooks your business entirely.
The act of blocking all AI bots resembles the historic practice of banning Google search engines because it removes your business from AI search capabilities. The action of blocking content from training LLMs results in your brand becoming invisible in generated AI answers which leads to reduced engagement opportunities and visitor traffic.
No. Robots.txt is a suggestion, not a security measure. Ethical bots (like Googlebot or GPTBot) respect it, but malicious scrapers may ignore it. For sensitive data, use stronger protections like authentication, paywalls, or legal measures (e.g., terms of service).
To check if your site is blocked, you need to visit yoursite.com/robots.txt and verify the presence of either User-agent: GPTBot or Disallow: / entries. The use of Disallow: / for all bots will completely conceal your site from every type of interstellar and artificial intelligence tool.
A 10-second delay (Crawl-delay: 10) is a good balance—it reduces server strain while letting bots index your content efficiently. Adjust based on your site’s traffic and hosting capacity.
Yes! Most artificial intelligence crawlers, including GPTBot, execute the User-agent: * command. Crawlers will access your site as regular bots do if you do not precisely block them from your robots.txt file. You should block GPTBot by deploying User-agent: GPTBot Disallow: /.