Should You Be Blocking AI Bots?

By Jacob Cohen Donnelly • September 3, 2024

Deciding whether to block AI bots from accessing your content is a delicate balancing act between protecting your intellectual property and risking the potential loss of traffic and revenue.

But increasingly, many larger publishers are opting to block bots—when they can—in what I suspect is a negotiating tactic. The latest example is Applebot-Extended, the company’s bot specific to AI training. According to Wired:

WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training. The cold reception reflects a significant shift in both the perception and use of the robotic crawlers that have trawled the web for decades. Now that these bots play a key role in collecting AI training data, they’ve become a conflict zone over intellectual property and the future of the web.

If we look at that list, The New York Times is currently in a lawsuit with OpenAI and The Financial Times, The Atlantic, and Condé Nast have all signed deals with OpenAI. Ironically, Business Insider and Politico, both owned by Axel Springer, which had one of the earliest licensing deals with OpenAI, don’t block Applebot-Extended.

Neither does The Wall Street Journal, which News Corp owns, but this is less because it is comfortable with all of the AI bots and more because it’s team has likely not realized it can block Applebot-Extended. Doing a simple look at WSJ’s robots.txt file and you’ll see: anthropic-ai, Google-Extended, ClaudeBot, Claude-Web, PerplexityBot, and Perplexity-ai are all blocked. It’s clear that WSJ doesn’t want any of these sites scraping its content.

But let’s take a step back. What are we actually talking about here? Many of the AI companies have come out and said that if you don’t want them to use your content to train their models, you can just block the bots in your robots.txt. And so, if you add their bots to your robots.txt file, you are, theoretically, preventing them from using your content to train their models.

And these AI companies don’t care if you do it. They don’t need any individual website’s content. However, If we all, collectively, blocked the bots, that would probably suck for these companies. But since there’s no cabal of media companies—despite what conspiracy theorists might think—getting everyone to agree to block the bots becomes a prisoners dilemma problem. Everyone is acting in their own best interest.

And so, we come back to the question in the title: should you be blocking AI bots? There are multiple ways to answer this.

First, even if the answer is yes, it’s hard to do it. Just look at WSJ, which is a massive media company. It hasn’t blocked Apple yet. Maybe it has a deal, but I suspect it just hasn’t thought to do it yet. But don’t be too ashamed about that, WSJ. As Wired reports:

Applebot-Extended is so new that relatively few websites block it yet. Ontario, Canada–based AI-detection startup Originality AI analyzed a sampling of 1,000 high-traffic websites last week and found that approximately 7 percent—predominantly news and media outlets—were blocking Applebot-Extended. This week, the AI agent watchdog service Dark Visitors ran its own analysis of another sampling of 1,000 high-traffic websites, finding that approximately 6 percent had the bot blocked.

Even if you are opting to block every AI bot, there are so many new ones popping up. Which do you prioritize?

Second, there are legitimate risks to blocking the bots. Take all of these sites that have blocked OpenAI’s ChatGPT. I understand why they did it—I even suggested it might make sense a year ago—but the risk is you’re not included in its SearchGPT product. If for some reason SearchGPT were to become the default search engine for millions of people, you’d be cutting yourself off from any potential traffic.

And we can see this risk in action with Google. According to Bloomberg, publishers are struggling with the decision to block Google’s AI.

That’s because the Google tool that sifts through web content to come up with its AI answers is the same one that keeps track of web pages for search results, according to publishers. Blocking Alphabet Inc.’s Google the way sites have blocked some of its AI competitors would also hamper a site’s ability to be discovered online.

…

Google said AI Overviews — the summaries displayed at the top of Google search — are part of its longstanding commitment to serve higher quality information and bolster opportunities for publishers and other businesses. “Every day, Google sends billions of clicks to sites across the web, and we intend for this long-established value exchange with websites to continue,” a Google spokesperson said in a statement.

I can tell you right now no publisher should be adding GoogleBot to their robots.txt. The potential downside is that your content shows up in its AI overviews. This could have a negative impact on traffic to your site. Or, maybe it won’t. We don’t know yet.

Nevertheless, this is very much a big risk. And honestly, even if you are willing to take that risk, the internet is remarkably open and the bots will get the content elsewhere. I chatted with Mike King, founder of SEO agency, iPullRank, and he said:

Yeah, you can’t block Google’s AI from Google Search specifically without blocking core Googlebot. Blocking gen AI bots is silly anyway because there are so many other major scrapers like the Common Crawl that the AI tools can get the content from or if you have any syndication deals that don’t block the bots.

For example, if you license your content to MSN—a common place for publishers—and it doesn’t block one of the bots, then your content is now in their training. Looking at its robots.txt, I can see AppleBot is blocked, but not Applebot-Extended. And Yahoo, which many publishers also license to, doesn’t block Apple at all. So, we come back to that first answer, which is: it’s damn hard.

I’ve started developing a bit of a framework for determining if I, a media executive, would block AI bots. Here are the questions I’d ask:

Am I of a sufficient scale where I can likely command money from these AI platforms? If so, I’d block them. Look at who OpenAI has done deals with and that should give you an idea.
Am I creating content that is fundamentally unique and cannot be acquired elsewhere? If so, I’d block them.

If at least one of those questions isn’t yes, it’s probably a waste of time. Especially as these tools become more advanced and we see more search-like products come from them, being excluded could hurt in the long-term. On the other hand, you should obviously not be giving your content away for free. So, it’s a balancing act. And one that none of us are going to get right every time.

Should You Be Blocking AI Bots?

Related

You're on the list!