Publishers Tackle AI Scraping—It’s Not Going Well

Publishers are investing in server-side tools to block large language model (LLM) crawlers as growing evidence shows that robots.txt files and legal threats aren’t enough to stop the scraping.
AI monetization startup Tollbit’s state of the bots report, which analyzed bot activity, found a 40% increase in unauthorized scraping from the third quarter to the fourth. Robots.txt, which is used by websites to tell bots not to scrape content, is clearly failing to deliver.
Lucky Gunasekara, co-founder and CEO of AI-driven search tool startup Miso.AI, has found robots.txt violations from AI-powered search engine Perplexity could be anywhere from 15% to 20% for articles to 65% to 77% for homepage news violations based on research and analysis through his firm’s in house monitoring tool Sentinel, which checks whether AI companies are complying with publishers’ robots.txt requests.
“They have block blockers of their own,” Gunasekara said. “The irony is always like they’re happy to crawl other people’s sites but they don’t want you crawling them.”
Perplexity relies on its own web crawlers, as well as third-parties, which creates challenges for publishers in knowing which scrapers to include in robots.txt. Gunasekara believes that there are about 75 different AI bots publishers should be blocking currently, but found that most block on average 15. Perplexity has recently updated its agreements with third-party crawlers to ensure they respect robots.txt, according to a FAQ sheet.
“We found that there were troubling dynamics; for instance, the bigger the site you were the more likely they were going to collect your data,” Gunasekara said. “This raises this real arms question: look, what good is robots.txt when you don’t give them consent and they don’t give a shit and they take it anyway?”
Bringing in the Pros
Publishers are now looking beyond robots.txt to disincentivize crawlers. B2B publisher Citywire puts a lot of effort into stopping being spammed by bots, Richard Lander, director at Citywire, told AMO. Safeguarding against LLMs is “more of the same,” he said.
Content delivery network service provider Cloudflare has a tool that enables website owners to block any AI crawler at the click of a button. Over 800,000 websites have the most aggressive form of this tool switched on, which means they block all AI scrapers across the board, Will Allen, vice president of product at Cloudflare, told AMO. It’s programmatic enforcement at the HTTP level, he said.
Working at this level is “a major undertaking especially as threats evolve daily,” Benjamin Barrier, co-founder of cyberfraud platform DataDome, told AMO. Building effective AI bot detection requires expertise in machine learning, traffic anomaly detection, cybersecurity and high-scale infrastructure management.
“We see just absolutely enormous amounts of traffic, not just for your one particular domain, but across the entire board of what we’re seeing and that means that we can see patterns and understand what’s going on and help you apply those rules much more quickly,” Allen said.
Gunasekara sees this as a collectivist problem requiring a collectivist solution. And even though publishers are aware they are being stolen from, he hasn’t seen much investment into infrastructure to understand how deep these thefts really go.
The New Site Visitor
Travel publication Skift recently started working with Tollbit to get a better understanding of AI scraping. Jason Clampet, chief product officer at Skift, told AMO:
It’s crazy how much scraping happens. The first three days, it [Tollbit] was saying we were having about 50,000 scrapes a day from Open AI and that was turning into 20 page views, which is crazy. And so we have blocked certain ones, we’ve said ‘don’t visit us.’
The whole idea is to get a bunch of info about how much scraping is going on and then go to them and say, ‘let’s strike a deal.’ We’ll see how that goes, I think we’re like most publishers—we’re still in the ‘we don’t know what’s next thing, so we’re just trying to try everything.’
Tollbit, which raised a $24 million Series A round last year, takes bot blocking one step further by presenting AI bots with a paywall where publishers can monetize the data that is being scraped. It now works with over 1,100 media companies including TIME, Mumsnet and AdWeek.
“AI traffic on the internet is only increasing and it’s going to surpass human traffic because of the ease with which these AI systems can fetch information,” Toshit Panigrahi, co-founder and CEO of Tollbit, told AMO. “We really should be treating them as a new visitor that’s going to be coming in way bigger numbers than human eyeballs are and so we need a way to be able to charge them.”
Yet not all publications feel the need to block bots. Politico’s senior vice president of product and design told Press Gazette that Politico.eu’s new website has been redesigned to be as readable to AI crawlers as possible.
The Next Iteration of Paywalls
Even publishers with strict traditional paywalls aren’t confident that their existing infrastructure is enough to keep out crawlers.
“I wouldn’t trust them. We have a paywall, we have registration … we’re not taking any chances, we’ve put in extra safeguards to stop them crawling over. I think some stuff is always going to get out there — If it gets reproduced by other people, that’s going to happen. But we’re keeping a very tight lid on this.” Citywire’s Lander said.
Skift’s Clampet has tested various chatbots and discovered that gated content is being served by some of them. “They get behind paywalls and it’s a challenge,” he said.
Paywall content for Miso AI’s clients is being accessed by Perplexity, including information that would require sign ups or paying accounts, Gunasekara said. Some of the mechanisms bots are using to bypass paywalls is mimicking human browsing behavior such as rotating through thousands of IP addresses, DataDome’s Barrier said.
“They exploit weaknesses in how paywalls are implemented—such as unsecured APIs, predictable access patterns, or incomplete session validation—to access premium content without detection,” Barrier said.
Some paywalls have been built with SEO in mind, which makes them less restrictive. It’s often discovery versus locking everything down, Clampet said. Publishers are putting within their terms and services that training AI on their content is prohibited without permission as a form of recourse for when AI companies violate paywalls.
Tollbit is looking at how to prevent these actions by presenting AI crawlers with their own paywall. Publishers, which are not charged by Tollbit, set rates based on the value of one page view, often using CPM or RPM as a starting point, Olivia Joslin, co-founder and chief operating officer of Tollbit, told AMO. They then charge a transaction fee to AI companies.
Cloudflare’s Allen is working with companies to test out paywalls for humans and for AI agents. Tollbit has already built Tollbit.dev, which makes AI agents authenticate and pay for usage without needing to have a human in the loop.
Playing Offense
One of the biggest challenges Tollbit’s Panigrahi faces with publishers is illustrating that this is not a zero-sum game and that it’s about finding a way to transact with this new form of visitor, rather than viewing AI companies as taking market share.
Gunasekara has seen many publishers fixated on the idea that they are being robbed. Tollbit’s bots report shows that AI bots on average are driving 95.7% less click through traffic than traditional Google search. As usage of AI tools increases, publishers will likely need to think less about SEO practices and more about handling an increase of traffic from bots.
Publishers need to have both offensive and defensive postures, Gunasekara said. One of the challenges is the “knee-jerk” reactions of publishers to strike content deals with AI companies, which makes it harder to innovate and come up with industry-wide standards for licensing.
He predicts 2025 will be “the summer of open standards” and is seeing leaders who drove the internet’s early open standards take an interest in tackling this problem. One scenario could be a decentralized robots.txt file used and updated by the industry, and “a license to crawl” model that sends bots to a native index layer where via an API the bots get access by requesting and receiving a license and providing details like registering an IP address.
“If anything, the media community should be spending more time with those leaders than they are with VC-backed startup founders,” Gunasekara said. “The community for media, in my opinion, is the open standards community. But we’ll see what happens. It’s just very unpredictable.”