Web Data Diminishes: What if online publishers make it harder and more expensive to train models?

Loading the Elevenlabs Text to Speech AudioNative Player...

For decades, AI developers have treated the web as an open faucet of training data. Now publishers are shutting the tap. Will web data dry up?

The fear: Publishers are moving to lock down their text and images, deny access or demand payment, and ensnare web crawler software with decoy data. These moves make training AI systems more expensive and less effective. Soon, only wealthy developers will be able to afford access to timely, high-quality, web data.

Horror stories: From a publisher’s perspective, AI systems that train on text, images, and other data copied from the web siphon off traffic to their websites while they get nothing in return. Publishers can ask crawlers that scrape their pages to refrain via robots.txt files and terms of service. Indeed, the percentage of regularly updated sites that do so rose roughly from 1 percent to 5 percent between 2023 and 2024. Some AI companies comply, but others don’t. Instead, they flood sites with download requests, incurring bandwidth costs and overloading servers. Consequently, measures to block crawlers initially taken by individual publishers have evolved into server-level software defenses. 

  • Wikipedia, a popular source of data for training large language models, is a top target of crawlers that gather training data. In May, traffic surged, but the online encyclopedia discovered that most requests came from crawlers rather than users. It says that efforts to download training data increase its server costs and AI models trained on its text cut its traffic, threatening the volunteer labor and financial donations that sustain it.
  • Read the Docs, a documentation-hosting service widely used by open-source projects, received a $5,000 bandwidth bill when one AI company’s crawler downloaded 73 terabytes. Blocking AI-related crawlers identified by the web-security provider Cloudflare saved $1,500 per month.
  • In April, Cloudflare launched AI Labyrinth, which serves AI-generated decoy pages to waste crawlers’ processing budgets and make them easier to identify. The company now blocks crawlers run by a list of AI companies by default. It’s testing a pay-per-crawl system that would allow publishers to set terms and prices for access to their data.
  • Publishers are taking other defensive measures as well. Developer Xe Iaso offers Anubis, a tool that makes browsers complete a short challenge before allowing them to load a page. SourceHut, a Git hosting service for open-source projects, deployed Anubis to stop aggressive crawlers after they disrupted its service.
  • The publishers’ rebellion began in 2023, when The New York TimesCNNReuters, and the Australia Broadcasting Company blocked OpenAI’s crawlers via their terms of service and disallowed them via their robots.txt. Since then, many news organizations followed, reducing access to data on current events that keeps models up-to-date.

How scared should you be: Yes, data scraped from the web will continue to exist in datasets like Common Crawl, which is updated regularly. Nonetheless, the web is becoming less hospitable to data mining, and some web-scale datasets will include less — and less-current — material. Instead, publishers and developers may be entering a cat-and-mouse scenario. For example, Reddit alleged that Perplexity scraped its data indirectly through Google’s search results, which would suggest that some AI companies are finding workarounds to get data from closed sites. However, it would also mean that web publishers can detect some strategies. Other AI companies have paid to license content, showing that well funded organizations can secure high-quality data while avoiding legal risks.

Facing the fear: Data available on the open web should be fair game for AI training, but developers can reduce publishers’ bandwidth burdens by limiting the frequency of crawls and volume of download requests. For sites behind paywalls, it makes sense to respect the publishers’ preferences and invest in data partnerships. Although this approach is more costly up front, it supports sustainable access to high-quality training data and helps preserve an open web that benefits audiences, publishers, and AI developers.