Study Shows OpenAI’s GPT-4o Model Can Identify Verbatim Excerpts from Paywalled O’Reilly Books

A study co-authored by tech-manual publisher Tim O’Reilly shows that OpenAI trained GPT-4o on parts of his company’s books that were not made freely available.

What happened: O’Reilly, computer scientist Sruly Rosenblat, and economist Ilan Strauss found that GPT-4o was able to identify verbatim excerpts from dozens of O’Reilly Media books that the company kept behind a paywall, indicating that the books likely were included in the model’s training data.

How it works: The researchers adapted the DE-COP method to compare how well GPT-4o, GPT-4o-mini, and GPT-3.5 Turbo recognized paywalled excerpts versus freely available excerpts from the same books.

The team selected 34 O’Reilly Media books and divided them into roughly 14,000 paragraphs.
They labeled the paragraphs private (paywalled) or public (when O’Reilly Media publishes a book, it distributes freely on the web chapters 1 and 4 as well as the first 1,500 characters of other chapters). They also labeled the paragraphs according to whether they were published before or after the models’ knowledge cutoff dates.
The team built multiple-choice quizzes, each composed of a verbatim paragraph and three paraphrased versions generated by Claude 3.5 Sonnet. The researchers ordered the paragraphs and paraphrases in all permutations to eliminate potential position bias.

Results: The authors asked each model to identify the verbatim paragraph and calculated each model’s percentage of correct responses. Then they averaged each model’s accuracy per book and converted the averages into AUROC scores that measure how well a model distinguished books available prior to its knowledge cutoff (potentially included in the training set) from those that weren’t available at the time. 50 percent AUROC indicates random chance, while higher scores indicate higher accuracy.

GPT-4o tended to recognize O’Reilly Media content whether or not it was public, but it recognized private paragraphs (82 percent AUROC) markedly more often than public paragraphs (64 percent AUROC).
GPT-4o-mini’s performance was nearly random for both private (56% AUROC) and public material (55% AUROC). The researchers hypothesize that either (i) the model’s smaller size may limit its ability to memorize or (2) OpenAI may reserve premium data to train larger models.
The earlier GPT-3.5 Turbo recognized public paragraphs (64% AUROC) more often than private paragraphs (54% AUROC), which suggests that it was trained predominantly on freely available data.

Yes, but: Newer large language models are better at distinguishing human-written from generated text, even if it wasn’t in their training sets. For instance, given paragraphs that were published after their knowledge cutoffs, GPT-4o returned scores as high as 78 percent AUROC. The authors note that this may challenge their conclusions, since they interpret high scores to indicate that a model saw the text during training. Nonetheless, they argue that their approach will remain valid while scores for both text that was included and text that was excluded from training sets remain under 96 percent AUROC. “For now,” they write, “the gap remains sufficiently large to reliably separate the two categories.”

Behind the news: Historically AI developers have trained machine learning models on any data they could acquire. But in the era of generative AI, models trained on copyrighted works can mimic the works and styles of the works’ owners, creating a threat to their livelihoods. Some AI developers have responded by regarding data that’s freely available on the web as fair game, and material that’s otherwise protected as off-limits for training. However, datasets that include ostensibly private data are widely circulated, including LibGen, which includes all 34 of the O’Reilly Media titles tested in this study. Moreover, unauthorized copies of many copyrighted works are posted without paywalls or even logins, making it possible even for web scrapers that crawl only the open web to download them. Google and OpenAI, which is currently embroiled in lawsuits by authors and publishers who claim it violated copyrights by training models on copyrighted works, recently lobbied the United States government to relax copyright laws for AI developers.

Why it matters: The AI industry requires huge quantities of high-quality data to keep advancing the state of the art. At the same time, copyright owners are worried that models trained on their data might hamper their opportunities to earn a living. AI developers must find fair ways to respond. As O’Reilly points out, exploiting copyrighted works instead of rewarding their authors could lead to an “extractive dead end” that ultimately diminishes the supply of the high-quality training data.

We’re thinking: We have learned a great deal from O’Reilly Media’s books, and we’re grateful to the many authors, editors, graphic artists, and others who produce them. Meanwhile, it’s time for the U.S. Congress — and legislators internationally — to update copyright laws for the era of generative AI, so everyone knows the rules and we can find ways to follow them.