Publishers push Common Crawl to stop collecting content for AI training

Digital Content Next (DCN) sent the Common Crawl Foundation a cease-and-desist letter demanding that it stop scraping and distributing protected publisher content.

The U.S. trade group, which represents major digital publishers (e.g., the AP, the New York Times, NBC Universal, Bloomberg, NPR, and Fox), also asked Common Crawl to remove DCN members’ content from its datasets, including paywalled and subscriber-only news articles.

Publishers question opt-outs. DCN’s lawyers raised concerns about whether Common Crawl honored publisher opt-out requests and removed older content when asked.

The letter said Common Crawl had, in some cases, told publishers it was complying, only to later say technical costs and delays prevented full removal. DCN’s lawyers said they were reviewing whether those statements may have been inaccurate or misleading.
Common Crawl publishes a registry of sites that have opted out of scraping. The list includes many large news publishers.

DCN alleges infringement. The letter argued that copyright law is not an opt-out system. DCN said Common Crawl “flagrantly infringed” publisher copyrights by creating and distributing datasets containing protected content without permission or compensation.

The group also said Common Crawl made that content available to companies developing AI tools and large language models.
DCN CEO Jason Kint said the legal notice challenges the idea that online content can be collected, stored, and reused simply because it is accessible.

Common Crawl pushes back. Executive Director Rich Skrenta denied that CCBot bypasses paywalls to scrape websites. He also denied misleading publishers after The Atlantic reported in November that some content from publishers that had requested removal remained available.

“When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset,” Skrenta said.

Why we care. This fight could shape how much publisher content AI search engines can use without permission. If courts or settlements impose stricter consent requirements, AI responses may rely more on licensed sources and less on the open web.

AI training stakes. Since 2008, Common Crawl has scraped billions of webpages to build a free public archive. Its datasets have been widely used to train AI models. The New York Times’ 2023 copyright lawsuit against OpenAI cited Common Crawl as making up 60% of GPT-3’s training data, Press Gazette reported.

A 2024 Mozilla Foundation paper said that, in its current form, generative AI likely would not have been possible without Common Crawl.
Common Crawl has been working on open standards for AI crawling preferences, Skrenta said this week. DCN’s letter asks for a harder line: stop scraping protected publisher content and remove member content already in the datasets.

Search Engine Land is owned by Semrush. We remain committed to providing high-quality coverage of marketing topics. Unless otherwise noted, this page’s content was written by either an employee or a paid contractor of Semrush Inc.

Danny Goodwin is Editorial Director of Search Engine Land & Search Marketing Expo – SMX. He joined Search Engine Land in 2022 as Senior Editor. In addition to reporting on the latest search marketing news, he manages Search Engine Land’s SME (Subject Matter Expert) program. He also helps program U.S. SMX events.

Goodwin has been editing and writing about the latest developments and trends in search and digital marketing since 2007. He previously was Executive Editor of Search Engine Journal (from 2017 to 2022), managing editor of Momentology (from 2014-2016) and editor of Search Engine Watch (from 2007 to 2014). He has spoken at many major search conferences and virtual events, and has been sourced for his expertise by a wide range of publications and podcasts.

Source link

Publishers push Common Crawl to stop collecting content for AI training

Like this:

Related

Publishers push Common Crawl to stop collecting content for AI training

Share this:

Like this:

Related