AI & ML

Wikipedia's Wayback Machine Faces Existential Threat to 866 Billion Archived Web Pages

Apr 13, 2026 5 min read views

Major News Outlets Are Blocking the Wayback Machine

Citing AI scraping concerns, publishers including USA Today and The New York Times have cut off the Internet Archive's web crawler, threatening a crucial resource for journalists and researchers.
A staff member wears a Universal Access to All Knowledge shirt during a 20th anniversary celebration of the Internet...
Photograph: Carlos Avila Gonzalez/Getty Images

Earlier this month, USA Today published a detailed investigation showing how US Immigration and Customs Enforcement withheld critical data about its detention operations. Reporters relied on the Internet Archive's Wayback Machine to reconstruct ICE's evolving statistics and document changes under the Trump administration. The story exemplifies how the Wayback Machine—which systematically crawls and preserves web pages—serves the public interest. It also highlights a troubling contradiction, says Wayback Machine director Mark Graham.

USA Today Co., the media conglomerate that operates its flagship paper and more than 200 local outlets, blocks the Wayback Machine from archiving its content. "They built their investigation using the Wayback Machine's historical records," Graham notes. "Meanwhile, they're preventing anyone from doing the same with their own reporting."

USA Today isn't alone. Several prominent news organizations have recently restricted access to the archiving service, including The New York Times, according to Nieman Lab. Analysis by AI-detection startup Originality AI found that 23 major news sites now block ia_archiverbot, the Internet Archive's primary web crawler. Reddit has implemented similar restrictions. The Guardian takes a different approach: while it doesn't block the crawler outright, it excludes content from the Internet Archive API and filters articles from the Wayback Machine interface, making archived versions difficult for the public to find.

USA Today Co. spokesperson Lark-Marie Anton clarified that the company isn't "specifically targeting the Internet Archive" but rather implementing broader protections against all automated scraping. Robert Hahn, the Guardian's director of business affairs and licensing, said the outlet has been discussing with the Archive "concerns about potential misuse by AI companies of content sets gathered for preservation."

Individual journalists are now mounting a defense of the Wayback Machine. This week, the Electronic Frontier Foundation and Fight for the Future organized a coalition that gathered over 100 signatures from working journalists in support of the Internet Archive. The signatories span the media landscape, from MSNBC's Rachel Maddow to independent journalists like Kat Tenbarge of Spitfire News and Taylor Lorenz of User Mag. Their letter emphasizes a critical shift in journalism's infrastructure: "In previous generations, journalists would turn to the physical archives of a local newspaper or of a local public library to access historical reporting and follow the threads of the present back into history. With many newspapers closed, and no clear path for local public libraries to preserve digital-only reporting, the work of safeguarding journalism's record increasingly falls to the Internet Archive."

Laura Flynn, a supervising podcast producer at The Intercept who signed the letter, describes the Internet Archive as an "essential tool" throughout her career, particularly for fact-checking and locating audio clips. Micco Caporale, a writer at Chicago Reader, relies on the Wayback Machine to research older bands and cultural figures by accessing defunct fan sites that would otherwise vanish from the historical record.

Caporale has found another use for the tool in union organizing work. "I've also been using the Wayback Machine a ton in my union organizing work to find old job listings so we know what the company claimed to hire people for vs. what duties they actually assigned or to see how different positions have been retooled at different points," Caporale explains. "These posts also help us keep track of pay fluctuations across the organization over time."

Some publishers justify blocking the Wayback Machine by citing concerns about AI companies potentially using Internet Archive data for model training. Graham James, a New York Times spokesperson, states that "the issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us." The Times did not specify whether this represents an actual occurrence or a hypothetical risk.

Reddit has cited similar AI-related concerns when blocking the Wayback Machine crawler. The battle between publishers and AI companies over unauthorized training data continues to intensify, with over 100 AI copyright lawsuits currently pending in US courts. The Wayback Machine's comprehensive archive makes it an attractive target for companies seeking training data.

The Internet Archive has operated for three decades and preserved over a trillion web pages. The nonprofit has faced multiple significant legal challenges since 2020. Most recently, it reached a settlement with major music publishers who had sought up to $700 million in damages over the Great 78s project, which archived vintage recordings. While no substantial financial penalty looms currently, the expanding blockade by media outlets threatens the Archive's core mission.

No comparable public tool exists to replace the Wayback Machine. If major news sources continue restricting access, the preservation of early digital history could deteriorate significantly, potentially rendering crucial records inaccessible or lost entirely. The tool has even been used to scrutinize The New York Times itself: In 2016, the paper faced criticism for editorial changes made to an article about then-presidential candidate Bernie Sanders. Those revisions were first documented through the Wayback Machine.

If a similar situation occurred today, media watchdogs would face significant obstacles tracking previous versions of Times articles. A diminished Wayback Machine doesn't just undermine accountability journalism—it also weakens the legal system, as archived pages are regularly cited as evidence in US litigation.

Mark Graham of the Internet Archive remains cautiously optimistic that some publishers may reverse their blocking decisions. He indicates the nonprofit is "in conversation" with the Times and other outlets. For now, however, Graham warns that "there's no question that the general locking-down of more and more of the public web is impacting society's ability to understand what's going on in our world."

Updated: 4/14/26, 12:25 pm EST: This story has been updated to include a citation from Nieman Lab.