News publishers limit Internet Archive access due to AI scraping concerns

Posted by ninjagoo |3 hours ago |152 comments

kevincloudsec 3 hours ago[10 more]

There's a compliance angle to this that nobody's talking about. Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention. A lot of that evidence lives at URLs. When a vendor's security documentation, a published incident response, or a compliance attestation disappears from the web and can't be archived, you've got a gap in your audit trail that no auditor is going to be happy about.

I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited. The web being unarchivable isn't just a cultural loss. It's becoming a real operational problem for anyone who has to prove to an auditor that something was true at a specific point in time.

f33d5173 2 hours ago[6 more]

So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money. The only real loser is the common man who doesn't have the resources to scrape the entire web himself.

I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.

jruohonen 3 hours ago[3 more]

It affects science too (and there you'd want solid archiving as much as possible). Increasingly, meta-data is full of errors and general purpose search engines for science are breaking down, including even things like Google Scholar. I suppose some big science publishers are blocking AI bots too.

ninjagoo 3 hours ago[2 more]

Publishers like The Guardian and NYT are blocking the IA/Wayback Machine. 20% of news websites are blocking both IA and Common Crawl. As an example, https://www.realtor.com/news/celebrity-real-estate/james-van... is unarchivable, with IA being 429ed while the site is accessible otherwise.

Brian_K_White 2 hours ago[2 more]

Time for a crowd source plugin that relays copies of what individuals view right from the browser.

Users control what sites they want to allow it to record so no privacy worries, especially assuming the plugin is open source.

No automated crawling. The plugin does not drive the users browser to fetch things. Just whatever a user happens to actually view on their own, some percentage of those views from the activated domains gets submitted up to some archive.

Not every view, just like maybe 100 people each submit 1% of views, and maybe it's a random selection or maybe it's weighted by some feedback mechanism where the archive destination can say "Hey if the user views this particular url, I still don't have that one yet so definitely send that one if you see it rather than just applying the normal random chance"

Not sure how to protect the archive itself or it's operators.

daniel31x13 an hour ago[1 more]

I maintain an open-source project called Linkwarden and this exact discussion is one of the reasons why it exists, teams needed a way to preserve referenced URLs reliably without having to depend on external services.

It stores webpages in multiple formats (HTML snapshot, screenshot, PDF snapshot, and a fully dedicated reader view) so you’re not relying on a single fragile archive method.

There’s both a hosted cloud plan [1] which directly supports the project, and a fully self-hosted option [2], depending on how much control you need over storage and retention.

[1]: https://linkwarden.app

[2]: https://github.com/linkwarden/linkwarden

derefr 3 hours ago[3 more]

I wonder if these publishers would be more amenable to a private archiver that only serves registered academic / journalistic research projects (the way most physical private archives do), with a specific provision to never provide data to companies that would resell it or use it for training of generative models.

nananana9 3 hours ago[2 more]

The silver lining is that it's increasingly not worth being archived as well.

upboundspiral 2 hours ago[3 more]

I feel like a government funded search engine would resolve a lot of the issues with the monetized web.

The purpose of a search engine is to display links to web pages, not the entire content. As such, it can be argued it falls under fair use. It provides value to the people searching for content and those providing it.

However we left such a crucially important public utility in the hands of private companies, that changed their algorythms many times in order to maximize their profits and not the public good.

I think there needs to be real competition, and I am increasingly becoming certain that the government should be part of that competition. Both "private" companies and "public" governement are biased, but are biased in different ways, and I think there is real value to be created in this clash. It makes it easier for individuals to pick and choose the best option for themselves, and for third independent options to be developed.

The current cycle of knowledge generation is academia doing foundational research -> private companies expanding this research and monetizing it -> nothing. If the last step was expanded to the government providing a barebones but useable service to commodotize it, years after private companies have been able to reap immense profits, then the capabilities of the entire society are increased. If the last step is prevented, then the ruling companies turn to rentseeking and sitting on their lawrels, turn from innovating to extracting.

yellowapple an hour ago

Framing this as some anti-AI thing is wild. The simpler, more obvious, and more evidenced reason for this is that these sites want to make money with ads and paywalls that an archived copy tends to omit by design. Scapegoating AI lets them pretend that they're not the greedy bad guys here — just like how the agricultural sector is hell-bent on scapegoating AI (and lawns, and golf courses, and long showers, and free water at restaurants) for excess water consumption when even the worst-offending datacenters consume infinitesimally-tiny fractions of the water farms in their areas consume.

RajT88 3 hours ago[2 more]

Proposed solution:

Sell a "truck full of DAT tapes" type service to AI scrapers with snapshots of the IA. Sort of like the cloud providers have with "Data Boxes".

It will fund IA, be cheaper than building and maintaining so many scrapers, and may relieve the pressure on these news sites.

cdrnsf 2 hours ago

This is a natural response to AI companies plundering the web to enrich themselves and provide no benefit to the sites being scraped.

shevy-java 3 hours ago[4 more]

> The Financial Times, for example, blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive

But then it was not really open content anyway.

> When asked about The Guardian’s decision, Internet Archive founder Brewster Kahle said that “if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”

Well - we need something like wikipedia for news content. Perhaps not 100% wikipedia; instead, wikipedia to store the hard facts, with tons of verification; and a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers. I don't know how the model could work, but IF we could come up with this then newspapers who have gatewalls to information would become less relevant automatically. That way we win long-term, as the paid gatewalls aren't really part of the open web anyway.

Havoc 3 hours ago[1 more]

Yup. Recently built something that needs to do low volume scraping. About 40% success rate - rest hits bot detection even on first try

bmiekre an hour ago

Explain it to me like I’m 5, why is ai scraping the way back machine bad?

WesBrownSQL 2 hours ago

As someone who has been dealing with SOC 2, HIPAA, ISO 9001, etc., for years, I have always maintained copies of the third-party agreements for all of our downstream providers for compliance purposes. This documentation is collected at the time of certification, and our policies always include a provision for its retrieval on schedule. The problem is when you certify their policy said X and were in compliance, they quietly change that and don't send proper notification downstream to us, and captain lawsuit comes by, we have to be able to prove that they did claim they were in compliance and the time we certified. We don't want to rely on their ability to produce that documentation. We can't prove that it wasn't tampered with, or that there is a chain of custody for their documentation and policies. If I wanted to use a vendor that wouldn't provide that information, then I didn't use them. Welcome to the world of highly regulated industries.

jackfranklyn 2 hours ago

There's a mundane version of this that hits small businesses every day. Platform terms of service pages, API documentation, pricing policies, even the terms you agreed to when you signed up for a SaaS product - these all live at URLs that change or vanish.

I've been building tools that integrate with accounting platforms and the number of times a platform's API docs or published rate limits have simply disappeared between when I built something and when a user reports it broken is genuinely frustrating. You can't file a support ticket saying "your docs said X" when the docs no longer say anything because they've been restructured.

For compliance specifically - HMRC guidance in the UK changes constantly, and the old versions are often just gone. If you made a business decision based on published guidance that later changes, good luck proving what the guidance actually said at the time. The Wayback Machine has saved me more than once trying to verify what a platform's published API behaviour was supposed to be versus what it actually does.

The SOC 2 / audit trail point upthread is spot on. I'd add that for smaller businesses, it's not just formal compliance frameworks - it's basic record keeping. When your payment processor's fee schedule was a webpage instead of a PDF and that webpage no longer exists, you can't reconcile why your fees changed.

notepad0x90 an hour ago

The internet isn't so simple anymore. I think it's important to separate commercial websites from non-commercial ones. Commercial sites shouldn't be expected to be achievable to begin with, unless it's part of their business model. A lot of sites (like reddit), started of as ad-supported sites, but now they're commercial (not just post-IPO, but accept payments and sell things to/from consumers). Even for ad-supported sites, there is a difference between ad-supported non-profit, and sites that exist to generate revenue from ads. As in, the primary purpose of the site is to generate ad-revenue, the content is just a means to that end.

I've said it before, and I'll say it again: The main issue is not design patterns, but lack of acceptable payment systems. The EU with their dismantling of visa and mastercard now have the perfect opportunity to solve this, but I doubt they will. They'll probably just create a european wechat.

mellosouls 2 hours ago

editorialised. Original title (submitted previously a few times correctly by others):

News publishers limit Internet Archive access due to AI scraping concerns

JumpCrisscross 3 hours ago[1 more]

Let’s be honest, one of the most-common uses of these archive sites has been paywall circumvention. An academics-only archive might make sense, or one that is mutually-owned and charges a fee for lookup. But a public archive for content that costs money to make obviously doesn’t work.

zeagle 3 hours ago[2 more]

I mean why wouldn’t they? All their IP was scraped for at their own cost of hosting it for AI training. It further pulls away from their own business models as people ask the AI models the questions instead of reading primary sources. Plus it doesn’t seem likely they’ll ever be compensated for that loss given the economy is all in on AI. At least search engines would link back.

holoduke an hour ago

The end of traditional news sites is coming. At least for the newspaper websites. Future mcp like systems will generate on the fly newstites in your desired style and content. Journalists will have some kind of paid per view model provided by these gpt like platforms which of course take a too big of a chunk. I can't imagine a WSJ is able to survive.

zachlatta 3 hours ago

The death of trust on the cloud.

colesantiago an hour ago

I fear that these news publishers would come after RSS next as I see hundreds of AI companies misusing the terms of the news publishers's RSS feed for profit on mass scraping.

They do not care and we will be all worse off for it if these AI companies keep continuing to bombard news publishers RSS feeds.

It is a shame that the open web as we know it is closing down because of these AI companies.

g-b-r 3 hours ago

This is awful, they need to at the very least allow private archivals.

Maybe the Internet Archive might be ok to keeping some things private until x time passes; or they could require an account to access them

macinjosh 3 hours ago[5 more]

We need something like SETI@home/Folding@home but for crawling and archiving the web or maybe something as simple as a browser extension that can (with permission) archive pages you view.

blell 3 hours ago[1 more]

That’s good. I don’t like archival sites. Let things disappear.

OGEnthusiast 3 hours ago[2 more]

If most of the Internet is AI-generated slop (as is already the case), is there really any value in expensing so much bandwidth and storage to preserve it? And on the flip side, I'd imagine the value of a pre-2022 (ChatGPT launch) Internet snapshot on physical media will probably increase astronomically.

sejje 3 hours ago[4 more]

This is a good thing, IMO.

I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.