kevincloudsec 3 hours ago
I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited. The web being unarchivable isn't just a cultural loss. It's becoming a real operational problem for anyone who has to prove to an auditor that something was true at a specific point in time.
f33d5173 2 hours ago
I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.
jruohonen 3 hours ago
ninjagoo 3 hours ago
Brian_K_White 2 hours ago
Users control what sites they want to allow it to record so no privacy worries, especially assuming the plugin is open source.
No automated crawling. The plugin does not drive the users browser to fetch things. Just whatever a user happens to actually view on their own, some percentage of those views from the activated domains gets submitted up to some archive.
Not every view, just like maybe 100 people each submit 1% of views, and maybe it's a random selection or maybe it's weighted by some feedback mechanism where the archive destination can say "Hey if the user views this particular url, I still don't have that one yet so definitely send that one if you see it rather than just applying the normal random chance"
Not sure how to protect the archive itself or it's operators.
daniel31x13 an hour ago
It stores webpages in multiple formats (HTML snapshot, screenshot, PDF snapshot, and a fully dedicated reader view) so you’re not relying on a single fragile archive method.
There’s both a hosted cloud plan [1] which directly supports the project, and a fully self-hosted option [2], depending on how much control you need over storage and retention.
derefr 3 hours ago
nananana9 3 hours ago
upboundspiral 2 hours ago
The purpose of a search engine is to display links to web pages, not the entire content. As such, it can be argued it falls under fair use. It provides value to the people searching for content and those providing it.
However we left such a crucially important public utility in the hands of private companies, that changed their algorythms many times in order to maximize their profits and not the public good.
I think there needs to be real competition, and I am increasingly becoming certain that the government should be part of that competition. Both "private" companies and "public" governement are biased, but are biased in different ways, and I think there is real value to be created in this clash. It makes it easier for individuals to pick and choose the best option for themselves, and for third independent options to be developed.
The current cycle of knowledge generation is academia doing foundational research -> private companies expanding this research and monetizing it -> nothing. If the last step was expanded to the government providing a barebones but useable service to commodotize it, years after private companies have been able to reap immense profits, then the capabilities of the entire society are increased. If the last step is prevented, then the ruling companies turn to rentseeking and sitting on their lawrels, turn from innovating to extracting.
yellowapple an hour ago
RajT88 3 hours ago
Sell a "truck full of DAT tapes" type service to AI scrapers with snapshots of the IA. Sort of like the cloud providers have with "Data Boxes".
It will fund IA, be cheaper than building and maintaining so many scrapers, and may relieve the pressure on these news sites.
cdrnsf 2 hours ago
shevy-java 3 hours ago
But then it was not really open content anyway.
> When asked about The Guardian’s decision, Internet Archive founder Brewster Kahle said that “if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”
Well - we need something like wikipedia for news content. Perhaps not 100% wikipedia; instead, wikipedia to store the hard facts, with tons of verification; and a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers. I don't know how the model could work, but IF we could come up with this then newspapers who have gatewalls to information would become less relevant automatically. That way we win long-term, as the paid gatewalls aren't really part of the open web anyway.
Havoc 3 hours ago
bmiekre an hour ago
WesBrownSQL 2 hours ago
jackfranklyn 2 hours ago
I've been building tools that integrate with accounting platforms and the number of times a platform's API docs or published rate limits have simply disappeared between when I built something and when a user reports it broken is genuinely frustrating. You can't file a support ticket saying "your docs said X" when the docs no longer say anything because they've been restructured.
For compliance specifically - HMRC guidance in the UK changes constantly, and the old versions are often just gone. If you made a business decision based on published guidance that later changes, good luck proving what the guidance actually said at the time. The Wayback Machine has saved me more than once trying to verify what a platform's published API behaviour was supposed to be versus what it actually does.
The SOC 2 / audit trail point upthread is spot on. I'd add that for smaller businesses, it's not just formal compliance frameworks - it's basic record keeping. When your payment processor's fee schedule was a webpage instead of a PDF and that webpage no longer exists, you can't reconcile why your fees changed.
notepad0x90 an hour ago
I've said it before, and I'll say it again: The main issue is not design patterns, but lack of acceptable payment systems. The EU with their dismantling of visa and mastercard now have the perfect opportunity to solve this, but I doubt they will. They'll probably just create a european wechat.
mellosouls 2 hours ago
News publishers limit Internet Archive access due to AI scraping concerns
JumpCrisscross 3 hours ago
zeagle 3 hours ago
holoduke an hour ago
zachlatta 3 hours ago
colesantiago an hour ago
They do not care and we will be all worse off for it if these AI companies keep continuing to bombard news publishers RSS feeds.
It is a shame that the open web as we know it is closing down because of these AI companies.
g-b-r 3 hours ago
Maybe the Internet Archive might be ok to keeping some things private until x time passes; or they could require an account to access them
macinjosh 3 hours ago
blell 3 hours ago
OGEnthusiast 3 hours ago
sejje 3 hours ago
I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.