AI is just unauthorised plagiarism at a bigger scale

Posted by speckx |an hour ago |171 comments

dvduval 43 minutes ago[6 more]

The broader problem of original sources not being given credit in a way that rewards them remains. Websites owners are paying to host their content so that spiders can come and crawl them and index it into the AI and then if they’re lucky, they might get a citation, but otherwise there’s very little reward for being a provider of content. And of course, this is something that’s getting worse and worse. Why look at a website when it’s all in AI? And then the counter to that is maybe we need to start closing the website to crawlers and put everything behind a login.

tancop 6 minutes ago[2 more]

if theres just one good thing coming out of ai its breaking copyright law forever. no one should be able to "own" ideas. royalties for commercial use is another thing and i support it but what we know as (non commercial) piracy and unlicensed fan art should be 100% legal

deaton 43 minutes ago[1 more]

"Steal an apple and you're a thief. Steal a kingdom and you're a statesman." - Literal Disney villain

storus 7 minutes ago

This is really not so clear cut as "fair use" might cover 99% of all data scrapping; you are not reproducing the originals just use them to estimate probabilistic distribution of tokens in pre-training. You are never going to get the exact book word-for-word using LLMs.

pluc 37 minutes ago[4 more]

Seriously how is this surprising? We all know AI companies stole troves of data to train their models, why do you think they'll stop? Have they faced consequences for the mass theft of copyrighted data?

You can't steal or profit off of that data, but it's fine for them for whatever reason. I guess because they're a force for good in the world and are pushing humanity forward eh?

ggillas 20 minutes ago[5 more]

IP attorney here and actively working on this problem.

nla: if you create content online (public repo code, blog, podcast, YouTube, publishing) the smartest thing you can do if to file a US copyright, even if you have a hobby blog.

Anthropic paid $1.5B in a class settlement to authors because it was piracy of copyrighted works. If we as a HN community had our works protected, there are potentially huge statutory damages for scraping by any and all llms. I work with hundreds of writers and publishers and am forming a coalition to protect and license what they're creating.

MontyCarloHall 19 minutes ago[1 more]

Did You Say “Intellectual Property”? It's a Seductive Mirage. [0]

[0] https://www.gnu.org/philosophy/not-ipr.html

andai 14 minutes ago

There's two aspects to this.

The pretraining (common crawl, i.e. the entire internet. Also books and papers, mostly pirated), and the realtime web scraping.

The article appears to be about the latter.

Though the two are kind of similar, since they keep updating the training data with new web pages. The difference is that, with the web search version, it's more likely to plagiarize a single article, rather than the kind of "blending" that happens if the article was just part of trillions of web pages in the training data.

There's this old quote: "If you steal from one artist, they say oh, he is the next so-and-so. If you steal from many, they say, how original!"

hparadiz 28 minutes ago[4 more]

You guys have fun arguing. I'm gonna be building cool stuff.

kstenerud an hour ago[5 more]

> their article contains links to my actual website, with the exact link text (?!)

I'm having a hard time understanding what's wrong here? Unless the link text is very long, why would someone linking to your article use different words for the link text?

oytmeal 15 minutes ago

Isn't plagiarism inherently unauthorized?

tptacek an hour ago[8 more]

People were effectively copying websites (especially ecommerce tutorials) and beating the original authors at SEO decades before ChatGPT 2.

adamzwasserman 40 minutes ago[4 more]

People need to cope with the fact that no thought is original. Even Newton and Leibniz were having the same thoughts at the same time. Get over it.

ecommerceguy 13 minutes ago

I remember playing around with Writesonic in my days of spammy seo tactics (some of my products weren't allowed on marketplaces & advertising platforms due to hazmat products so..). Often times I would see my own product descriptions nearly verbatim in the output.

100% creators should get compensated by ai platforms for their work.

Further, I can see a day where someone like Reddit will close off or license their data to llms. No doubt they are losing traffic right now.

hiroto_lemon 12 minutes ago

Worth noting what changed isn't AI itself — copying always existed. LLM just made per-article rewrites a 5-second job. Detection didn't get the same speedup; that's the actual break.

jorisw 7 minutes ago

> X is just Y but

Can't recall the last time a compelling argument started out like this

baq 34 minutes ago[2 more]

turns out plagiarism at scale can solve Erdos problems

cryptocod3 an hour ago[3 more]

There's authorized plagiarism?

motbus3 35 minutes ago

It allows data do be compressed into the weights and the mere coincidence of certain strings of a book will make it spit the full book

an hour ago

Comment deleted

_-_-__-_-_- 30 minutes ago

Recent thoughts, https://theonlyblogever.com/blog/2026/distrust.html

kingleopold 14 minutes ago

with this logic, business is also just unauthorised plagiarism at a bigger scale. Because all the products/services gets copied and not all of them have patents etc???

pull_my_finger 9 minutes ago

What gets me is when this was brought up, they said "requiring explicit permission will kill the AI industry"[1]. No shit! Why do you think all the rest of us didn't build a business/"industry" around stealing shit? They could have done it at a slower pace while respecting copyright laws, but they were too greedy to be first to market and secure a hold.

[1]: https://www.theverge.com/news/674366/nick-clegg-uk-ai-artist...

dwa3592 41 minutes ago[2 more]

Plagiarism by default is unauthorised so I think the title should be "AI is just authorised plagiarism". It's authorised by the markets, the governments and the society at large.

mrbluecoat 43 minutes ago[2 more]

> AI ... do some "learning"

Is AI plural or is that a typo?

ProllyInfamous 32 minutes ago

>>"The underlying purpose of AI is to allow wealth to access skill while removing from the skilled the ability to access wealth." @jeffowski (first I read it, not sure if author)

Bezos' admission, recently, that the bottom 50% of current taxpayers ought'a NOT pay any taxes... is just preparing us for the inevitable UBI'd masses.

: own nothing, be happy!

iloveoof 17 minutes ago[3 more]

I don’t know if this author supports OSS but I’ll share this because HN generally is full of people with that mindset.

It’s deeply ironic that if you forget about LLMs and look only at the outcome—-we’ve found a way to legally circumvent copyright and the siloing of coding knowledge, making it so you can build on top of (almost) the whole of human coding knowledge without needing to pay a rent or ask for permission—-it sounds like the dream of open source software has been realized.

But this doesn’t feel like a win for the philosophy of OSS because a corporation broke down the gates. It turns out for a lot of people, OSS is an aesthetic and not an outcome, it’s a vibe against corporate use or control of software, not for democratized access to knowledge.

saghm 31 minutes ago[1 more]

It's basically the same thing as the old joke "if you owe the bank a million dollars, you have a problem; if you owe the bank a billion dollars, they have a problem". IP law seems to always be disproportionately wielded against smaller players, and the ones who are big enough get away with it.

dana321 36 minutes ago

Breaking the law to start a large company seems to be the norm

quantummagic 14 minutes ago

What do people imagine can be done about it at this point? Offer a concrete suggestion. Any law or tax against this will give a huge advantage to other countries. It's already over, there's no going back to a world where this didn't happen. Let's just hope some good comes of it.

peterbell_nyc 37 minutes ago[3 more]

I do just want to highlight that this is also what humans do. We read a bunch of content online and then use it in our work product. The vast majority of the value that I provide comes from copyrighted information that I have ingested - either directly with a payment to the creator (bought and read the book, paid for and attended the seminar) or indirectly via third party blog posts or summaries where I did not then pay the originator of the materials.

I think there are real questions around motivations for creation of novel, high quality valuable content (I think they still exist but move to indirect monetization for some content and paywalls for high value materials).

I don't inherently have any problems with agents (or humans) ingesting content and using it in work product. I think we just need to accept that the landscape is changing and ensure we think through the reasons why and how content is created and monetized.

Havoc 12 minutes ago

End of an era

schwartzworld 25 minutes ago

Let this sink in: I wanted to open source a package at work at needed approval from legal and other teams to make sure I wasn't leaking anything proprietary. The same executives that worried about proprietary, copyrighted code being leaked 10 years ago are now mandating using the plagiarism machine.

The whole AI bubble is The Emperor's New Clothes, and it feels liek more people are finally admitting it.

booleandilemma 27 minutes ago[1 more]

This site is strange. I'm pretty sure there's lots of AI shilling happening on it. I don't think the opinions here are authentic, they seem to be opinions that the AI company CEOs would hold, not the disenfranchised 99%. I used to trust HN, I'm not so sure I can now.

NetMageSCW an hour ago

Reading is just unauthorized plagiarism.

asklq 40 minutes ago[1 more]

Yes, of course it is. If the model is built on all human information, then it is by definition a derivative work of all human information and as such violates IP.

Currently politicians don't understand this and listen to the criminals like Amodei, but it will change.

It took a while to deal with Napster etc., but the backlash will come.

Deprogrammer9 13 minutes ago

Welcome to the internet! It's one massive copy machine form one server to the next.

Pennoungen0 39 minutes ago

Yeah AI just actually plagiarize everything lel, sometimes even the source are..full of question and worst, my academical use it as a source...welp

onion2k 21 minutes ago

Fuck Google for ranking some copycat website higher than mine, even though they copied my article.

This has been happening since Google launched in 1998. It was probably happening when we all used Hotbot and Altavista. It isn't really an AI problem, save for the fact that the automated production of copycat articles now reword things a bit.

tayo42 14 minutes ago

I think AI is just getting people riled up. Not sure what AI has to do with anything in this case here. Someone copy and pasted his content, could have been done without AI.

I guess AI could have made a better website and did better SEO then him but that's not really the issue

tiahura an hour ago

To answer the author's question: Yes, progress IS largely built on the shoulders of those who came before.

bparsons 24 minutes ago

I am old enough to remember when the US insisted that it was superior to China because they believed in the rule of law and sanctity of intellectual property.

andy12_ an hour ago

Someone blatantly copied their tutorials but ChatGPT is to blame, somehow? The accusation here isn't even that ChatGPT learned from their tutorials and then generated them verbatim. The accusation is that someone copied the whole article and rewrote it with ChatGPT (which they could have done manually without AI anyway).

metalman an hour ago

it's a spiral into a finite hall of mirrors, where at the end is somebody with a gun

JohnHaugeland an hour ago[1 more]

the court disagreed

Ecys an hour ago[5 more]

No, it takes input, then SYNTHESIZES (very importanttt!!!!!!!) its own output.

Reading a dictionary and making a sentence is not plagiarism. Cope.

lukasbm 42 minutes ago[1 more]

If i tell my friend a synopsis of a book, i am not stealing from the author, what is this take lmao

analog8374 31 minutes ago[1 more]

language is just plagiarism

kristofferR 33 minutes ago

I'd rather have AI slop appear on the top of HN than regurgitated old low effort thoughts like this.

There's absolutely nothing new or interesting here that hasn't already been said better by a thousand different random HN commenters.

codepack 22 minutes ago

[flagged]

mapcars an hour ago

Comment deleted

drcongo an hour ago

Is this a new and original thought?

ciconia an hour ago[1 more]

> Is this what the pinnacle of human is? Lazy and greedy?

Apparently yes.

39 minutes ago

Comment deleted

swader999 40 minutes ago

On one hand, there's nothing new under the sun. On the other, these llms are just copies of us and they owe the collective some due. The trajectory right now has money, power, control, policy and even free will going to a very small needle point of humanity. It's not aligned with humanity flourishing, it only makes sense if the goal is to replace the humans.

beej71 42 minutes ago[1 more]

I dunno. People do this exact thing by hand (digest everything they've read and produce something indirectly derivative--what author has not been so-influenced?) and it's not a copyright violation. It's just as impossible to dig around in a model to find Hamlet as it is to do digging around a human brain. And if the result is an obvious copy, then you have a violation no matter how it was created.

As someone who thinks humanity would be better off without LLMs, I want the assertion to be true, but I don't think it is.

rigonkulous an hour ago[5 more]

AI is human knowledge at scale, wanting to be free.

We built it, because we as humans intrinsically know that information should be free - always - and AI is a way to accomplish this, finally.

Extrinsically, we also have a subset of humans who do not want information to be free, because they desire to profit from the divide between free/non-free information.

I have been thinking a lot about Aaron Schwartz lately, and how un-just it is that he was persecuted for doing something that is so commonplace now, it is practically expected behaviour in the AI/ML realms. If he hadn't been targetted for elimination, I wonder just how well his ethos would have perpetuated into the AI age ..

kolinko 14 minutes ago

Years ago i published slides on Slideshare that were viewed almost two million times. And helped me build a business.

There were people that learned knowledge from myself, and then made their own tutorials and promote these. It hadn't crossed my mind to complain about that. AI changes very little here.

What really changes things is not people republishing my materials, but people using agents to read my materials, and to get knowledge reformatted into something that they like.

If my slides were published today, they would probably be read verbatim by a handful of humans. The rest would be agents, but I'm ok with that. The business case is the same -- I want whatever reads the slide to be encouraged to use my tool. What kind of entity, I don't really care (again: from purely business perspective)