cross-posted from: https://lemmy.world/post/949452

OpenAI's ChatGPT and Sam Altman are in massive trouble. OpenAI is getting sued in the US for illegally using content from the internet to train their LLM or large language models

top 50 comments

sorted by: hot top controversial new old

[–] RatzChatsubo@vlemmy.net 7 points 2 years ago (2 children)

So we can sue robots but when I ask if we can tax them, and reduce human working hours, I'm the crazy one?

[–] atzanteol@sh.itjust.works 2 points 2 years ago

So we can sue robots

... No?

[–] slipperydippery@lemmy.world 0 points 2 years ago (2 children)

What would you tax exactly? Robots don't earn an income and don't inherently make a profit. You could tax a company or owner who profits off of robots and/or sells their labor.

[–] RatzChatsubo@vlemmy.net 0 points 2 years ago (2 children)

It would have to be some sort of moderated labor cost saving tax kind of thing enforced by the government

[–] PlebsicleMcGee@feddit.uk 1 points 2 years ago* (last edited 2 years ago)

If we think of production as costing land, labour and capital, then more efficient methods of production would likely swap labour for capital. In that case then we just tax capital growth like we're doing now (Only properly, like without the loopholes). No need to complicate it past that

[–] devzero@sh.itjust.works 1 points 2 years ago (1 children)

Should we tax bulldozers because they take away jobs from people using shovels? What about farm equipment, since they take away jobs from people picking fruit by hand? What about mining equipment, because they take away jobs from people using pickaxes?

[–] RatzChatsubo@vlemmy.net -1 points 2 years ago* (last edited 2 years ago)

If the machine replaced the human, yes. That's the argument being made currently.

Imagine if we simply taxed machine profits after 40 hours of work. You not only can give kickbacks to large companies, but you could also rewire profits to UBI

load more comments (1 replies)

[–] magnetosphere@sh.itjust.works 2 points 2 years ago

I can’t speak for others, but I don’t consider posts I made on a website I don’t own to be my property. If anything, it’s amusing to think of my idiotic rants making up a tiny fraction of an AIs “knowledge”.

[–] JohnnyCanuck@sh.itjust.works 2 points 2 years ago* (last edited 2 years ago) (1 children)

"Massive Trouble"

Step 1 - Scrape everyone's data to make your LLM and make a high profile deal worth $10B

Step 2 - Get sued by everyone whose data you scraped

Step 3 - Settle and everyone in the class will be eligible for $5 credit using ChatGPT-4

Step 4 - Bask in the influx of new data

Step 5 - Profit

[–] manitcor@lemmy.intai.tech 2 points 2 years ago (1 children)

i posted on the public internet with the intent and understanding that it would be crawled by systems for all kinds of things. if i dont want content to be grabbed i dont publish it publicly

you can't easily have it both ways imo. even with systems that do strong pki if you want the world in general to see it you are giving up a certain amount of control over how the content gets used.

law does not really matter here as much as people would like to try to apply it, this is simply how public content will be used. Go post in a garden if you don't want to get scrapped, just remember the corollary is your reach, your voice is limited to the walls of that garden.

[–] lemmyvore@feddit.nl 0 points 2 years ago (1 children)

What you said makes a lot of sense. But here's the catch: it assumes OpenAI checked the licensing for all the stuff they grabbed. And I can guarantee you they didn't.

It's damn near impossible to automatically check the licensing for all the stuff they got she we know for a fact they got stuff whose licensing does not allow it to be used this way. Microsoft has already been sued for Copilot, and these lawsuits will keep coming. Assuming they somehow managed to only grab legit material and they used excellent legal advisors that assured them out would stand in court, it's definitely impossible to tell what piece of what goes where after it becomes a LLM token, and also impossible to tell what future lawsuits will decide about it.

Where does that leave OpenAI? With the good ol' "I grabbed something off the internet because I could". Why does that sound familiar? It's something people have been doing since the internet was invented, it's commonly referred to as "piracy". But it's supposed to be wrong and illegal. Well either it's wrong and illegal for everybody or the other way around.

[–] rbhfd@lemmy.world 2 points 2 years ago

The difference between piracy and having your content used for training a generative model, is that in the latter case, the content isn't redistributed. It's like downloading a movie from netflix (and eventually distributing it for free) vs watching a movie on netflix and using it as inspiration to make your own movie.

The legality of it all is unclear and most of that is because the technology evolved so quickly that the legal framework is just not equipped to deal with it. Despite the obvious moral issues with scraping artist's content.

[–] state_electrician@discuss.tchncs.de 2 points 2 years ago

I don't see how this is any different than humans copying or being inspired by something. While I hate seeing companies profiting off of the commons while giving nothing of value back, how do you prove that an AI model is using your work in any meaningful or substantial way? What would make me really mad is if this dumb shit leads to even harsher copyright laws. We need less copyright not more.

[–] Treemaster099@pawb.social 2 points 2 years ago* (last edited 2 years ago) (2 children)

Good. Technology always makes strides before the law can catch up. The issue with this is that multi million dollar companies use these gaps in the law to get away with legally gray and morally black actions all in the name of profits.

Edit: This video is the best way to educate yourself on why ai art and writing is bad when it steals from people like most ai programs currently do. I know it's long, but it's broken up into chapters if you can't watch the whole thing.

[–] PlebsicleMcGee@feddit.uk 1 points 2 years ago (1 children)

Totally agree. I don't care that my data was used for training, but I do care that it's used for profit in a way that only a company with big budget lawyers can manage

[–] CoderKat@lemm.ee 2 points 2 years ago* (last edited 2 years ago) (2 children)

But if we're drawing the line at "did it for profit", how much technological advancement will happen? I suspect most advancement is profit driven. Obviously people should be paid for any work they actually put in, but we're talking about content on the internet that you willingly create for fun and the fact it's used by someone else for profit is a side thing.

And quite frankly, there's no way to pay you for this. No company is gonna pay you to use your social media comments to train their AI and even if they did, your share would likely be pennies at best. The only people who would get paid would be companies like reddit and Twitter, which would just write into their terms of service that they're allowed to do that (and I mean, they already use your data for targeting ads and it's of course visible to anyone on the internet).

So it's really a choice between helping train AI (which could be viewed as a net benefit for society, depending on how you view those AIs) vs simply not helping train them.

Also, if we're requiring payment, only the super big AI companies can afford to frankly pay anything at all. Training an AI is already so expensive that it's hard enough for small players to enter this business without having to pay for training data too (and at insane prices, if Twitter and Reddit are any indication).

[–] Johem@lemmy.world 1 points 2 years ago

Reddit is currently trying to monetize their user comments and other content by charging for API access. Which creates a system where only the corporations profit and the users generating the content are not only unpaid, but expected to pay directly or are monetized by ads. And if the users want to use the technogy trained by their content they also have to pay for it.

Sure seems like a great deal for corporations and users getting fleeced as much as possible.

[–] programmer_belch@lemmy.dbzer0.com 1 points 2 years ago

Hundreds of projects in github are supported by donations, innovation happens even without profit incentives. It may slow down the pace of AI development but I am willing to wait anothrt decade for AIs if it protects user data and let's regulation catch up.

[–] archomrade@midwest.social 0 points 2 years ago (1 children)

I'm honestly at a loss for why people are so up at arms about OAI using this practice and not Google or Facebook or Microsoft, ect. It really seems we're applying a double standard just because people are a bit pissed at OpenAI for a variety of reasons, or maybe just vaguely mad at the monetary scale of "tech giants"

My 2 cents: I don't think content posted on the open internet (especially content produced by users on a free platform being claimed not by those individuals but by the platforms themselves) should be litigated over, when that information isnt even being reproduced but being used on derivative works. I think it's conceptually similar to an individual reading a library of books to become a writer and charge for the content they produce.

I would think a piracy community would be against platforms claiming ownership over user generated content at all.

[–] Treemaster099@pawb.social 0 points 2 years ago (1 children)

https://youtu.be/9xJCzKdPyCo

This video can answer just about any question you ask. It's long, but it's split up into chapters so you can see what questions he's answering in that chapter. I do recommend you watch the whole thing if you can. There's a lot of information that I found very insightful and thought provoking

[–] archomrade@midwest.social 0 points 2 years ago* (last edited 2 years ago) (1 children)

Couple things:

While I appreciate this gentleman's copywrite experience, I do have a couple comments:

his analysis seems primarily focused from a law perspective. While I don't doubt there is legal precedent for protection under copywrite law, my personal opinion is that copywrite is a capitalist conception that is dependent on an economic reality I fundamentally disagree with. Copywrite is meant to protect the livelihoods of artists, but I don't think anyone's livelihood should be dependent on having to sell labor. More often, copywrite is used to protect the financial interests of large businesses, not individual artists. The current litigation is between large media companies and OAI, and any settlement isn't likely to remunerate much more than a couple dollars to individual artists, and we can't turn back the clock to before AI could displace the jobs of artists, either.
I'm not a lawyer, but his legal argument is a little iffy to me... Unless I misunderstood something, he's resting his case on a distinction between human inspiration (i.e. creative inspiration on derivative works) and how AI functions practically (i.e. AI has no subjective "experience" so it cannot bring its own "hand" to a derivative work). I don't see this as a concrete argument, but even if I did, it is still no different than individual artists creating derivative works and crossing the line into copywrite infringement. I don't see how this argument can be blanket applied to the use of AI, rather than individual cases of someone using AI on a project that draws too much from a derivative work.

The line is even less clear when discussing LLMs as opposed to T2I or I2I models, which I believe is what is being discussed in the lawsuit against OAI. Unlike images from DeviantArt and Instagram, text datasets from sources like reddit, Wikipedia, and Twitter aren't protected under copywrite like visual media. The legal argument against the use of training data drawn from public sources is even less clear, and is even more removed to protecting the individual users and is instead a question of protecting social media sites with questionable legal claim to begin with. This is the point id expect this particular community would take issue with: I don't think reddit or Twitter should be able to claim ownership over their user's content, nor do I think anyone should be able to revoke consent over fair use just because it threatens our status quo capitalist system.

AI isn't going away anytime soon, and litigating over the ownership of the training data is only going to serve to solidify the dominant hold over our economy by a handful of large tech giants. I would rather see large AI models be nationalized, or otherwise be protected from monopolization.

[–] Treemaster099@pawb.social 0 points 2 years ago (1 children)

I don't really have the time to look for timestamps, but he does present his arguments from many different angles. I highly recommend watching the whole thing if you can.

Aside from that, the main thing I want to address is the responsibility of these big corporations to curate the massive library of content they gather. It's entirely in their power to blacklist certain things like PII or sensitive information or hate speech, but they decided not to because it was cheaper. They took a gamble that people either wouldn't care, didn't have the resources to fight it, or would actively support their theft if it meant getting a new toy to play with.

Now that there's a chance they could lose a massive amount of money, this could deter other ai companies from flagrantly breaking the law and set a better standard that protects people's personal data. Tbh I don't really think this specific case has much ground to stand on, but it's the first step in securing more safety for people online. Imagine if the database for this ai was leaked. Imagine all of the personal data, yours and mine included, that would be available to malicious people. Imagine the damage that could cause.

[–] archomrade@midwest.social 1 points 2 years ago

They do curate the data somewhat, though it's not easy to verify if they did since they don't share their data set (likely because they expect legal challenge)

There's no evidence they have "personal data" beyond direct textual data scraped from platforms such as reddit (much of which is disembodied from other metadata). I care FAR more about data google, facebook, or microsoft has leaking than I do text written on my old reddit or twitter account, and somehow we're not wringing our hands about that data collection.

I watched most of that video, and i'm frankly not moved by much of it. The video seems primarily (if not entirely) written in response to generative image models and image data that may actually be protected under existing copywrite, unlike the textual data in question in this particular lawsuit. Despite that, I think his interpretation of "derivative work" hand waving is flimsy at best, and relies on a materialist perspective that I just can't identify with (a pragmatic framework might be more persuasive to me). A case-by-case basis of copywrite infringement of the use of AI tools is the most solid argument he makes, but I am just not persuaded that all AI is theft based on publicly accessible data being used as training data. And i just don't think copywrite law is an ideal solution to a growing problem with technological automation and ever increasing levels of productivity and stagnating levels of demand.

I'm open to being wrong, but i think copywrite law doesn't address the long-term problems introduced by AI and is instead a shortcut to maintaining a status quo destined to failure regardless.

[–] Geograph6@lemmy.dbzer0.com 1 points 2 years ago (8 children)

People talk about OpenAI as if its some utopian saviour that's going to revolutionise society. When in reality its a large corporation flooding the internet with terrible low-quality content using machine learning models that have existed for years. And the fields it is "automating" are creative ones that specifically require a human touch, like art and writing. Language learning models and image generation isn't going to improve anything. They're not "AI" and they never will be. Hopefully when AI does exist and does start automating everything we'll have a better economic system though :D

load more comments (8 replies)

[–] sycamore@lemmy.world 1 points 2 years ago (1 children)

I once looked outside. Could I be sued for observing a public space?

[–] manitcor@lemmy.intai.tech 1 points 2 years ago

i once looked at a picture of spider man and badman then made a crappy drawing biterman

to jail with me!

[–] ArmokGoB@lemmy.world 0 points 2 years ago (1 children)

Curious to see if this goes anywhere.

[–] WagnasT@iusearchlinux.fyi 1 points 2 years ago (1 children)

inal but i think it's going to come down to the terms of service where the data was scraped from. If the terms say the stuff you post can be shared with third parties then they might not have a leg to stand on. Where it gets sketchy is if someone posted someone else's work, then the original author had no say in it being shared with a third party, BUT, is that the fault of the third party or the service provider that shared it?

Also, if i were exposed to copyright material through some unauthorised person distributing it can i not summarize the information? I guess i don't know enough about fair use to answer that.

The wording in the article says they are being sued for stealing their data, this seems like a stretch but i guess i'll wait for more details of the case.

[–] ArmokGoB@lemmy.world 2 points 2 years ago

The thing is that the images are used to train a set of weights and biases; the training data isn't distributed as part of the AI or as part of the software used to generate images.

[–] UntouchedWagons@lemmy.ca 0 points 2 years ago (1 children)

Piracy isn't stealing and neither is this.

[–] dmmeyournudes@lemmy.world 0 points 2 years ago (1 children)

Piracy is literally theft, what are you talking about?

[–] hightrix@lemmy.world 1 points 2 years ago

It is absolutely not theft. If you’d like a physical crime to compare it to, forgery would be what you are looking for. But piracy is not at all theft.

That is, unless you are talking about Captain Davy Jones and his pirate ship. That type of piracy is theft.

[–] Uriel238@lemmy.fmhy.ml 0 points 2 years ago (1 children)

If this lawsuit is ruled in favor of the plaintiff, it might lead to lawsuits against those who have collected and used private data more maliciously, from advertisement-targeting services to ALPR services that reveal to law enforcement your driving habits.

[–] Ajen@sh.itjust.works 2 points 2 years ago

So some of the most profitable corporations in the world? In that case this lawsuit isn't going anywhere.

[–] redditsucks@lemmy.world 0 points 2 years ago (2 children)

Hope it goes through and sets a president.

[–] sorenant@lemmy.world 1 points 2 years ago

Vote Skynet for 2024 Presidential Election, the efficient choice!

load more comments (1 replies)

[–] Technoguyfication@lemmy.ml 0 points 2 years ago (3 children)

It’s wild to see people in the piracy community of all places have an issue with someone benefiting from data they got online for free.

[–] briongloid@aussie.zone 2 points 2 years ago

Many of us are sharing without reward and have strong ethical beliefs regarding for-profit distribution of material versus non-profit sharing.

[–] DankMemeMachine@lemmy.world 1 points 2 years ago (1 children)

The difference is that they are profitting from other people's work and properties, I don't profit from watching a movie or playing a game for free, I just save some money.

[–] Holodeck_Moriarty@lemm.ee 1 points 2 years ago (1 children)

You do if you make games or movies and those things give you inspiration.

This is just how learning is done though, whether it's AI or human.

[–] DankMemeMachine@lemmy.world 1 points 2 years ago (1 children)

Absolutely not comparable. Inspiration and an amalgation of everything a LLM consumes are completely different things.

[–] Holodeck_Moriarty@lemm.ee 1 points 2 years ago

I'd argue that what we do is an amalgamation of what we are exposed to, to a great extent. And we are exposed to way less information than a LLM.

[–] arinot@lemmy.world 1 points 2 years ago

It really isn't that bonkers. A lot software thought is about licensing. See GPL and Creative Commons and all that stuff thats all about how things can be profited from/responsibilities around it. Benefiting from free data is one thing. Privately profiting at the expense or not sharing the capability/advances that came from it is another. Willing to bet there's GPL violations via the training sets.

Is it even possible to attach licenses to text posts on social media?

load more comments