this post was submitted on 24 Mar 2024
1236 points (97.3% liked)

Memes

52433 readers
1349 users here now

Rules:

  1. Be civil and nice.
  2. Try not to excessively repost, as a rule of thumb, wait at least 2 months to do it if you have to.

founded 6 years ago
MODERATORS
 
top 50 comments
sorted by: hot top controversial new old
[–] Liz@midwest.social 103 points 2 years ago (3 children)

Ask a man his salary. Do it. How else are you supposed to learn who is getting underpaid? The only way to rectify that problem is to learn about it in the first place.

[–] SpaceCowboy@lemmy.ca 32 points 2 years ago (1 children)

I think context is important here. Asking a co-worker their salary is fine. Asking about the salary of someone you're on a date with is not fine.

[–] GlitterInfection@lemmy.world 13 points 2 years ago (1 children)

Exactly.

You should have asked them for their W-2 before agreeing to meet.

[–] SpaceCowboy@lemmy.ca 5 points 2 years ago

Yeah and get their credit score before you even reply.

[–] EmpathicVagrant@lemmy.world 24 points 2 years ago (1 children)

The NLRB ensures that discussion of wages is a protected right.

Talk about your wages.

[–] brbposting@sh.itjust.works 4 points 2 years ago (1 children)
[–] EmpathicVagrant@lemmy.world 7 points 2 years ago

Plans to, too.

[–] moistclump@lemmy.world 5 points 2 years ago

Ask a woman her age. Do it. How else are you supposed to learn who is getting older? The only way to celebrate that is to learn about it in the first place.

[–] kernelle@0d.gs 59 points 2 years ago (4 children)

"Publicly available data" - I wonder if that includes Disney's catalogue? Or Nintendo's IP? I think they are veeery selective about their "Publicly available data", it also implies the only requirement for such training data is that it is publicly available, which almost every piece of media ever? How an AI model isn't public domain by default baffles me.

[–] Even_Adder@lemmy.dbzer0.com 16 points 2 years ago (1 children)

You should check out this article by Kit Walsh, a senior staff attorney at the EFF, and this one by Katherine Klosek, the director of information policy and federal relations at the Association of Research Libraries.

[–] kernelle@0d.gs 9 points 2 years ago

Great articles, first is one of the best I've read about the implications of fair use. I argue that because of the broadness of human knowledge that is interpreted through these models, everyone is entitled to have unrestricted access to them (not the servers or algorithms used, the models). I'll dub it "the library of the digital age" argument.

[–] Raykin@lemmy.world 12 points 2 years ago

Great point.

[–] redcalcium@lemmy.institute 4 points 2 years ago* (last edited 2 years ago) (1 children)

There is a rumor that OpenAI downloaded the entirety of LibGen to train their AI models. No definite proof yet, but it seems very likely.

https://torrentfreak.com/authors-accuse-openai-of-using-pirate-sites-to-train-chatgpt-230630/

[–] 100_kg_90_de_belin@feddit.it 3 points 2 years ago

"It just like me fr fr" (cit.)

[–] programmer_belch@lemmy.dbzer0.com 2 points 2 years ago (17 children)

The problem is that if copyrighted works are used, you could generate a copyrighted artwork that would be made into public domain, stripping its protection. I would love this approach, the problem is the lobbyists don't agree with me.

load more comments (17 replies)
[–] zinderic@programming.dev 33 points 2 years ago (2 children)

It's almost impossible to audit what data got into an AI model. Until this is true companies could scrape and use whatever they like and no one would be the wiser to what data got used or misused in the process. That makes it hard to make such companies accountable to what and how they are using.

[–] po-lina-ergi@kbin.social 38 points 2 years ago (1 children)

Then it needs to be on companies to prove their audit trail, and until then require all development to be open source

[–] zinderic@programming.dev 7 points 2 years ago (2 children)

That would be amazing. But it won't happen any time soon if ever.. I mean - just think about all that investment in GPU compute and the need to realize good profit margins. Until there are laws and legislation that requires AI companies to open their data pipelines and make public all details about the data sources I don't think much would happen. They'll just keep feeding any data they get their hands on and nothing can stop that today.

load more comments (2 replies)
load more comments (1 replies)
[–] dislocate_expansion@reddthat.com 22 points 2 years ago (11 children)

Anyone know why most are a 2021 internet data cut off?

[–] Natanael@slrpnk.net 19 points 2 years ago (3 children)

Training from scratch and retraining is expensive. Also, they want to avoid training on ML outputs as samples, they want primarily human made works as samples, and after the initial public release of LLMs it has become harder to create large datasets without ML stuff in them

[–] scrubbles@poptalk.scrubbles.tech 13 points 2 years ago* (last edited 2 years ago)

There was a good paper that came out recently saying that training on ml data will result in a collapse of cohesion. It's going to be real interesting, I don't know if they'll be able to train as easily ever again

[–] Iron_Lynx@lemmy.world 4 points 2 years ago

I recall spotting a few things about Image Generators having their training data contaminated using generated images, and the output becoming significantly worse. So yeah, I guess LLMs and IGA's need natural sources, or it gets more inbred than the Habsburgs.

load more comments (1 replies)
[–] Donkter@lemmy.world 7 points 2 years ago

I think it's just that most are based on chatgpt which cuts off at 2021.

[–] can@sh.itjust.works 3 points 2 years ago (1 children)

Hey, did you know your profile is set to appear as a bot and as a result many may be filtering your posts and comments? You can change this in your Lemmy settings.

Unless you are a bot... In which case where did you get your data?

[–] dislocate_expansion@reddthat.com 4 points 2 years ago (1 children)

The data wasn't stolen, I can at least assure you of that

load more comments (1 replies)
load more comments (8 replies)
[–] Hjalamanger@feddit.nu 5 points 2 years ago (1 children)

I love of it isn't just a image of the open ai logo but also a sad person besides it

[–] unique_hemp@discuss.tchncs.de 11 points 2 years ago (2 children)

Oh that is not just some person, that's the CTO of "Open"AI when asked, if YT videos were used to train Sora.

Sauce: https://youtu.be/mAUpxN-EIgU?feature=shared&t=270

[–] brbposting@sh.itjust.works 4 points 2 years ago

Lying MF, unbelievable that’s the best they thought of.

I’m sorry, but we’ve made an internal decision not to reveal our proprietary methodology at this time.

There, now it’s not a lie (hurr durr I’m only the CTO how would I know whether a tiny startup like YOUTUBE was one of our sources)

[–] PipedLinkBot@feddit.rocks 3 points 2 years ago

Here is an alternative Piped link(s):

https://piped.video/mAUpxN-EIgU?feature=shared&t=270

Piped is a privacy-respecting open-source alternative frontend to YouTube.

I'm open-source; check me out at GitHub.

[–] turkishdelight@lemmy.ml 4 points 2 years ago (2 children)

What's wrong with her face?

[–] AnUnusualRelic@lemmy.world 24 points 2 years ago (1 children)

Poor training data presumably.

[–] SpaceCowboy@lemmy.ca 4 points 2 years ago

It's this face: https://www.compdermcenter.com/wp-content/uploads/2016/09/vanheusen_5BSQnoz.jpg

She was asked about openai using copyrighted material for training data and literally made that face. Only thing more perfect would've been if she tugged at her collar while doing the face.

load more comments
view more: next ›