Technology

74055 readers

5104 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

492

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From (futurism.com)

submitted 1 year ago by ylai@lemmy.ml to c/technology@lemmy.world

154 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] HaywardT@lemmy.sdf.org 2 points 1 year ago (1 children)

Sure, if that is what the network has been trained to do, just like a librarian will if that is how they have been trained.

[–] Linkerbaan@lemmy.world -1 points 1 year ago* (last edited 1 year ago) (1 children)

Actually it's the opposite, you need to train a network not to reveal its training data.

“Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”

The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet.

[–] HaywardT@lemmy.sdf.org 0 points 1 year ago (1 children)

Interesting article. It seems to be about a bug, not a designed behavior. It also says it exposes random excerpts from books and other training data.

[–] Linkerbaan@lemmy.world -1 points 1 year ago (1 children)

It's not designed to do that because they don't want to reveal the training data. But factually all neural networks are a combination of their training data encoded into neurons.

When given the right prompt (or image generation question) they will exactly replicate it. Because that's how they have been trained in the first place. Replicating their source images with as little neurons as possible, and tweaking them when it's not correct.

[–] HaywardT@lemmy.sdf.org 3 points 1 year ago

That is a little like saying every photograph is a copy of the thing. That is just factually incorrect. I have many three layer networks that are not the thing they were trained on. As a compression method they can be very lossy and in fact that is often the point.