Technology

74003 readers

3914 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

690

Study finds that Chat GPT will cheat when given the opportunity and lie to cover it up later. (lemmy.world)

submitted 2 years ago* (last edited 2 years ago) by yesman@lemmy.world to c/technology@lemmy.world

173 comments fedilink hide all child comments

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision.

https://arxiv.org/abs/2311.07590

you are viewing a single comment's thread
view the rest of the comments

[–] DarkGamer@kbin.social 2 points 2 years ago* (last edited 2 years ago) (1 children)

Thanks for citing specifics but I'm still not seeing what you are claiming there, this paper seems to be about the limits of accurate classification of true and false statements in LLM models and shows that there is a linear pattern in the underlying classification via multidimensional analysis. This seems unsurprising since the way LLMs work is essentially taking a probabilistic walk through an array of every possible next word or token based on multidimensional analysis of patterns of each.

Their conclusions, from the paper (btw, Arxive is not peer-reviewed):

In this work we conduct a detailed investigation of the structure of LLM representations of truth.
Drawing on simple visualizations, correlational evidence, and causal evidence, we find strong reason to believe that there is a “truth direction” in LLM representations. We also introduce mass-mean
probing, a simple alternative to other linear probing techniques which better identifies truth directions from true/false datasets.

Nothing about symbolic understanding, just showing that there is a linear pattern to statements defined as true vs false, when graphed a specific way.

From the associated data explorer.:

These representations live in a 5120-dimensional space, far too high-dimensional for us to picture, so we use PCA to select the two directions of greatest variation for the data. This allows us to produce 2-dimensional pictures of 5120-dimensional data.

So they take the two dimensions that differ the greatest and chart those on X/Y, showing there are linear patterns to the differences in statements classified as, "true," and, "false." Because this is multidimensional and it's AI finding patterns there are patterns being matched beyond the simplistic examples I've been offering as analogues, patterns that humans cannot see, patterns that extend beyond simple obvious correlations we humans might see in training data. It doesn't literally need to be trained on statements like "Beijing is in China" and even if it is it's not guaranteed that it will match that as a true statement. It might find patterns in unrelated words around these, or might associate these words or parts of these words with each other for other reasons.

I'm rather simplifying how LLMs work for purposes of this discussion, but the point stands that pattern matching of words still seems to account for all of this. LLMs, which are probabilistic in nature, often get things wrong. Llama-13B is the best and it still gets things wrong a significant amount of the time.

[–] kromem@lemmy.world 2 points 2 years ago (1 children)

this paper seems to be about the limits of accurate classification of true and false statements in LLM models

No, that's not what it is about and I'm really not sure where you are picking that perspective up. It is discussing the limits on the ability to model the representations, but it's not about the inherent ability of the model to classify. Tegmark's recent interest has entirely been about linear representations of world models in LLMs, such as the other paper he coauthored a few weeks before this one looking at representation of space and time: Language Models Represent Space and Time

This seems unsurprising since the way LLMs work is essentially taking a probabilistic walk through an array of every possible next word or token based on multidimensional analysis of patterns of each.

That's not how they work. You are confusing their training from their operation. They are trained to predict the next tokens, but how they accomplish that is much more complex and opaque. Training is well understood. Operation is not, especially on the largest models. Though Anthropic is making good headway in the past few months with the perspective of virtual neurons mapped onto the lower dimensional actual nodes and looking at activation around features instead of nodes.

Llama-13B is the best

It's definitely not the best and I'm not sure where you got that impression.

Because this is multidimensional and it's AI finding patterns there are patterns being matched beyond the simplistic examples I've been offering as analogues, patterns that humans cannot see, patterns that extend beyond simple obvious correlations we humans might see in training data.

All LLM activations are multidimensional. That's how the networks work, with multidimensional vectors in a virtual network fuzzily mapping to the underlying network nodes and layers. But you seem to think that because it's a complex modeling of language relationships that it can't be modeling world models? I'm not really clear what point you are trying to make here.

Again, there's many papers pointing to how LLMs establish world models abstracted from the input, from the Othello-GPT paper and follow-up by a DeepMind researcher to Tegmark's two recent papers. This isn't an isolated paper but part of a broader trend. To be saying that this isn't actually happening means claiming multiple different researchers across Harvard, MIT, and institutions leading in the development of the tech are all getting it wrong.

And none of the LLM papers these days are peer reviewed because no one is waiting months to publish in a field where things are moving so quickly that your findings will likely be secondary or uninteresting by the time you publish. For example both Stanford's model collapse one and Are Emergent Abilities of Large Language Models a Mirage? were published to arXiv and not peer reviewed journals, while both getting a ton of attention, in part because of how negative takes on LLMs get more press coverage these days. Go ahead and point to an influential LLM paper from the last year published in a peer reviewed journal and not arXiv. Even Wei's CoT paper, probably the most influential in the past two years, was published there.

[–] DarkGamer@kbin.social 2 points 2 years ago (1 children)

I could be wrong, I'll keep reading, thanks for the feedback and the citations.

[–] kromem@lemmy.world 1 points 2 years ago

I would strongly encourage starting with the Othello-GPT work because it strips down a lot of the complexity.

If we had a toy model that was only fed the a, b, and c from valid Pythagorean equations and evaluated by its ability to predict c given an a and b, it's pretty obvious that a network that stumbles upon an internal representation of a^2 + b^2 = c^2 and could use that to solve for c would outperform a model that simply built statistical correlations between various a, b, and cs, right?

By focusing in on toy model only fed millions of legal Othello moves they were able to introspect the best performing model at outputting valid moves to discover it had developed an internal representation of an Othello board in the network despite never being fed anything that explicitly described or laid one out.

And then that finding was replicated by a separate researcher, finding it was doing this through linear representations.

Once it clicks that this has been shown in replicated research to be possible in a toy model, it becomes easier to process the more difficult efforts at demonstrating the same thing is happening in much larger and more complex smaller LLMs (which in turn suggests it is happening in the much larger and more complex SotA LLMs).