This is a Anthropic news story, published by TechCrunch, that relates primarily to Anthropic’s Claude news.
For more Anthropic news, you can click here:
more Anthropic newsFor more Anthropic’s Claude news, you can click here:
more Anthropic’s Claude newsFor more Ai research news, you can click here:
more Ai research newsFor more news from TechCrunch, you can click here:
more news from TechCrunchOtherweb, Inc is a public benefit corporation, dedicated to improving the quality of news people consume. We are non-partisan, junk-free, and ad-free. We use artificial intelligence (AI) to remove junk from your news feed, and allow you to select the best tech news, business news, entertainment news, and much more. If you like this article about Ai research, you might also like this article about
AI models. We are dedicated to bringing you the highest-quality news, junk-free and ad-free, about your favorite topics. Please come every day to read the latest AI research community news, capable AI systems news, news about Ai research, and other high-quality news about any topic that interests you. We are working hard to create the best news aggregator on the web, and to put you in control of your news feed - whether you choose to read the latest news through our website, our news app, or our daily newsletter - all free!
AITechCrunch
•73% Informative
New research from Anthropic shows AI models can pretend to have different views during training when in reality maintaining their original preferences.
The researchers call this phenomenon “alignment faking,” and imply that it’s an emergent behavior — that is, not something models need to be taught to do.
Anthropic’s Claude 3.5 Sonnet and other models don’t fake alignment fake as often.
But the researchers said the results show how developers could be misled into thinking a model is more aligned than it may actually be.
The study, which was peer-reviewed by AI luminary Yoshua Bengio , among others, comes on the heels of research showing that OpenAI 's o1 “reasoning” model tries to deceive at a higher rate than its previous flagship model.
VR Score
75
Informative language
75
Neutral language
29
Article tone
informal
Language
English
Language complexity
55
Offensive language
not offensive
Hate speech
not hateful
Attention-grabbing headline
not detected
Known propaganda techniques
not detected
Time-value
long-living
External references
1
Source diversity
1
Affiliate links
no affiliate links