Dylan Matthews had in depth Vox profile of Anthropic, which I recommend reading in full if you have not yet done so. This post covers that one.
Anthropic Hypothesis
The post starts by describing an experiment. Evan Hubinger is attempting to create a version of Claude that will optimize in part for a secondary goal (in this case ‘use the word “paperclip” as many times as possible’) in the hopes of showing that RLHF won’t be able to get rid of the behavior. Co-founder Jared Kaplan warns that perhaps RLHF will still work here.
Hubinger agrees, with a caveat. “It’s a little tricky because you don’t know if you just didn’t try hard enough to get deception,” he says. Maybe Kaplan is exactly right: Naïve deception gets destroyed in training, but sophisticated deception doesn’t. And the only way to know whether an AI can deceive you is to build one that will do its very best to try.
The problem with this approach is that an AI that ‘does its best to try’ is not doing the best that the future dangerous system will do.
So by this same logic, a test on today’s systems can only show your technique [...]
---
First published:
July 25th, 2023
Source:
https://www.lesswrong.com/posts/b3CA5Gst5iWkyuY9L/anthropic-observations
Narrated by TYPE III AUDIO.