Start / LessWrong (30+ Karma) / Steering gemini with bidpo by turntrout

“Steering Gemini with BiDPO” by TurnTrout

2 min • 31 januari 2025

This is a link post.

Coauthored with Mark Kurzeja

A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.

...

BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It's hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness ／factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.

This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines and reflect on lessons learned.

I'd also like to note that @Ryan Greenblatt's skepticism [...]

The original text contained 1 footnote which was omitted from this narration.

---

First published:
January 31st, 2025

Source:
https://www.lesswrong.com/posts/WqjkqrEyFDXoHzz9K/steering-gemini-with-bidpo

---

Narrated by TYPE III AUDIO.

Kategorier

Filosofi Poddar Samhälle och kultur Teknologi

Förekommer på

Teknik

00:00 -00:00