Start / LessWrong (Curated & Popular) / Lora fine tuning efficiently undoes safety training from llama 2 chat 70b by simon lermen jeffrey ladish

"LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B" by Simon Lermen & Jeffrey Ladish.

33 min • 23 oktober 2023

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish.

TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models[1] maintain helpful capabilities without refusing to fulfill harmful instructions. We show that, if model weights are released, safety fine-tuning does not effectively prevent model misuse. Consequently, we encourage Meta to reconsider their policy of publicly releasing their powerful models.

Source:
https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

[125+ Karma Post] ✓

Kategorier

Förekommer på

00:00 -00:00