In Sparse Feature Circuits (Marks et al. 2024), the authors introduced Spurious Human-Interpretable Feature Trimming (SHIFT), a technique designed to eliminate unwanted features from a model's computational process. They validate SHIFT on the Bias in Bios task, which we think is too simple to serve as meaningful validation. To summarize:
---
Outline:
(03:07) Background on SHIFT
(06:59) The SHIFT experiment in Marks et al. 2024 relies on embedding features.
(07:51) You can train an unbiased classifier just by deleting gender-related tokens from the data.
(08:33) In fact, for some models, you can train directly on the embedding (and de-bias by removing gender-related tokens)
(09:21) If not BiB, how do we check that SHIFT works?
(10:00) SHIFT applied to classifiers and reward models
(10:51) SHIFT for cognition-based oversight/disambiguating behaviorally identical classifiers
(12:03) Next steps: Focus on what to disentangle, and not just how well you can disentangle them
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 19th, 2025
Narrated by TYPE III AUDIO.