“Anomalous”, “glitch”, or “unspeakable” tokens in an LLM are those that induce bizarre behavior or otherwise don’t behave like regular text.
The SolidGoldMagikarp saga is pretty much essential context, as it documents the discovery of this phenomenon in GPT-2 and GPT-3.
But, as far as I was able to tell, nobody had yet attempted to search for these tokens in DeepSeek-V3, so I tried doing exactly that. Being a SOTA base model, open source, and an all-around strange LLM, it seemed like a perfect candidate for this.
This is a catalog of the glitch tokens I've found in DeepSeek after a day or so of experimentation, along with some preliminary observations about their behavior.
Note: I’ll be using “DeepSeek” as a generic term for V3 and r1.
ProcessI searched for these tokens by first extracting the vocabulary from DeepSeek-V3's tokenizer, and then automatically testing every one of them [...]
---
Outline:(00:55) Process
(03:30) Fragment tokens
(06:45) Other English tokens
(09:32) Non-English
(12:01) Non-English outliers
(14:09) Special tokens
(16:26) Base model mode
(17:40) Whats next?
The original text contained 1 footnote which was omitted from this narration. The original text contained 12 images which were described by AI. ---
First published: January 25th, 2025
Source: https://www.lesswrong.com/posts/xtpcJjfWhn3Xn8Pu5/anomalous-tokens-in-deepseek-v3-and-r1 ---
Narrated by
TYPE III AUDIO.
---