x.com/suchenzang/status/2047559677316325807/photo/2
1 correction found
deepseek could not fix training instabilities
This overstates what DeepSeek reported. In the DeepSeek-V4 technical report, the authors say they found two techniques that "effectively maintain training stability" and that these methods avert loss spikes during training.
Full reasoning
DeepSeek's own V4 report does not say the instability problem remained unfixed. It says the team encountered instability during training, found that simple rollbacks were not enough, and then "discovered two practical techniques that effectively maintain training stability." The report then names those techniques as Anticipatory Routing and SwiGLU Clamping.
More specifically, the paper says Anticipatory Routing "allows us to avert loss spikes" without harming performance, and that SwiGLU clamping "effectively eliminates outliers and substantially aids in stabilizing the training process."
So while the model did experience instability during development, the claim that DeepSeek "could not fix training instabilities" contradicts the paper's explicit description of the methods they used to stabilize training.
1 source
- DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
"we discovered two practical techniques that effectively maintain training stability." The paper adds that Anticipatory Routing "allows us to avert loss spikes" and that SwiGLU clamping "effectively eliminates outliers and substantially aids in stabilizing the training process."