technical Report and source code
https://github.com/deepseek-ai/DeepSeek-R1/
A Contribution of DeepSeek R1
Training an LLM today consists of three steps: (1) pre-training, (2) supervised fine-tuning (SFT), and (3) reinforcement learning (RL). Step 3 is not strictly necessary, but is increasingly common.
DeepSeek R1 was able to completely replace step 2 with step 3.
My questions
How big of a deal is the above contribution? No human in the loop needed at all? Or just less labor intensive? Did someone literally sitting there during training saying, “this answer is better than that answer”? How were the rankings generated?
Also, was R1 was distilled from their V3 model? If so, did V3 use SFT? (I am curious if you could bootstrap a model entirely using RL without SFT anywhere in the family tree.)
How much of a difference did low-precision training make? Did it make any difference to the efficiency of inference?
Broader context / key concepts / background
Low-precision training: https://huggingface.co/docs/accelerate/en/concept_guides/low_precision_training
Multi-head Attention—transformer architecture. [A nice book]
Multi-head Latent Attention (MLA), an architectural design pattern
Mixture of Experts (MoE), an architectural design pattern
Reinforcement Learning (RL), a training technique
Model distillation, a training technique (from this seminal paper)