[论文翻译]DeepSeek R1 简要分析及其对生成式 AI (Generative AI) 的影响


原文地址:https://arxiv.org/pdf/2502.02523


Brief analysis of DeepSeek R1 and its implications for Generative AI

DeepSeek R1 简要分析及其对生成式 AI (Generative AI) 的影响

Abstract

摘要

In late January 2025, DeepSeek released their new reasoning model (DeepSeek R1); which was developed at a fraction of the cost yet remains competitive with OpenAI’s models, despite the US’s GPU export ban. This report discusses the model, and what its release means for the field of Generative AI more widely. We briefly discuss other models released from China in recent weeks, their similarities; innovative use of Mixture of Experts (MoE), Reinforcement Learning (RL) and clever engineering appear to be key factors in the capabilities of these models. This think piece has been written to a tight timescale, providing broad coverage of the topic, and serves as introductory material for those looking to understand the model’s technical advancements, as well as its place in the ecosystem. Several further areas of research are identified.

2025年1月下旬,DeepSeek发布了他们的新推理模型(DeepSeek R1);尽管美国实施了GPU出口禁令,但该模型的开发成本仅为一部分,同时仍与OpenAI的模型保持竞争力。本报告讨论了该模型,并探讨了其发布对生成式AI领域更广泛的意义。我们简要讨论了最近几周中国发布的其他模型,它们的相似之处;专家混合模型(MoE)、强化学习(RL)的创新应用以及巧妙的工程似乎是这些模型能力的关键因素。本文是在较短的时间内撰写的,提供了对该主题的广泛覆盖,并作为理解该模型技术进步及其在生态系统中地位的入门材料。确定了几个进一步的研究领域。

1 Introduction

1 引言

The relatively short history of Generative AI has been punctuated with big steps forward in model capability. This happened again over the last few weeks triggered by a couple of papers released by a Chinese company DeepSeek [1]. In late December they released DeepSeek-V3 [2] a direct competitor to OpenAI’s GPT4o, apparently trained in two months, for approximately $\mathbb{S}5.6$ million [3, 4], which equates to 1/50th of the costs of other comparable models [5]. On the 20th of January they released DeepSeek-R1 [6] a set of reasoning models, containing “numerous powerful and intriguing reasoning behaviours” [6], achieving comparable performance to OpenAI’s o1 model – and they are open for researchers to examine [7].

生成式 AI (Generative AI) 相对较短的历史中,模型能力的大幅提升时常出现。过去几周,这种情况再次发生,由中国公司深度求索 (DeepSeek) 发布的两篇论文引发了这一进展 [1]。12 月底,他们发布了 DeepSeek-V3 [2],这是 OpenAI 的 GPT4o 的直接竞争对手,显然在两个月内完成训练,成本约为 560 万美元 [3, 4],相当于其他类似模型成本的 1/50 [5]。1 月 20 日,他们发布了 DeepSeek-R1 [6],这是一组推理模型,包含“许多强大且有趣的推理行为” [6],达到了与 OpenAI 的 o1 模型相当的性能,并且他们向研究人员开放以供审查 [7]。

This openness is a welcome move for many AI researchers keen to understand more about the models they are using. It should be noted that these models are released as ‘open weights’ meaning the model can be built upon, and freely used (under the MIT license), but without the training data it’s not truly open source. However, more details than usual were shared about the training process in the associated documentation.

这种开放性对于许多渴望更深入了解所使用模型的 AI 研究人员来说是一个受欢迎的措施。需要注意的是,这些模型以“开放权重 (open weights)”的形式发布,意味着模型可以在 MIT 许可下被构建和自由使用,但由于缺少训练数据,它们并非真正的开源。然而,在相关文档中,关于训练过程的细节比往常分享得更多。

2 DeepSeek

2 DeepSeek

In this section we give a brief overview of the latest models out of DeepSeek. We begin by discussing DeepSeek V3, a competitor to OpenAI’s GPT4o model, used as a base model for the development of DeepSeek R1. For more details, please see original papers for DeepSeek-V3 [2] and DeepSeek-R1 [6].

在本节中,我们简要概述了DeepSeek的最新模型。首先我们讨论DeepSeek V3,它是OpenAI的GPT4o模型的竞争对手,并作为DeepSeek R1开发的基础模型。更多详细信息,请参见DeepSeek-V3 [2] 和 DeepSeek-R1 [6] 的原始论文。

2.1 DeepSeek V3 - base model

2.1 DeepSeek V3 - 基础模型

The DeepSeek-V3 model, employs two major efficiencies; the Mixture of Experts (MoE) architecture and a lot of engineering efficiencies.

DeepSeek-V3 模型采用了两种主要的高效策略:专家混合(Mixture of Experts,MoE)架构和大量的工程优化。

The MoE architecture, which at a high level essentially divides the model up into a selection of specialised smaller models (one for maths, one for coding etc.) to ease training burden; was used in machine translation Transformers such as Google’s GShard in 2020 and was used in Mixtral LLM [8] in January 2024, and DeepSeek published a paper on their approach to MoE in January 2024 [9]. A flurry of MoE papers happened during 2024, with several of the MoE techniques used by the models in the next section being presented at NeurIPs at the end of 2024. This shows, architecturally at least, DeepSeek V3 was not an out-of-the-blue breakthrough (with 20/20 hindsight!).

MoE架构在高层级上本质上是将模型划分为多个专用的小模型(一个用于数学,一个用于编码等),以减轻训练负担;该架构曾在2020年Google的GShard等机器翻译Transformer中使用,并于2024年1月在Mixtral大语言模型[8]中应用,同时DeepSeek在2024年1月发表了关于其MoE方法的论文[9]。2024年期间涌现了大量关于MoE的论文,其中许多MoE技术在2024年底的NeurIPs会议上被展示,并被下一节中的模型所采用。这至少从架构上表明,DeepSeek V3并非一个突如其来的突破(事后回看有20/20的洞察力)。

2.2 DeepSeek R1 - reasoning

2.2 DeepSeek R1 - 推理

The aim of the project was to improve reasoning capabilities using pure Reinforcement Learning (RL), without the need for supervised data, to focus on self-evolution. Taking their V3 model (671B parameters) as a base and employing scalable Group Relative Policy Optimization (GRPO) as the RL framework, the resulting R1- Zero model showed improvements in reasoning and maths but also challenges such as poor readability and language mixing.

该项目的目标是利用纯强化学习(Reinforcement Learning, RL)来提升推理能力,无需监督数据,专注于自我进化。以其V3模型(671B参数)为基础,采用可扩展的组相对策略优化(Group Relative Policy Optimization, GRPO)作为RL框架,最终生成的R1-Zero模型在推理和数学方面有所改进,但也面临诸如可读性差和语言混合等挑战。

Notably the performance of the R1-Zero model increased from $15.6%$ on AIME 2024, to $71.0%$ , comparable to openAI-o1-0912, which was then exceeded when the DeepSeek team tweaked the RL (majority voting) scoring $86.7%$ .

值得注意的是,R1-Zero 模型在 AIME 2024 上的表现从 15.6% 提升到了 71.0%,与 openAI-o1-0912 相当,随后 DeepSeek 团队通过调整 RL (多数投票) 得分达到了 86.7%,超越了这一表现。

They continued to evolve their pipeline reintroducing some supervised fine tuning, which resulted in the R1 model, which reportedly achieves scores on par with OpenAI’s o1 model for many reasoning and maths-based evaluation tasks.

他们继续改进其流程,重新引入了一些监督微调,最终得到了 R1 模型。据报道,该模型在许多基于推理和数学的评估任务中取得了与 OpenAI 的 o1 模型相当的成绩。

The process of RL encourages the model to generate more tokens (more ‘thinking time’) to solve reasoning tasks, as the process progresses, and test-time computation increases, behaviours such as reflection and the exploration of alternative approaches arise spontaneously, the term ‘aha moment’ [6] has been ascribed to the moment when an intermediate model learns to rethink using an anthropomorphic tone. This emergent property of self-reflection is a key finding that needs further research to unpick and evaluate; is the model ‘learning’ how to answer better through self-reflection, in the same way it ‘learnt’ to write prose in the early days of the GPT; in which case will these internal ‘functions’ enable better generalisation?

强化学习 (RL) 的过程鼓励模型生成更多的 Token (更多的“思考时间”) 来解决推理任务。随着过程的进展,测试时的计算量增加,反思和探索替代方法等行为会自发产生,术语“顿悟时刻 (aha moment)” [6] 被用来描述中间模型以拟人化的语调重新思考的那一刻。这种自我反思的涌现特性是一个关键发现,需要进一步研究来剖析和评估;模型是否通过自我反思“学习”如何更好地回答问题,就像 GPT 早期“学会”写散文一样;在这种情况下,这些内部“功能”是否能实现更好的泛化?

Another observation from the R1 paper, is that the model’s performance decreased when they introduced RL prompts to encourage language consistency, trading off its performance against benchmarks with its useability and readability; the performance of the finalised R1 model on AIME 2024, was $79.8%$ . Which leads to the question, if the model is allowed to ‘think’ in any language (including code) without concern for the readability of its CoT artefacts; and then translated before the output is presented to the user; would this improve performance without impacting useability? Conversely, being able to view and interrogate a model’s CoT artefacts, not only builds users confidence, but also aids explain ability.

从 R1 论文中的另一个观察结果是,当引入 RL 提示来鼓励语言一致性时,模型的性能有所下降,这是在其性能和基准测试之间做出的权衡,以换取其可用性和可读性;最终 R1 模型在 AIME 2024 上的性能为 $79.8%。这就引出了一个问题,如果允许模型以任何语言(包括代码)“思考”,而不关心其 CoT (Chain of Thought) 产物的可读性;然后在输出呈现给用户之前进行翻译;这是否会在不影响可用性的情况下提高性能?相反,能够查看和询问模型的 CoT 产物,不仅能增强用户的信心,还能帮助提高可解释性。

The paper also presented details of how the reasoning patterns of larger models can be ‘distilled’ into small models (via the supervised fine-tuning dataset) and that these distilled versions perform better than if the same RL was performed on the model. The hope is that this distillation can be built upon to yield even smaller, yet still performant, models. The performance of the distilled models improved compared to their original baseline benchmarks, with R1-Distill-Qwen-32B, and R1-Distill-Llama-70B, outperforming OpenAI’s o1-mini on tasks involving coding and mathematical reasoning. Again, future research could be devoted to determining the effect such distillation has on the overall attitude (values and personality) of the model.

该论文还详细介绍了如何将更大模型的推理模式“蒸馏”到小模型中(通过监督微调数据集),这些蒸馏后版本的性能优于在模型上执行相同 RL(强化学习)的结果。希望这种蒸馏方法可以进一步发展,以产生更小但仍具有高性能的模型。蒸馏模型的性能相较于原始基线基准有所提升,其中 R1-Distill-Qwen-32B 和 R1-Distill-Llama-70B 在涉及编程和数学推理的任务中表现优于 OpenAI 的 o1-mini。未来的研究可以致力于确定这种蒸馏对模型整体态度(价值观和个性)的影响。

2.3 Replication

2.3 复制

On the 25th of January, researchers from the Hong Kong University of Science and Technology, released a paper [10, 11] describing how long Chain-of-Thought (CoT) and self-reflection can emerge on a 7B model with only $8\mathrm{k\MATH^{1}}$ . examples, and “we achieve surprisingly strong results on complex mathematical reasoning”.

1月25日,香港科技大学的研究人员发布了一篇论文 [10, 11],描述了在仅使用 $8\mathrm{k\MATH^{1}}$ 示例的情况下,长链思维 (Chain-of-Thought, CoT) 和自我反思如何在7B模型中出现,并且“我们在复杂数学推理上取得了令人惊讶的强大结果”。

Their aim was to recreate the R1-zero model; they started with the Qwen2.5-Math-7B (base model), performed reinforcement learning on it directly (no SFT, no reward model) with only $8\mathrm{k}$ MATH examples. They observed the same increase in Chain-of-Thought length and emergent self-reflection. The resulting model achieving $33.3%$ AIME, and $77.2%$ on MATH benchmarks (up from $16.7%$ , $52.4%$ respectively, for the base model); comparable to rStar-MATH [12]. They note that rStar-MATH uses greater than 50 times the data and required more complicated components.

他们的目标是复现 R1-zero 模型;他们以 Qwen2.5-Math-7B(基础模型)为起点,仅使用 $8\mathrm{k}$ MATH 样本直接对其进行强化学习(没有 SFT,没有奖励模型)。他们观察到了思维链长度的增加和自我反思的涌现。最终模型在 AIME 上达到了 $33.3%$,在 MATH 基准测试中达到了 $77.2%$(基础模型分别为 $16.7%$ 和 $52.4%$);与 rStar-MATH [12] 相当。他们指出,rStar-MATH 使用了超过 50 倍的数据,并且需要更复杂的组件。

There were some notable differences in the approach taken, for example, this project used Proximal Policy Optimization (PPO) instead of GRPO for its RL, although both are considered relatively simple, and do not require reward models etc., but perhaps, more importantly, they didn’t start with a large model, the sought to recreate the approach using the smaller 7B parameter Qwen model and without large-scale RL setup.

在采取的方法上存在一些显著差异,例如,该项目使用了近端策略优化 (Proximal Policy Optimization, PPO) 而不是 GRPO 来完成其强化学习 (RL) ,尽管两者都被认为是相对简单的,并且不需要奖励模型等,但也许更重要的是,他们并没有从一个大型模型开始,而是试图使用较小的 7B 参数 Qwen 模型并在没有大规模 RL 设置的情况下重现该方法。

Hugging Face are recreating R1 [13], and this will be fully open sourced, with full data and training pipeline released. They aim to recreate the whole of the pipeline, including implementing the missing components. They intend to replicate the R1-distil models, by extracting a high-quality reasoning corpus from DeepSeekR1, reproduce the pure reinforcement learning pipeline used to create R1-Zero model, and demonstrate the ability to transition from a base model to an RL-tuned model through multi-stage training (akin to R1’s).

Hugging Face 正在重建 R1 [13],这将完全开源,并发布完整的数据和训练流程。他们的目标是重建整个流程,包括实现缺失的组件。他们计划通过从 DeepSeekR1 中提取高质量推理语料来复制 R1-distil 模型,重现用于创建 R1-Zero 模型的纯强化学习流程,并展示通过多阶段训练从基础模型过渡到 RL 调优模型的能力(类似于 R1 的)。

3 Related Work of Note

3 相关工作

These aren’t the only notable innovations to come out of China in recent weeks, on the 22nd of January, ByteDance (the company behind TikTok – at time of writing), released their Doubao-1.5-pro model [14], which out-performs GPT 4o, and is $50\mathtt{x}$ cheaper [15]. It also uses MoE, and a highly optimised architecture that balances performance with reduced computational demands. Doubao is one of the most popular AI Chatbots in China, with 60 million active users [16]. The company focuses on building AI models that balance intelligence with communication, looking for more emotionally aware, natural sounding interactions. It is likely that Duobao incorporates improved prompt optimisation techniques [17] and a communication efficient MoE training via locality-sensitive hashing [18]. The latter aimed at tackling latency challenges inherent in training sparse-gated MoE models; results in 2.2 times quicker inferences.

这些并非中国近期唯一的显著创新。1月22日,字节跳动(TikTok的母公司——截至撰写时)发布了他们的Doubao-1.5-pro模型 [14],该模型性能超越GPT 4o,且成本低至其1/50 [15]。它也采用了MoE(混合专家)技术,并拥有一个高度优化的架构,在保持高性能的同时降低了计算需求。Doubao是中国最受欢迎的AI聊天机器人之一,拥有6000万活跃用户 [16]。该公司专注于构建智能与沟通并重的AI模型,追求更具情感意识和自然流畅的交互体验。Doubao可能采用了改进的提示优化技术 [17],以及通过局部敏感哈希实现的通信高效的MoE训练 [18]。后者旨在解决稀疏门控MoE模型训练中固有的延迟挑战,使得推理速度提升至原来的2.2倍。

On the 15th of January, iFlytek, launched its own deep reasoning large model, training on fully domestic computing platform; Spark Deep Reasoning X1. It demonstrates characteristics similar to “slow thinking” during problem solving, whilst achieving ‘industry-leading’ results with relatively low computing power. It is particularly strong in Chinese mathematical capabilities and has already been successfully applied in the education sector, as an intelligent teaching assistant [19].

1月15日,科大讯飞推出了其自主研发的深度推理大模型,该模型在全自主研发的计算平台上训练,名为 Spark Deep Reasoning X1。它在解决问题时展现出类似于“慢思考”的特性,同时在相对较低的计算能力下实现了“行业领先”的成绩。该模型在中文数学能力方面表现尤为突出,并已在教育领域成功应用,作为智能教学助手 [19]。

On the 20th of January, Kimi k1.5 [20] was released by Chinese research company Moonshot AI, reporting o1 equivalent performance on reasoning tasks (i.e. $77.5%$ on AIME and $96.2%$ on MATH). This model also reports the use of RL in post-training [21]. From the technical press, Kimi is multimodal, text/code and images. It has a context length of $128\mathrm{k}$ , meaning whole novels can be read in via the prompt. Their simplified RL framework balances exploration and exploitation, and penalised the model for generating overly verbose responses. They also encouraged shorter/faster responses by blended the weights from both long and short CoT models [22].

1月20日,中国研究公司Moonshot AI发布了Kimi k1.5 [20],报告了在推理任务上等效于o1的性能(即在AIME上为 $77.5%$,在MATH上为 $96.2%$)。该模型还报告了在训练后使用了RL(强化学习)[21]。根据技术媒体的报道,Kimi是多模态的,支持文本/代码和图像。它的上下文长度为 $128\mathrm{k}$,意味着可以通过提示输入整部小说。他们的简化RL框架平衡了探索和利用,并对生成过于冗长回应的模型进行了惩罚。他们还通过混合长短CoT(Chain-of-Thought)模型的权重,鼓励更短/更快的回应[22]。

At the end of January, Qwen released a new family of models, Qwen2.5-VL [23]. This multi-modal (visual and text) model has had several improvements over Qwen2, including better text recognition (including handwriting, multilingual and tables), improved object detection and spatial reasoning, improved agent functionality and better video functionality

1 月底,Qwen 发布了新模型系列 Qwen2.5-VL [23]。这个多模态(视觉和文本)模型相比 Qwen2 有多项改进,包括更好的文本识别(包括手写、多语言和表格)、改进的目标检测和空间推理、增强的 AI智能体功能以及更好的视频功能。

On 2nd February OpenAI announced Deep Research [24], claiming “It accomplishes in tens of minutes what would take