[论文翻译]DeepSeek R1 简要分析及其对生成式 AI (Generative AI) 的影响


原文地址:https://arxiv.org/pdf/2502.02523


Brief analysis of DeepSeek R1 and its implications for Generative AI

DeepSeek R1 简要分析及其对生成式 AI (Generative AI) 的影响

Abstract

摘要

In late January 2025, DeepSeek released their new reasoning model (DeepSeek R1); which was developed at a fraction of the cost yet remains competitive with OpenAI’s models, despite the US’s GPU export ban. This report discusses the model, and what its release means for the field of Generative AI more widely. We briefly discuss other models released from China in recent weeks, their similarities; innovative use of Mixture of Experts (MoE), Reinforcement Learning (RL) and clever engineering appear to be key factors in the capabilities of these models. This think piece has been written to a tight timescale, providing broad coverage of the topic, and serves as introductory material for those looking to understand the model’s technical advancements, as well as its place in the ecosystem. Several further areas of research are identified.

2025年1月下旬,DeepSeek发布了他们的新推理模型(DeepSeek R1);尽管美国实施了GPU出口禁令,但该模型的开发成本仅为一部分,同时仍与OpenAI的模型保持竞争力。本报告讨论了该模型,并探讨了其发布对生成式AI领域更广泛的意义。我们简要讨论了最近几周中国发布的其他模型,它们的相似之处;专家混合模型(MoE)、强化学习(RL)的创新应用以及巧妙的工程似乎是这些模型能力的关键因素。本文是在较短的时间内撰写的,提供了对该主题的广泛覆盖,并作为理解该模型技术进步及其在生态系统中地位的入门材料。确定了几个进一步的研究领域。

1 Introduction

1 引言

The relatively short history of Generative AI has been punctuated with big steps forward in model capability. This happened again over the last few weeks triggered by a couple of papers released by a Chinese company DeepSeek [1]. In late December they released DeepSeek-V3 [2] a direct competitor to OpenAI’s GPT4o, apparently trained in two months, for approximately $\mathbb{S}5.6$ million [3, 4], which equates to 1/50th of the costs of other comparable models [5]. On the 20th of January they released DeepSeek-R1 [6] a set of reasoning models, containing “numerous powerful and intriguing reasoning behaviours” [6], achieving comparable performance to OpenAI’s o1 model – and they are open for researchers to examine [7].

生成式 AI (Generative AI) 相对较短的历史中,模型能力的大幅提升时常出现。过去几周,这种情况再次发生,由中国公司深度求索 (DeepSeek) 发布的两篇论文引发了这一进展 [1]。12 月底,他们发布了 DeepSeek-V3 [2],这是 OpenAI 的 GPT4o 的直接竞争对手,显然在两个月内完成训练,成本约为 560 万美元 [3, 4],相当于其他类似模型成本的 1/50 [5]。1 月 20 日,他们发布了 DeepSeek-R1 [6],这是一组推理模型,包含“许多强大且有趣的推理行为” [6],达到了与 OpenAI 的 o1 模型相当的性能,并且他们向研究人员开放以供审查 [7]。

This openness is a welcome move for many AI researchers keen to understand more about the models they are using. It should be noted that these models are released as ‘open weights’ meaning the model can be built upon, and freely used (under the MIT license), but without the training data it’s not truly open source. However, more details than usual were shared about the training process in the associated documentation.

这种开放性对于许多渴望更深入了解所使用模型的 AI 研究人员来说是一个受欢迎的措施。需要注意的是,这些模型以“开放权重 (open weights)”的形式发布,意味着模型可以在 MIT 许可下被构建和自由使用,但由于缺少训练数据,它们并非真正的开源。然而,在相关文档中,关于训练过程的细节比往常分享得更多。

2 DeepSeek

2 DeepSeek

In this section we give a brief overview of the latest models out of DeepSeek. We begin by discussing DeepSeek V3, a competitor to OpenAI’s GPT4o model, used as a base model for the development of DeepSeek R1. For more details, please see original papers for DeepSeek-V3 [2] and DeepSeek-R1 [6].

在本节中,我们简要概述了DeepSeek的最新模型。首先我们讨论DeepSeek V3,它是OpenAI的GPT4o模型的竞争对手,并作为DeepSeek R1开发的基础模型。更多详细信息,请参见DeepSeek-V3 [2] 和 DeepSeek-R1 [6] 的原始论文。

2.1 DeepSeek V3 - base model

2.1 DeepSeek V3 - 基础模型

The DeepSeek-V3 model, employs two major efficiencies; the Mixture of Experts (MoE) architecture and a lot of engineering efficiencies.

DeepSeek-V3 模型采用了两种主要的高效策略:专家混合(Mixture of Experts,MoE)架构和大量的工程优化。

The MoE architecture, which at a high level essentially divides the model up into a selection of specialised smaller models (one for maths, one for coding etc.) to ease training burden; was used in machine translation Transformers such as Google’s GShard in 2020 and was used in Mixtral LLM [8] in January 2024, and DeepSeek published a paper on their approach to MoE in January 2024 [9]. A flurry of MoE papers happened during 2024, with several of the MoE techniques used by the models in the next section being presented at NeurIPs at the end of 2024. This shows, architecturally at least, DeepSeek V3 was not an out-of-the-blue breakthrough (with 20/20 hindsight!).

MoE架构在高层级上本质上是将模型划分为多个专用的小模型(一个用于数学,一个用于编码等),以减轻训练负担;该架构曾在2020年Google的GShard等机器翻译Transformer中使用,并于2024年1月在Mixtral大语言模型[8]中应用,同时DeepSeek在2024年1月发表了关于其MoE方法的论文[9]。2024年期间涌现了大量关于MoE的论文,其中许多MoE技术在2024年底的NeurIPs会议上被展示,并被下一节中的模型所采用。这至少从架构上表明,DeepSeek V3并非一个突如其来的突破(事后回看有20/20的洞察力)。

2.2 DeepSeek R1 - reasoning

2.2 DeepSeek R1 - 推理

The aim of the project was to improve reasoning capabilities using pure Reinforcement Learning (RL), without the need for supervised data, to focus on self-evolution. Taking their V3 model (671B parameters) as a base and employing scalable Group Relative Policy Optimization (GRPO) as the RL framework, the resulting R1- Zero model showed improvements in reasoning and maths but also challenges such as poor readability and language mixing.

该项目的目标是利用纯强化学习(Reinforcement Learning, RL)来提升推理能力,无需监督数据,专注于自我进化。以其V3模型(671B参数)为基础,采用可扩展的组相对策略优化(Group Relative Policy Optimization, GRPO)作为RL框架,最终生成的R1-Zero模型在推理和数学方面有所改进,但也面临诸如可读性差和语言混合等挑战。

Notably the performance of the R1-Zero model increased from $15.6%$ on AIME 2024, to $71.0%$ , comparable to openAI-o1-0912, which was then exceeded when the DeepSeek team tweaked the RL (majority voting) scoring $86.7%$ .

值得注意的是,R1-Zero 模型在 AIME 2024 上的表现从 15.6% 提升到了 71.0%,与 openAI-o1-0912 相当,随后 DeepSeek 团队通过调整 RL (多数投票) 得分达到了 86.7%,超越了这一表现。

They continued to evolve their pipeline reintroducing some supervised fine tuning, which resulted in the R1 model, which reportedly achieves scores on par with OpenAI’s o1 model for many reasoning and maths-based evaluation tasks.

他们继续改进其流程,重新引入了一些监督微调,最终得到了 R1 模型。据报道,该模型在许多基于推理和数学的评估任务中取得了与 OpenAI 的 o1 模型相当的成绩。

The process of RL encourages the model to generate more tokens (more ‘thinking time’) to solve reasoning tasks, as the process progresses, and test-time computation increases, behaviours such as reflection and the exploration of alternative approaches arise spontaneously, the term ‘aha moment’ [6] has been ascribed to the moment when an intermediate model learns to rethink using an anthropomorphic tone. This emergent property of self-reflection is a key finding that needs further research to unpick and evaluate; is the model ‘learning’ how to answer better through self-reflection, in the same way it ‘learnt’ to write prose in the early days of the GPT; in which case will these internal ‘functions’ enable better generalisation?

强化学习 (RL) 的过程鼓励模型生成更多的 Token (更多的“思考时间”) 来解决推理任务。随着过程的进展,测试时的计算量增加,反思和探索替代方法等行为会自发产生,术语“顿悟时刻 (aha moment)” [6] 被用来描述中间模型以拟人化的语调重新思考的那一刻。这种自我反思的涌现特性是一个关键发现,需要进一步研究来剖析和评估;模型是否通过自我反思“学习”如何更好地回答问题,就像 GPT 早期“学会”写散文一样;在这种情况下,这些内部“功能”是否能实现更好的泛化?

Another observation from the R1 paper, is that the model’s performance decreased when they introduced RL prompts to encourage language consistency, trading off its performance against benchmarks with its useability and readability; the performance of the finalised R1 model on AIME 2024, was $79.8%$ . Which leads to the question, if the model is allowed to ‘think’ in any language (including code) without concern for the readability of its CoT artefacts; and then translated before the output is presented to the user; would this improve performance without impacting useability? Conversely, being able to view and interrogate a model’s CoT artefacts, not only builds users confidence, but also aids explain ability.

从 R1 论文中的另一个观察结果是,当引入 RL 提示来鼓励语言一致性时,模型的性能有所下降,这是在其性能和基准测试之间做出的权衡,以换取其可用性和可读性;最终 R1 模型在 AIME 2024 上的性能为 $79.8%。这就引出了一个问题,如果允许模型以任何语言(包括代码)“思考”,而不关心其 CoT (Chain of Thought) 产物的可读性;然后在输出呈现给用户之前进行翻译;这是否会在不影响可用性的情况下提高性能?相反,能够查看和询问模型的 CoT 产物,不仅能增强用户的信心,还能帮助提高可解释性。

The paper also presented details of how the reasoning patterns of larger models can be ‘distilled’ into small models (via the supervised fine-tuning dataset) and that these distilled versions perform better than if the same RL was performed on the model. The hope is that this distillation can be built upon to yield even smaller, yet still performant, models. The performance of the distilled models improved compared to their original baseline benchmarks, with R1-Distill-Qwen-32B, and R1-Distill-Llama-70B, outperforming OpenAI’s o1-mini on tasks involving coding and mathematical reasoning. Again, future research could be devoted to determining the effect such distillation has on the overall attitude (values and personality) of the model.

该论文还详细介绍了如何将更大模型的推理模式“蒸馏”到小模型中(通过监督微调数据集),这些蒸馏后版本的性能优于在模型上执行相同 RL(强化学习)的结果。希望这种蒸馏方法可以进一步发展,以产生更小但仍具有高性能的模型。蒸馏模型的性能相较于原始基线基准有所提升,其中 R1-Distill-Qwen-32B 和 R1-Distill-Llama-70B 在涉及编程和数学推理的任务中表现优于 OpenAI 的 o1-mini。未来的研究可以致力于确定这种蒸馏对模型整体态度(价值观和个性)的影响。

2.3 Replication

2.3 复制

On the 25th of January, researchers from the Hong Kong University of Science and Technology, released a paper [10, 11] describing how long Chain-of-Thought (CoT) and self-reflection can emerge on a 7B model with only $8\mathrm{k\MATH^{1}}$ . examples, and “we achieve surprisingly strong results on complex mathematical reasoning”.

1月25日,香港科技大学的研究人员发布了一篇论文 [10, 11],描述了在仅使用 $8\mathrm{k\MATH^{1}}$ 示例的情况下,长链思维 (Chain-of-Thought, CoT) 和自我反思如何在7B模型中出现,并且“我们在复杂数学推理上取得了令人惊讶的强大结果”。

Their aim was to recreate the R1-zero model; they started with the Qwen2.5-Math-7B (base model), performed reinforcement learning on it directly (no SFT, no reward model) with only $8\mathrm{k}$ MATH examples. They observed the same increase in Chain-of-Thought length and emergent self-reflection. The resulting model achieving $33.3%$ AIME, and $77.2%$ on MATH benchmarks (up from $16.7%$ , $52.4%$ respectively, for the base model); comparable to rStar-MATH [12]. They note that rStar-MATH uses greater than 50 times the data and required more complicated components.

他们的目标是复现 R1-zero 模型;他们以 Qwen2.5-Math-7B(基础模型)为起点,仅使用 $8\mathrm{k}$ MATH 样本直接对其进行强化学习(没有 SFT,没有奖励模型)。他们观察到了思维链长度的增加和自我反思的涌现。最终模型在 AIME 上达到了 $33.3%$,在 MATH 基准测试中达到了 $77.2%$(基础模型分别为 $16.7%$ 和 $52.4%$);与 rStar-MATH [12] 相当。他们指出,rStar-MATH 使用了超过 50 倍的数据,并且需要更复杂的组件。

There were some notable differences in the approach taken, for example, this project used Proximal Policy Optimization (PPO) instead of GRPO for its RL, although both are considered relatively simple, and do not require reward models etc., but perhaps, more importantly, they didn’t start with a large model, the sought to recreate the approach using the smaller 7B parameter Qwen model and without large-scale RL setup.

在采取的方法上存在一些显著差异,例如,该项目使用了近端策略优化 (Proximal Policy Optimization, PPO) 而不是 GRPO 来完成其强化学习 (RL) ,尽管两者都被认为是相对简单的,并且不需要奖励模型等,但也许更重要的是,他们并没有从一个大型模型开始,而是试图使用较小的 7B 参数 Qwen 模型并在没有大规模 RL 设置的情况下重现该方法。

Hugging Face are recreating R1 [13], and this will be fully open sourced, with full data and training pipeline released. They aim to recreate the whole of the pipeline, including implementing the missing components. They intend to replicate the R1-distil models, by extracting a high-quality reasoning corpus from DeepSeekR1, reproduce the pure reinforcement learning pipeline used to create R1-Zero model, and demonstrate the ability to transition from a base model to an RL-tuned model through multi-stage training (akin to R1’s).

Hugging Face 正在重建 R1 [13],这将完全开源,并发布完整的数据和训练流程。他们的目标是重建整个流程,包括实现缺失的组件。他们计划通过从 DeepSeekR1 中提取高质量推理语料来复制 R1-distil 模型,重现用于创建 R1-Zero 模型的纯强化学习流程,并展示通过多阶段训练从基础模型过渡到 RL 调优模型的能力(类似于 R1 的)。

3 Related Work of Note

3 相关工作

These aren’t the only notable innovations to come out of China in recent weeks, on the 22nd of January, ByteDance (the company behind TikTok – at time of writing), released their Doubao-1.5-pro model [14], which out-performs GPT 4o, and is $50\mathtt{x}$ cheaper [15]. It also uses MoE, and a highly optimised architecture that balances performance with reduced computational demands. Doubao is one of the most popular AI Chatbots in China, with 60 million active users [16]. The company focuses on building AI models that balance intelligence with communication, looking for more emotionally aware, natural sounding interactions. It is likely that Duobao incorporates improved prompt optimisation techniques [17] and a communication efficient MoE training via locality-sensitive hashing [18]. The latter aimed at tackling latency challenges inherent in training sparse-gated MoE models; results in 2.2 times quicker inferences.

这些并非中国近期唯一的显著创新。1月22日,字节跳动(TikTok的母公司——截至撰写时)发布了他们的Doubao-1.5-pro模型 [14],该模型性能超越GPT 4o,且成本低至其1/50 [15]。它也采用了MoE(混合专家)技术,并拥有一个高度优化的架构,在保持高性能的同时降低了计算需求。Doubao是中国最受欢迎的AI聊天机器人之一,拥有6000万活跃用户 [16]。该公司专注于构建智能与沟通并重的AI模型,追求更具情感意识和自然流畅的交互体验。Doubao可能采用了改进的提示优化技术 [17],以及通过局部敏感哈希实现的通信高效的MoE训练 [18]。后者旨在解决稀疏门控MoE模型训练中固有的延迟挑战,使得推理速度提升至原来的2.2倍。

On the 15th of January, iFlytek, launched its own deep reasoning large model, training on fully domestic computing platform; Spark Deep Reasoning X1. It demonstrates characteristics similar to “slow thinking” during problem solving, whilst achieving ‘industry-leading’ results with relatively low computing power. It is particularly strong in Chinese mathematical capabilities and has already been successfully applied in the education sector, as an intelligent teaching assistant [19].

1月15日,科大讯飞推出了其自主研发的深度推理大模型,该模型在全自主研发的计算平台上训练,名为 Spark Deep Reasoning X1。它在解决问题时展现出类似于“慢思考”的特性,同时在相对较低的计算能力下实现了“行业领先”的成绩。该模型在中文数学能力方面表现尤为突出,并已在教育领域成功应用,作为智能教学助手 [19]。

On the 20th of January, Kimi k1.5 [20] was released by Chinese research company Moonshot AI, reporting o1 equivalent performance on reasoning tasks (i.e. $77.5%$ on AIME and $96.2%$ on MATH). This model also reports the use of RL in post-training [21]. From the technical press, Kimi is multimodal, text/code and images. It has a context length of $128\mathrm{k}$ , meaning whole novels can be read in via the prompt. Their simplified RL framework balances exploration and exploitation, and penalised the model for generating overly verbose responses. They also encouraged shorter/faster responses by blended the weights from both long and short CoT models [22].

1月20日,中国研究公司Moonshot AI发布了Kimi k1.5 [20],报告了在推理任务上等效于o1的性能(即在AIME上为 $77.5%$,在MATH上为 $96.2%$)。该模型还报告了在训练后使用了RL(强化学习)[21]。根据技术媒体的报道,Kimi是多模态的,支持文本/代码和图像。它的上下文长度为 $128\mathrm{k}$,意味着可以通过提示输入整部小说。他们的简化RL框架平衡了探索和利用,并对生成过于冗长回应的模型进行了惩罚。他们还通过混合长短CoT(Chain-of-Thought)模型的权重,鼓励更短/更快的回应[22]。

At the end of January, Qwen released a new family of models, Qwen2.5-VL [23]. This multi-modal (visual and text) model has had several improvements over Qwen2, including better text recognition (including handwriting, multilingual and tables), improved object detection and spatial reasoning, improved agent functionality and better video functionality

1 月底,Qwen 发布了新模型系列 Qwen2.5-VL [23]。这个多模态(视觉和文本)模型相比 Qwen2 有多项改进,包括更好的文本识别(包括手写、多语言和表格)、改进的目标检测和空间推理、增强的 AI智能体功能以及更好的视频功能。

On 2nd February OpenAI announced Deep Research [24], claiming “It accomplishes in tens of minutes what would take a human many hours.”. After the DeepSeek models were released, it was conjectured that this might force OpenAI to rush their next release to maintain market dominance. It is too early to determine if this is the case, or the impact it has had on the model.

2月2日,OpenAI宣布了深度研究 (Deep Research) [24],并声称“它能在几十分钟内完成人类需要数小时的工作。”DeepSeek模型发布后,有人推测这可能会迫使OpenAI加快下一个版本的发布以保持市场主导地位。目前还无法确定这是否属实,也无法确定这对模型产生了何种影响。

4 Reactions and Observations

4 反应与观察

4.1 Implications and Repercussions

4.1 影响与后果

• These models highlight the importance of algorithmic efficiency and resource optimisation. Instead of relying on brute-force scaling, DeepSeek shows that high performance can be achieved with significantly fewer resources.

这些模型强调了算法效率和资源优化的重要性。DeepSeek 表明,不依赖于暴力扩展,通过显著减少资源也可以实现高性能。

• OpenAI have already cut their prices twice in recent days, and pressure is mounting that they should allow users access to the reasoning tokens.

• OpenAI 最近已经两次降价,而且要求他们允许用户访问推理 Token 的压力越来越大。

– On the 29th of January, OpenAI suggested that DeepSeek ’may have inappropriately distilled our models’ [25]. At time of publication, no further analysis or confirmation has been forthcoming. – On the 31st of January, OpenAI deployed their o3-mini reasoning model in response [26]. This model uses de liber at ive alignment, where a set of internal policies are reviewed at every reasoning step, to ensure it’s not ignoring any safety rules, but they also acknowledge that reasoning models are better at jail breaking themselves [27].

– 1月29日,OpenAI指出DeepSeek可能“不恰当地蒸馏了我们的模型”[25]。截至本文发布时,尚未有进一步的分析或确认。– 1月31日,OpenAI部署了他们的o3-mini推理模型作为回应[26]。该模型采用了深思熟虑的对齐(deliberative alignment)方法,即在每一步推理过程中审查一套内部策略,以确保不会忽略任何安全规则,但他们也承认推理模型在自我越狱方面表现更好[27]。

• There were consequences for Nvidia: how many top-of-the-line chips are really needed to build state-ofthe-art models? Shares in Nvidia fell by $17%$ , losing nearly $\Phi600\mathrm{bn}$ off its market value [4, 28].

• 这给 Nvidia 带来了影响:构建顶尖模型究竟需要多少顶级芯片?Nvidia 的股价下跌了 17%,市值蒸发了近 6000 亿美元 [4, 28]。

• It also shows that the US’s CHIPS-Act [29], designed to slow China in the AI race, may have inadvertently encouraged innovation.

这也表明,旨在减缓中国在AI领域竞争的美国CHIPS法案 [29],可能无意中促进了创新。

• DeepSeek app is at the top of the App Store charts for UK, US and China [30].

DeepSeek app 在英国、美国和中国的 App Store 排行榜上名列前茅 [30]。

4.2 DeepSeek Observations from the AI research community

4.2 来自 AI 研究社区的 DeepSeek 观察


Fig. 1: A comparison of model outputs to highlight value differences between the two models

图 1: 模型输出对比,突出两个模型之间的价值差异

4.3 Political Commentary

4.3 政治评论

Many have commented on the model’s refusal to answer questions on certain topics, related to the censorship of the CCP [40]. From a national security point of view, this raises several concerns. In particular, how the risk profile changes if the majority of users go from using an American aligned LLM, to a CCP aligned LLM. Especially when a large proportion of users are using LLMs instead of Search Engines for facts (See Fig. 1 for an example discrepancy between responses, generated 3 Feb. 2025). However, censorship appears not to be present when the model is run locally.

许多人评论了该模型拒绝回答某些与中国共产党的审查制度相关的话题 [40]。从国家安全的角度来看,这引发了一些担忧。特别是,如果大多数用户从使用与美国对齐的大语言模型转向使用与中国共产党对齐的大语言模型,风险状况将如何变化。尤其是当大部分用户使用大语言模型而不是搜索引擎来获取事实时(如图 1 所示,2025 年 2 月 3 日生成的回答之间的差异示例)。然而,当模型在本地运行时,似乎不存在审查现象。

Political commentators have suggested the release of the DeepSeek-R1 model was specifically aligned with President Trump’s inauguration, to undermine the perception of US dominance of the AI sector [40], or perhaps to undermine the impact of The Stargate Project [41]. Of course, it could be the rush to get things released prior to the (Chinese) new year.

政治评论员认为,DeepSeek-R1 模型的发布特意与特朗普总统就职典礼同步,旨在削弱美国在 AI 领域主导地位的印象 [40],或者可能是为了削弱 The Stargate Project [41] 的影响。当然,也可能是为了赶在(中国)新年前发布。

US [42] and Australian [43] Governments raised concerns about the use of DeepSeek by staff, with the US Navy banning the application on “security and ethical" grounds [44]. Meanwhile, the application has also been banned country-wide in Italy, pending an investigation into the app’s handling of personal data by privacy watchdog, Garante [45]. Coupled with a recent data breach [46] that allowed researchers to access over 1 million plain-text chat histories, it paints a worrying picture of data-handling practices within the fast-paced AI environment.

US [42] 和 Australian [43] 政府对员工使用 DeepSeek 表示担忧,美国海军以“安全和伦理”为由禁止了该应用 [44]。同时,在意大利,该应用已被全国范围内禁止,等待隐私监管机构 Garante 对其个人数据处理方式的调查 [45]。再加上最近的一次数据泄露事件 [46],该事件允许研究人员访问超过 100 万条明文聊天记录,这在快速发展的 AI 环境中描绘了一幅令人担忧的数据处理实践图景。

A ‘White House AI and crypto czar’ stated “There’s substantial evidence that what DeepSeek did here is they distilled the knowledge out of OpenAI’s models” [42]. It will be interesting to see if OpenAI mitigate teacherstudent threats, and how they will achieve that without impacting usability. Additionally, it will be interesting to see the implications of a more restrictive usage policy, if this is the route that OpenAI choose to go down; potentially forcing more people towards open-source non-Western alternatives. Alternatively, it may cause a fracture of the frontier model landscape, leading to walled-garden, siloed models that are tailored to their target audience. Indeed, we are already seeing evidence of this, such as with the Open Euro LL M project [47].

“白宫AI和加密货币负责人”表示:“有大量证据表明,DeepSeek所做的是从OpenAI的模型中提炼了知识” [42]。有趣的是,看看OpenAI是否会缓解师生威胁,以及他们如何在不对可用性产生影响的情况下实现这一目标。此外,如果OpenAI选择采取更严格的使用政策,这将引发怎样的影响也值得关注;这可能会迫使更多人转向开源的非西方替代方案。或者,这可能导致前沿模型格局的分裂,导致形成针对目标受众的封闭式、孤岛式模型。事实上,我们已经看到了这方面的证据,例如Open Euro大语言模型项目 [47]。

5 Discussion

5 讨论

We believe this flurry of reasoning model releases, with lower training and inference costs, is China’s technical response to data (and compute) scaling limitations. These models demonstrate an innovative mix of KISS approaches and clever engineering, building on open-source literature, with many techniques being traceable back through recent papers. Albeit, with details of the data used for training being frustratingly absent from the documentation.

我们相信,这一系列训练和推理成本较低的推理模型发布,是中国对数据(和计算)扩展限制的技术回应。这些模型展示了 KISS 方法与巧妙工程相结合的创新,基于开源文献,许多技术可追溯到最近的论文。然而,令人遗憾的是,文档中缺少用于训练的数据细节。

The focus on improving maths and coding (through reasoning) may be to support future agentic approaches (2025 being touted as the year of the agent). But it should be noted that these evaluations are at the easier end of the scale to automate; correct maths answers are definite, coding tasks with unit test can also be easily automated and therefore are more suitable for RL type approaches.

关注数学和编程(通过推理)的提升可能是为了支持未来的AI智能体方法(2025年被吹捧为AI智能体之年)。但值得注意的是,这些评估在自动化难度上属于较容易的一端;正确的数学答案是确定的,带有单元测试的编程任务也可以轻松自动化,因此更适合基于强化学习(RL)的

However, if we consider that simple RL allows models to be ‘upskilled’ with relatively small datasets (like the 8k MATH), what other skills could be developed/bestowed onto small models? Is this technique only effective for pass/fail datasets? Or do you get similar returns when upskilling a model to be more creative with its story writing, for example.

然而,如果考虑到简单的强化学习 (RL) 允许模型通过相对较小的数据集(如 8k MATH)进行“提升”,那么在小模型上还能开发/赋予哪些其他技能?这种技术是否仅适用于通过/失败的数据集?或者,例如,当提升模型以使其在故事写作中更具创造性时,是否会获得类似的回报?

Responding to the uncertainty on the technology used and true costs of training: It is obviously difficult for us to provide accurate and reliable conclusions. Which does pose an interesting research question; what insights about the development pipeline can be gleaned from a released model? And in a similar vein, can any insights be gleaned into what datasets were used during training?

应对技术使用的不确定性和训练的真实成本:我们显然很难提供准确可靠的结论。这确实提出了一个有趣的研究问题;从发布的模型中能获得哪些关于开发流程的见解?同样地,能否从中获得关于训练期间使用了哪些数据集的见解?

The implication for smaller models is twofold: firstly the proven ability to distil information from larger models to smaller models - provides a short cut in post-training. And that the approach of using simple reinforcement learning can yield significant (albeit) narrow performance improvements - at lower computational costs. Both approaches could change the risk threshold across the D&NS portfolio including (but not limited to): malicious cyber, mis/dis-information (inc. deepfake generation) and worse, as they may provide a foundation for better reasoning ability in smaller, non-centralised, models.

对较小模型的影响是双重的:首先,从较大模型中提取信息的能力为训练后提供了捷径。其次,使用简单的强化学习方法可以在较低的计算成本下显著提高性能(尽管是狭窄的)。这两种方法可能会改变D&NS投资组合的风险阈值,包括(但不限于):恶意网络活动、虚假/错误信息(包括深度伪造生成)以及更糟糕的情况,因为它们可能为较小、非集中化的模型提供更好的推理能力基础。

Although these models do not ‘fix’ the issues related to LLMs e.g. hallucinations [5], the open weights release of DeepSeek, bolstered by media attention, has raised the question of whether these models are ‘good enough’; given that the smaller, distilled, models are freely available, will they be good enough to see widespread adoption (businesses, researchers and hobbyists)? Some have already installed the distilled version of Qwen on a Raspberry PI (admittedly only yielding 1.2 tokens per second). And the cheaper API rates have triggered developers to write their own VSCode plug-ins that use the DeepSeek model instead of GitHub’s copilot. Some hypothesize that this grass root adoption – a shift in the ubiquity rather than ability of AI systems – is a key step towards artificial general intelligence. If this is the case, it will be vital to understand the societal and security implications of DeepSeek’s models.

尽管这些模型并未解决与大语言模型相关的问题,例如幻觉 [5],但DeepSeek的开源权重发布在媒体关注下引发了人们对于这些模型是否“足够好”的疑问;鉴于更小、更精简的模型是免费提供的,它们是否足以看到广泛采用(企业、研究人员和爱好者)?一些人已经在Raspberry PI上安装了Qwen的精简版(诚然每秒只能生成1.2个Token)。而更便宜的API费率促使开发者编写自己的VSCode插件,使用DeepSeek模型而非GitHub的Copilot。一些人推测,这种基层采用——AI系统普及度的变化而非能力的变化——是迈向通用人工智能的关键一步。如果是这样,理解DeepSeek模型的社会和安全影响将至关重要。

References

参考文献

阅读全文(20积分)