A Comparison of DeepSeek and Other LLMs
DeepSeek与其他大语言模型的对比
Tianchen Gao, Jiashun Jin, Zheng Tracy Ke, and Gabriel Moryoussef ∗
Tianchen Gao, Jiashun Jin, Zheng Tracy Ke, 和 Gabriel Moryoussef ∗
Abstract
摘要
Recently, DeepSeek has been the focus of attention in and beyond the AI community. An interesting problem is how DeepSeek compares to other large language models (LLMs). There are many tasks an LLM can do, and in this paper, we use the task of predicting an outcome using a short text for comparison. We consider two settings, an authorship classification setting and a citation classification setting. In the first one, the goal is to determine whether a short text is written by human or AI. In the second one, the goal is to classify a citation to one of four types using the textual content. For each experiment, we compare DeepSeek with 4 popular LLMs: Claude, Gemini, GPT, and Llama.
近日,DeepSeek 在人工智能领域内外引发了广泛关注。一个有趣的问题是,DeepSeek 与其他大语言模型 (LLMs) 相比表现如何。大语言模型可以执行多种任务,本文我们通过使用短文本预测结果的任务进行比较。我们考虑了两种设置:作者分类设置和引用分类设置。在第一种设置中,目标是判断一段短文本是由人类还是 AI 撰写的。在第二种设置中,目标是根据文本内容将引用分类为四种类型之一。在每次实验中,我们将 DeepSeek 与 4 个流行的 LLMs 进行对比:Claude、Gemini、GPT 和 Llama。
We find that, in terms of classification accuracy, DeepSeek outperforms Gemini, GPT, and Llama in most cases, but under performs Claude. We also find that DeepSeek is comparably slower than others but with a low cost to use, while Claude is much more expensive than all the others. Finally, we find that in terms of similarity, the output of DeepSeek is most similar to those of Gemini and Claude (and among all 5 LLMs, Claude and Gemini have the most similar outputs).
我们发现,在分类准确性方面,DeepSeek 在大多数情况下优于 Gemini、GPT 和 Llama,但表现不如 Claude。我们还发现,DeepSeek 的速度相对较慢,但使用成本较低,而 Claude 的成本则远高于其他模型。最后,我们发现,在相似性方面,DeepSeek 的输出与 Gemini 和 Claude 的输出最为相似(在所有 5 个大语言模型中,Claude 和 Gemini 的输出最为相似)。
In this paper, we also present a fully-labeled dataset collected by ourselves, and propose a recipe where we can use the LLMs and a recent data set, MADStat, to generate new data sets. The datasets in our paper can be used as benchmarks for future study on LLMs.
在本文中,我们还展示了一个自主收集的完整标注数据集,并提出了一种方法,即利用大语言模型和最近的数据集 MADStat 来生成新的数据集。本文中的数据集可作为未来大语言模型研究的基准。
Keywords: Citation classification, AI-generated text detection, MADStat, prompt, text analysis, textual content.
关键词:引用分类、AI生成文本检测、MADStat、提示词 (prompt)、文本分析、文本内容
1 Introduction
1 引言
In the past two weeks, DeepSeek (DS), a recent large language model (LLM) (DeepSeek-AI, 2024), has shaken up the AI industry. Since its latest version was released on January 20, 2025, DS has made the headlines of news and social media, shot to the top of Apple Store’s downloads, stunning investors and sinking some tech stocks including Nvidia.
过去两周,近期发布的大语言模型 (DeepSeek-AI, 2024) 搅动了AI行业。自其最新版本于2025年1月20日发布以来,DeepSeek (DS) 登上了新闻和社交媒体的头条,迅速攀升至Apple Store下载榜首位,令投资者震惊并导致Nvidia等科技股下跌。
What makes DS so special is that in some benchmark tasks it achieved the same or even better results as the big players in the AI industry (e.g., OpenAI’s ChatGPT), but with only a fraction of the training cost. For example,
DS之所以如此特别,是因为在某些基准任务中,它与 AI 行业中的巨头(例如 OpenAI 的 ChatGPT)取得了相同甚至更好的结果,但仅需一小部分的训练成本。例如,
• In Evstafev (2024), the author showed that over 30 challenging mathematical problems derived from the MATH dataset (Hendrycks and et al., 2021), DeepSeek-R1 achieves superior accuracy on these complex problems, compared with ChatGPT and Gemini, among others.
在 Evstafev (2024) 中,作者展示了在来自 MATH 数据集 (Hendrycks 等, 2021) 的 30 多个具有挑战性的数学问题上,DeepSeek-R1 相比 ChatGPT 和 Gemini 等模型,在这些复杂问题上表现出更高的准确性。
• In a LinkedIn post on January 28, 2025, Javier Aguirre (a researcher specialized in medicine and AI, South Korea) wrote: “I am quite impressed with Deepseek. .... Today I had a really tricky and complex (coding) problem. Even chatGPT-o1 was not able to reason enough to solve it. I gave a try to Deepseek and it solved it at once and straight to the point”. This was echoed by several other researchers in AI.
Javier Aguirre(韩国,专注于医学和AI的研究员)在2025年1月28日的LinkedIn帖子中写道:“我对Deepseek印象非常深刻。...今天我遇到了一个非常棘手且复杂的(编码)问题。即使是chatGPT-o1也无法通过推理解决。我尝试了Deepseek,它立刻解决了问题,而且直击要点。”多位AI研究人员的回应也表达了相同的观点。
See more comparison in DeepSeek-AI (2024); Zuo and et al. (2025); Arrieta and et al. (2025). Of course, a sophisticated LLM has many aspects (e.g., Infrastructure, Architecture, Performances, Costs) and can achieve many tasks. The tasks discussed above are only a small part of what an LLM can deliver. It is desirable to have a more comprehensive and in-depth comparison. Seemingly, such a comparison may take a lot of time and efforts, but some interesting discussions have already appeared on the internet and social media (e.g., Ramadhan (2025)).
更多比较详见 DeepSeek-AI (2024); Zuo 等人 (2025); Arrieta 等人 (2025)。当然,一个复杂的大语言模型有许多方面(例如基础设施、架构、性能、成本),并且可以完成许多任务。上面讨论的任务只是大语言模型所能实现的一小部分。更全面和深入的比较是可取的。表面上看,这样的比较可能需要花费大量时间和精力,但一些有趣的讨论已经出现在互联网和社交媒体上(例如 Ramadhan (2025))。
We are especially interested in how accurate an LLM is in prediction. Despite that there are a long list of literature on this topic (see, for example, Friedman et al. (2001)), using LLM for prediction still has advantages: while a classical approach may need a reasonable large training sample, an LLM can work with only a prompt. Along this line, a problem of major interest is how DS compares to other LLMs in terms of prediction accuracy. In this paper, we consider the two classification settings as follows.
我们特别关注大语言模型在预测方面的准确性。尽管已有大量文献讨论这一主题(例如,Friedman et al. (2001)),使用大语言模型进行预测仍具有优势:传统方法可能需要较大的训练样本,而大语言模型仅需一个提示即可工作。因此,一个重要问题是 DS 在预测准确性方面与其他大语言模型相比如何。在本文中,我们考虑以下两种分类设置。
• Authorship classification (AC). Determine whether a document is human generated (hum), or AI-generated (AI), or human-generated but with AI-editing (humAI). • Citation classification (CC). Given an (academic) citation and the small piece of text surrounding it, determine which type the citation is (see below for the 4 types of citations).
• 作者分类 (AC):判断文档是人类生成的 (hum)、AI 生成的 (AI) 还是人类生成但经过 AI 编辑的 (humAI)。
• 引用分类 (CC):给定一个 (学术) 引用及其周围的简短文本,确定引用的类型 (见下文 4 种引用类型)。
For each of the two settings, we compare the classification results of DeepSeek-R1 (DS) with those of 4 representative LLMs: OpenAI’s GPT-4o-mini (GPT), Google’s Gemini-1.5-flash (Gemini), Meta’s Llama-3.1-8b (Llama) and Anthropic’s Claude-3.5-sonnet (Claude). We now discuss each of the two settings with more details.
在每种设置下,我们将DeepSeek-R1(DS)的分类结果与4个代表性的LLM进行了比较:OpenAI的GPT-4o-mini(GPT)、Google的Gemini-1.5-flash(Gemini)、Meta的Llama-3.1-8b(Llama)和Anthropic的Claude-3.5-sonnet(Claude)。接下来我们将详细讨论这两种设置。
1.1 Authorship classification
1.1 作者分类
In the past two years, AI-generated text content started to spread quickly, influencing the internet, workplace, and daily life. This raises a problem: how to differentiate AI-authored content from human-authored content (Kreps et al., 2022; Danilevsky et al., 2020).
过去两年,AI生成的文本内容开始迅速传播,影响了互联网、职场和日常生活。这引发了一个问题:如何区分AI生成的内容与人类创作的内容 (Kreps et al., 2022; Danilevsky et al., 2020)。
The problem is interesting for at least two reasons. First, the AI-generated content may contain harmful misinformation in areas such as health care, news, and finance (Kreps et al., 2022), and the spread of fake and misleading information may threaten the integrity of online resources. Second, understanding the main differences between human-generated and AI-written content can significantly help improve AI language models (Danilevsky et al., 2020).
这个问题至少有两个有趣的原因。首先,AI生成的内容可能在医疗、新闻和金融等领域包含有害的错误信息 (Kreps et al., 2022),虚假和误导性信息的传播可能威胁在线资源的完整性。其次,理解人类生成内容和AI编写内容之间的主要差异可以显著帮助改进大语言模型 (Danilevsky et al., 2020)。
We approach the problem by considering two classification settings, AC1 and AC2.
我们通过考虑两种分类设置 AC1 和 AC2 来解决这个问题。
• (ACI) . In the first setting, we focus on distinguishing human-generated text and AI-generated text (i.e., hum vs. AI).
(ACI)。在第一个设置中,我们专注于区分人类生成的文本和AI生成的文本(即hum vs. AI)。
• (ACQ) . In the second setting, we consider the more subtle setting of distinguishing text generated by human and text that are generated by human but with AI editing (i.e., hum vs. humAI).
• (ACQ) 。在第二种设置中,我们考虑更微妙的场景,即区分由人类生成的文本和由人类生成但经过 AI 编辑的文本(即 hum vs. humAI)。
For experiments, we propose to use the recent MADStat data set (Ji et al., 2022; Ke et al., 2024). MADStat is a large-scale data set on statistical publications, consisting of the bibtex and citation information of 83,331 papers published in 36 journal in statistics and related field, spanning 1975-2015. The data set is available for free download (see Section 2 for the download links).
对于实验,我们建议使用最近的MADStat数据集(Ji等,2022;Ke等,2024)。MADStat是一个关于统计学出版物的大规模数据集,包含83,331篇论文的bibtex和引用信息,这些论文发表在统计学及相关领域的36种期刊上,时间跨度为1975年至2015年。该数据集可免费下载(下载链接见第2节)。
We propose a general recipe where we use the LLMs and MADStat to generate new data sets for our study. We start by selecting a few authors and collecting all papers authored by thesm in the MADStat. For each paper, the MADStat contains a title and an abstract.
我们提出了一种通用方法,使用大语言模型和MADStat为我们的研究生成新数据集。我们从选择一些作者开始,并收集他们在MADStat上发表的所有论文。对于每篇论文,MADStat包含标题和摘要。
• (hum). We use all the abstracts as the human-generated text.
• (hum)。我们使用所有摘要作为人类生成的文本。
Seemingly, using this recipe, we can generate many different data sets. These data sets provide a useful platform where we can compare different classification approaches, especially the 5 LLMs.
使用这个方案,我们可以生成许多不同的数据集。这些数据集为比较不同的分类方法,特别是 5 个大语言模型 (LLMs),提供了一个有用的平台。
Remark 1. (The MadStatAI data set). In Section 2.2, we fix 15 authors in the MADStat data set (see Table 2 for the list) and generate a data set containing 582 abstract triplets (each triplet contains three abstracts: hum, AI, and humAI) following the recipe above. For simplicity, we call this data set the MadStatAI.
注 1. (MadStatAI 数据集). 在第2.2节中,我们固定了MADStat数据集中的15位作者(见表2中的列表),并按照上述方法生成了一个包含582个摘要三元组(每个三元组包含三个摘要:hum、AI和humAI)的数据集。为简便起见,我们称此数据集为MadStatAI。
Once the data set is ready, we apply each of the 5 LLMs above for classification, with the same prompt. See Section 2.1 for details. Note that aside from LLMs, we may apply other algorithms to this problem (Solaiman et al., 2019; Zellers et al., 2019; Gehrmann et al., 2019; Ippolito et al., 2020; Fagni et al., 2021; Adelani et al., 2020; Kashtan and
在数据集准备就绪后,我们对上述5个大语言模型 (LLMs) 应用相同的提示进行分类,详见第2.1节。需要注意的是,除了大语言模型外,我们还可以将其他算法应用于此问题 (Solaiman et al., 2019; Zellers et al., 2019; Gehrmann et al., 2019; Ippolito et al., 2020; Fagni et al., 2021; Adelani et al., 2020; Kashtan and
Kipnis, 2024). However, as our focus in this paper is to compare DeepSeek with other LLMs, we only consider the 5 LLM class if i ers mentioned above.
Kipnis, 2024). 然而,由于本文的重点是比较 DeepSeek 与其他大语言模型,我们仅考虑上述提到的 5 种大语言模型类别。
1.2 Citation classification
1.2 引用分类
When a paper is being cited, the citation could be significant or insignificant. Therefore, to evaluate the impact of a paper, we are interested in not only how many times it is being cited, but also how many significant citations it has. The challenge is, while it is comparably easier to count the raw citations of a paper (i.e., by Google Scholar, Web of Science), it is unclear how to count the number of ‘significant’ citations of a paper.
当一篇论文被引用时,引用可能是重要的或不重要的。因此,为了评估一篇论文的影响力,我们不仅关注它被引用了多少次,还关注它有多少次重要引用。挑战在于,虽然统计一篇论文的原始引用次数相对容易(例如通过 Google Scholar、Web of Science),但如何统计一篇论文的“重要”引用次数尚不明确。
To address the issue, note that surrounding a citation instance, there is usually a short piece of text. The text contains important information for the citation, and we can use it to predict the type of this citation. This gives rise to the problem of Citation Classification, where the goal is to use the short text surrounding a citation to predict the citation type.
为了解决这个问题,请注意在引用实例周围通常有一小段文本。这段文本包含了引用的重要信息,我们可以用它来预测引用的类型。这就引出了引用分类问题,其目标是利用引用周围的短文本预测引用的类型。
Here, we have two challenges. First, it is unclear how many different types academic citations may have and what these types are. Second, we do not have a ready-to-use data set.
在这里,我们面临两个挑战。首先,不清楚学术引用可能有多少种类型以及这些类型是什么。其次,我们没有现成的数据集。
To address these challenges, first, after reviewing many literature works and empirical results, we propose to divide all academic citations into four different types:
为了解决这些挑战,首先,在回顾了大量文献和实证结果后,我们提出将学术引用分为四种不同类型:
“Fundamental ideas (FI)” “Technical basis (TB)”, “Background (BG)”, “Comparison (CP)”.
基本理念 (FI)
技术基础 (TB)
背景 (BG)
对比 (CP)
Below for simplicity, we encode the four types as “1”, “2”, “3”, “4”. Note that the first two types are considered as significant, and the other two are considered as comparably less significant. See Section 2.2 for details.
为简单起见,我们将这四种类型编码为“1”、“2”、“3”、“4”。请注意,前两种类型被认为是重要的,而其他两种被认为相对不那么重要。详见第 2.2 节。
Second, with substantial efforts, we have collected from scratch a new data set by ourselves, which we call the CitaStat. In this data set, we downloaded all papers in 4 representative journals in statistics between 1996 and 2020 in PDF format. These papers contain about 360K citation instances. For our study, we selected n = 3000 citation instances. For each citation,
其次,经过大量努力,我们从零开始自行收集了一个新的数据集,我们称之为CitaStat。在该数据集中,我们下载了1996年至2020年间统计学领域4种代表性期刊的所有论文PDF格式。这些论文包含了约360K的引用实例。在我们的研究中,我们选择了n=3000个引用实例。对于每个引用,
• we write a code to select the small piece of text surrounding the citation in the PDF file and convert it to usable text files. • we manually label each citation to each of the 4 citation types above.
• 我们编写代码以选择 PDF 文件中引用周围的小段文本,并转换为可用的文本文件。• 我们手动将每个引用标记为上述4种引用类型之一。
See Section 2.2 for details. As a result, CitaStat is a fully labeled data set with n=3000 samples, where each y -variable takes values in 1,2,3,4 (see above), and each x -variable is a short text which we call the textual content of the corresponding citation.
详见第 2.2 节。因此,CitaStat 是一个完全标注的数据集,包含 n=3000 个样本,其中每个 y 变量取值于 1,2,3,4 (见上文),每个 x 变量是一段短文本,我们称之为相应引用的文本内容。
We can now use the data set to compare the 5 LLMs above for their performances in citation classification. We consider two experiments.
我们现在可以使用这个数据集来比较上述 5 个大语言模型在引文分类中的表现。我们考虑两个实验。
1.3 Results and contributions
1.3 结果与贡献
We have applied all 5 LLMs to the four experiments (AC1, AC2, CC1, CC2), and we have the following observations:
我们将所有 5 个大语言模型应用于四项实验 (AC1, AC2, CC1, CC2),并得出以下观察结果:
• In terms of classification errors, Claude consistently outperforms all other LLM approaches. DeepSeek-R1 under performs Claude but outperforms Gemini, GPT, and Llama in most of the cases. GPT performs un satisfactorily for AC1 and AC2, with an error rate similar to that of random guessing, but it performs much better than random guessing for CC1 and CC2. Llama performs un satisfactorily: its error rates are either comparable to those of random guessing or even larger.
- 在分类错误方面,Claude 始终优于其他所有大语言模型方法。DeepSeek-R1 表现不如 Claude,但在大多数情况下优于 Gemini、GPT 和 Llama。GPT 在 AC1 和 AC2 上表现不佳,错误率与随机猜测相似,但在 CC1 和 CC2 上表现明显优于随机猜测。Llama 表现不理想:其错误率要么与随机猜测相当,甚至更高。
Table 1 presents the ranks of different approaches in terms of error rates (the method with the lowest error rate is assigned a rank of 1). The average ranks suggest that DeepSeek outperforms Gemini, GPT, and Llama, but under performs Claude (note that for CC1 and CC2, we have used two versions of DeepSeek, R1 and V3; the results in Table 1 are based on R1. If we use V3, then DeepSeek ties with Gemini in average rank; it still outperforms GPT and Llama). See Section 2 for details.
表 1 展示了不同方法在错误率方面的排名(错误率最低的方法排名为 1)。平均排名表明,DeepSeek 优于 Gemini、GPT 和 Llama,但不如 Claude (注意,对于 CC1 和 CC2,我们使用了两个版本的 DeepSeek,R1 和 V3;表 1 中的结果基于 R1。如果使用 V3,则 DeepSeek 与 Gemini 的平均排名相同;它仍然优于 GPT 和 Llama)。详见第 2 节。
Table 1: The rankings of all 5 LLM approaches in terms of error rates.
表 1: 所有5种大语言模型方法在错误率方面的排名。
Claude | DeepSeek | Gemini | GPT | Llama |
---|---|---|---|---|
实验 AC1 (hum vs. AI) 1 | 2 | 3 | 5 | 4 |
实验 AC2 (hum vs. humAI) 2 | 1 | 3 | 5 | 4 |
实验 CC1 (4-class) 1 | 4 | 2 | 3 | 5 |
实验 CC2 (2-class) 1 | 2 | 3 | 4 | 5 |
平均排名 1.25 | 2.25 | 2.75 | 4.25 | 4.50 |
Overall, we find that Claude and DeepSeek have the lowest error rates, but Claude is relatively expensive and DeepSeek is relatively slow.
总体而言,我们发现 Claude 和 DeepSeek 的误差率最低,但 Claude 相对较贵,而 DeepSeek 相对较慢。
We have made the following contributions. First, as DeepSeek has been the focus of attention in and beyond the AI community, there is a timely need to understand how it compares to other popular LLMs. Using two interesting classification problems, we demonstrate that DeepSeek is competitive in the task of predicting an outcome using a short piece of text. Second, we propose citation classification as an interesting new problem, the understanding of which will help evaluate the impact of academic research. Last but not the least, we provide CitaStat as a new data set which can be used for evaluating academic research. We also propose a general recipe for generating new data sets (with the MadStatAI as an example) for studying AI-generated text. These data sets can be used as benchmarks to compare different algorithms, and to learn the differences between human-generated text and AI-generated text.
我们做出了以下贡献。首先,由于 DeepSeek 在 AI 社区内外备受关注,迫切需要了解其与其他流行的大语言模型相比如何。通过两个有趣的分类问题,我们展示了 DeepSeek 在使用短文本预测结果任务中的竞争力。其次,我们提出了引用分类作为一个有趣的新问题,对其理解将有助于评估学术研究的影响。最后同样重要的是,我们提供了 CitaStat 作为一个新的数据集,可用于评估学术研究。我们还提出了一种生成新数据集(以 MadStatAI 为例)的通用方法,用于研究 AI 生成的文本。这些数据集可以作为比较不同算法的基准,并了解人类生成的文本与 AI 生成的文本之间的差异。
2 Main results
2 主要结果
In this section, we describe our numerical results on the two problems, authorship classification and citation classification, and report the performances of all 5 LLMs.
在本节中,我们描述了作者分类和引用分类这两个问题的数值结果,并报告了所有 5 个大语言模型的性能。
2.1 Authorship classification
2.1 作者分类
The MADStat (Ji et al., 2022; Ke et al., 2024) contains over 83K abstracts, but it is time-consuming to process all of them.1 We selected a small subset as follows: First, we restricted to authors who had over 30 papers in MADStat. Second, we randomly drew 15 authors from this pool by sampling without replacement. Each time a new author was selected, we checked if he/she had co-authored papers with previously drawn authors; if so, we deleted this author and drew a new one, until the total number of authors reached 15. Finally, we collected all 15 authors’ abstracts in MADStat. This gave rise to a data set with 582 abstracts in total (see Table 2).
MADStat(Ji 等人,2022;Ke 等人,2024)包含超过 83K 篇摘要,但处理所有这些摘要非常耗时。我们选择了一个小子集,步骤如下:首先,我们限定了在 MADStat 中有超过 30 篇论文的作者。其次,我们从这些作者中随机抽取 15 位,采用无放回抽样。每次选择一位新作者时,我们会检查他/她是否与之前抽到的作者有合著论文;如果有,我们删除该作者并重新抽取,直到作者总数达到 15 位。最后,我们收集了这 15 位作者在 MADStat 中的所有摘要。这生成了一个包含 582 篇摘要的数据集(见表 2)。
For each original human-written abstract, we used GPT-4o-mini to get two variants.
对于每个原始的人工撰写的摘要,我们使用了 GPT-4o-mini 来生成两个变体。
• The AI version. We provided the paper title and requested for a new abstract. The prompt is “Write an abstract for a statistical paper with this title: [paper title].”
AI版本。我们提供了论文标题,并要求生成新的摘要。提示是“为这篇论文写一个摘要,标题为:[论文标题]。”
Table 2: The 15 selected authors and their numbers of abstracts in MADStat.
表 2: MADStat 中选出的 15 位作者及其摘要数量。
• The humAI version. We provided the original abstract and requested for an edited version. The prompt is “Given the following abstract, make some revisions. Make sure not to change the length too much. [original abstract].”
• humAI 版本。我们提供了原始摘要,并请求编辑版本。提示是“给定以下摘要,进行一些修改。确保不要改变太多长度。[原始摘要]。”
Both variants are authored by AI, but they look differently. The AI version is usually significantly different from the original abstract, so the ‘human versus AI’ classification problem is comparably easier. For example, the left panel of Figure 1 is a comparison of the length of the original abstract with that of its AI version. The length of human-written abstracts varies widely, while the length of AI-generated ones is mostly in the range of 100-200 words. The humAI version is much closer to the original abstract, typically only having local word replacements and mild sentence re-structuring. In particular, its length is highly correlated with the original length, which can be seen in the right panel of Figure 1.
两种变体均由 AI 创作,但看起来有所不同。AI 版本通常与原始摘要显著不同,因此“人类与 AI”的分类问题相对容易。例如,图 1 的左侧面板展示了原始摘要与其 AI 版本的长度比较。人类撰写的摘要长度差异较大,而 AI 生成的摘要长度大多在 100-200 个单词之间。humAI 版本与原始摘要更为接近,通常只进行局部单词替换和轻微的句子重构。特别是,其长度与原始长度高度相关,这可以在图 1 的右侧面板中看到。
As mentioned, we consider two classification problems:
正如所述,我们考虑两个分类问题:
(AC1). A 2-class classification problem of ‘human versus AI’, (AC2). A 2-class classification problem of ‘human versus humAI’.
(AC1). 二分类问题:“人类 vs AI”,(AC2). 二分类问题:“人类 vs humAI”。
For each problem, there are 582×2=1164 testing samples, half from each class. We input them into each of the 5 LLMs using the same prompt: “You are a classifier that determines whether text is human-written or AI-edited. Respond with exactly one word: either ‘human’ for human-written text or ‘ChatGPT’ for AI-written text. Be as accurate as possible.”
对于每个问题,共有 582×2=1164 个测试样本,每个类别各占一半。我们使用相同的提示将这些样本输入到 5 个大语言模型中:“你是一个判断文本是人类撰写还是AI编辑的分类器。请用一个词准确回答:‘human’ 表示人类撰写的文本,‘ChatGPT’ 表示AI撰写的文本。请尽可能准确。”
Note that comparing with classification approaches (e.g., SVM, Random Forest (Friedman et al., 2001)), an advantage of using an LLM for classification is that, we do not need to provide any training sample. All we need is to input the LLM with a prompt.
需要注意的是,与分类方法(例如 SVM、随机森林 [Friedman et al., 2001])相比,使用大语言模型进行分类的一个优势在于,我们无需提供任何训练样本。我们只需要向大语言模型输入提示即可。
Figure 1: Comparison of the lengths of human-generated and AI-generated abstracts
图 1: 人工生成摘要与 AI 生成摘要的长度对比
Table 3 summarizes the performances of 5 LLMs. For ‘human vs AI’ (AC1), Claude-3.5- sonnet has the best error rate 0.218, and DeepSeek-R1 has the second best one 0.286. The remaining three methods almost always predict ‘human-written’, which explains why their error rates are close to 0.5. For ‘human vs humAI’ (AC2), since the problem is harder, the best achievable error rate is much higher than that of ‘human vs AI’ (AC1). DeepSeek-R1 has the lowest error rate 0.405, and Claude-3.5-sonnet has the second best one 0.435. The error rates of the other three methods are nearly 0.5. In conclusion, Claude-3.5-sonnet and DeepSeek-R1 are the winners in terms of error rate. If we also take the running time into account, Claude-3.5-sonnet has the best overall performance. On the other hand, the cost of Claude-3.5-sonnet is the highest.
表 3 总结了 5 个大语言模型 (LLM) 的性能。对于 "人类 vs AI" (AC1), Claude-3.5-sonnet 的最佳错误率为 0.218, DeepSeek-R1 次之,为 0.286。其余三种方法几乎总是预测为 "人类书写", 这解释了为什么它们的错误率接近 0.5。对于 "人类 vs humAI" (AC2), 由于问题更难, 最佳可达到的错误率远高于 "人类 vs AI" (AC1)。DeepSeek-R1 的最低错误率为 0.405, Claude-3.5-sonnet 次之,为 0.435。其他三种方法的错误率接近 0.5。总的来说, 就错误率而言, Claude-3.5-sonnet 和 DeepSeek-R1 是赢家。如果我们还将运行时间考虑在内, Claude-3.5-sonnet 整体表现最佳。另一方面, Claude-3.5-sonnet 的成本也是最高的。
Since the 1164 testing abstracts come from 15 authors, we also report the classification error for each author (i.e., the testing documents only include this author’s human-written abstracts and the AI-generated variants). Figure 2 shows the boxplots of per-author errors for each of 5 LLMs. Since authors have different writing styles, these plots give more information than Table 3. For ‘human vs AI’ (AC1), Claude-3.5-sonnet is still the clear winner. For ‘human vs humAI’ (AC2), DeepSeek-R1 still has the best performance. Furthermore, its advantage over Claude-3.5-sonnet is more visible in these boxplots: Although the overall error rates of two methods are only mildly different, DeepSeek-R1 does have a significantly
由于1164篇测试摘要来自15位作者,我们还报告了每位作者的分类错误(即测试文档仅包含该作者手写的摘要和AI生成的变体)。图2展示了5个大语言模型每位作者的错误箱线图。由于作者有不同的写作风格,这些图比表3提供了更多的信息。对于“人类 vs AI” (AC1),Claude-3.5-sonnet仍然是明显的赢家。对于“人类 vs humAI” (AC2),DeepSeek-R1仍然表现最佳。此外,在这些箱线图中,DeepSeek-R1相对于Claude-3.5-sonnet的优势更加明显:尽管两种方法的整体错误率仅有轻微差异,但DeepSeek-R1确实有明显的
Table 3: The classification errors, runtime, and costs of 5 LLMs for authorship classification. (In ‘human vs AI’ (AC1), if we report four digits after the decimal point, Llama-3.1-8b has a lower error than GPT-4o-mini. This is why they have different ranks for AC1 in Table 1.)
方法 | 人 vs. AI 错误率 | 人 vs. AI 运行时间 | 人 vs. AI 成本 | 人 vs. humAI 错误率 | 人 vs. humAI 运行时间 | 人 vs. humAI 成本 |
---|---|---|---|---|---|---|
Claude-3.5-sonnet | 0.218 | 7 分钟 | $ 0.5 USD | 0.435 | 7 分钟 | $ 0.3 USD |
DeepSeek-R1 | 0.286 | 235 分钟 | $ 0.05 USD | 0.405 | 183 分钟 | $ 0.04 USD |
Gemini-1.5-flash | 0.468 | 6 分钟 | $ 0.1 USD | 0.500 | 6 分钟 | $ 0.09 USD |
GPT-4o-mini | 0.511 | 7 分钟 | $ 0.1 USD | 0.502 | 8 分钟 | $ 0.12 USD |
Llama-3.1-8b | 0.511 | 11 分钟 | $ 0.2 USD | 0.501 | 12 分钟 | $ 0.17 USD |
表 3: 5 个大语言模型在作者分类中的错误率、运行时间和成本。(在“人 vs AI”(AC1)中,如果我们报告小数点后四位,Llama-3.1-8b 的错误率低于 GPT-4o-mini。这就是为什么它们在表 1 中 AC1 的排名不同。)
better performance for some authors. Figure 2: The boxplots of per-author classification errors.
图 2: 每位作者的分类错误箱线图。
We also investigate the similarity of predictions made by different LLMs. For each pair of LLMs, we calculate the percent of agreement on predicted labels, in both the ‘human versus AI’ (AC1) setting and ‘human versus humAI’ (AC2) settings. The results are in Figure 3. For both settings, Gemini-1.5-flash, GPT-4o-mini, and Llama-3.1-8b have extremely high agreements with each other. This is because all three models predict ‘human-written’ for the majority of samples. DeepSeek-R1 and Claude are different from the other three, and they have 64% and 70% agreements with each other in two settings, respectively.
我们还研究了不同大语言模型(LLM)预测结果的相似性。对于每一对大语言模型,我们计算了在“人类与 AI”(AC1)和“人类与 humAI”(AC2)两种设置下预测标签的一致性百分比。结果如图 3 所示。在两种设置下,Gemini-1.5-flash、GPT-4o-mini 和 Llama-3.1-8b 彼此之间的高度一致。这是因为这三个模型对大多数样本都预测为“人类编写”。DeepSeek-R1 和 Claude 与其他三个模型不同,在两种设置下它们之间的一致性分别为 64% 和 70%。
Figure 3: The prediction agreement among 5 LLMs in detecting AI from human texts. Left: ‘human versus AI’ (AC1). Right: ‘human versus humAI’ (AC2). Take the cell on the first row and second column (left panel) for example: for 70% of the samples, the predicted outcomes by DeepSeek-R1 and Claude-3.5-sonnet are exactly the same.
图 3: 5 个大语言模型在检测人类文本中的 AI 内容时的预测一致性。左图:‘人类 vs AI’(AC1)。右图:‘人类 vs humAI’(AC2)。以第一行第二列的单元格(左图)为例:对于 70% 的样本,DeepSeek-R1 和 Claude-3.5-sonnet 的预测结果完全一致。
2.2 Citation classification
2.2 引用分类
The MADStat only contains meta-information and abstracts, rather than full papers. We created a new data set, CitaStat, by downloading full papers and extracting the text surrounding citations. Specifically, we restricted to those papers published during 1996- 2020 in 4 representative journals in statistics: Annals of Statistics, Biometrika, Journal of the American Statistical Association, and Journal of the Royal Statistical Society Series B. We wrote code to acquire PDFs from journal websites, convert PDFs to plain text files, and then extracted the sentence (using SpaCy, a well-known python package for sentence parsing) containing