[论文翻译]意义随时间变化的动态:对大语言模型的评估


原文地址:https://arxiv.org/pdf/2501.05552


The dynamics of meaning through time: Assessment of Large Language Models

意义随时间变化的动态:对大语言模型的评估

Abstract

摘要

Understanding how large language models (LLMs) grasp the historical context of concepts and their semantic evolution is essential in advancing artificial intelligence and linguistic studies. This study aims to evaluate the capabilities of various LLMs in capturing temporal dynamics of meaning, specifically how they interpret terms across different time periods. We analyze a diverse set of terms from multiple domains, using tailored prompts and measuring responses through both objective metrics (e.g., perplexity and word count) and subjective human expert evaluations. Our comparative analysis includes prominent models like ChatGPT, GPT-4, Claude, Bard, Gemini, and Llama. Findings reveal marked differences in each model's handling of historical context and semantic shifts, highlighting both strengths and limitations in temporal semantic understanding. These insights offer a foundation for refining LLMs to better address the evolving nature of language, with implications for historical text analysis, AI design, and applications in digital humanities.

理解大语言模型 (LLMs) 如何掌握概念的历史背景及其语义演变对于推进人工智能和语言学研究至关重要。本研究旨在评估各种大语言模型在捕捉意义的时间动态方面的能力,特别是它们如何解释不同时期的术语。我们分析了来自多个领域的多样化术语集,使用定制的提示并通过客观指标(例如,困惑度和词数)和主观的人类专家评估来衡量响应。我们的比较分析包括 ChatGPT、GPT-4、Claude、Bard、Gemini 和 Llama 等知名模型。研究结果揭示了每个模型在处理历史背景和语义变化方面的显著差异,突显了它们在时间语义理解方面的优势和局限性。这些见解为改进大语言模型以更好地应对语言的演变性质提供了基础,对历史文本分析、AI 设计以及数字人文领域的应用具有重要意义。

Keywords: Large Language Models (LLMs), Temporal Reasoning, Historical Reasoning.

关键词:大语言模型 (Large Language Models, LLMs),时序推理,历史推理。

1. Introduction

1. 引言

Large language models (LLMs) have revolutionized numerous domains, demonstrating remarkable performance across a wide array of tasks, including reasoning, understanding, truthfulness, mathematics, and coding (Periti and Montanelli, 2024; Zhao et al., 2023). The capacity of these models is typically assessed through various benchmarks that evaluate their proficiency in each of these domains (Chang et al., 2024). Central to their success is the combination of model size and the data used during pre training (Zhao et al., 2023). A growing body of literature explores the relationship between the scaling of model size and the emergence of increasingly sophisticated cognitive abilities, commonly referred to as emergent intelligence (Kaplan et al., 2020). Additionally, other studies emphasize the pivotal role of training data quality, specifically highlighting how pre training on diverse and specialized datasets can significantly enhance a model’s reasoning capabilities (Isik et al., 2024).

大语言模型 (LLMs) 已经彻底改变了众多领域,在推理、理解、真实性、数学和编码等广泛任务中展示了卓越的性能 (Periti 和 Montanelli, 2024; Zhao 等, 2023)。这些模型的能力通常通过各种基准测试来评估,这些基准测试评估了它们在每个领域的熟练程度 (Chang 等, 2024)。其成功的关键在于模型规模和预训练期间使用的数据的结合 (Zhao 等, 2023)。越来越多的文献探讨了模型规模扩展与日益复杂的认知能力(通常称为涌现智能)之间的关系 (Kaplan 等, 2020)。此外,其他研究强调了训练数据质量的关键作用,特别指出在多样化和专业化的数据集上进行预训练如何显著增强模型的推理能力 (Isik 等, 2024)。

The vast corpus of text used to train these models is predominantly derived from content available on the internet, much of which has been produced over the past few decades (Liu et al., 2024). This temporal context inherently introduces challenges, as much of the text that drives current models reflects contemporary understandings of language and meaning. Consequently, the historical evolution of words, their meanings, and their contextual shifts may be underrepresented in modern training data (Manjavacas and Fonteyn, 2022). This temporal gap poses a particular challenge for LLMs when tasked with interpreting how words and phrases have evolved across time.

用于训练这些模型的庞大文本语料库主要来源于互联网上的内容,其中大部分是在过去几十年中产生的 (Liu et al., 2024)。这种时间背景本身就带来了挑战,因为驱动当前模型的文本大多反映了当代对语言和意义的理解。因此,词汇的历史演变、其含义以及语境的变化在现代训练数据中可能没有得到充分体现 (Manjavacas and Fonteyn, 2022)。这种时间差距对大语言模型在解释词汇和短语如何随时间演变时提出了特别的挑战。

This gap has spurred interest in examining whether LLMs, which are typically trained on contemporary data, can capture the historical evolution of words and their meanings (Cuscito et al., 2024; Manjavacas Arevalo and Fonteyn, 2021; Manjavacas and Fonteyn, 2022). The central motivation of this study is to assess whether current LLMs possess the ability to understand semantic shifts over time. Specifically, it seeks to determine whether models, in their current configurations, can track how word meanings have evolved over the past century. In doing so, this study also explores the possibility of establishing this capacity—tracking and interpreting historical linguistic changes—as a benchmark for future advancements in LLM development. This study aims to understand LLM models’ ability to understand changes of a word meaning as a critical dimension of reasoning and language comprehension, with potential implications for a variety of applications in both AI development and interdisciplinary research.

这一差距激发了人们对于大语言模型(LLM)是否能够捕捉词汇及其意义历史演变的兴趣(Cuscito 等,2024;Manjavacas Arevalo 和 Fonteyn,2021;Manjavacas 和 Fonteyn,2022)。本研究的核心动机是评估当前的大语言模型是否具备理解语义随时间变化的能力。具体而言,它试图确定当前配置的模型是否能够追踪过去一个世纪中词汇意义的演变。通过这样做,本研究还探讨了将这种能力——追踪和解释历史语言变化——确立为未来大语言模型发展基准的可能性。本研究旨在理解大语言模型理解词汇意义变化的能力,作为推理和语言理解的关键维度,并对人工智能开发和跨学科研究中的多种应用具有潜在影响。

The findings of this study reveal significant variability in the ability of large language models (LLMs) to interpret temporal semantic shifts, emphasizing that training data quality and domainspecific fine-tuning outweigh model size in determining performance. GPT-4 and Claude Instant $100\mathrm{k}$ demonstrated superior factuality and comprehensiveness, reflecting the advantages of robust training methodologies. Meanwhile, the code-based Llama 34B surpassed larger Llama models, underscoring the value of retraining on structured datasets, such as code, to enhance analytical reasoning and temporal understanding. In contrast, models like Google Gemini and smaller Llama variants struggled to capture nuanced historical contexts, highlighting the limitations of generalpurpose training approaches. These insights establish a foundation for advancing LLM design and training strategies, enabling improved applications in historical linguistics, digital humanities, and beyond.

本研究的结果揭示了大语言模型 (LLMs) 在解释时间语义变化能力上的显著差异,强调了训练数据质量和领域特定微调在决定性能方面的重要性,超过了模型规模。GPT-4 和 Claude Instant $100\mathrm{k}$ 在事实性和全面性方面表现出色,反映了强大训练方法的优势。与此同时,基于代码的 Llama 34B 超越了更大的 Llama 模型,突显了在结构化数据集(如代码)上重新训练的价值,以增强分析推理和时间理解能力。相比之下,Google Gemini 和较小的 Llama 变体在捕捉细微历史背景方面表现不佳,突显了通用训练方法的局限性。这些见解为推进大语言模型设计和训练策略奠定了基础,使其在历史语言学、数字人文等领域中的应用得以改进。

The paper is organised in six sections. A brief review of LLMs with the scope of word meaning development is laid out in section 2. Section 3 discusses the research methodology and the experiment performed to meet with the research aim and gaols. Section 4 presents the results of the experiment. Section 5 has a comprehensive discussion over the results and how they correlate with similar studies in the same domain. Finally, conclusions are shared in section 6.

本文共分为六个部分。第2部分简要回顾了大语言模型 (LLM) 在词义发展领域的应用范围。第3部分讨论了研究方法和为实现研究目标而进行的实验。第4部分展示了实验结果。第5部分对结果进行了全面讨论,并探讨了这些结果与同一领域类似研究的相关性。最后,第6部分分享了结论。

The evolution of word meanings over time is influenced by a myriad of factors, including social changes, technological advancements, and cultural dynamics. This interplay between language, history, and culture highlights the importance of understanding semantic change as a gradual process that encompasses both lexical and core meaning shifts. Traditionally, semantic change has been examined through corpus linguistics, where researchers analyze texts from different historical periods to identify shifts in word usage and meaning. By examining word frequencies and the contexts in which they appear, scholars can trace the evolution of language over time (Asri et al., 2024) .

词义的演变受到多种因素的影响,包括社会变迁、技术进步和文化动态。语言、历史和文化之间的相互作用凸显了理解语义变化作为一个渐进过程的重要性,这一过程既包括词汇变化,也包括核心意义的变化。传统上,语义变化通过语料库语言学进行研究,研究人员分析不同历史时期的文本,以识别词汇使用和意义的变化。通过考察词汇频率及其出现的上下文,学者们可以追溯语言随时间的演变 (Asri et al., 2024) 。

In recent years, diachronic word embeddings have emerged as a powerful tool to capture changes in word meanings across time. These embeddings align words with their respective time periods, enabling a strong understanding of how meanings shift in response to cultural and societal changes. Notably, two statistical laws of semantic change have been proposed: i) words that are used more frequently tend to change at a slower rate, ii) polysemous words exhibit a higher rate of change (Hamilton et al., 2016a). This framework allows researchers to categorize various types of semantic shifts such as drift of meaning based on cultural norms (Spataru et al., 2024), special is ation of a meaning over time (Hamilton et al., 2016a), generalisation of a meaning over time (Wegmann et al., 2020), pejoration and amelioration which refers to a shift towards a negative or positive connotations, respectively (Periti and Montanelli, 2024; Wevers and Koolen, 2020), metaphorical meaning added to an existing word based on pop culture or a change in a society (de Sá et al., 2024) and finally, the broadening and narrowing of a meaning through time (Vijayarani and Geetha, 2020).

近年来,历时词嵌入 (diachronic word embeddings) 作为一种强大的工具出现,用于捕捉词语意义随时间的变化。这些嵌入将词语与其对应的时间段对齐,从而能够深入理解意义如何随着文化和社会变化而演变。值得注意的是,已经提出了两条语义变化的统计规律:i) 使用频率较高的词语变化速度较慢,ii) 多义词表现出更高的变化率 (Hamilton et al., 2016a)。这一框架使研究人员能够分类各种类型的语义变化,例如基于文化规范的意义漂移 (Spataru et al., 2024)、意义随时间的特化 (Hamilton et al., 2016a)、意义随时间的泛化 (Wegmann et al., 2020)、贬义化和褒义化(分别指意义向负面或正面内涵的转变)(Periti and Montanelli, 2024; Wevers and Koolen, 2020)、基于流行文化或社会变化的现有词语隐喻意义的增加 (de Sá et al., 2024),以及意义随时间的扩展和缩小 (Vijayarani and Geetha, 2020)。

On the other hand, as LLMs become increasingly integrated into daily tasks, their ability to handle semantic change and cultural variations is crucial. By training LLMs on a diverse corpora and finetuning them for specific tasks related to semantic change, researchers can enhance their performance in detecting shifts in meaning (Shen et al., 2024; Tao et al., 2023). However, LLMs often propagate cultural biases inherent in their training data, which can skew responses in crosscultural contexts (Tao et al., 2023). In addition, LLMs may struggle with complex social scenarios and generate text that deviates from intended meaning (Choi et al., 2023; Spataru et al., 2024).

另一方面,随着大语言模型 (LLM) 日益融入日常任务,其处理语义变化和文化差异的能力变得至关重要。通过在多样化的语料库上训练大语言模型,并针对与语义变化相关的特定任务进行微调,研究人员可以提高其在检测意义变化方面的性能 (Shen et al., 2024; Tao et al., 2023)。然而,大语言模型往往会传播其训练数据中固有的文化偏见,这可能会在跨文化环境中导致回答偏差 (Tao et al., 2023)。此外,大语言模型在处理复杂的社会场景时可能会遇到困难,并生成偏离预期意义的文本 (Choi et al., 2023; Spataru et al., 2024)。

Despite these limitations, LLMs can grasp cultural common-sense when fine-tuned on balanced datasets that incorporate diverse cultural perspectives (Shen et al., 2024). Addressing the biases in LLMs requires training on time-aware datasets that reflect the aftermath of significant cultural events, such as the term "coronavirus" and its evolving meanings (Mousavi et al., 2024). This approach allows for a deeper tracing of historical contexts and semantic changes.

尽管存在这些限制,大语言模型 (LLM) 在经过包含多元文化视角的平衡数据集微调后,能够掌握文化常识 (Shen et al., 2024)。解决大语言模型中的偏见问题,需要在反映重大文化事件后果的时间感知数据集上进行训练,例如“coronavirus”一词及其不断演变的含义 (Mousavi et al., 2024)。这种方法能够更深入地追溯历史背景和语义变化。

The use of language models to study the evolution of meaning over time has garnered increasing attention in recent research. Several studies have explored the capabilities of lamguage models in tracking semantic shifts and historical language usage. For instance, Bamler and Mandt, (2017) proposed dynamic word embeddings as a method for capturing temporal variations in word meanings, highlighting the utility of diachronic models in understanding historical text corpora. Similarly, Hamilton et al., (2016) introduced dynamic embeddings to study semantic change over decades, demonstrating how embeddings trained on historical corpora reveal shifts in cultural and social contexts. More recently, MacBERTh was specifically designed to analyze historical texts, leveraging transformers to improve the understanding of linguistic evolution across time periods (Manjavacas Arevalo and Fonteyn, 2021).

近年来,利用语言模型研究词义随时间演变的方法在研究中受到了越来越多的关注。多项研究探讨了语言模型在追踪语义变化和历史语言使用方面的能力。例如,Bamler 和 Mandt (2017) 提出了动态词嵌入 (dynamic word embeddings) 作为捕捉词义时间变化的方法,强调了历时模型在理解历史文本语料库中的实用性。同样,Hamilton 等人 (2016) 引入了动态嵌入来研究数十年间的语义变化,展示了基于历史语料库训练的嵌入如何揭示文化和社会背景的变迁。最近,MacBERTh 被专门设计用于分析历史文本,利用 Transformer 来提高对跨时间段语言演变的理解 (Manjavacas Arevalo 和 Fonteyn, 2021)。

The implications of understanding semantic change through LLMs are vast, including applications in tracking antisemitic language, analyzing the evolution of hate speech, and assessing sentiment shifts in economic discourse. Research has shown that diachronic embeddings can effectively monitor these shifts across multiple languages, such as English, German, French, Swedish, and Spanish (Hoeken et al., 2023; Periti et al., 2024; Tripodi et al., 2019). Additionally, the methodologies developed for semantic change detection can advance fields such as historical linguistics, social media analysis, and sentiment analysis by providing tools to predict potential indicators of accurate language use, such as political dog whistles (Boholm et al., 2024).

通过大语言模型理解语义变化的影响是广泛的,包括在追踪反犹太语言、分析仇恨言论的演变以及评估经济话语中的情感变化等方面的应用。研究表明,历时嵌入可以有效地监控多种语言(如英语、德语、法语、瑞典语和西班牙语)中的这些变化 (Hoeken et al., 2023; Periti et al., 2024; Tripodi et al., 2019)。此外,为语义变化检测开发的方法可以通过提供工具来预测准确语言使用的潜在指标(如政治狗哨),从而推动历史语言学、社交媒体分析和情感分析等领域的发展 (Boholm et al., 2024)。

3. Research Methodology

3. 研究方法

To evaluate the ability of LLMs to capture the temporal dynamics of meaning, we conducted a comprehensive, multi-dimensional experiment using a range of state-of-the-art models.

为了评估大语言模型 (LLM) 捕捉意义时间动态的能力,我们使用一系列最先进的模型进行了全面的多维度实验。

Language models. We evaluate six large language models in this study. The first is ChatGPT (OpenAI, 2022), for which we use text-davinci-002, a 175B parameter model based on the GPT3 architecture. The second is GPT-4 (OpenAI et al., 2023), and GPT4O (OpenAI et al., 2024), are more advanced model with improved reasoning and contextual understanding, although its exact parameter count is undisclosed, it is estimated to exceed 175B parameters. The third is Claude (Anthropic, 2023), for which we use Claude 1, with parameters ranging from 52B to 100B. The fourth is Bard, based on Google's LaMDA model (Thoppilan et al., 2022), which comes in configurations including 422M, 2B, 8B, 68B, and 137B parameters. The fifth is Gemini (Google DeepMind, 2024), a model from Google's suite with versions estimated to match or exceed the PaLM 540B model. The sixth is Llama (Meta, 2023), for which we used both Llama 1 and Llama 2, was evaluated at different scales: Llama 1 with 34B parameters, and Llama 2 with 7B, 13B, and 70B parameters.

语言模型。我们在本研究中评估了六种大语言模型。第一个是 ChatGPT (OpenAI, 2022),我们使用了 text-davinci-002,这是一个基于 GPT3 架构的 175B 参数模型。第二个是 GPT-4 (OpenAI et al., 2023) 和 GPT4O (OpenAI et al., 2024),它们是更先进的模型,具有改进的推理和上下文理解能力,尽管其确切参数数量未公开,但估计超过 175B 参数。第三个是 Claude (Anthropic, 2023),我们使用了 Claude 1,其参数范围在 52B 到 100B 之间。第四个是 Bard,基于 Google 的 LaMDA 模型 (Thoppilan et al., 2022),其配置包括 422M、2B、8B、68B 和 137B 参数。第五个是 Gemini (Google DeepMind, 2024),这是 Google 系列中的一个模型,其版本估计与或超过 PaLM 540B 模型。第六个是 Llama (Meta, 2023),我们同时使用了 Llama 1 和 Llama 2,并在不同规模下进行了评估:Llama 1 有 34B 参数,Llama 2 有 7B、13B 和 70B 参数。

Term Selection. A carefully curated list of terms was selected to represent diverse domains, including scientific concepts, historical events, and cultural phenomena. These terms were chosen to evaluate how well each model interprets semantic shifts and the historical context associated with each concept. Two key terms were selected for this study: "Data Mining" (a technical term) and "Michael Jackson" (a cultural figure). These terms span both technological and cultural evolution from the 1920s to the 2020s, offering a robust platform to assess the LLMs' capacity to track and describe temporal meaning changes.

术语选择。我们精心挑选了一系列术语,涵盖科学概念、历史事件和文化现象等多个领域。这些术语旨在评估每个模型如何解释语义变化以及与每个概念相关的历史背景。本研究选择了两个关键术语:"数据挖掘 (Data Mining)"(技术术语)和"Michael Jackson"(文化人物)。这些术语跨越了从1920年代到2020年代的技术和文化演变,为评估大语言模型跟踪和描述时间意义变化的能力提供了一个强大的平台。

Prompt Design and Input Format. For each term, we designed specific prompts aimed at assessing the models' understanding of the historical evolution of meaning. A typical prompt asked the models to:

提示设计与输入格式。对于每个术语,我们设计了特定的提示,旨在评估模型对词义历史演变的理解。一个典型的提示要求模型:

“Create a table with two columns. The first column should list decades (e.g., 1920s, 1930s, etc.), and the second should describe the meaning and synonyms for the term based on the knowledge and context of that period.” This prompt structure remained consistent across all models, ensuring that the evaluation was controlled, and the comparison was fair. Prompts were specifically crafted to challenge each model’s ability to capture and relay temporal semantic shifts without additional training or fine-tuning.

创建一个包含两列的表。第一列应列出年代(例如,1920年代、1930年代等),第二列应根据该时期的知识和背景描述术语的含义和同义词。此提示结构在所有模型中保持一致,确保评估受控且比较公平。提示经过精心设计,以挑战每个模型在不进行额外训练或微调的情况下捕捉和传达时间语义变化的能力。

Subjective Evaluations were conducted by human experts. Two methods were selected based on the following criteria: (1) Factuality score: Experts evaluated how well each model captured the evolution of meaning over time, (2) Comprehensiveness score The extent to which the models effectively answered the question including the description length and number synonyms in each answer, if any.

主观评估由人类专家进行。根据以下标准选择了两种方法:(1) 事实性评分:专家评估每个模型捕捉意义随时间演变的程度,(2) 全面性评分:模型有效回答问题的程度,包括每个答案的描述长度和同义词数量(如果有)。

4. Results

4. 结果

This section presents the findings from a comparative evaluation of LLMs on their ability to understand the temporal dynamics of meaning and semantic shifts for two terms, "Data Mining" and "Michael Jackson", see Table 1. The analysis reveals notable differences in model performance across metrics, including factuality and comprehensiveness. Key results indicate that models trained with specialized data, such as code, may exhibit enhanced analytical capabilities, potentially influencing their ability to interpret semantic evolution.

本节展示了大语言模型在理解“数据挖掘”和“迈克尔·杰克逊”这两个词的意义和语义演变的时间动态能力上的比较评估结果,见表 1。分析揭示了模型在不同指标(包括事实性和全面性)上的显著差异。关键结果表明,使用专门数据(如代码)训练的模型可能表现出更强的分析能力,这可能影响其解释语义演变的能力。

High-Performing Models

高性能模型

The models GPT-4 and Claude Instant $100\mathrm{k}$ consistently outperformed other LLMs across both evaluation metrics, demonstrating a high capacity for capturing historical context. Both models scored a maximum of 22 in factuality and near-maximum in comprehensiveness (21 and 22 for Claude Instant $100\mathrm{k}$ and GPT-4, respectively) for the term "Data Mining." A similarly strong performance was observed for the term "Michael Jackson," with GPT-4 achieving the highest comprehensiveness score of 22 and factuality of 21. These scores indicate an advanced capacity in these models to accurately trace semantic evolution, which may stem from robust training data diversity and model architecture optimized for complex language tasks. The consistent performance across terms suggests that GPT-4 and Claude may be more attuned to changes in meaning across different temporal contexts, supporting their utility in applications requiring historical linguistic analysis.

GPT-4 和 Claude Instant $100\mathrm{k}$ 模型在两项评估指标上始终优于其他大语言模型,展示了其在捕捉历史背景方面的高能力。在术语“数据挖掘 (Data Mining)”上,两个模型在事实性 (factuality) 方面均获得了最高分 22 分,在全面性 (comprehensiveness) 方面接近满分(Claude Instant $100\mathrm{k}$ 为 21 分,GPT-4 为 22 分)。在术语“Michael Jackson”上,GPT-4 也表现出色,获得了最高的全面性分数 22 分和事实性分数 21 分。这些分数表明,这些模型在准确追踪语义演变方面具有先进的能力,这可能源于其多样化的训练数据和针对复杂语言任务优化的模型架构。这些模型在不同术语上的一致表现表明,GPT-4 和 Claude 可能更适应不同时间背景下的语义变化,支持其在需要历史语言分析的应用中的实用性。

The Code-Based Llama Model’s Unique Strengths

基于代码的 Llama 模型的独特优势

The model Llama 34B, specifically trained on code and provided by Poe, emerged as a notable outlier within the Llama series, outperforming larger versions, such as Llama 70B and Llama 13B, across factuality and comprehensiveness. This model achieved a factuality score of 12 and comprehensiveness of 20 for "Data Mining" and similarly high scores of 18 and 22, respectively, for "Michael Jackson." The distinct performance of this model may be attributable to its extensive retraining on code datasets, which are characterized by structured syntax and logic-driven language. Such exposure potentially enhances a model’s analytical capabilities, providing a foundation for more structured thought processes that facilitate better recognition of historical patterns in meaning. This suggests that domain-specific retraining, particularly with code, may imbue LLMs with improved temporal reasoning and analytical rigor—skills critical for comprehending complex semantic shifts. Given these results, further research into code-based retraining could elucidate its potential benefits in enhancing temporal analytical abilities within LLMs.

模型 Llama 34B 是 Poe 提供的专门针对代码训练的模型,在 Llama 系列中表现突出,在事实性和全面性方面超越了更大的版本,如 Llama 70B 和 Llama 13B。该模型在“数据挖掘”方面获得了 12 分的事实性得分和 20 分的全面性得分,在“Michael Jackson”方面也分别获得了 18 分和 22 分的高分。该模型的独特表现可能归因于其在代码数据集上的广泛再训练,这些数据集以结构化语法和逻辑驱动的语言为特征。这种训练可能增强了模型的分析能力,为更结构化的思维过程提供了基础,从而有助于更好地识别历史意义的模式。这表明,特定领域的再训练,尤其是代码训练,可能赋予大语言模型更好的时间推理和分析严谨性——这些技能对于理解复杂的语义变化至关重要。鉴于这些结果,进一步研究基于代码的再训练可能有助于阐明其在增强大语言模型时间分析能力方面的潜在益处。

Under performance and Inconsistencies Among Other Models

性能不足与其他模型的不一致性

Several models displayed under performance, particularly smaller Llama variants (Llama 7B) and Google Bard. The Llama 2 7B model, for example, exhibited the lowest scores across both terms and evaluation metrics, scoring 0 for both factuality and comprehensiveness for “Data Mining” and only 1 and 0, respectively, for “Michael Jackson.” These findings suggest that model size has some effect, but a larger model size does not guarantee adequate understanding of temporal semantics; instead, both model architecture and training data composition play critical roles. As seen on Google Bard, it displayed low performance, particularly on the "Data Mining" term (factuality: 3, comprehensiveness: 0), underscoring the challenges faced by these models in accurately representing historical semantic shifts. These inconsistencies across models, particularly among the Llama and Bard models, imply potential limitations in their training frameworks for tasks requiring deep temporal context comprehension.

在性能表现上,多个模型表现不佳,尤其是较小的 Llama 变体(Llama 7B)和 Google Bard。例如,Llama 2 7B 模型在“数据挖掘”和“Michael Jackson”两个术语和评估指标上得分最低,分别获得了 0 分和 1 分、0 分。这些发现表明,模型大小有一定影响,但更大的模型规模并不能保证对时间语义的充分理解;相反,模型架构和训练数据构成起着关键作用。正如 Google Bard 的表现所示,它在“数据挖掘”术语上表现尤其不佳(事实性:3,全面性:0),突显了这些模型在准确表示历史语义变化方面面临的挑战。这些模型之间的不一致性,尤其是 Llama 和 Bard 模型之间的差异,暗示了它们在需要深度时间上下文理解的任务中训练框架的潜在局限性。

Table 1: Table showing two terms and the performance of each model in their ability to answer questions with factuality and comprehensiveness.

表 1: 展示两个术语及每个模型在回答问题的准确性和全面性方面的表现。

术语/模型 回答数量 准确性得分 全面性得分
数据挖掘 (Data Mining)
ChatGPT 11 22 0
ChatGPT40 11 22 22
Claude Instant 100k 11 22 21
Google Bard 6 3 0
Google Gemini 10 20 20
Llama 2 13B 11 8 11
Llama 2 70B 11 4 8
Llama 2 7B 11 0 0
CodeLllama 2 34B 10 12 20
Michael Jackson
ChatGPT 11 18 0
ChatGPT40 11 21 22
Google Gemini 1 2 2
Google Gemini 5 10 8
Llama 2 13B 11 2 22
Llama 2 70B 11 16 10
Llama 2 7B 11 1 0
CodeLllama 2 34B 11 18 22

Patterns and Correlations in Model Performance

模型性能的模式与相关性

The analysis revealed several noteworthy patterns regarding model characteristics and performance outcomes. The study found no linear correlation between model size and performance quality in temporal understanding. While larger Llama models such as Llama 70B displayed moderately higher scores than Llama 7B, they still under performed relative to the codebased Llama 34B model, suggesting that architectural enhancements and training with structured datasets may supersede raw model size. Additionally, the strong performance of GPT-4 and Claude Instant $100\mathrm{k}$ across both terms indicates that diverse and large training datasets, coupled with refined architectures, can enhance the capacity for temporal semantic understanding. However, bigger and better architecture might not always be the answer. The inconsistent performance of Llama 70B in comparison to its smaller model CodeLlama 34B affirm that pretraining and data selection designs are equally crucial for better performance, with historical semantic shifts reliably, as seen from the study. These findings underscore the importance of model training approaches and suggest that beyond increasing model size, fine-tuning with specialized or structured data can enhance temporal analysis capabilities.

分析揭示了关于模型特征和性能结果的几个值得注意的模式。研究发现,在时间理解方面,模型大小与性能质量之间没有线性相关性。虽然较大的Llama模型(如Llama 70B)显示出比Llama 7B略高的分数,但它们仍然不如基于代码的Llama 34B模型,这表明架构增强和结构化数据集的训练可能优于原始模型大小。此外,GPT-4和Claude Instant $100\mathrm{k}$ 在两个术语中的强劲表现表明,多样化和大规模的训练数据集,加上精细的架构,可以增强时间语义理解的能力。然而,更大更好的架构并不总是答案。Llama 70B与其较小的模型CodeLlama 34B相比表现不一致,证实了预训练和数据选择设计对于更好的性能同样至关重要,正如研究中所见,历史语义变化是可靠的。这些发现强调了模型训练方法的重要性,并表明除了增加模型大小外,使用专门或结构化数据进行微调可以增强时间分析能力。

5. Discussion

5. 讨论

Implications and Future Directions

影响与未来方向

The findings of this study offer several implications for the advancement of LLMs, particularly in fields requiring a in-depth understanding of language evolution, such as digital humanities and historical linguistics. The high performance of GPT-4 and Claude Instant $100\mathrm{k}$ illustrates the potential for well-architect ed models with diverse training data to provide accurate and comprehensive historical interpretations. The success of the code-based Llama 34B model suggests that retraining on specialized datasets, like code, could be a valuable approach for enhancing models’ temporal analytical reasoning. This could have broader applications in improving models’ ability to handle structured data and in fields such as digital humanities, where historical accuracy and adequate interpretation are essential.

本研究的结果为大语言模型 (LLM) 的发展提供了若干启示,特别是在需要深入理解语言演变的领域,如数字人文和历史语言学。GPT-4 和 Claude Instant $100\mathrm{k}$ 的高性能表明,具有多样化训练数据的良好架构模型能够提供准确且全面的历史解释。基于代码的 Llama 34B 模型的成功表明,在专门数据集(如代码)上进行重新训练可能是增强模型时间分析推理能力的有效方法。这在提高模型处理结构化数据的能力以及在数字人文等领域中具有广泛的应用潜力,这些领域对历史准确性和充分解释至关重要。

Future research should expand the scope of this study by incorporating a larger set of terms and additional time periods to assess models’ abilities to generalize across domains and temporal contexts. Further, investigating the specific impacts of code-based retraining or domain-specific dataset integration could offer insights into optimizing LLMs for tasks demanding analytical precision. This study provides foundational evidence that supports the refinement of LLMs to better address evolving language dynamics, with significant implications for enhancing AI’s utility in historical analysis and beyond.

未来的研究应通过纳入更多术语和额外的时间段来扩展本研究的范围,以评估模型在跨领域和时间背景下的泛化能力。此外,研究基于代码的再训练或特定领域数据集集成的具体影响,可以为优化大语言模型在需要分析精度的任务中的表现提供见解。本研究提供了支持改进大语言模型以更好地应对不断变化的语言动态的基础证据,这对增强人工智能在历史分析及其他领域的实用性具有重要意义。

The superior performance of GPT-4 and Claude Instant $100\mathrm{k}$ in this study underscores the importance of model architecture and the diversity of training datasets in achieving accurate temporal semantic understanding. Both models excelled in factuality and comprehensiveness, highlighting their ability to capture historical context with high fidelity. These results align with previous research (e.g., (Chiang et al., 2024)), which identified these models as top performers in reasoning and semantic tasks. However, critiques such as those by Bender et al., (2021) raise questions about whether such capabilities extend beyond statistical pattern recognition, suggesting a need for further validation of these models’ capacity for deep semantic understanding in temporally sensitive tasks. The continued dominance of GPT-4 and Claude across various contexts reinforces their potential for applications requiring accurate historical linguistic analysis, while also presenting an opportunity to examine the precise contributions of training methodologies to their performance.

GPT-4 和 Claude Instant $100\mathrm{k}$ 在本研究中的卓越表现凸显了模型架构和训练数据集多样性在实现准确时间语义理解中的重要性。这两个模型在事实性和全面性方面表现出色,突显了它们以高保真度捕捉历史背景的能力。这些结果与之前的研究(例如 (Chiang et al., 2024))一致,这些研究将这些模型确定为推理和语义任务中的佼佼者。然而,Bender 等人 (2021) 的批评提出了这些能力是否超越了统计模式识别的问题,表明需要进一步验证这些模型在时间敏感任务中的深层语义理解能力。GPT-4 和 Claude 在各种情境中的持续主导地位强化了它们在需要准确历史语言分析的应用中的潜力,同时也为研究训练方法对其性能的具体贡献提供了机会。

The observed success of CodeLlama 34B introduces a compelling case for domain-specific retraining as a strategy for enhancing structured reasoning in LLMs. Despite being smaller than other Llama variants, its superior performance in both factuality and comprehensiveness demonstrates the benefits of pre training on structured datasets such as code. This finding corroborates studies by Aryabumi et al., (2024) and Yang et al., (2024), which emphasized the enhanced reasoning capabilities of models trained on code. Code-based retraining may in still improved logical reasoning and temporal analytical skills, offering broader implications for fields like computational linguistics and data science. However, further research is needed to determine whether such retraining benefits extend to other tasks or are limited to specific applications. Investigating how structured reasoning abilities gained from code datasets translate to understanding unstructured historical data could offer valuable insights.

CodeLlama 34B 的成功为领域特定重训练作为增强大语言模型结构化推理能力的策略提供了一个引人注目的案例。尽管它比其他 Llama 变体更小,但它在事实性和全面性方面的卓越表现证明了在代码等结构化数据集上进行预训练的优势。这一发现与 Aryabumi 等人 (2024) 和 Yang 等人 (2024) 的研究相吻合,这些研究强调了在代码上训练的模型具有增强的推理能力。基于代码的重训练可能会提升逻辑推理和时间分析能力,为计算语言学和数据科学等领域带来更广泛的影响。然而,需要进一步研究以确定这种重训练的好处是否扩展到其他任务,还是仅限于特定应用。研究从代码数据集中获得的结构化推理能力如何转化为对非结构化历史数据的理解,可能会提供有价值的见解。

The performance inconsistencies across model sizes in this study reveal critical insights into the interplay between model architecture and dataset composition. While larger models, such as Llama 70B, exhibited moderate improvement over smaller variants like Llama 7B, they under performed relative to the code-based Llama 34B. This finding is consistent with Kaplan et al., (2020) and Hoffmann et al., (2022), who noted that factors such as training data quality and architectural optimization often outweigh the benefits of increasing model size. These results challenge the assumption that larger models inherently perform better, suggesting a paradigm shift toward prioritizing specialized training and architectural efficiency over sheer scale. Such a shift has significant implications for the development of cost-effective, high-performing LLMs capable of adequate temporal analysis.

本研究中的模型大小性能不一致揭示了模型架构与数据集组成之间相互作用的关键见解。尽管较大的模型(如 Llama 70B)相较于较小的变体(如 Llama 7B)表现出适度的改进,但其表现却不如基于代码的 Llama 34B。这一发现与 Kaplan 等人 (2020) 和 Hoffmann 等人 (2022) 的研究一致,他们指出,训练数据质量和架构优化等因素通常比增加模型大小的好处更为重要。这些结果挑战了“更大的模型本质上表现更好”的假设,表明了一种范式转变,即优先考虑专门的训练和架构效率,而非单纯的规模。这种转变对于开发具有成本效益、高性能且能够进行充分时间分析的大语言模型具有重要意义。

The limitations observed in general-purpose models, such as Google Bard and Gemini, further emphasize the need for domain-specific optimization. These models struggled with both factuality and comprehensiveness, particularly in tasks requiring historical and temporal sensitivity. This aligns with findings in Chatbot Arena (Chiang et al., 2024) and related literature, which highlight the challenges faced by general-purpose models in specialized reasoning tasks. Enhancing these models to perform well in temporally complex tasks may require targeted retraining on domainspecific datasets or integration of context-aware mechanisms. Addressing these limitations could make general-purpose models more versatile and effective in a broader range of applications, including digital humanities and historical research.

在通用模型(如 Google Bard 和 Gemini)中观察到的局限性进一步强调了领域特定优化的必要性。这些模型在事实性和全面性方面表现不佳,尤其是在需要历史和时间敏感性的任务中。这与 Chatbot Arena (Chiang et al., 2024) 和相关文献中的发现一致,这些研究强调了通用模型在专业推理任务中面临的挑战。增强这些模型在时间复杂任务中的表现,可能需要在领域特定数据集上进行有针对性的重新训练,或集成上下文感知机制。解决这些局限性可以使通用模型在更广泛的应用中(包括数字人文和历史研究)更加多功能和有效。

The relationship between temporal semantic understanding and training data quality is particularly evident in the strong performance of GPT-4 and CodeLlama 34B. These models underscore the critical role of well-curated and diverse datasets in fostering the ability to interpret and analyze linguistic evolution. Studies such as (Beltagy et al., 2020) and Zhao et al., (2023) similarly emphasize the importance of diverse datasets in enhancing model performance, particularly for tasks involving extended temporal contexts. However, contrasting findings from Chen et al., (2021) suggest potential brittleness in code-trained models, indicating that further investigation is necessary to understand the long-term benefits and limitations of specialized datasets. Future research should aim to delineate the characteristics of training data that most effectively enhance temporal reasoning, particularly in relation to unstructured and evolving linguistic contexts.

时间语义理解与训练数据质量之间的关系在 GPT-4 和 CodeLlama 34B 的强劲表现中尤为明显。这些模型凸显了精心策划且多样化的数据集在培养语言演变解释和分析能力中的关键作用。类似地,(Beltagy et al., 2020) 和 Zhao et al., (2023) 等研究也强调了多样化数据集在提升模型性能中的重要性,尤其是在涉及长时间上下文的任务中。然而,Chen et al., (2021) 的对比研究结果表明,代码训练模型可能存在脆弱性,这表明需要进一步研究以理解专用数据集的长期收益和局限性。未来的研究应致力于界定最能有效增强时间推理能力的训练数据特征,特别是在非结构化和不断演变的语言环境中。

The consistently low scores of smaller models, such as Llama 7B, highlight the challenges inherent in using resource-constrained architectures for complex semantic tasks. This aligns with the findings of Kaplan et al., (2020) and (Schick and Schütze, 2020), which emphasize the limitations of smaller models in capturing clear patterns without fine-tuning. While smaller models may hold potential for niche applications with appropriate optimization, their inability to generalize effectively to tasks requiring deep temporal comprehension underscores the need for targeted architectural and dataset enhancements. Future research could explore fine-tuning smaller models for specific historical linguistic tasks, thereby balancing resource efficiency with improved performance.

较小模型(如 Llama 7B)持续的低分凸显了在复杂语义任务中使用资源受限架构所固有的挑战。这与 Kaplan 等人 (2020) 以及 Schick 和 Schütze (2020) 的研究结果一致,这些研究强调了较小模型在未经微调的情况下捕捉清晰模式的局限性。虽然较小模型在适当优化后可能在特定应用领域具有潜力,但它们在需要深度时间理解的任务上无法有效泛化,这突显了针对性的架构和数据集增强的必要性。未来的研究可以探索针对特定历史语言学任务微调较小模型,从而在资源效率和性能提升之间取得平衡。

Finally, these findings hold substantial implications for applications in digital humanities and historical linguistics, where understanding the evolution of language and meaning over time is critical. The ability of GPT-4 and Claude to accurately trace semantic shifts positions them as valuable tools for scholars in these fields. The growing interest in domain-specific language models, such as MacBERTh (Manjavacas Arevalo and Fonteyn, 2021), highlights the demand for systems capable of contextual i zing language within historical frameworks. The integration of such models into digital humanities workflows could revolutionize the analysis of historical texts, offering new opportunities for interdisciplinary research. Further efforts to refine LLMs for historical applications could bridge gaps in computational and humanitarian research, fostering a deeper understanding of language dynamics across time.

最后,这些发现对数字人文和历史语言学领域的应用具有重要意义,在这些领域中,理解语言和意义随时间的演变至关重要。GPT-4 和 Claude 能够准确追踪语义变化的能力,使它们成为这些领域学者的宝贵工具。对领域特定语言模型(如 MacBERTh [Manjavacas Arevalo 和 Fonteyn, 2021])日益增长的兴趣,突显了对能够在历史框架内对语言进行语境化的系统的需求。将此类模型整合到数字人文工作流程中,可能会彻底改变历史文本的分析方式,为跨学科研究提供新的机会。进一步改进大语言模型以用于历史应用的努力,可能会弥合计算研究与人文研究之间的差距,促进对跨时间语言动态的更深入理解。

6. Conclusion

6. 结论

This study provides critical insights into the performance of LLMs in understanding temporal semantic shifts, emphasizing the roles of model architecture, training data, and domain-specific optimization. The results reveal that state-of-the-art models, such as GPT-4 and Claude Instant $100\mathrm{k}$ , excel in capturing the historical context, showcasing the potential of well-designed architectures and diverse datasets. Moreover, the outstanding performance of CodeLlama 34B highlights the efficacy of retraining on structured datasets, like code, for enhancing logical reasoning and temporal analysis.

本研究为大语言模型在理解时间语义变化方面的表现提供了关键见解,强调了模型架构、训练数据和领域特定优化的作用。结果表明,最先进的模型(如 GPT-4 和 Claude Instant $100\mathrm{k}$)在捕捉历史背景方面表现出色,展示了精心设计的架构和多样化数据集的潜力。此外,CodeLlama 34B 的卓越表现突显了在结构化数据集(如代码)上重新训练对增强逻辑推理和时间分析的有效性。

Contrast ingly, the limitations of general-purpose models and inconsistencies among larger models, such as Llama 70B, underscore that size alone does not guarantee superior performance. Instead, this study reinforces the importance of training data quality and architectural refinement in shaping a model’s ability to interpret linguistic evolution. Smaller models, like Llama 7B, demonstrated significant deficits in temporal understanding, further highlighting the necessity for targeted finetuning to enable such architectures to handle complex semantic tasks effectively.

相比之下,通用模型的局限性以及更大模型(如 Llama 70B)之间的不一致性表明,仅靠规模并不能保证卓越的性能。相反,本研究强调了训练数据质量和架构优化在塑造模型理解语言演变能力方面的重要性。较小的模型(如 Llama 7B)在时间理解方面表现出显著不足,进一步凸显了针对性微调的必要性,以使这些架构能够有效处理复杂的语义任务。

These findings have profound implications for the development of LLMs tailored for applications in digital humanities, historical linguistics, and other fields that require temporal sensitivity. By prioritizing diverse, domain-specific datasets and leveraging specialized retraining approaches, future models could achieve higher levels of precision and contextual understanding. This study lays the groundwork for further exploration into the optimization of LLMs for temporal semantic analysis, advocating for a research trajectory that bridges computational advancements with interdisciplinary applications.

这些发现对开发适用于数字人文、历史语言学以及其他需要时间敏感性的领域的大语言模型具有深远意义。通过优先考虑多样化的领域特定数据集,并利用专门的再训练方法,未来的模型可以实现更高水平的精确度和上下文理解。本研究为进一步探索大语言模型在时间语义分析中的优化奠定了基础,倡导一种将计算进展与跨学科应用相结合的研究路径。

阅读全文(20积分)