[论文翻译]意义随时间变化的动态:对大语言模型的评估


原文地址:https://arxiv.org/pdf/2501.05552


The dynamics of meaning through time: Assessment of Large Language Models

意义随时间变化的动态:对大语言模型的评估

Abstract

摘要

Understanding how large language models (LLMs) grasp the historical context of concepts and their semantic evolution is essential in advancing artificial intelligence and linguistic studies. This study aims to evaluate the capabilities of various LLMs in capturing temporal dynamics of meaning, specifically how they interpret terms across different time periods. We analyze a diverse set of terms from multiple domains, using tailored prompts and measuring responses through both objective metrics (e.g., perplexity and word count) and subjective human expert evaluations. Our comparative analysis includes prominent models like ChatGPT, GPT-4, Claude, Bard, Gemini, and Llama. Findings reveal marked differences in each model's handling of historical context and semantic shifts, highlighting both strengths and limitations in temporal semantic understanding. These insights offer a foundation for refining LLMs to better address the evolving nature of language, with implications for historical text analysis, AI design, and applications in digital humanities.

理解大语言模型 (LLMs) 如何掌握概念的历史背景及其语义演变对于推进人工智能和语言学研究至关重要。本研究旨在评估各种大语言模型在捕捉意义的时间动态方面的能力,特别是它们如何解释不同时期的术语。我们分析了来自多个领域的多样化术语集,使用定制的提示并通过客观指标(例如,困惑度和词数)和主观的人类专家评估来衡量响应。我们的比较分析包括 ChatGPT、GPT-4、Claude、Bard、Gemini 和 Llama 等知名模型。研究结果揭示了每个模型在处理历史背景和语义变化方面的显著差异,突显了它们在时间语义理解方面的优势和局限性。这些见解为改进大语言模型以更好地应对语言的演变性质提供了基础,对历史文本分析、AI 设计以及数字人文领域的应用具有重要意义。

Keywords: Large Language Models (LLMs), Temporal Reasoning, Historical Reasoning.

关键词:大语言模型 (Large Language Models, LLMs),时序推理,历史推理。

1. Introduction

1. 引言

Large language models (LLMs) have revolutionized numerous domains, demonstrating remarkable performance across a wide array of tasks, including reasoning, understanding, truthfulness, mathematics, and coding (Periti and Montanelli, 2024; Zhao et al., 2023). The capacity of these models is typically assessed through various benchmarks that evaluate their proficiency in each of these domains (Chang et al., 2024). Central to their success is the combination of model size and the data used during pre training (Zhao et al., 2023). A growing body of literature explores the relationship between the scaling of model size and the emergence of increasingly sophisticated cognitive abilities, commonly referred to as emergent intelligence (Kaplan et al., 2020). Additionally, other studies emphasize the pivotal role of training data quality, specifically highlighting how pre training on diverse and specialized datasets can significantly enhance a model’s reasoning capabilities (Isik et al., 2024).

大语言模型 (LLMs) 已经彻底改变了众多领域,在推理、理解、真实性、数学和编码等广泛任务中展示了卓越的性能 (Periti 和 Montanelli, 2024; Zhao 等, 2023)。这些模型的能力通常通过各种基准测试来评估,这些基准测试评估了它们在每个领域的熟练程度 (Chang 等, 2024)。其成功的关键在于模型规模和预训练期间使用的数据的结合 (Zhao 等, 2023)。越来越多的文献探讨了模型规模扩展与日益复杂的认知能力(通常称为涌现智能)之间的关系 (Kaplan 等, 2020)。此外,其他研究强调了训练数据质量的关键作用,特别指出在多样化和专业化的数据集上进行预训练如何显著增强模型的推理能力 (Isik 等, 2024)。

The vast corpus of text used to train these models is predominantly derived from content available on the internet, much of which has been produced over the past few decades (Liu et al., 2024). This temporal context inherently introduces challenges, as much of the text that drives current models reflects contemporary understandings of language and meaning. Consequently, the historical evolution of words, their meanings, and their contextual shifts may be underrepresented in modern training data (Manjavacas and Fonteyn, 2022). This temporal gap poses a particular challenge for LLMs when tasked with interpreting how words and phrases have evolved across time.

用于训练这些模型的庞大文本语料库主要来源于互联网上的内容,其中大部分是在过去几十年中产生的 (Liu et al., 2024)。这种时间背景本身就带来了挑战,因为驱动当前模型的文本大多反映了当代对语言和意义的理解。因此,词汇的历史演变、其含义以及语境的变化在现代训练数据中可能没有得到充分体现 (Manjavacas and Fonteyn, 2022)。这种时间差距对大语言模型在解释词汇和短语如何随时间演变时提出了特别的挑战。

This gap has spurred interest in examining whether LLMs, which are typically trained on contemporary data, can capture the historical evolution of words and their meanings (Cuscito et al., 2024; Manjavacas Arevalo and Fonteyn, 2021; Manjavacas and Fonteyn, 2022). The central motivation of this study is to assess whether current LLMs possess the ability to understand semantic shifts over time. Specifically, it seeks to determine whether models, in their current configurations, can track how word meanings have evolved over the past century. In doing so, this study also explores the possibility of establishing this capacity—tracking and interpreting historical linguistic changes—as a benchmark for future advancements in LLM development. This study aims to understand LLM models’ ability to understand changes of a word meaning as a critical dimension of reasoning and language comprehension, with potential implications for a variety of applications in both AI development and interdisciplinary research.

这一差距激发了人们对于大语言模型(LLM)是否能够捕捉词汇及其意义历史演变的兴趣(Cuscito 等,2024;Manjavacas Arevalo 和 Fonteyn,2021;Manjavacas 和 Fonteyn,2022)。本研究的核心动机是评估当前的大语言模型是否具备理解语义随时间变化的能力。具体而言,它试图确定当前配置的模型是否能够追踪过去一个世纪中词汇意义的演变。通过这样做,本研究还探讨了将这种能力——追踪和解释历史语言变化——确立为未来大语言模型发展基准的可能性。本研究旨在理解大语言模型理解词汇意义变化的能力,作为推理和语言理解的关键维度,并对人工智能开发和跨学科研究中的多种应用具有潜在影响。

The findings of this study reveal significant variability in the ability of large language models (LLMs) to interpret temporal semantic shifts, emphasizing that training data quality and domainspecific fine-tuning outweigh model size in determining performance. GPT-4 and Claude Instant $100\mathrm{k}$ demonstrated superior factuality and comprehensiveness, reflecting the advantages of robust training methodologies. Meanwhile, the code-based Llama 34B surpassed larger Llama models, underscoring the value of retraining on structured datasets, such as code, to enhance analytical reasoning and temporal understanding. In contrast, models like Google Gemini and smaller Llama variants struggled to capture nuanced historical contexts, highlighting the limitations of generalpurpose training approaches. These insights establish a foundation for advancing LLM design and training strategies, enabling improved applications in historical linguistics, digital humanities, and beyond.

本研究的结果揭示了大语言模型 (LLMs) 在解释时间语义变化能力上的显著差异,强调了训练数据质量和领域特定微调在决定性能方面的重要性,超过了模型规模。GPT-4 和 Claude Instant $100\mathrm{k}$ 在事实性和全面性方面表现出色,反映了强大训练方法的优势。与此同时,基于代码的 Llama 34B 超越了更大的 Llama 模型,突显了在结构化数据集(如代码)上重新训练的价值,以增强分析推理和时间理解能力。相比之下,Google Gemini 和较小的 Llama 变体在捕捉细微历史背景方面表现不佳,突显了通用训练方法的局限性。这些见解为推进大语言模型设计和训练策略奠定了基础,使其在历史语言学、数字人文等领域中的应用得以改进。

The paper is organised in six sections. A brief review of LLMs with the scope of word meaning development is laid out in section 2. Section 3 discusses the research methodology and the experiment performed to meet with the research aim and gaols. Section 4 presents the results of the experiment. Section 5 has a comprehensive discussion over the results and how they correlate with similar studies in the same domain. Finally, conclusions are shared in section 6.

本文共分为六个部分。第2部分简要回顾了大语言模型 (LLM) 在词义发展领域的应用范围。第3部分讨论了研究方法和为实现研究目标而进行的实验。第4部分展示了实验结果。第5部分对结果进行了全面讨论,并探讨了这些结果与同一领域类似研究的相关性。最后,第6部分分享了结论。

The evolution of word meanings over time is influenced by a myriad of factors, including social changes, technological advancements, and cultural dynamics. This interplay between language, history, and culture highlights the importance of understanding semantic change as a gradual process that encompasses both lexical and core meaning shifts. Traditionally, semantic change has been examined through corpus linguistics, where researchers analyze texts from different historical periods to identify shifts in word usage and meaning. By examining word frequencies and the contexts in which they appear, scholars can trace the evolution of language over time (Asri et al., 2024) .

词义的演变受到多种因素的影响,包括社会变迁、技术进步和文化动态。语言、历史和文化之间的相互作用凸显了理解语义变化作为一个渐进过程的重要性,这一过程既包括词汇变化,也包括核心意义的变化。传统上,语义变化通过语料库语言学进行研究,研究人员分析不同历史时期的文本,以识别词汇使用和意义的变化。通过考察词汇频率及其出现的上下文,学者们可以追溯语言随时间的演变 (Asri et al., 2024) 。

In recent years, diachronic word embeddings have emerged as a powerful tool to capture changes in word meanings across time. These embeddings align words with their respective time periods, enabling a strong understanding of how meanings shift in response to cultural and societal changes. Notably, two statistical laws of semantic change have been proposed: i) words that are used more frequently tend to change at a slower rate, ii) polysemous words exhibit a higher rate of change (Hamilton et al., 2016a). This framework allows researchers to categorize various types of semantic shifts such as drift of meaning based on cultural norms (Spataru et al., 2024), special is ation of a meaning over time (Hamilton et al., 2016a), generalisation of a meaning over time (Wegmann et al., 2020), pejoration and amelioration which refers to a shift towards a negative or positive connotations, respectively (Periti and Montanelli, 2024; Wevers and Koolen, 2020), metaphorical meaning added to an existing word based on pop culture or a change in a society (de Sá et al., 2024) and finally, the broadening and narrowing of a meaning through time (Vijayarani and Geetha, 2020).

近年来,历时词嵌入 (diachronic word embeddings) 作为一种强大的工具出现,用于捕捉词语意义随时间的变化。这些嵌入将词语与其对应的时间段对齐,从而能够深入理解意义如何随着文化和社会变化而演变。值得注意的是,已经提出了两条语义变化的统计规律:i) 使用频率较高的词语变化速度较慢,ii) 多义词表现出更高的变化率 (Hamilton et al., 2016a)。这一框架使研究人员能够分类各种类型的语义变化,例如基于文化规范的意义漂移 (Spataru et al., 2024)、意义随时间的特化 (Hamilton et al., 2016a)、意义随时间的泛化 (Wegmann et al., 2020)、贬义化和褒义化(分别指意义向负面或正面内涵的转变)(Periti and Montanelli, 2024; Wevers and Koolen, 2020)、基于流行文化或社会变化的现有词语隐喻意义的增加 (de Sá et al., 2024),以及意义随时间的扩展和缩小 (Vijayarani and Geetha, 2020)。

On the other hand, as LLMs become increasingly integrated into daily tasks, their ability to handle semantic change and cultural variations is crucial. By training LLMs on a diverse corpora and finetuning them for specific tasks related to semantic change, researchers can enhance their performance in detecting shifts in meaning (Shen et al., 2024; Tao et al., 2023). However, LLMs often propagate cultural biases inherent in their training data, which can skew responses in crosscultural contexts (Tao et al., 2023). In addition, LLMs may struggle with complex social scenarios and generate text that deviates from intended meaning (Choi et al., 2023; Spataru et al., 2024).

另一方面,随着大语言模型 (LLM) 日益融入日常任务,其处理语义变化和文化差异的能力变得至关重要。通过在多样化的语料库上训练大语言模型,并针对与语义变化相关的特定任务进行微调,研究人员可以提高其在检测意义变化方面的性能 (Shen et al., 2024; Tao et al., 2023)。然而,大语言模型往往会传播其训练数据中固有的文化偏见,这可能会在跨文化环境中导致回答偏差 (Tao et al., 2023)。此外,大语言模型在处理复杂的社会场景时可能会遇到困难,并生成偏离预期意义的文本 (Choi et al., 2023; Spataru et al., 2024)。

Despite these limitations, LLMs can grasp cultural common-sense when fine-tuned on balanced datasets that incorporate diverse cultural perspectives (Shen et al., 2024). Addressing the biases in LLMs requires training on time-aware datasets that reflect the aftermath of significant cultural events, such as the term "coronavirus" and its evolving meanings (Mousavi et al., 2024). This approach allows for a deeper tracing of historical contexts and semantic changes.

尽管存在这些限制,大语言模型 (LLM) 在经过包含多元文化视角的平衡数据集微调后,能够掌握文化常识 (Shen et al., 2024)。解决大语言模型中的偏见问题,需要在反映重大文化事件后果的时间感知数据集上进行训练,例如“coronavirus”一词及其不断演变的含义 (Mousavi et al., 2024)。这种方法能够更深入地追溯历史背景和语义变化。

The use of language models to study the evolution of meaning over time has garnered increasing attention in recent research. Several studies have explored the capabilities of lamguage models in tracking semantic shifts and historical language usage. For instance, Bamler and Mandt, (2017) proposed dynamic word embeddings as a method for capturing temporal variations in word meanings, highlighting the utility of diachronic models in understanding historical text corpora. Similarly, Hamilton et al., (2016) introduced dynamic embeddings to study semantic change over decades, demonstrating how embeddings trained on historical corpora reveal shifts in cultural and social contexts. More recently, MacBERTh was specifically designed to analyze historical texts, leveraging transformers to improve the understanding of linguistic evolution across time periods (Manjavacas Arevalo and Fonteyn, 2021).

近年来,利用语言模型研究词义随时间演变的方法在研究中受到了越来越多的关注。多项研究探讨了语言模型在追踪语义变化和历史语言使用方面的能力。例如,Bamler 和 Mandt (2017) 提出了动态词嵌入 (dynamic word embeddings) 作为捕捉词义时间变化的方法,强调了历时模型在理解历史文本语料库中的实用性。同样,Hamilton 等人 (2016) 引入了动态嵌入来研究数十年间的语义变化,展示了基于历史语料库训练的嵌入如何揭示文化和社会背景的变迁。最近,MacBERTh 被专门设计用于分析历史文本,利用 Transformer 来提高对跨时间段语言演变的理解 (Manjavacas Arevalo 和 Fonteyn, 2021)。

The implications of understanding semantic change through LLMs are vast, including applications in tracking antisemitic language, analyzing the evolution of hate speech, and assessing sentiment shifts in economic discourse. Research has shown that diachronic embeddings can effectively monitor these shifts across multiple languages, such as English, German, French, Swedish, and Spanish (Hoeken et al., 2023; Periti et al., 2024; Tripodi et al., 2019). Additionally, the methodologies developed for semantic change detection can advance fields such as historical linguistics, social media analysis, and sentiment analysis by providing tools to predict potential indicators of accurate language use, such as political dog whistles (Boholm et al., 2024).

通过大语言模型理解语义变化的影响是广泛的,包括在追踪反犹太语言、分析仇恨言论的演变以及评估经济话语中的情感变化等方面的应用。研究表明,历时嵌入可以有效地监控多种语言(如英语、德语、法语、瑞典语和西班牙语)中的这些变化 (Hoeken et al., 2023; Periti et al., 2024; Tripodi et al., 2019)。此外,为语义变化检测开发的方法可以通过提供工具来预测准确语言使用的潜在指标(如政治狗哨),从而推动历史语言学、社交媒体分析和情感分析等领域的发展 (Boholm et al., 2024)。

3. Research Methodology

3. 研究方法

To evaluate the ability of LLMs to capture the temporal dynamics of meaning, we conducted a comprehensive, multi-dimensional experiment using a range of state-of-the-art models.

为了评估大语言模型 (LLM) 捕捉意义时间动态的能力,我们使用一系列最先进的模型进行了全面的多维度实验。

Language models. We evaluate six large language models in this study. The first is ChatGPT (OpenAI, 2022), for which we use text-davinci-002, a 175B parameter model based on the GPT3 architecture. The second is GPT-4 (OpenAI et al., 2023), and GPT4O (OpenAI et al., 2024), are more advanced model with improved reasoning and contextual understanding, although its exact parameter count is undisclosed, it is estimated to exceed 175B parameters. The third is Claude (Anthropic, 2023), for which we use Claude 1, with parameters ranging from 52B to 100B. The fourth is Bard, based on Google's LaMDA model (Thoppilan et al., 2022), which comes in configurations including 422M, 2B, 8B, 68B, and 137B parameters. The fifth is Gemini (Google DeepMind, 2024), a model from Google's suite with versions estimated to match or exceed the PaLM 540B model. The sixth is Llama (Meta, 2023), for which we used both Llama 1 and Llama 2, was evaluated at different scales: Llama 1 with 34B parameters, and Llama 2 with 7B, 13B, and 70B parameters.

语言模型。我们在本研究中评估了六种大语言模型。第一个是 ChatGPT (OpenAI, 2022),我们使用了 text-davinci-002,这是一个基于 GPT3 架构的 175B 参数模型。第二个是 GPT-4 (OpenAI et al., 2023) 和 GPT4O (OpenAI et al., 2024),它们是更先进的模型,具有改进的推理和上下文理解能力,尽管其确切参数数量未公开,但估计超过 175B 参数。第三个是 Claude (Anthropic, 2023),我们使用了 Claude 1,其参数范围在 52B 到 100B 之间。第四个是 Bard,基于 Google 的 LaMDA 模型 (Thoppilan et al., 2022),其配置包括 422M、2B、8B、68B 和 137B 参数。第五个是 Gemini (Google DeepMind, 2024),这是 Google 系列中的一个模型,其版本估计与或超过 PaLM 540B 模型。第六个是 Llama (Meta, 2023),我们同时使用了 Llama 1 和 Llama 2,并在不同规模下进行了评估:Llama 1 有 34B 参数,Llama 2 有 7B、13B 和 70B 参数。

Term Selection. A carefully curated list of terms was selected to represent diverse domains, including scientific concepts, historical events, and cultural phenomena. These terms were chosen to evaluate how well each model interprets semantic shifts and the historical context associated with each concept. Two key terms were selected for this study: "Data Mining" (a technical term) and "Michael Jackson" (a cultural figure). These terms span both technological and cultural evolution from the 1920s to the 2020s, offering a robust platform to assess the LLMs' capacity to track and describe temporal meaning changes.

术语选择。我们精心挑选了一系列术语,涵盖科学概念、历史事件和文化现象等多个领域。这些术语旨在评估每个模型如何解释语义变化以及与每个概念相关的历史背景。本研究选择了两个关键术语:"数据挖掘 (Data Mining)"(技术术语)和"Michael Jackson"(文化人物)。这些术语跨越了从1920年代到2020年代的技术和文化演变,为评估大语言模型跟踪和描述时间意义变化的能力提供了一个强大的平台。

Prompt Design and Input Format. For each term, we designed specific prompts aimed at assessing the models' understanding of the historical evolution of meaning. A typical prompt asked the models to:

提示设计与输入格式。对于每个术语,我们设计了特定的提示,旨在评估模型对词义历史演变的理解。一个典型的提示要求模型:

“Create a table with two columns. The first column should list decades (e.g., 1920s, 1930s, etc.), and the second should describe the meaning and synonyms for the term based on the knowledge and context of that period.” This prompt structure remained consistent across all models, ensuring that the evaluation was controlled, and the comparison was fair. Prompts were specifically crafted to challenge each model’s ability to capture and relay temporal semantic shifts without additional training or fine-tuning.

创建一个包含两列的表。第一列应列出年代(例如,1920年代、1930年代等),第二列应根据该时期的知识和背景描述术语的含义和同义词。此提示结构在所有模型中保持一致,确保评估受控且比较公平。提示经过精心设计,以挑战每个模型在不进行额外训练或微调的情况下捕捉和传达时间语义变化的能力。

Subjective Evaluations were conducted by human experts. Two methods were selected based on the following criteria: (1) Factuality score: Experts evaluated how well each model captured the evolution of meaning over time, (2) Comprehensiveness score The extent to which the models effectively answered the question including the description length and number synonyms in each answer, if any.

主观评估由人类专家进行。根据以下标准选择了两种方法:(1) 事实性评分:专家评估每个模型捕捉意义随时间演变的程度,(2) 全面性评分:模型有效回答问题的程度,包括每个答案的描述长度和同义词数量(如果有)。

4. Results

4. 结果

This section presents the findings from a comparative evaluation of LLMs on their ability to understand the temporal dynamics of meaning and semantic shifts for two terms, "Data Mining" and "Michael Jackson", see Table 1. The analysis reveals notable differences in model performance across metrics, including factuality and comprehensiveness. Key results indicate that models trained with specialized data, such as code, may exhibit enhanced analytical capabilities, potentially influencing their ability to interpret semantic evolution.

本节展示了大语言模型在理解“数据挖掘”和“迈克尔·杰克逊”这两个词的意义和语义演变的时间动态能力上的比较评估结果,见表 1。分析揭示了模型在不同指标(包括事实性和全面性)上的显著差异。关键结果表明,使用专门数据(如代码)训练的模型可能表现出更强的分析能力,这可能影响其解释语义演变的能力。

High-Performing Models

高性能模型

The models GPT-4 and Claude Instant $100\mathrm{k}$ consistently outperformed other LLMs across both evaluation metrics, demonstrating a high capacity for capturing historical context. Both models scored a maximum of 22 in factuality and near-maximum in comprehensiveness (21 and 22 for Claude Instant $100\mathrm{k}$ and GPT-4, respectively) for the term "Data Mining." A similarly strong performance was observed for the term "Michael Jackson," with GPT-4 achieving the highest comprehensiveness score of 22 and factuality of 21. These scores indicate an advanced capacity in these models to accurately trace semantic evolution, which may stem from robust training data diversity and model architecture optimized for complex language tasks. The consistent performance across terms suggests that GPT-4 and Claude may be more attuned to changes in meaning across different temporal contexts, supporting their utility in applications requiring historical linguistic analysis.

GPT-4 和 Claude Instant $100\mathrm{k}$ 模型在两项评估指标上始终优于其他大语言模型,展示了其在捕捉历史背景方面的高能力。在术语“数据挖掘 (Data Mining)”上,两个模型在事实性 (factuality) 方面均获得了最高分 22 分,在全面性 (comprehensiveness) 方面接近满分(Claude Instant $100\mathrm{k}$ 为 21 分,GPT-4 为 22 分)。在术语“Michael Jackson”上,GPT-4 也表现出色,获得了最高的全面性分数 22 分和事实性分数 21 分。这些分数表明,这些模型在准确追踪语义演变方面具有先进的能力,这可能源于其多样化的训练数据和针对复杂语言任务优化的模型架构。这些模型在不同术语上的一致表现表明,GPT-4 和 Claude 可能更适应不同时间背景下的语义变化,支持其在需要历史语言分析的应用中的实用性。

The Code-Based Llama Model’s Unique Strengths

基于代码的 Llama 模型的独特优势

The model Llama 34B, specifically trained on code and provided by Poe, emerged as a notable outlier within the Llama series, outperforming larger versions, such as Llama 70B and Llama 13B, across factuality and comprehensiveness. This model achieved a factuality score of 12 and comprehensiveness of 20 for "Data Mining" and similarly high scores of 18 and 22, respectively, for "Michael Jackson." The distinct performance of this model may be attributable to its extensive retraining on code datasets, which are characterized by structured syntax and logic-driven language. Such exposure potentially enhances a model’s analytical capabilities, providing a foundation for more structured thought processes that facilitate better recognition of historical patterns in meaning. This suggests that domain-specific retraining, particularly with code, may imbue LLMs with improved temporal reasoning and analytical rigor—skills critical for comprehending complex semantic shifts. Given these results, further research into code-based retraining could elucidate its potential benefits in enhancing temporal analytical abilities within LLMs.

模型 Llama 34B 是 Poe 提供的专门针对代码训练的模型,在 Llama 系列中表现突出,在事实性和全面性方面超越了更大的版本,如 Llama 70B 和 Llama 13B。该模型在“数据挖掘”方面获得了 12 分的事实性得分和 20 分的全面性得分,在“Michael Jackson”方面也分别获得了 18 分和 22 分的高分。该模型的独特表现可能归因于其在代码数据集上的广泛再训练,这些数据集以结构化语法和逻辑驱动的语言为特征。这种训练可能增强了模型的分析能力,为更结构化的思维过程提供了基础,从而有助于更好地识别历史意义的模式。这表明,特定领域的再训练,尤其是代码训练,可能赋予大语言模型更好的时间推理和分析严谨性——这些技能对于理解复杂的语义变化至关重要。鉴于这些结果,进一步研究基于代码的再训练可能有助于阐明其在增强大语言模型时间分析能力方面的潜在益处。

Under performance and Inconsistencies Among Other Models

性能不足与其他模型的不一致性

Several models displayed under performance, particularly smaller Llama variants (Llama 7B) and Google Bard. The Llama 2 7B model, for example, exhibited the lowest scores across both terms and evaluation metrics, scoring 0 for both factuality and comprehensiveness for “Data Mining” and only 1 and 0, respectively, for “Michael Jackson.” These findings suggest that model size has some effect, but a larger model size does not guarantee adequate understanding of temporal semantics; instead, both model architecture and training data composition play critical roles. As seen on Google Bard, it displayed low performance, particularly on the "Data Mining" term (factuality: 3, comprehensiveness: 0), underscoring the challenges faced by these models in accurately representing historical semantic shifts. These inconsistencies across models, particularly among the Llama and Bard models, imply potential limitations in their training frameworks for tasks requiring deep temporal context comprehension.

在性能表现上,多个模型表现不佳,尤其是较小的 Llama 变体(Llama 7B)和 Google Bard。例如,Llama 2 7B 模型在“数据挖掘”和“Michael Jackson”两个术语和评估指标上得分最低,分别获得了 0 分和 1 分、0 分。这些发现表明,模型大小有一定影响,但更大的模型规模并不能保证对时间语义的充分理解;相反,模型架构和训练数据构成起着关键作用。正如 Google Bard 的表现所示,它在“数据挖掘”术语上表现尤其不佳(事实性:3,全面性:0),突显了这些模型在准确表示历史语义变化方面面临的挑战。这些模型之间的不一致性,尤其是 Llama 和 Bard 模型之间的差异,暗示了它们在需要深度时间上下文理解的任务中训练框架的潜在局限性。

Table 1: Table showing two terms and the performance of each model in their ability to answer questions with factuality and comprehensiveness.

表 1: 展示两个术语及每个模