LlamBERT: Large-scale low-cost data annotation in NLP

LlamBERT: NLP 中的大规模低成本数据标注

Bálint Csanády1, Lajos Muzsai1, Péter Vedres1 Zoltán Nádasdy2,3, András Lukács1

Abstract. Large Language Models (LLMs), such as GPT-4 and Llama 2, show remarkable proficiency in a wide range of natural language processing (NLP) tasks. Despite their effectiveness, the high costs associated with their use pose a challenge. We present LlamBERT, a hybrid approach that leverages LLMs to annotate a small subset of large, unlabeled databases and uses the results for fine-tuning transformer encoders like BERT and RoBERTa. This strategy is evaluated on two diverse datasets: the IMDb review dataset and the UMLS Meta-Thesaurus. Our results indicate that the LlamBERT approach slightly compromises on accuracy while offering much greater cost-effectiveness.

摘要。大语言模型 (LLMs)，如 GPT-4 和 Llama 2，在广泛的自然语言处理 (NLP) 任务中表现出卓越的能力。尽管它们非常有效，但使用它们的高成本仍然是一个挑战。我们提出了 LlamBERT，这是一种混合方法，利用 LLMs 对大型未标记数据库的一小部分进行标注，并使用这些结果对 BERT 和 RoBERTa 等 Transformer 编码器进行微调。我们在两个不同的数据集上评估了这一策略：IMDb 评论数据集和 UMLS 元词库。我们的结果表明，LlamBERT 方法在略微牺牲准确性的同时，提供了更高的成本效益。

Keywords: NLP, data annotation, LLM, Llama, BERT, ontology, artificial intelligence

关键词：NLP、数据标注、大语言模型 (LLM)、Llama、BERT、本体论、人工智能

1. Introduction

1. 引言

In the contemporary technological landscape, when confronted with the task of annotating a large corpus of natural language data using a natural language prompt, LLMs such as the proprietary GPT-4 [1] and the open-source Llama 2 [2] present themselves as compelling solutions. Indeed, minimal prompt-tuning enables them to be highly proficient in handling a wide variety of NLP tasks [3]. However, running such LLMs on millions of prompts demands large and expensive computational resources. There have been optimization efforts aimed at achieving superior performance with reduced resource requirements [4, 5]. Numerous studies have investigated the efficiency and resource requirements of LLMs versus smaller transformer encoders and humans [6, 7, 8, 9, 10, 11]. Recent advancements in data augmentation with LLMs [12] underscore our approach, which relies on data labeling. Going beyond the exclusive use of LLMs for a task, we combine LLMs with substantially smaller yet capable NLP models. A study closest to our approach is [13], where GPT-NeoX was used to surrogate human annotation for solving named entity recognition.

在当今的技术环境中，当面临使用自然语言提示对大量自然语言数据进行注释的任务时，像专有的 GPT-4 [1] 和开源的 Llama 2 [2] 这样的大语言模型 (LLM) 提供了极具吸引力的解决方案。事实上，通过最少的提示调优，它们就能够熟练处理各种自然语言处理 (NLP) 任务 [3]。然而，在数百万个提示上运行这些大语言模型需要大量且昂贵的计算资源。已有一些优化工作旨在以更少的资源需求实现卓越的性能 [4, 5]。许多研究探讨了大语言模型与较小的 Transformer 编码器以及人类在效率和资源需求方面的对比 [6, 7, 8, 9, 10, 11]。最近在大语言模型数据增强方面的进展 [12] 强调了我们的方法，该方法依赖于数据标注。除了仅使用大语言模型完成任务外，我们还将大语言模型与规模小得多但能力强大的 NLP 模型结合起来。与我们的方法最接近的研究是 [13]，其中使用 GPT-NeoX 替代人工注释来解决命名实体识别问题。

Through two case studies, our research aims to assess the advantages and limitations of the approach we call LlamBERT, a hybrid methodology utilizing both LLMs and smaller-scale transformer encoders. The first case study examines the partially annotated IMDb review dataset [14] as a comparative baseline, while the second selects biomedical concepts from the UMLS Meta-Thesaurus [15] to demonstrate potential applications. Leveraging LLM’s language modeling capabilities, while utilizing relatively modest resources, enhances their accessibility and enables new business opportunities. We believe that such resource-efficient solutions can foster sustainable development and environmental stewardship.

通过两个案例研究，我们的研究旨在评估我们称之为 LlamBERT 的方法的优势和局限性，这是一种结合了大语言模型和小规模 Transformer 编码器的混合方法。第一个案例研究以部分标注的 IMDb 评论数据集 [14] 作为比较基准，而第二个案例则从 UMLS 元词表 [15] 中选取生物医学概念，以展示其潜在应用。利用大语言模型的语言建模能力，同时使用相对较少的资源，增强了其可访问性，并创造了新的商业机会。我们相信，这种资源高效的解决方案可以促进可持续发展和环境管理。

2. Approach

2. 方法

Given a large corpus of unlabeled natural language data, the suggested LlamBERT approach takes the following steps: (i) Annotate a reasonably sized, randomly selected subset of the corpus utilizing Llama 2 and a natural language prompt reflecting the labeling criteria; (ii) Parse the Llama 2 responses into the desired categories; (iii) Discard any data that fails to classify into any of the specified categories; (iv) Employ the resulting labels to perform supervised fine-tuning on a BERT classifier; (v) Apply the fine-tuned BERT classifier to annotate the original unlabeled corpus.

给定一个大规模未标注的自然语言数据集，建议的 LlamBERT 方法采取以下步骤：(i) 利用 Llama 2 和反映标注标准的自然语言提示，对合理大小的随机选择子集进行标注；(ii) 将 Llama 2 的响应解析为所需的类别；(iii) 丢弃任何无法分类到指定类别的数据；(iv) 使用生成的标签对 BERT 分类器进行监督微调；(v) 应用微调后的 BERT 分类器对原始未标注的数据集进行标注。

We explored two binary classification tasks, engineering the prompt to limit the LLM responses to one of the two binary choices. As anticipated, our efforts to craft such a prompt were considerably more effective when utilizing the ’chat’ variants of Llama 2 [16]. We investigated two versions: Llama-2-7b-chat running on a single A100 80GB GPU, and Llama-2-70b-chat requiring four such GPUs. We also tested the performance of gpt-4-0613 using the OpenAI API.

我们探索了两个二元分类任务，通过设计提示词将大语言模型的响应限制在两个二元选择之一。正如预期的那样，我们在使用 Llama 2 的 "chat" 变体 [16] 时，设计此类提示词的努力效果显著更好。我们研究了两个版本：在单个 A100 80GB GPU 上运行的 Llama-2-7b-chat，以及需要四个此类 GPU 的 Llama-2-70b-chat。我们还通过 OpenAI API 测试了 gpt-4-0613 的性能。

3. The IMDb dataset

3. IMDb 数据集

The Stanford Large Movie Review Dataset (IMDb) [14] is a binary sentiment dataset commonly referenced in academic literature. It comprises 25,000 labeled movie reviews for training purposes, 25,000 labeled reviews designated for testing, and an additional 50,000 unlabeled reviews that can be employed for supplementary self-supervised training. This dataset serves as a fundamental baseline in NLP for classification problems, which allows us to evaluate our method against a wellestablished standard [17, 18, 19].

斯坦福大型电影评论数据集 (IMDb) [14] 是学术文献中常引用的二元情感数据集。它包含 25,000 条用于训练的带标签电影评论、25,000 条用于测试的带标签评论，以及额外的 50,000 条未标记评论，可用于补充自监督训练。该数据集在 NLP 分类问题中作为基础基准，使我们能够根据既定标准评估我们的方法 [17, 18, 19]。

3.1. Experimental results

3.1. 实验结果

All of the results in this section were measured on the entire IMDb sentiment test data. In Table 1, we compare the performance of Llama 2 and GPT-4 in different few-shot settings. Due to limited access to the OpenAI API, we only measured the 0-shot performance of GPT-4. The results indicate that the number of few-shot examples has a significant impact on Llama-2-7b-chat. This model exhibited a bias toward classifying the reviews as positive, but few-shot examples of negative sentiment effectively mitigated this. Likely due to reaching the context-length limit, 3-shot prompts did not outperform 2-shot prompts on Llama-2-7b-chat, achieving an accuracy of $87.27%$ . The inference times shown in Table 1 depend on various factors, including the implementation and available hardware resources; they reflect the specific setup we used at the time of writing.

本节中的所有结果均在 IMDb 情感测试数据的完整数据集上进行测量。在表 1 中，我们比较了 Llama 2 和 GPT-4 在不同少样本设置下的性能。由于 OpenAI API 的访问限制，我们仅测量了 GPT-4 的零样本性能。结果表明，少样本示例的数量对 Llama-2-7b-chat 有显著影响。该模型在分类评论时表现出偏向于将其分类为积极的倾向，但负面情感的少样本示例有效地缓解了这一问题。可能是由于达到了上下文长度限制，Llama-2-7b-chat 在 3-shot 提示下的表现并未优于 2-shot 提示，准确率为 $87.27%$。表 1 中显示的推理时间取决于多种因素，包括实现方式和可用的硬件资源；它们反映了我们在撰写本文时使用的特定设置。

Table 1: Comparison LLM test performances on the IMDb data.

表 1: 大语言模型在 IMDb 数据上的测试性能对比

LLM	准确率 %	推理时间
	零样本	1样本
Llama-2-7b-chat Llama-2-70b-chat	75.28 95.39	89.77 95.33

In Table 2, we compare various pre-trained BERT models that were fine-tuned for five epochs on different training data with a batch size of 16. First, we established a baseline by using the original gold-standard training data. For the LlamBERT results, training data labeling was conducted by the Llama-2-70b-chat model from 0-shot prompts. The LlamBERT results were not far behind the baseline measurements, underscoring the practicality and effectiveness of the framework. Incorporating the extra 50,000 unlabeled data in LlamBERT resulted in a slight improvement in accuracy. We also evaluated a combined strategy where we first fine-tuned with the extra data labeled by Llama-2-70b-chat, then with the gold training data. The large version of RoBERTa performed the best on all 4 training scenarios, reaching a state-of-the-art accuracy of $96.68%$ . Inference on the test data with roberta-large took $9\mathrm{m}18\mathrm{s}$ , after fine-tuning for $2\mathrm{h}33\mathrm{m}$ . Thus, we can estimate that labeling the entirety of IMDb’s 7.816 million movie reviews [20] would take about $48\mathrm{h}28\mathrm{m}$ with roberta-large. In contrast, the same task would require approximately 367 days on our setup using Llama-2-70b-chat, while demanding significantly more computing power.

在表 2 中，我们比较了在不同训练数据上微调了五个 epoch 的各种预训练 BERT 模型，批量大小为 16。首先，我们使用原始的金标准训练数据建立了基线。对于 LlamBERT 的结果，训练数据标注是由 Llama-2-70b-chat 模型通过零样本提示完成的。LlamBERT 的结果与基线测量结果相差不大，突显了该框架的实用性和有效性。在 LlamBERT 中加入额外的 50,000 条未标注数据后，准确率略有提升。我们还评估了一种组合策略，即首先使用 Llama-2-70b-chat 标注的额外数据进行微调，然后再使用金标准训练数据进行微调。RoBERTa 的大版本在所有 4 种训练场景中表现最佳，达到了最先进的准确率 $96.68%$。使用 roberta-large 对测试数据进行推理耗时 $9\mathrm{m}18\mathrm{s}$，微调耗时 $2\mathrm{h}33\mathrm{m}$。因此，我们可以估计，使用 roberta-large 标注 IMDb 的全部 781.6 万条电影评论 [20] 大约需要 $48\mathrm{h}28\mathrm{m}$。相比之下，使用 Llama-2-70b-chat 在我们的设置上完成相同的任务大约需要 367 天，同时需要更多的计算资源。

Table 2: Comparison BERT test accuracies on the IMDb data.

表 2: IMDb 数据上 BERT 测试准确率对比

BERT 模型	基线训练	LlamBERT 训练	LlamBERT 训练&额外数据	额外数据+训练组合
distilbert-base [21]	91.23	90.77	92.12	92.53
bert-base	92.35	91.58	92.76	93.47
bert-large	94.29	93.31	94.07	95.03
roberta-base	94.74	93.53	94.28	95.23
roberta-large	96.54	94.83	94.98	96.68

3.2. Error analysis

3.2. 错误分析

To assess the relationship between training data quantity and the accuracy of the ensuing BERT model, we fine-tuned roberta-large across different-sized subsets of the gold training data as well as data labeled by Llama-2-70b-chat. As the left side of Fig. 1 indicates, the performance improvement attributed to the increasing amount of training data tends to plateau more rapidly in the case of LlamBERT. Based on these results, we concluded that labeling 10,000 entries represents a reasonable balance between accuracy and efficiency for the LlamBERT experiments in the next section. We were also interested in assessing the impact of deliberately mis labeling various-sized random subsets of the gold labels. The discrepancy between the gold-standard training labels and those generated by Llama 2 stands at $4.61%$ ; this prompted our curiosity regarding how this $4.61%$ error rate compares to mis labeling a randomly chosen subset of the gold training data. As shown on the right side of Fig. 1, roberta-large demonstrates substantial resilience to random mis labeling. Furthermore, data mislabeled by Llama-2-70b-chat results in a more pronounced decrease in performance compared to that of a random sample.

为了评估训练数据量与后续 BERT 模型准确性之间的关系，我们在不同大小的黄金训练数据子集以及由 Llama-2-70b-chat 标记的数据上对 roberta-large 进行了微调。如图 1 左侧所示，随着训练数据量的增加，LlamBERT 的性能提升趋于更快地达到平稳期。基于这些结果，我们得出结论，对于下一节的 LlamBERT 实验，标记 10,000 条数据在准确性和效率之间达到了合理的平衡。我们还对评估故意错误标记不同大小的黄金标签随机子集的影响感兴趣。黄金标准训练标签与 Llama 2 生成的标签之间的差异为 $4.61%$；这促使我们好奇，这个 $4.61%$ 的错误率与随机选择的黄金训练数据子集的错误标记相比如何。如图 1 右侧所示，roberta-large 对随机错误标记表现出显著的韧性。此外，与随机样本相比，由 Llama-2-70b-chat 错误标记的数据导致性能下降更为显著。

Table 3: Comparison of human annotation to model outputs on wrong test answers.

表 3: 人类标注与模型在错误测试答案上的输出对比

RoBERTa 情感	LlamBERT 训练			组合额外+训练
	正面	负面	混合	正面	负面	混合
正面	31	16	13	25	17	13
负面	17	14	9	15	14	16

Figure 1: Accuracy $(%)$ comparison of RoBERTa class if i ers on the IMDb test data. On the left: The effects of training data size. On the right: The effects of intentionally mis labeling a random part of the gold training data.

图 1: RoBERTa 分类器在 IMDb 测试数据上的准确率 $(%)$ 对比。左侧：训练数据大小的影响。右侧：故意对黄金训练数据的随机部分进行错误标注的影响。

We also conducted a manual error analysis on two of the models fine-tuned from roberta-large. For the model fine-tuned with the combined strategy, we randomly selected 100 reviews from the test data, where the model outputs differed from the gold labels. We sampled an additional 27 mislabeled reviews of the model fine-tuned with the LlamBERT strategy to get a sample size of 100 on the errors of this model too. We collected human annotations for the sentiment of the selected reviews independently from the gold labels. In the case of human annotation, we added a third category of mixed/neutral. Reviews not discussing the movie or indicating that ’the film is so bad it is good’ were typically classified in this third category. Table 3 compares the human annotations to the model outputs. The results indicate a comparable ratio of positive to negative labels between the human annotations and the model outputs, suggesting that the model outputs are more aligned with human sentiment than the original labels. Overall human performance on this hard subset of the test data was worse than random labeling.

我们还对从 roberta-large 微调的两个模型进行了手动错误分析。对于使用组合策略微调的模型，我们从测试数据中随机选择了 100 条评论，这些评论的模型输出与黄金标签不同。我们还额外采样了使用 LlamBERT 策略微调的模型的 27 条错误标记评论，以使该模型的错误样本量也达到 100 条。我们为所选评论的情感收集了独立于黄金标签的人工标注。在人工标注的情况下，我们添加了第三类“混合/中性”。未讨论电影或表示“这部电影太糟糕以至于很好”的评论通常被归类为第三类。表 3 比较了人工标注与模型输出。结果表明，人工标注与模型输出之间的正面与负面标签比例相当，表明模型输出比原始标签更符合人类情感。总体而言，人类在这个测试数据的困难子集上的表现比随机标记更差。

4. The UMLS dataset

4. UMLS 数据集

The United Medical Language System (UMLS) [15], devel

[论文翻译]LlamBERT: NLP 中的大规模低成本数据标注

原文地址：https://arxiv.org/pdf/2403.15938v1