[论文翻译]使用大语言模型解决第三方库推荐中的流行度偏差问题


原文地址:https://arxiv.org/pdf/2501.10313


Addressing Popularity Bias in Third-Party Library Recommendations Using LLMs

使用大语言模型解决第三方库推荐中的流行度偏差问题

Abstract—Recommend er systems for software engineering (RSSE) play a crucial role in automating development tasks by providing relevant suggestions according to the developer’s context. However, they suffer from the so-called popularity bias, i.e., the phenomenon of recommending popular items that might be irrelevant to the current task. In particular, the long-tail effect can hamper the system’s performance in terms of accuracy, thus leading to false positives in the provided recommendations. Foundation models are the most advanced generative AI-based models that achieve relevant results in several SE tasks.

摘要—软件工程推荐系统 (RSSE) 通过根据开发者的上下文提供相关建议,在自动化开发任务中发挥着至关重要的作用。然而,它们受到所谓的流行度偏差的影响,即推荐可能不相关的流行项目的现象。特别是,长尾效应可能会影响系统的准确性,从而导致提供的推荐中出现误报。基础模型是最先进的基于生成式 AI (Generative AI) 的模型,在多个软件工程任务中取得了相关成果。

This paper aims to investigate the capability of large language models (LLMs) to address the popularity bias in recommend er systems of third-party libraries (TPLs). We conduct an ablation study experimenting with state-of-the-art techniques to mitigate the popularity bias, including fine-tuning and popularity penalty mechanisms. Our findings reveal that the considered LLMs cannot address the popularity bias in TPL recommend ers, even though fine-tuning and post-processing penalty mechanism contributes to increasing the overall diversity of the provided recommendations. In addition, we discuss the limitations of LLMs in this context and suggest potential improvements to address the popularity bias in TPL recommend ers, thus paving the way for additional experiments in this direction.

本文旨在研究大语言模型(LLMs)在解决第三方库(TPLs)推荐系统中流行度偏差方面的能力。我们通过消融实验,尝试了包括微调和流行度惩罚机制在内的最先进技术来缓解流行度偏差。我们的研究结果表明,尽管微调和后处理惩罚机制有助于提高推荐的整体多样性,但所考虑的LLMs无法解决TPL推荐系统中的流行度偏差。此外,我们讨论了LLMs在此背景下的局限性,并提出了解决TPL推荐系统中流行度偏差的潜在改进措施,从而为该方向的进一步实验铺平了道路。

Index Terms—recommend er systems, popularity bias, large language models

索引词—推荐系统、流行度偏差、大语言模型

I. INTRODUCTION

I. 引言

While the general concept of fairness, i.e., the absence of any bias or prejudice in a given decision-making process, has been widely explored in sensitive domains such as health, crime, or education [1]–[3], software fairness is an emerging term that is attracting more and more attention after the rise of AI-intensive models [3]–[5] and the related legal concerns underscored by European Union [6].

尽管公平的一般概念,即在给定决策过程中不存在任何偏见或歧视,已在健康、犯罪或教育等敏感领域得到广泛探讨 [1]–[3],但软件公平性是一个新兴术语,随着 AI 密集型模型的兴起 [3]–[5] 以及欧盟强调的相关法律问题 [6],它正吸引越来越多的关注。

Among the large variety of intelligent systems, recommender systems for software engineering (RSSEs) [7], [8] are at the forefront in assisting developers in several tasks, spanning from code completion to automated program repair. Among different types of RSSEs, third-party library (TPL) RSSEs provide off-the-shelf software components relevant to the project under development [9]–[13].

在众多智能系统中,软件工程推荐系统 (RSSEs) [7], [8] 在协助开发者完成从代码补全到自动化程序修复等多项任务方面处于前沿地位。在各类 RSSEs 中,第三方库 (TPL) RSSEs 提供了与开发项目相关的现成软件组件 [9]–[13]。

While previous work demonstrates their accuracy in providing ready-to-use solutions, these systems tend to present frequently seen items [14]–[16], thus undermining the novelty of the results. Referred as popularity bias in recent literature [17]–[19], this phenomenon is particularly harmful in TPL recommendation, as it can lead to the recommendation of libraries that are not relevant to the current task, thus causing the so-called long-tail effect [20], [21] that may hamper the system’s performance in terms of accuracy [22]. Although large language models (LLMs) [23] can be seen as the most advanced intelligent assistant to developers, as showcased in different development tasks [24], [25], recent research reveal that those models suffer from the long-tail effect in different coding tasks [26]. In particular, we aim to answer the following research question:

虽然之前的工作展示了它们在提供即用解决方案方面的准确性,但这些系统往往呈现常见项目 [14]–[16],从而削弱了结果的新颖性。这种现象在最近的文献中被称为流行度偏差 [17]–[19],在 TPL 推荐中尤其有害,因为它可能导致推荐与当前任务无关的库,从而引发所谓的长尾效应 [20], [21],这可能会影响系统在准确性方面的性能 [22]。尽管大语言模型 (LLMs) [23] 可以被视为开发者最先进的智能助手,正如在不同开发任务中所展示的那样 [24], [25],最近的研究表明,这些模型在不同编码任务中也会受到长尾效应的影响 [26]。特别是,我们旨在回答以下研究问题:

RQ: How effectively can open-source LLMs address

RQ: 开源大语言模型 (LLM) 能多有效地解决

popularity bias in TPL recommendations?

TPL 推荐中的流行度偏差?

To this end, we conduct an ablation study [27] by defining six different experimental configurations involving three versions of Llama model [28] and adopting different strategies to mitigate the popularity bias, i.e., few-shots prompt engineering, fine-tuning, and popularity penalty mechanism. Our initial findings confirm that the long-tail effect emerges in the recommendations provided by the baseline model, even though fine-tuning and penalty mechanisms can mitigate the popularity bias. Therefore, we see our work as the stepping stone to further investigate this issue and provide more effective solutions to mitigate the popularity bias in RSSEs, e.g., employing retrieval augmented generation [29] or involving human-in-the-loop solutions [30].

为此,我们通过定义六种不同的实验配置进行了一项消融研究 [27],涉及三个版本的 Llama 模型 [28],并采用不同的策略来缓解流行度偏差,即少样本提示工程、微调和流行度惩罚机制。我们的初步发现证实,尽管微调和惩罚机制可以缓解流行度偏差,但在基线模型提供的推荐中仍然出现了长尾效应。因此,我们将我们的工作视为进一步研究这一问题的基石,并为缓解 RSSE 中的流行度偏差提供更有效的解决方案,例如采用检索增强生成 [29] 或引入人在环解决方案 [30]。

The contributions of our work can be summarized as follows:

我们工作的贡献可以总结如下:

II. MOTIVATION AND BACKGROUND

II. 动机与背景

A. Motivating example

A. 动机示例

In AI-based systems, bias can be originated by several factors. Mehrabi et al. proposed a fairness taxonomy [31] where three main types of bias have been identified, i.e., Data to algorithm, Algorithm to User, and User to data. The first type refers to biases in the training data, e.g., unbalanced variables, missing data, or noisy data, while the second type is related to the algorithm itself, e.g., the choice of the model, the used hyper-parameters, or the optimization processes. Finally, User to Data bias represents any inherent biases in users that might be reflected in the data they generate. Following the taxonomy, popularity bias falls under the Algorithm to User category even though we acknowledge that it can be originated by bias in the training data.

在基于 AI 的系统中,偏见可能由多种因素引起。Mehrabi 等人提出了一种公平性分类法 [31],其中识别了三种主要类型的偏见,即数据到算法、算法到用户以及用户到数据。第一种类型指的是训练数据中的偏见,例如不平衡的变量、缺失数据或噪声数据,而第二种类型与算法本身相关,例如模型的选择、使用的超参数或优化过程。最后,用户到数据的偏见代表了用户可能存在的任何固有偏见,这些偏见可能会反映在他们生成的数据中。根据该分类法,流行度偏见属于算法到用户类别,尽管我们承认它可能源于训练数据中的偏见。


Fig. 1. Popularity bias in traditional TPL RSSEs

图 1: 传统 TPL RSSEs 中的流行度偏差

Figure 1 represents an explanatory process employing recommender systems for TPLs [11], [13]. In the shown example, the user is developing a Web application by relying on local dependencies, i.e., spring-core and jackson-core. Those dependencies represent the context of the TPL recommender, and they may be included in the query $\textcircled{1}$ performed to the system, i.e., the Recommend er engine. During the recommendation phase, the system exploits $\circledcirc$ a knowledge base composed of various software artifacts, including OSS projects mined from GitHub. However, those projects use popular TPLs that might not be useful for the current task, thus lowering the TPL recommend er’s overall accuracy. In the example, the system may provide $\circled{3}$ popular libraries, i.e., $\mathtt{1O g4,j}$ and junit instead of more relevant ones, i.e., commons-lang3 and micrometer-core, thus reducing the system accuracy.

图 1: 展示了使用推荐系统为第三方库 (TPL) 提供解释性流程的过程 [11], [13]。在所示的示例中,用户正在开发一个依赖于本地依赖项(即 spring-core 和 jackson-core)的 Web 应用程序。这些依赖项代表了 TPL 推荐器的上下文,它们可能包含在向系统(即推荐引擎)执行的查询 $\textcircled{1}$ 中。在推荐阶段,系统利用 $\circledcirc$ 一个由各种软件工件组成的知识库,包括从 GitHub 挖掘的开源项目 (OSS)。然而,这些项目使用的流行 TPL 可能对当前任务没有帮助,从而降低了 TPL 推荐器的整体准确性。在示例中,系统可能会提供 $\circled{3}$ 流行的库,即 $\mathtt{1O g4,j}$ 和 junit,而不是更相关的库,即 commons-lang3 和 micrometer-core,从而降低了系统的准确性。

Even though existing approaches try to mitigate this issue [9], a recent study reveals that traditional TPLs recommend ers still need to address the popularity bias adequately [19]. The Novelty metric assesses if a system can retrieve libraries in the long tail and expose them to projects [11]. This increases the possibility of coming across serendipitous libraries [32], e.g., those that are seen by chance but turn out to be useful for the project under development [7]. For example, there could be a recent library, yet to be widely used, that can better interface with new hardware or achieve faster performance than popular

尽管现有方法试图缓解这一问题 [9],但最近的一项研究表明,传统的第三方库 (TPL) 推荐系统仍需充分解决流行度偏差问题 [19]。新颖性 (Novelty) 指标评估系统是否能够检索长尾库并将其推荐给项目 [11]。这增加了偶然发现有用库的可能性 [32],例如那些偶然看到但对开发中的项目有用的库 [7]。例如,可能存在一个尚未被广泛使用的新库,它能够更好地与新硬件接口或实现比流行库更快的性能。


Fig. 2. Overview of the proposed approach

图 2: 所提出方法的概述

ones.

ones.

On the one hand, project-specific requirements must be considered by the recommend er systems to provide more accurate recommendations. On the other hand, libraries that are welldocumented and supported by an active community are more likely to be adopted by developers. This leads to a situation where the most popular libraries are recommended while the less popular ones are overlooked. In summary, recommending only popular TPLs would harm the novelty of the results and a trade-off between popularity and relevance must be found to provide a more balanced set of recommendations.

一方面,推荐系统必须考虑项目特定的需求,以提供更准确的推荐。另一方面,那些文档完善且得到活跃社区支持的库更有可能被开发者采用。这导致了一种情况,即最受欢迎的库被推荐,而不太受欢迎的库则被忽视。总之,仅推荐流行的第三方库(TPL)会损害结果的新颖性,必须在流行度和相关性之间找到平衡,以提供更均衡的推荐集。

III. APPROACH

III. 方法

This section outlines the approach employed to investigate the use of open-source LLMs for recommending TPLs, as illustrated in Figure 2. The process begins with filtering the original dataset of Java libraries to extract the most popular libraries along with relevant contextual information. Subsequently, the employed foundational model is improved using two complementary strategies: prompt engineering and finetuning. To address the long-tail effect in the recommendations, a popularity penalty mechanism is introduced. The following sections provide a detailed explanation of each step.

本节概述了研究使用开源大语言模型 (LLM) 推荐第三方库 (TPL) 的方法,如图 2 所示。该过程首先从原始的 Java 库数据集中筛选出最受欢迎的库及其相关上下文信息。随后,通过两种互补策略改进所使用的基础模型:提示工程 (prompt engineering) 和微调 (finetuning)。为了应对推荐中的长尾效应,引入了一种流行度惩罚机制。以下各节将详细解释每个步骤。

A. Data encoding

A. 数据编码

To support our analysis, we utilize an existing dataset [11] sourced from GitHub projects. We then identify the most popular libraries based on their usage, specifically TPLs that frequently appear as dependencies in other projects. For this purpose, a filtering component is employed to isolate and extract only the popular libraries along with their associated information, such as functionalities, dependencies, README files, and usage scenarios.

为了支持我们的分析,我们利用了一个现有的数据集 [11],该数据集来源于 GitHub 项目。然后,我们根据使用情况识别出最受欢迎的库,特别是那些经常作为其他项目依赖的第三方库 (TPL)。为此,我们使用了一个过滤组件来隔离并提取仅包含流行库及其相关信息的部分,例如功能、依赖项、README 文件和用例场景。

For each recommendation session, the system first identifies libraries that are highly popular, specifically targeting the top 20 libraries based on their usage frequency as recorded in the dataset. In addition, each library is annotated with a usage score based on the user interaction data. This allows the system to dynamically adjust recommendations to avoid these highly utilized libraries.

对于每个推荐会话,系统首先识别出高度流行的库,特别是基于数据集中记录的使用频率排名前20的库。此外,每个库都会根据用户交互数据标注一个使用分数。这使得系统能够动态调整推荐,以避免这些高度使用的库。

In addition, we collect README file for each TPL in the dataset to augment the prompt engineering with contextual information. To avoid token limitation issues during the prompting phase, we summarize the README files using an existing approach [33] that relies on the T5 pre-trained model. In particular, we filter out irrelevant information such as code snippets, URLs, and emojis. We also normalize the text by removing new lines, multiple spaces, and special characters.

此外,我们为数据集中的每个第三方库(TPL)收集了README文件,以通过上下文信息增强提示工程。为了避免在提示阶段出现Token限制问题,我们使用了一种现有的方法[33]对README文件进行总结,该方法依赖于T5预训练模型。具体来说,我们过滤掉了不相关的信息,如代码片段、URL和表情符号。我们还通过删除换行符、多个空格和特殊字符来规范化文本。

B. Prompt engineering

B. 提示工程

After the data curation phase, we conceive a particular prompt that instructs the model to disregard popular libraries and suggest alternative ones. Prompts are dynamically generated based on the current dataset, with parameters adjusted to emphasize features that are unique to less popular libraries, such as dependencies or rare application contexts, as specified in the README files. The rationale is to ensure that the model examines a broader array of library options.

在数据整理阶段之后,我们设计了一个特定的提示,指示模型忽略流行的库并建议替代方案。提示是基于当前数据集动态生成的,参数调整以强调不太流行的库的独特特征,例如依赖关系或罕见的应用场景,如 README 文件中所述。其目的是确保模型能够检查更广泛的库选项。

In the scope of the paper, we experiment with all two main prompt engineering strategies, i.e., zero-shot and few-shot. In addition, we embody the past conversation history in the few-shot learning to enhance the model’s ability to generate relevant recommendations.

在本文的研究范围内,我们实验了两种主要的提示工程策略,即零样本 (zero-shot) 和少样本 (few-shot)。此外,我们在少样本学习中融入了过去的对话历史,以增强模型生成相关推荐的能力。

For each prompt technique, we devise a specific template composed of two main elements, i.e., the role and instructions. The former specifies the context and the main objective of the AI assistant. The latter provides a list of fine-grained instructions defined as follows:

对于每种提示技术,我们设计了一个由两个主要元素组成的特定模板,即角色和指令。前者指定了AI助手的上下文和主要目标。后者提供了一系列细粒度的指令,定义如下:

The above mentioned instructions aim to guide the model in recommending lesser-known libraries that are relevant to the project. In addition, we employ the negative prompting technique [34] to force the LLM to not generate specific content, e.g., code snippets.

上述指令旨在引导模型推荐与项目相关的较冷门库。此外,我们采用负向提示技术 [34] 来强制大语言模型不生成特定内容,例如代码片段。

The explanatory templates for the zero-shot and few-shots are reported in Listing 1 and Listing 2, respectively.

零样本和少样本的解释模板分别列在清单 1 和清单 2 中。

Listing 2. Few-shots prompt template

列表 2. 少样本提示模板

Meanwhile, Listing 3 shows the template for the few-shot learning with history. In this case, the model is provided with the past conversation history to enhance the recommendation generation process without the list of specific instructions, as we want to evaluate the model’s ability to recall past interactions with the user.

与此同时,代码清单 3 展示了带有历史记录的少样本学习模板。在这种情况下,模型会提供过去的对话历史记录,以增强推荐生成过程,而不提供具体指令列表,因为我们希望评估模型回忆与用户过去互动的能力。

Listing 4 shows an example of produced output given in the Maven format.

清单 4 展示了以 Maven 格式生成的输出示例。

Listing 3. Few-shots with history.

列表 3. 带历史的少样本。

以下是 Maven 格式的列表:

| 1 | org.apache.commons:commons-text |
| 2 | io.jsonwebtoken:jsonwebtoken |
| 3 | com.fasterxml.jackson.module:jackson-module-scalars |
| 4 | org.apache.commons:commons-validator |
| 5 | org.bitbucket.direvtor:javalin |
| 6 | org.jsonwebtoken:jwt-simple |
| 7 | com.github.fge:json-schema-validator |
| 8 | io.requery:jrequery |
| 9 | org.apache.httpcomponents:httpmime |
| 10 | com.nimbusds:oauth2 |

Listing 4. An explanatory output of the recommendation process.

清单 4. 推荐过程的解释性输出。

C. Selected models and Fine-tuning

C. 选定的模型与微调

Concerning the selected open-source LLMs, we opt for the Llama architecture as its effectiveness in generating content for different tasks has been demonstrated in recent research [35], [36]. In particular, we experiment with the following models:

关于所选的开源大语言模型,我们选择了 Llama 架构,因为最近的研究 [35]、[36] 已经证明了它在为不同任务生成内容方面的有效性。具体来说,我们实验了以下模型:

$\geq\mathtt{L}1\mathtt{a m a}{-}2{-}7\mathtt{b}{-}\mathtt{c h a t}^{2}$ : This is part of Meta’s Llama 2 family, a collection of pre-trained and fine-tuned generative language models ranging from 7 billion to 70 billion parameters. The 7B chat model is specifically optimized for dialogue-based tasks, fine-tuned using supervised learning and reinforcement learning with human feedback (RLHF). It is designed for assistant-like chat applications and outperforms many open-source models on various benchmarks.

$\geq\mathtt{L}1\mathtt{a m a}{-}2{-}7\mathtt{b}{-}\mathtt{c h a t}^{2}$:这是 Meta 的 Llama 2 系列的一部分,该系列包含从 70 亿到 700 亿参数的预训练和微调生成式语言模型。7B chat 模型专门针对基于对话的任务进行了优化,通过监督学习和人类反馈的强化学习 (RLHF) 进行微调。它专为类似助手的聊天应用设计,并在多个基准测试中优于许多开源模型。

$\geq\mathrm{L}\mathrm{1}\mathrm{,ama}{-}2{-}\mathrm{1}\mathrm{,3b}{-}\mathrm{ch}\mathrm{a}{\sf t}^{3}$ : Similar to the 7B variant, this 13 billion parameters model is fine-tuned for dialogue tasks. It benefits from a larger parameter size, which improves its ability to handle complex natural language tasks with better context understanding and response generation.

$\geq\mathrm{L}\mathrm{1}\mathrm{,ama}{-}2{-}\mathrm{1}\mathrm{,3b}{-}\mathrm{ch}\mathrm{a}{\sf t}^{3}$:与 7B 变体类似,这个拥有 130 亿参数的模型针对对话任务进行了微调。得益于更大的参数量,它在处理复杂自然语言任务时表现更好,能够更好地理解上下文并生成响应。

$\eqsucc$ Llama-3-8b-instruct4: It is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

$\eqsucc$ Llama-3-8b-instruct4:它是一种自回归语言模型,使用了优化的Transformer架构。调优版本通过监督微调(SFT)和基于人类反馈的强化学习(RLHF)来与人类对帮助性和安全性的偏好保持一致。


Fig. 3. Most popular TPLs in the dataset.


图 3: 数据集中最受欢迎的 TPLs。

D. Popularity penalty mechanism

D. 流行度惩罚机制

It is worth mentioning that we opt for small, optimized open-source models to allow easy deployment in local machine. To support the fine-tuning process, we first tokenized the pre-processed datasets as mentioned in Section III-A. After the token iz ation phase, we adopt the LoRa approach [37], which is a parameter-efficient fine-tuning technique that adjusts specific layers while freezing most of model’s original parameters. In particular, we used a rank of 16, a LoRA Alpha of 32, and a dropout rate of 0.05. Bits And Bytes library was used to apply 4-bit quantization, optimizing memory usage using Auto-Tokenizer Python library5, with each input truncated or padded to a maximum sequence length of 512 tokens. Table II reports the fine-tuning settings in terms of parameters.

值得一提的是,我们选择小型、优化的开源模型,以便在本地机器上轻松部署。为了支持微调过程,我们首先对预处理后的数据集进行了 Token 化处理,如第 III-A 节所述。在 Token 化阶段之后,我们采用了 LoRa 方法 [37],这是一种参数高效的微调技术,它调整特定层,同时冻结模型的大部分原始参数。具体来说,我们使用了秩为 16、LoRA Alpha 为 32 和丢弃率为 0.05 的设置。Bits And Bytes 库用于应用 4 位量化,使用 Auto-Tokenizer Python 库优化内存使用,每个输入被截断或填充到最大序列长度为 512 个 Token。表 II 报告了微调设置的参数。

表 II: 微调设置参数

As dicussed in Section II-A, existing approaches adopts reweighting strategies to penalize popular libraries compared to specific ones [9], [38]. Similarly, we introduce a popularity penalty mechanism to reduce bias in TPL recommend er systems by using data collected from Maven. We define the penalty mechanism as follows:

正如第 II-A 节所讨论的,现有方法采用重新加权策略来惩罚流行库,而不是特定库 [9], [38]。类似地,我们引入了一种流行度惩罚机制,通过使用从 Maven 收集的数据来减少 TPL 推荐系统中的偏差。我们将惩罚机制定义如下:

A. Metrics

A. 指标

In the following, we define metrics to assess the system’s performance across relevance diversity and ability to promote less popular libraries.

在以下内容中,我们定义了评估系统在相关性、多样性以及推广较不流行库的能力方面的性能指标。

Precision and Recall measure the accuracy of recommendations. Precision $(P\mathbb{\textlangle}\mathrm{&}N)$ is calculated as:

精确率和召回率衡量推荐的准确性。精确率 $(P\mathbb{\textlangle}\mathrm{&}N)$ 的计算公式为:

P@N={\frac{N u m b e r\ o f\ r e l e\nu a n t\ r e c o m m e n d e d\ i t e m s}{N u m b e r\ o f\ r e c o m m e n d e d\ i t e m s}}

whereas Recall $(R@N)$ is defined as:

召回率 (Recall) $(R@N)$ 定义为:

R@N={\frac{N u m b e r\ o f\ r e l e\nu a n t\ r e c o m m e n d e d\ i t e m s}{N u m b e r\ o f\ i t e m s\ i n\ g r o u n d\-t r u t h}}```


where the relevant recommended items is the set of items in the top-N list that match with those in the ground-truth data. These metrics evaluate how well system identifies relevant libraries, even when they are not among the most popular.

其中相关推荐项是前N列表中与真实数据匹配的项集。这些指标评估系统在识别相关库时的表现,即使这些库不是最受欢迎的。

F1-Score balances Precision and Recall, offering a comprehensive view of system’s recommendation accuracy.

F1分数平衡了精确率 (Precision) 和召回率 (Recall),提供了系统推荐准确性的全面视图。

F1=2\cdot{\frac{P@N\cdot R@N}{P@N+R@N}}



F1-Score is useful in scenarios where maintaining a balance between relevance and reducing popularity bias is crucial.

F1分数在保持相关性和减少流行度偏差之间平衡至关重要的场景中非常有用。

Novelty and Diversity assess how often the system recommends less popular libraries. Novelty measures the ability to introduce new, less-used libraries, addressing the core issue of popularity bias. Diversity evaluates the range of recommendations across different projects, ensuring variety in suggested libraries.

新颖性和多样性评估系统推荐不太流行的库的频率。新颖性衡量引入新的、较少使用的库的能力,解决流行性偏差的核心问题。多样性评估不同项目中的推荐范围,确保推荐的库具有多样性。

Catalog Coverage determines the extent to which the system recommends from the available library catalog. Catalog coverage (Coverage@N) is calculated as:

目录覆盖率 (Catalog Coverage) 决定了系统从可用库目录中推荐的程度。目录覆盖率 (Coverage@N) 的计算公式为:

{\mathrm{PenaltyScore}}={\frac{1}{\mathrm{PopularityRank}+1}}



where the popularity rank is the usage of each library collected from Maven. Roughly speaking, we penalize the popular libraries by assigning a lower score to them compared to the less popular ones. The penalty score is then used to adjust the recommendation generation process by reducing the likelihood of recommending popular libraries.

其中,流行度排名是从 Maven 收集的每个库的使用情况。粗略地说,我们通过为流行库分配较低的分数来惩罚它们,与不太流行的库相比。然后,惩罚分数用于调整推荐生成过程,减少推荐流行库的可能性。

# IV. EVALUATION MATERIALS

# IV. 评估材料

This section discusses the methodology that we used to answer the research question defined in Section I. In particular, we aim to evaluate the effectiveness of each defined module in mitigating popularity bias in TPL recommendations.

本节讨论了我们用于回答第 I 部分中定义的研究问题的方法。特别是,我们旨在评估每个定义模块在缓解 TPL 推荐中的流行度偏差方面的有效性。

C o v e r a g e@N=\frac{|\bigcup_{p\in P}R E C_{N}(p)|}{|L|}



where   \$R E C_{N}(p)\$   represents set of recommended items for project  \$p\$  , and   \$L\$   is total number of unique libraries available. Higher coverage implies a broader range of recommended libraries.

其中 \$REC_{N}(p)\$ 表示项目 \$p\$ 的推荐项集合,\$L\$ 是可用唯一库的总数。覆盖率越高,意味着推荐的库范围越广。

Expected Popularity Complement (EPC) measures the system’s ability to recommend libraries that are less popular but still relevant. EPC   \$(E P C@N)\$   is formally defined as:

预期流行度补充 (Expected Popularity Complement, EPC) 衡量系统推荐不太流行但仍相关的库的能力。EPC \$(E P C@N)\$ 正式定义为:

E P C@N=\frac{\sum_{p\in P}\sum_{r=1}^{N}\frac{r e l(p,r)\cdot\frac{1}{1+\log_{2}(R E C_{r}(p))}}{\log_{2}(r+1)}}{\sum_{p\in P}\sum_{r=1}^{N}\frac{r e l(p,r)}{\log_{2}(r+1)}}



where   \$r e l(p,r)\$   is 1 if library at position   \$r\$   of the topN list for project   \$p\$   belongs to ground-truth data, and 0 otherwise.  \$R E C_{r}(p)\$   reflects popularity of library at position  \$r\$  , ensuring that less popular but relevant libraries are prioritized in recommendations.

其中,\$r e l(p,r)\$ 表示如果项目 \$p\$ 的 topN 列表中位置 \$r\$ 的库属于真实数据,则为 1,否则为 0。\$R E C_{r}(p)\$ 反映了位置 \$r\$ 的库的流行度,确保在推荐中优先考虑不太流行但相关的库。

TABLE I EXPERIMENTAL CONFIGURATIONS USED IN THE ABLATION STUDY.


表 1: 消融研究中使用的实验配置

| 配置 | 模型 | 提示技术 | 微调 | 惩罚机制 |
|------|------|----------|------|----------|
| C1   | Llama-2-7b-chat | 零样本 |      | x        |
| C2   | Llama-2-13b-chat | 少样本 |      |          |
| C3   | Llama-2-13b-chat | 少样本+历史 | x    | x        |
| C4   | Llama-3-8b-instruct | 少样本 |      | x        |
| C5   | Llama-3-8b-instruct | 少样本 | x    |          |
| C6   | Llama-3-8b-instruct | 少样本 |      |          |

# B. Ablation study

# B. 消融研究

# V. PRELIMINARY RESULTS

# V. 初步结果

This section discusses the ablation study that we conducted with the aim of investigating how the combination of prompt engineering, fine-tuning, and penalty mechanisms influence the balance between recommendation accuracy and catalog coverage. In particular, we define various configurations of Llama models tested to evaluate their effectiveness in reducing popularity bias within software library recommend er systems (see Table I). These configurations were designed to test differe