Addressing Popularity Bias in Third-Party Library Recommendations Using LLMs
使用大语言模型解决第三方库推荐中的流行度偏差问题
Abstract—Recommend er systems for software engineering (RSSE) play a crucial role in automating development tasks by providing relevant suggestions according to the developer’s context. However, they suffer from the so-called popularity bias, i.e., the phenomenon of recommending popular items that might be irrelevant to the current task. In particular, the long-tail effect can hamper the system’s performance in terms of accuracy, thus leading to false positives in the provided recommendations. Foundation models are the most advanced generative AI-based models that achieve relevant results in several SE tasks.
摘要—软件工程推荐系统 (RSSE) 通过根据开发者的上下文提供相关建议,在自动化开发任务中发挥着至关重要的作用。然而,它们受到所谓的流行度偏差的影响,即推荐可能不相关的流行项目的现象。特别是,长尾效应可能会影响系统的准确性,从而导致提供的推荐中出现误报。基础模型是最先进的基于生成式 AI (Generative AI) 的模型,在多个软件工程任务中取得了相关成果。
This paper aims to investigate the capability of large language models (LLMs) to address the popularity bias in recommend er systems of third-party libraries (TPLs). We conduct an ablation study experimenting with state-of-the-art techniques to mitigate the popularity bias, including fine-tuning and popularity penalty mechanisms. Our findings reveal that the considered LLMs cannot address the popularity bias in TPL recommend ers, even though fine-tuning and post-processing penalty mechanism contributes to increasing the overall diversity of the provided recommendations. In addition, we discuss the limitations of LLMs in this context and suggest potential improvements to address the popularity bias in TPL recommend ers, thus paving the way for additional experiments in this direction.
本文旨在研究大语言模型(LLMs)在解决第三方库(TPLs)推荐系统中流行度偏差方面的能力。我们通过消融实验,尝试了包括微调和流行度惩罚机制在内的最先进技术来缓解流行度偏差。我们的研究结果表明,尽管微调和后处理惩罚机制有助于提高推荐的整体多样性,但所考虑的LLMs无法解决TPL推荐系统中的流行度偏差。此外,我们讨论了LLMs在此背景下的局限性,并提出了解决TPL推荐系统中流行度偏差的潜在改进措施,从而为该方向的进一步实验铺平了道路。
Index Terms—recommend er systems, popularity bias, large language models
索引词—推荐系统、流行度偏差、大语言模型
I. INTRODUCTION
I. 引言
While the general concept of fairness, i.e., the absence of any bias or prejudice in a given decision-making process, has been widely explored in sensitive domains such as health, crime, or education [1]–[3], software fairness is an emerging term that is attracting more and more attention after the rise of AI-intensive models [3]–[5] and the related legal concerns underscored by European Union [6].
尽管公平的一般概念,即在给定决策过程中不存在任何偏见或歧视,已在健康、犯罪或教育等敏感领域得到广泛探讨 [1]–[3],但软件公平性是一个新兴术语,随着 AI 密集型模型的兴起 [3]–[5] 以及欧盟强调的相关法律问题 [6],它正吸引越来越多的关注。
Among the large variety of intelligent systems, recommender systems for software engineering (RSSEs) [7], [8] are at the forefront in assisting developers in several tasks, spanning from code completion to automated program repair. Among different types of RSSEs, third-party library (TPL) RSSEs provide off-the-shelf software components relevant to the project under development [9]–[13].
在众多智能系统中,软件工程推荐系统 (RSSEs) [7], [8] 在协助开发者完成从代码补全到自动化程序修复等多项任务方面处于前沿地位。在各类 RSSEs 中,第三方库 (TPL) RSSEs 提供了与开发项目相关的现成软件组件 [9]–[13]。
While previous work demonstrates their accuracy in providing ready-to-use solutions, these systems tend to present frequently seen items [14]–[16], thus undermining the novelty of the results. Referred as popularity bias in recent literature [17]–[19], this phenomenon is particularly harmful in TPL recommendation, as it can lead to the recommendation of libraries that are not relevant to the current task, thus causing the so-called long-tail effect [20], [21] that may hamper the system’s performance in terms of accuracy [22]. Although large language models (LLMs) [23] can be seen as the most advanced intelligent assistant to developers, as showcased in different development tasks [24], [25], recent research reveal that those models suffer from the long-tail effect in different coding tasks [26]. In particular, we aim to answer the following research question:
虽然之前的工作展示了它们在提供即用解决方案方面的准确性,但这些系统往往呈现常见项目 [14]–[16],从而削弱了结果的新颖性。这种现象在最近的文献中被称为流行度偏差 [17]–[19],在 TPL 推荐中尤其有害,因为它可能导致推荐与当前任务无关的库,从而引发所谓的长尾效应 [20], [21],这可能会影响系统在准确性方面的性能 [22]。尽管大语言模型 (LLMs) [23] 可以被视为开发者最先进的智能助手,正如在不同开发任务中所展示的那样 [24], [25],最近的研究表明,这些模型在不同编码任务中也会受到长尾效应的影响 [26]。特别是,我们旨在回答以下研究问题:
RQ: How effectively can open-source LLMs address
RQ: 开源大语言模型 (LLM) 能多有效地解决
popularity bias in TPL recommendations?
TPL 推荐中的流行度偏差?
To this end, we conduct an ablation study [27] by defining six different experimental configurations involving three versions of Llama model [28] and adopting different strategies to mitigate the popularity bias, i.e., few-shots prompt engineering, fine-tuning, and popularity penalty mechanism. Our initial findings confirm that the long-tail effect emerges in the recommendations provided by the baseline model, even though fine-tuning and penalty mechanisms can mitigate the popularity bias. Therefore, we see our work as the stepping stone to further investigate this issue and provide more effective solutions to mitigate the popularity bias in RSSEs, e.g., employing retrieval augmented generation [29] or involving human-in-the-loop solutions [30].
为此,我们通过定义六种不同的实验配置进行了一项消融研究 [27],涉及三个版本的 Llama 模型 [28],并采用不同的策略来缓解流行度偏差,即少样本提示工程、微调和流行度惩罚机制。我们的初步发现证实,尽管微调和惩罚机制可以缓解流行度偏差,但在基线模型提供的推荐中仍然出现了长尾效应。因此,我们将我们的工作视为进一步研究这一问题的基石,并为缓解 RSSE 中的流行度偏差提供更有效的解决方案,例如采用检索增强生成 [29] 或引入人在环解决方案 [30]。
The contributions of our work can be summarized as follows:
我们工作的贡献可以总结如下:
II. MOTIVATION AND BACKGROUND
II. 动机与背景
A. Motivating example
A. 动机示例
In AI-based systems, bias can be originated by several factors. Mehrabi et al. proposed a fairness taxonomy [31] where three main types of bias have been identified, i.e., Data to algorithm, Algorithm to User, and User to data. The first type refers to biases in the training data, e.g., unbalanced variables, missing data, or noisy data, while the second type is related to the algorithm itself, e.g., the choice of the model, the used hyper-parameters, or the optimization processes. Finally, User to Data bias represents any inherent biases in users that might be reflected in the data they generate. Following the taxonomy, popularity bias falls under the Algorithm to User category even though we acknowledge that it can be originated by bias in the training data.
在基于 AI 的系统中,偏见可能由多种因素引起。Mehrabi 等人提出了一种公平性分类法 [31],其中识别了三种主要类型的偏见,即数据到算法、算法到用户以及用户到数据。第一种类型指的是训练数据中的偏见,例如不平衡的变量、缺失数据或噪声数据,而第二种类型与算法本身相关,例如模型的选择、使用的超参数或优化过程。最后,用户到数据的偏见代表了用户可能存在的任何固有偏见,这些偏见可能会反映在他们生成的数据中。根据该分类法,流行度偏见属于算法到用户类别,尽管我们承认它可能源于训练数据中的偏见。

Fig. 1. Popularity bias in traditional TPL RSSEs
图 1: 传统 TPL RSSEs 中的流行度偏差
Figure 1 represents an explanatory process employing recommender systems for TPLs [11], [13]. In the shown example, the user is developing a Web application by relying on local dependencies, i.e., spring-core and jackson-core. Those dependencies represent the context of the TPL recommender, and they may be included in the query $\textcircled{1}$ performed to the system, i.e., the Recommend er engine. During the recommendation phase, the system exploits $\circledcirc$ a knowledge base composed of various software artifacts, including OSS projects mined from GitHub. However, those projects use popular TPLs that might not be useful for the current task, thus lowering the TPL recommend er’s overall accuracy. In the example, the system may provide $\circled{3}$ popular libraries, i.e., $\mathtt{1O g4,j}$ and junit instead of more relevant ones, i.e., commons-lang3 and micrometer-core, thus reducing the system accuracy.
图 1: 展示了使用推荐系统为第三方库 (TPL) 提供解释性流程的过程 [11], [13]。在所示的示例中,用户正在开发一个依赖于本地依赖项(即 spring-core 和 jackson-core)的 Web 应用程序。这些依赖项代表了 TPL 推荐器的上下文,它们可能包含在向系统(即推荐引擎)执行的查询 $\textcircled{1}$ 中。在推荐阶段,系统利用 $\circledcirc$ 一个由各种软件工件组成的知识库,包括从 GitHub 挖掘的开源项目 (OSS)。然而,这些项目使用的流行 TPL 可能对当前任务没有帮助,从而降低了 TPL 推荐器的整体准确性。在示例中,系统可能会提供 $\circled{3}$ 流行的库,即 $\mathtt{1O g4,j}$ 和 junit,而不是更相关的库,即 commons-lang3 和 micrometer-core,从而降低了系统的准确性。
Even though existing approaches try to mitigate this issue [9], a recent study reveals that traditional TPLs recommend ers still need to address the popularity bias adequately [19]. The Novelty metric assesses if a system can retrieve libraries in the long tail and expose them to projects [11]. This increases the possibility of coming across serendipitous libraries [32], e.g., those that are seen by chance but turn out to be useful for the project under development [7]. For example, there could be a recent library, yet to be widely used, that can better interface with new hardware or achieve faster performance than popular
尽管现有方法试图缓解这一问题 [9],但最近的一项研究表明,传统的第三方库 (TPL) 推荐系统仍需充分解决流行度偏差问题 [19]。新颖性 (Novelty) 指标评估系统是否能够检索长尾库并将其推荐给项目 [11]。这增加了偶然发现有用库的可能性 [32],例如那些偶然看到但对开发中的项目有用的库 [7]。例如,可能存在一个尚未被广泛使用的新库,它能够更好地与新硬件接口或实现比流行库更快的性能。

Fig. 2. Overview of the proposed approach
图 2: 所提出方法的概述
ones.
ones.
On the one hand, project-specific requirements must be considered by the recommend er systems to provide more accurate recommendations. On the other hand, libraries that are welldocumented and supported by an active community are more likely to be adopted by developers. This leads to a situation where the most popular libraries are recommended while the less popular ones are overlooked. In summary, recommending only popular TPLs would harm the novelty of the results and a trade-off between popularity and relevance must be found to provide a more balanced set of recommendations.
一方面,推荐系统必须考虑项目特定的需求,以提供更准确的推荐。另一方面,那些文档完善且得到活跃社区支持的库更有可能被开发者采用。这导致了一种情况,即最受欢迎的库被推荐,而不太受欢迎的库则被忽视。总之,仅推荐流行的第三方库(TPL)会损害结果的新颖性,必须在流行度和相关性之间找到平衡,以提供更均衡的推荐集。
III. APPROACH
III. 方法
This section outlines the approach employed to investigate the use of open-source LLMs for recommending TPLs, as illustrated in Figure 2. The process begins with filtering the original dataset of Java libraries to extract the most popular libraries along with relevant contextual information. Subsequently, the employed foundational model is improved using two complementary strategies: prompt engineering and finetuning. To address the long-tail effect in the recommendations, a popularity penalty mechanism is introduced. The following sections provide a detailed explanation of each step.
本节概述了研究使用开源大语言模型 (LLM) 推荐第三方库 (TPL) 的方法,如图 2 所示。该过程首先从原始的 Java 库数据集中筛选出最受欢迎的库及其相关上下文信息。随后,通过两种互补策略改进所使用的基础模型:提示工程 (prompt engineering) 和微调 (finetuning)。为了应对推荐中的长尾效应,引入了一种流行度惩罚机制。以下各节将详细解释每个步骤。
A. Data encoding
A. 数据编码
To support our analysis, we utilize an existing dataset [11] sourced from GitHub projects. We then identify the most popular libraries based on their usage, specifically TPLs that frequently appear as dependencies in other projects. For this purpose, a filtering component is employed to isolate and extract only the popular libraries along with their associated information, such as functionalities, dependencies, README files, and usage scenarios.
为了支持我们的分析,我们利用了一个现有的数据集 [11],该数据集来源于 GitHub 项目。然后,我们根据使用情况识别出最受欢迎的库,特别是那些经常作为其他项目依赖的第三方库 (TPL)。为此,我们使用了一个过滤组件来隔离并提取仅包含流行库及其相关信息的部分,例如功能、依赖项、README 文件和用例场景。
For each recommendation session, the system first identifies libraries that are highly popular, specifically targeting the top 20 libraries based on their usage frequency as recorded in the dataset. In addition, each library is annotated with a usage score based on the user interaction data. This allows the system to dynamically adjust recommendations to avoid these highly utilized libraries.
对于每个推荐会话,系统首先识别出高度流行的库,特别是基于数据集中记录的使用频率排名前20的库。此外,每个库都会根据用户交互数据标注一个使用分数。这使得系统能够动态调整推荐,以避免这些高度使用的库。
In addition, we collect README file for each TPL in the dataset to augment the prompt engineering with contextual information. To avoid token limitation issues during the prompting phase, we summarize the README files using an existing approach [33] that relies on the T5 pre-trained model. In particular, we filter out irrelevant information such as code snippets, URLs, and emojis. We also normalize the text by removing new lines, multiple spaces, and special characters.
此外,我们为数据集中的每个第三方库(TPL)收集了README文件,以通过上下文信息增强提示工程。为了避免在提示阶段出现Token限制问题,我们使用了一种现有的方法[33]对README文件进行总结,该方法依赖于T5预训练模型。具体来说,我们过滤掉了不相关的信息,如代码片段、URL和表情符号。我们还通过删除换行符、多个空格和特殊字符来规范化文本。
B. Prompt engineering
B. 提示工程
After the data curation phase, we conceive a particular prompt that instructs the model to disregard popular libraries and suggest alternative ones. Prompts are dynamically generated based on the current dataset, with parameters adjusted to emphasize features that are unique to less popular libraries, such as dependencies or rare application contexts, as specified in the README files. The rationale is to ensure that the model examines a broader array of library options.
在数据整理阶段之后,我们设计了一个特定的提示,指示模型忽略流行的库并建议替代方案。提示是基于当前数据集动态生成的,参数调整以强调不太流行的库的独特特征,例如依赖关系或罕见的应用场景,如 README 文件中所述。其目的是确保模型能够检查更广泛的库选项。
In the scope of the paper, we experiment with all two main prompt engineering strategies, i.e., zero-shot and few-shot. In addition, we embody the past conversation history in the few-shot learning to enhance the model’s ability to generate relevant recommendations.
在本文的研究范围内,我们实验了两种主要的提示工程策略,即零样本 (zero-shot) 和少样本 (few-shot)。此外,我们在少样本学习中融入了过去的对话历史,以增强模型生成相关推荐的能力。
For each prompt technique, we devise a specific template composed of two main elements, i.e., the role and instructions. The former specifies the context and the main objective of the AI assistant. The latter provides a list of fine-grained instructions defined as follows:
对于每种提示技术,我们设计了一个由两个主要元素组成的特定模板,即角色和指令。前者指定了AI助手的上下文和主要目标。后者提供了一系列细粒度的指令,定义如下:
The above mentioned instructions aim to guide the model in recommending lesser-known libraries that are relevant to the project. In addition, we employ the negative prompting technique [34] to force the LLM to not generate specific content, e.g., code snippets.
上述指令旨在引导模型推荐与项目相关的较冷门库。此外,我们采用负向提示技术 [34] 来强制大语言模型不生成特定内容,例如代码片段。
The explanatory templates for the zero-shot and few-shots are reported in Listing 1 and Listing 2, respectively.
零样本和少样本的解释模板分别列在清单 1 和清单 2 中。
Listing 2. Few-shots prompt template
列表 2. 少样本提示模板
Meanwhile, Listing 3 shows the template for the few-shot learning with history. In this case, the model is provided with the past conversation history to enhance the recommendation generation process without the list of specific instructions, as we want to evaluate the model’s ability to recall past interactions with the user.
与此同时,代码清单 3 展示了带有历史记录的少样本学习模板。在这种情况下,模型会提供过去的对话历史记录,以增强推荐生成过程,而不提供具体指令列表,因为我们希望评估模型回忆与用户过去互动的能力。


Listing 4 shows an example of produced output given in the Maven format.
清单 4 展示了以 Maven 格式生成的输出示例。
Listing 3. Few-shots with history.
列表 3. 带历史的少样本。
以下是 Maven 格式的列表:
| 1 | org.apache.commons:commons-text |
| 2 | io.jsonwebtoken:jsonwebtoken |
| 3 | com.fasterxml.jackson.module:jackson-module-scalars |
| 4 | org.apache.commons:commons-validator |
| 5 | org.bitbucket.direvtor:javalin |
| 6 | org.jsonwebtoken:jwt-simple |
| 7 | com.github.fge:json-schema-validator |
| 8 | io.requery:jrequery |
| 9 | org.apache.httpcomponents:httpmime |
| 10 | com.nimbusds:oauth2 |
Listing 4. An explanatory output of the recommendation process.
清单 4. 推荐过程的解释性输出。
C. Selected models and Fine-tuning
C. 选定的模型与微调
Concerning the selected open-source LLMs, we opt for the Llama architecture as its effectiveness in generating content for different tasks has been demonstrated in recent research [35], [36]. In particular, we experiment with the following models:
关于所选的开源大语言模型,我们选择了 Llama 架构,因为最近的研究 [35]、[36] 已经证明了它在为不同任务生成内容方面的有效性。具体来说,我们实验了以下模型:
$\geq\mathtt{L}1\mathtt{a m a}{-}2{-}7\mathtt{b}{-}\mathtt{c h a t}^{2}$ : This is part of Meta’s Llama 2 family, a collection of pre-trained and fine-tuned generative language models ranging from 7 billion to 70 billion parameters. The 7B chat model is specifically optimized for dialogue-based tasks, fine-tuned using supervised learning and reinforcement learning with human feedback (RLHF). It is designed for assistant-like chat applications and outperforms many open-source models on various benchmarks.
$\geq\mathtt{L}1\mathtt{a m a}{-}2{-}7\mathtt{b}{-}\mathtt{c h a t}^{2}$:这是 Meta 的 Llama 2 系列的一部分,该系列包含从 70 亿到 700 亿参数的预训练和微调生成式语言模型。7B chat 模型专门针对基于对话的任务进行了优化,通过监督学习和人类反馈的强化学习 (RLHF) 进行微调。它专为类似助手的聊天应用设计,并在多个基准测试中优于许多开源模型。
$\geq\mathrm{L}\mathrm{1}\mathrm{,ama}{-}2{-}\mathrm{1}\mathrm{,3b}{-}\mathrm{ch}\mathrm{a}{\sf t}^{3}$ : Similar to the 7B variant, this 13 billion parameters model is fine-tuned for dialogue tasks. It benefits from a larger parameter size, which improves its ability to handle complex natural language tasks with better context understanding and response generation.
$\geq\mathrm{L}\mathrm{1}\mathrm{,ama}{-}2{-}\mathrm{1}\mathrm{,3b}{-}\mathrm{ch}\mathrm{a}{\sf t}^{3}$:与 7B 变体类似,这个拥有 130 亿参数的模型针对对话任务进行了微调。得益于更大的参数量,它在处理复杂自然语言任务时表现更好,能够更好地理解上下文并生成响应。
$\eqsucc$ Llama-3-8b-instruct4: It is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
$\eqsucc$ Llama-3-8b-instruct4:它是一种自回归语言模型,使用了优化的Transformer架构。调优版本通过监督微调(SFT)和基于人类反馈的强化学习(RLHF)来与人类对帮助性和安全性的偏好保持一致。

Fig. 3. Most popular TPLs in the dataset.

图 3: 数据集中最受欢迎的 TPLs。
D. Popularity penalty mechanism
D. 流行度惩罚机制
It is worth mentioning that we opt for small, optimized open-source models to allow easy deployment in local machine. To support the fine-tuning process, we first tokenized the pre-processed datasets as mentioned in Section III-A. After the token iz ation phase, we adopt the LoRa approach [37], which is a parameter-efficient fine-tuning technique that adjusts specific layers while freezing most of model’s original parameters. In particular, we used a rank of 16, a LoRA Alpha of 32, and a dropout rate of 0.05. Bits And Bytes library was used to apply 4-bit quantization, optimizing memory usage using Auto-Tokenizer Python library5, with each input truncated or padded to a maximum sequence length of 512 tokens. Table II reports the fine-tuning settings in terms of parameters.
值得一提的是,我们选择小型、优化的开源模型,以便在本地机器上轻松部署。为了支持微调过程,我们首先对预处理后的数据集进行了 Token 化处理,如第 III-A 节所述。在 Token 化阶段之后,我们采用了 LoRa 方法 [37],这是一种参数高效的微调技术,它调整特定层,同时冻结模型的大部分原始参数。具体来说,我们使用了秩为 16、LoRA Alpha 为 32 和丢弃率为 0.05 的设置。Bits And Bytes 库用于应用 4 位量化,使用 Auto-Tokenizer Python 库优化内存使用,每个输入被截断或填充到最大序列长度为 512 个 Token。表 II 报告了微调设置的参数。
表 II: 微调设置参数
As dicussed in Section II-A, existing approaches adopts reweighting strategies to penalize popular libraries compared to specific ones [9], [38]. Similarly, we introduce a popularity penalty mechanism to reduce bias in TPL recommend er systems by using data collected from Maven. We define the penalty mechanism as follows:
正如第 II-A 节所讨论的,现有方法采用重新加权策略来惩罚流行库,而不是特定库 [9], [38]。类似地,我们引入了一种流行度惩罚机制,通过使用从 Maven 收集的数据来减少 TPL 推荐系统中的偏差。我们将惩罚机制定义如下:
A. Metrics
A. 指标
In the following, we define metrics to assess the system’s performance across relevance diversity and ability to promote less popular libraries.
在以下内容中,我们定义了评估系统在相关性、多样性以及推广较不流行库的能力方面的性能指标。
Precision and Recall measure the accuracy of recommendations. Precision $(P\mathbb{\textlangle}\mathrm{&}N)$ is calculated as:
精确率和召回率衡量推荐的准确性。精确率 $(P\mathbb{\textlangle}\mathrm{&}N)$ 的计算公式为:
P@N={\frac{N u m b e r\ o f\ r e l e\nu a n t\ r e c o m m e n d e d\ i t e m s}{N u m b e r\ o f\ r e c o m m e n d e d\ i t e m s}}
whereas Recall $(R@N)$ is defined as:
召回率 (Recall) $(R@N)$ 定义为:
R@N={\frac{N u m b e r\ o f\ r e l e\nu a n t\ r e c o m m e n d e d\ i t e m s}{N u m b e r\ o f\ i t e m s\ i n\ g r o u n d\-t r u t h}}```
where the relevant recommended items is the set of items in the top-N list that match with those in the ground-truth data. These metrics evaluate how well system identifies relevant libraries, even when they are not among the most popular.
其中相关推荐项是前N列表中与真实数据匹配的项集。这些指标评估系统在识别相关库时的表现,即使这些库不是最受欢迎的。
F1-Score balances Precision and Recall, offering a comprehensive view of system’s recommendation accuracy.
F1分数平衡了精确率 (Precision) 和召回率 (Recall),提供了系统推荐准确性的全面视图。
F1=2\cdot{\frac{P@N\cdot R@N}{P@N+R@N}}
F1-Score is useful in scenarios where maintaining a balance between relevance and reducing popularity bias is crucial.
F1分数在保持相关性和减少流行度偏差之间平衡至关重要的场景中非常有用。
Novelty and Diversity assess how often the system recommends less popular libraries. Novelty measures the ability to introduce new, less-used libraries, addressing the core issue of popularity bias. Diversity evaluates the range of recommendations across different projects, ensuring variety in suggested libraries.
新颖性和多样性评估系统推荐不太流行的库的频率。新颖性衡量引入新的、较少使用的库的能力,解决流行性偏差的核心问题。多样性评估不同项目中的推荐范围,确保推荐的库具有多样性。
Catalog Coverage determines the extent to which the system recommends from the available library catalog. Catalog coverage (Coverage@N) is calculated as:
目录覆盖率 (Catalog Coverage) 决定了系统从可用库目录中推荐的程度。目录覆盖率 (Coverage@N) 的计算公式为:
{\mathrm{PenaltyScore}}={\frac{1}{\mathrm{PopularityRank}+1}}
where the popularity rank is the usage of each library collected from Maven. Roughly speaking, we penalize the popular libraries by assigning a lower score to them compared to the less popular ones. The penalty score is then used to adjust the recommendation generation process by reducing the likelihood of recommending popular libraries.
其中,流行度排名是从 Maven 收集的每个库的使用情况。粗略地说,我们通过为流行库分配较低的分数来惩罚它们,与不太流行的库相比。然后,惩罚分数用于调整推荐生成过程,减少推荐流行库的可能性。
# IV. EVALUATION MATERIALS
# IV. 评估材料
This section discusses the methodology that we used to answer the research question defined in Section I. In particular, we aim to evaluate the effectiveness of each defined module in mitigating popularity bias in TPL recommendations.
本节讨论了我们用于回答第 I 部分中定义的研究问题的方法。特别是,我们旨在评估每个定义模块在缓解 TPL 推荐中的流行度偏差方面的有效性。
C o v e r a g e@N=\frac{|\bigcup_{p\in P}R E C_{N}(p)|}{|L|}
where \$R E C_{N}(p)\$ represents set of recommended items for project \$p\$ , and \$L\$ is total number of unique libraries available. Higher coverage implies a broader range of recommended libraries.
其中 \$REC_{N}(p)\$ 表示项目 \$p\$ 的推荐项集合,\$L\$ 是可用唯一库的总数。覆盖率越高,意味着推荐的库范围越广。
Expected Popularity Complement (EPC) measures the system’s ability to recommend libraries that are less popular but still relevant. EPC \$(E P C@N)\$ is formally defined as:
预期流行度补充 (Expected Popularity Complement, EPC) 衡量系统推荐不太流行但仍相关的库的能力。EPC \$(E P C@N)\$ 正式定义为:
E P C@N=\frac{\sum_{p\in P}\sum_{r=1}^{N}\frac{r e l(p,r)\cdot\frac{1}{1+\log_{2}(R E C_{r}(p))}}{\log_{2}(r+1)}}{\sum_{p\in P}\sum_{r=1}^{N}\frac{r e l(p,r)}{\log_{2}(r+1)}}
where \$r e l(p,r)\$ is 1 if library at position \$r\$ of the topN list for project \$p\$ belongs to ground-truth data, and 0 otherwise. \$R E C_{r}(p)\$ reflects popularity of library at position \$r\$ , ensuring that less popular but relevant libraries are prioritized in recommendations.
其中,\$r e l(p,r)\$ 表示如果项目 \$p\$ 的 topN 列表中位置 \$r\$ 的库属于真实数据,则为 1,否则为 0。\$R E C_{r}(p)\$ 反映了位置 \$r\$ 的库的流行度,确保在推荐中优先考虑不太流行但相关的库。
TABLE I EXPERIMENTAL CONFIGURATIONS USED IN THE ABLATION STUDY.
表 1: 消融研究中使用的实验配置
| 配置 | 模型 | 提示技术 | 微调 | 惩罚机制 |
|------|------|----------|------|----------|
| C1 | Llama-2-7b-chat | 零样本 | | x |
| C2 | Llama-2-13b-chat | 少样本 | | |
| C3 | Llama-2-13b-chat | 少样本+历史 | x | x |
| C4 | Llama-3-8b-instruct | 少样本 | | x |
| C5 | Llama-3-8b-instruct | 少样本 | x | |
| C6 | Llama-3-8b-instruct | 少样本 | | |
# B. Ablation study
# B. 消融研究
# V. PRELIMINARY RESULTS
# V. 初步结果
This section discusses the ablation study that we conducted with the aim of investigating how the combination of prompt engineering, fine-tuning, and penalty mechanisms influence the balance between recommendation accuracy and catalog coverage. In particular, we define various configurations of Llama models tested to evaluate their effectiveness in reducing popularity bias within software library recommend er systems (see Table I). These configurations were designed to test different aspects of models’ capabilities and refine our approach based on specific challenges identified in preliminary tests.
本节讨论了我们进行的消融研究,旨在探讨提示工程 (prompt engineering)、微调 (fine-tuning) 和惩罚机制的结合如何影响推荐准确性和目录覆盖率之间的平衡。特别是,我们定义了测试的 Llama 模型的各种配置,以评估它们在减少软件库推荐系统中流行度偏差方面的有效性(见表 1)。这些配置旨在测试模型能力的不同方面,并根据初步测试中识别的具体挑战优化我们的方法。
The first three configurations consider the prompts defined in Section III-D and two different versions of the Llama 2 model, i.e., \$_{\mathrm{L1ama-2-7b-chat}}\$ and \$\mathtt{L1a m a-2-13b-c h a t}\$ . The rationale behind this choice is to investigate how relatively basic LLama models can handle the popularity bias without any specific countermeasures. We consider \$C_{1}\$ as the baseline configuration since it considers the basic prompt technique and the Llama-2-7b-chat model without any enhancement.
前三种配置考虑了第 III-D 节中定义的提示词以及 Llama 2 模型的两个不同版本,即 \$_{\mathrm{L1ama-2-7b-chat}}\$ 和 \$\mathtt{L1a m a-2-13b-c h a t}\$。选择这种配置的原因是为了研究相对基础的 Llama 模型在没有特定对策的情况下如何处理流行度偏差。我们将 \$C_{1}\$ 视为基线配置,因为它考虑了基本的提示词技术和未经过任何增强的 Llama-2-7b-chat 模型。
Afterward, the two advanced modules, i.e., finetuning and penalty mechanism, have been tested on Llama-3-8b-instruct in the last two configurations, i.e., \$\mathbf{C}_{5}\$ , and \$C_{6}\$ . As stated in Section III-D, the Llama-3-8b-instruct model exploits RLHF as the underpinning training strategy. Thus, we are interested in understanding how this can affect the popularity bias in TPL recommendations.
随后,两个高级模块,即微调 (finetuning) 和惩罚机制 (penalty mechanism),在最后两种配置(即 \$\mathbf{C}_{5}\$ 和 \$C_{6}\$)中在 Llama-3-8b-instruct 上进行了测试。如第 III-D 节所述,Llama-3-8b-instruct 模型利用 RLHF (Reinforcement Learning from Human Feedback) 作为基础训练策略。因此,我们感兴趣的是了解这如何影响 TPL 推荐中的流行度偏差。
For each configuration, we split the dataset into training and testing sets composed of \$80\%\$ and \$20\%\$ of the data, respectively. Table II summarizes the employed hyper parameters and their corresponding values.
对于每种配置,我们将数据集分为训练集和测试集,分别由数据的 \$80\%\$ 和 \$20\%\$ 组成。表 II 总结了所使用的超参数及其对应的值。
TABLE II FINE-TUNING SETTINGS
表 II 微调设置
| 超参数 | 值 |
| --- | --- |
| 批量大小 | 4(训练和验证均使用) |
| 训练轮数 | 3 |
| 学习率 | 2e-5 |
| 权重衰减 | 0.01 |
| 梯度累积步数 | 1 |
| 优化算法 | PagedAdamW (8-bit) 优化器 |
# A. RQ: How effectively can open-source LLMs address popularity bias in TPL recommendations ?
# A. RQ: 开源大语言模型在解决TPL推荐中的流行度偏差方面有多有效?
Table III shows the results of precision, recall, and catalog coverage for each configuration. Overall, the results demonstrate that the models’ performance improved as we introduced more advanced modules compared to the baseline configuration, i.e., \$C_{1}\$ . In particular, using Llama-2-7b-chat without any enhancement confirms its limitation in handling popularity bias, i.e., precision and recall scores are very low. In addition, this configuration achieves only \$26\%\$ of catalog coverage, meaning that TPLs are not well diversified. This is due to the limited token size of the model, which prevents it from capturing the full context of the project and generating relevant recommendations. In this respect, we try to enlarge the token size by considering the previous iteration with the model, i.e., including the history in \$\mathrm{{C_{3}}}\$ . Even though a slight improvement in all the metrics has been achieved, the overall results remain low, e.g., the maximum value of catalog coverage is \$36\%\$ .
表 III 展示了每种配置的精度 (precision)、召回率 (recall) 和目录覆盖率 (catalog coverage) 的结果。总体而言,结果表明,与基线配置(即 \$C_{1}\$)相比,随着我们引入更先进的模块,模型的性能有所提升。特别是,使用未经任何增强的 Llama-2-7b-chat 模型证实了其在处理流行度偏差方面的局限性,即精度和召回率得分非常低。此外,该配置仅实现了 \$26\%\$ 的目录覆盖率,这意味着 TPLs(技术产品列表)的多样性不足。这是由于模型的 token 大小有限,无法捕捉项目的完整上下文并生成相关推荐。在这方面,我们尝试通过考虑模型的先前迭代(即在 \$\mathrm{{C_{3}}}\$ 中包含历史记录)来扩大 token 大小。尽管所有指标都有轻微改善,但整体结果仍然较低,例如目录覆盖率的最高值为 \$36\%\$。
In contrast, results for the last three configurations show that using the advanced version of the Llama model contributes to improving the overall recommendations even though the combination of different techniques is not yet optimal. On the one hand, the best configuration in terms of accuracy metrics and catalog coverage is \$\mathrm{{C_{3}}}\$ , where the Llama-3-8b-instruct model is considered with only the few-shots technique. On the other hand, the penalty mechanism achieves the best results in terms of precision, although the catalog coverage decreases to \$40\%\$ . Such a negative effect is reduced by introducing the fine-tuning dataset, i.e., the catalog coverage and EPC increase up to \$55\%\$ and \$60\%\$ , respectively. Nonetheless, the recall score is low for all the considered configurations, meaning that even with the conceived advanced techniques, false negatives still have a relevant impact on the recommendations.
相比之下,最后三种配置的结果表明,尽管不同技术的组合尚未达到最优,但使用 Llama 模型的高级版本有助于提高整体推荐效果。一方面,在准确度指标和目录覆盖率方面表现最佳的是 \$\mathrm{{C_{3}}}\$ 配置,其中 Llama-3-8b-instruct 模型仅使用了少样本技术。另一方面,惩罚机制在精确度方面取得了最佳结果,尽管目录覆盖率下降到了 \$40\%\$。通过引入微调数据集,这种负面影响得到了缓解,即目录覆盖率和 EPC 分别提高到了 \$55\%\$ 和 \$60\%\$。然而,所有考虑的配置的召回率都较低,这意味着即使采用了这些先进技术,假阴性仍然对推荐结果有显著影响。
TABLE III RESULTS OF THE ABLATION STUDY.
表 III 消融研究结果
| 配置 | PR@N | REC@N | F1 | Coverage@N | EPC |
|------|-------|-------|----|------------|-----|
| C1 | 0.12 | 0.08 | 0.09 | 26% | 15% |
| C2 | 0.17 | 0.12 | 0.14 | 30% | 20% |
| C3 | 0.24 | 0.16 | 0.19 | 34% | 29% |
| C4 | 0.47 | 0.20 | 0.28 | 58% | 40% |
| C5 | 0.67 | 0.16 | 0.26 | 40% | 10% |
| C6 | 0.45 | 0.17 | 0.25 | 55% | 60% |
Answer to RQ: The ablation study reveals that introducing advanced techniques and prompts has a positive effect on the diversity of the recommendations. However, the recall values are still far from the optimal value, motivating further studies to improve the results.
RQ 的答案:消融研究表明,引入先进技术和提示对推荐的多样性有积极影响。然而,召回值仍远未达到最优值,这促使进一步研究以改进结果。
# B. Discussion and improvements
# B. 讨论与改进
Despite the advanced techniques proposed in this work, we report that the popularity bias is still a relevant issue in opensource LLMs. We acknowledge that further techniques and methodologies are needed to improve the TPLs recommendations. In particular, we plan to investigate the following directions:
尽管本文提出了先进的技术,但我们报告指出,开源大语言模型中仍然存在流行度偏差问题。我们承认需要进一步的技术和方法来改进TPLs推荐。特别是,我们计划研究以下方向:
\$\blacktriangleright\$ Applying bias mitigation algorithms: Fairness research highlights that a wide range of debiasing algorithms have been successfully applied to traditional ML models [39]–[41]. These well-established techniques can be leveraged as a postprocessing step to further minimize popularity bias.
\$\blacktriangleright\$ 应用偏差缓解算法:公平性研究表明,多种去偏算法已成功应用于传统机器学习模型 [39]–[41]。这些成熟的技术可以作为后处理步骤,进一步减少流行度偏差。
\$\blacktriangleright\$ Integrating contextual information with RAG: Retrieval Augmented Generation (RAG) technique that has been proven to be effective when the contextual information plays a relevant role in SE specific tasks [42]. In the context of TPL recommend ers, we foresee the usage of this technique to improve the recommendation by selecting unbiased sources of information. For instance, few-shot prompts plus past conversations can be substituted with this advanced technique that can leverage the context directly provided by the user, thus leading to more control over the training data.
\$\blacktriangleright\$ 将上下文信息与 RAG 结合:检索增强生成 (Retrieval Augmented Generation, RAG) 技术在上下文信息对特定软件工程 (SE) 任务起重要作用时已被证明是有效的 [42]。在 TPL 推荐器的背景下,我们预见通过选择无偏见的信息源来使用该技术以改进推荐。例如,少样本提示加上过去的对话可以被这种先进技术取代,该技术可以直接利用用户提供的上下文,从而更好地控制训练数据。
\$\blacktriangleright\$ Leveraging user feedback during recommendations: A relevant field of study in RSSEs is the exploitation of user feedback to improve the recommendations. In the context of our study, we consider the history as a source of implicit feedback, i.e., the model can learn from the previous interactions to improve the recommendations [43]–[45]. Future research can investigate how explicit feedback can be introduced in the TPLs RSSE based on LLMs, e.g., conceiving different recommendation sessions instead of a single one.
\$\blacktriangleright\$ 利用用户反馈进行推荐:RSSEs 中的一个相关研究领域是利用用户反馈来改进推荐。在我们的研究背景下,我们将历史视为隐式反馈的来源,即模型可以从之前的交互中学习以改进推荐 [43]–[45]。未来的研究可以探讨如何在大语言模型 (LLM) 为基础的 TPLs RSSE 中引入显式反馈,例如设计多个推荐会话而不是单一的会话。
# VI. THREATS TO VALIDITY
# 六、有效性威胁
In this section, we discuss the potential threats to the validity of our study and the measures taken to mitigate them.
在本节中,我们讨论了研究中可能存在的有效性威胁以及为缓解这些威胁所采取的措施。
Internal validity focuses on two primary aspects: the dataset used for the experiments and the prompt engineering techniques. Regarding the dataset, we utilize a state-of-the-art dataset widely adopted in several TPL recommendation systems. Additionally, we identify the most popular libraries in the dataset and leverage their ranking to design the penalty mechanism. As for the prompt techniques, we adhere to wellestablished guidelines in prompt engineering by employing a consistent prompt across all models and configurations. Furthermore, we enhance the basic prompts by incorporating negative instructions and contextual information from previous iterations with the user.
内部有效性主要关注两个方面:用于实验的数据集和提示工程技术。关于数据集,我们使用了一个在多个TPL推荐系统中广泛采用的最先进的数据集。此外,我们识别了数据集中最流行的库,并利用它们的排名来设计惩罚机制。至于提示技术,我们遵循提示工程中已确立的准则,在所有模型和配置中使用一致的提示。此外,我们通过结合负面指令和与用户之前交互的上下文信息来增强基本提示。
External validity concerns the general iz ability of the study’s findings to other contexts, i.e., the results may vary if other datasets or LLMs are considered. To mitigate this, we employ a state-of-the-art dataset used by several TPLs recommendation systems, thus ensuring that the obtained recommendations can be compared with those traditional systems. As we focus on open-source LLMs, we opt for Llama 2 and 3, which are widely used in the SE community.
外部有效性关注的是研究结果在其他情境下的普遍适用性,即如果使用其他数据集或大语言模型 (LLM),结果可能会有所不同。为了缓解这一问题,我们采用了多个第三方库 (TPL) 推荐系统使用的最先进数据集,从而确保所获得的推荐结果能够与传统系统进行比较。由于我们专注于开源的大语言模型,我们选择了在软件工程 (SE) 社区广泛使用的 Llama 2 和 Llama 3。
Construct validity refers to the conducted ablation study and its design. While we acknowledge that not all possible combinations have been assessed, we carefully isolate the different components, showing their contributions in terms of well-founded metrics, i.e., accuracy, novelty, and diversity.
构念效度指的是所进行的消融研究及其设计。虽然我们承认并非所有可能的组合都得到了评估,但我们仔细隔离了不同的组件,展示了它们在准确性、新颖性和多样性等有充分依据的指标上的贡献。
# VII. RELATED WORKS
# VII. 相关工作
TPL recommendation systems: LibSeek [9] is a TPL recommend er that provides diversified libraries. The system uses an adaptive weighting mechanism as a post-processing module to neutralize popularity bias by promoting less popular TPLs. The evaluation conducted on a curated Android dataset shows that LibSeek succeeds in providing a wide range of libraries, thus increasing novelty in the recommendation outcomes. Similarly, Rubei et al. [30] investigated the usage of a learning-to-rank mechanism to embody explicit user feedback in TPL recommend ers. In particular, the underpinning algorithm is used to re-sort the suggested libraries according to their popularity. They exploit the same dataset used in our study as it contains popular libraries. LibRec [13] works on top of a light collaborative-filtering technique and association mining, retrieving libraries that are used by popular projects. Req2Lib [46] suggests relevant TPLs starting from the textual description of the requirements to handle the cold-start problem by combining a Sequence-to-Sequence network with a doc2vec pre-trained model. Similarly, GRec [47] encodes mobile apps, TPLs, and their interactions in an app-library graph. Afterward, it uses a graph neural network to distill the relevant features to increase the overall accuracy. Chen et al. [48] proposed an unsupervised deep learning approach to embed both usage and description semantics of TPLs to infer mappings at the API level. The model is trained using the information encoded as vectors from 135,127 GitHub projects. An approach [49] based on Stack Overflow was proposed to recommend analogical libraries, i.e., a library similar to the ones that developers already use. Compared to those approaches, we rely on Llama models to handle the popularity bias in TPL recommendations.
TPL 推荐系统:LibSeek [9] 是一个提供多样化库的 TPL 推荐系统。该系统使用自适应加权机制作为后处理模块,通过推广不太流行的 TPL 来中和流行度偏差。在精选的 Android 数据集上进行的评估表明,LibSeek 成功提供了广泛的库,从而增加了推荐结果的新颖性。同样,Rubei 等人 [30] 研究了使用学习排序机制在 TPL 推荐系统中体现显式用户反馈的方法。特别是,基础算法用于根据库的流行度对建议的库进行重新排序。他们利用了我们研究中使用的相同数据集,因为该数据集包含流行的库。LibRec [13] 基于轻量级的协同过滤技术和关联挖掘,检索流行项目使用的库。Req2Lib [46] 从需求的文本描述开始,通过将序列到序列网络与 doc2vec 预训练模型相结合,建议相关的 TPL 以处理冷启动问题。同样,GRec [47] 将移动应用、TPL 及其交互编码到应用-库图中。随后,它使用图神经网络提取相关特征以提高整体准确性。Chen 等人 [48] 提出了一种无监督的深度学习方法,嵌入 TPL 的使用和描述语义,以在 API 级别推断映射。该模型使用从 135,127 个 GitHub 项目中编码为向量的信息进行训练。基于 Stack Overflow 的方法 [49] 被提出,用于推荐类比库,即与开发者已经使用的库相似的库。与这些方法相比,我们依赖 Llama 模型来处理 TPL 推荐中的流行度偏差。
Analysis of the long tail effect in SE: In [26], the authors explore several techniques to mitigate the long-tail effect using different pre-trained models in three code-related tasks, i.e., API completion, code revision, and vulnerability prediction. The experimental results reveal that the long-tailed distribution has a negative impact on the overall prediction performance, reducing the effectiveness by up to \$254.0\%\$ in the examined tasks. Nguyen et al. [19] compare three different TPL RSSEs, i.e., LibSeek, CrossRec, and LibRec, to investigate the longtail effect in TPL recommendations. The authors found that the all the systems are prone to this bias. Lopes and Ossher [50] conduct a large-scale analysis on 30,911 Java projects, revealing that they adhere to a long-tail distribution in terms of size, i.e., the majority of the projects are small or medium. Borges et al. [51] identified four main growing patterns that increase popularity on the GitHub platform. By considering 2,500 top-ranked projects, the conducted study reveals a longtailed distribution in the examined data. To the our knowledge, our study is the first to investigate the long-tail effect in TPL recommendation systems using LLMs.
SE中的长尾效应分析:在[26]中,作者探索了在三个代码相关任务(即API补全、代码修订和漏洞预测)中使用不同预训练模型来缓解长尾效应的几种技术。实验结果表明,长尾分布对整体预测性能有负面影响,在研究的任务中,效果降低了高达\$254.0\%\$。Nguyen等人[19]比较了三种不同的TPL RSSE(即LibSeek、CrossRec和LibRec),以研究TPL推荐中的长尾效应。作者发现所有系统都容易受到这种偏差的影响。Lopes和Ossher[50]对30,911个Java项目进行了大规模分析,揭示了它们在规模上遵循长尾分布,即大多数项目都是小型或中型的。Borges等人[51]确定了在GitHub平台上增加受欢迎程度的四种主要增长模式。通过对2,500个排名靠前的项目进行研究,揭示了研究数据中的长尾分布。据我们所知,我们的研究是第一个使用大语言模型研究TPL推荐系统中长尾效应的研究。
# VIII. CONCLUSION
# VIII. 结论
While cutting-edge AI generative models have demonstrated promising results across various software engineering tasks, recent studies highlight a significant challenge: the longtail effect, which limits diversity in recommendations within software engineering recommendation systems (RSSEs).
尽管前沿的生成式 AI (Generative AI) 模型在各种软件工程任务中展示了令人瞩目的成果,但最近的研究突显了一个重大挑战:长尾效应,它限制了软件工程推荐系统 (RSSEs) 中推荐的多样性。
In this paper, we presented an initial investigation into the impact of popularity bias on TPL recommendations generated by open-source LLMs, specifically Llama 2 and Llama 3. To explore this, we applied multiple mitigation strategies aimed at reducing the influence of the most popular libraries in Java projects, including advanced prompt engineering techniques, fine-tuning, and popularity penalty mechanisms. Our results reveal that the long-tail effect is evident in the recommendations produced by the baseline model (Llama 2). However, fine-tuning and penalty mechanisms demonstrate the potential to mitigate this bias in the more advanced Llama 3 model. We hypothesize that incorporating more sophisticated approaches, such as retrieval-augmented generation (RAG) or human-inthe-loop solutions, could further reduce popularity bias.
在本文中,我们对开源大语言模型(LLM)生成的 TPL 推荐中的流行度偏差影响进行了初步研究,特别是 Llama 2 和 Llama 3。为了探索这一问题,我们应用了多种缓解策略,旨在减少 Java 项目中最流行库的影响,包括高级提示工程(prompt engineering)技术、微调(fine-tuning)和流行度惩罚机制。我们的结果表明,基线模型(Llama 2)生成的推荐中存在明显的长尾效应。然而,微调和惩罚机制在更先进的 Llama 3 模型中显示出缓解这种偏差的潜力。我们假设,结合更复杂的方法,如检索增强生成(RAG)或人机协同解决方案,可能会进一步减少流行度偏差。
For future work, we plan to expand our experiments by integrating these advanced solutions, such as bias mitigation techniques, RAG, and user feedback. Additionally, we aim to explore open-source models optimized for code-related tasks, such as Code Mistral and CodeLlama. Finally, we will broaden the dataset to include a larger number of Java projects and extend the analysis to other programming languages that heavily rely on TPLs, such as Python and JavaScript.
在未来的工作中,我们计划通过整合这些先进的解决方案来扩展实验,例如偏差缓解技术、RAG 和用户反馈。此外,我们计划探索针对代码相关任务优化的开源模型,例如 Code Mistral 和 CodeLlama。最后,我们将扩展数据集以包含更多的 Java 项目,并将分析扩展到其他严重依赖 TPL 的编程语言,例如 Python 和 JavaScript。
# ACKNOWLEDGMENTS
# 致谢
This work was partially supported by the following Italian research projects: EMELIOT (PRIN 2020, grant n. 2020W3A5FY) and TRex SE (PRIN 2022, grant n. 2022LKJWHC), MATTERS project, funded under the cascade scheme of the SERICS program (CUP J 33 C 22002810001), Spoke 8, within the Italian PNRR Mission 4, Component 2, and the FRINGE project (PRIN 2022 PNRR, grant n. P2022553SL).
本工作部分得到了以下意大利研究项目的支持:EMELIOT (PRIN 2020, 资助编号 2020W3A5FY) 和 TRex SE (PRIN 2022, 资助编号 2022LKJWHC),MATTERS 项目,由 SERICS 计划的级联计划资助 (CUP J 33 C 22002810001),意大利 PNRR 任务 4 组件 2 中的 Spoke 8,以及 FRINGE 项目 (PRIN 2022 PNRR, 资助编号 P2022553SL)。
# REFERENCES
# 参考文献
