[论文翻译]生成式 AI 驱动的元数据建模方法


原文地址:https://arxiv.org/pdf/2501.04008


A Generative AI-driven Metadata Modelling Approach

生成式 AI 驱动的元数据建模方法

Mayukh Bagchi

Mayukh Bagchi

DISI, University of Trento, Italy. Institute for Globally Distributed Open Research and Education (IGDORE).

意大利特伦托大学 DISI。全球分布式开放研究与教育研究所 (IGDORE)。

mayukh.bagchi@igdore.org

mayukh.bagchi@igdore.org

Abstract

摘要

Since decades, the modelling of metadata has been core to the functioning of any academic library. Its importance has only enhanced with the increasing pervasive ness of Generative Artificial Intelligence (AI)-driven information activities and services which constitute a library’s outreach. However, with the rising importance of metadata, there arose several outstanding problems with the process of designing a library metadata model impacting its re usability, crosswalk and interoperability with other metadata models. This paper posits that the above problems stem from an underlying thesis that there should only be a few core metadata models which would be necessary and sufficient for any information service using them, irrespective of the heterogeneity of intra-domain or inter-domain settings. To that end, this paper advances a contrary view of the above thesis and substantiates its argument in three key steps. First, it introduces a novel way of thinking about a library metadata model as an ontology-driven composition of five functionally interlinked representation levels from perception to its in tension al definition via properties. Second, it introduces the representational manifold ness implicit in each of the five levels which cumulatively contributes to a conceptually entangled library metadata model. Finally, and most importantly, it proposes a Generative AI-driven Human-Large Language Model (LLM) collaboration based metadata modelling approach to disentangle the entanglement inherent in each representation level leading to the generation of a conceptually disentangled metadata model. Throughout the paper, the arguments are exemplified by motivating scenarios and examples from representative libraries handling cancer information.

几十年来,元数据建模一直是任何学术图书馆运作的核心。随着生成式人工智能 (Generative AI) 驱动的信息活动和服务(这些活动和服务构成了图书馆的外展服务)的日益普及,其重要性只会进一步增强。然而,随着元数据的重要性日益增加,图书馆元数据模型设计过程中也出现了几个突出的问题,影响了其可重用性、跨模型映射以及与其他元数据模型的互操作性。本文认为,上述问题源于一个基本论点,即无论域内或域间设置的异质性如何,都应该只有少数核心元数据模型对于任何使用它们的信息服务来说是必要且充分的。为此,本文提出了与上述论点相反的观点,并通过三个关键步骤来论证其论点。首先,它引入了一种新的思维方式,将图书馆元数据模型视为由五个功能相互关联的表示层次组成的本体驱动组合,从感知到通过属性的内涵定义。其次,它介绍了五个层次中隐含的表示多样性,这些多样性共同促成了一个概念上纠缠的图书馆元数据模型。最后,也是最重要的,它提出了一种基于生成式人工智能驱动的人与大语言模型 (LLM) 协作的元数据建模方法,以解开每个表示层次中固有的纠缠,从而生成概念上解开的元数据模型。在整篇论文中,通过处理癌症信息的代表性图书馆的激励场景和示例来阐述论点。

Keywords

关键词

Generative AI, Metadata, Ontology-Driven Metadata Models, Academic Libraries and AI, Large Language Models, Human-LLM Collaboration, Knowledge Organization, Knowledge Representation.

生成式 AI (Generative AI)、元数据、本体驱动的元数据模型、学术图书馆与 AI、大语言模型 (Large Language Model)、人类与大语言模型协作、知识组织、知识表示。

Introduction

引言

Since the 1980s (Yu and Breivold 2008) and increasingly from the 2000s (Tedd and Large 2004), the development and continual management of (digitised) metadata (Satija, Bagchi and Martínez-Ávila 2020) has been core to the mundane functioning of any well-administered academic library. With the advent and increasing pervasive ness of AI (Cox, Pinfield and Rutter 2019) and Generative AI (Banh and Strobel 2023), however, the scope of an academic library has radically expanded to include the deployment of AI-based data-driven information services, e.g., research data management (And riko poul ou, Rowley and Walton 2022), ontology-driven content management (Bagchi 2021a; Bagchi 2021b), chatbots (Bagchi 2020), all of which crucially depend on a well-founded metadata model (semantically) annotating and exposing the underlying data. For instance, let us consider the motivating scenario of semantically annotating scientific information in cancer research (e.g., cancer big data (Jiang et al. 2022)) within the framework of an academic library. Such an exercise is clearly dependent on a multitude of factors, e.g., the precise purpose, the target user base, the technical features, etc., each of which mutually defines the information service(s) that will harness such information. To that end, for example, a metadata model designed by a metadata librarian (Han and Hswe 2011) annotating, e.g., cancer big data, for a specialised cancer research library will be significantly (if not completely) different from one suited for an oncology policy library which, again, would be considerably different from a metadata model employed by a medical college library.

自 20 世纪 80 年代 (Yu and Breivold 2008) 以来,尤其是从 2000 年代开始 (Tedd and Large 2004),(数字化) 元数据 (Satija, Bagchi and Martínez-Ávila 2020) 的开发与持续管理已成为任何管理良好的学术图书馆日常运作的核心。然而,随着 AI (Cox, Pinfield and Rutter 2019) 和生成式 AI (Generative AI) (Banh and Strobel 2023) 的出现及其日益普及,学术图书馆的范围已大幅扩展,涵盖了基于 AI 的数据驱动信息服务的部署,例如研究数据管理 (Andrikopoulou, Rowley and Walton 2022)、本体驱动的内容管理 (Bagchi 2021a; Bagchi 2021b)、聊天机器人 (Bagchi 2020),所有这些都依赖于一个良好的元数据模型来(语义上)注释和暴露底层数据。例如,让我们考虑在学术图书馆框架内对癌症研究中的科学信息(例如癌症大数据 (Jiang et al. 2022))进行语义注释的动机场景。这样的工作显然依赖于多种因素,例如精确的目的、目标用户群、技术特征等,每一个因素都共同定义了将利用这些信息的信息服务。为此,例如,由元数据馆员 (Han and Hswe 2011) 设计的元数据模型,为专门的癌症研究图书馆注释癌症大数据,将与适合肿瘤政策图书馆的元数据模型显著(如果不是完全)不同,而后者又与医学院图书馆使用的元数据模型大相径庭。

The general thesis advanced by this paper, as evidenced from the aforementioned motivating scenario and several other similar use-cases (see, for e.g., (Van Dijck 2014; Gartner 2016; Ulrich et al. 2022)), is that there is no unique metadata model which is necessary and sufficient for semantic data annotation within a single domain and certainly (not) for the heterogeneity inherent in cross-domain use-case scenarios (see also (EU-NSF 2024)). Let us understand, via the lens of the motivating scenario, the interlinked representation levels which cumulatively constitute a metadata model as indicated in the above thesis. First, the perception of the (expert) users of a cancer research library towards cancer research data would be highly specialised (e.g., uncovering multi-omics workflows (Ulrich et al. 2022)) and hence considerably different to how it is perceived by the (expert) users of an oncology policy library or a medical college library. Second, as a partial consequence of the first reason, the terminology employed by the users of the three types of libraries to describe the various perceived concepts of cancer research data would be mutually different. Third, the decision to ontologically characterise a perceived data concept as, e.g., a function or a process (Arp and Smith 2008) or an object property or a data property (Bagchi 2022) etc., by users of the three different types of libraries would be mutually different. Fourth, as a direct consequence of the ontological characterization, the taxonomy (Bagchi and Madalli 2019) assumed by the users of the three types of libraries would be markedly different. Fifth, the in tension al characterization (see, e.g., (Von Fintel and Heim 2011), for an overview of in tension al semantics) of the taxonomy in terms of interrelating and describing its constituent concepts via object properties and data properties would be different for different sets of (expert) library users. Finally, it is also key to note that the ordered representation levels, as briefed above, (each) independently as well as cumulatively compounds the design of any metadata model (e.g., developed by one of the three aforementioned libraries) and complicates its crosswalk with any other metadata model (e.g., developed by the other two).

本文提出的总体论点,正如上述动机场景和其他几个类似用例所证明的那样(例如,参见 (Van Dijck 2014; Gartner 2016; Ulrich et al. 2022)),是在单个领域内,没有一种唯一的元数据模型是必要且足以进行语义数据标注的,尤其是在跨领域用例场景中固有的异质性情况下(另见 (EU-NSF 2024))。让我们通过动机场景的视角,理解上述论点中指出的构成元数据模型的相互关联的表示层次。首先,癌症研究图书馆的(专家)用户对癌症研究数据的感知将高度专业化(例如,揭示多组学工作流程 (Ulrich et al. 2022)),因此与肿瘤政策图书馆或医学院图书馆的(专家)用户的感知方式大不相同。其次,作为第一个原因的部分结果,三种类型图书馆的用户用来描述癌症研究数据各种感知概念的术语将相互不同。第三,三种不同类型图书馆的用户在将感知到的数据概念本体论地描述为例如功能或过程 (Arp and Smith 2008) 或对象属性或数据属性 (Bagchi 2022) 等方面的决策将相互不同。第四,作为本体论描述的直接结果,三种类型图书馆用户所假设的分类法 (Bagchi and Madalli 2019) 将显著不同。第五,通过对象属性和数据属性相互关联和描述其组成概念的分类法的内涵描述(例如,参见 (Von Fintel and Heim 2011) 对内涵语义的概述)对于不同的(专家)图书馆用户群体将有所不同。最后,还需要注意的是,上述有序的表示层次(每个)独立地以及累积地构成了任何元数据模型(例如,由上述三种图书馆之一开发的)的设计,并使其与其他任何元数据模型(例如,由其他两种图书馆开发的)的交叉映射复杂化。

knowledge model (e.g., a metadata model) is representation ally manifold by design and cannot be necessary and sufficient for all use-cases irrespective of their intra-domain or inter-domain setting. Let us understand the above in terms of the motivating scenario. First, there is always a many-to-many correspondence (hereafter, referred to as representational manifold ness) between entities and how they are perceived as concepts (e.g., by sets of users of the three different libraries). Second, given perception, there is always a representational manifold ness between the perceived concepts and how they are linguistically labelled (e.g., by the same sets of users) using some terminology. Third, given labelling, there is always a representational manifold ness between the labelled concepts and their ontological status (e.g., by the same sets of users). Fourth, as a consequence of the manifold ness in ontological character is ation, there is always a representational manifold ness between the ontologically characterised concepts and how they ought to be taxonomically classified (e.g., as per the warrant of the same set of users). Fifth, there is always a representational manifold ness between the taxonomically classified concepts and how they are in tension ally interrelated and described via properties. Two observations. Firstly, it is interesting to note how the individual as well as the cumulative impact of the representation layers magnify the entanglement in the final metadata model. Secondly, and perhaps most importantly, notice the need to explicate the decision of the modeller (e.g., a metadata librarian) at each level which, in an overwhelming majority of cases, remains implicit (Bagchi and Das 2022; Bagchi and Das 2023).

知识模型(例如,元数据模型)在设计上是表示性的流形,因此无论其域内或域外设置如何,都无法对所有用例都是必要且充分的。让我们通过动机场景来理解上述内容。首先,实体与其被感知为概念的方式之间总是存在多对多的对应关系(以下称为表示性流形性)(例如,由三个不同图书馆的用户集感知)。其次,在感知的基础上,感知到的概念与其通过某种术语进行语言标记的方式之间总是存在表示性流形性(例如,由同一组用户标记)。第三,在标记的基础上,标记的概念与其本体状态之间总是存在表示性流形性(例如,由同一组用户确定)。第四,由于本体特征的流形性,本体特征化的概念与其应如何分类(例如,根据同一组用户的授权)之间总是存在表示性流形性。第五,分类后的概念与其通过属性相互关联和描述的方式之间总是存在表示性流形性。两点观察。首先,有趣的是,表示层的个体和累积影响如何放大了最终元数据模型中的纠缠。其次,也许最重要的是,注意到在每个层级上需要明确建模者(例如,元数据图书馆员)的决策,而在绝大多数情况下,这些决策仍然是隐式的(Bagchi 和 Das 2022;Bagchi 和 Das 2023)。

The solution proposed in this paper is a Generative AI-driven LLM-based enhancement of an early version of the approach termed as Conceptual Disentanglement (Bagchi and Das 2022; Bagchi and Das 2023). The key focus of the approach is to explicitly disentangle the decisions made by the modeller (e.g., a metadata librarian), at each (knowledge) representation level mentioned before, which would otherwise implicitly entangle the final conceptual knowledge artefact (e.g., the metadata model). To that end, the general strategy of the approach is to explicitly enforce one-to-one correspondence (hereafter, referred to as representational bijection) out of the potentially many representational manifold possibilities at each level. To explain using the motivating scenario, first, the metadata librarian (who is the modeller) should explicate a representational bijection between entities and their consensual perception as concepts (e.g., by a particular set of users of one of the libraries). Second, given the fixation of perception, the metadata librarian should explicate a representational bijection between the perceived concepts and their linguistic labelling via a consensual body of terminology. Third, given the fixation of terminology, the metadata librarian should explicate a representational bijection between the labelled concepts and their ontological commitments (Guarino, Carrara and Giaretta 1994), i.e., whether they are events or processes or properties, etc. In fact, an ontology-driven metadata model is key to underpin the precise semantics represented within (meta)data concepts and properties (see, for e.g., the arguments advanced in (Dutta 2014; Leipzig et al. 2021)). Fourth, the metadata librarian should explicate a representational bijection between the ontologically characterised concepts and their exact taxonomy. Fifth, given the taxonomy, the metadata librarian should explicate a representational bijection between each concept in the taxonomy and their exact in tension al characterization. Finally, it is interesting to note how the individual as well as the cumulative disentanglement of the representation layers minimise the entanglement in the final metadata model. Notice that the decision of the modeller is also explicit at each representation level thereby crucially impacting the minimization of the conceptual entanglement.

本文提出的解决方案是基于生成式 AI (Generative AI) 驱动的大语言模型 (LLM),对早期版本的方法——概念解耦 (Conceptual Disentanglement) (Bagchi and Das 2022; Bagchi and Das 2023) 进行增强。该方法的核心在于明确解耦建模者(例如元数据图书馆员)在每个(知识)表示层级上所做的决策,这些决策在最终的概念知识产物(例如元数据模型)中可能会隐含地纠缠在一起。为此,该方法的总体策略是在每个层级上明确强制执行一对一的对应关系(以下称为表示双射),而不是潜在的多种表示流形可能性。以激励场景为例,首先,元数据图书馆员(即建模者)应明确实体与其作为概念的共识感知之间的表示双射(例如,由某个图书馆的特定用户群体所感知)。其次,在感知固定的情况下,元数据图书馆员应明确感知概念与通过共识术语体系进行的语言标签之间的表示双射。第三,在术语固定的情况下,元数据图书馆员应明确标签概念与其本体承诺 (ontological commitments) (Guarino, Carrara and Giaretta 1994) 之间的表示双射,即它们是事件、过程还是属性等。事实上,本体驱动的元数据模型是支撑(元)数据概念和属性中精确语义的关键(参见例如 (Dutta 2014; Leipzig et al. 2021) 中的论点)。第四,元数据图书馆员应明确本体特征化的概念与其精确分类之间的表示双射。第五,在分类固定的情况下,元数据图书馆员应明确分类中每个概念与其精确的内在特征之间的表示双射。最后,值得注意的是,表示层的个体解耦以及累积解耦如何最小化最终元数据模型中的纠缠。注意到建模者的决策在每个表示层级上也是明确的,从而对最小化概念纠缠起到了关键作用。

While the contextual iz ation, problem and solution approach is clear to some extent, it is not yet clear as to how the solution approach can be potentially implemented by, e.g., a metadata librarian, within an academic library setting. In fact, there are two critical complementary highlights which should characterise any implementation strategy of the conceptual disentanglement approach for modelling library metadata. First, while the approach attempts to disentangle, level-by-level, the entanglement inherent by design in a metadata model, it runs the risk of considerable intellectual work on part of the metadata librarian in terms of manually factoring and refactoring constituent knowledge management nitty-gritties (Bagchi 2019a; Bagchi 2019b). On the other hand, given the justified complexity of the conceptual disentanglement approach (e.g., in layers like perception or ontology), an overtly (semi-)automatic implementation is not poised to produce an ontology-driven metadata model of requisite quality. To harness the best of both worlds (i.e., human and machine), the paper proposes a novel implementation of the conceptual disentanglement approach via Generative AI-driven Human-LLM Collaboration, wherein, the metadata librarian, using prompt engineering (Ekin 2023), exploits an LLM (Liang et al. 2022; Chang et al. 2024) to generate a conceptually (dis)entangled metadata model which, for each representational level, is validated/repaired/enriched by him/her. Notice that the validation or repair or enrichment of the Human-LLM collaborative ly generated metadata model by the metadata librarian can involve several standard dimensions, e.g., metadata quality (Park 2009), semantic quality (Poels et al. 2005), and additionally purpose-specific dimensions such as, e.g., measuring the fit of the metadata model with actual cancer data/information. Overall, a widespread acceptance and implementation of the conceptual disentanglement approach via Generative AI-driven Human-LLM collaboration is poised to bring to the fore several advantages in terms of, e.g., metadata development methodologies, metadata crosswalk and interoperability (Khoo and Hall 2010), semantic heterogeneity (Hull 1997), FAIR data (Wilkinson et al. 2016), etc.

虽然情境化、问题和解决方案方法在某种程度上是清晰的,但尚不清楚如何在学术图书馆环境中由元数据馆员等人员实施该解决方案方法。事实上,有两个关键的互补亮点应该成为任何图书馆元数据建模的概念解缠方法实施策略的特征。首先,虽然该方法试图逐层解缠元数据模型中固有的设计纠缠,但它可能会导致元数据馆员在手动分解和重构知识管理细节方面进行大量的智力工作(Bagchi 2019a; Bagchi 2019b)。另一方面,考虑到概念解缠方法的复杂性(例如,在感知或本体等层面),过度(半)自动化的实施不太可能产生所需质量的本体驱动元数据模型。为了充分利用人类和机器的优势,本文提出了一种通过生成式 AI 驱动的人-大语言模型协作来实现概念解缠方法的新颖实施方式,其中元数据馆员利用提示工程(Ekin 2023)来利用大语言模型(Liang et al. 2022; Chang et al. 2024)生成一个概念上(解)缠的元数据模型,该模型在每个表示层面上由他/她进行验证/修复/丰富。请注意,元数据馆员对人-大语言模型协作生成的元数据模型的验证、修复或丰富可能涉及多个标准维度,例如元数据质量(Park 2009)、语义质量(Poels et al. 2005),以及额外的特定目的维度,例如测量元数据模型与实际癌症数据/信息的拟合度。总体而言,通过生成式 AI 驱动的人-大语言模型协作广泛接受和实施概念解缠方法,有望在元数据开发方法、元数据交叉和互操作性(Khoo and Hall 2010)、语义异质性(Hull 1997)、FAIR 数据(Wilkinson et al. 2016)等方面带来诸多优势。

The remainder of the paper is organised as follows: The second section describes the different knowledge representation levels involved in designing a metadata model and how such levels are ordered and interlinked to cumulatively impact its design. The third section details the problem of conceptual entanglement individually and cumulatively across the representation levels and their confounding impact on the design of the metadata model. The Generative AI-driven Human-LLM collaboration based conceptual disentanglement approach individually and cumulatively across the representation levels and the way it minimises the entanglement and confusion in the design of the metadata model is elucidated in the fourth section. Finally, the fifth section discusses research implications from the related work in terms of some of the (generative) AI, metadata and semantics-based research issues academic libraries face and the sixth section concludes the paper. Throughout all the sections, especially for the second, third and fourth sections, an adapted instantiation of the motivating scenario in terms of a simplified ontology-driven metadata model for the cancer domain would be employed to exemplify and highlight the issues and approaches. Further, several of the fine-grained technical intricacies, e.g., of knowledge representation, are skipped in accordance with the broader methodological scope of the paper.

本文的其余部分组织如下:第二部分描述了设计元数据模型时涉及的不同知识表示层次,以及这些层次如何有序地相互关联并累积影响其设计。第三部分详细阐述了概念纠缠在各个表示层次上单独和累积的问题,以及它们对元数据模型设计的混淆影响。第四部分阐述了基于生成式 AI 驱动的人-大语言模型协作的概念解耦方法在各个表示层次上单独和累积的作用,以及它如何最小化元数据模型设计中的纠缠和混淆。最后,第五部分讨论了相关研究对学术图书馆面临的一些(生成式)AI、元数据和基于语义的研究问题的影响,第六部分总结了本文。在所有章节中,特别是第二、第三和第四部分,将采用一个简化的癌症领域本体驱动元数据模型的实例化场景来举例说明和突出这些问题和方法。此外,根据本文更广泛的方法论范围,省略了一些细粒度的技术细节,例如知识表示。

Representation Levels in Modelling Metadata

建模元数据中的表示层次

Let us now expand the discussion on the characteristically independent but functionally interlinked knowledge organization and representation levels which cumulatively compose any metadata model within the framework of an academic library. To that end, we have the following:

现在让我们扩展讨论学术图书馆框架内构成任何元数据模型的特征上独立但功能上相互关联的知识组织和表示层次。为此,我们有以下内容:

Additionally, it is also important to note that the motivating example considered for illustrating the above (and for remainder of the paper) are three equally probable variations of an ontology-driven library metadata model for cancer domain (e.g., encoding concepts such as Clinical Trial, Patient, Biomarker, Imaging Test, His to pathology Report, etc.) generated by prompting the Generative-AI based LLM (interface) - ChatGPT 3.5 - using a series of prompts. Please note two observations (valid throughout the paper). The metadata models cannot be reproduced completely within the full-text of this paper due to constraints of space and to that end, please follow the link - motivating example - wherein readers can get a fuller understanding of the concept/property exemplified in the paper and its page location within the linked document describing the generated metadata models. Also, while the metadata model generated by ChatGPT 3.5 is informal in nature, this is not relevant to the scope of this paper as it can be directly formalised in any formal language of choice (e.g., the web ontology language - OWL) for concrete implementation purposes. The above levels are elucidated as follows.

此外,还需要注意的是,本文用于说明上述内容(以及后续内容)的激励示例是基于癌症领域的本体驱动图书馆元数据模型的三个等概率变体(例如,编码诸如临床试验、患者、生物标志物、影像测试、病理报告等概念),这些模型是通过使用一系列提示词生成的生成式 AI(Generative AI)大语言模型(LLM)接口——ChatGPT 3.5——生成的。请注意两点观察(贯穿全文有效)。由于篇幅限制,本文无法完全再现这些元数据模型,因此请读者点击链接——激励示例——以更全面地理解本文中示例的概念/属性及其在描述生成的元数据模型的链接文档中的页面位置。此外,尽管 ChatGPT 3.5 生成的元数据模型本质上是非正式的,但这与本文的范围无关,因为它可以直接用任何选择的正式语言(例如,网络本体语言——OWL)进行形式化,以用于具体实现目的。上述层次结构如下所述。

First, let us concentrate on the perceptual level which, while being not explicitly pronounced in state-of-the-art library metadata research and implementation (see, e.g., (Haynes 2018; Strecker et al. 2021)), is nonetheless the very first level where the crucial representational choice as to how a set of entities should be perceived as concepts (later) composing a library metadata model is made. Notice two key dimensions which inform the decision-making of a metadata librarian at this level. First, it goes without mentioning that perception is highly egocentric to an individual academic library user and therefore is usually incomplete (Bagchi 2021a) and cannot be fully captured in a (semi-)formal manner. Second, while the first dimension holds, it is also equally the case that various communities of practice (Wenger 1999; Cox 2005) (e.g., users of different types of libraries) commit to a shared and homogeneous perception as to how certain entities should be perceived as concepts. Consider the case of the motivating example. In it, pages 1-7 elaborates a general (ontology-driven) library metadata model for cancer domain generated by ChatGPT 3.5 with constituent concepts (classes and properties) at three levels of abstraction. Further, note how the general model, via successive prompts, is tuned to the potential perceptions of a representative library user belonging to the specialised cancer research community (pages 8-10), oncology policy community (pages 10-12) and medical college student community (pages 12-14), respectively. It is interesting to observe that even for the same macro domain (i.e., cancer), each of the three metadata models encode concepts and properties uniquely relevant to the perception of a specific user community.

首先,让我们关注感知层面,尽管在最新的图书馆元数据研究和实现中并未明确提及(例如,参见 (Haynes 2018; Strecker et al. 2021)),但它是最初的层面,决定了如何将一组实体感知为(后续)构成图书馆元数据模型的概念。注意在这一层面,元数据馆员的决策受到两个关键维度的影响。首先,不言而喻,感知对个体学术图书馆用户来说是高度自我中心的,因此通常是不完整的 (Bagchi 2021a),无法以(半)正式的方式完全捕捉。其次,尽管第一个维度成立,但同样存在各种实践社区 (Wenger 1999; Cox 2005)(例如,不同类型图书馆的用户)对某些实体应如何被感知为概念达成了共享且同质的感知。以动机示例为例,其中第1-7页详细阐述了一个由ChatGPT 3.5生成的癌症领域通用(本体驱动的)图书馆元数据模型,包含三个抽象层次的构成概念(类和属性)。此外,注意通用模型如何通过连续的提示,调整为属于特定癌症研究社区(第8-10页)、肿瘤政策社区(第10-12页)和医学院学生社区(第12-14页)的代表性图书馆用户的潜在感知。有趣的是,即使在同一宏观领域(即癌症)中,这三个元数据模型中的每一个都编码了与特定用户社区感知独特相关的概念和属性。

Second, given the understanding as to why perception is crucial for modelling metadata, let us now turn to the terminological level where the key representational choice is to decide how a set of perceived concepts should be labelled using a body of terminology. This choice is not trivial for the metadata librarian due to the very nature of the interaction of language with the perceptual level. Languages are “itemized [terminological] inventories” (Brown and Lenneberg 1954) of the entities we perceive and each of such “inventories” generate either similar (but not the same) or a different lexical iz ation of the perception due to issues following from linguistic relativity (Boroditsky 2011). Such a lexical iz ation might manifest amongst different sets of academic library users in terms of different linguistic phenomenon. For example, two well known linguistic phenomenon include polysemy (the coexistence of several potential meanings of a term) and synonymy (terms having similar meanings) (Glynn and Robinson 2014). Lexical gaps (Bentivogli and Pianta 2000) occur when, for a concept, a specific language/vocabulary doesn’t have a referrent term. The scenario might get even more confounded when library user communities communicate with each other or with the computer system by linguistically referring to the perceived entities with terms from specialised terminological standards (Suonuuti 1997) or glossaries (Sarginson et al. 2012) possibly from different languages. There might be issues with both mapping and interoperability of terms amongst the standards due to issues of linguistic phenomenon or non-existent terminological crosswalks. To exemplify, in the motivating example, please refer to page 20-23 which detail the LLM response to the prompt clarifying the terminology employed in the three examples. Notice that parts of the cancer research library metadata model uses terms from the Dublin Core (Weibel et al. 1998) to express its perceived concepts. Similarly, the oncology policy library and the medical college library metadata models use terms from standards such as HL7 (Health Level Seven) standard, National Comprehensive Cancer Network (NCCN) standard, SNOMED-CT, etc. (Schulz, Stegwee and Chronaki 2019).

其次,鉴于理解感知为何对建模元数据至关重要,我们现在转向术语层面,关键的表征选择是决定如何使用一组术语来标记感知到的概念集。由于语言与感知层面互动的本质,这一选择对元数据馆员来说并非易事。语言是我们感知实体的“项目化[术语]清单”(Brown 和 Lenneberg 1954),每个这样的“清单”由于语言相对性(Boroditsky 2011)的问题,会产生相似(但不相同)或不同的感知词汇化。这种词汇化可能在不同的学术图书馆用户群体中以不同的语言现象表现出来。例如,两个著名的语言现象包括多义性(一个术语同时具有多个潜在含义)和同义性(术语具有相似含义)(Glynn 和 Robinson 2014)。当某个概念在特定语言/词汇中没有对应的术语时,就会出现词汇空缺(Bentivogli 和 Pianta 2000)。当图书馆用户群体通过使用来自不同语言的专门术语标准(Suonuuti 1997)或词汇表(Sarginson 等 2012)的术语来指代感知到的实体时,情况可能会变得更加复杂。由于语言现象或缺乏术语交叉映射的问题,标准之间的术语映射和互操作性可能会出现问题。举例来说,在动机示例中,请参阅第20-23页,详细描述了大语言模型对提示的响应,澄清了三个示例中使用的术语。请注意,癌症研究图书馆元数据模型的部分使用了都柏林核心(Weibel 等 1998)的术语来表达其感知到的概念。同样,肿瘤政策图书馆和医学院图书馆元数据模型使用了诸如HL7(健康等级七)标准、国家综合癌症网络(NCCN)标准、SNOMED-CT等标准的术语(Schulz、Stegwee 和 Chronaki 2019)。

Third, given the central importance of terminology in the design of any library metadata model, the next dimension to understand is the ontological level where the key representational choice regarding the ontological commitment (Guarino, Carrara and Giaretta 1994) of the (newly) labelled concepts is made. This choice, again, is non-trivial for the metadata librarian due to the theoretical root of the interaction of ontology with the terminological level, in the sense that, committing to a terminology results in committing to an ontology (Moltmann 2019) which might not be necessarily completely compatible with another ontology/terminology pairing. The chief importance of this representation level is to explicate the otherwise implicit commitment of the body of terms to a top-level ontology (Guarino 1997), wherein, such an ontology, based on an explicitly specified philosophical doctrine, reveals the underlying nature of domain-level concepts. There can be several such philosophical doctrines such as three-dimensional is m, four-dimensional is m, etc. (McCall and Lowe 2006), and according to the chosen doctrine, the domain-level labelled concepts can be classified as a kind, a part-of, an event, a process, a role, a function or a property (Gangemi et al. 2001). Two observations. First, by far, this representation layer is the most demanding in terms of the involvement of the human modeller, i.e., the metadata librarian, for the simple reason that while different library user communities might, most possibly, conceptual is e the same set of entities into different top-level ontological categories, it is chiefly the responsibility of the metadata librarian to explicate such assumptions. Secondly, as evidenced in decades worth of research in semantic heterogeneity (Hull 1997), the ontological level is key to achieve a range of semantics-intensive tasks like semantic mapping (Ben event a no et al. 2008) and semantic interoperability (Bittner, Donnelly and Winter 2005) with respect to (meta)data models. To exemplify, in the motivating example, please refer to page 23-29 which detail the LLM response to the prompt requesting for the ontological categories employed in the three examples. For example, notice that in page 26, the LLM response aligns concepts and properties such as Document and hasAuthor to top-level ontological categories like Information Object and Quality.

第三,鉴于术语在任何图书馆元数据模型设计中的核心重要性,下一个需要理解的维度是本体论层面,即关于(新)标记概念的本体论承诺(Guarino, Carrara 和 Giaretta 1994)的关键表示选择。这一选择对于元数据图书馆员来说同样不简单,因为本体论与术语层面的交互具有理论根源,即承诺使用某个术语意味着承诺使用某个本体论(Moltmann 2019),而该本体论可能并不完全与另一个本体论/术语对兼容。这一表示层面的主要重要性在于阐明术语体系对顶级本体论(Guarino 1997)的隐含承诺,其中,基于明确指定的哲学学说的本体论揭示了领域级概念的基本性质。可能存在多种这样的哲学学说,例如三维主义、四维主义等(McCall 和 Lowe 2006),根据所选择的学说,领域级标记概念可以被分类为种类、部分、事件、过程、角色、功能或属性(Gangemi 等 2001)。有两个观察点。首先,到目前为止,这一表示层在人类建模者(即元数据图书馆员)的参与方面要求最高,原因很简单,虽然不同的图书馆用户社区很可能将同一组实体概念化为不同的顶级本体论类别,但阐明这些假设主要是元数据图书馆员的责任。其次,正如数十年语义异构研究(Hull 1997)所证明的那样,本体论层面是实现一系列语义密集型任务(如语义映射(Ben event a no 等 2008)和语义互操作性(Bittner, Donnelly 和 Winter 2005))的关键。为了举例说明,请参阅第23-29页,其中详细描述了大语言模型对请求三个示例中使用的本体论类别的提示的响应。例如,请注意在第26页中,大语言模型的响应将诸如“文档”和“hasAuthor”等概念和属性与“信息对象”和“质量”等顶级本体论类别对齐。

Fourth, given the importance of ontological commitment of constituent concepts and properties in a library metadata model, let us concentrate on the taxonomical level where the key representational choice regarding the ontologically characterised concepts and their classification into a taxonomical hierarchy is made. This choice by the metadata librarian is guided by two key factors. First, the ontological level already induces a very abstract high-level taxonomy for the concepts due to the mandatory ontological constraints (Guarino, Carrara and Giaretta 1994) they pose while the metadata librarian commits to a specific top-level ontology and its constituent top-level ontological categories. For example, if the concept Person and Patient belong to the top-level ontological categories Kind and Role, it is an ontological constraint that Role can never be the taxonomical parent of a Kind, and, therefore, Patient can never be the taxonomical parent of Person. Within the ontological constraints, however, the taxonomy can be designed by the metadata librarian in accordance with how its hierarchy and metadata properties will be exploited by the target information service (e.g., a data catalog (Guptill 1999), a chatbot (Bagchi 2020)). To exemplify, in the motivating example, please refer to page 30-34 which detail the LLM response to the prompt requesting for the taxonomical choices made (for concepts as well as properties). For example, in page $30$ , Library Resource is specialised into Research Paper and Dataset. Notice that, alongside the ontological level, the taxonomical level is also equally demanding, if not more, in terms of the involvement of the metadata librarian in modelling the classification hierarchy.

第四,鉴于图书馆元数据模型中构成概念和属性的本体承诺的重要性,让我们集中讨论分类学层面,在这一层面上,关于本体特征化的概念及其分类到分类学层次结构中的关键表示选择被做出。元数据馆员的这一选择受到两个关键因素的指导。首先,本体层面已经由于强制性本体约束(Guarino, Carrara 和 Giaretta 1994)而诱导出一个非常抽象的高层分类学,这些约束在元数据馆员承诺使用特定的顶层本体及其构成顶层本体类别时产生。例如,如果概念 Person 和 Patient 属于顶层本体类别 Kind 和 Role,那么 Role 永远不能成为 Kind 的分类学父类,因此 Patient 永远不能成为 Person 的分类学父类。然而,在本体约束内,元数据馆员可以根据目标信息服务(例如,数据目录(Guptill 1999)、聊天机器人(Bagchi 2020))如何利用其层次结构和元数据属性来设计分类学。举例来说,在动机示例中,请参阅第 30-34 页,这些页面详细描述了大语言模型对请求分类学选择(针对概念和属性)的提示的响应。例如,在第 30 页,Library Resource 被细分为 Research Paper 和 Dataset。请注意,与本体层面一样,分类学层面在元数据馆员参与建模分类层次结构方面同样要求严格,甚至可能更为严格。

Last but not the least, given the key importance of the taxonomic classification in the composition of a library metadata model, let us focus on the in tension al level where the key representational choice is how to interrelate and describe each individual concept in the taxonomy. This is done by first finalising the special is ation of the already classified properties (in the ontological level) into object properties and data properties (Bagchi and Madalli 2019). Note that, even if the properties are informally classified as object or data property by the metadata librarian before this level, the final decision is taken at this level. Second, each concept in the taxonomy is assigned with a set of object properties which encode how the concept is interlinked with other concepts in the overall metadata model. Equally, and perhaps the most important aspect, every taxonomic concept is assigned with a set of data properties which describe its attributes. Two observations. First, this level, implementation ally, is perhaps the most visible and pronounced level in a library metadata model as it decides on the data properties which ultimately encode real-world library metadata. Secondly, this level is central to a library metadata model in the sense that it decides the inheritance of data properties by classes (aka concepts) in the taxonomy, thereby, considerably influencing the decision concerning the description of each taxonomic concept by a set of data properties and data types. To exemplify, in the motivating example, please refer to page 34-39 which detail the LLM response to the prompt requesting for the object and data property choices made. For example, in page 35, the concept Policy Document is described with data properties like {policy Document ID, policy Document Title, policy Document Text}.

最后但同样重要的是,鉴于分类法在图书馆元数据模型组成中的关键重要性,让我们关注内涵层面,其中关键的表征选择是如何相互关联并描述分类法中的每个单独概念。这是通过首先将已分类的属性(在本体层面)细化为对象属性和数据属性来完成的(Bagchi 和 Madalli 2019)。请注意,即使元数据管理员在此层面之前非正式地将属性分类为对象属性或数据属性,最终决策仍在此层面做出。其次,分类法中的每个概念都被分配了一组对象属性,这些属性编码了该概念在整个元数据模型中如何与其他概念相互关联。同样,也许是最重要的方面,每个分类法概念都被分配了一组描述其属性的数据属性。有两点观察。首先,从实现角度来看,这一层面可能是图书馆元数据模型中最明显和突出的层面,因为它决定了最终编码现实世界图书馆元数据的数据属性。其次,这一层面在图书馆元数据模型中处于核心地位,因为它决定了分类法中类(即概念)对数据属性的继承,从而显著影响了通过一组数据属性和数据类型描述每个分类法概念的决策。例如,在激励示例中,请参阅第34-39页,其中详细描述了大语言模型对请求对象和数据属性选择的提示的响应。例如,在第35页中,概念“政策文档”被描述为具有诸如{policy Document ID, policy Document Title, policy Document Text}等数据属性。

Finally, it is interesting to note two crucial highlights about the aforementioned levels of knowledge organization and representation which constitute any library (or, even a generic) metadata model. First, the ordering from perception to in tension ali ty is not just a linear, ordered sequence but is essentially an ordered and continual interlinked scientific spiral (Rang a nathan 1957) from perception to in tension ali ty and back. To that end, successive cycles can be utilised by the metadata librarian to continuously validate/repair/update/enrich the library metadata model with new potential changes in any of the representation levels. This also reinforces the notion of the non-validity of a single or a few metadata models as necessary and sufficient for any and every domain and application scenario. Second, notice that the above levels of representation is a general framework to understand the composition of a library metadata model. It might well be the case that for a specific application scenario, the perceptual level or the terminological level or both are quite homogeneous and, therefore, of much less relative importance to the model than that of, e.g., the ontological level or the in tension al level. In any case, such permutations and combinations of the relative importance of each representation level to the final library metadata model should be decided on a case-by-case basis.

最后,值得注意的是,关于构成任何图书馆(或甚至通用)元数据模型的上述知识组织和表示层次的两个关键要点。首先,从感知到内涵的排序不仅仅是一个线性的有序序列,而本质上是一个从感知到内涵并返回的有序且持续互连的科学螺旋(Ranganathan 1957)。为此,元数据管理员可以利用连续的周期来持续验证/修复/更新/丰富图书馆元数据模型,以应对任何表示层次中的新潜在变化。这也强化了一个观点,即单一或少数元数据模型对于任何和每个领域和应用场景来说都不是必要且充分的。其次,注意到上述表示层次是理解图书馆元数据模型组成的一般框架。对于特定的应用场景,感知层次或术语层次或两者可能非常同质,因此相对于模型的重要性远不如本体层次或内涵层次。无论如何,每个表示层次对最终图书馆元数据模型的相对重要性的排列组合应基于具体情况来决定。

Entanglement in Modelling Metadata

建模元数据中的纠缠

Given the central and irreplaceable impact of the five functionally interlinked representation levels on the modelling of metadata, let us now focus on how the conceptual entanglement problem is implicitly instantiated at each successive level within the metadata design strategy of an academic library. To that end, we have the following:

鉴于五个功能互相关联的表示层对元数据建模的核心且不可替代的影响,我们现在将重点关注学术图书馆元数据设计策略中,概念纠缠问题如何在每个连续层次中隐式实例化。为此,我们有以下内容:

In order to exemplify the stratification above, the same motivating example of the ontology-driven library metadata model for cancer domain is employed (pages 8-43). Notice also the fact that the stratification of conceptual entanglement proposed above is in sync with the stratification of representation in metadata proposed in the second section. The above levels are elucidated as follows.

为了举例说明上述分层结构,我们采用了癌症领域本体驱动的图书馆元数据模型的相同激励示例(第8-43页)。还需注意的是,上述提出的概念纠缠分层与第二部分中提出的元数据表示分层是一致的。上述层次结构如下所述。

First, let us concentrate on perceptual entanglement which, in essence, refers to the many-to-many mapping between entities and their perception as concepts which would eventually compose a library metadata model. There are two key factors which underlie the entanglement in the decision-making of a metadata librarian at this level. First, as already briefed before, perception is egocentric and can be unique to communities of practice. This premise leads to the notion of perception as a cognitive filter (Guarino, Guizzardi and Mylopoulos 2020), i.e., in our terms, the fact that the same entity and its properties can be perceived differently by different groups of library users depending on their overall purpose and goals. There can be an overlap in the concepts perceived between, say, two groups of library users and, equally, there can be ranges of mutual exclusion in their perception (e.g., due to different purposes or goals). Second, the highly implicit nature and consideration of perception within academic library metadata design (see, e.g., (Fleming, Mering and Wolfe 2008; Lopatin 2010)) almost leads to the necessary and sufficient assumption (explained before) on surface while the underlying multiplicities of perception remain unaddressed. Consider the case of the motivating example generated by ChatGPT 3.5 via Human-LLM collaboration. Three different user perceptions, viz., of a representative cancer research library (pages 8-10), an oncology policy library (pages 10-12) and a medical college library (pages 12-14), of the same macro domain cancer are illustrated in the example. Notice that the common ali ty in the perception of the three different communities are indicated by their shared perceived concept of cancer specialised by its various types (breast cancer, lung cancer, etc.) and described by common properties (e.g., biomarkers associated with a cancer). However, major differences emerge in the three perceptions due to their very purpose in studying the same entity cancer. The cancer research library user community is interested in concepts like research study, treatment and properties like clinical trial ID, pathology report text, etc. The oncology policy library user community, on the other hand, is interested in policy documents like clinical guidelines, best practice, regulatory policy, etc. The medical college library user community is interested in medical study, study material, treatment approach, etc. Please see pages 15-18 in the motivating example for more details on differences in perception. It is interesting to notice from above the representational manifold ness between the same (domain) entities and how they are variously perceived as concepts.

首先,让我们关注感知纠缠(perceptual entanglement),它本质上指的是实体与其作为概念感知之间的多对多映射,这些概念最终将构成图书馆元数据模型。在这一层面上,元数据馆员决策中的纠缠有两个关键因素。首先,正如之前简要提到的,感知是自我中心的,并且可能因实践社区而异。这一前提导致了感知作为认知过滤器的概念(Guarino, Guizzardi 和 Mylopoulos 2020),即在我们看来,同一实体及其属性可能因图书馆用户群体的整体目的和目标而被不同地感知。例如,两个图书馆用户群体之间可能存在感知概念的重叠,同样,他们的感知之间也可能存在相互排斥的范围(例如,由于不同的目的或目标)。其次,学术图书馆元数据设计中感知的高度隐含性和考虑(参见,例如 (Fleming, Mering 和 Wolfe 2008; Lopatin 2010))几乎导致了表面上必要且充分的假设(如前所述),而感知的潜在多样性仍未得到解决。考虑由 ChatGPT 3.5 通过人-大语言模型协作生成的激励示例。该示例展示了同一宏观领域癌症的三种不同用户感知,即代表性癌症研究图书馆(第8-10页)、肿瘤政策图书馆(第10-12页)和医学院图书馆(第12-14页)。请注意,三个不同社区感知中的共同性体现在他们对癌症的共同感知概念上,该概念由其各种类型(乳腺癌、肺癌等)专门化,并由共同属性(例如,与癌症相关的生物标志物)描述。然而,由于他们研究同一实体癌症的目的不同,三种感知之间出现了重大差异。癌症研究图书馆用户群体对研究、治疗以及临床试验ID、病理报告文本等属性感兴趣。另一方面,肿瘤政策图书馆用户群体对临床指南、最佳实践、监管政策等政策文件感兴趣。医学院图书馆用户群体则对医学研究、学习材料、治疗方法等感兴趣。有关感知差异的更多详细信息,请参见激励示例的第15-18页。从上述内容中,有趣的是注意到相同(领域)实体之间的表示多样性以及它们如何被不同地感知为概念。

Second, let us focus on the terminological entanglement which implies the necessary existence of a many-to-many mapping between perceived concepts and their lexical iz ation using a terminological label. As briefly mentioned earlier, linguistic phenomena are a key inducer of representational manifold ness at this level. For example, polysemous terms labelling a perceived concept provides at best an ambiguous notion of its meaning and magnifies the many-to-many possibilities for the metadata librarian. On the other hand, the central problem with synonyms is the establishment of mapping. Given implicitly synonymous terms, it might not always be straightforward for the metadata librarian to infer whether they are same or synonymous or a broader/narrower/distinct term. The above problems are further compounded in settings which might involve multiple languages and/or multiple communities of different genres of academic library users. To exemplify, in the motivating example, let us consider the singular case of the concept Breast Cancer (page 8). It can be termed variously by different communities of library users as Carcinoma of the Breast, Mammary Carcinoma, Breast Tissue Malignant Neoplasm or Lobular A de no carcinoma. In multilingual metadata modelling settings, the same concept can be referred to as, for instance, cancer du sein in French, cáncer de mama in Catalan or rakovina prsu in Czech. While technologies from Natural Language Processing (NLP) can be harnessed to implement some of these semantic (non)equivalences, it is crucial for the metadata librarian to first diagnose and understand the entanglement and manifold ness existent in the equivalency mapping.

其次,让我们关注术语的纠缠问题,这意味着感知概念与其使用术语标签进行词汇化之间存在必然的多对多映射关系。正如前面简要提到的,语言现象是这一层次上表示多样性的关键诱因。例如,多义词标签最多只能提供其含义的模糊概念,并放大了元数据图书管理员面临的多对多可能性。另一方面,同义词的核心问题在于映射的建立。对于隐含的同义词,元数据图书管理员可能并不总是能够直接推断它们是相同的、同义的,还是更广泛/更狭窄/不同的术语。在涉及多种语言和/或多个不同学术图书馆用户群体的环境中,上述问题会进一步复杂化。举例来说,在动机示例中,让我们考虑乳腺癌 (Breast Cancer) 这一概念 (第 8 页)。不同的图书馆用户群体可能会将其称为乳腺癌 (Carcinoma of the Breast)、乳腺腺癌 (Mammary Carcinoma)、乳腺组织恶性肿瘤 (Breast Tissue Malignant Neoplasm) 或小叶腺癌 (Lobular Adenocarcinoma)。在多语言元数据建模环境中,同一概念可以被称为法语中的 cancer du sein、加泰罗尼亚语中的 cáncer de mama 或捷克语中的 rakovina prsu。虽然可以利用自然语言处理 (NLP) 技术来实现其中一些语义的(非)等价性,但元数据图书管理员首先需要诊断并理解等价映射中存在的纠缠和多样性。

Third, given the elucidation of terminological entanglement, let us now elaborate the notion of ontological entanglement which refers to the necessary existence of a many-to-many mapping between labelled concepts and their ontological commitment. There are two interlinked dimensions underlying the above entanglement. First, due to perceptual and subsequent terminological entanglement, different communities of library users can perceive and label the same entity differently, thereby, boots trapping the many-to-many mapping between different entities/concepts and the different top-level ontological categories they might potentially be categorised into. Second, as a consequence of the first reason, different communities of library users, unknowingly and implicitly, commit to the philosophical doctrine of the one of the state-of-the-art top-level ontologies such as DOLCE (Gangemi et al. 2002), BFO (Smith, Kumar and Bittner 2005), UFO (Guizzardi et al. 2022), etc., thereby, adding a second layer of many-to-many entanglement between entities and top-level ontological theories. The ontological entanglement resulting in a multiplicity of representational manifold ness between labelled concepts and their top-level ontological characterization, in effect, translates into application-level implementation al entanglements such as the management of interoperability and harvesting of metadata in networked academic library settings (Taha 2012). To exemplify, in the motivating example, let us consider the ontological entanglement in the oncology policy library metadata model as detailed in page 27. Notice that the same concept of a Policy Document has been categorised into different ontological categories (Information Object, Artefact, Continuant) by different top-level ontologies (DOLCE, UFO, BFO), respectively. Further, each of these categories can be instantiated for various other labelled concepts, e.g., textbook on page 28 (having a different semantics).

第三,在阐明了术语纠缠之后,让我们现在详细阐述本体论纠缠的概念,它指的是标记概念与其本体论承诺之间必然存在的多对多映射关系。上述纠缠背后有两个相互关联的维度。首先,由于感知和随后的术语纠缠,不同的图书馆用户群体可以对同一实体进行不同的感知和标记,从而在不同实体/概念与它们可能被分类的不同顶级本体类别之间形成多对多映射。其次,由于第一个原因,不同的图书馆用户群体在不知不觉中隐式地承诺了诸如DOLCE (Gangemi et al. 2002)、BFO (Smith, Kumar and Bittner 2005)、UFO (Guizzardi et al. 2022) 等顶级本体的哲学学说,从而在实体与顶级本体理论之间增加了第二层的多对多纠缠。本体论纠缠导致标记概念与其顶级本体表征之间的多重表示多样性,实际上转化为应用层面的实现纠缠,例如在网络学术图书馆环境中管理互操作性和元数据收割 (Taha 2012)。举例来说,在动机示例中,让我们考虑第27页详细描述的肿瘤学政策图书馆元数据模型中的本体论纠缠。请注意,同一个政策文档概念被不同的顶级本体 (DOLCE、UFO、BFO) 分别分类为不同的本体类别 (信息对象、人工制品、持续体)。此外,这些类别中的每一个都可以为其他各种标记概念实例化,例如第28页的教科书 (具有不同的语义)。

Fourth, with the assumption of ontological entanglement and ontological constraints, let us now elucidate the taxonomical entanglement sub-problem which refers to the necessary existence of a many-to-many mapping between ontologically enriched concepts and their taxonomic classification which would eventually constitute the backbone of a library metadata model. There are four key parameters, from Rang a nathan’s faceted classification theory (Rang a nathan 1967; Rang a nathan 1989), which induce representational manifold ness at this stage. First, with respect to a concept at a specific level of abstraction in the taxonomy, there are always multiple characteristics which can be employed to taxonomically specialise that concept into (potentially many) subordinate concepts. Second, the successive application of characteristics across the entire depth of taxonomy (with the possibility of multiple class if ica tory characteristics at each level of abstraction) leads to potentially infinite entangled classifications. Third and fourth, there can also be multiple ways in which concepts can be organised horizontally across a specific level of taxonomic abstraction (termed arrays in (Rang a nathan 1967)) and vertically across a taxonomic path (termed chains in (Rang a nathan 1967)), respectively. To exemplify the above, in the motivating example, please refer to page 30-34 which detail the LLM response to the prompt requesting for the taxonomical choices made. Although simplistic, note that the Human-LLM prompt collaboration returns two equally probable but different entangled taxonomies (page 31: example 1 and example 2) for the oncology policy library user community. It is interesting to observe that in example 1, the taxonomy is implicitly suited for a library metadata model on social-healthcare policy perspective whereas in example 2, the taxonomy is implicitly suited for a library metadata model on governmental-healthcare policy perspective.

第四,在本体纠缠和本体约束的假设下,我们现在来阐述分类学纠缠子问题,该问题指的是在本体丰富的概念与其分类学分类之间必然存在多对多的映射关系,这种映射最终将构成图书馆元数据模型的骨架。根据Ranganathan的分面分类理论(Ranganathan 1967; Ranganathan 1989),有四个关键参数在这一阶段引发了表示的多样性。首先,对于分类学中某一特定抽象层次的概念,总是存在多个特征,可以用来将该概念分类为(可能多个)下属概念。其次,在整个分类学深度上连续应用特征(在每个抽象层次上可能存在多个分类特征)会导致潜在的无限纠缠分类。第三和第四,概念在特定分类学抽象层次上的水平组织方式(在Ranganathan 1967中称为数组)和沿分类路径的垂直组织方式(在Ranganathan 1967中称为链)也可以有多种方式。为了举例说明上述内容,请参考第30-34页,详细描述了大语言模型对请求分类选择的提示的响应。尽管简单,但请注意,人类与大语言模型的提示协作返回了两个概率相等但不同的纠缠分类(第31页:示例1和示例2),适用于肿瘤学政策图书馆用户社区。有趣的是,在示例1中,分类学隐含地适合社会医疗政策视角的图书馆元数据模型,而在示例2中,分类学隐含地适合政府医疗政策视角的图书馆元数据模型。

Last but not the least, with the assumption of taxonomical entanglement, let us now elucidate the problem of in tension al entanglement which refers to the necessary existence of a many-to-many mapping between the taxonomic concepts and their in tension al interrelation and description. There are two key factors which magnify representational manifold ness in this final stage. First, during the final decision to characterise a property (uncovered at the ontological level) as an object or a data property, there exists a many-to-many mapping as the same property can be represented and refactored as an object or a data property depending on the purposes and goals to be served by the library metadata model. Second, a concept in the taxonomy can be interrelated and described via multiple possible combinations of sets of object and data properties, again, depending on the precise purpose of the final metadata model. To exemplify the entanglement, in the motivating example, please refer to page 34-39 which detail the LLM response to the prompt requesting for the object and data property choices made. For example, the Human-LLM collaboration example generated two equally probable but different sets of object and data properties (page 36) implicitly suited for the medical students and medical study participants user community of a medical college library, respectively.

最后但同样重要的是,在假设分类学纠缠的前提下,我们现在来阐明内涵纠缠的问题,这指的是分类学概念与其内涵相互关系和描述之间必然存在多对多的映射关系。在这个最后阶段,有两个关键因素放大了表示的多样性。首先,在最终决定将(在本体层面揭示的)属性特征化为对象或数据属性时,存在多对多的映射,因为相同的属性可以根据图书馆元数据模型所要服务的目的和目标,被表示和重构为对象或数据属性。其次,分类学中的一个概念可以通过对象和数据属性集合的多种可能组合来相互关联和描述,这同样取决于最终元数据模型的具体目的。为了举例说明这种纠缠,在动机示例中,请参阅第34-39页,这些页面详细描述了大语言模型对请求对象和数据属性选择的提示的响应。例如,Human-LLM协作示例生成了两组概率相等但不同的对象和数据属性集合(第36页),分别隐含地适用于医学院图书馆的医学生和医学研究参与者用户群体。

Finally, notice that the aforementioned individual level-by-level representational manifold ness are accumulative in nature and, therefore, there is also an overall manifold ness between entities and their taxonomy and in tension al characterization. In the motivating example, pages 39-43 detail at length the LLM response to the prompt requesting for the overall many-to-many mapping between entities, taxonomical hierarchies and in tension al property-based characterization. For example, even within the same macro domain of discourse cancer, different perception by the three different communities of library users (cancer specialists, oncology policy experts and medical college professionals) generate multiple terminology, ontological categories, taxonomies and property predication, most of which are not in correspondence amongst each other. Further, as also previously noted in the second section, it might well be the case that for an application scenario, there might be no manifold ness in perception or in terminology or both (with, e.g., the other entanglements still holding). In any case, the conceptual entanglement problem is a generic characterization inherent and implicit by design in any library metadata model and should be adapted as necessary on a case-by-case basis.

最后,注意到上述逐层的表示流形本质上是累积的,因此,在实体与其分类和张力特征之间也存在一个整体的流形。在动机示例中,第39-43页详细描述了大语言模型对请求实体、分类层次和基于张力属性的整体多对多映射的提示的响应。例如,即使在癌症这一宏观话语领域内,三个不同的图书馆用户群体(癌症专家、肿瘤政策专家和医学院专业人士)的不同感知也会产生多种术语、本体类别、分类和属性断言,其中大多数彼此之间并不对应。此外,正如第二部分中提到的,对于某个应用场景,可能在感知或术语或两者上都没有流形(例如,其他纠缠仍然存在)。无论如何,概念纠缠问题是任何图书馆元数据模型中固有的设计隐含的通用特征,应根据具体情况适当调整。

Towards Generative AI-driven Disentangled Metadata Modelling

迈向生成式 AI 驱动的解耦元数据建模

After the elucidation of how the conceptual entanglement problem is inherent, by design, at each representation level, let us now describe in detail how a metadata librarian can exploit the conceptual disentanglement approach, via Generative AI-driven Human-LLM collaboration, to disentangle entangled metadata within the framework of an academic library. To that end, we have the following:

在阐明了概念纠缠问题如何在每个表示层级上固有存在之后,我们现在详细描述元数据管理员如何通过生成式 AI (Generative AI) 驱动的人与大语言模型 (LLM) 协作,利用概念解缠方法,在学术图书馆框架内解缠纠缠的元数据。为此,我们有以下内容:

In order to exemplify the stratification above, the same motivating example of the ontology-driven library metadata model for cancer domain is employed (pages 8-43). Notice also the fact that the stratification of conceptual disentanglement proposed above is in sync with both the stratification of conceptual entanglement and the stratification of representation in metadata proposed in the second and the third section, respectively. The above levels are elucidated as follows.

为了举例说明上述分层结构,我们采用了癌症领域本体驱动的图书馆元数据模型的相同激励示例(第8-43页)。还需注意的是,上述提出的概念解缠分层结构与第二和第三部分分别提出的概念纠缠分层和元数据表示分层是同步的。上述层次结构如下所述。

First, let us concentrate on the perceptual disentanglement sub-approach which advocates, on the part of the metadata librarian (i.e., the modeller), to decide on an explicit one-to-one mapping between entities and their perception as concepts. As described and exemplified in the third section, even for the same (domain) entities, there is an inherent representational manifold ness which entangles the resultant library metadata model. There are two key factors which underlie the disentanglement approach of a metadata librarian at this level. First, the metadata librarian has to generate, via prompt engineering (similar to that in the motivating example), an initial characterization of the perception he/she wants to follow in designing the metadata model. Notice that the goals and purposes which such a model would potentially serve is a key influence on the design of prompts at this stage to elicit a reasonable response from the LLM. Given an initial record of perception generated via the above Human-LLM collaboration, the next crucial step for the metadata librarian is to employ (a combination of)

首先,让我们专注于感知解耦子方法,该方法主张元数据馆员(即建模者)在实体与其作为概念的感知之间决定一个明确的一对一映射。正如第三部分所描述和示例的那样,即使对于相同的(领域)实体,也存在一种固有的表示多样性,这会导致最终的库元数据模型变得复杂。元数据馆员在这一层次的解耦方法有两个关键因素。首先,元数据馆员必须通过提示工程(类似于激励示例中的方法)生成他/她希望在设计元数据模型时遵循的感知的初始特征描述。需要注意的是,这样的模型可能服务的目标和目的是影响此阶段提示设计的关键因素,以便从大语言模型中获得合理的响应。在通过上述人机协作生成感知的初始记录后,元数据馆员的下一个关键步骤是采用(组合的)

several standard techniques to validate and consolidate the generated perception depending on the critical it y of the use of the metadata model. To that end, he/she can conduct a lightweight digital ethnographic exercise (Varis 2015) about the user community and its perception about a domain. He/she can also employ (digital) focus groups (Morgan 1996) and/or one-to-one consultation with domain experts (Drab ens to tt 2003) to better understand a community’s viewpoint about entities of a specific domain. Further, independently or in addition to the above, the metadata librarian can also perform an information and data validation exercise, wherein, he/she can explore real-life information and data resources produce by an user community to gauge its perception about a domain or even, if available, reuse existing conceptual disentanglement documentation s (thereby, reinforcing the spiral nature of the levels). Finally, the above inputs can be consolidated to validate/repair/enrich the initial perception and, hence, facilitate disentangling the perceptual entanglement. To exemplify, in the motivating example, the metadata librarian, after the perception validation exercise, might choose to add a new concept multi-omics workflow and related properties to the perception already generated by the Human-LLM collaboration for cancer research library user community. Notice also the fact that the initial perception generated via the LLM prompting also reduces to a significant extent the otherwise laborious exercise of domain analysis (Hjørland and Albrecht sen 1995) on part of the metadata librarian.

根据元数据模型使用的关键性,可以采用几种标准技术来验证和巩固生成的感知。为此,他/她可以进行轻量级的数字民族志研究(Varis 2015),了解用户社区及其对某个领域的感知。他/她还可以采用(数字)焦点小组(Morgan 1996)和/或与领域专家进行一对一咨询(Drabenstott 2003),以更好地理解社区对特定领域实体的看法。此外,元数据管理员还可以独立或与上述方法结合,进行信息和数据验证工作,探索用户社区产生的现实信息和数据资源,以评估其对某个领域的感知,甚至(如果可用)重用现有的概念解耦文档(从而强化层次的螺旋性质)。最后,可以整合上述输入以验证/修复/丰富初始感知,从而促进感知纠缠的解耦。举例来说,在激励示例中,元数据管理员在感知验证工作后,可能会选择为癌症研究图书馆用户社区的人类-大语言模型协作生成的感知添加一个新概念“多组学工作流”及相关属性。还需注意的是,通过大语言模型提示生成的初始感知也在很大程度上减少了元数据管理员在领域分析(Hjørland 和 Albrechtsen 1995)方面的繁琐工作。

Second, assuming the representational bijection at the perceptual level, let us now focus on terminological disentanglement, which advocates, on the part of the metadata librarian, to decide on an explicit one-to-one mapping between perceived concepts and the terminology which would compose the disentangled library metadata model. The guiding cardinal which a metadata librarian can employ for disentanglement at the terminology level can be referred to as the principle of user warrant. It is based on the generalised notion of warrant in information science (Nylund 2020; Lancaster 1972) and can be understood as the principle to assign to a perceived concept a linguistically disambiguate d and semantically explicit term which is based on the documented warrant of the (expert) library users. To that end, each (meta)data concept/property term label should embody standard terminological quality, e.g., having a natural language gloss, examples, identifiers, etc (Miller 1995). There are several (combinations of) options which a metadata librarian can exploit to achieve representational bijection of terminology. If the user warrant is to use relatively commonsense terms to refer to concepts of cancer in a, e.g., oncology policy library, lexical-semantic resources such as WordNet (Miller 1995) might be exploited to achieve the representational bijection. On the other hand, if the user warrant in a specialised cancer research library is to refer to concepts of cancer using scientific terms, resources such as specialised glossaries and healthcare/clinical terminological standards (Schulz, Stegwee and Chronaki 2019) can be exploited to achieve an explicit one-to-one mapping. Further, in multilingual settings, NLP resources like multilingual variations of WordNets and WordNet-like resources (Pianta, Bentivogli and Girardi 2002; Dash, Bhatt a chary ya and Pawar 2017; Navigli and Ponzetto 2012) and/or open multilingual knowledge bases like Wikidata (Vrandečić and Krötzsch 2014) can be exploited to disentangle and disambiguate terms.

其次,假设在感知层面存在表示双射,我们现在关注术语解耦,这要求元数据管理员在感知概念与构成解耦库元数据模型的术语之间建立明确的一对一映射。元数据管理员在术语层面进行解耦时可以参考的指导原则是用户授权原则。该原则基于信息科学中的广义授权概念(Nylund 2020; Lancaster 1972),可以理解为根据(专家)图书馆用户的文档授权,为感知概念分配一个语言上无歧义且语义明确的术语。为此,每个(元)数据概念/属性术语标签应体现标准的术语质量,例如具有自然语言解释、示例、标识符等(Miller 1995)。元数据管理员可以利用多种(组合)选项来实现术语的表示双射。如果用户授权是在肿瘤学政策库中使用相对常识性的术语来指代癌症概念,那么可以利用诸如WordNet(Miller 1995)等词汇语义资源来实现表示双射。另一方面,如果用户授权是在专门的癌症研究库中使用科学术语来指代癌症概念,那么可以利用诸如专门词汇表和医疗/临床术语标准(Schulz, Stegwee 和 Chronaki 2019)等资源来实现明确的一对一映射。此外,在多语言环境中,可以利用诸如WordNet的多语言变体和类似WordNet的资源(Pianta, Bentivogli 和 Girardi 2002; Dash, Bhattacharyya 和 Pawar 2017; Navigli 和 Ponzetto 2012)和/或开放的多语言知识库如Wikidata(Vrandečić 和 Krötzsch 2014)等NLP资源来解耦和消除术语歧义。

Third, assuming the representational bijection at the terminological level, let us move to the next activity of ontological disentanglement, according to which, the metadata librarian has to decide on an explicit one-to-one mapping between labelled concepts and their ontological commitment. Notice that, similar to the disentanglement strategy at the terminological level, the general guidance of user warrant is key to uncover the ontological commitment of library users at this level. To that end, the metadata librarian can apply a four-staged approach. First, the librarian can prompt an LLM like ChatGPT 3.5 appropriately (see, for instance, the examples in pages 23-29 of the motivating exam- ple) to bootstrap an initial ‘entangled’ version of the alignment of the library metadata model with different state-of-the-art top-level ontologies. Second, the librarian can reuse the outputs and documentation of focus group interviews and/or domain expert consultation and/or relevant conceptual disentanglement documentation s (as discussed before in the perceptual disentanglement level) to understand the ontological warrant of the community of library users he/she is attending to. Third, he/she uses the results of the second step to explicitly choose and enrich the best possible ontological fit amongst the different ontology alignments produced in the first step. Finally, the metadata librarian also should explicate the ontological categories of each individual labelled concept and property that is under consideration with respect to the chosen top-level ontology.

第三,假设在术语层面存在表示的双射,让我们转向本体解缠的下一个活动,即元数据馆员需要决定标记概念与其本体承诺之间的显式一对一映射。请注意,与术语层面的解缠策略类似,用户授权的总体指导是揭示图书馆用户在这一层面的本体承诺的关键。为此,元数据馆员可以应用一个四阶段的方法。首先,馆员可以适当地提示像 ChatGPT 3.5 这样的大语言模型(例如,参见动机示例的第 23-29 页的示例),以引导图书馆元数据模型与不同最先进的顶级本体对齐的初始“纠缠”版本。其次,馆员可以重用焦点小组访谈和/或领域专家咨询的输出和文档,以及/或相关的概念解缠文档(如之前在感知解缠层面讨论的那样),以了解他/她所服务的图书馆用户社区的本体授权。第三,他/她使用第二步的结果来明确选择并丰富在第一步中产生的不同本体对齐中的最佳本体匹配。最后,元数据馆员还应明确每个单独标记的概念和属性的本体类别,这些类别与所选的顶级本体相关。

Fourth, assuming the representational bijection at the ontological level, let us now concentrate on taxonomical disentanglement, which advocates, on the part of the metadata librarian, to decide on an explicit one-to-one mapping between ontologically characterised concepts and their precise taxonomy. Given the general direction from the needs of the library and information service(s) which would exploit the final library metadata model, the metadata librarian can follow a four-step disentanglement strategy in response to the four stages of taxonomical entanglement (see the third section). The solution is grounded in the canons of knowledge classification proposed by Rang a nathan in his faceted classification theory (Rang a nathan 1967; Rang a nathan 1989. First, the canons of characteristics (Rang a nathan 1967) like canons of relevance and ascertain ability should be applied to eliminate manifold ness at the level of selecting a class if ica tory characteristic for specialising concepts at a specific level of abstraction in the taxonomy. Second, the canons of succession of characteristics (Rang a nathan 1967) like canon of relevant succession should be applied to disentangle the multiplicity existent in how class if ica tory characteristics are successively applied to design the conceptual depth of the taxonomy. Third and fourth, the canons of arrays and chains (Rang a nathan 1967) should be applied to disentangle the manifold ness existent while modelling concepts across a specific horizontal level and across a path of the taxonomy, respectively. To aid the above exercise, the metadata librarian can also consult and reuse relevant state-of-the-art taxonomies if available as open source code. Finally, the above procedure is expected to generate an enriched and disentangled version of one of the many possibilities initially generated via Human-LLM collaboration, e.g., as exemplified in the motivating example (page 30-34).

第四,假设在本体层面存在表示的双射关系,我们现在专注于分类解耦,这要求元数据图书馆员决定在本体特征化的概念与其精确分类之间建立明确的一对一映射。根据图书馆和信息服务利用最终图书馆元数据模型的需求,元数据图书馆员可以遵循四步解耦策略,以应对分类纠缠的四个阶段(见第三部分)。该解决方案基于Ranganathan在其分面分类理论中提出的知识分类准则(Ranganathan 1967; Ranganathan 1989)。首先,应应用特征准则(Ranganathan 1967),如相关性和可确定性准则,以消除在选择分类特征时的多样性,这些特征用于在分类的特定抽象层次上专门化概念。其次,应应用特征继承准则(Ranganathan 1967),如相关继承准则,以解耦在如何连续应用分类特征来设计分类的概念深度时存在的多样性。第三和第四,应应用数组和链准则(Ranganathan 1967),以分别解耦在特定水平层次和分类路径上建模概念时存在的多样性。为了辅助上述工作,元数据图书馆员还可以参考并重用相关的先进分类法(如果它们作为开源代码可用)。最后,上述过程预计将生成一个丰富且解耦的版本,该版本是初始通过人类与大语言模型协作生成的众多可能性之一,例如在动机示例中所示(第30-34页)。

Last but not the least, assuming the representational bijection at the taxonomical level, let us move to the final activity of in tension al disentanglement, according to which, the metadata librarian has to decide on an explicit one-to-one mapping between the taxonomic concepts and their in tension al interrelation and description. To that end, he/she can proceed on a two-step strategy. First, he/she should make explicit a final decision on the exact split of properties into two distinct sets: a set of object properties and a set of data properties. This decision is critically influenced by the information service applications driven by the warrant of a specific community of library users. Given the first step, the next step is to make explicit the decisions on exactly how the object properties would be exploited to interlink the concepts in the disentangled taxonomy, i.e., determining the precise conceptual domain and ranges of the object properties. Further, perhaps most importantly, he/she should determine the exact set of data properties which should describe each concept in the taxonomy (also factoring in the taxonomic inheritance). Together, the above steps can facilitate a representational bijection and, thereby, a generate the final disentangled ontology-driven library metadata model out of the many in tension ally entangled possibilities initially generated via Generative AI-driven Human-LLM collaboration, e.g., as exemplified in the motivating example (page 34-39).

最后但同样重要的是,假设在分类学层面存在表示双射,让我们转向最终的紧张解耦活动,根据这一活动,元数据馆员必须决定分类学概念与其紧张相互关系及描述之间的明确一对一映射。为此,他/她可以采取两步策略。首先,他/她应明确决定将属性精确划分为两个不同的集合:对象属性集合和数据属性集合。这一决定受到由特定图书馆用户社区需求驱动的信息服务应用的严重影响。在完成第一步后,下一步是明确决定如何利用对象属性来互连解耦分类中的概念,即确定对象属性的精确概念域和范围。此外,也许最重要的是,他/她应确定描述分类中每个概念的确切数据属性集合(同时考虑分类继承)。上述步骤共同促进表示双射,从而从最初通过生成式AI驱动的人-大语言模型协作生成的众多紧张纠缠可能性中生成最终的解耦本体驱动图书馆元数据模型,例如在激励示例(第34-39页)中所展示的那样。

There are three key observations with respect to the aforementioned elucidation of the levels of conceptual disentanglement. First, as with the second and the third section, it might well be the case that for a specific application scenario, the conceptual disentanglement exercise is relevant to be implemented only for specific representation levels and not for all five levels together. Second, notice that the disentanglement at each of the above levels is effectuated by Human-LLM collaboration via appropriate prompt engineering. It is definitely the case that in some levels, e.g., ontological and taxonomical disentanglement, human modellers should be more involved than LLMs due to the very conceptual and semantics-intensive nature of disentanglement. On the other hand, the initial responses generated by LLMs can speed up the boots trapping of modelling metadata by several counts. Notice also the fact that human oversight, even if not demanding, is crucial, at each level, to minimise the effect of the so-called LLM hallucinations (Rawte, Sheth and Das 2023; Chomsky, Roberts and Watamull 2023). Finally, the conceptual disentanglement exercise should be formally documented as a datasheet (Gebru et al. 2021) (for possible future reuse) with all the decisions concerning the disentanglement of the manifold ness into one-to-one correspondences clearly explicated.

关于上述概念解缠层次的阐述,有三个关键观察点。首先,与第二和第三部分一样,对于特定的应用场景,概念解缠练习可能仅与特定的表示层次相关,而不需要同时针对所有五个层次进行。其次,注意到上述每个层次的解缠都是通过人类与大语言模型(LLM)的协作,借助适当的提示工程实现的。在某些层次,例如本体论和分类学解缠,由于解缠的概念性和语义密集性,人类建模者应比大语言模型更深入地参与。另一方面,大语言模型生成的初始响应可以显著加快建模元数据的引导过程。还需注意的是,在每个层次上,即使不要求严格的人类监督,也是至关重要的,以最小化所谓的大语言模型幻觉的影响(Rawte, Sheth 和 Das 2023; Chomsky, Roberts 和 Watamull 2023)。最后,概念解缠练习应正式记录为数据表(Gebru 等人 2021)(以便未来可能的复用),并明确解释所有关于将多样性解缠为一对一对应关系的决策。

Research Implications for Libraries and Beyond

图书馆及其他领域的研究意义

The deconstruction of a library metadata model as composed of five disentangled representation levels is a novel way of thinking about library metadata and is directly linked to some of the AI, metadata and semantics-based research issues which academic libraries are increasingly grappling with. To that end, let us focus on general perspectives relevant to related work in library metadata and library knowledge discovery, FAIR data and metadata, and AI, Generative AI and academic libraries vis-a-vis the conceptual entanglement and disentanglement metadata modelling proposal advanced in the current work. Each of the perspectives are individually discussed as follows.

将图书馆元数据模型解构为五个解耦的表示层次,是一种新颖的思考图书馆元数据的方式,并且直接关联到学术图书馆日益关注的一些基于人工智能、元数据和语义的研究问题。为此,让我们聚焦于与图书馆元数据和图书馆知识发现、FAIR数据和元数据、以及人工智能、生成式AI和学术图书馆相关的总体视角,这些视角与当前工作中提出的概念纠缠和解缠元数据建模建议密切相关。每个视角将分别讨论如下。

Knowledge Discovery, according to the highly cited work in (Fayyad, Piatetsky-Shapiro and Smyth 1996), can be defined as “the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns or knowledge in data”. Notice that, in the above quoted definition, what is assumed a priori is the notion of user warrant based data modelling and data annotation without which eliciting patterns from such data would be an uphill task. In fact, some of the assumptions (alongside many others) were alluded to in later works such as (Pazzani 2000). Notice that metadata-based data modelling and annotation is key to digital infrastructures, e.g., digital libraries (Borgman 1999), discovery services (Vaughan 2011), etc., of academic libraries which host and expose key assets for the academic library user community, especially the community of data science researchers, to engage in algorithmic knowledge discovery. To that end, the mainstream research in library metadata (see, e.g., (Dunsire and Willer 2010; Dunsire and Willer 2011; Coyle 2015; Peponakis 2013)) have overwhelmingly concentrated on developing metadata schemas (which is key) but have, at best, under-considered the key spectrum of human factors in modelling metadata. This has led to long standing issues with knowledge discovery in academic libraries (Richardson, Srinivasan and Fox 2008), metadata crosswalks (Khoo and Hall 2010), metadata and data interoperability (Pagano, Candella and Castelli 2013), etc. More recently, the CILIP research report (Cox 2021) suggested enhanced emphasis on the human and social aspect and termed knowledge discovery as a socio-technical process. The proposal of Generative AI-driven conceptual disentanglement proposed in this work across the five representation levels right from (human) perception to the formal is ation of a library metadata model can contribute as a research catalyst to this under-researched theme. Moreover, the level-by-level disentanglement of a library metadata model can form the base for developing incremental frameworks and methodologies for attenuating the impact of, e.g., metadata non-interoperability and skewed crosswalks.

知识发现 (Knowledge Discovery) 根据 (Fayyad, Piatetsky-Shapiro 和 Smyth 1996) 中高度引用的工作,可以定义为“在数据中识别有效、新颖、潜在有用且最终可理解的模式或知识的非平凡过程”。需要注意的是,在上述引用的定义中,先验假设的是基于用户授权 (user warrant) 的数据建模和数据标注概念,没有这些概念,从数据中提取模式将是一项艰巨的任务。事实上,一些假设(以及其他许多假设)在后续的工作中被提及,例如 (Pazzani 2000)。需要注意的是,基于元数据 (metadata) 的数据建模和标注是数字基础设施的关键,例如数字图书馆 (Borgman 1999)、发现服务 (Vaughan 2011) 等,这些基础设施为学术图书馆用户社区(尤其是数据科学研究人员社区)提供了关键资源,以进行算法知识发现。为此,图书馆元数据的主流研究(参见,例如 (Dunsire 和 Willer 2010; Dunsire 和 Willer 2011; Coyle 2015; Peponakis 2013))主要集中在开发元数据模式(这是关键),但在建模元数据时,最多只是低估了人类因素的关键范围。这导致了学术图书馆中知识发现的长期问题 (Richardson, Srinivasan 和 Fox 2008)、元数据交叉映射 (Khoo 和 Hall 2010)、元数据和数据互操作性 (Pagano, Candella 和 Castelli 2013) 等。最近,CILIP 研究报告 (Cox 2021) 建议加强对人类和社会方面的重视,并将知识发现称为一个社会技术过程。本文提出的生成式 AI 驱动的概念解耦 (conceptual disentanglement) 从(人类)感知到图书馆元数据模型的形式化 (formalization) 的五个表示层次,可以作为这一研究不足主题的研究催化剂。此外,图书馆元数据模型的逐层解耦可以为开发增量框架和方法奠定基础,以减轻元数据非互操作性和偏差交叉映射等影响。

Next, let us focus on how the current work is directly linked to the upsurge of research in FAIR data (Wilkinson et al. 2016), i.e., Findable, Accessible, Interoperable and Reusable data. As stated in its original vision (Wilkinson et al. 2016), the effort to make data (or, in general, an information resource) FAIR involves a dedicated metadata strategy on part of the implementing entity. Academic libraries and the library metadata community have been quick to incorporate the FAIR vision (especially the Find ability and Accessibility dimensions) and have been successful in the technological infrastructure (e.g., data catalogues) backing a standalone FAIR implementation (for example, see, (Hettne et al. 2020; Koster and Woutersen-Windhouwer 2018)). However, as the current proposal of conceptual disentanglement shows, FAIR if i cation is a multi-level strategy encompassing all the five representation levels: perception, terminology, ontology, taxonomy and in tension al characterization. This is key, especially for the Interoperability and Re usability dimensions as there can be non-interoperability in each of the five representation levels due to representational manifold ness leading to non-re usability of digital data and information (thereby, also defeating the overall purpose of FAIR). Additionally, the above aspect also reinforces the importance of ontology-driven metadata modelling as, without an ontology structuring and consolidating the (meta)data properties, it is next to impossible, as per the current technology toolbox, to achieve interoperability and re usability both for data and ontology (see, e.g., (Fernández-López et al. 2019; Carriero et al. 2020) for related issues and problems). Finally, while the conceptual disentanglement framework is innovative in its vision of level-by-level FAIR if i cation, it runs the risk of overwhelming intellectual work on part of, e.g., a metadata librarian. The Generative AI-driven Human-LLM collaboration strategy to achieve conceptual disentanglement via prompt engineering is advanced exactly to mitigate a substantial percentage of initial work which can be automated and the human modeller can validate and enrich it with his/her expertise, differently in different representation levels.

接下来,让我们关注当前工作如何与FAIR数据(Wilkinson等人,2016)研究热潮直接相关,即可查找、可访问、可互操作和可重用的数据。正如其最初愿景所述(Wilkinson等人,2016),使数据(或一般信息资源)FAIR化的努力涉及实施实体的专用元数据策略。学术图书馆和图书馆元数据社区迅速采纳了FAIR愿景(特别是可查找性和可访问性维度),并在支持独立FAIR实施的技术基础设施(例如数据目录)方面取得了成功(例如,参见Hettne等人,2020;Koster和Woutersen-Windhouwer,2018)。然而,正如当前的概念解耦提案所示,FAIR化是一个多层次策略,涵盖所有五个表示层次:感知、术语、本体、分类和内在特征。这一点尤为关键,特别是对于互操作性和可重用性维度,因为在每个表示层次中,由于表示的多样性可能导致数字数据和信息的不可重用性(从而也违背了FAIR的总体目的)。此外,上述方面也强调了本体驱动的元数据建模的重要性,因为如果没有本体来结构化和巩固(元)数据属性,根据当前的技术工具箱,几乎不可能实现数据和本体的互操作性和可重用性(例如,参见Fernández-López等人,2019;Carriero等人,2020,了解相关问题)。最后,虽然概念解耦框架在其逐层FAIR化的愿景中是创新的,但它可能会给元数据馆员等带来巨大的智力工作负担。生成式AI驱动的人-大语言模型协作策略通过提示工程实现概念解耦,正是为了减轻可以自动化的初始工作的很大一部分,人类建模者可以在不同的表示层次中验证并利用其专业知识丰富它。

Finally, let us also move to a more abstract plane and position the current work in terms of how AI in general and Generative AI in particular have impacted academic librarians, information professionals and users. The CILIP report $(\mathrm{Cox},,2021)$ ), as cited before, provides a well-rounded bird’s-eye view as to how AI is going to shape the academic library space in the years to come. The possibilities of generative AI applications in (academic) libraries encompasses a rich spectrum of information-intensive activities including, amongst others, intelligent web and mobile search, socio-technical infrastructures for knowledge discovery, chatbots, marketing and user behaviour eli citation, robotics, data science community building and data stewardship. Notice three dimensions which are key to the socio-technical tool-box required to conceptual is e and implement all of the above activities. First, all of the above activities depend on the involvement of humans, particularly the warrant of library users but also the technical expertise and knowledge of academic librarians and other information professionals. Second, all of the above activities also depend on the involvement of machines, i.e., computers and computing infrastructures to semi-autonomously implement digital information services and increase their usage penetration amongst users. Thirdly, all of the above activities crucially depend on metadata, which forms the glue between humans (helping them discover and retrieve data, information and knowledge from machines) and machines (to exploit and fine-tune user warrant, services and information discover ability). In this context, the current proposal of a conceptual disentanglement strategy for modelling library metadata via Generative AI-driven Human-LLM collaboration contributes to all the three dimensions. The strategy involves humans (metadata librarians as semantic modellers and users), machines (generative AI technology (Sætra 2023; Jo 2023; Fui-Hoon Nah et al. 2023) in the form of LLMs) and a strategy to model disentangled metadata via greater Human-AI collaboration.

最后,让我们从更抽象的角度出发,探讨当前工作如何定位,以及人工智能(AI)尤其是生成式 AI 对学术图书馆员、信息专业人员和用户的影响。CILIP 报告 (Cox 2021) 提供了一个全面的鸟瞰图,展示了 AI 将如何塑造未来几年的学术图书馆空间。生成式 AI 在(学术)图书馆中的应用可能性涵盖了丰富的信息密集型活动,包括智能网页和移动搜索、知识发现的社会技术基础设施、聊天机器人、营销和用户行为分析、机器人技术、数据科学社区建设和数据管理。需要注意的是,这些活动的概念化和实施所需的社会技术工具箱有三个关键维度。首先,所有这些活动都依赖于人类的参与,特别是图书馆用户的授权,以及学术图书馆员和其他信息专业人员的技术专长和知识。其次,所有这些活动也依赖于机器的参与,即计算机和计算基础设施,以半自主地实施数字信息服务并提高其在用户中的使用渗透率。第三,所有这些活动都关键依赖于元数据,元数据是连接人类(帮助他们从机器中发现和检索数据、信息和知识)和机器(以利用和微调用户授权、服务和信息发现能力)的粘合剂。在此背景下,当前提出的通过生成式 AI 驱动的人类-大语言模型协作来建模图书馆元数据的概念解耦策略,对这三个维度都有贡献。该策略涉及人类(作为语义建模者和用户的元数据图书馆员)、机器(以 LLM 形式存在的生成式 AI 技术 (Sætra 2023; Jo 2023; Fui-Hoon Nah 等 2023))以及通过更大的人类-AI 协作来建模解耦元数据的策略。

Conclusion and Future Work

结论与未来工作

To summarise, the paper presented a novel and completely general Generative AI-driven Human-LLM collaboration approach to model disentangled library metadata free from several levels of representational entanglements which can otherwise have unintended technological consequences. To that end, the paper advanced three key contributions. First, it introduced a novel way of thinking about library metadata in terms of a composition of five functionally interlinked representation levels from perception to in tension al definition. Second, it introduced the implicit representational manifold ness existent in each of the five levels which cumulatively contribute to a conceptually entangled library metadata model. Third, it proposed a Generative AI-driven Human-LLM collaboration based conceptual disentanglement approach to disentangle the entanglement inherent, level-by-level, in the metadata model leading to a disentangled metadata model which is validated, explanatory and continually enriched via a spiral process. Some of the future research lines in this direction can include a new conceptually disentangled interpretation of FAIR (meta)data in libraries, elevating the approach to a full-fledged general metadata modelling methodology beyond libraries and a detailed cost-benefit analysis of the Human-LLM collaboration dimension in conceptual disentanglement.

总结来说,本文提出了一种新颖且完全通用的生成式 AI (Generative AI) 驱动的人与大语言模型 (LLM) 协作方法,用于建模解耦的图书馆元数据,避免了多个层次的表示纠缠,否则可能会产生意外的技术后果。为此,本文提出了三个关键贡献。首先,它引入了一种新的思考方式,将图书馆元数据视为从感知到内在定义的五个功能互连表示层次的组合。其次,它引入了存在于这五个层次中的隐含表示流形性,这些流形性共同促成了一个概念上纠缠的图书馆元数据模型。第三,它提出了一种基于生成式 AI 驱动的人与 LLM 协作的概念解耦方法,逐层解耦元数据模型中固有的纠缠,从而形成一个经过验证、具有解释性并通过螺旋过程不断丰富的解耦元数据模型。未来在这一方向的研究可以包括对图书馆中 FAIR (元) 数据的新概念解耦解释,将该方法提升为超越图书馆的完整通用元数据建模方法,以及对人与 LLM 协作在概念解耦中的详细成本效益分析。

References

参考文献

And riko poul ou, Angeliki, Jennifer Rowley, and Geoff Walton. "Research data management (RDM) and the evolving identity of academic libraries and librarians: A literature review." New Review of Academic Librarianship 28, no. 4 (2022): 349-365.

And riko poul ou, Angeliki, Jennifer Rowley, 和 Geoff Walton. "研究数据管理 (RDM) 与学术图书馆及图书馆员身份的演变:文献综述." 《新学术图书馆评论》 28, no. 4 (2022): 349-365.

Arp, Robert, and Barry Smith. "Function, role, and disposition in basic formal ontology." Nature Precedings (2008): 1-1.

Arp, Robert, 和 Barry Smith. "基本形式本体论中的功能、角色和倾向." Nature Precedings (2008): 1-1.

Bagchi, Mayukh. "Smart Cities, Smart Libraries and Smart Knowledge Managers: Ushering in the neo-Knowledge Society." 50 Years of LIS Education in North East India (2019b).

Bagchi, Mayukh. "智慧城市、智慧图书馆与智慧知识管理者:引领新知识社会。" 印度东北部图书馆与信息科学教育50年 (2019b).

Bagchi, Mayukh, and D. P. Madalli. "Domain visualization using knowledge cartography in the big data era: A knowledge graph based alternative." In International Conference on Future of Libraries: Jointly organized by Indian Institute of Management (IIM) Bangalore and Indian Statistical Institute (ISI) Kolkata, Bangalore. 2019.

Bagchi, Mayukh, 和 D. P. Madalli. "大数据时代使用知识制图进行领域可视化:一种基于知识图谱的替代方案。" 在由印度管理学院 (IIM) 班加罗尔和印度统计学院 (ISI) 加尔各答联合组织的国际图书馆未来会议上,班加罗尔。2019。

Bagchi, Mayukh. "Conceptual ising a Library Chatbot using Open Source Conversational Artificial Intelligence." DESIDOC Journal of Library & Information Technology 40, no. 6 (2020).

Bagchi, Mayukh. "使用开源对话式人工智能设计图书馆聊天机器人的概念。" DESIDOC 图书馆与信息技术杂志 40, no. 6 (2020).

Bagchi, Mayukh. "A large scale, knowledge intensive domain development methodology." Knowledge Organization 48, no. 1 (2021a): 8-23

Bagchi, Mayukh. "大规模知识密集型领域开发方法论。" Knowledge Organization 48, no. 1 (2021a): 8-23

Bagchi, Mayukh. "Towards knowledge organization ecosystem (KOE)." Cataloging & Classification Quarterly 59, no. 8 (2021b): 740-756.

Bagchi, Mayukh. "迈向知识组织生态系统 (KOE)。" 《编目与分类季刊》 59, no. 8 (2021b): 740-756.

Bagchi, Mayukh. "A Diversity-Aware Domain Development Methodology." In the 41st International Conference on Conceptual Modelling (ER Conference). CEUR Proceedings, India, 2022.

Bagchi, Mayukh. "一种多样性感知的领域开发方法论。" 在第41届国际概念建模会议(ER会议)上。CEUR会议论文集,印度,2022年。

Bagchi, Mayukh, and Subhashis Das. "Conceptually Disentangled Class if ica tory Ontologies." In the 15th Seminar on Ontology Research in Brazil (ONTOBRAS). CEUR Proceedings, Brazil, 2022.

Bagchi, Mayukh, 和 Subhashis Das. "概念上解耦的分类本体论。" 在第十五届巴西本体论研究研讨会 (ONTOBRAS) 上。CEUR 会议论文集,巴西,2022。

Bagchi, Mayukh, and Subhashis Das. "Disentangling Domain Ontologies." In the 19th Italian Research Conference on Digital Libraries (IRCDL). CEUR Proceedings, Italy, 2023.

Bagchi, Mayukh 和 Subhashis Das. "解耦领域本体。" 在第19届意大利数字图书馆研究会议 (IRCDL) 上。CEUR 会议录,意大利,2023年。

Banh, Leonardo, and Gero Strobel. "Generative artificial intelligence." Electronic Markets 33, no. 1 (2023): 63.

Banh, Leonardo, and Gero Strobel. "生成式人工智能 (Generative Artificial Intelligence)." Electronic Markets 33, no. 1 (2023): 63.

Ben event a no, Domenico, Nikolai Dahlem, Sabina El Haoum, Axel Hahn, Daniele Montanari, and Matthias Reinelt. "Ontology-driven semantic mapping." In Enterprise Interoperability III: New Challenges and Industrial Approaches, pp. 329-341. Springer London, 2008.

Ben event a no, Domenico, Nikolai Dahlem, Sabina El Haoum, Axel Hahn, Daniele Montanari, and Matthias Reinelt. "本体驱动的语义映射。" 载于《企业互操作性 III:新挑战与工业方法》,第 329-341 页。Springer London, 2008.

Bentivogli, Luisa, and Emanuele Pianta. "Looking for lexical gaps." In Proceedings of the ninth EURALEX International Congress, pp. 8-12. Stuttgart: Universit t Stuttgart, 2000.

Bentivogli, Luisa, 和 Emanuele Pianta. "寻找词汇空缺。" 在第九届 EURALEX 国际会议论文集, 第 8-12 页. 斯图加特: 斯图加特大学, 2000.

Biagetti, Maria Teresa. "Ontologies as knowledge organization systems." Knowledge Organization 48, no. 2 (2021): 152-176.

Biagetti, Maria Teresa. "本体作为知识组织系统。" Knowledge Organization 48, no. 2 (2021): 152-176.

Bittner, Thomas, Maureen Donnelly, and Stephan Winter. "Ontology and semantic interoperability." In Large-scale 3D data integration, pp. 139-160. CRC Press, 2005.

Bittner, Thomas, Maureen Donnelly, 和 Stephan Winter. "本体论与语义互操作性." 在《大规模 3D 数据集成》中, 第 139-160 页. CRC Press, 2005.

Borgman, Christine L. "What are digital libraries? Competing visions." Information processing & management 35, no. 3 (1999): 227-243.

Borgman, Christine L. "什么是数字图书馆?竞争中的愿景。" 《信息处理与管理》 35, no. 3 (1999): 227-243.

Boroditsky, Lera. "How language shapes thought." Scientific American 304, no. 2 (2011): 62-65.

Boroditsky, Lera. "语言如何塑造思维。" 《科学美国人》 304, no. 2 (2011): 62-65.

Brown, Roger W., and Eric H. Lenneberg. "A study in language and cognition." The Journal of Abnormal and Social Psychology 49, no. 3 (1954): 454.

Brown, Roger W., 和 Eric H. Lenneberg. "语言与认知研究." The Journal of Abnormal and Social Psychology 49, no. 3 (1954): 454.

Carriero, Valentina Anita, Marilena Daquino, Aldo Gangemi, Andrea Giovanni Nuzzolese, Silvio Peroni, Valentina Presutti, and Francesca Tomasi. "The landscape of ontology reuse approaches." In Applications and practices in ontology design, extraction, and reasoning, pp. 21-38. IOS Press, 2020.

Carriero, Valentina Anita, Marilena Daquino, Aldo Gangemi, Andrea Giovanni Nuzzolese, Silvio Peroni, Valentina Presutti, 和 Francesca Tomasi. "本体复用方法的现状。" 在《本体设计、提取和推理的应用与实践》中,第21-38页。IOS出版社,2020年。

Chang, Yupeng, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen et al. "A survey on evaluation of large language models." ACM Transactions on Intelligent Systems and Technology 15, no. 3 (2024): 1-45.

Chang, Yupeng, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen 等. "大语言模型评估综述." ACM Transactions on Intelligent Systems and Technology 15, no. 3 (2024): 1-45.

Chomsky, N., I. Roberts, and J. Watamull. "The False Promise of ChatGPT/New York Times." New York Times (2023).

Chomsky, N., I. Roberts, 和 J. Watamull. "ChatGPT 的虚假承诺/纽约时报." 纽约时报 (2023).

Cox, Andrew. "What are communities of practice? A comparative review of four seminal works." Journal of information science 31, no. 6 (2005): 527-540.

Cox, Andrew. "什么是实践社区?对四部开创性著作的比较回顾。" 信息科学杂志 31, no. 6 (2005): 527-540.

Cox, Andrew. "The impact of AI, machine learning, automation and robotics on the information professions: a report for CILIP." (2021).

Cox, Andrew. "人工智能、机器学习、自动化和机器人技术对信息职业的影响:CILIP 报告。" (2021)。

Cox, Andrew M., Stephen Pinfield, and Sophie Rutter. "The intelligent library: Thought leaders’ views on the likely impact of artificial intelligence on academic libraries." Library Hi Tech 37, no. 3 (2019): 418-435.

Cox, Andrew M., Stephen Pinfield, 和 Sophie Rutter. "智能图书馆:思想领袖对人工智能对学术图书馆可能影响的看法." Library Hi Tech 37, no. 3 (2019): 418-435.

Coyle, Karen. "FRBR, twenty years on." Cataloging & Classification Quarterly 53, no. 3-4 (2015): 265-285.

Coyle, Karen. "FRBR,二十年回顾。" 编目与分类季刊 53, no. 3-4 (2015): 265-285.

Dash, Niladri Sekhar, Pushpak Bhatt a chary ya, and Jyoti D. Pawar, eds. The WordNet in Indian Languages. Springer Singapore, 2017.

Dash, Niladri Sekhar, Pushpak Bhattacharyya, 和 Jyoti D. Pawar 编。《印度语言中的 WordNet》。Springer 新加坡出版社,2017 年。

Drab ens to tt, Karen M. "Do nondomain experts enlist the strategies of domain experts?." Journal of the American Society for Information Science and Technology 54, no. 9 (2003): 836-854.

Drab ens to tt, Karen M. "非领域专家是否会采用领域专家的策略?" 《美国信息科学与技术学会期刊》 54, no. 9 (2003): 836-854.

Dunsire, Gordon, and Mirna Willer. "Initiatives to make standard library metadata models and structures available to the Semantic Web." In Proceedings of the IFLA World Library and Information Congress (IFLA’10), Gothenborg. 2010.

Dunsire, Gordon, 和 Mirna Willer. "将标准图书馆元数据模型和结构提供给语义网的举措。" 在 IFLA 世界图书馆和信息大会 (IFLA’10) 的会议录中,哥德堡。2010年。

Dunsire, Gordon, and Mirna Willer. "Standard library metadata models and structures for the Semantic Web." Library hi tech news 28, no. 3 (2011): 1-12.

Dunsire, Gordon, 和 Mirna Willer. "标准图书馆元数据模型和语义网结构." Library hi tech news 28, no. 3 (2011): 1-12.

Dutta, Biswanath. "Symbiosis between Ontology and Linked Data." Librarian 21, no. 2 (2014): 15-24.

Dutta, Biswanath. "本体与关联数据之间的共生关系。" Librarian 21, no. 2 (2014): 15-24.

EU-NSF Working Group on Metadata. "Metadata for Digital Libraries: a Research Agenda". https://www.ercim.eu/publication/ws-proceedings/EU-NSF/metadata.html. Accessed June 9, 2024.

EU-NSF 元数据工作组。《数字图书馆元数据:研究议程》。https://www.ercim.eu/publication/ws-proceedings/EU-NSF/metadata.html。访问日期:2024年6月9日。

Ekin, Sabit. "Prompt engineering for ChatGPT: a quick guide to techniques, tips, and best practices." Authorea Preprints (2023).

Ekin, Sabit. "ChatGPT 的提示工程:技术、技巧和最佳实践的快速指南。" Authorea 预印本 (2023)。

Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From data mining to knowledge discovery in databases." AI magazine 17, no. 3 (1996): 37-37.

Fayyad, Usama, Gregory Piatetsky-Shapiro, 和 Padhraic Smyth. "从数据挖掘到数据库中的知识发现." AI杂志 17, no. 3 (1996): 37-37.

Fernández-López, Mariano, María Poveda-Villalón, Mari Carmen Suárez-Figueroa, and Asunción Gómez-Pérez. "Why are ontologies not reused across the same domain?." Journal of Web Semantics 57 (2019): 100492.

Fernández-López, Mariano, María Poveda-Villalón, Mari Carmen Suárez-Figueroa, and Asunción Gómez-Pérez. "为什么同一领域内的本体没有被重用?" Journal of Web Semantics 57 (2019): 100492.

Fleming, Adonna, Margaret Mering, and Judith A. Wolfe. "Library personnel's role in the creation of metadata: A survey of academic libraries." Technical Services Quarterly 25, no. 4 (2008): 1-15.

Fleming, Adonna, Margaret Mering, 和 Judith A. Wolfe. "图书馆人员在元数据创建中的角色:一项学术图书馆的调查。" Technical Services Quarterly 25, no. 4 (2008): 1-15.

Fui-Hoon Nah, Fiona, Ruilin Zheng, Jingyuan Cai, Keng Siau, and Langtao Chen. "Generative AI and ChatGPT: Applications, challenges, and AI-human collaboration." Journal of Information Technology Case and Application Research 25, no. 3 (2023): 277-304.

Fui-Hoon Nah, Fiona, Ruilin Zheng, Jingyuan Cai, Keng Siau, 和 Langtao Chen. "生成式 AI (Generative AI) 与 ChatGPT: 应用、挑战及人机协作." 《信息技术案例与应用研究期刊》 25, no. 3 (2023): 277-304.

Gangemi, Aldo, Nicola Guarino, Claudio Masolo, and Alessandro Oltramari. "Understanding top-level ontological distinctions." OIS $@$ IJCAI 47 (2001).

Gangemi, Aldo, Nicola Guarino, Claudio Masolo, 和 Alessandro Oltramari. "理解顶层本体论区别." OIS $@$ IJCAI 47 (2001).

Gangemi, Aldo, Nicola Guarino, Claudio Masolo, Alessandro Oltramari, and Luc Schneider. "Sweetening ontologies with DOLCE." In International conference on knowledge engineering and knowledge management, pp. 166-181. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002.

Gangemi, Aldo, Nicola Guarino, Claudio Masolo, Alessandro Oltramari, 和 Luc Schneider. "Sweetening ontologies with DOLCE." 在国际知识工程与知识管理会议上,第166-181页。柏林,海德堡:Springer Berlin Heidelberg,2002年。

Gartner, Richard. Metadata. Springer, 2016.

Gartner, Richard. 元数据. Springer, 2016.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. "Datasheets for datasets." Communications of the ACM 64, no. 12 (2021): 86-92.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, 和 Kate Crawford. "数据集的数据表." 《ACM通讯》 64, no. 12 (2021): 86-92.

Glynn, Dylan, and Justyna A. Robinson, eds. Corpus methods for semantics: Quantitative studies in polysemy and synonymy. Vol. 43. John Benjamins Publishing Company, 2014.

Glynn, Dylan 和 Justyna A. Robinson 编. 语义的语料库方法:多义性和同义性的定量研究. 第43卷. John Benjamins 出版公司, 2014.

Guarino, Nicola. "Some organizing principles for a unified top-level ontology." In AAAI Spring Symposium on Ontological Engineering, pp. 57-63. Menlo Park, CA, USA: AAAI Press, 1997.

Guarino, Nicola. "统一顶层本体的一些组织原则。" 在AAAI春季研讨会本体工程中,第57-63页。美国加利福尼亚州门洛帕克:AAAI出版社,1997年。

Guarino, Nicola, Giancarlo Guizzardi, and John Mylopoulos. "On the philosophical foundations of conceptual models." Information Modelling and Knowledge Bases 31, no. 321 (2020): 1.

Guarino, Nicola, Giancarlo Guizzardi, 和 John Mylopoulos. "论概念模型的哲学基础." 《信息建模与知识库》 31, no. 321 (2020): 1.

Guarino, Nicola, Massimiliano Carrara, and Pier daniele Giaretta. "Formalizing ontological commitment." In AAAI, vol. 94, pp. 560-567. 1994.

Guarino, Nicola, Massimiliano Carrara, 和 Pier daniele Giaretta. "形式化本体承诺." 在 AAAI, 卷 94, 页 560-567. 1994.

Guizzardi, Giancarlo, Alessander Botti Benevides, Claudenir M. Fonseca, Daniele Porello, João Paulo A. Almeida, and Tiago Prince Sales. "UFO: Unified foundational ontology." Applied ontology 17, no. 1 (2022): 167-210.

Guizzardi, Giancarlo, Alessander Botti Benevides, Claudenir M. Fonseca, Daniele Porello, João Paulo A. Almeida, and Tiago Prince Sales. "UFO: 统一基础本体论." Applied ontology 17, no. 1 (2022): 167-210.

Guptill, Stephen C. "Metadata and data catalogues." Geographical information systems 2 (1999): 677-692.

Guptill, Stephen C. "元数据与数据目录。" 地理信息系统 2 (1999): 677-692.

Han, Myung-Ja, and Patricia Hswe. "The evolving role of the metadata librarian." Library Resources & Technical Services 54, no. 3 (2011): 129-141.

Han, Myung-Ja, 和 Patricia Hswe. "元数据馆员角色的演变." Library Resources & Technical Services 54, no. 3 (2011): 129-141.

Haynes, David. Metadata for Information Management and Retrieval: Understanding metadata and its use. Facet Publishing, 2018.

Haynes, David. 信息管理与检索的元数据:理解元数据及其应用。Facet Publishing, 2018.

Hettne, Kristina Maria, Peter Verhaar, Erik Schultes, and Laurents Sesink. "From FAIR leading practices to FAIR implementation and back: An inclusive approach to FAIR at Leiden University Libraries." Data science journal 19 (2020): 40-40.

Hettne, Kristina Maria, Peter Verhaar, Erik Schultes, 和 Laurents Sesink. "从FAIR领先实践到FAIR实施及反馈:莱顿大学图书馆的包容性FAIR方法." 数据科学期刊 19 (2020): 40-40.

Hjørland, Birger. "Knowledge organization (KO)." Knowledge Organization 43, no. 6 (2016): 475-484.

Hjørland, Birger. "知识组织 (Knowledge Organization, KO)." 《知识组织》 43, no. 6 (2016): 475-484.

Hjørland, Birger. "Reviews of Concepts in Knowledge Organization." Knowledge Organization 44, no. 2 (2017).

Hjørland, Birger. "知识组织中的概念评述。" 《知识组织》 44, no. 2 (2017).

Hjørland, Birger, and Hanne Albrecht sen. "Toward a new horizon in information science: Domain‐analysis." Journal of the American society for information science 46, no. 6 (1995): 400-425.

Hjørland, Birger 和 Hanne Albrecht sen. "迈向信息科学的新视野:领域分析 (Domain‐analysis) ." 《美国信息科学学会杂志》46, 第 6 期 (1995): 400-425.

Hull, Richard. "Managing semantic heterogeneity in databases: a theoretical prospective." In Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pp. 51-61. 1997.

Hull, Richard. "管理数据库中的语义异构性:理论展望。" 在第十六届ACM SIGACT-SIGMOD-SIGART数据库系统原理研讨会论文集,第51-61页。1997年。

Jiang, Peng, Sanju Sinha, Kenneth Aldape, Sridhar Hanne n hall i, Cenk Sahinalp, and Eytan Ruppin. "Big data in basic and translational cancer research." Nature Reviews Cancer 22, no. 11 (2022): 625-639.

Jiang, Peng, Sanju Sinha, Kenneth Aldape, Sridhar Hannenhalli, Cenk Sahinalp, 和 Eytan Ruppin. "基础与转化癌症研究中的大数据." 《自然评论·癌症》 22, 第 11 期 (2022): 625-639.

Jo, A. "The promise and peril of generative AI." Nature 614, no. 1 (2023): 214-216.

Jo, A. "生成式 AI 的承诺与风险。" 《自然》614, no. 1 (2023): 214-216.

Khoo, Michael, and Catherine Hall. "Merging metadata: a socio technical study of cross walking and interoperability." In Proceedings of the 10th annual joint conference on Digital libraries, pp. 361-364. 2010.

Khoo, Michael, 和 Catherine Hall. "合并元数据:跨映射和互操作性的社会技术研究." 在第十届数字图书馆联合年会论文集, 第 361-364 页. 2010.

Koster, Lukas, and Saskia Woutersen-Windhouwer. "FAIR Principles for Library, Archive and Museum Collections: A proposal for standards for reusable collections." Code4Lib Journal 40 (2018).

Koster, Lukas, 和 Saskia Woutersen-Windhouwer. "图书馆、档案馆和博物馆藏品的FAIR原则:可重用藏品的标准提案。" Code4Lib Journal 40 (2018).

Lancaster, Frederick Wilfrid. "Vocabulary control for information retrieval." (1972).

Lancaster, Frederick Wilfrid. "信息检索的词汇控制。" (1972)。

Leipzig, Jeremy, Daniel Nüst, Charles Tapley Hoyt, Karthik Ram, and Jane Greenberg. "The role of metadata in reproducible computational research." Patterns 2, no. 9 (2021).

Leipzig, Jeremy, Daniel Nüst, Charles Tapley Hoyt, Karthik Ram, 和 Jane Greenberg. "元数据在可重复计算研究中的作用。" Patterns 2, no. 9 (2021).

Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang et al. "Holistic evaluation of language models." arXiv preprint arXiv:2211.09110 (2022).

Liang, Percy, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang 等. "语言模型的整体评估." arXiv 预印本 arXiv:2211.09110 (2022).

Lopatin, Laurie. "Metadata practices in academic and non-academic libraries for digital projects: A survey." Cataloging & Classification Quarterly 48, no. 8 (2010): 716-742.

Lopatin, Laurie. "学术与非学术图书馆数字项目中的元数据实践:一项调查。" Cataloging & Classification Quarterly 48, no. 8 (2010): 716-742.

McCall, Storrs, and Edward Jonathan Lowe. "The 3D/4D controversy: a storm in a teacup." Noûs 40, no. 3 (2006): 570-578.

McCall, Storrs, 和 Edward Jonathan Lowe. "The 3D/4D controversy: a storm in a teacup." Noûs 40, no. 3 (2006): 570-578.

Miller, George A. "WordNet: a lexical database for English." Communications of the ACM 38, no. 11 (1995): 39-41.

Miller, George A. "WordNet: 一个英语词汇数据库。" Communications of the ACM 38, no. 11 (1995): 39-41.

Moltmann, Friederike. "Natural language and its ontology." Metaphysics and cognitive science (2019): 206-32.

Moltmann, Friederike. "自然语言及其本体论。" 形而上学与认知科学 (2019): 206-32.

Morgan, David L. "Focus groups." Annual review of sociology 22, no. 1 (1996): 129-152.

Morgan, David L. "焦点小组." 《社会学年度评论》 22, 第1期 (1996): 129-152.

Navigli, Roberto, and Simone Paolo Ponzetto. "BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network." Artificial intelligence 193 (2012): 217-250.

Navigli, Roberto, 和 Simone Paolo Ponzetto. "BabelNet: 自动构建、评估和应用广泛覆盖的多语言语义网络." 人工智能 193 (2012): 217-250.

Nylund, Inger Beate. "Using the Concept of Warrant in Designing Metadata for Enterprise Search." In Knowledge Organization at the Interface, pp. 328-337. Ergon-Verlag, 2020.

Nylund, Inger Beate. "在企业搜索元数据设计中使用授权概念。" In Knowledge Organization at the Interface, pp. 328-337. Ergon-Verlag, 2020.

Pagano, Pasquale, Leonardo Candela, and Donatella Castelli. "Data Interoperability." Data Science Journal 12 (2013).

Pagano, Pasquale, Leonardo Candela, 和 Donatella Castelli. "数据互操作性 (Data Interoperability)." Data Science Journal 12 (2013).

Park, Jung-Ran. "Metadata quality in digital repositories: A survey of the current state of the art." Cataloging & classification quarterly 47, no. 3-4 (2009): 213-228.

Park, Jung-Ran. "数字存储库中的元数据质量:当前技术现状调查." Cataloging & classification quarterly 47, no. 3-4 (2009): 213-228.

Pazzani, Michael J. "Knowledge discovery from data?." IEEE intelligent systems and their applications 15, no. 2 (2000): 10-12.

Pazzani, Michael J. "从数据中发现知识?." IEEE 智能系统及其应用 15, no. 2 (2000): 10-12.

Peponakis, Manolis. "Libraries’ metadata as data in the era of the semantic web: modeling a repository of master theses and PhD dissertations for the web of data." Journal of Library Metadata 13, no. 4 (2013): 330-348.

Peponakis, Manolis. "语义网时代图书馆元数据作为数据:为数据网建模硕士论文和博士论文的存储库。" Journal of Library Metadata 13, no. 4 (2013): 330-348.

Pianta, Emanuele, Luisa Bentivogli, and Christian Girardi. "Multi Word Net: developing an aligned multilingual database." In First international conference on global WordNet, pp. 293-302. 2002.

Pianta, Emanuele, Luisa Bentivogli, 和 Christian Girardi. "Multi Word Net: 开发一个对齐的多语言数据库." 在首届全球WordNet国际会议上, 第293-302页. 2002.

Poels, Geert, Ann Maes, Frederik Gailly, and Roland Paemeleire. "Measuring the perceived semantic quality of information models." In Perspectives in Conceptual Modeling: ER 2005 Workshops AOIS, BP-UML, CoMoGIS, eCOMO, and QoIS, Klagenfurt, Austria, October 24-28, 2005. Proceedings 24, pp. 376-385. Springer Berlin Heidelberg, 2005.

Poels, Geert, Ann Maes, Frederik Gailly, 和 Roland Paemeleire. "测量信息模型的感知语义质量." 在《概念建模的视角:ER 2005 研讨会 AOIS, BP-UML, CoMoGIS, eCOMO, 和 QoIS, 奥地利克拉根福, 2005年10月24-28日. 第24届会议论文集》中, 第376-385页. 柏林海德堡: Springer, 2005.

Rang a nathan, Shiyali Ramamrita. "Library science and scientific method." Annals of Library Science (1957): 19-32.

Rang a nathan, Shiyali Ramamrita. "图书馆学与科学方法." 图书馆学年刊 (1957): 19-32.

Rang a nathan, Shiyali Ramamrita. Pro lego men a to library classification. Asia Publishing House (New York), 1967.

Ranganathan, Shiyali Ramamrita. 《图书馆分类学导论》。亚洲出版社(纽约),1967年。

Rang a nathan, Shiyali Ramamrita. Philosophy of library classification. Sarada Rang a nathan Endowment for Library Science (Bangalore, India), 1989.

Ranganathan, Shiyali Ramamrita. 图书馆分类哲学. Sarada Ranganathan 图书馆科学基金会 (印度班加罗尔), 1989.

Rawte, Vipula, Amit Sheth, and Amitava Das. "A survey of hallucination in large foundation models." arXiv preprint arXiv:2309.05922 (2023).

Rawte, Vipula, Amit Sheth, 和 Amitava Das. "大基础模型中的幻觉研究综述." arXiv 预印本 arXiv:2309.05922 (2023).

Sætra, Henrik Skaug. "Generative AI: Here to stay, but for good?." Technology in Society 75 (2023): 102372.

Sætra, Henrik Skaug. "生成式 AI (Generative AI): 会留下来,但会带来好处吗?" 《社会中的技术》 75 (2023): 102372.

Richardson, W. Ryan, Venkat Srinivasan, and Edward A. Fox. "Knowledge discovery in digital libraries of electronic theses and dissertations: an NDLTD case study." International Journal on Digital Libraries 9 (2008): 163-171.

Richardson, W. Ryan, Venkat Srinivasan, 和 Edward A. Fox. "电子论文和学位论文数字图书馆中的知识发现:NDLTD 案例研究." 国际数字图书馆杂志 9 (2008): 163-171.

Sarginson, R. E., N. Taylor, M. A. De La Cal, and H. K. F. Van Saene. "Glossary of terms and definitions." Infection Control in the Intensive Care Unit (2012): 3-16.

Sarginson, R. E., N. Taylor, M. A. De La Cal, 和 H. K. F. Van Saene. "术语和定义词汇表." 《重症监护病房的感染控制》(2012): 3-16.

Satija, M. P., Mayukh Bagchi, and Daniel Martínez-Ávila. "Metadata management and application." Library Herald 58, no. 4 (2020): 84-107.

Satija, M. P., Mayukh Bagchi, 和 Daniel Martínez-Ávila. "元数据管理与应用." Library Herald 58, no. 4 (2020): 84-107.

Schulz, Stefan, Robert Stegwee, and Catherine Chronaki. "Standards in healthcare data." Fundamentals of clinical data science (2019): 19-36.

Schulz, Stefan, Robert Stegwee, 和 Catherine Chronaki. "医疗数据标准." 临床数据科学基础 (2019): 19-36.

Smith, Barry, Anand Kumar, and Thomas Bittner. "Basic formal ontology for bioinformatics." (2005).

Smith, Barry, Anand Kumar, 和 Thomas Bittner. "生物信息学的基础形式本体论." (2005).

Strecker, Dorothea, Roland Bertelmann, Helena Cousijn, Kirsten Elger, Lea Maria Ferguson, David Fichtmüller, Hans-Jürgen Goebel becker et al. "Metadata schema for the description of research data repositories: Version 3.1." (2021).

Strecker, Dorothea, Roland Bertelmann, Helena Cousijn, Kirsten Elger, Lea Maria Ferguson, David Fichtmüller, Hans-Jürgen Goebelbecker 等. "研究数据存储库描述的元数据模式:版本 3.1." (2021).

Suonuuti, Heidi. Guide to terminology. Helsinki: Tekniikan Sana stokes ku s, 1997.

Suonuuti, Heidi. 术语指南。赫尔辛基: Tekniikan Sana stokes ku s, 1997.

Taha, Ahmed. "Networked library services in a research‐intensive university." The Electronic Library 30, no. 6 (2012): 844-856.

Taha, Ahmed. "研究型大学中的网络化图书馆服务。" The Electronic Library 30, no. 6 (2012): 844-856.

Tedd, Lucy A., and J. Andrew Large. Digital libraries: principles and practice in a global environment. Walter de Gruyter–KG Saur, 2004.

Tedd, Lucy A., 和 J. Andrew Large. 《数字图书馆:全球环境中的原则与实践》. Walter de Gruyter–KG Saur, 2004.

Ulrich, Hannes, Ann-Kristin Kock-S chop pen hauer, Noemi De ppen wiese, Robert Gött, Jori Kern, Martin Lablans, Raphael W. Majeed et al. "Understanding the nature of metadata: systematic review." Journal of medical Internet research 24, no. 1 (2022): e25440.

Ulrich, Hannes, Ann-Kristin Kock-Schoppenhauer, Noemi Deppenwiese, Robert Gött, Jori Kern, Martin Lablans, Raphael W. Majeed 等. "理解元数据的本质:系统综述." 《医学互联网研究杂志》 24, 第 1 期 (2022): e25440.

Van Dijck, José. "Data fi cation, dataism and data veil lance: Big Data between scientific paradigm and ideology." Surveillance & society 12, no. 2 (2014): 197-208.

Van Dijck, José. "数据化、数据主义与数据监控:大数据在科学范式与意识形态之间。" Surveillance & society 12, no. 2 (2014): 197-208.

Varis, Piia. "Digital ethnography." In The Routledge handbook of language and digital communication, pp. 55-68. Routledge, 2015.

Varis, Piia. "数字民族志。" 载于《Routledge 语言与数字传播手册》,第55-68页。Routledge,2015年。

Vaughan, Jason. Web scale discovery services. American Library Association, 2011.

Vaughan, Jason. 网络规模发现服务。美国图书馆协会,2011。

Von Fintel, Kai, and Irene Heim. "In tension al semantics." Unpublished lecture notes (2011).

Von Fintel, Kai, 和 Irene Heim. "In tension al semantics." 未发表的讲义 (2011).

Vrandečić, Denny, and Markus Krötzsch. "Wikidata: a free collaborative knowledge base." Communications of the ACM 57, no. 10 (2014): 78-85.

Vrandečić, Denny, 和 Markus Krötzsch. "Wikidata: 一个免费的协作知识库." 《ACM通讯》 57, 第10期 (2014): 78-85.

Weibel, Stuart, John Kunze, Carl Lagoze, and Misha Wolf. "RFC2413: Dublin Core Metadata for Resource Discovery." (1998).

Weibel, Stuart, John Kunze, Carl Lagoze, 和 Misha Wolf. "RFC2413: 用于资源发现的都柏林核心元数据 (Dublin Core Metadata for Resource Discovery)." (1998).

Wenger, Etienne. Communities of practice: Learning, meaning, and identity. Cambridge university press, 1999.

Wenger, Etienne. 实践社群:学习、意义与身份. 剑桥大学出版社, 1999.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan A albers berg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg et al. "The FAIR Guiding Principles for scientific data management and stewardship." Scientific data 3, no. 1 (2016): 1-9.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg 等. "科学数据管理和管理的FAIR指导原则." 《科学数据》 3, 第1期 (2016): 1-9.

Yu, Holly, and Scott Breivold, eds. Electronic Resource Management in Libraries: Research and Practice: Research and Practice. IGI Global, 2008.

Yu, Holly 和 Scott Breivold 编. 《图书馆电子资源管理:研究与实践》. IGI Global, 2008.

Zeng, Marcia Lei. "Knowledge organization systems (KOS)." Knowledge Organization 35, no. 2-3 (2008): 160-182.

Zeng, Marcia Lei. "知识组织系统 (Knowledge Organization Systems, KOS)." 《知识组织》 35, no. 2-3 (2008): 160-182.

Author Bio

作者简介

Mayukh Bagchi is a researcher specializing in Information Science and Artificial Intelligence at the Department of Information Engineering and Computer Science (DISI) at the University of Trento, Italy, where he is also a Graduate Teaching Assistant in the Master of AI and Data Science course on Knowledge Graph Engineering from 2020. He is also an affiliated researcher at the Institute for Globally Distributed Open Research and Education (IGDORE). He has graduate degrees in Mathematics, Computer Science and Information Science. He has participated in several international AI-based research projects like JIDEP (Horizon Europe Research and Innovation Project), MIUR-Italy DELPHI Project and Data Sci en tia Data Space Research Project. His research interests are in Interdisciplinary AI, Interdisciplinary Information Science, Human-AI Interaction, Philosophy of Machine and Deep Learning, Foundations of Mathematics and Literary Epistemology.

Mayukh Bagchi 是意大利特伦托大学信息工程与计算机科学系 (DISI) 的信息科学与人工智能研究员,自 2020 年起担任人工智能与数据科学硕士课程中的知识图谱工程研究生助教。他还是全球分布式开放研究与教育研究所 (IGDORE) 的附属研究员。他拥有数学、计算机科学和信息科学的研究生学位。他参与了多个基于人工智能的国际研究项目,如 JIDEP (Horizon Europe 研究与创新项目)、MIUR-Italy DELPHI 项目和 Data Sci en tia 数据空间研究项目。他的研究兴趣包括跨学科人工智能、跨学科信息科学、人机交互、机器与深度学习哲学、数学基础和文学认识论。

阅读全文(20积分)