A Generative AI-driven Metadata Modelling Approach
生成式 AI 驱动的元数据建模方法
Mayukh Bagchi
Mayukh Bagchi
DISI, University of Trento, Italy. Institute for Globally Distributed Open Research and Education (IGDORE).
意大利特伦托大学 DISI。全球分布式开放研究与教育研究所 (IGDORE)。
Abstract
摘要
Since decades, the modelling of metadata has been core to the functioning of any academic library. Its importance has only enhanced with the increasing pervasive ness of Generative Artificial Intelligence (AI)-driven information activities and services which constitute a library’s outreach. However, with the rising importance of metadata, there arose several outstanding problems with the process of designing a library metadata model impacting its re usability, crosswalk and interoperability with other metadata models. This paper posits that the above problems stem from an underlying thesis that there should only be a few core metadata models which would be necessary and sufficient for any information service using them, irrespective of the heterogeneity of intra-domain or inter-domain settings. To that end, this paper advances a contrary view of the above thesis and substantiates its argument in three key steps. First, it introduces a novel way of thinking about a library metadata model as an ontology-driven composition of five functionally interlinked representation levels from perception to its in tension al definition via properties. Second, it introduces the representational manifold ness implicit in each of the five levels which cumulatively contributes to a conceptually entangled library metadata model. Finally, and most importantly, it proposes a Generative AI-driven Human-Large Language Model (LLM) collaboration based metadata modelling approach to disentangle the entanglement inherent in each representation level leading to the generation of a conceptually disentangled metadata model. Throughout the paper, the arguments are exemplified by motivating scenarios and examples from representative libraries handling cancer information.
几十年来,元数据建模一直是任何学术图书馆运作的核心。随着生成式人工智能 (Generative AI) 驱动的信息活动和服务(这些活动和服务构成了图书馆的外展服务)的日益普及,其重要性只会进一步增强。然而,随着元数据的重要性日益增加,图书馆元数据模型设计过程中也出现了几个突出的问题,影响了其可重用性、跨模型映射以及与其他元数据模型的互操作性。本文认为,上述问题源于一个基本论点,即无论域内或域间设置的异质性如何,都应该只有少数核心元数据模型对于任何使用它们的信息服务来说是必要且充分的。为此,本文提出了与上述论点相反的观点,并通过三个关键步骤来论证其论点。首先,它引入了一种新的思维方式,将图书馆元数据模型视为由五个功能相互关联的表示层次组成的本体驱动组合,从感知到通过属性的内涵定义。其次,它介绍了五个层次中隐含的表示多样性,这些多样性共同促成了一个概念上纠缠的图书馆元数据模型。最后,也是最重要的,它提出了一种基于生成式人工智能驱动的人与大语言模型 (LLM) 协作的元数据建模方法,以解开每个表示层次中固有的纠缠,从而生成概念上解开的元数据模型。在整篇论文中,通过处理癌症信息的代表性图书馆的激励场景和示例来阐述论点。
Keywords
关键词
Generative AI, Metadata, Ontology-Driven Metadata Models, Academic Libraries and AI, Large Language Models, Human-LLM Collaboration, Knowledge Organization, Knowledge Representation.
生成式 AI (Generative AI)、元数据、本体驱动的元数据模型、学术图书馆与 AI、大语言模型 (Large Language Model)、人类与大语言模型协作、知识组织、知识表示。
Introduction
引言
Since the 1980s (Yu and Breivold 2008) and increasingly from the 2000s (Tedd and Large 2004), the development and continual management of (digitised) metadata (Satija, Bagchi and Martínez-Ávila 2020) has been core to the mundane functioning of any well-administered academic library. With the advent and increasing pervasive ness of AI (Cox, Pinfield and Rutter 2019) and Generative AI (Banh and Strobel 2023), however, the scope of an academic library has radically expanded to include the deployment of AI-based data-driven information services, e.g., research data management (And riko poul ou, Rowley and Walton 2022), ontology-driven content management (Bagchi 2021a; Bagchi 2021b), chatbots (Bagchi 2020), all of which crucially depend on a well-founded metadata model (semantically) annotating and exposing the underlying data. For instance, let us consider the motivating scenario of semantically annotating scientific information in cancer research (e.g., cancer big data (Jiang et al. 2022)) within the framework of an academic library. Such an exercise is clearly dependent on a multitude of factors, e.g., the precise purpose, the target user base, the technical features, etc., each of which mutually defines the information service(s) that will harness such information. To that end, for example, a metadata model designed by a metadata librarian (Han and Hswe 2011) annotating, e.g., cancer big data, for a specialised cancer research library will be significantly (if not completely) different from one suited for an oncology policy library which, again, would be considerably different from a metadata model employed by a medical college library.
自 20 世纪 80 年代 (Yu and Breivold 2008) 以来,尤其是从 2000 年代开始 (Tedd and Large 2004),(数字化) 元数据 (Satija, Bagchi and Martínez-Ávila 2020) 的开发与持续管理已成为任何管理良好的学术图书馆日常运作的核心。然而,随着 AI (Cox, Pinfield and Rutter 2019) 和生成式 AI (Generative AI) (Banh and Strobel 2023) 的出现及其日益普及,学术图书馆的范围已大幅扩展,涵盖了基于 AI 的数据驱动信息服务的部署,例如研究数据管理 (Andrikopoulou, Rowley and Walton 2022)、本体驱动的内容管理 (Bagchi 2021a; Bagchi 2021b)、聊天机器人 (Bagchi 2020),所有这些都依赖于一个良好的元数据模型来(语义上)注释和暴露底层数据。例如,让我们考虑在学术图书馆框架内对癌症研究中的科学信息(例如癌症大数据 (Jiang et al. 2022))进行语义注释的动机场景。这样的工作显然依赖于多种因素,例如精确的目的、目标用户群、技术特征等,每一个因素都共同定义了将利用这些信息的信息服务。为此,例如,由元数据馆员 (Han and Hswe 2011) 设计的元数据模型,为专门的癌症研究图书馆注释癌症大数据,将与适合肿瘤政策图书馆的元数据模型显著(如果不是完全)不同,而后者又与医学院图书馆使用的元数据模型大相径庭。
The general thesis advanced by this paper, as evidenced from the aforementioned motivating scenario and several other similar use-cases (see, for e.g., (Van Dijck 2014; Gartner 2016; Ulrich et al. 2022)), is that there is no unique metadata model which is necessary and sufficient for semantic data annotation within a single domain and certainly (not) for the heterogeneity inherent in cross-domain use-case scenarios (see also (EU-NSF 2024)). Let us understand, via the lens of the motivating scenario, the interlinked representation levels which cumulatively constitute a metadata model as indicated in the above thesis. First, the perception of the (expert) users of a cancer research library towards cancer research data would be highly specialised (e.g., uncovering multi-omics workflows (Ulrich et al. 2022)) and hence considerably different to how it is perceived by the (expert) users of an oncology policy library or a medical college library. Second, as a partial consequence of the first reason, the terminology employed by the users of the three types of libraries to describe the various perceived concepts of cancer research data would be mutually different. Third, the decision to ontologically characterise a perceived data concept as, e.g., a function or a process (Arp and Smith 2008) or an object property or a data property (Bagchi 2022) etc., by users of the three different types of libraries would be mutually different. Fourth, as a direct consequence of the ontological characterization, the taxonomy (Bagchi and Madalli 2019) assumed by the users of the three types of libraries would be markedly different. Fifth, the in tension al characterization (see, e.g., (Von Fintel and Heim 2011), for an overview of in tension al semantics) of the taxonomy in terms of interrelating and describing its constituent concepts via object properties and data properties would be different for different sets of (expert) library users. Finally, it is also key to note that the ordered representation levels, as briefed above, (each) independently as well as cumulatively compounds the design of any metadata model (e.g., developed by one of the three aforementioned libraries) and complicates its crosswalk with any other metadata model (e.g., developed by the other two).
本文提出的总体论点,正如上述动机场景和其他几个类似用例所证明的那样(例如,参见 (Van Dijck 2014; Gartner 2016; Ulrich et al. 2022)),是在单个领域内,没有一种唯一的元数据模型是必要且足以进行语义数据标注的,尤其是在跨领域用例场景中固有的异质性情况下(另见 (EU-NSF 2024))。让我们通过动机场景的视角,理解上述论点中指出的构成元数据模型的相互关联的表示层次。首先,癌症研究图书馆的(专家)用户对癌症研究数据的感知将高度专业化(例如,揭示多组学工作流程 (Ulrich et al. 2022)),因此与肿瘤政策图书馆或医学院图书馆的(专家)用户的感知方式大不相同。其次,作为第一个原因的部分结果,三种类型图书馆的用户用来描述癌症研究数据各种感知概念的术语将相互不同。第三,三种不同类型图书馆的用户在将感知到的数据概念本体论地描述为例如功能或过程 (Arp and Smith 2008) 或对象属性或数据属性 (Bagchi 2022) 等方面的决策将相互不同。第四,作为本体论描述的直接结果,三种类型图书馆用户所假设的分类法 (Bagchi and Madalli 2019) 将显著不同。第五,通过对象属性和数据属性相互关联和描述其组成概念的分类法的内涵描述(例如,参见 (Von Fintel and Heim 2011) 对内涵语义的概述)对于不同的(专家)图书馆用户群体将有所不同。最后,还需要注意的是,上述有序的表示层次(每个)独立地以及累积地构成了任何元数据模型(例如,由上述三种图书馆之一开发的)的设计,并使其与其他任何元数据模型(例如,由其他两种图书馆开发的)的交叉映射复杂化。
knowledge model (e.g., a metadata model) is representation ally manifold by design and cannot be necessary and sufficient for all use-cases irrespective of their intra-domain or inter-domain setting. Let us understand the above in terms of the motivating scenario. First, there is always a many-to-many correspondence (hereafter, referred to as representational manifold ness) between entities and how they are perceived as concepts (e.g., by sets of users of the three different libraries). Second, given perception, there is always a representational manifold ness between the perceived concepts and how they are linguistically labelled (e.g., by the same sets of users) using some terminology. Third, given labelling, there is always a representational manifold ness between the labelled concepts and their ontological status (e.g., by the same sets of users). Fourth, as a consequence of the manifold ness in ontological character is ation, there is always a representational manifold ness between the ontologically characterised concepts and how they ought to be taxonomically classified (e.g., as per the warrant of the same set of users). Fifth, there is always a representational manifold ness between the taxonomically classified concepts and how they are in tension ally interrelated and described via properties. Two observations. Firstly, it is interesting to note how the individual as well as the cumulative impact of the representation layers magnify the entanglement in the final metadata model. Secondly, and perhaps most importantly, notice the need to explicate the decision of the modeller (e.g., a metadata librarian) at each level which, in an overwhelming majority of cases, remains implicit (Bagchi and Das 2022; Bagchi and Das 2023).
知识模型(例如,元数据模型)在设计上是表示性的流形,因此无论其域内或域外设置如何,都无法对所有用例都是必要且充分的。让我们通过动机场景来理解上述内容。首先,实体与其被感知为概念的方式之间总是存在多对多的对应关系(以下称为表示性流形性)(例如,由三个不同图书馆的用户集感知)。其次,在感知的基础上,感知到的概念与其通过某种术语进行语言标记的方式之间总是存在表示性流形性(例如,由同一组用户标记)。第三,在标记的基础上,标记的概念与其本体状态之间总是存在表示性流形性(例如,由同一组用户确定)。第四,由于本体特征的流形性,本体特征化的概念与其应如何分类(例如,根据同一组用户的授权)之间总是存在表示性流形性。第五,分类后的概念与其通过属性相互关联和描述的方式之间总是存在表示性流形性。两点观察。首先,有趣的是,表示层的个体和累积影响如何放大了最终元数据模型中的纠缠。其次,也许最重要的是,注意到在每个层级上需要明确建模者(例如,元数据图书馆员)的决策,而在绝大多数情况下,这些决策仍然是隐式的(Bagchi 和 Das 2022;Bagchi 和 Das 2023)。
The solution proposed in this paper is a Generative AI-driven LLM-based enhancement of an early version of the approach termed as Conceptual Disentanglement (Bagchi and Das 2022; Bagchi and Das 2023). The key focus of the approach is to explicitly disentangle the decisions made by the modeller (e.g., a metadata librarian), at each (knowledge) representation level mentioned before, which would otherwise implicitly entangle the final conceptual knowledge artefact (e.g., the metadata model). To that end, the general strategy of the approach is to explicitly enforce one-to-one correspondence (hereafter, referred to as representational bijection) out of the potentially many representational manifold possibilities at each level. To explain using the motivating scenario, first, the metadata librarian (who is the modeller) should explicate a representational bijection between entities and their consensual perception as concepts (e.g., by a particular set of users of one of the libraries). Second, given the fixation of perception, the metadata librarian should explicate a representational bijection between the perceived concepts and their linguistic labelling via a consensual body of terminology. Third, given the fixation of terminology, the metadata librarian should explicate a representational bijection between the labelled concepts and their ontological commitments (Guarino, Carrara and Giaretta 1994), i.e., whether they are events or processes or properties, etc. In fact, an ontology-driven metadata model is key to underpin the precise semantics represented within (meta)data concepts and properties (see, for e.g., the arguments advanced in (Dutta 2014; Leipzig et al. 2021)). Fourth, the metadata librarian should explicate a representational bijection between the ontologically characterised concepts and their exact taxonomy. Fifth, given the taxonomy, the metadata librarian should explicate a representational bijection between each concept in the taxonomy and their exact in tension al characterization. Finally, it is interesting to note how the individual as well as the cumulative disentanglement of the representation layers minimise the entanglement in the final metadata model. Notice that the decision of the modeller is also explicit at each representation level thereby crucially impacting the minimization of the conceptual entanglement.
本文提出的解决方案是基于生成式 AI (Generative AI) 驱动的大语言模型 (LLM),对早期版本的方法——概念解耦 (Conceptual Disentanglement) (Bagchi and Das 2022; Bagchi and Das 2023) 进行增强。该方法的核心在于明确解耦建模者(例如元数据图书馆员)在每个(知识)表示层级上所做的决策,这些决策在最终的概念知识产物(例如元数据模型)中可能会隐含地纠缠在一起。为此,该方法的总体策略是在每个层级上明确强制执行一对一的对应关系(以下称为表示双射),而不是潜在的多种表示流形可能性。以激励场景为例,首先,元数据图书馆员(即建模者)应明确实体与其作为概念的共识感知之间的表示双射(例如,由某个图书馆的特定用户群体所感知)。其次,在感知固定的情况下,元数据图书馆员应明确感知概念与通过共识术语体系进行的语言标签之间的表示双射。第三,在术语固定的情况下,元数据图书馆员应明确标签概念与其本体承诺 (ontological commitments) (Guarino, Carrara and Giaretta 1994) 之间的表示双射,即它们是事件、过程还是属性等。事实上,本体驱动的元数据模型是支撑(元)数据概念和属性中精确语义的关键(参见例如 (Dutta 2014; Leipzig et al. 2021) 中的论点)。第四,元数据图书馆员应明确本体特征化的概念与其精确分类之间的表示双射。第五,在分类固定的情况下,元数据图书馆员应明确分类中每个概念与其精确的内在特征之间的表示双射。最后,值得注意的是,表示层的个体解耦以及累积解耦如何最小化最终元数据模型中的纠缠。注意到建模者的决策在每个表示层级上也是明确的,从而对最小化概念纠缠起到了关键作用。
While the contextual iz ation, problem and solution approach is clear to some extent, it is not yet clear as to how the solution approach can be potentially implemented by, e.g., a metadata librarian, within an academic library setting. In fact, there are two critical complementary highlights which should characterise any implementation strategy of the conceptual disentanglement approach for modelling library metadata. First, while the approach attempts to disentangle, level-by-level, the entanglement inherent by design in a metadata model, it runs the risk of considerable intellectual work on part of the metadata librarian in terms of manually factoring and refactoring constituent knowledge management nitty-gritties (Bagchi 2019a; Bagchi 2019b). On the other hand, given the justified complexity of the conceptual disentanglement approach (e.g., in layers like perception or ontology), an overtly (semi-)automatic implementation is not poised to produce an ontology-driven metadata model of requisite quality. To harness the best of both worlds (i.e., human and machine), the paper proposes a novel implementation of the conceptual disentanglement approach via Generative AI-driven Human-LLM Collaboration, wherein, the metadata librarian, using prompt engineering (Ekin 2023), exploits an LLM (Liang et al. 2022; Chang et al. 2024) to generate a conceptually (dis)entangled metadata model which, for each representational level, is validated/repaired/enriched by him/her. Notice that the validation or repair or enrichment of the Human-LLM collaborative ly generated metadata model by the metadata librarian can involve several standard dimensions, e.g., metadata quality (Park 2009), semantic quality (Poels et al. 2005), and additionally purpose-specific dimensions such as, e.g., measuring the fit of the metadata model with actual cancer data/information. Overall, a widespread acceptance and implementation of the conceptual disentanglement approach via Generative AI-driven Human-LLM collaboration is poised to bring to the fore several advantages in terms of, e.g., metadata development methodologies, metadata crosswalk and interoperability (Khoo and Hall 2010), semantic heterogeneity (Hull 1997), FAIR data (Wilkinson et al. 2016), etc.
虽然情境化、问题和解决方案方法在某种程度上是清晰的,但尚不清楚如何在学术图书馆环境中由元数据馆员等人员实施该解决方案方法。事实上,有两个关键的互补亮点应该成为任何图书馆元数据建模的概念解缠方法实施策略的特征。首先,虽然该方法试图逐层解缠元数据模型中固有的设计纠缠,但它可能会导致元数据馆员在手动分解和重构知识管理细节方面进行大量的智力工作(Bagchi 2019a; Bagchi 2019b)。另一方面,考虑到概念解缠方法的复杂性(例如,在感知或本体等层面),过度(半)自动化的实施不太可能产生所需质量的本体驱动元数据模型。为了充分利用人类和机器的优势,本文提出了一种通过生成式 AI 驱动的人-大语言模型协作来实现概念解缠方法的新颖实施方式,其中元数据馆员利用提示工程(Ekin 2023)来利用大语言模型(Liang et al. 2022; Chang et al. 2024)生成一个概念上(解)缠的元数据模型,该模型在每个表示层面上由他/她进行验证/修复/丰富。请注意,元数据馆员对人-大语言模型协作生成的元数据模型的验证、修复或丰富可能涉及多个标准维度,例如元数据质量(Park 2009)、语义质量(Poels et al. 2005),以及额外的特定目的维度,例如测量元数据模型与实际癌症数据/信息的拟合度。总体而言,通过生成式 AI 驱动的人-大语言模型协作广泛接受和实施概念解缠方法,有望在元数据开发方法、元数据交叉和互操作性(Khoo and Hall 2010)、语义异质性(Hull 1997)、FAIR 数据(Wilkinson et al. 2016)等方面带来诸多优势。
The remainder of the paper is organised as follows: The second section describes the different knowledge representation levels involved in designing a metadata model and how such levels are ordered and interlinked to cumulatively impact its design. The third section details the problem of conceptual entanglement individually and cumulatively across the representation levels and their confounding impact on the design of the metadata model. The Generative AI-driven Human-LLM collaboration based conceptual disentanglement approach individually and cumulatively across the representation levels and the way it minimises the entanglement and confusion in the design of the metadata model is elucidated in the fourth section. Finally, the fifth section discusses research implications from the related work in terms of some of the (generative) AI, metadata and semantics-based research issues academic libraries face and the sixth section concludes the paper. Throughout all the sections, especially for the second, third and fourth sections, an adapted instantiation of the motivating scenario in terms of a simplified ontology-driven metadata model for the cancer domain would be employed to exemplify and highlight the issues and approaches. Further, several of the fine-grained technical intricacies, e.g., of knowledge representation, are skipped in accordance with the broader methodological scope of the paper.
本文的其余部分组织如下:第二部分描述了设计元数据模型时涉及的不同知识表示层次,以及这些层次如何有序地相互关联并累积影响其设计。第三部分详细阐述了概念纠缠在各个表示层次上单独和累积的问题,以及它们对元数据模型设计的混淆影响。第四部分阐述了基于生成式 AI 驱动的人-大语言模型协作的概念解耦方法在各个表示层次上单独和累积的作用,以及它如何最小化元数据模型设计中的纠缠和混淆。最后,第五部分讨论了相关研究对学术图书馆面临的一些(生成式)AI、元数据和基于语义的研究问题的影响,第六部分总结了本文。在所有章节中,特别是第二、第三和第四部分,将采用一个简化的癌症领域本体驱动元数据模型的实例化场景来举例说明和突出这些问题和方法。此外,根据本文更广泛的方法论范围,省略了一些细粒度的技术细节,例如知识表示。
Representation Levels in Modelling Metadata
建模元数据中的表示层次
Let us now expand the discussion on the characteristically independent but functionally interlinked knowledge organization and representation levels which cumulatively compose any metadata model within the framework of an academic library. To that end, we have the following:
现在让我们扩展讨论学术图书馆框架内构成任何元数据模型的特征上独立但功能上相互关联的知识组织和表示层次。为此,我们有以下内容:
Additionally, it is also important to note that the motivating example considered for illustrating the above (and for remainder of the paper) are three equally probable variations of an ontology-driven library metadata model for cancer domain (e.g., encoding concepts such as Clinical Trial, Patient, Biomarker, Imaging Test, His to pathology Report, etc.) generated by prompting the Generative-AI based LLM (interface) - ChatGPT 3.5 - using a series of prompts. Please note two observations (valid throughout the paper). The metadata models cannot be reproduced completely within the full-text of this paper due to constraints of space and to that end, please follow the link - motivating example - wherein readers can get a fuller understanding of the concept/property exemplified in the paper and its page location within the linked document describing the generated metadata models. Also, while the metadata model generated by ChatGPT 3.5 is informal in nature, this is not relevant to the scope of this paper as it can be directly formalised in any formal language of choice (e.g., the web ontology language - OWL) for concrete implementation purposes. The above levels are elucidated as follows.
此外,还需要注意的是,本文用于说明上述内容(以及后续内容)的激励示例是基于癌症领域的本体驱动图书馆元数据模型的三个等概率变体(例如,编码诸如临床试验、患者、生物标志物、影像测试、病理报告等概念),这些模型是通过使用一系列提示词生成的生成式 AI(Generative AI)大语言模型(LLM)接口——ChatGPT 3.5——生成的。请注意两点观察(贯穿全文有效)。由于篇幅限制,本文无法完全再现这些元数据模型,因此请读者点击链接——激励示例——以更全面地理解本文中示例的概念/属性及其在描述生成的元数据模型的链接文档中的页面位置。此外,尽管 ChatGPT 3.5 生成的元数据模型本质上是非正式的,但这与本文的范围无关,因为它可以直接用任何选择的正式语言(例如,网络本体语言——OWL)进行形式化,以用于具体实现目的。上述层次结构如下所述。
First, let us concentrate on the perceptual level which, while being not explicitly pronounced in state-of-the-art library metadata research and implementation (see, e.g., (Haynes 2018; Strecker et al. 2021)), is nonetheless the very first level where the crucial representational choice as to how a set of entities should be perceived as concepts (later) composing a library metadata model is made. Notice two key dimensions which inform the decision-making of a metadata librarian at this level. First, it goes without mentioning that perception is highly egocentric to an individual academic library user and therefore is usually incomplete (Bagchi 2021a) and cannot be fully captured in a (semi-)formal manner. Second, while the first dimension holds, it is also equally the case that various communities of practice (Wenger 1999; Cox 2005) (e.g., users of different types of libraries) commit to a shared and homogeneous perception as to how certain entities should be perceived as concepts. Consider the case of the motivating example. In it, pages 1-7 elaborates a general (ontology-driven) library metadata model for cancer domain generated by ChatGPT 3.5 with constituent concepts (classes and properties) at three levels of abstraction. Further, note how the general model, via successive prompts, is tuned to the potential perceptions of a representative library user belonging to the specialised cancer research community (pages 8-10), oncology policy community (pages 10-12) and medical college student community (pages 12-14), respectively. It is interesting to observe that even for the same macro domain (i.e., cancer), each of the three metadata models encode concepts and properties uniquely relevant to the perception of a specific user community.
首先,让我们关注感知层面,尽管在最新的图书馆元数据研究和实现中并未明确提及(例如,参见 (Haynes 2018; Strecker et al. 2021)),但它是最初的层面,决定了如何将一组实体感知为(后续)构成图书馆元数据模型的概念。注意在这一层面,元数据馆员的决策受到两个关键维度的影响。首先,不言而喻,感知对个体学术图书馆用户来说是高度自我中心的,因此通常是不完整的 (Bagchi 2021a),无法以(半)正式的方式完全捕捉。其次,尽管第一个维度成立,但同样存在各种实践社区 (Wenger 1999; Cox 2005)(例如,不同类型图书馆的用户)对某些实体应如何被感知为概念达成了共享且同质的感知。以动机示例为例,其中第1-7页详细阐述了一个由ChatGPT 3.5生成的癌症领域通用(本体驱动的)图书馆元数据模型,包含三个抽象层次的构成概念(类和属性)。此外,注意通用模型如何通过连续的提示,调整为属于特定癌症研究社区(第8-10页)、肿瘤政策社区(第10-12页)和医学院学生社区(第12-14页)的代表性图书馆用户的潜在感知。有趣的是,即使在同一宏观领域(即癌症)中,这三个元数据模型中的每一个都编码了与特定用户社区感知独特相关的概念和属性。
Second, given the understanding as to why perception is crucial for modelling metadata, let us now turn to the terminological level where the key representational choice is to decide how a set of perceived concepts should be labelled using a body of terminology. This choice is not trivial for the metadata librarian due to the very nature of the interaction of language with the perceptual level. Languages are “itemized [terminological] inventories” (Brown and Lenneberg 1954) of the entities we perceive and each of such “inventories” generate either similar (but not the same) or a different lexical iz ation of the perception due to issues following from linguistic relativity (Boroditsky 2011). Such a lexical iz ation might manifest amongst different sets of academic library users in terms of different linguistic phenomenon. For example, two well known linguistic phenomenon include polysemy (the coexistence of several potential meanings of a term) and synonymy (terms having similar meanings) (Glynn and Robinson 2014). Lexical gaps (Bentivogli and Pianta 2000) occur when, for a concept, a specific language/vocabulary doesn’t have a referrent term. The scenario might get even more confounded when library user communities communicate with each other or with the computer system by linguistically referring to the perceived entities with terms from specialised terminological standards (Suonuuti 1997) or glossaries (Sarginson et al. 2012) possibly from different languages. There might be issues with both mapping and interoperability of terms amongst the standards due to issues of linguistic phenomenon or non-existent terminological crosswalks. To exemplify, in the motivating example, please refer to page 20-23 which detail the LLM response to the prompt clarifying the terminology employed in the three examples. Notice that parts of the cancer research library metadata model uses terms from the Dublin Core (Weibel et al. 1998) to express its perceived concepts. Similarly, the oncology policy library and the medical college library metadata models use terms from standards such as HL7 (Health Level Seven) standard, National Comprehensive Cancer Network (NCCN) standard, SNOMED-CT, etc. (Schulz, Stegwee and Chronaki 2019).
其次,鉴于理解感知为何对建模元数据至关重要,我们现在转向术语层面,关键的表征选择是决定如何使用一组术语来标记感知到的概念集。由于语言与感知层面互动的本质,这一选择对元数据馆员来说并非易事。语言是我们感知实体的“项目化[术语]清单”(Brown 和 Lenneberg 1954),每个这样的“清单”由于语言相对性(Boroditsky 2011)的问题,会产生相似(但不相同)或不同的感知词汇化。这种词汇化可能在不同的学术图书馆用户群体中以不同的语言现象表现出来。例如,两个著名的语言现象包括多义性(一个术语同时具有多个潜在含义)和同义性(术语具有相似含义)(Glynn 和 Robinson 2014)。当某个概念在特定语言/词汇中没有对应的术语时,就会出现词汇空缺(Bentivogli 和 Pianta 2000)。当图书馆用户群体通过使用来自不同语言的专门术语标准(Suonuuti 1997)或词汇表(Sarginson 等 2012)的术语来指代感知到的实体时,情况可能会变得更加复杂。由于语言现象或缺乏术语交叉映射的问题,标准之间的术语映射和互操作性可能会出现问题。举例来说,在动机示例中,请参阅第20-23页,详细描述了大语言模型对提示的响应,澄清了三个示例中使用的术语。请注意,癌症研究图书馆元数据模型的部分使用了都柏林核心(Weibel 等 1998)的术语来表达其感知到的概念。同样,肿瘤政策图书馆和医学院图书馆元数据模型使用了诸如HL7(健康等级七)标准、国家综合癌症网络(NCCN)标准、SNOMED-CT等标准的术语(Schulz、Stegwee 和 Chronaki 2019)。
Third, given the central importance of terminology in the design of any library metadata model, the next dimension to understand is the ontological level where the key representational choice regarding the ontological commitment (Guarino, Carrara and Giaretta 1994) of the (newly) labelled concepts is made. This choice, again, is non-trivial for the metadata librarian due to the theoretical root of the interaction of ontology with the terminological level, in the sense that, committing to a terminology results in committing to an ontology (Moltmann 2019) which might not be necessarily completely compatible with another ontology/terminology pairing. The chief importance of this representation level is to explicate the otherwise implicit commitment of the body of terms to a top-level ontology (Guarino 1997), wherein, such an ontology, based on an explicitly specified philosophical doctrine, reveals the underlying nature of domain-level concepts. There can be several such philosophical doctrines such as three-dimensional is m, four-dimensional is m, etc. (McCall and Lowe 2006), and according to the chosen doctrine, the domain-level labelled concepts can be classified as a kind, a part-of, an event, a process, a role, a function or a property (Gangemi et al. 2001). Two observations. First, by far, this representation layer is the most demanding in terms of the involvement of the human modeller, i.e., the metadata librarian, for the simple reason that while different library user communities might, most possibly, conceptual is e the same set of entities into different top-level ontological categories, it is chiefly the responsibility of the metadata librarian to explicate such assumptions. Secondly, as evidenced in decades worth of research in semantic heterogeneity (Hull 1997), the ontological level is key to achieve a range of semantics-intensive tasks like semantic mapping (Ben event a no et al. 2008) and semantic interoperability (Bittner, Donnelly and Winter 2005) with respect to (meta)data models. To exemplify, in the motivating example, please refer to page 23-29 which detail the LLM response to the prompt requesting for the ontological categories employed in the three examples. For example, notice that in page 26, the LLM response aligns concepts and properties such as Document and hasAuthor to top-level ontological categories like Information Object and Quality.
第三,鉴于术语在任何图书馆元数据模型设计中的核心重要性,下一个需要理解的维度是本体论层面,即关于(新)标记概念的本体论承诺(Guarino, Carrara 和 Giaretta 1994)的关键表示选择。这一选择对于元数据图书馆员来说同样不简单,因为本体论与术语层面的交互具有理论根源,即承诺使用某个术语意味着承诺使用某个本体论(Moltmann 2019),而该本体论可能并不完全与另一个本体论/术语对兼容。这一表示层面的主要重要性在于阐明术语体系对顶级本体论(Guarino 1997)的隐含承诺,其中,基于明确指定的哲学学说的本体论揭示了领域级概念的基本性质。可能存在多种这样的哲学学说,例如三维主义、四维主义等(McCall 和 Lowe 2006),根据所选择的学说,领域级标记概念可以被分类为种类、部分、事件、过程、角色、功能或属性(Gangemi 等 2001)。有两个观察点。首先,到目前为止,这一表示层在人类建模者(即元数据图书馆员)的参与方面要求最高,原因很简单,虽然不同的图书馆用户社区很可能将同一组实体概念化为不同的顶级本体论类别,但阐明这些假设主要是元数据图书馆员的责任。其次,正如数十年语义异构研究(Hull 1997)所证明的那样,本体论层面是实现一系列语义密集型任务(如语义映射(Ben event a no 等 2008)和语义互操作性(Bittner, Donnelly 和 Winter 2005))的关键。为了举例说明,请参阅第23-29页,其中详细描述了大语言模型对请求三个示例中使用的本体论类别的提示的响应。例如,请注意在第26页中,大语言模型的响应将诸如“文档”和“hasAuthor”等概念和属性与“信息对象”和“质量”等顶级本体论类别对齐。
Fourth, given the importance of ontological commitment of constituent concepts and properties in a library metadata model, let us concentrate on the taxonomical level where the key representational choice regarding the ontologically characterised concepts and their classification into a taxonomical hierarchy is made. This choice by the metadata librarian is guided by two key factors. First, the ontological level already induces a very abstract high-level taxonomy for the concepts due to the mandatory ontological constraints (Guarino, Carrara and Giaretta 1994) they pose while the metadata librarian commits to a specific top-level ontology and its constituent top-level ontological categories. For example, if the concept Person and Patient belong to the top-level ontological categories Kind and Role, it is an ontological constraint that Role can never be the taxonomical parent of a Kind, and, therefore, Patient can never be the taxonomical parent of Person. Within the ontological constraints, however, the taxonomy can be designed by the metadata librarian in accordance with how its hierarchy and metadata properties will be exploited by the target information service (e.g., a data catalog (Guptill 1999), a chatbot (Bagchi 2020)). To exemplify, in the motivating example, please refer to page 30-34 which detail the LLM response to the prompt requesting for the taxonomical choices made (for concepts as well as properties). For example, in page $30$ , Library Resource is specialised into Research Paper and Dataset. Notice that, alongside the ontological level, the taxonomical level is also equally demanding, if not more, in terms of the involvement of the metadata librarian in modelling the classification hierarchy.
第四,鉴于图书馆元数据模型中构成概念和属性的本体承诺的重要性,让我们集中讨论分类学层面,在这一层面上,关于本体特征化的概念及其分类到分类学层次结构中的关键表示选择被做出。元数据馆员的这一选择受到两个关键因素的指导。首先,本体层面已经由于强制性本体约束(Guarino, Carrara 和 Giaretta 1994)而诱导出一个非常抽象的高层分类学,这些约束在元数据馆员承诺使用特定的顶层本体及其构成顶层本体类别时产生。例如,如果概念 Person 和 Patient 属于顶层本体类别 Kind 和 Role,那么 Role 永远不能成为 Kind 的分类学父类,因此 Patient 永远不能成为 Person 的分类学父类。然而,在本体约束内,元数据馆员可以根据目标信息服务(例如,数据目录(Guptill 1999)、聊天机器人(Bagchi 2020))如何利用其层次结构和元数据属性来设计分类学。举例来说,在动机示例中,请参阅第 30-34 页,这些页面详细描述了大语言模型对请求分类学选择(针对概念和属性)的提示的响应。例如,在第 30 页,Library Resource 被细分为 Research Paper 和 Dataset。请注意,与本体层面一样,分类学层面在元数据馆员参与建模分类层次结构方面同样要求严格,甚至可能更为严格。
Last but not the least, given the key importance of the taxonomic classification in the composition of a library metadata model, let us focus on the in tension al level where the key representational choice is how to interrelate and describe each individual concept in the taxonomy. This is done by first finalising the special is ation of the already classified properties (in the ontological level) into object properties and data properties (Bagchi and Madalli 2019). Note that, even if the properties are informally classified as object or data property by the metadata librarian before this level, the final decision is taken at this level. Second, each concept in the taxonomy is assigned with a set of object properties which encode how the concept is interlinked with other concepts in the overall metadata model. Equally, and perhaps the most important aspect, every taxonomic concept is assigned with a set of data properties which describe its attributes. Two observations. First, this level, implementation ally, is perhaps the most visible and pronounced level in a library metadata model as it decides on the data properties which ultimately encode real-world library metadata. Secondly, this level is central to a library metadata model in the sense that it decides the inheritance of data properties by classes (aka concepts) in the taxonomy, thereby, considerably influencing the decision concerning the description of each taxonomic concept by a set of data properties and data types. To exemplify, in the motivating example, please refer to page 34-39 which detail the LLM response to the prompt requesting for the object and data property choices made. For example, in page 35, the concept Policy Document is described with data properties like {policy Document ID, policy Document Title, policy Document Text}.
最后但同样重要的是,鉴于分类法在图书馆元数据模型组成中的关键重要性,让我们关注内涵层面,其中关键的表征选择是如何相互关联并描述分类法中的每个单独概念。这是通过首先将已分类的属性(在本体层面)细化为对象属性和数据属性来完成的(Bagchi 和 Madalli 2019)。请注意,即使元数据管理员在此层面之前非正式地将属性分类为对象属性或数据属性,最终决策仍在此层面做出。其次,分类法中的每个概念都被分配了一组对象属性,这些属性编码了该概念在整个元数据模型中如何与其他概念相互关联。同样,也许是最重要的方面,每个分类法概念都被分配了一组描述其属性的数据属性。有两点观察。首先,从实现角度来看,这一层面可能是图书馆元数据模型中最明显和突出的层面,因为它决定了最终编码现实世界图书馆元数据的数据属性。其次,这一层面在图书馆元数据模型中处于核心地位,因为它决定了分类法中类(即概念)对数据属性的继承,从而显著影响了通过一组数据属性和数据类型描述每个分类法概念的决策。例如,在激励示例中,请参阅第34-39页,其中详细描述了大语言模型对请求对象和数据属性选择的提示的响应。例如,在第35页中,概念“政策文档”被描述为具有诸如{policy Document ID, policy Document Title, policy Document Text}等数据属性。
Finally, it is interesting to note two crucial highlights about the aforementioned levels of knowledge organization and representation which constitute any library (or, even a generic) metadata model. First, the ordering from perception to in tension ali ty is not just a linear, ordered sequence but is essentially an ordered and continual interlinked scientific spiral (Rang a nathan 1957) from perception to in tension ali ty and back. To that end, successive cycles can be utilised by the metadata librarian to continuously validate/repair/update/enrich the library metadata model with new potential changes in any of the representation levels. This also reinforces the notion of the non-validity of a single or a few metadata models as necessary and sufficient for any and every domain and application scenario. Second, notice that the above levels of representation is a general framework to understand the composition of a library metadata model. It might well be the case that for a specific application scenario, the perceptual level or the terminological level or both are quite homogeneous and, therefore, of much less relative importance to the model than that of, e.g., the ontological level or the in tension al level. In any case, such permutations and combinations of the relative importance of each representation level to the final library metadata model should be decided on a case-by-case basis.
最后,值得注意的是,关于构成任何图书馆(或甚至通用)元数据模型的上述知识组织和表示层次的两个关键要点。首先,从感知到内涵的排序不仅仅是一个线性的有序序列,而本质上是一个从感知到内涵并返回的有序且持续互连的科学螺旋(Ranganathan 1957)。为此,元数据管理员可以利用连续的周期来持续验证/修复/更新/丰富图书馆元数据模型,以应对任何表示层次中的新潜在变化。这也强化了一个观点,即单一或少数元数据模型对于任何和每个领域和应用场景来说都不是必要且充分的。其次,注意到上述表示层次是理解图书馆元数据模型组成的一般框架。对于特定的应用场景,感知层次或术语层次或两者可能非常同质,因此相对于模型的重要性远不如本体层次或内涵层次。无论如何,每个表示层次对最终图书馆元数据模型的相对重要性的排列组合应基于具体情况来决定。
Entanglement in Modelling Metadata
建模元数据中的纠缠
Given the central and irreplaceable impact of the five functionally interlinked representation levels on the modelling of metadata, let us now focus on how the conceptual entanglement problem is implicitly instantiated at each successive level within the metadata design strategy of an academic library. To that end, we have the following:
鉴于五个功能互相关联的表示层对元数据建模的核心且不可替代的影响,我们现在将重点关注学术图书馆元数据设计策略中,概念纠缠问题如何在每个连续层次中隐式实例化。为此,我们有以下内容:
In order to exemplify the stratification above, the same motivating example of the ontology-driven library metadata model for cancer domain is employed (pages 8-43). Notice also the fact that the stratification of conceptual entanglement proposed above is in sync with the stratification of representation in metadata proposed in the second section. The above levels are elucidated as follows.
为了举例说明上述分层结构,我们采用了癌症领域本体驱动的图书馆元数据模型的相同激励示例(第8-43页)。还需注意的是,上述提出的概念纠缠分层与第二部分中提出的元数据表示分层是一致的。上述层次结构如下所述。
First, let us concentrate on perceptual entanglement which, in essence, refers to the many-to-many mapping between entities and their perception as concepts which would eventually compose a library metadata model. There are two key factors which underlie the entanglement in the decision-making of a metadata librarian at this level. First, as already briefed before, perception is egocentric and can be unique to communities of practice. This premise leads to the notion of perception as a cognitive filter (Guarino, Guizzardi and Mylopoulos 2020), i.e., in our terms, the fact that the same entity and its properties can be perceived differently by different groups of library users depending on their overall purpose and goals. There can be an overlap in the concepts perceived between, say, two groups of library users and, equally, there can be ranges of mutual exclusion in their perception (e.g., due to different purposes or goals). Second, the highly implicit nature and consideration of perception within academic library metadata design (see, e.g., (Fleming, Mering and Wolfe 2008; Lopatin 2010)) almost leads to the necessary and sufficient assumption (explained before) on surface while the underlying multiplicities of perception remain unaddressed. Consider the case of the motivating example generated by ChatGPT 3.5 via Human-LLM collaboration. Three different user perceptions, viz., of a representative cancer research library (pages 8-10), an oncology policy library (pages 10-12) and a medical college library (pages 12-14), of the same macro domain cancer are illustrated in the example. Notice that the common ali ty in the perception of the three different communities are indicated by their shared perceived concept of cancer specialised by its various types (breast cancer, lung cancer, etc.) and described by common properties (e.g., biomarkers associated with a cancer). However, major differences emerge in the three perceptions due to their very purpose in studying the same entity cancer. The cancer research library user community is interested in concepts like research study, treatment and properties like clinical trial ID, pathology report text, etc. The oncology policy library user community, on the other hand, is interested in policy documents like clinical guidelines, best practice, regulatory policy, etc. The medical college library user community is interested in medical study, study material, treatment approach, etc. Please see pages 15-18 in the motivating example for more details on differences in perception. It is interesting to notice from above the representational manifold ness between the same (domain) entities and how they are variously perceived as concepts.
首先,让我们关注感知纠缠(perceptual entanglement),它本质上指的是实体与其作为概念感知之间的多对多映射,这些概念最终将构成图书馆元数据模型。在这一层面上,元数据馆员决策中的纠缠有两个关键因素。首先,正如之前简要提到的,感知是自我中心的,并且可能因实践社区而异。这一前提导致了感知作为认知过滤器的概念(Guarino, Guizzardi 和 Mylopoulos 2020),即在我们看来,同一实体及其属性可能因图书馆用户群体的整体目的和目标而被不同地感知。例如,两个图书馆用户群体之间可能存在感知概念的重叠,同样,他们的感知之间也可能存在相互排斥的范围(例如,由于不同的目的或目标)。其次,学术图书馆元数据设计中感知的高度隐含性和考虑(参见,例如 (Fleming, Mering 和 Wolfe 2008; Lopatin 2010))几乎导致了表面上必要且充分的假设(如前所述),而感知的潜在多样性仍未得到解决。考虑由 ChatGPT 3.5 通过人-大语言模型协作生成的激励示例。该示例展示了同一宏观领域癌症的三种不同用户感知,即代表性癌症研究图书馆(第8-10页)、肿瘤政策图书馆(第10-12页)和医学院图书馆(第12-14页)。请注意,三个不同社区感知中的共同性体现在他们对癌症的共同感知概念上,该概念由其各种类型(乳腺癌、肺癌等)专门化,并由共同属性(例如,与癌症相关的生物标志物)描述。然而,由于他们研究同一实体癌症的目的不同,三种感知之间出现了重大差异。癌症研究图书馆用户群体对研究、治疗以及临床试验ID、病理报告文本等属性感兴趣。另一方面,肿瘤政策图书馆用户群体对临床指南、最佳实践、监管政策等政策文件感兴趣。医学院图书馆用户群体则对医学研究、学习材料、治疗方法等感兴趣。有关感知差异的更多详细信息,请参见激励示例的第15-18页。从上述内容中,有趣的是注意到相同(领域)实体之间的表示多样性以及它们如何被不同地感知为概念。
Second, let us focus on the terminological entanglement which implies the necessary existence of a many-to-many mapping between perceived concepts and their lexical iz ation using a terminological label. As briefly mentioned earlier, linguistic phenomena are a key inducer of representational manifold ness at this level. For example, polysemous terms labelling a perceived concept provides at best an ambiguous notion of its meaning and magnifies the many-to-many possibilities for the metadata librarian. On the other hand, the central problem with synonyms is the establishment of mapping. Given implicitly synonymous terms, it might not always be straightforward for the metadata librarian to infer whether they are same or synonymous or a broader/narrower/distinct term. The above problems are further compounded in settings which might involve multiple languages and/or multiple communities of different genres of academic library users. To exemplify, in the motivating example, let us consider the singular case of the concept Breast Cancer (page 8). It can be termed variously by different communities of library users as Carcinoma of the Breast, Mammary Carcinoma, Breast Tissue Malignant Neoplasm or Lobular A de no carcinoma. In multilingual metadata modelling settings, the same concept can be referred to as, for instance, cancer du sein in French, cáncer de mama in Catalan or rakovina prsu in Czech. While technologies from Natural Language Processing (NLP) can be harnessed to implement some of these semantic (non)equivalences, it is crucial for the metadata librarian to first diagnose and understand the entanglement and manifold ness existent in the equivalency mapping.
其次,让我们关注术语的纠缠问题,这意味着感知概念与其使用术语标签进行词汇化之间存在必然的多对多映射关系。正如前面简要提到的,语言现象是这一层次上表示多样性的关键诱因。例如,多义词标签最多只能提供其含义的模糊概念,并放大了元数据图书管理员面临的多对多可能性。另一方面,同义词的核心问题在于映射的建立。对于隐含的同义词,元数据图书管理员可能并不总是能够直接推断它们是相同的、同义的,还是更广泛/更狭窄/不同的术语。在涉及多种语言和/或多个不同学术图书馆用户群体的环境中,上述问题会进一步复杂化。举例来说,在动机示例中,让我们考虑乳腺癌 (Breast Cancer) 这一概念 (第 8 页)。不同的图书馆用户群体可能会将其称为乳腺癌 (Carcinoma of the Breast)、乳腺腺癌 (Mammary Carcinoma)、乳腺组织恶性肿瘤 (Breast Tissue Malignant Neoplasm) 或小叶腺癌 (Lobular Adenocarcinoma)。在多语言元数据建模环境中,同一概念可以被称为法语中的 cancer du sein、加泰罗尼亚语中的 cáncer de mama 或捷克语中的 rakovina prsu。虽然可以利用自然语言处理 (NLP) 技术来实现其中一些语义的(非)等价性,但元数据图书管理员首先需要诊断并理解等价映射中存在的纠缠和多样性。
Third, given the elucidation of terminological entanglement, let us now elaborate the notion of ontological entanglement which refers to the necessary existence of a many-to-many mapping between labelled concepts and their ontological commitment. There are two interlinked dimensions underlying the above entanglement. First, due to perceptual and subsequent terminological entanglement, different communities of library users can perceive and label the same entity differently, thereby, boots trapping the many-to-many mapping between different entities/concepts and the different top-level ontological categories they might potentially be categorised into. Second, as a consequence of the first reason, different communities of library users, unknowingly and implicitly, commit to the philosophical doctrine of the one of the state-of-the-art top-level ontologies such as DOLCE (Gangemi et al. 2002), BFO (Smith, Kumar and Bittner 2005), UFO (Guizzardi et al. 2022), etc., thereby, adding a second layer of many-to-many entanglement between entities and top-level ontological theories. The ontological entanglement resulting in a multiplicity of representational manifold ness between labelled concepts and their top-level ontological characterization, in effect, translates into application-level implementation al entanglements such as the management of interoperability and harvesting of metadata in networked academic library settings (Taha 2012). To exemplify, in the motivating example, let us consider the ontological entanglement in the oncology policy library metadata model as detailed in page 27. Notice that the same concept of a Policy Document has been categorised into different ontological categories (Information Object, Artefact, Continuant) by different top-level ontologies (DOLCE, UFO, BFO), respectively. Further, each of these categories can be instantiated for various other labelled concepts, e.g., textbook on page 28 (having a different semantics).
第三,在阐明了术语纠缠之后,让我们现在详细阐述本体论纠缠的概念,它指的是标记概念与其本体论承诺之间必然存在的多对多映射关系。上述纠缠背后有两个相互关联的维度。首先,由于感知和随后的术语纠缠,不同的图书馆用户群体可以对同一实体进行不同的感知和标记,从而在不同实体/概念与它们可能被分类的不同顶级本体类别之间形成多对多映射。其次,由于第一个原因,不同的图书馆用户群体在不知不觉中隐式地承诺了诸如DOLCE (Gangemi et al. 2002)、BFO (Smith, Kumar and Bittner 2005)、UFO (Guizzardi et al. 2022) 等顶级本体的哲学学说,从而在实体与顶级本体理论之间增加了第二层的多对多纠缠。本体论纠缠导致标记概念与其顶级本体表征之间的多重表示多样性,实际上转化为应用层面的实现纠缠,例如在网络学术图书馆环境中管理互操作性和元数据收割 (Taha 2012)。举例来说,在动机示例中,让我们考虑第27页详细描述的肿瘤学政策图书馆元数据模型中的本体论纠缠。请注意,同一个政策文档概念被不同的顶级本体 (DOLCE、UFO、BFO) 分别分类为不同的本体类别 (信息对象、人工制品、持续体)。此外,这些类别中的每一个都可以为其他各种标记概念实例化,例如第28页的教科书 (具有不同的语义)。
Fourth, with the assumption of ontological entanglement and ontological constraints, let us now elucidate the taxonomical entanglement sub-problem which refers to the necessary existence of a many-to-many mapping between ontologically enriched concepts and their taxonomic classification which would eventually constitute the backbone of a library metadata model. There are four key parameters, from Rang a nathan’s faceted classification theory (Rang a nathan 1967; Rang a nathan 1989), which induce representational manifold ness at this stage. First, with respect to a concept at a specific level of abstraction in the taxonomy, there are always multiple characteristics which can be employed to taxonomically specialise that concept into (potentially many) subordinate concepts. Second, the successive application of characteristics across the entire depth of taxonomy (with the possibility of multiple class if ica tory characteristics at each level of abstraction) leads to potentially infinite entangled classifications. Third and fourth, there can also be multiple ways in which concepts can be organised horizontally across a specific level of taxonomic abstraction (termed arrays in (Rang a nathan 1967)) and vertically across a taxonomic path (termed chains in (Rang a nathan 1967)), respectively. To exemplify the above, in the motivating example, please refer to page 30-34 which detail the LLM response to the prompt requesting for the taxonomical choices made. Although simplistic, note that the Human-LLM prompt collaboration returns two equally probable but different entangled taxonomies (page 31: example 1 and example 2) for the oncology policy library user community. It is interesting to observe that in example 1, the taxonomy is implicitly suited for a library metadata model on social-healthcare policy perspective whereas in example 2, the taxonomy is implicitly suited for a library metadata model on governmental-healthcare policy perspective.
第四,在本体纠缠和本体约束的假设下,我们现在来阐述分类学纠缠子问题,该问题指的是在本体丰富的概念与其分类学分类之间必然存在多对多的映射关系,这种映射最终将构成图书馆元数据模型的骨架。根据Ranganathan的分面分类理论(Ranganathan 1967; Ranganathan 1989),有四个关键参数在这一阶段引发了表示的多样性。首先,对于分类学中某一特定抽象层次的概念,总是存在多个特征,可以用来将该概念分类为(可能多个)下属概念。其次,在整个分类学深度上连续应用特征(在每个抽象层次上可能存在多个分类特征)会导致潜在的无限纠缠分类。第三和第四,概念在特定分类学抽象层次上的水平组织方式(在Ranganathan 1967中称为数组)和沿分类路径的垂直组织方式(在Ranganathan 1967中称为链)也可以有多种方式。为了举例说明上述内容,请参考第30-34页,详细描述了大语言模型对请求分类选择的提示的响应。尽管简单,但请注意,人类与大语言模型的提示协作返回了两个概率相等但不同的纠缠分类(第31页:示例1和示例2),适用于肿瘤学政策图书馆用户社区。有趣的是,在示例1中,分类学隐含地适合社会医疗政策视角的图书馆元数据模型,而在示例2中,分类学隐含地适合政府医疗政策视角的图书馆元数据模型。
Last but not the least, with the assumption of taxonomical entanglement, let us now elucidate the problem of in tension al entanglement which refers to the necessary existence of a many-to-many mapping between the taxonomic concepts and their in tension al interrelation and description. There are two key factors which magnify representational manifold ness in this final stage. First, during the final decision to characterise a property (uncovered at the ontological level) as an object or a data property, there exists a many-to-many mapping as the same property can be represented and refactored as an object or a data property depending on the purposes and goals to be served by the library metadata model. Second, a concept in the taxonomy can be interrelated and described via multiple possible combinations of sets of object and data properties, again, depending on the precise purpose of the final metadata model. To exemplify the entanglement, in the motivating example, please refer to page 34-39 which detail the LLM response to the prompt requesting for the object and data property choices made. For example, the Human-LLM collaboration example generated two equally probable but different sets of object and data properties (page 36) implicitly suited for the medical students and medical study participants user community of a medical college library, respectively.
最后但同样重要的是,在假设分类学纠缠的前提下,我们现在来阐明内涵纠缠的问题,这指的是分类学概念与其内涵相互关系和描述之间必然存在多对多的映射关系。在这个最后阶段,有两个关键因素放大了表示的多样性。首先,在最终决定将(在本体层面揭示的)属性特征化为对象或数据属性时,存在多对多的映射,因为相同的属性可以根据图书馆元数据模型所要服务的目的和目标,被表示和重构为对象或数据属性。其次,分类学中的一个概念可以通过对象和数据属性集合的多种可能组合来相互关联和描述,这同样取决于最终元数据模型的具体目的。为了举例说明这种纠缠,在动机示例中,请参阅第34-39页,这些页面详细描述了大语言模型对请求对象和数据属性选择的提示的响应。例如,Human-LLM协作示例生成了两组概率相等但不同的对象和数据属性集合(第36页),分别隐含地适用于医学院图书馆的医学生和医学研究参与者用户群体。
Finally, notice that the aforementioned individual level-by-level representational manifold ness are accumulative in nature and, therefore, there is also an overall manifold ness between entities and their taxonomy and in tension al characterization. In the motivating example, pages 39-43 detail at length the LLM response to the prompt requesting for the overall many-to-many mapping between entities, taxonomical hierarchies and in tension al property-based characterization. For example, even within the same macro domain of discourse cancer, different perception by the three different communities of library users (cancer specialists, oncology policy experts and medical college professionals) generate multiple terminology, ontological categories, taxonomies and property predication, most of which are not in correspondence amongst each other. Further, as also previously noted in the second section, it might well be the case that for an application scenario, there might be no manifold ness in perception or in terminology or both (with, e.g., the other entanglements still holding). In any case, the conceptual entanglement problem is a generic characterization inherent and implicit by design in any library metadata model and should be adapted as necessary on a case-by-case basis.
最后,注意到上述逐层的表示流形本质上是累积的,因此,在实体与其分类和张力特征之间也存在一个整体的流形。在动机示例中,第39-43页详细描述了大语言模型对请求实体、分类层次和基于张力属性的整体多对多映射的提示的响应。例如,即使在癌症这一宏观话语领域内,三个不同的图书馆用户群体(癌症专家、肿瘤政策专家和医学院专业人士)的不同感知也会产生多种术语、本体类别、分类和属性断言,其中大多数彼此之间并不对应。此外,正如第二部分中提到的,对于某个应用场景,可能在感知或术语或两者上都没有流形(例如,其他纠缠仍然存在)。无论如何,概念纠缠问题是任何图书馆元数据模型中固有的设计隐含的通用特征,应根据具体情况适当调整。
Towards Generative AI-driven Disentangled Metadata Modelling
迈向生成式 AI 驱动的解耦元数据建模
After the elucidation of how the conceptual entanglement problem is inherent, by design, at each representation level, let us now describe in detail how a metadata librarian can exploit the conceptual disentanglement approach, via Generative AI-driven Human-LLM collaboration, to disentangle entangled metadata within the framework of an academic library. To that end, we have the following:
在阐明了概念纠缠问题如何在每个表示层级上固有存在之后,我们现在详细描述元数据管理员如何通过生成式 AI (Generative AI) 驱动的人与大语言模型 (LLM) 协作,利用概念解缠方法,在学术图书馆框架内解缠纠缠的元数据。为此,我们有以下内容:
In order to exemplify the stratification above, the same motivating example of the ontology-driven library metadata model for cancer domain is employed (pages 8-43). Notice also the fact that the stratification of conceptual disentanglement proposed above is in sync with both the stratification of conceptual entanglement and the stratification of representation in metadata proposed in the second and the third section, respectively. The above levels are elucidated as follows.
为了举例说明上述分层结构,我们采用了癌症领域本体驱动的图书馆元数据模型的相同激励示例(第8-43页)。还需注意的是,上述提出的概念解缠分层结构与第二和第三部分分别提出的概念纠缠分层和元数据表示分层是同步的。上述层次结构如下所述。
First, let us concentrate on the perceptual disentanglement sub-approach which advocates, on the part of the metadata librarian (i.e., the modeller), to decide on an explicit one-to-one mapping between entities and their perception as concepts. As described and exemplified in the third section, even for the same (domain) entities, there is an inherent representational manifold ness which entangles the resultant library metadata model. There are two key factors which underlie the disentanglement approach of a metadata librarian at this level. First, the metadata librarian has to generate, via prompt engineering (similar to that in the motivating example), an initial characterization of the perception he/she wants to follow in designing the metadata model. Notice that the goals and purposes which such a model would potentially serve is a key influence on the design of prompts at this stage to elicit a reasonable response from the LLM. Given an initial record of perception generated via the above Human-LLM collaboration, the next crucial step for the metadata librarian is to employ (a combination of)
首先,让我们专注于感知解耦子方法,该方法主张元数据馆员(即建模者)在实体与其作为概念的感知之间决定一个明确的一对一映射。正如第三部分所描述和示例的那样,即使对于相同的(领域)实体,也存在一种固有的表示多样性,这会导致最终的库元数据模型变得复杂。元数据馆员在这一层次的解耦方法有两个关键因素。首先,元数据馆员必须通过提示工程(类似于激励示例中的方法)生成他/她希望在设计元数据模型时遵循的感知的初始特征描述。需要注意的是,这样的模型可能服务的目标和目的是影响此阶段提示设计的关键因素,以便从大语言模型中获得合理的响应。在通过上述人机协作生成感知的初始记录后,元数据馆员的下一个关键步骤是采用(组合的)
several standard techniques to validate and consolidate the generated perception depending on the critical it y of the use of the metadata model. To that end, he/she can conduct a lightweight digital ethnographic exercise (Varis 2015) about the user community and its perception about a domain. He/she can also employ (digital) focus groups (Morgan 1996) and/or one-to-one consultation with domain experts (Drab ens to tt 2003) to better understand a community’s viewpoint about entities of a specific domain. Further, independently or in addition to the above, the metadata librarian can also perform an information and data validation exercise, wherein, he/she can explore real-life information and data resources produce by an user community to gauge its perception about a domain or even, if available, reuse existing conceptual disentanglement documentation s (thereby, reinforcing the spiral nature of the levels). Finally, the above inputs can be consolidated to validate/repair/enrich the initial perception and, hence, facilitate disentangling the perceptual entanglement. To exemplify, in the motivating example, the metadata librarian, after the perception validation exercise, might choose to add a new concept multi-omics workflow and related properties to the perception already generated by the Human-LLM collaboration for cancer research library user community. Notice also the fact that the initial perception generated via the LLM prompting also reduces to a significant extent the otherwise laborious exercise of domain analysis (Hjørland and Albrecht sen 1995) on part of the metadata librarian.
根据元数据模型使用的关键性,可以采用几种标准技术来验证和巩固生成的感知。为此,他/她可以进行轻量级的数字民族志研究(Varis 2015),了解用户社区及其对某个领域的感知。他/她还可以采用(数字)焦点小组(Morgan 1996)和/或与领域专家进行一对一咨询(Drabenstott 2003),以更好地理解社区对特定领域实体的看法。此外,元数据管理员还可以独立或与上述方法结合,进行信息和数据验证工作,探索用户社区产生的现实信息和数据资源,以评估其对某个领域的感知,甚至(如果可用)重用现有的概念解耦文档(从而强化层次的螺旋性质)。最后,可以整合上述输入以验证/修复/丰富初始感知,从而促进感知纠缠的解耦。举例来说,在激励示例中,元数据管理员在感知验证工作后,可能会选择为癌症研究图书馆用户社区的人类-大语言模型协作生成的感知添加一个新概念“多组学工作流”及相关属性。还需注意的是,通过大语言模型提示生成的初始感知也在很大程度上减少了元数据管理员在领域分析(Hjørland 和 Albrechtsen 1995)方面的繁琐工作。
Second, assuming the representational bijection at the perceptual level, let us now focus on terminological disentanglement, which advocates, on the part of the metadata librarian, to decide on an explicit one-to-one mapping between perceived concepts and the terminology which would compose the disentangled library metadata model. The guiding cardinal which a metadata librarian can employ for disentanglement at the terminology level can be referred to as the principle of user warrant. It is based on the generalised notion of warrant in information science (Nylund 2020; Lancaster 1972) and can be understood as the principle to assign to a perceived concept a linguistically disambiguate d and semantically explicit term which is based on the documented warrant of the (expert) library users. To that end, each (meta)data concept/property term label should embody standard terminological quality, e.g., having a natural language gloss, examples, identifiers, etc (Miller 1995). There are several (combinations of) options which a metadata librarian can exploit to achieve representational bijection of terminology. If the user warrant is to use relatively commonsense terms to refer to concepts of cancer in a, e.g., oncology policy library, lexical-semantic resources such as WordNet (Miller 1995) might be exploited to achieve the representational bijection. On the other hand, if the user warrant in a specialised cancer research library is to refer to concepts of cancer using scientific terms, resources such as specialised glossaries and healthcare/clinical terminological standards (Schulz, Stegwee and Chronaki 2019) can be exploited to achieve an explicit one-to-one mapping. Further, in multilingual settings, NLP resources like multilingual variations of WordNets and WordNet-like resources (Pianta, Bentivogli and Girardi 2002; Dash, Bhatt a chary ya and Pawar 2017; Navigli and Ponzetto 2012) and/or open multilingual knowledge bases like Wikidata (Vrandečić and Krötzsch 2014) can be exploited to disentangle and disambiguate terms.
其次,假设在感知层面存在表示双射,我们现在关注术语解耦,这要求元数据管理员在感知概念与构成解耦库元数据模型的术语之间建立明确的一对一映射。元数据管理员在术语层面进行解耦时可以参考的指导原则是用户授权原则。该原则基于信息科学中的广义授权概念(Nylund 2020; Lancaster 1972),可以理解为根据(专家)图书馆用户的文档授权,为感知概念分配一个语言上无歧义且语义明确的术语。为此,每个(元)数据概念/属性术语标签应体现标准的术语质量,例如具有自然语言解释、示例、标识符等(Miller 1995)。元数据管理员可以利用多种(组合)选项来实现术语的表示双射。如果用户授权是在肿瘤学政策库中使用相对常识性的术语来指代癌症概念,那么可以利用诸如WordNet(Miller 1995)等词汇语义资源来实现表示双射。另一方面,如果用户授权是在专门的癌症研究库中使用科学术语来指代癌症概念,那么可以利用诸如专门词汇表和医疗/临床术语标准(Schulz, Stegwee 和 Chronaki 2019)等资源来实现明确的一对一映射。此外,在多语言环境中,可以利用诸如WordNet的多语言变体和类似WordNet的资源(Pianta, Bentivogli 和 Girardi 2002; Dash, Bhattacharyya 和 Pawar 2017; Navigli 和 Ponzetto 2012)和/或开放的多语言知识库如Wikidata(Vrandečić 和 Krötzsch 2014)等NLP资源来解耦和消除术语歧义。
Third, assuming the representational bijection at the terminological level, let us move to the next activity of ontological disentanglement, according to which, the metadata librarian has to decide on an explicit one-to-one mapping between labelled concepts and their ontological commitment. Notice that, similar to the disentanglement strategy at the terminological level, the general guidance of user warrant is key to uncover the ontological commitment of library users at this level. To that end, the metadata librarian can apply a four-staged approach. First, the librarian can prompt an LLM like ChatGPT 3.5 appropriately (see, for instance, the examples in pages 23-29 of the motivating exam- ple) to bootstrap an initial ‘entangled’ version of the alignment of the library metadata model with different state-of-the-art top-level ontologies. Second, the librarian can reuse the outputs and documentation of focus group interviews and/or domain expert consultation and/or relevant conceptual disentanglement documentation s (as discussed before in the perceptual disentanglement level) to understand the ontological warrant of the community of library users he/she is attending to. Third, he/she uses the results of the second step to explicitly choose and enrich the best possible ontological fit amongst the different ontology alignments produced in the first step. Finally, the metadata librarian also should explicate the ontological categories of each individual labelled concept and property that is under consideration with respect to the chosen top-level ontology.
第三,假设在术语层面存在表示的双射,让我们转向本体解缠的下一个活动,即元数据馆员需要决定标记概念与其本体承诺之间的显式一对一映射。请注意,与术语层面的解缠策略类似,用户授权的总体指导是揭示图书馆用户在这一层面的本体承诺的关键。为此,元数据馆员可以应用一个四阶段的方法。首先,馆员可以适当地提示像 ChatGPT 3.5 这样的大语言模型(例如,参见动机示例的第 23-29 页的示例),以引导图书馆元数据模型与不同最先进的顶级本体对齐的初始“纠缠”版本。其次,馆员可以重用焦点小组访谈和/或领域专家咨询的输出和文档,以及/或相关的概念解缠文档(如之前在感知解缠层面讨论的那样),以了解他/她所服务的图书馆用户社区的本体授权。第三,他/她使用第二步的结果来明确选择并丰富在第一步中产生的不同本体对齐中的最佳本体匹配。最后,元数据馆员还应明确每个单独标记的概念和属性的本体类别,这些类别与所选的顶级本体相关。
Fourth, assuming the representational bijection at the ontological level, let us now concentrate on taxonomical disentanglement, which advocates, on the part of the metadata librarian, to decide on an explicit one-to-one mapping between ontologically characterised concepts and their precise taxonomy. Given the general direction from the needs of the library and information service(s) which would exploit the final library metadata model, the metadata librarian can follow a four-step disentanglement strategy in response to the four stages of taxonomical entanglement (see the third section). The solution is grounded in the canons of knowledge classification proposed by Rang a nathan in his faceted classification theory (Rang a nathan 1967; Rang a nathan 1989. First, the canons of characteristics (Rang a nathan 1967) like canons of relevance and ascertain ability should be applied to eliminate manifold ness at the level of selecting a class if ica tory characteristic for specialising concepts at a specific level of abstraction in the taxonomy. Second, the canons of succession of characteristics (Rang a nathan 1967) like canon of relevant succession should be applied to disentangle the multiplicity existent in how class if ica tory characteristics are successively applied to design the conceptual depth of the taxonomy. Third and fourth, the canons of arrays and chains (Rang a nathan 1967) should be applied to disentangle the manifold ness existent while modelling concepts across a specific horizontal level and across a path of the taxonomy, respectively. To aid the above exercise, the metadata librarian can also consult and reuse relevant state-of-the-art taxonomies if available as open source code. Finally, the above procedure is expected to generate an enriched and disentangled version of one of the many possibilities initially generated via Human-LLM collaboration, e.g., as exemplified in the motivating example (page 30-34).
第四,假设在本体层面存在表示的双射关系,我们现在专注于分类解耦,这要求元数据图书馆员决定在本体特征化的概念与其精确分类之间建立明确的一对一映射。根据图书馆和信息服务利用最终图书馆元数据模型的需求,元数据图书馆员可以遵循四步解耦策略,以应对分类纠缠的四个阶段(见第三部分)。该解决方案基于Ranganathan在其分面分类理论中提出的知识分类准则(Ranganathan 1967; Ranganathan 1989)。首先,应应用特征准则(Ranganathan 1967),如相关性和可确定性准则,以消除在选择分类特征时的多样性,这些特征用于在分类的特定抽象层次上专门化概念。其次,应应用特征继承准则(Ranganathan 1967),如相关继承准则,以解耦在如何连续应用分类特征来设计分类的概念深度时存在的多样性。第三和第四,应应用数组和链准则(Ranganathan 1967),以分别解耦在特定水平层次和分类路径上建模概念时存在的多样性。为了辅助上述工作,元数据图书馆员还可以参考并重用相关的先进分类法(如果它们作为开源代码可用)。最后,上述过程预计将生成一个丰富且解耦的版本,该版本是初始通过人类与大语言模型协作生成的众多可能性之一,例如在动机示例中所示(第30-34页)。
Last but not the least, assuming the representational bijection at the taxonomical level, let us move to the final activity of in tension al disentanglement, according to which, the metadata librarian has to decide on an explicit one-to-one mapping between the taxonomic concepts and their in tension al interrelation and description. To that end, he/she can proceed on a two-step strategy. First, he/she should make explicit a final decision on the exact split of properties into two distinct sets: a set of object properties and a set of data properties. This decision is critically influenced by the information service applications driven by the warrant of a specific community of library users. Given the first step, the