[论文翻译]LLaVA-Med: 一天内训练出用于生物医学的大语言视觉助手


原文地址:https://arxiv.org/pdf/2306.00890


LLaVA-Med: Training a Large Language-and-Vision Assistant for Bio medicine in One Day

LLaVA-Med: 一天内训练出用于生物医学的大语言视觉助手

Chunyuan $\mathbf{Li}^{* }$ , Cliff Wong∗, Sheng Zhang∗, Naoto Usuyama, Haotian Liu, Jianwei Yang Tristan Naumann, Hoifung Poon, Jianfeng Gao

Chunyuan $\mathbf{Li}^{* }$, Cliff Wong∗, Sheng Zhang∗, Naoto Usuyama, Haotian Liu, Jianwei Yang Tristan Naumann, Hoifung Poon, Jianfeng Gao

Microsoft https://aka.ms/llava-med

Microsoft https://aka.ms/llava-med

Abstract

摘要

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a visionlanguage conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for Bio Medicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal convers at ional capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

对话式生成式AI(Generative AI)在赋能生物医学实践方面展现出巨大潜力,但现有研究主要集中于单模态文本。通过利用公共网络的数十亿图文对,多模态对话式AI已取得快速进展,然而这类通用领域视觉语言模型在理解和讨论生物医学图像方面仍显不足。本文提出一种高效训练视觉语言对话助手的方法,使其能够回答生物医学图像的开放式研究问题。核心思路是从PubMed Central提取大规模、广覆盖的生物医学图文数据集,利用GPT-4从图注自动生成开放式指令遵循数据,再通过新颖的课程学习方式微调通用领域视觉语言大模型。具体而言,模型先通过原始图文对学习生物医学术语对齐,再通过GPT-4生成的指令数据掌握开放式对话语义,模拟普通人逐步获取生物医学知识的过程。该方法仅需不到15小时(使用8块A100)即可训练出生物医学大语言视觉助手(LLaVA-Med)。LLaVA-Med展现出卓越的多模态对话能力,能遵循开放式指令协助解析生物医学图像。在三个标准生物医学视觉问答数据集上,微调后的LLaVA-Med在部分指标上超越了此前有监督的最先进方法。为促进生物医学多模态研究,我们将公开指令数据集和LLaVA-Med模型。

1 Introduction

1 引言

Parallel image-text data is abundantly available in the general domain, such as web images and their associated captions. Generative pre training has proven effective to leverage this parallel data for self-supervised vision-language modeling, as demonstrated by multimodal GPT-4 [32] and opensourced efforts such as LLaVA [24]. By instruction-tuning models to align with human intents based on multimodal inputs, the resulting large multimodal models (LMMs) exhibit strong zero-shot task completion performance on a variety of user-oriented vision-language tasks such as image understanding and reasoning, paving the way to develop general-purpose multimodal conversational assistants [2, 21, 9].

通用领域存在大量并行的图文数据,例如网络图片及其关联标题。生成式预训练 (Generative Pre-training) 已被证明能有效利用这种并行数据进行自监督视觉语言建模,多模态GPT-4 [32] 和开源项目LLaVA [24] 等成果已验证了这一点。通过基于多模态输入对模型进行指令微调 (instruction-tuning) 以对齐人类意图,最终得到的大规模多模态模型 (LMMs) 在面向用户的视觉语言任务(如图像理解与推理)中展现出强大的零样本任务完成能力,为开发通用多模态对话助手铺平了道路 [2, 21, 9]。

While successful in the general domains, such LMMs are less effective for biomedical scenarios because biomedical image-text pairs are drastically different from general web content. As a result, general-domain visual assistants may behave like a layperson, who would refrain from answering biomedical questions, or worse, produce incorrect responses or complete hallucinations. Much progress has been made in biomedical visual question answering (VQA), but prior methods typically formulate the problem as classification (e.g., among distinct answers observed in the training set) and are not well equipped for open-ended instruction-following. Consequently, although conversational generative AI has demonstrated great potential for biomedical applications [19, 30, 18], current investigations are often limited to unimodal text.

虽然在通用领域取得了成功,但此类大语言模型在生物医学场景中的效果较差,因为生物医学图文对与通用网络内容截然不同。因此,通用领域的视觉助手可能表现得像外行,要么回避回答生物医学问题,要么更糟,产生错误响应或完全幻觉。生物医学视觉问答 (VQA) 已取得很大进展,但现有方法通常将问题表述为分类任务 (例如在训练集中观察到的不同答案之间进行选择),并不擅长处理开放式指令跟随。因此,尽管对话式生成式 AI (Generative AI) 在生物医学应用中展现出巨大潜力 [19, 30, 18],但目前的研究往往局限于单模态文本。

In this paper, we present Large Language and Vision Assistant for Bio Medicine (LLaVA-Med), a first attempt to extend multimodal instruction-tuning to the biomedical domain for end-to-end training of a biomedical multimodal conversational assistant. Domain-specific pre training has been shown to be effective for biomedical natural language processing (NLP) applications [17, 14, 10, 28] and biomedical vision-language (VL) tasks [15, 7, 38, 49, 8]. Most recently, large-scale biomedical VL learning has been made possible by the creation of PMC-15M [49], a broad-coverage dataset with 15 million biomedical image-text pairs extracted from PubMed Central1. This dataset is two orders of magnitude larger than the next largest public dataset, MIMIC-CXR [15], and covers a diverse image types. Inspired by recent work in instruction-tuning [34, 24], LLaVA-Med uses GPT-4 to generate diverse biomedical multimodal instruction-following data using image-text pairs from PMC-15M, and fine-tune a large biomedical-domain VL model [24] using a novel curriculum learning method.

本文介绍了生物医学领域的大语言与视觉助手(LLaVA-Med),这是将多模态指令调优技术扩展到生物医学领域、实现端到端生物医学多模态对话助手训练的首次尝试。领域特异性预训练已被证明对生物医学自然语言处理(NLP)应用[17,14,10,28]和生物医学视觉语言(VL)任务[15,7,38,49,8]具有显著效果。最近,通过创建PMC-15M[49]这一包含1500万对生物医学图文数据的大型数据集,使得大规模生物医学VL学习成为可能。该数据集规模比第二大公开数据集MIMIC-CXR[15]大两个数量级,且涵盖多样化的图像类型。受指令调优领域最新研究[34,24]启发,LLaVA-Med利用GPT-4基于PMC-15M的图文对生成多样化生物医学多模态指令跟随数据,并采用新颖课程学习方法对大型生物医学领域VL模型[24]进行微调。

Specifically, our paper makes the following contributions:

具体而言,本文的贡献如下:

2 Related Work

2 相关工作

Biomedical Chatbots. Inspired by ChatGPT [31]/GPT-4 [32] and the success of open-sourced instruction-tuned large language models (LLMs) in the general domain, several biomedical LLM chatbots have been developed, including ChatDoctor [47], Med-Alpaca [12], PMC-LLaMA [45], Clinical Camel [1], DoctorGLM [46], and Huatuo [44]. They are initialized with open-sourced LLM and fine-tuned on customized sets of biomedical instruction-following data. The resulting LLMs emerge with great potential to offer assistance in a variety of biomedical-related fields/settings, such as understanding patients’ needs and providing informed advice.

生物医学聊天机器人。受ChatGPT [31]/GPT-4 [32]以及开源指令调优大语言模型(LLM)在通用领域成功的启发,目前已开发出多款生物医学LLM聊天机器人,包括ChatDoctor [47]、Med-Alpaca [12]、PMC-LLaMA [45]、Clinical Camel [1]、DoctorGLM [46]和华佗(Huatuo) [44]。这些模型基于开源LLM初始化,并在定制化的生物医学指令遵循数据集上进行微调。最终形成的LLM展现出在多种生物医学相关领域/场景中提供协助的巨大潜力,例如理解患者需求并提供专业建议。

To our knowledge, Visual Med-Alpaca [39] is the only existing multimodal biomedical chatbot that accepts image inputs. Though Visual Med-Alpaca and the proposed LLaVA-Med share a similar input-output data format, they differ in key aspects: (i) Model architectures. LLaVA-Med is an end-to-end neural model and Visual Med-Alpaca is a system that connect multiple image captioning models with a LLM, using a classifier to determine if or which biomedical captioning model is responsible for the image. The text prompt subsequently merges the converted visual information with the textual query, enabling Med-Alpaca to generate an appropriate response. $(i i)$ Biomedical instruction-following data. While Visual Med-Alpaca is trained on 54K samples from limited biomedical subject domains, LLaVA-Med is trained a more diverse set.

据我们所知,Visual Med-Alpaca [39] 是目前唯一支持图像输入的多模态生物医学对话系统。尽管Visual Med-Alpaca与本文提出的LLaVA-Med具有相似的输入输出数据格式,但二者存在关键差异:(i) 模型架构。LLaVA-Med是端到端神经网络模型,而Visual Med-Alpaca是通过分类器判断图像归属、串联多个图像描述模型与大语言模型的系统,其文本提示会将转换后的视觉信息与文本查询合并,使Med-Alpaca生成响应。$(ii)$ 生物医学指令数据。Visual Med-Alpaca仅基于54K有限生物医学领域样本训练,而LLaVA-Med采用了更丰富多样的训练集。

Biomedical Visual Question Answering. An automated approach to building models that can answer questions based on biomedical images stands to support clinicians and patients. To describe existing biomedical VQA methods, we make a distinction between disc rim i native and generative methods. For disc rim i native methods, VQA is treated a classification problem: models make predictions from a predefined set of answers. While disc rim i native methods yield good performance, they deal with closed-set predictions [13], and require mitigation when a customized answer set is provided in at inference [22, 49, 8]. The disc rim i native formulation is suboptimal towards the goal of developing a general-purpose biomedical assistant that can answer open questions in the wild. To this end, generative methods have been developed to predict answers as a free-form text sequence [5, 26, 41]. Generative methods are more versatile because they naturally cast the close-set questions as as special case where candidate answers are in language instructions.

生物医学视觉问答。构建能基于生物医学图像回答问题的自动化模型方法,有望为临床医生和患者提供支持。为描述现有生物医学VQA方法,我们区分了判别式(discriminative)与生成式(generative)方法。对于判别式方法,VQA被视为分类问题:模型从预定义答案集中进行预测。虽然判别式方法表现良好,但其处理的是封闭集预测[13],在推理时提供定制化答案集时需要额外处理[22,49,8]。这种判别式框架对于开发能回答开放问题的通用生物医学助手目标而言并非最优。为此,研究者开发了生成式方法,将答案预测为自由文本序列[5,26,41]。生成式方法更具通用性,因为它们天然将封闭集问题视为语言指令中包含候选答案的特殊情况。

Model Architecture. LLaVA-Med is similar to prefix tuning of language models (LMs) in [41] in that a new trainable module connects frozen image encoder and causal LM. In [41], a three-layer MLP network is used to map the visual features into a visual prefix, and the pre-trained LM are GPT2-XL [37], BioMedLM [42] and BioGPT [28], with size varying from 1.5B to 2.7B. By contrast, LLaVA-Med uses a linear projection and a 7B LM [43, 40]. Most importantly, [41] only considers standard supervised fine-tuning and focuses efforts on exploring various modeling choices. Our main contributions instead comprise proposing a novel data generation method that uses GPT-4 to self-instruct biomedical multimodal instruction-following data using freely-available broad-coverage biomedical image-text pairs extracted from PubMed Central [49].

模型架构。LLaVA-Med 类似于 [41] 中语言模型 (LM) 的前缀调优方法,通过一个新的可训练模块连接冻结的图像编码器和因果 LM。在 [41] 中,使用了一个三层 MLP 网络将视觉特征映射为视觉前缀,预训练的 LM 包括 GPT2-XL [37]、BioMedLM [42] 和 BioGPT [28],模型大小从 1.5B 到 2.7B 不等。相比之下,LLaVA-Med 使用了线性投影和一个 7B 的 LM [43, 40]。最重要的是,[41] 仅考虑了标准的有监督微调,并将重点放在探索各种建模选择上。而我们的主要贡献在于提出了一种新颖的数据生成方法,利用 GPT-4 通过自指令方式生成生物医学多模态指令跟随数据,这些数据基于从 PubMed Central [49] 提取的广泛覆盖的生物医学图文对。

3 Biomedical Visual Instruction-Following Data

3 生物医学视觉指令跟随数据

There are a lack of multimodal biomedical datasets to train an instruction-following assistant. To fill this gap, we create the first dataset of its kind from widely existing biomedical image-text pairs, through a machine-human co-curation procedure. It consists of two sets, concept alignment and instruction-following, which are used at different training stages, described in Section 4.

缺乏用于训练遵循指令助手的多模态生物医学数据集。为填补这一空白,我们通过机器-人工协同筛选流程,从广泛存在的生物医学图文对中创建了首个此类数据集。该数据集包含概念对齐和遵循指令两个子集,分别用于不同训练阶段(详见第4节)。

Biomedical Concept Alignment Data. For a biomedical image $\mathbf{X}_ {\mathrm{v}}$ and its associated caption $\mathbf{X}_ {\mathtt{c}}$ , we sample a question $\mathbf{X}_ {\mathfrak{q}}$ , which asks to describe the biomedical image. With $(\mathbf{X}_ {\mathtt{v}},\mathbf{X}_ {\mathtt{c}},\mathbf{X}_ {\mathtt{q}})$ , we create a single-round instruction-following example:

生物医学概念对齐数据。对于一张生物医学图像 $\mathbf{X}_ {\mathrm{v}}$ 及其关联标题 $\mathbf{X}_ {\mathtt{c}}$,我们采样一个问题 $\mathbf{X}_ {\mathfrak{q}}$,要求描述该生物医学图像。基于 $(\mathbf{X}_ {\mathtt{v}},\mathbf{X}_ {\mathtt{c}},\mathbf{X}_ {\mathtt{q}})$,我们构建一个单轮指令跟随示例:

$$
\mathrm{Human:X_ {q}X_ {v}{<}S T O P{>}\backslash n A s s i s t a n t:X_ {c}{<}S T O P{>}\backslash n}
$$

$$
\mathrm{Human:X_ {q}X_ {v}{<}S T O P{>}\backslash n A s s i s t a n t:X_ {c}{<}S T O P{>}\backslash n}
$$

Depending on the length of caption, the question that is sampled either asks to describe the image concisely or in detail. Two lists of questions are provided in Appendix A. In practice, $25%$ of captions have length less than 30 words in PMC-15M [49], and thus 30 words is used as the cutoff point to determine which list to choose. We sample 600K image-text pairs from PMC-15M. Though this dataset only presents one-single task instructions, i.e., image captioning, it contains a diverse and representative set of biomedical concept samples from the original PMC-15M [49].

根据标题的长度,采样到的问题会要求简洁或详细地描述图像。附录A中提供了两个问题列表。实际上,在PMC-15M [49]中,25%的标题长度少于30个词,因此以30词作为选择问题列表的分界点。我们从PMC-15M中采样了60万张图像-文本对。尽管该数据集仅包含单一任务指令(即图像描述),但它从原始PMC-15M [49]中提取了多样化且具有代表性的生物医学概念样本。

Biomedical Instruction-Tuning Data. To align the model to follow a variety of instructions, we present and curate diverse instruction-following data with multi-round conversations about the provided biomedical images, by prompting language-only GPT-4. Specifically, given an image caption, we design instructions in a prompt that asks GPT-4 to generate multi-round questions and answers in a tone as if it could see the image (even though it only has access to the text). Sometimes the image caption is too short for GPT-4 to generate meaningful questions and answers. To provide more context regarding the image, we also create a prompt that includes not only captions but also sentences from the original PubMed paper that mentions the image. We also manually curate few-shot examples in the prompt to demonstrate how to generate high-quality conversations based on the provided caption and context. See Appendix B for the prompt and few-shot examples. To collect image captions and their context, we filter PMC-15M to retain the images that only contain a single plot. From them, we sample 60K image-text pairs from the five most common imaging modalities: CXR (chest X-ray), CT (computed tomography), MRI (magnetic resonance imaging), his to pathology, and gross (i.e., macroscopic) pathology. We then extract sentences that mention the image from the original PubMed paper as additional context to the caption, inspired by the observations that external knowledge helps generalization [20, 25].

生物医学指令微调数据。为使模型能够遵循多样化指令,我们通过仅使用文本的GPT-4生成多轮对话,整理出关于给定生物医学图像的多样化指令跟随数据。具体而言,给定图像描述时,我们设计提示词要求GPT-4以仿佛能看到图像的语调生成多轮问答(尽管其仅能访问文本)。当图像描述过短时,我们会创建同时包含描述和提及该图像的PubMed论文句子的提示词以提供更多上下文。我们还手动编写了提示词中的少样本示例,展示如何基于给定描述和上下文生成高质量对话(详见附录B的提示词和少样本示例)。为收集图像描述及其上下文,我们从PMC-15M筛选仅含单个图表图像,从中按五种最常见成像模态(胸透X光(CXR)、计算机断层扫描(CT)、磁共振成像(MRI)、组织病理学和大体病理学)抽取6万图文对,并借鉴外部知识提升泛化能力的研究[20,25],从原始PubMed论文提取提及图像的句子作为描述的补充上下文。

An example of instruction-following data is shown in Figure 1 shows, and the data statistics is shown Figure 2. We have produced three versions of instruct data when iterative ly improving the data

指令跟随数据的示例如图1所示,数据统计信息如图2所示。我们在迭代改进数据时生成了三个版本的指令数据。

Biomedical Visual Instruction-Following Example

生物医学视觉指令跟随示例

Caption:

标题:

Figure 4: Contrast-enhanced CT scan of the chest for patient #1. A large cavitating lesion was identified in the right upper lobe with pleural involvement and likely extension into the chest wall. In addition, right hilar lymph aden opa thy was suspected. This was the primary malignant lesion in this patient.

图 4: 患者#1的胸部增强CT扫描。右肺上叶发现一个大型空洞性病变,累及胸膜并可能延伸至胸壁。此外,怀疑存在右肺门淋巴结病变。这是该患者的主要恶性病灶。

Sentences with in-line figure mentions:

带内联图表引用的句子:

  • Computerized tomography ( CT ) scans of the chest ( Figure 4 ) were obtained. - ( Figure 4 ) and demonstrated a large cavitating lesion in the posterior aspect of the right upper lobe with probable pleural involvement likely representing the primary lesion.
  • 进行了胸部计算机断层扫描 (CT) (图 4: )
  • (图 4: ) 显示右肺上叶后部有一个大型空洞性病变,可能累及胸膜,很可能代表原发病灶。

Conversation:

对话:

User: Where is the lesion located?

病灶位于何处?

Assistant: The lesion is located in the posterior aspect of the right upper lobe of the lung.

病变位于右肺上叶后部。

User: Is it extending into the chest wall?

用户:是否已扩散至胸壁?

Assistant: Yes, the CT scan suggests that the large cavitating lesion in the right upper lobe is likely extending into the chest wall.

CT扫描提示右肺上叶的大空洞性病变可能已侵犯胸壁。

User: Can you explain more about the cavitating lesion?

用户: 能详细解释一下空泡性病变吗?

Assistant: Certainly. A cavitating lesion refers to an abnormal mass or growth within the lung tissue that has a cavity or hollow space inside it. This cavity can be formed due to the necrosis (death) of the tissue inside the lesion, leading to the breakdown and disintegration of the affected tissue. Cavitating lesions can be caused by various factors, such as infections (like tuberculosis or fungal infections), autoimmune diseases, or malignancies (cancer). In the context of the CT scan provided, the large cavitating lesion in the right upper lobe of the lung is suggestive of a malignant (cancerous) growth. The cavity within the lesion may be a result of the rapid and uncontrolled growth of cancer cells, which can lead to the destruction of surrounding healthy tissue. It’s important to note that further diagnostic tests, such as a biopsy, would be required to confirm the malignancy and determine the specific type of cancer involved.

规则:

  • 输出中文翻译部分时,仅保留翻译标题,不包含任何冗余内容、重复或解释。
  • 不输出与英文无关的内容。
  • 保留原始段落格式及术语(如FLAC、JPEG)、公司缩写(如Microsoft、Amazon、OpenAI)。
  • 人名不翻译。
  • 保留论文引用格式(如[20])。
  • Figure/Table翻译为“图/表”并保留编号(如“图1: ”、“表1: ”)。
  • 全角括号替换为半角括号,左右括号前后添加半角空格。
  • 专业术语首次出现时标注英文(如“生成式AI (Generative AI)”),后续仅用中文。

策略:

  1. 特殊字符/公式原样保留
  2. HTML表格转为Markdown格式
  3. 翻译内容需符合中文表达习惯且无信息遗漏

最终仅返回Markdown格式译文。


    • 翻译结果:* *

空泡性病变指肺组织内存在空腔或中空结构的异常肿块。该空腔可能由病变内部组织坏死(组织死亡)导致,引发受累组织的分解与破坏。空泡性病变的成因多样,包括感染(如结核或真菌感染)、自身免疫性疾病或恶性肿瘤(癌症)。根据提供的CT扫描结果,右肺上叶的大型空泡性病变提示恶性(癌性)肿块。病变内的空腔可能是癌细胞快速无序增殖破坏周围健康组织所致。需注意,需通过活检等进一步诊断确认恶性程度及具体癌症类型。

Figure 1: An instance of our GPT-4 generated instruction-following data. Top: The figure and caption were extracted from a PubMed Central full-text article [35], along with the corresponding citances (mentions of the given figure in the article). Bottom: The instruction-following data generated by GPT-4 using the text only (caption and citances). Note that the image is not used to prompt GPT-4; we only show it here as a reference.

图 1: 我们的GPT-4生成的指令遵循数据示例。顶部:该图和标题是从PubMed Central全文文章[35]中提取的,同时包含相应的引用(文章中对该图的提及)。底部:GPT-4仅使用文本(标题和引用)生成的指令遵循数据。请注意,图像并未用于提示GPT-4;我们在此仅将其作为参考展示。

quality: (i) 60K-IM. The a foremen i one d dataset that considers inline mentions (IM) as the context. (ii) 60K. A dataset of similar size (60K samples) without IM in self-instruct generation. (iii) 10K. A smaller dataset (10 samples) without IM. They are used to ablate our data generation strategies and their impact on trained LLaVA-Med in experiments.

质量:(i) 60K-IM。该数据集将内联提及(IM)作为上下文。(ii) 60K。一个规模相似(6万样本)但不含IM的自指令生成数据集。(iii) 10K。不含IM的小型数据集(10个样本)。这些数据集用于实验中消融研究我们的数据生成策略及其对训练的LLaVA-Med模型的影响。

4 Adapting Multimodal Conversational Models to the Biomedical Domain

4 将多模态对话模型适配到生物医学领域

We employ LLaVA,a general-domain multimodal conversation model [24], as the initial generaldomain LM, and continuously train the model to the biomedical domain. The same network architecture is utilized, where a linear projection layer connects the vision encoder and the language model. For LLaVA-Med model training, we use a two-stage procedure, illustrated in Figure 3.

我们采用LLaVA(一种通用领域多模态对话模型[24])作为初始通用领域大语言模型,并持续训练该模型至生物医学领域。该方案沿用相同网络架构,即通过线性投影层连接视觉编码器与语言模型。LLaVA-Med模型训练采用图3所示的两阶段流程:

Stage 1: Biomedical Concept Feature Alignment. To balance between concept coverage and training efficiency, we filter PMC-15M to 600K image-text pairs. These pairs are converted to instruction-following data using a naive expansion method: instructions simply presents the task of describing the image. For each sample, given the language instruction and image input, we ask the model to predict the original caption. In training, we keep both the visual encoder and LM weights frozen, and only update the projection matrix. In this way, the image features of vast novel biomedical visual concepts can be aligned to their textual word embeddings in the pre-trained LM. This stage can be understood as expanding the vocabulary of aligned image-text tokens to the biomedical domain.

阶段1:生物医学概念特征对齐。为平衡概念覆盖范围与训练效率,我们从PMC-15M数据集中筛选出60万张图文对。这些配对通过基础扩展方法转化为指令跟随数据:指令仅描述图像内容生成任务。针对每个样本,在给定语言指令和图像输入后,要求模型预测原始标题。训练过程中保持视觉编码器和大语言模型权重冻结,仅更新投影矩阵。通过这种方式,大量新型生物医学视觉概念的特征可与预训练大语言模型中的文本词向量对齐。该阶段可理解为将已对齐的图像-文本token词汇表扩展至生物医学领域。


Figure 2: The data statistics of biomedical multimodal instruction-following data: (a,b) The root verb-noun pairs of instruction and responses, where the inner circle of the plot represents the root verb of the output response, and the outer circle represents the direct nouns. (c) The distribution of images and QA pairs on the five domains, one image is shown per domain. The domain example images are from [3, 33, 4, 29, 48].

图 2: 生物医学多模态指令跟随数据统计: (a,b) 指令与响应的根动词-名词对,其中图表内圈表示输出响应的根动词,外圈表示直接宾语名词。 (c) 五个领域上图像和问答对的分布,每个领域展示一张示例图像。领域示例图像来自 [3, 33, 4, 29, 48]。


Figure 3: LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instructiontuning). We evaluated LLaVA-Med on standard visual conversation and question answering tasks.

图 3: LLaVA-Med 基于通用领域 LLaVA 初始化,随后采用课程学习方式进行持续训练 (先进行生物医学概念对齐,再进行完整指令微调)。我们在标准视觉对话和问答任务上评估了 LLaVA-Med。

Stage 2: End-to-End Instruction-Tuning. We only keep the visual encoder weights frozen, and continue to update both the pre-trained weights of the projection layer and LM. To train the model to follow various instructions and complete tasks in a conversational manner, we develop a biomedical chatbot by fine-tuning our model on the biomedical language-image instruction-following data collected in Section 3. As demonstrated in the experiments to be described later, the LLaVA-Med model at this stage is able to not only be served as a biomedical visual assistant to interact with users, but also achieve good zero-shot task transfer performance when evaluated on well-established biomedical VQA datasets.

阶段2:端到端指令微调。我们仅冻结视觉编码器的权重,继续更新投影层和大语言模型的预训练权重。为了让模型学会以对话方式遵循各类指令并完成任务,我们在第3节收集的生物医学图文指令跟随数据上微调模型,开发出生物医学聊天机器人。如后续实验所示,此阶段的LLaVA-Med模型不仅能作为生物医学视觉助手与用户交互,在成熟的生物医学VQA数据集上进行零样本任务迁移评估时也表现出色。

Fine-tuning to Downstream Datasets. For some specific biomedical scenarios, there is a need of developing highly accurate and dataset-specific models to improve the service quality of the assistant. We fine-tune LLaVA-Med after the two-stage training on three biomedical VQA datasets [27], covering varied dataset sizes and diverse biomedical subjects. Given a biomedical image as context, multiple natural language questions are provided, the assistant responds in free-form text for both the close-set and open-set questions, with a list of candidate answers constructed in the prompt for each close-set question.

针对下游数据集的微调。在某些特定生物医学场景中,需要开发高精度且针对特定数据集的模型以提升助手服务质量。我们在两阶段训练后,基于三个生物医学VQA数据集[27]对LLaVA-Med进行微调,这些数据集涵盖不同规模数据和多样化的生物医学主题。给定生物医学图像作为上下文,系统会提供多个自然语言问题,助手以自由文本形式回答闭集和开集问题,其中每个闭集问题的提示中会构建候选答案列表。

Discussion. We discuss three favorable properties/implications of LLaVA-Med: (i) Affordable development cost. Instead of scaling up data/model for the best performance, we aim to provide affordable and reasonable solutions with low development cost: it takes 7 and 8 hours for stage 1 and 2 on 8 40G A100 GPUs, respectively (see Table 5 for detailed numbers). $(i i)A$ recipe for many domains. Though this paper focuses on biomedical domains, the proposed adaptation procedure is general iz able to other vertical domains such as gaming and education, where novel concepts and domain knowledge are needed to build a helpful assistant. Similar to the don’t stop pre-training argument in [11], we consider a scalable pipeline to create domain-specific instruct data from large unlabelled data, and advocate don’t stop instruction-tuning to build customized LMM. (iii) Low serving cost. While the model size of general LMM can be giant and serving cost can be prohibitively high, customized LMM has its unique advantages in low serving cost. (iv) Smooth Model Adaptation. Alternatively, the network architecture allows us to initialize the vision encoder from BioMedCLIP [49], or initialize the language model from Vicuna [43], which may lead to higher performance. However, adapting from LLaVA smooth adaptation as a chatbot, where model’s behaviors transit from layperson to a professional assistant that is able to provide helpful domain-specific response.

讨论。我们探讨LLaVA-Med的三个优势特性/影响:(i) 可负担的开发成本。我们不追求通过扩大数据/模型规模来获得最佳性能,而是旨在提供开发成本低廉的合理解决方案:在8块40G A100 GPU上,阶段1和阶段2分别耗时7小时和8小时(具体数据见表5)。$(ii)A$ 多领域适配方案。虽然本文聚焦生物医学领域,但提出的适配流程可推广至游戏、教育等需要新概念和领域知识来构建实用助手的垂直领域。类似文献[11]提出的"不要停止预训练"观点,我们设计了一个可扩展的流程来从大规模无标注数据生成领域特定的指令数据,并主张通过"不要停止指令微调"来构建定制化大型多模态模型。(iii) 低服务成本。通用大型多模态模型的参数量可能极其庞大,导致服务成本过高,而定制化模型在低服务成本方面具有独特优势。(iv) 平滑模型适配。网络架构允许我们从BioMedCLIP[49]初始化视觉编码器,或从Vicuna[43]初始化语言模型,这可能带来更高性能。但采用LLaVA进行平滑适配时,模型行为会从外行自然过渡到能够提供专业领域响应的智能助手。

5 Experiments

5 实验

We conduct experiments to study two key components, the quality of the produced multimodal biomedical instruction-following data, and performance of LLaVA-Med. We consider two research evaluation settings: (1) What is the performance of LLaVA-Med as an open-ended biomedcal visual chatbot? (2) How does LLaVA-Med compare to existing methods on standard benchmarks? To clarify, throughout the entire experiments, we only utilize the language-only GPT-4.

我们通过实验研究两个关键组成部分:生成的多模态生物医学指令跟随数据的质量,以及LLaVA-Med的性能表现。我们考虑两种研究评估设置:(1) LLaVA-Med作为开放式生物医学视觉聊天机器人的性能如何?(2) LLaVA-Med在标准基准测试中与现有方法相比表现如何?需要说明的是,在整个实验过程中,我们仅使用了纯文本版本的GPT-4。

5.1 Biomedical Visual Chatbot

5.1 生物医学视觉聊天机器人

To evaluate the performance of LLaVA-Med on biomedical multimodal conversation, we construct an evaluation dataset with 193 novel questions. For this test dataset, we randomly selected 50 unseen image and caption pairs from PMC-15M, and generate two types of questions: conversation and detailed description. The conversation data is collected using the same self-instruct data generation pipeline as for the 2nd stage. Detailed description questions were randomly selected from a fixed set [24] of questions to elicit detailed description responses.

为了评估LLaVA-Med在生物医学多模态对话中的表现,我们构建了一个包含193个新问题的评估数据集。针对该测试集,我们从PMC-15M中随机选取了50组未见过的图像及标题对,并生成两类问题:对话型问题和详细描述型问题。对话数据采用与第二阶段相同的自指令数据生成流程进行收集。详细描述问题则从固定问题集[24]中随机选取,以获取详细描述性回答。

We leverage GPT-4 to quantify the correctness of the model answer to a question when given the image context and caption. GPT-4 makes a reference prediction, setting the upper bound answer for the teacher model. We then generate response to the same question from another LMM. Given responses from the two assistants (the candidate LMM and GPT-4), the question, figure caption, and figure context, we ask GPT-4 to score the helpfulness, relevance, accuracy, and level of details of the responses from the two assistants, and give an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. GPT-4 is also asked to provide a comprehensive explanation the evaluation, for us to better understand the models. We then compute the relative score using GPT-4 reference score for normalization.

我们利用GPT-4在给定图像上下文和标题时量化模型对问题回答的正确性。GPT-4生成参考预测,为教师模型设定答案上限。接着,我们从另一个大语言模型生成对同一问题的回答。给定两个助手(候选大语言模型和GPT-4)的回答、问题、图标题和图像上下文,我们要求GPT-4对两个回答的有用性、相关性、准确性和细节程度进行评分,并按1到10分给出总体评分(分数越高表示整体表现越好)。同时要求GPT-4提供评估的详细解释,以便我们更好地理解模型性能。最后,我们使用GPT-4的参考分数进行归一化计算相对得分。

QuestionTypesDomainsOverall
ConversationDescriptionCXRMRIHistologyGrossCT
(Question Count) LLaVA(143) 39.4(50) 26.2(37) 41.6(38) 33.4(44) 38.4(34) 32.9(40) 33.4(193) 36.1
LLaVA-Med
Stage 122.625.225.819.024.824.722.223.3
10K42.432.546.136.743.534.737.539.9
60K53.736.957.339.849.847.452.449.4
60K-IM55.136.456.240.452.751.850.150.2

Table 1: Performance comparison of mulitmodal chat instruction-following abilities, measured by the relative score via language GPT-4 evaluation.

问题类型 领域 总体
对话 描述 CXR MRI 组织学 大体 CT
(问题数量) LLaVA (143) 39.4 (50) 26.2 (37) 41.6 (38) 33.4 (44) 38.4 (34) 32.9 (40) 33.4 (193) 36.1
LLaVA-Med
阶段1 22.6 25.2 25.8 19.0 24.8 24.7 22.2 23.3
10K 42.4 32.5 46.1 36.7 43.5 34.7 37.5 39.9
60K 53.7 36.9 57.3 39.8 49.8 47.4 52.4 49.4
60K-IM 55.1 36.4 56.2 40.4 52.7 51.8 50.1 50.2

表 1: 多模态聊天指令跟随能力的性能比较,通过语言GPT-4评估的相对得分衡量。

Visual input example, Biomedical image from [6]

视觉输入示例,生物医学图像来自 [6]

Caption:Chest x-ray(CXR) on Day 2 of
User LLaVA-Med User LLaVA-Med procedures.seenthroughoutthelungswithrelative subpleuralsparing(redarrows) Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7398722/ Whatisshowninthisimage? The image is a chest X-ray (CXR) that shows bilateral patchy infiltrates,which are areas of increased opacity in the lungs. These infiltrates can be indicative of various lung conditions, such as infections, inflammation,or otherlung diseases. Are there any devices implanted in this patient? Yes, it appears that the patient has several devices implanted. These devices may include endotracheal tubes,central venous catheters, and/or pulmonary artery catheters.These devices are typically used for monitoring and managing various medical conditions, as well as for providing support during surgical What is shown in this image?
User LLaVA
diagram. User LLaVA for treatment or monitoring.The image shows a ribcage with wires coming out of it, indicating that it is likely a medical illustration or Are there any devices implanted in this patient? Yes, there are several devices protruding from the patient's chest, which are likely medical devices used
User Language GPT-4What is shown in this image? Theimageshows a chest x-ray taken onDay 2of apatient's admission after intubation.You cansee the endotracheal tube marked by a yellow line, and rapidly worsening ground-glass opacities throughout the lungs. There is also relative subpleural sparing, which is indicated by red arrows.

Table 2: Example comparison of medical visual chat and reasoning capabilities. The language-only GPT-4 is considered as the performance upper bound, as the golden captions and inline mentions are fed into GPT-4 as the context, without requiring the model to understand the raw image.

表 2: 医疗视觉对话与推理能力示例对比。纯语言模型 GPT-4 被视为性能上限,因为黄金标注和文中提及内容会作为上下文输入 GPT-4,无需模型理解原始图像。

The results are reported in Table 1. LLaVA-Med with Stage-1 training alone is insufficient as a chatbot, as it loses its ability to follow diverse instructions, though biomedical concept coverage is improved. LLaVA-Med with the full two-stage training consistently outperforms the general domain LLaVA, and training with larger instruct data (from 10K to 60K samples) leads to higher performance. When inline mentions are considered in self-instruct, the generated data 60K-IM slightly improves the chat ability. The results demonstrate the effectiveness of the strategies in biomedical instructionfollowing data collection as well as the value of dataset assets. Overall, for the best LLaVA-Med, it matches the $50.2%$ performance of GPT-4. Note that GPT-4 generates response by considering ground-truth caption and golden inline mentions, without understanding the images. Though not a fair comparison between LMMs and GPT-4, GPT-4 is a consistent and reliable evaluation tool.

结果如表 1 所示。仅进行第一阶段训练的 LLaVA-Med 作为聊天机器人表现不足,虽然生物医学概念覆盖有所提升,但丧失了遵循多样化指令的能力。完整两阶段训练的 LLaVA-Med 始终优于通用领域的 LLaVA,且使用更大规模指令数据 (从 10K 到 60K 样本) 训练会带来更高性能。当自指令生成考虑内联提及时,60K-IM 生成数据能略微提升对话能力。这些结果证明了生物医学指令跟随数据收集策略的有效性以及数据集资产的价值。最佳性能的 LLaVA-Med 达到了 GPT-4 50.2% 的水平。需注意 GPT-4 在生成响应时会参考真实标注和黄金内联提及,但并不理解图像内容。尽管大语言模型与 GPT-4 的对比并不公平,但 GPT-4 仍是稳定可靠的评估工具。

In Table 2, we provide examples on the biomed visual conversations of different chatbots. LLaVAMed precisely answers the questions with biomedical knowledge, while LLaVA behaves like a layperson, who hallucinate based on commonsense. Since the multimodal GPT-4 is not publicly available, we resort to language-only GPT-4 for comparison. We feed golden captions and inline mentions into GPT-4 as the context, it generates knowledgeable response through re-organizing the information in the conversational manner.

在表2中,我们提供了不同聊天机器人在生物医学视觉对话中的示例。LLaVAMed能基于生物医学知识准确回答问题,而LLaVA表现得像外行,会基于常识产生幻觉。由于多模态GPT-4尚未公开,我们采用纯语言版本的GPT-4进行对比。将黄金标注和文中提及内容作为上下文输入GPT-4后,它能通过对话式重组信息生成专业回答。

5.2 Performance on Established Benchmarks

5.2 在现有基准测试上的表现

Dataset Description. We train and evaluate LLaVA-Med on three biomedical VQA datasets. The detailed data statistics are summarized in Table 3.

数据集描述。我们在三个生物医学视觉问答(VQA)数据集上训练和评估LLaVA-Med。详细数据统计信息总结如表3所示。

• VQA-RAD [16] contains $3515\mathrm{QA}$ pairs generated by clinicians and 315 radiology images that are evenly distributed over the head, chest, and abdomen. Each image is associated with multiple questions. Questions are categorized into 11 categories: abnormality, attribute, modality, organ system, color, counting, object/condition presence, size, plane, positional reasoning, and other. Half of the answers are closed-ended (i.e., yes/no type), while the rest are open- ended with either one-word or short phrase answers.

• VQA-RAD [16] 包含由临床医生生成的3515个QA对和315张均匀分布在头部、胸部和腹部的放射影像。每张影像关联多个问题,问题分为11类:异常 (abnormality)、属性 (attribute)、模态 (modality)、器官系统 (organ system)、颜色 (color)、计数 (counting)、对象/条件存在 (object/condition presence)、尺寸 (size)、平面 (plane)、位置推理 (positional reasoning) 及其他 (other)。半数答案为封闭式(即是非型),其余为开放式(需回答单词或短语)。

DatasetVQA-RAD Train TestSLAKEPathVQA
TrainValTestTrainValTest
# Images31320345096962599858858
#QAPairs179745149191053106119,75562796761
#Open7701792976631645994931443370
#Closed10272721943422416980631353391

Table 3: Dataset statistics. For SLAKE, only the English subset is considered for head-to-head comparison with existing methods.

Dataset VQA-RAD Train Test SLAKE PathVQA
Train Val Test Train Val Test
# Images 313 203 450 96 96 2599 858 858
#QAPairs 1797 451 4919 1053 1061 19,755 6279 6761
#Open 770 179 2976 631 645 9949 3144 3370
#Closed 1027 272 1943 422 416 9806 3135 3391

表 3: 数据集统计。对于SLAKE数据集,仅考虑英文子集以便与现有方法进行直接对比。

• SLAKE [23] is a Semantically-Labeled Knowledge-Enhanced dataset for medical VQA. It consists of 642 radiology images and over 7000 diverse QA pairs annotated by experienced physicians, where the questions may involve external medical knowledge (solved by provided medical knowledge graph), and the images are associated with rich visual annotations, including semantic segmentation masks and object detection bounding boxes. Besides, SLAKE includes richer modalities and covers more human body parts than the currently available dataset, including brain, neck, chest, abdomen, and pelvic cavity. Note SLAKE is bilingual dataset with English and Chinese. When compared with existing methods, we only consider the English subset. • PathVQA [13] is a dataset of pathology images. It contains a total of 4998 pathology images with 32,799 QA pairs. Every image has several questions that relate to multiple aspects such as location, shape, color, appearance, etc. The questions are categorized into two types, with several varieties: open-ended questions such as why, what, how, where, etc., and closed-ended questions.

• SLAKE [23] 是一个面向医学视觉问答 (VQA) 的语义标注知识增强数据集。它包含642张放射影像和由经验丰富的医师标注的7000多个多样化问答对,其中问题可能涉及外部医学知识(通过提供的医学知识图谱解决),影像关联了丰富的视觉标注,包括语义分割掩码和物体检测边界框。此外,SLAKE比现有数据集涵盖更丰富的模态和更多人体部位,包括脑部、颈部、胸部、腹部和盆腔。需要注意的是,SLAKE是包含英文和中文的双语数据集。在与现有方法对比时,我们仅考虑英文子集。

• PathVQA [13] 是一个病理影像数据集。它包含4998张病理学图像和32,799个问答对。每张图像都有涉及位置、形状、颜色、外观等多方面的问题。问题分为开放式(如为什么、是什么、怎么样、在哪里等)和封闭式两大类,并包含多个变体。

Evaluation Metrics. For the closed-set questions, we report the accuracy. For open-set questions, we use recall to evaluate the ratio that ground-truth tokens appear in the generated sequences. In the literature, the unique answers in the training set are considered as the answer candidates, from which the models can select to predict answers for testing questions. Since we do not provide any constraint for the responses to open-set questions, our formulation is closer to open-set nature, but is intrinsically harder.

评估指标。对于封闭式问题,我们报告准确率。对于开放式问题,我们使用召回率来评估真实答案token出现在生成序列中的比例。在现有文献中,训练集中的唯一答案被视为候选答案集,模型可以从中选择以预测测试问题的答案。由于我们对开放式问题的回答未施加任何约束,我们的设定更贴近开放式本质,但本质上更具挑战性。

Comparisons with SoTA. We compare LLaVA-Med with the general domain LLaVA and existing representative methods in Table 4 (a). First, All LLaVA-Med variants outperform LLaVA. While the difference of language model initialization from LLaVA or Vicuna is minor, the initialization of vision encoder from BioMed CLIP is slightly better than from general-domain CLIP. Second, the fine-tuning performance of LLaVA-Med is higher than supervised SoTA on the closed-set questions on VQA-RAD and PathVQA. This validates LLaVA-Med’s strong ability in following instruction to complete biomedical tasks, when clear instructions are provided (e.g., , yes or no). Third, for open-set questions, LLaVA-Med achieves SoTA on SLAKE, while its performance is limited on other datasets, especially compared with existing methods. This is perhaps because the open-set biomedical questions can be ambiguous without constraining their excepted answer options.

与现有最优技术(SoTA)的对比。我们在表4(a)中将LLaVA-Med与通用领域LLaVA及现有代表性方法进行对比。首先,所有LLaVA-Med变体均优于LLaVA。虽然语言模型初始化使用LLaVA或Vicuna的差异较小,但视觉编码器使用BioMed CLIP初始化的效果略优于通用领域CLIP。其次,在VQA-RAD和PathVQA的封闭式问题上,LLaVA-Med的微调性能高于监督学习的SoTA方法。这验证了当提供明确指令时(例如是/否问题),LLaVA-Med在执行生物医学任务方面具有强大能力。第三,对于开放式问题,LLaVA-Med在SLAKE数据集上达到SoTA水平,但在其他数据集(尤其是与现有方法相比)表现有限。这可能是因为开放式生物医学问题在没有限定预期答案选项时容易产生歧义。

Ablation Studies. To study the impact of our curated instruction data and hyper-parameters in the training pipeline, we report the performance of different model variants in Table 4 (b). Several findings are confirmed: $(i)$ LLaVA-Med consistently outperforms LLaVA by a large margin, indicating the effectiveness of our biomedical domain-specific adaptation. The performance gaps on zero-shot are larger than that in fine-tuned settings, showing that LLaVA-Med is clearly a better option than LLaVA when deploying one model for various scenarios in the wild. $(i i)$ Training longer in Stage 1 improves zero-shot transfer, but Stage 1 alone is not sufficient, because the single image captioning instruction in Stage 1 may encourage the model to lose its ability in follow diverse instructions. (iii) Instruction-following data in Stage 2 is critical, and the performance is generally improved, when the instruct data amount increases from 10K to 60K. The 60K-IM data provides the best averaged zero-shot and fine-tuned performance, respectively, validating the effectiveness of considering inline mention as external knowledge in data creation. $(i v)$ Fine-tuning longer on downstream datasets till 9 epochs benefits the performance, especially on checkpoints with 3-epoch training in Stage 2. Increasing language model size from 7B to 13B improves the overall zero-shot performance and fine-tuned performance. We suggest practitioners to choose the appropriate quality-cost trade-off, by referring to the running time in Table 5.

消融实验。为研究精选指令数据及训练流程超参数的影响,我们在表4(b)中报告了不同模型变体的性能。主要发现如下:$(i)$ LLaVA-Med始终以显著优势超越LLaVA,印证了生物医学领域自适应策略的有效性。零样本场景下的性能差距大于微调场景,表明在开放场景部署单一模型时LLaVA-Med是更优选择。$(ii)$ 延长第一阶段训练能提升零样本迁移能力,但仅靠第一阶段不足以保证指令多样性,因该阶段单一图像描述任务可能导致模型丧失多指令响应能力。(iii) 第二阶段指令跟随数据至关重要,当指令数据从10K增至60K时性能普遍提升。60K-IM数据分别在零样本和微调场景取得最佳平均性能,验证了在数据构建中将文中提及内容作为外部知识的有效性。$(iv)$ 在下游数据集进行9轮微调可提升性能(尤其对第二阶段仅训练3轮的检查点),将语言模型规模从7B扩展至13B能全面提高零样本与微调性能。建议实践者参考表5的运行时间,权衡质量与成本选择合适方案。

MethodVQA-RAD Ref Open ClosedSLAKE Ref Open ClosedPathVQA Ref Open Closed
Supervisedfinet-tuningresultswithourownexperimentruns
LLaVA50.00 65.0778.18 63.227.74 63.20
LLaVA-Med (From LLaVA)61.52 84.1983.08 85.3437.95 91.21
LLaVA-Med(FromVicuna)64.39 81.9884.71 83.1738.87 91.65
LLaVA-Med(BioMed CLIP)64.75 83.0987.11 86.7839.60 91.09
Representative&SoTAmethodswithnumbersreportedintheliterature
VLEncoder-Decoder[5] 71.4982.4771.49 85.61
Q2ATransformer[26]79.19 81.2054.85 88.85
Prefix T. Medical LM [41]84.30 82.0140.00 87.00
PubMedCLIP [8]60.10 80.0078.40 82.50
BiomedCLIP [49]67.60 79.8082.05 89.70
M212 [22]66.50 83.5074.70 91.1036.30 88.00
方法 VQA-RAD 参考 开放 封闭 SLAKE 参考 开放 封闭 PathVQA 参考 开放 封闭
监督微调结果(未包含我们自己的实验运行)
LLaVA 50.00 65.07 78.18 63.22 7.74 63.20
LLaVA-Med (基于LLaVA) 61.52 84.19 83.08 85.34 37.95 91.21
LLaVA-Med (基于Vicuna) 64.39 81.98 84.71 83.17 38.87 91.65
LLaVA-Med (BioMed CLIP) 64.75 83.09 87.11 86.78 39.60 91.09
代表性及最先进方法(文献报道数据)
VL编码器-解码器 [5] 71.49 82.47 71.49 85.61
Q2A Transformer [26] 79.19 81.20 54.85 88.85
前缀T.医学LM [41] 84.30 82.01 40.00 87.00
PubMedCLIP [8] 60.10 80.00 78.40 82.50
BiomedCLIP [49] 67.60 79.80 82.05 89.70
M212 [22] 66.50 83.50 74.70 91.10 36.30 88.00

(a) Comparison with prior state-of-the-art supervised methods. For open-ended questions, prior methods still formulate the problem as classification among distinct answers in the training set, which may overestimate their general iz ability as these datasets are unusual in that the test answers are almost always present in training.

(a) 与现有最优监督方法的对比。对于开放式问题,现有方法仍将问题表述为训练集中不同答案的分类任务,这可能高估其泛化能力,因为这些数据集具有特殊性——测试答案几乎总是存在于训练集中。

LLaVA-MedModelVariantsVQA-RADSLAKEPathVQAAverage
Instruct Stage 1Stage 2FTOpen(ClosedOpenClosedOpenClosed
CLIP Vision Encoder [36],7B Language Model0015.2712.5018.5513.466.2613.5113.26
0 01 3 0015.3315.4423.6115.386.3514.7415.14
10K1 3025.7957.3531.5051.688.4959.6639.08
10K3 3028.4459.5622.6343.995.4052.6735.45
10K1 3136.3955.8871.6456.4925.5082.8754.79
10K318.5955.5178.6063.4634.0286.94
1 329.8055.1538.0850.0011.7059.6656.19
60K1 1029.6760.2935.5311.7640.73
60K1 3022.6358.0972.7553.8553.2040.72
60K13 154.1279.3354.3324.1971.6050.60
60K13 364.7138.4464.9017.1871.3758.60
60K-IM11 029.6761.4052.4011.4156.2441.59
60K-IM13 028.2361.4039.1752.1612.3054.0541.22
60K-IM13 128.6156.2570.5854.5711.1759.1946.73
60K-IM13 355.5066.5480.5764.1835.8889.1565.30
60K-IM13 966.2680.8882.3084.8637.5991.5473.90
60K-IM13 1561.5384.1983.0885.3437.9591.2173.88
60K-IM13 1861.3781.2584.2483.1737.8891.3973.22
CLIPVisionEncoder[36], 13BLanguageModel
60K-IM13 031.6661.4037.7149.7611.3449.6340.25
60K-IM13 964.5877.9484.9785.5838.8292.3974.05
BioMedCLIPVisionEncoder[49],7B LanguageModel
60K-IM13037.8460.6639.7354.3311.6549.0742.21
60K-IM139 64.7583.0987.1186.7839.6091.0975.40
LLaVA00020.7459.1926.8250.248.7445.6535.23
LLaVA-Med模型变体 指令阶段1阶段2 微调 VQA-RAD开放 VQA-RAD封闭 SLAKE开放 SLAKE封闭 PathVQA开放 PathVQA封闭 平均
CLIP视觉编码器[36], 7B大语言模型 0 0 15.27 12.50 18.55 13.46 6.26 13.51 13.26
0 1 3 0 0 15.33 15.44 23.61 15.38 6.35 14.74 15.14
10K 1 3 0 25.79 57.35 31.50 51.68 8.49 59.66 39.08
10K 3 3 0 28.44 59.56 22.63 43.99 5.40 52.67 35.45
10K 1 3 1 36.39 55.88 71.64 56.49 25.50 82.87 54.79
10K 3 18.59 55.51 78.60 63.46 34.02 86.94
1 3 29.80 55.15 38.08 50.00 11.70 59.66 56.19
60K 1 1 0 29.67 60.29 35.53 11.76 40.73
60K 1 3 0 22.63 58.09 72.75 53.85 53.20 40.72
60K 1 3 1 54.12 79.33 54.33 24.19 71.60 50.60
60K 1 3 3 64.71 38.44 64.90 17.18 71.37 58.60
60K-IM 1 1 0 29.67 61.40 52.40 11.41 56.24 41.59
60K-IM 1 3 0 28.23 61.40 39.17 52.16 12.30 54.05 41.22
60K-IM 1 3 1 28.61 56.25 70.58 54.57 11.17 59.19 46.73
60K-IM 1 3 3 55.50 66.54 80.57 64.18 35.88 89.15 65.30
60K-IM 1 3 9 66.26 80.88 82.30 84.86 37.59 91.54 73.90
60K-IM 1 3 15 61.53 84.19 83.08 85.34 37.95 91.21 73.88
60K-IM 1 3 18 61.37 81.25 84.24 83.17 37.88 91.39 73.22
CLIP视觉编码器[36], 13B大语言模型
60K-IM 1 3 0 31.66 61.40 37.71 49.76 11.34 49.63 40.25
60K-IM 1 3 9 64.58 77.94 84.97 85.58 38.82 92.39 74.05
BioMedCLIP视觉编码器[49], 7B大语言模型
60K-IM 1 3 0 37.84 60.66 39.73 54.33 11.65 49.07
60K-IM 1 3 9 64.75 83.09 87.11 86.78 39.60 91.09
LLaVA 0 0 0 20.74 59.19 26.82 50.24 8.74 45.65

(b) Ablation studies with varying number of training epochs at different stages. “FT” is Fine-Tuning. 60K-IM indicates the instruct data generated with inline mentions. The gray rows are zero-shot performance of LLaVAMed trained with different instruct data, they are selected to show in subtable (a).

(b) 不同训练阶段epoch数量消融实验。 "FT"表示微调(Fine-Tuning)。60K-IM代表使用内联提及生成的指令数据。灰色行展示的是使用不同指令数据训练的LLaVAMed的零样本性能,这些数据在子表(a)中被选中展示。

Table 4: Quantitative results on three established biomedical VQA datasets. For open-set questions, we report the recall for our free-form text generation method in column Open. For closed-set questions, we report the accuracy in column Closed. Bold indicates LLaVA-Med achieves new SoTA.

表 4: 三个成熟生物医学VQA数据集的量化结果。对于开放集问题,我们在Open列报告了自由文本生成方法的召回率。对于封闭集问题,我们在Closed列报告了准确率。加粗表示LLaVA-Med取得了新的SoTA (state-of-the-art) 成果。

Table 5: Running time (hours) for 1 and 3-epoch training at each stage, with batch size 128 on eight A100 GPUs.

Stage 1 1 3Stage 2 Instruct 1 3VQA-RAD 1 3SLAKE 1 3PathVQA 1 3
6.8 19.410K 0.6 1.8 60K 2.6 8.00.3 0.60.6 1.01.0 2.5

表 5: 在八块A100 GPU上以128的批次大小进行1轮和3轮训练的运行时间(小时)。

Stage 1 1 3 Stage 2 Instruct 1 3 VQA-RAD 1 3 SLAKE 1 3 PathVQA 1 3
6.8 19.4 10K 0.6 1.8 60K 2.6 8.0 0.3 0.6 0.6 1.0 1.0 2.5
Biomedical image from the SLAKE Bilingual dataset. [23]
User LLaVA-Med这张图片的成像方式是什么? Theimageisanabdominalcomputed tomography(CT) scan,whichisacross-sectionalviewof thepatient's ab- domen.The image is taken in the portal phase, which meansitis captured after the administration ofcontrastma- terialtoenhancethevisibilityofcertainstructureswithin theabdomen. CT
User LLaVA这张图片展示的是核磁共振的哪种类型? The image shows a typical appearance of a nodular lesion on T1-weighted MR images. The lesion appears hyper- intense,which means it is brighter than the surrounding tissue. This can be indicative of various types of lesions, andfurtherevaluationwouldbeneededtodeterminethe specific nature and cause of the lesion.

Table 6: Examples on zero-shot on Chinese questions. Though LLaVA-Med training data itself does not include Chinese, it is able to correctly understand the Chinese questions.

SLAKE双语数据集中的生物医学图像。[23]
用户LLaVA-Med 这张图片的成像方式是什么?该图像是一幅腹部计算机断层扫描(CT)图像,呈现患者腹部的横截面视图。图像拍摄于门静脉期,这意味着是在注射造影剂以增强腹部某些结构可见度后捕获的。
用户LLaVA 这张图片展示的是核磁共振的哪种类型?该图像显示了T1加权磁共振图像上结节性病灶的典型表现。病灶呈现高信号,意味着比周围组织更亮。这可能提示多种类型的病变,需要进一步评估以确定病灶的具体性质和成因。

表6: 中文问题零样本示例。尽管LLaVA-Med训练数据本身不包含中文,但能够正确理解中文提问。

Case Study I: Zero-shot on Chinese Questions. For the LLaVA-Med trained on 60K-IM data, we provide Chinese questions on SLAKE dataset. Though LLaVA-Med training does not include Chinese instruction-following data, we show in Table 6 that LLaVA-Med is able to correctly understand the Chinese questions and respond the correct answers, probably due to the multilingual knowledge learned in LLaMA/Vicuna. Existing models will fail when zero-shot transfer cross languages.

案例研究一:中文问题零样本测试。针对基于60K-IM数据训练的LLaVA-Med模型,我们在SLAKE数据集上提出中文问题。如表6所示,尽管LLaVA-Med训练过程中未包含中文指令跟随数据,该模型仍能正确理解中文问题并给出准确回答,这可能是由于LLaMA/Vicuna已习得多语言知识所致。现有模型在跨语言零样本迁移时均会失效。

6 Conclusions

6 结论

We present LLaVA-Med, a large language-and-vision model for the biomedical domain. To create this model, we create high-quality biomedical language-image instruction-following dataset using a self-instruct approach to build a data curation pipeline using language-only GPT-4 and external knowledge. LLaVA-Med demonstrates strong excellent chat abilities with domain knowledge, and outperforms previous supervised SoTA on three VQA datasets on certain metrics with subsequent fine-tuning.

我们推出了LLaVA-Med,这是一个面向生物医学领域的大语言视觉模型。为构建该模型,我们采用自指导方法创建了高质量的生物医学语言-图像指令跟随数据集,利用纯语言GPT-4和外部知识搭建数据筛选流程。LLaVA-Med展现出卓越的领域知识对话能力,经过后续微调后,在三个视觉问答(VQA)数据集的特定指标上超越了此前监督学习的最先进(SoTA)方法。

While we believe that LLaVA-Med represents a significant step towards building a useful biomedical visual assistant, we note that LLaVA-Med is limited by hallucinations and weak in-depth reasoning common to many LMMs. Future work is directed toward improving quality and reliability.

虽然我们相信LLaVA-Med在构建实用的生物医学视觉助手方面迈出了重要一步,但我们也注意到LLaVA-Med存在许多大语言模型共有的幻觉问题和深度推理能力不足的局限。未来的工作将致力于提升质量和可靠性。

References

参考文献

A Data

数据

Instructions for brief image description. The list of instructions used to briefly describe the image content are shown in Table 7. They present the same meaning with natural language variance.

简要图像描述说明。用于简要描述图像内容的指令列表如表7所示。这些指令含义相同,但采用自然语言变体表达。

• "Describe the image concisely." • "Provide a brief description of the given image." • "Offer a succinct explanation of the picture presented." • "Summarize the visual content of the image." • "Give a short and clear explanation of the subsequent image." • "Share a concise interpretation of the image provided." • "Present a compact description of the photo’s key features." • "Relay a brief, clear account of the picture shown." • "Render a clear and concise summary of the photo." • "Write a terse but informative summary of the picture." • "Create a compact narrative representing the image presented."

• "简要描述图像内容。"
• "对给定图像进行简要描述。"
• "对呈现的图片提供简洁说明。"
• "概括图像的视觉内容。"
• "对后续图像给出简短清晰的解释。"
• "对提供的图像进行简明解读。"
• "对照片的关键特征进行紧凑描述。"
• "对展示的图片进行简短清晰的说明。"
• "对照片进行清晰简洁的总结。"
• "撰写简洁但信息丰富的图片摘要。"
• "为呈现的图像构建紧凑的叙述。"

Table 7: The list of instructions for brief image description.

表 7: 简短图像描述指令列表

Instructions for detailed image description. The list of instructions used to describe the image content in detail are shown in Table 8. They present the same meaning with natural language variance.

详细图像描述说明。用于详细描述图像内容的指令列表如表 8 所示。这些指令在自然语言表达上存在差异但含义相同。

• "Describe the following image in detail" • "Provide a detailed description of the given image" • "Give an elaborate explanation of the image you see" • "Share a comprehensive rundown of the presented image" • "Offer a thorough analysis of the image" • "Explain the various aspects of the image before you" • "Clarify the contents of the displayed image with great detail" • "Characterize the image using a well-detailed description" • "Break down the elements of the image in a detailed manner" • "Walk through the important details of the image" • "Portray the image with a rich, descriptive narrative" • "Narrate the contents of the image with precision" • "Analyze the image in a comprehensive and detailed manner" • "Illustrate the image through a descriptive explanation" • "Examine the image closely and share its details" • "Write an exhaustive depiction of the given image"

• "详细描述以下图像"
• "对给定图像进行详细描述"
• "对你看到的图像进行详尽解释"
• "对呈现的图像进行全面概述"
• "对图像进行深入分析"
• "解释你面前图像的各个方面"
• "以极大细节阐明所显示图像的内容"
• "通过细致描述来刻画图像特征"
• "以详细方式分解图像中的元素"
• "逐步讲解图像的重要细节"
• "用丰富的描述性叙述描绘图像"
• "精确叙述图像内容"
• "以全面且详细的方式分析图像"
• "通过描述性解释说明图像"
• "仔细检查图像并分享其细节"
• "对给定图像进行详尽描述"

B Prompts

B 提示词

Prompting GPT-4 to generate medical visual instruction-following data

提示 GPT-4 生成医疗视觉指令跟随数据

messages $=$ [ {"role":"system", "content": """You are an AI assistant specialized in biomedical topics.

消息 = [ {"role":"system", "content": """你是一位专注于生物医学主题的AI助手。

You are provided with a text description (Figure Caption) of a figure image from a biomedical research paper. In some cases, you may have additional text (Figure Context) that mentions the image. Unfortunately, you don’t have access to the actual image.

你获得了一份生物医学研究论文中某个图像的文本描述(Figure Caption)。在某些情况下,你可能还会看到提及该图像的其他文本(Figure Context)。遗憾的是,你无法查看实际图像。

Below are requirements for generating the questions and answers in the conversation:

以下是对话中生成问题和答案的要求:


Figure 4: messages we use to prompt GPT-4 to generate medical visual instruction-following data. Manually curated few-shot examples are included in the prompt, where each example has input sample[‘context’] and output sample[‘response’]. Please see Figure 5 for one of the few-shot examples.

图 4: 用于提示 GPT-4 生成医疗视觉指令跟随数据的消息。提示中包含人工整理的少样本示例,每个示例包含输入样本['context']和输出样本['response']。具体示例请参见图 5。

One of the few-shot examples used in the prompt

提示中使用的少样本示例之一

Example input:

示例输入:

Figure Caption:

图注:

Figure 2: Chest X-ray. Card iomega ly with diffuse bilateral interstitial infiltrates and a right-sided pleural effusion.

图 2: 胸部X光片。心脏增大伴弥漫性双侧间质浸润及右侧胸腔积液。

Figure Context:

图上下文:

  • His respiratory status continued to worsen and on hospital day number 14 , he required intubation and mechanical ventilation for hypoxemic respiratory failure ( Figure 2 ) .
  • 他的呼吸状况持续恶化,在住院第14天时,因低氧性呼吸衰竭需要插管并接受机械通气 (图 2)。

Example output:

示例输出:

User: Which side of the lung has a pleural effusion?

用户:肺部哪一侧出现胸腔积液?

Assistant: The right side of the lung has a pleural effusion.

右肺出现胸腔积液。

User: According to the CT image, does the patient has card iomega ly?

用户:根据CT图像,患者是否有心脏肥大?

Assistant: The image appeares to be a chest x-ray. However, it indicates that the patient has card iomega ly, which means an enlarged heart. Additionally, there are diffuse bilateral interstitial infiltrates and a right-sided pleural effusion present in the image.

该图像疑似为胸部X光片。结果显示患者存在心脏肥大(cardiomegaly),即心脏扩大。此外,图像中可见弥漫性双侧间质浸润及右侧胸腔积液。

User: Additional Info: The patient’s respiratory status continued to worsen and on hospital day number 14, requiring intubation and mechanical ventilation for hypoxemic respiratory failure.

患者呼吸状态持续恶化,于住院第14天因低氧性呼吸衰竭需行气管插管及机械通气。

Assistant: Thank you for providing additional information about the patient’s condition. Based on the information you provided, it appears that the patient’s respiratory status continued to worsen and on hospital day number 14, they required intubation and mechanical ventilation for hypoxemic respiratory failure. This suggests that the patient’s lung function has significantly declined, and they may be experiencing respiratory distress. It is important for the medical team to closely monitor the patient’s condition and provide appropriate treatment to manage their symptoms and improve their lung function.

感谢您提供患者病情的补充信息。根据您所述,患者的呼吸状况持续恶化,在住院第14天因低氧性呼吸衰竭需要插管和机械通气。这表明患者肺功能已显著下降,可能正经历呼吸窘迫。医疗团队需密切监测患者状况,并提供适当治疗以控制症状并改善肺功能。

Visual input example, Biomedical image:

视觉输入示例,生物医学图像:

Table 9: Example comparison of medical visual chat and reasoning capabilities.

User LLaVA-MedWhat is this image about? The image is a coronal view of a computed tomography (CT) scan of the abdomen and pelvis. It provides a detailed view of the internal organs and structures within the abdominal
User LLaVA-Medand pelvic cavities. Whatisabnormal aboutthisimage? The abnormality in this image is the presence of a large left-sided pelvic mass. This mass could be due to various reasons, such as a tumor, cyst, or other growth. Further evaluation and tests would be needed to determine the exact cause and appropriate treatment for this abnormality.
User LLaVAWhat is this image about? The image is a medical illustration showing a person's internal anatomy, especially focusing on their large and small intestines, stomach, liver and kidneys. The organs are highlighted in different colors, making it easy to identify each organ
User LLaVAWhatisabnormalaboutthisimage? As a visual AI assistant, I cannot determine the specific abnormality on the chest X-ray based on the provided image description. The image description mainly focuses on the organs and their colors...

表 9: 医学视觉对话与推理能力示例对比

| User LLaVA-Med | 这张图像是关于什么的? 该图像是腹部和骨盆的计算机断层扫描 (CT) 冠状视图。它提供了腹腔内部器官和结构的详细视图 |
| User LLaVA-Med | 以及盆腔腔体。这张图像有什么异常? 该图像的异常之处在于存在一个较大的左侧盆腔肿块。这个肿块可能由多种原因引起,例如肿瘤、囊肿或其他增生。需要进一步评估和检查以确定该异常的确切原因和适当治疗方案。 |
| User LLaVA | 这张图像是关于什么的? 该图像是一幅医学插图,展示了一个人的内部解剖结构,尤其侧重于其大小肠、胃、肝脏和肾脏。器官用不同颜色高亮显示,便于识别每个器官 |
| User LLaVA | 这张图像有什么异常? 作为视觉AI助手,我无法根据提供的图像描述确定胸部X光片的具体异常。图像描述主要关注器官及其颜色... |

阅读全文(20积分)