[论文翻译]基于提示的生物信息学:多组学分析的新界面




Prompt-based bioinformatics: a new interface for multi-omics analysis

基于提示的生物信息学:多组学分析的新界面

Ali R. Awan, Mehrdad Oveisi & Mohammad M. Karimi

Ali R. Awan, Mehrdad Oveisi & Mohammad M. Karimi

Prompt-based bioinformatics redefines how scientists interact with biological data, enabling natural language queries across multi-omics layers. By removing coding barriers and streamlining integration, this paradigm facilitates accessible, hypothesis-driven discovery. We call for community standards, educational adoption and collaborative development to realize its full potential in research and clinical settings.

基于提示的生物信息学重新定义了科学家与生物数据的交互方式,使得跨多组学层的自然语言查询成为可能。通过消除编码障碍并简化集成流程,这种范式促进了可访问的、假设驱动的科学发现。我们呼吁建立社区标准、教育体系融合与协同开发,以充分发挥其在科研与临床环境中的潜力。

Natural language processing has long supported bioinformatics, aiding in the extraction of insights from unstructured text and biological sequences. Rule-based methods and early statistical approaches enabled structured analysis of scientific literature, gene and protein annotations, and biological pathways. A breakthrough came in 2017 with the introduction of the transformer deep neural network model, which offered superior performance in learning contextual relationships within text. The introduction of transformer models laid the foundation for large language models (LLMs). The scale and capabilities of LLMs gave rise to prompting, which offers a more intuitive way to interact with computational systems compared with traditional programming. As LLMs advanced, they began demonstrating emergent abilities such as few-shot learning and reasoning. The release of ChatGPT in 2022 showcased the power of LLMs in delivering coherent, context-aware outputs, prompting widespread exploration of their use in scientific domains, including bioinformatics.

自然语言处理长期支持生物信息学, 帮助从非结构化文本和生物序列中提取洞见. 基于规则的方法和早期统计方法实现了对科学文献, 基因和蛋白质注释以及生物通路的结构化分析. 2017年, 随着Transformer深度神经网络模型的引入带来了突破, 该模型在学习文本上下文关系方面表现出卓越性能. Transformer模型的引入为大语言模型 (LLM) 奠定了基础. 大语言模型的规模和能力催生了提示技术, 与传统编程相比, 它提供了更直观的与计算系统交互的方式. 随着大语言模型的发展, 它们开始展现出诸如少样本学习和推理等新兴能力. 2022年ChatGPT的发布展示了大语言模型在生成连贯, 上下文感知输出方面的强大能力, 促使科学领域 (包括生物信息学) 广泛探索其应用.

Prompting as a new programming paradigm

提示工程作为新兴编程范式

Prompting introduces an accessible interface for computational tasks. Instead of coding in languages such as Python or R, users specify tasks in natural language. This shift is operational i zed through LLM-based ‘agent’ systems that connect prompts to executable tools1. These systems interpret user intent, select appropriate functions and orchestrate analysis steps without requiring the user to understand syntax or pipeline logic. Whereas conventional workflows require scripting or clicking through graphical user interfaces, prompting enables seamless, adaptive task execution from a single input line, reducing the cognitive and technical burden for end users.

提示 (Prompting) 为计算任务引入了便捷的交互方式。用户无需使用 Python语言 或 R 等编程语言编写代码,只需用自然语言描述任务即可。这一转变通过基于大语言模型的AI智能体 (AI Agent) 系统实现,该系统将提示词与可执行工具相连接。这些系统能够解读用户意图,自动选择合适功能并协调分析步骤,用户无需理解语法或流程逻辑。传统工作流需要编写脚本或点击图形用户界面,而提示机制仅需单行输入即可实现无缝自适应的任务执行,显著减轻终端用户的认知和技术负担。

LLMs are probabilistic and context sensitive. Therefore, prompt phrasing considerably affects output quality. Prompt engineering encompasses techniques such as in-context learning, structured formatting and self-critique to enhance consistency 2. Retrieval-augmented generation complements prompt engineering by enabling models to incorporate external documents or datasets into their responses.

大语言模型具有概率性和上下文敏感性。因此提示词的表述方式会显著影响输出质量。提示工程包含上下文学习、结构化格式设计和自我修正等技术,用以提升输出一致性[2]。检索增强生成通过使模型能够将外部文档或数据集纳入其响应,对提示工程形成补充。

Nature Reviews Genetics

自然综述遗传学

This is particularly relevant in bioinformatics, in which up-to-date datasets and unpublished results are often required.

这在生物信息学领域尤为重要,因为该领域经常需要最新的数据集和未发表的成果。

Prompt-based versus conventional bioinformatics

基于提示与传统生物信息学

Conventional bioinformatics workflows rely on well-defined pipelines built using command-line tools, scripting languages, such as R or Python, and modular platforms, such as Galaxy or Nextflow. These workflows require substantial programming knowledge, domain expertise and familiarity with data formats and preprocessing steps. Integration across data modalities (for example, genomics and transcrip tomic s) often demands extensive manual curation, metadata alignment and file conversion.

传统的生物信息学工作流程依赖于使用命令行工具、脚本语言 (例如R语言或Python语言) 以及模块化平台 (例如Galaxy或Nextflow) 构建的明确定义的流程。这些工作流程需要大量的编程知识、领域专业知识以及对数据格式和预处理步骤的熟悉。跨数据模态 (例如基因组学和转录组学) 的整合通常需要大量的人工整理、元数据对齐和文件转换。

Prompt-based bioinformatics disrupts this paradigm by enabling researchers to articulate complex analysis tasks in plain language. The core distinction lies in the user interface: instead of constructing or navigating pipelines, users interact with agentic systems that parse prompts and assemble the necessary components in real time. For instance, rather than writing a script to run differential expression analysis followed by gene set enrichment, a user might input: “Compare gene expression between treatment and control samples and summarize key pathways involved”. The system then autonomously executes a multi-step workflow, using the appropriate tools behind the scenes.

基于提示的生物信息学打破了这一范式,使研究人员能够用通俗语言表述复杂的分析任务。核心区别在于用户界面:用户无需构建或操作流程管线,而是与能够解析提示并实时组装必要组件的智能体系统交互。例如,用户无需编写脚本来运行差异表达分析及后续基因集富集分析,只需输入:"比较处理组与对照组的基因表达差异,并总结涉及的关键通路"。系统随后自主执行多步骤工作流,在后台调用相应工具完成分析。

This new model also affects how users interact with data. Recently, graphical user interface-based platforms, such as BiomiX, have aimed to simplify multi-omics analysis for non-programmers by providing visual interfaces and dropdown workflows3. However, such tools still require manual coordination of steps, whereas prompt-based systems avoid these choices entirely. In conventional workflows, integrating data types such as RNA-sequencing and ATAC-seq data often involves separate pipelines followed by joint analysis, which requires manual harmonization of identifiers, resolutions and normalization strategies 4. Prompt-based systems, such as PromptBio5, streamline this process by enabling cross-modal queries — for example: “Identify genes with increased expression and chromatin accessibility in responders”. The agentic system handles the underlying data integration and statistical modelling, removing the need for manual harmonization.

这种新型模型也改变了用户与数据的交互方式。最近,基于图形用户界面的平台(例如BiomiX)试图通过提供可视化界面和下拉式工作流程来简化非编程人员的多组学分析3。但此类工具仍需要人工协调操作步骤,而基于提示的系统则完全避免了这些选择。在传统工作流程中,整合RNA测序和ATAC-seq等数据类型通常需要分别运行分析流程再进行联合分析,这要求人工统一标识符、分辨率和标准化策略4。基于提示的系统(例如PromptBio5)通过支持跨模态查询来简化这一流程——例如:"识别应答者中表达量增加且染色质可及性升高的基因"。智能体系统会处理底层数据整合和统计建模,无需人工协调。

Potential for integrative multi-omics analysis

整合多组学分析的潜力

Integrative analysis across omics layers, including genomics, transcrip tomic s, epi genomics and proteomics, is a longstanding goal in systems biology. Yet, traditional approaches face hurdles in harmonizing data formats, dealing with missing modalities and tuning multi-view models4. Prompt-based systems offer unique advantages in this context by abstracting data handling and analysis logic.

跨组学整合分析 (包括基因组学、转录组学、表观基因组学和蛋白质组学) 是系统生物学领域的长期目标。然而,传统方法在协调数据格式、处理缺失模态和调整多视图模型 [4] 方面面临障碍。基于提示的系统通过抽象化数据处理与分析逻辑,在此背景下展现出独特优势。

For example, PromptBio enables users to issue high-level prompts such as: “Compare immune cell composition and DNA methyl ation between tumour subtypes and suggest candidate biomarkers”. This single query can launch a sequence of integrated analyses involving cell type de convolution, differential methyl ation and pathway annotation. Similarly, AutoBA autonomously adapts workflows when errors occur or data quality varies, improving robustness in real-world integrative studies6.

例如,PromptBio 允许用户发出高级提示,如: "比较肿瘤亚型之间的免疫细胞组成和DNA甲基化,并建议候选生物标志物" 。这一单个查询可以启动一系列涉及细胞类型反卷积、差异甲基化和通路注释的集成分析。同样,AutoBA 在发生错误或数据质量变化时自主调整工作流程,提高了现实世界集成研究中的鲁棒性 [6] 。

By enabling users to describe multi-modal goals in natural language, prompt-based systems also support hypothesis generation. For instance, a researcher might query: “Suggest genes that could link increased DNA methyl ation to reduced tumour suppressor gene expression in chemo-resistant tumours”. Traditional methods would require coordinating results from several separate tools; a prompt-based system can automate this integration.

通过允许用户用自然语言描述多模态目标,基于提示的系统也支持假设生成。例如,研究人员可能会查询:"建议可能将DNA甲基化增加与化疗耐药肿瘤中肿瘤抑制基因表达降低联系起来的基因"。传统方法需要协调多个独立工具的结果;而基于提示的系统可以自动完成这种整合。

In addition, multi-agentic frameworks, such as Agentomics-ML7, distribute subtasks to specialized agents, which then communicate, critique each other’s outputs and converge on a shared result. These architectures mirror collaborative scientific reasoning, offering a powerful model for integrative analysis. Interactive multi-agent chatbots such as DrBioRight 2.0, which are specifically designed for pro teo genomic data, further demonstrate how users can refine queries iterative ly: asking questions, receiving plots, revising focus8. This conversational loop contrasts with traditional analysis pipelines, in which iterations require re-running scripts or re-parameter i zing interfaces. Prompt-based systems thus facilitate rapid hypothesis testing and data exploration.

此外,多智能体框架(例如 Agentomics-ML7)将子任务分配给专门的人工智能体,这些智能体通过相互通信、评判彼此输出并最终达成共识结果。这类架构模拟了协作式科学推理,为整合分析提供了强大模型。专为蛋白质组学数据设计的交互式多智能体聊天机器人(如 DrBioRight 2.0)进一步展示了用户如何迭代优化查询:提出问题、获取图表、调整研究焦点8。这种对话循环与传统分析流程形成鲜明对比,后者每次迭代都需要重新运行脚本或调整参数界面。基于提示的系统因此显著加速了假设检验与数据探索的进程。

Outlook and conclusions

展望与结论

Moving forward, community-driven development, such as the BioChatter framework for developing LLM-enabled biomedical applications 9, will be essential. Platforms such as BioMe d GP T 10 highlight the need for foundation models trained on biomedical data, but domain-specific fine-tuning and evaluation will require collaboration across computational and experimental labs. Similarly, open-source systems such as PromptBio and AutoBA should be extended with application programming interfaces and plugins to integrate into institutional workflows and cloud infrastructures.

展望未来,社区驱动的开发模式将至关重要,例如用于开发大语言模型生物医学应用的BioChatter框架[9]。诸如BioMed GPT[10]等平台凸显了基于生物医学数据训练基础模型的必要性,但领域特定的微调与评估需要计算实验室和实验实验室的跨领域协作。同样地,PromptBio和AutoBA等开源系统应通过应用程序编程接口和插件进行扩展,以集成至机构工作流和云基础设施中。

Prompt-based bioinformatics reimagines how researchers interact with data, lowering the barrier to entry while opening new avenues for exploration. Unlike traditional workflows that require specialized training, these systems enable anyone to ask complex questions about multi-omics data using natural language. For expert users, they provide a faster way to prototype ideas and customize analyses.

基于提示的生物信息学重新构想了研究人员与数据的交互方式,既降低了使用门槛,又开辟了新的探索途径。与需要专业训练的传统工作流程不同,这些系统使任何人都能使用自然语言提出关于多组学数据的复杂问题。对于专业用户而言,它们提供了更快速的原型构思和定制化分析途径。

As the field progresses, we anticipate that prompt-based systems will not replace but rather augment conventional pipelines, acting as interactive layers that bridge users and algorithms. To fully realize their potential, we need shared standards, evaluation frameworks and integration with laboratory and clinical systems. If successful, prompt-based approaches could become the default interface for bioinformatics, catalysing a new era of integrative and accessible biological discovery.

随着该领域的发展,我们预计基于提示的系统不会取代而是增强传统流程,成为连接用户与算法的交互层。为充分发挥其潜力,我们需要建立共享标准、评估框架,并实现与实验室及临床系统的集成。若成功实施,基于提示的方法有望成为生物信息学的默认界面,催生一个整合性强、可访问性高的生物学研究新时代。

As these tools mature, it is likely that life science and biology departments will begin incorporating modules or courses on prompt-based bioinformatics into undergraduate and postgraduate curricula, reflecting the growing need to equip students with the skills to engage with these emerging systems.

随着这些工具日趋成熟,生命科学与生物院系很可能将基于提示的生物信息学 (prompt-based bioinformatics) 模块或课程纳入本科及研究生培养方案,这反映出培养学生掌握运用新兴系统技能的需求日益增长。

Open questions

开放性问题

Despite the promise of prompt-based systems for bioinformatics, crucial questions remain. First, what are the best practices for designing prompt-based systems that ensure reproducibility and accuracy? Unlike static pipelines, prompt-based workflows are probabilistic and inherently flexible, and that flexibility risks inconsistency across users or sessions. Developing logging, versioning and validation protocols will be key. Second, how can we benchmark the performance of prompt-based systems? At present, few studies rigorously compare LLM-generated outputs to gold-standard results across standard bioinformatics tasks. As these systems mature, we need shared datasets and evaluation metrics that assess accuracy, robustness and computational efficiency. Third, what tasks are best suited to prompt-based systems? Early results suggest that exploratory analysis, visualization and hypothesis generation benefit most from natural-language interaction. Tasks that require strict parameter control or large-scale batch processing may still be better served by traditional workflows, although current work in advanced prompt-based systems will likely make this possible in the near future. Fourth, what is the role of human oversight? Although prompt-based systems automate much of the workflow, critical thinking and biological interpretation remain essential. Interfaces that allow users to inspect intermediate steps, modify tool choices or override decisions will help maintain scientific rigour. Finally, how will prompt-based systems integrate with experimental workflows? One possibility is that experiment a lists could use prompts to describe their study design and expectations in plain language, enabling LLM-based systems to initiate appropriate analyses without needing detailed technical specifications. This approach could reduce communication bottlenecks and ensure analysis pipelines are aligned with biological goals.

尽管基于提示的系统在生物信息学中展现出潜力,但关键问题依然存在。首先,如何设计确保可重现性和准确性的提示系统最佳实践?与静态流程不同,基于提示的工作流具有概率性且天生灵活,这种灵活性可能导致不同用户或会话间的不一致性。建立日志记录、版本控制和验证协议至关重要。其次,如何为基于提示的系统建立性能基准?目前很少有研究能在标准生物信息学任务中,将大语言模型生成结果与金标准结果进行严格比较。随着系统成熟,我们需要共享数据集和评估指标来衡量准确性、鲁棒性及计算效率。第三,哪些任务最适合基于提示的系统?早期研究表明探索性分析、可视化和假设生成最能受益于自然语言交互。需要严格参数控制或大规模批处理的任务可能仍更适合传统工作流,不过当前先进提示系统的研究很可能在不久的将来实现这一目标。第四,人类监督扮演什么角色?虽然基于提示的系统能自动化大部分工作流,但批判性思维和生物学解释仍然不可或缺。允许用户检查中间步骤、修改工具选择或推翻决策的界面将有助于保持科学严谨性。最后,基于提示的系统如何与实验工作流整合?一种可能是实验人员可以用通俗语言描述研究设计和预期,使基于大语言模型的系统能自行启动适当分析,而无需详细技术规范。这种方法既可减少沟通瓶颈,又能确保分析流程与生物学目标保持一致。

阅读全文(20积分)