Prompt-based bioinformatics: a new interface for multi-omics analysis
基于提示的生物信息学:多组学分析的新界面
Ali R. Awan, Mehrdad Oveisi & Mohammad M. Karimi
Ali R. Awan, Mehrdad Oveisi & Mohammad M. Karimi
Prompt-based bioinformatics redefines how scientists interact with biological data, enabling natural language queries across multi-omics layers. By removing coding barriers and streamlining integration, this paradigm facilitates accessible, hypothesis-driven discovery. We call for community standards, educational adoption and collaborative development to realize its full potential in research and clinical settings.
基于提示的生物信息学重新定义了科学家与生物数据的交互方式,使得跨多组学层的自然语言查询成为可能。通过消除编码障碍并简化集成流程,这种范式促进了可访问的、假设驱动的科学发现。我们呼吁建立社区标准、教育体系融合与协同开发,以充分发挥其在科研与临床环境中的潜力。
Natural language processing has long supported bioinformatics, aiding in the extraction of insights from unstructured text and biological sequences. Rule-based methods and early statistical approaches enabled structured analysis of scientific literature, gene and protein annotations, and biological pathways. A breakthrough came in 2017 with the introduction of the transformer deep neural network model, which offered superior performance in learning contextual relationships within text. The introduction of transformer models laid the foundation for large language models (LLMs). The scale and capabilities of LLMs gave rise to prompting, which offers a more intuitive way to interact with computational systems compared with traditional programming. As LLMs advanced, they began demonstrating emergent abilities such as few-shot learning and reasoning. The release of ChatGPT in 2022 showcased the power of LLMs in delivering coherent, context-aware outputs, prompting widespread exploration of their use in scientific domains, including bioinformatics.
自然语言处理长期支持生物信息学, 帮助从非结构化文本和生物序列中提取洞见. 基于规则的方法和早期统计方法实现了对科学文献, 基因和蛋白质注释以及生物通路的结构化分析. 2017年, 随着Transformer深度神经网络模型的引入带来了突破, 该模型在学习文本上下文关系方面表现出卓越性能. Transformer模型的引入为大语言模型 (LLM) 奠定了基础. 大语言模型的规模和能力催生了提示技术, 与传统编程相比, 它提供了更直观的与计算系统交互的方式. 随着大语言模型的发展, 它们开始展现出诸如少样本学习和推理等新兴能力. 2022年ChatGPT的发布展示了大语言模型在生成连贯, 上下文感知输出方面的强大能力, 促使科学领域 (包括生物信息学) 广泛探索其应用.
Prompting as a new programming paradigm
提示工程作为新兴编程范式
Prompting introduces an accessible interface for computational tasks. Instead of coding in languages such as Python or R, users specify tasks in natural language. This shift is operational i zed through LLM-based ‘agent’ systems that connect prompts to executable tools1. These systems interpret user intent, select appropriate functions and orchestrate analysis steps without requiring the user to understand syntax or pipeline logic. Whereas conventional workflows require scripting or clicking through graphical user interfaces, prompting enables seamless, adaptive task execution from a single input line, reducing the cognitive and technical burden for end users.
提示 (Prompting) 为计算任务引入了便捷的交互方式。用户无需使用 Python语言 或 R 等编程语言编写代码,只需用自然语言描述任务即可。这一转变通过基于大语言模型的AI智能体 (AI Agent) 系统实现,该系统将提示词与可执行工具相连接。这些系统能够解读用户意图,自动选择合适功能并协调分析步骤,用户无需理解语法或流程逻辑。传统工作流需要编写脚本或点击图形用户界面,而提示机制仅需单行输入即可实现无缝自适应的任务执行,显著减轻终端用户的认知和技术负担。
LLMs are probabilistic and context sensitive. Therefore, prompt phrasing considerably affects output quality. Prompt engineering encompasses techniques such as in-context learning, structured formatting and self-critique to enhance consistency 2. Retrieval-augmented generation complements prompt engineering by enabling models to incorporate external documents or datasets into their responses.
大语言模型具有概率性和上下文敏感性。因此提示词的表述方式会显著影响输出质量。提示工程包含上下文学习、结构化格式设计和自我修正等技术,用以提升输出一致性[2]。检索增强生成通过使模型能够将外部文档或数据集纳入其响应,对提示工程形成补充。
Nature Reviews Genetics
自然综述遗传学
This is particularly relevant in bioinformatics, in which up-to-date datasets and unpublished results are often required.
这在生物信息学领域尤为重要,因为该领域经常需要最新的数据集和未发表的成果。
Prompt-based versus conventional bioinformatics
基于提示与传统生物信息学
Conventional bioinformatics workflows rely on well-defined pipelines built using command-line tools, scripting languages, such as R or Python, and modular platforms, such as Galaxy or Nextflow. These workflows require substantial programming knowledge, domain expertise and familiarity with data formats and preprocessing steps. Integration across data modalities (for example, genomics and transcrip tomic s) often demands extensive manual curation, metadata alignment and file conversion.
传统的生物信息学工作流程依赖于使用命令行工具、脚本语言 (例如R语言或Python语言) 以及模块化平台 (例如Galaxy或Nextflow) 构建的明确定义的流程。这些工作流程需要大量的编程知识、领域专业知识以及对数据格式和预处理步骤的熟悉。跨数据模态 (例如基因组学和转录组学) 的整合通常需要大量的人工整理、元数据对齐和文件转换。
Prompt-based bioinformatics disrupts this paradigm by enabling researchers to articulate complex analysis tasks in plain language. The core distinction lies in the user interface: instead of constructing or navigating pipelines, users interact with agentic systems that parse prompts and assemble the necessary components in real time. For instance, rather than writing a script to run differential expression analysis followed by gene set enrichment, a user might input: “Compare gene expression between treatment and control samples and summarize key pathways involved”. The system then autonomously executes a multi-step workflow, using the appropriate tools behind the scenes.
基于提示的生物信息学打破了这一范式,使研究人员能够用通俗语言表述复杂的分析任务。核心区别在于用户界面:用户无需构建或操作流程管线,而是与能够解析提示并实时组装必要组件的智能体系统交互。例如,用户无需编写脚本来运行差异表达分析及后续基因集富集分析,只需输入:"比较处理组与对照组的基因表达差异,并总结涉及的关键通路"。系统随后自主执行多步骤工作流,在后台调用相应工具完成分析。
This new model also affects how users interact with data. Recently, graphical user interface-based platforms, such as BiomiX, have aimed to simplify multi-omics analysis for non-programmers by providing visual interfaces and dropdown workflows3. However, such tools still require manual coordination of steps, whereas prompt-based systems avoid these choices entirely. In conventional workflows, integrating data types such as RNA-sequencing and ATAC-seq data often involves separate pipelines followed by joint analysis, which requires manual harmonization of identifiers, resolutions and normalization strategies 4. Prompt-based systems, such as PromptBio5, streamline this process by enabling cross-modal queries — for example: “Identify genes with increased expression and chromatin accessibility in responders”. The agentic system handles the underlying data integration and statistical modelling, removing the need for manual harmonization.
这种新型模型也改变了用户与数据的交互方式。最近,基于图形用户界面的平台(例如BiomiX)试图通过提供可视化界面和下拉式工作流程来简化非编程人员的多组学分析3。但此类工具仍需要人工协调操作步骤,而基于提示的系统则完全避免了这些选择。在传统工作流程中,整合RNA测序和ATAC-seq等数据类型通常需要分别运行分析流程再进行联合分析,这要求人工统一标识符、分辨率和标准化策略4。基于提示的系统(例如PromptBio5)通过支持跨模态查询来简化这一流程——例如:"识别应答者中表达量增加且染色质可及性升高的基因"。智能体系统会处理底层数据整合和统计建模,无需人工协调。
Potential for integrative multi-omics analysis
整合多组学分析的潜力
Integrative analysis across omics layers, including genomics, transcrip tomic s, epi genomics and proteomics, is a longstanding goal in systems biology. Yet, traditional approaches face hurdles in harmonizing data formats, dealing with missing modalities and tuning multi-view models4. Prompt-based systems offer unique advantages in this context by abstracting data handling and analysis logic.
跨组学整合分析 (包括基因组学、转录组学、表观基因组学和蛋白质组学) 是系统生物学领域的长期目标。然而,传统方法在协调数据格式、处理缺失模态和调整多视图模型 [4] 方面面临障碍。基于提示的系统通过抽象化数据处理与分析逻辑,在此背景下展现出独特优势。
For example, PromptBio enables users to issue high-level prompts such as: “Compare immune cell composition and DNA methyl ation between tumour subtypes and suggest candidate biomarkers”. This single query can launch a sequence of integrated analyses involving cell type de convolution, differential methyl ation and pathway annotation. Similarly, AutoBA autonomously adapts workflows when errors occur or data quality varies, improving robustness in real-world integrative studies6.
例如,PromptBio 允许用户发出高级提示,如: "比较肿瘤亚型之间的免疫细胞组成和DNA甲基化,并建议候选生物标志物" 。这一单个查询可以启动一系列涉及细胞类型反卷积、差异甲基化和通路注释的集成分析。同样,AutoBA 在发生错误或数据质量变化时自主调整工作流程,提高了现实世界集成研究中的鲁棒性 [6] 。
By enabling users to describe multi-modal goals in natural language, prompt-based systems also support hypothesis generation. For instance, a researcher might query: “Suggest genes that could link increased DNA methyl ation to reduced tumour suppressor gene expression in chemo-resistant tumours”. Traditional methods would require coordinating results from several separate tools; a prompt-based system can automate this integration.
通过允许用户用自然语言描述多模态目标,基于提示的系统也支持假设生成。例如,研究人员可能会查询:"建议可能将DNA甲基化增加与化疗耐药肿瘤中肿瘤抑制基因表达降低联系起来的基因"。传统方法需要协调多个独立工具的结果;而基于提示的系统可以自动完成这种整合。
In addition, multi-agentic frameworks, such as Agentomics-ML7, distribute subtasks to specialized agents, which then communicate, critique each other’s outputs and c