Biomni 源码解读

1 引言

Biomni 是 Stanford SNAP 实验室开源的生医 AI Agent，目标是让 LLM 像一个真正的研究者那样：拿到一个问题，自己去查数据库、跑代码、调工具，最后把结果整理好交出来。围绕这个目标，Biomni 做了三件事：

打包一切：把生信工具、数据库、Python/R 包全部塞进一个环境里，让 Agent 有东西可用。
ReAct 循环：通过 <execute> / <solution> 两个标签，让 LLM 自己决定”继续干活”还是”该交作业了”。
动态 prompt：用 use_tool_retriever 按需筛选工具，避免把 200+ 工具一股脑塞进上下文。

项目地址：https://github.com/snap-stanford/Biomni

2 数据湖

第一次跑 Biomni，最耗时的不是启动，而是等它把数据湖下载完。README 里写的”约 11 GB”不是吓人的，光是 BindingDB_All_202409.tsv 这一个文件就有 6.25 G——蛋白-小分子结合亲和力数据，体量确实离谱。

数据湖落在 biomni_data/data_lake/ 下面，按主题大致分了这么几类：

类别	文件名	简短描述
蛋白质互作 / 蛋白组学	`affinity_capture-ms.parquet`	通过亲和捕获 + 质谱检测到的蛋白-蛋白相互作用
蛋白质互作 / 蛋白组学	`affinity_capture-rna.parquet`	通过亲和捕获检测到的蛋白-RNA 相互作用
蛋白质互作 / 蛋白组学	`co-fractionation.parquet`	共分馏实验得到的蛋白-蛋白相互作用
蛋白质互作 / 蛋白组学	`proximity_label-ms.parquet`	通过近邻标记 + 质谱检测的蛋白相互作用
蛋白质互作 / 蛋白组学	`two-hybrid.parquet`	酵母双杂交检测的蛋白-蛋白相互作用
蛋白质互作 / 蛋白组学	`reconstituted_complex.parquet`	体外重构的蛋白复合体数据
病毒/宿主相互作用	`Virus-Host_PPI_P-HIPSTER_2020.parquet`	P-HIPSTER 提供的病毒-宿主蛋白互作数据
蛋白表达	`proteinatlas.tsv`	人类蛋白表达数据（Human Protein Atlas）
小分子 / 药物发现	`BindingDB_All_202409.tsv`	蛋白-小分子结合亲和力测量数据（6.25 G）
小分子 / 药物发现	`broad_repurposing_hub_molecule_with_smiles.parquet`	Broad 重定位库分子及 SMILES 注释
小分子 / 药物发现	`broad_repurposing_hub_phase_moa_target_info.parquet`	药物阶段、作用机制与靶点信息
小分子 / 药物发现	`enamine_cloud_library_smiles.pkl`	Enamine REAL 化合物库（含 SMILES）
筛选 / 实验结果（EveBio）	`evebio_*.csv`（8 个文件）	筛选/分型实验的元数据、化合物、对照、原始点、结果汇总等
药物相互作用（DDInter）	`ddinter_*.csv`（8 个文件）	按药物分类的药物-药物相互作用
基因集 / 本体	`go-plus.json`	基因本体（GO）本体数据
基因集 / 本体	`msigdb_human_*.parquet`	MSigDB 人类 C1-C8 及 Hallmark 基因集
基因集 / 本体	`mousemine_*.parquet`	MouseMine M1/M2/M3/M5/M8/MH 基因集
基因组 / 变异 / 遗传学	`variant_table.parquet`	注释过的基因变异表
基因组 / 变异 / 遗传学	`genebass_*.pkl`	GeneBass 过滤后的 missense / pLoF / synonymous 变体
基因组 / 变异 / 遗传学	`gwas_catalog.pkl`	全基因组关联研究（GWAS）结果集合
基因组 / 变异 / 遗传学	`omim.parquet`	OMIM 遗传疾病与相关基因
基因组 / 变异 / 遗传学	`DisGeNET.parquet`	多源的基因-疾病关联
表型 / 本体	`hp.obo`	人类表型本体（HPO） obographs 格式
基因 / 表达	`gene_info.parquet`	基因综合信息
基因 / 表达	`gtex_tissue_gene_tpm.parquet`	GTEx 各组织基因表达（TPM）
细胞系 / 癌症依赖性 (DepMap)	`DepMap_*.csv`（4 个文件）	DepMap 细胞系 CRISPR 依赖性、效应、模型元数据、表达
细胞 / 单细胞资源	`czi_census_datasets_v4.parquet`	CZI Cell Census 数据集合集
细胞 / 标记	`marker_celltype.parquet`	细胞类型标记基因集合
免疫 / TCR	`McPAS-TCR.parquet`	T 细胞受体序列与特异性数据（McPAS）
miRNA 相关	`miRDB_v6.0_results.parquet`	miRDB 预测的 microRNA 靶点
miRNA 相关	`miRTarBase_*.parquet`	经验证的 miRNA-靶基因互作及结合位点
miRNA 相关	`miRTarBase_microRNA_target_interaction_pubmed_abtract.txt`	miRTarBase 互作对应的 PubMed 摘要文本
知识图谱 / 综合资源	`kg.csv`	精准医学知识图谱（疾病与多尺度关系）
CRISPR / sgRNA 库	`sgRNA_KO_SP_mouse.txt`	小鼠 sgRNA 敲除（KO）库
CRISPR / sgRNA 库	`sgRNA_KO_SP_human.txt`	人类 sgRNA 敲除（KO）库
遗传互作 / 合成致死	`genetic_interaction.parquet`	基因间遗传互作
遗传互作 / 合成致死	`dosage_growth_defect.parquet`	剂量变化影响生长的基因
遗传互作 / 合成致死	`synthetic_growth_defect.parquet`	合成生长缺陷数据
遗传互作 / 合成致死	`synthetic_lethality.parquet`	合成致死相互作用
遗传互作 / 合成致死	`synthetic_rescue.parquet`	恢复表型的遗传互作
机器学习 / 预测 (TXGNN)	`txgnn_name_mapping.pkl`	TXGNN 名称映射
机器学习 / 预测 (TXGNN)	`txgnn_prediction.pkl`	TXGNN 模型预测结果

磁盘空间宽裕的话就全部下完，紧张的话可以按需跳过——不过我个人建议还是全下，省得后面跑到一半发现缺数据。

除了数据文件，Biomni 还维护了一份 library_content_dict，把预先装好的 Python / R 包名和简介整理成一份清单，在 System Prompt 里一并告诉 LLM，这样 Agent 在调用工具的时候就能准确 import 正确的包，不用自己瞎猜该 import 什么。

3 Benchmark

光有工具和数据还不够，怎么证明 Agent 真的”学会了”？Biomni 在 data/biomni_data/benchmark/ 下面准备了三套题：

数据集	题目数	原始格式	评估格式	来源
`hle`	52	原生多选题	多选题	Humanity’s Last Exam
`DbQA`	60（测试集）	问答题 + 干扰项	多选题	Lab Bench（数据库 QA）
`SeqQA`	70（测试集）	问答题 + 干扰项	多选题	Lab Bench（序列 QA）

DbQA 和 SeqQA 各自有三个 parquet 文件：完整集（520/600 题）、采样集（65/75 题）、测试集（60/70 题）。实际跑分用的是 _test.parquet，别一不小心把完整集当测试集拿来算分，那结果就没意义了。

3.1 HLE

test_sampled_biology_medicine.parquet 这个文件来自 Humanity’s Last Exam，题目字段比较丰富：id、question、answer、rationale、category 都有，甚至还有图片相关字段。其中 canary 字段是一个数据集水印，用来检测 benchmark 有没有泄露到训练数据里，评估流程本身不依赖它。

随手放一道题：

Which of the following mechanisms most accurately describes the role of intron 
length variability in influencing eukaryotic gene expression efficiency?

A. Intron length variability impacts gene expression by altering the DNA 
   methylation patterns, thereby influencing transcriptional silencing.
B. Variable intron lengths change the DNA torsional stress distribution, 
   indirectly affecting RNA polymerase II elongation speed and efficiency.
C. Longer introns increase the probability of alternative splicing events 
   by providing more splice site positions.
D. Intron variability primarily affects the spatial configuration of 
   enhancers, thereby modifying gene expression through changes in 
   enhancer-promoter interactions.
E. Shorter introns enhance transcriptional efficiency by facilitating RNA 
   polymerase II access to promoter regions more quickly.

正确答案 B，选项设计一看就是故意埋坑的那种。

3.2 DbQA 和 SeqQA

DbQA 侧重”数据库查到了什么”，SeqQA 侧重”序列分析做对了没有”。两个数据集字段结构一样：id、question、ideal（标准答案）、distractors（干扰项列表）、canary、subtask。

DbQA 覆盖的子任务包括：多序列变异分析（11 题）、小鼠肿瘤基因集（10 题）、病毒蛋白质互作（7 题）、差异基因分析、基因定位、miRNA 靶点预测等等。SeqQA 则是限制性内切酶片段数、PCR 引物设计、ORF 氨基酸识别、GC 含量计算这一类序列分析题。

一道 DbQA 的例子：

Which of the following genes is associated with achromatopsia according to 
DisGeNet but not according to OMIM?

ideal: CNGA3
distractors: ["CNGB3", "GNAT2", "PDE6C", "ATF7IP"]

3.3 评估方法

评估流程很直白：把开放式问答转成多选题，然后算分。

options = shuffle(distractors + [ideal, "Insufficient information to answer the question."])

每道题最后长这样：

Question: {question}
Options:
A. {option1}
B. {option2}
C. {option3}
D. {option4}
E. {option5}
F. Insufficient information to answer the question.

其中那个 “Insufficient information to answer the question.” 是关键——它允许 Agent 在信息不够时”体面地”拒答，而不是硬猜。这样一来，评估就能同时看三个维度：

指标	含义
accuracy	答对了多少
coverage	敢于回答多少（而不是全选”信息不足”）
precision	回答的部分里，有几个是对的

答对得 1.0，答错得 0.0，没有部分分。

我个人挺喜欢这种设计：它把”会不会答”和”敢不敢答”拆开衡量了，光看 accuracy 根本看不出来 Agent 是不是在”躺平拒答”。

4 工具列表

Biomni 内置了 224 个工具，分布在 22 个模块里。第一次看到这个数字的时候我愣了一下——这也太多了吧。

把 200+ 工具全塞进 System Prompt，光是工具描述就能吃掉一大半上下文。Biomni 给出的折中方案是 use_tool_retriever，但说实话，把工具做成按需加载的 Skills 应该会更优雅一些。

工具模块分布如下：

模块	工具数	主要功能
`database`	40	生物医学数据库查询（UniProt、PDB、KEGG、ClinVar、GEO、Ensembl、gnomAD、ChEMBL、PubChem 等）
`pharmacology`	25	药物研发（分子对接、ADMET 预测、药物重定位、药物相互作用、FDA 不良反应等）
`genomics`	19	基因组学（scRNA-seq 注释、嵌入生成、ChIP-seq、基因集富集、物种间基因转换等）
`molecular_biology`	18	分子生物学（ORF 注释、质粒注释、PCR 模拟、限制性酶切、CRISPR sgRNA 设计、引物设计等）
`microbiology`	12	微生物学（细菌生长建模、生物膜分析、基因组注释、RNA 二级结构预测等）
`physiology`	11	生理学（血流动力学分析、脑 ADC 图、昼夜节律分析、脂肪酸组成分析等）
`bioimaging`	10	生物成像（医学图像分割、图像配准、相似性度量等）
`immunology`	10	免疫学（ATAC-seq、免疫细胞分选、细胞因子分析、抗体滴度分析等）
`genetics`	9	遗传学（基因组坐标转换、贝叶斯精细定位、CRISPR 分析、群体模拟等）
literature	8	文献检索（PubMed、arXiv、Google Scholar、PDF 提取、网页搜索等）
`synthetic_biology`	8	合成生物学（细菌基因组工程、密码子优化、基因线路模拟等）
`cancer_biology`	6	癌症生物学（DDR 网络分析、体细胞突变检测、结构变异检测、NMF 分析等）
`biochemistry`	6	生物化学（CD 光谱分析、酶动力学、ITC 结合热力学、蛋白保守性分析等）
`cell_biology`	5	细胞生物学（细胞周期分析、细胞迁移分析、流式细胞术、线粒体形态分析等）
`systems_biology`	7	系统生物学（通量平衡分析、信号网络模拟、代谢网络扰动分析等）
`bioengineering`	7	生物工程（细胞迁移分析、CRISPR 模拟、钙成像分析、ODE 模型模拟等）
`pathology`	7	病理学（主动脉分析、血栓组织学、骨微 CT 分析、多重图像分析等）
`protocols`	4	实验方案（protocols.io 搜索、本地方案读取）
`biophysics`	3	生物物理（蛋白无序区域预测、细胞形态分析、组织形变分析）
`glycoengineering`	3	糖工程（N-糖基化位点预测、O-糖基化预测）
`support_tools`	3	支持工具（Python REPL 执行、源码读取、Synapse 数据下载）
`lab_automation`	3	实验室自动化（PyLabRobot 脚本测试、文档获取）

5 System Prompt 结构

Biomni 的 System Prompt 本质上就是把”你有什么能力”写成一份说明书，交给 LLM。拆开来看，主要由这几段拼成：

5.1 1. 核心指令

You are a helpful biomedical assistant assigned with the task of problem-solving.
To achieve this, you will be using an interactive coding environment equipped
with a variety of tool functions, data, and softwares to assist you throughout
the process.

Given a task, make a plan first. The plan should be a numbered list of steps
that you will take to solve the task. Be specific and detailed.
Format your plan as a checklist with empty checkboxes like this:

- [ ] First step
- [ ] Second step
- [ ] Third step

Follow the plan step by step. After completing each step, update the checklist
by replacing the empty checkbox with a checkmark:

1. [✓] First step (completed)
- [ ] Second step
...

If a step fails or needs modification, mark it with an X and explain why:
...

At each turn, you should first provide your thinking and reasoning given the
conversation history.
After that, you have two options:

1. Interact with a programming environment and receive the corresponding
   output within . Your code should be enclosed using "" tag, for example:
   print("Hello World!") .
   - For Python code (default):  print("Hello World!")
   - For R code:  #!R\nlibrary(ggplot2)\nprint("Hello from R")
   - For Bash scripts:  #!BASH\necho "Hello from Bash"\nls -la
2. When you think it is ready, directly provide a solution that adheres to
   the required format for the given task to the user. Your solution should
   be enclosed using ""  tag, for example: The answer is  A .

In each response, you must include EITHER  or  tag. Not both at the
same time. Do not respond with messages without any tags. No empty messages.

两个核心标签，用口诀记就是”execute 是干活，solution 是收尾”：

<execute>...</execute>：Agent 要跑代码了，把 Python / R / Bash 写进去。
<solution>...</solution>：Agent 觉得可以交作业了，把最终答案放进去。

每轮回复里必须二选一，不能同时出现，也不能完全没有——Biomni 就靠这两个标签来解析 Agent 的输出。少了任何一个，解析直接报错或者把这一轮丢掉。

5.2 2. Self-Critic（可选）

You may or may not receive feedbacks from human. If so, address the feedbacks
by following the same procedure of multiple rounds of thinking, execution,
and then coming up with a new solution.

意思很简单：如果有人类反馈，就按同样的流程（思考→执行→给答案）重新来一遍。

5.3 3. Protocol Generation

PROTOCOL GENERATION:
If the user requests an experimental protocol, use search_protocols(),
advanced_web_search_claude(), list_local_protocols(), and read_local_protocol()
to generate an accurate protocol. Include details such as reagents (with
catalog numbers if available), equipment specifications, replicate
requirements, error handling, and troubleshooting - but ONLY include
information found in these resources. Do not make up specifications, catalog
numbers, or equipment details.
Prioritize accuracy over completeness.

生成实验方案时，只能用搜到的信息，不能瞎编货号和设备参数——这一点倒是挺严谨的。

5.4 4. 自定义资源（如果有）

PRIORITY CUSTOM RESOURCES
==============================
IMPORTANT: The following custom resources have been specifically added for
your use. PRIORITIZE using these resources as they are directly relevant to
your task. Always consider these FIRST and in the meantime using default
resources.

这一段是可选的，用来注入项目特定的 know-how 文档、自定义工具、数据集和软件库。相当于给 Agent 开了个”小灶”。

5.5 5. 环境资源

最后是把实验室里有什么明确告诉 LLM：

Function Dictionary：224 个工具的函数签名（模块路径、方法名、参数、默认值），按模块分组列出来。
Biological Data Lake：100+ 数据集的简介，告诉 Agent 数据在哪里、长什么样。
Software Library：约 50 个可以直接 import 的 Python 包（scanpy、anndata、pysam、biopython 等），让 Agent 知道哪些包已经装好了。
R 包与 Bash：说明 R 代码用 #!R 标记，Bash 用 #!BASH 标记。

6 动态 Prompt：`use_tool_retriever`

把所有工具、数据、库一股脑塞进 System Prompt 的代价是——prompt 长度爆炸。空载就接近 3 万 token，光是”把所有工具摆上桌”这一件事就能吃掉不少额度。

Biomni 提供了一个开关 use_tool_retriever：

模式	`use_tool_retriever=False`	`use_tool_retriever=True`
System prompt	包含所有 224 个工具 + 全部数据集	只包含检索到的相关工具/数据集
Prompt 长度	~30K+ tokens	~5K tokens
适用场景	简单查询、短对话	复杂查询、长对话
初始化速度	快	需要额外一次 LLM 调用做检索

两种模式的本质区别是：False 模式把”工具菜单”全摆出来让 LLM 自己挑，True 模式先用一次 LLM 调用预估任务可能用到的工具和数据集，再把精简后的菜单塞进 prompt。

这其实就是 RAG 思路在”工具选择”上的迁移——跟 Toolformer / ToolLLM 那一脉是一脉相承的。工程上怎么选，还是要看”准确性”和”成本”哪个对你更重要。

7 执行流程

光看 Prompt 还是太抽象，下面以一个”BRCA1 突变分析”任务为例，把跑通过程铺开看：

轮次 1:
  LLM:<execute>
       from biomni.tool.genetics import analyze_crispr_genome_editing
       # ... 代码
       </execute>
  执行: 运行 Python 代码 → 返回结果

轮次 2:
  LLM:<execute>
       # 可视化代码
       </execute>
  执行: 生成图表

轮次 3:
  LLM:<solution>
       BRCA1 基因分析结果：
       - 检测到 12 个突变位点
       - 其中 5 个为致病性突变
       - ...
       </solution>
  终止

翻译成流程图：

flowchart TD
    A[用户任务] --> B[LLM 思考 + 制定计划]
    B --> C{继续执行?}
    C -->|是| D[发出 execute 标签]
    D --> E[执行 Python / R / Bash]
    E --> F[把输出追加到上下文]
    F --> B
    C -->|否| G[发出 solution 标签]
    G --> H[返回最终答案]

整个循环就是一个”判断要不要继续”的 loop，直到 Agent 觉得信息够了，发出 <solution> 才停下来。

8 使用体验

环境准备成本不低。Python 和 R 两条线都得自己处理好，第一次用非常容易报错（依赖冲突、系统库缺失、R 包编译失败都很常见）。建议直接用 Biomni 提供的 biomni_env 目录或者上游 Docker 镜像起手。
Prompt 偏长。空载就接近 3 万 token，即使开了 use_tool_retriever 也有 5K 左右。更优雅的做法是把工具改造成按需加载的 Skills——这也是 Anthropic / Claude Code 现在在推的方向。
LLM 模型接入可以更灵活。llm.py 目前主要对接了 Claude，用 litellm 统一接入的话会方便很多。
数据湖首次下载耗时。BindingDB_All_202409.tsv 单文件 6.25 G，磁盘空间和网络都得提前规划好。