120M+ cells and 225,000 perturbation interactions will accelerate virtual cell model development in a first-of-its-kind partnership and will be released open source as part of a shared commitment to open science.
SAN FRANCISCO & PALO ALTO, Calif. & REDWOOD CITY, Calif.–(BUSINESS WIRE)–Tahoe Therapeutics, Arc Institute, and Biohub announced a landmark initiative to generate the largest and most perturbation-rich single-cell dataset for virtual cell models. The effort is supported by a multi-institutional, multi-million dollar commitment by each of these groups, and represents the first collaboration of its kind at this scale. It extends the breakthroughs established with the perturbational datasets including Tahoe-100M, Arc’s scBaseCount, and Biohub’s CELLxGENE dataset.
The three organizations have jointly curated the perturbations and disease models to be included in the dataset, aiming to maximize impact on the field of virtual cell modeling to benefit the broader scientific community. As part of the deal, Tahoe will generate more than 120 million single-cell data points across 225,000 drug–patient interactions using its proprietary Mosaic technology – the same platform used to produce the widely adopted Tahoe-100M dataset in 2025.
The new dataset will be over 4× more perturbation-rich than Tahoe-100M and represents one of the most ambitious biological data-generation efforts ever undertaken in support of AI-driven virtual biology.
Tahoe, Arc, and Biohub have played leading roles in advancing virtual cell models: AI systems capable of simulating cellular behavior, drug response, and disease biology. This collaboration marks the first time major players in the space have joined forces to develop a foundational dataset at this scale.
“Virtual cells are reaching a turning point, and data is the bottleneck,” said Johnny Yu, co-founder and Chief Science Officer at Tahoe Therapeutics. “In a field that’s been fundamentally data-starved, this dataset provides the fuel needed to accelerate real progress. Arc, Biohub, and Tahoe share that belief, and this strategic, multi-institutional deal reflects it.”
“We are pleased to join forces with Biohub and Tahoe to create critical resources in this promising field,” says Arc’s Executive Director and Co-Founder Silvana Konermann. “As we are seeing in our virtual cell modeling efforts and in the Virtual Cell Challenge, diverse, high-volume, high-quality perturbational data is still scarce and needed to unlock advances in ML-driven biological discovery.”
“Biohub is excited to partner with Tahoe and Arc to build this foundational dataset and make it available to the worldwide scientific community,” said Biohub Head of Science Alex Rives. “We have had a long commitment to open science and will continue that as we build the first large-scale initiative combining frontier AI and frontier biology to accelerate scientific discovery to cure or prevent disease.”
Released in 2025 on Arc Virtual Cell Atlas, Tahoe-100M rapidly became one of the most influential datasets in computational biology, with 250,000+ downloads and widespread use across model development and benchmarking. Together with Biohub’s CELLxGENE and Arc’s scBaseCount, it has served as training or reference data for leading virtual cell architectures developed by Arc, Tahoe and Biohub, and leading AI models of biology, including STATE, Tahoe-x1, and TranscriptFormer.
The defining strengths of these datasets – scale and diversity of biological contexts, in many cases resulting from drug perturbations – have proven crucial for training AI models capable of predicting drug response and mechanism of action.
The new Arc–Biohub–Tahoe dataset will build on these lessons by dramatically expanding perturbative diversity, cell-type representation, and patient-relevant context.
Data generated under this large-scale, jointly funded effort will first be shared among Tahoe, Arc and Biohub, and ultimately will be open sourced as part of the three organizations’ shared commitment to open science and community-driven scientific progress.
About Tahoe Therapeutics
Tahoe Therapeutics is building AI-powered models of the human cell to design better drugs for more patients. Its technology platform generates large-scale, perturbative single-cell datasets that enable a new generation of biological foundation models. Based in South San Francisco, Tahoe was founded by a team of scientists and technologists advancing the frontiers of drug discovery, genomics, and machine learning. Learn more at tahoebio.ai.
About Arc Institute
Arc Institute is an independent nonprofit research organization based in Palo Alto, California, that aims to accelerate scientific progress and understand the root causes of complex diseases. Arc’s investigators are supported by long-term funding and freedom to pursue bold ideas. Its Technology Centers leverage multi-omics, genome engineering, and cellular, mammalian and computational models to advance discoveries at the intersection of biology and artificial intelligence. Founded in 2021, Arc partners with Stanford, UC Berkeley, and UCSF.
About Biohub
Biohub is a 501(c)(3) biomedical research organization building the first large-scale initiative to combine frontier AI and frontier biology to solve disease. With its compute capacity, AI research and engineering, and state-of-the-art technology for measuring, imaging, and programming biology, Biohub is enabling scientists worldwide to use AI-powered biology to study how cells operate and organize as systems — ultimately understanding why disease happens and how to cure or prevent it. Learn more at biohub.org.
Contacts
Press contact: Kayleigh Karutis, kayleigh@tahoebio.com | 518 466 5265


