Join the club for FREE to access the whole archive and other member benefits.

Basecamp reveals million-species protein dataset

Vast new dataset could supercharge AI-driven protein and drug design across biology and medicine

11-Jun-2025

Key points from article :

UK-based biotech startup Basecamp Research has launched BaseData, the largest known dataset of protein sequences, containing 9.8 billion new sequences and sourced from one million newly discovered species. This unprecedented trove increases known protein diversity tenfold compared to all public databases combined. The goal is to meet the rising data demands of generative biology, where AI models are used to design proteins and therapeutics, much like ChatGPT generates language. Current biological databases are heavily skewed—70% of public sequence data comes from just 10 species—limiting model performance and discovery potential.

To overcome this, Basecamp developed a global network of partnerships with over 125 communities in 26 countries, collecting samples from extreme and diverse environments—from volcanic hot springs to Antarctic soil and shipwrecks. Their mobile molecular biology tools allow for real-time DNA extraction in remote locations, enabling broad access to genetic information. Notable discoveries include bacteria that generate water from hydrogen, species that could help tackle pollution and antibiotic resistance, and organisms that survive near-boiling temperatures, offering insights for industrial and medical applications.

BaseData not only focuses on the genetic "words" but also the full "sentences"—capturing long genome contexts (over 10,000 base pairs)—to better understand how biological systems work together. This approach mirrors how large language models understand meaning in text. The company plans to incorporate evolutionary-aware metagenome models, helping to further push the boundaries of programmable genetic medicine.

Since its founding in 2019, Basecamp Research has grown rapidly, raised $85 million, and established a presence in Cambridge’s Kendall Square, collaborating with notable scientists like genome-editing pioneer David Liu. They’re currently offering early access to BaseData for academic researchers and engaging with major pharmaceutical companies to integrate the data into AI-driven drug discovery pipelines.

Mentioned in this article:

Click on resource name for more details.

Basecamp Research

London-based biotech startup focused on accelerating biological discovery through AI

David Liu

Richard Merkin Professor and Vice-Chair of the Faculty at Broad Institute of MIT and Harvard

Topics mentioned on this page:
AI in Medical Research, Proteomics
Basecamp reveals million-species protein dataset