Key points from article :
UK-based biotech startup Basecamp Research has launched BaseData, the largest known dataset of protein sequences, containing 9.8 billion new sequences and sourced from one million newly discovered species. This unprecedented trove increases known protein diversity tenfold compared to all public databases combined. The goal is to meet the rising data demands of generative biology, where AI models are used to design proteins and therapeutics, much like ChatGPT generates language. Current biological databases are heavily skewed—70% of public sequence data comes from just 10 species—limiting model performance and discovery potential.
To overcome this, Basecamp developed a global network of partnerships with over 125 communities in 26 countries, collecting samples from extreme and diverse environments—from volcanic hot springs to Antarctic soil and shipwrecks. Their mobile molecular biology tools allow for real-time DNA extraction in remote locations, enabling broad access to genetic information. Notable discoveries include bacteria that generate water from hydrogen, species that could help tackle pollution and antibiotic resistance, and organisms that survive near-boiling temperatures, offering insights for industrial and medical applications.
BaseData not only focuses on the genetic "words" but also the full "sentences"—capturing long genome contexts (over 10,000 base pairs)—to better understand how biological systems work together. This approach mirrors how large language models understand meaning in text. The company plans to incorporate evolutionary-aware metagenome models, helping to further push the boundaries of programmable genetic medicine.
Since its founding in 2019, Basecamp Research has grown rapidly, raised $85 million, and established a presence in Cambridge’s Kendall Square, collaborating with notable scientists like genome-editing pioneer David Liu. They’re currently offering early access to BaseData for academic researchers and engaging with major pharmaceutical companies to integrate the data into AI-driven drug discovery pipelines.