Key points from article :
The Arc Institute has launched the Virtual Cell Challenge, a public competition aimed at accelerating the development of AI models that can predict how cells change their gene expression in response to genetic perturbations. These so-called “virtual cells” have the potential to revolutionize drug discovery by identifying how to shift cells from diseased to healthy states with fewer side effects. However, creating accurate virtual cell models is challenging due to the complexity of living cells and the technical noise in current single-cell datasets. A key goal of the challenge is to establish standard benchmarks that ensure models capture real biological patterns rather than dataset-specific artifacts. The initiative is described in a Cell commentary led by Yusuf Roohani, PhD, machine learning lead at Arc.
The competition, which is backed by sponsors such as Nvidia and 10x Genomics, invites researchers to build models that generalize well to unseen cell contexts using a newly generated dataset of 300,000 human embryonic stem cells subjected to 300 genetic perturbations. The models will be judged based on their ability to predict gene expression changes, distinguish between perturbations, and minimize prediction error. Prizes totalling $175,000, including NVIDIA DGX Cloud credits, will be awarded to the top three models. Real-time leaderboards and a final test set will determine the winners by December.
Entrants will compete against Arc’s own AI model, STATE, which demonstrated strong performance by using over 100 million perturbed cells to model transition effects and 167 million observational cells to understand gene expression variation. Unlike previous models that analyse cells individually, STATE uses a bi-directional transformer to make population-level predictions, offering robustness across diverse biological contexts and technical variations. This modular design allows STATE to flexibly integrate large datasets while improving accuracy over previous approaches.
To help competitors train their models, Arc has also released the Virtual Cell Atlas, a massive resource combining datasets like scBaseCount and Tahoe-100M. The initiative has received support from leaders in the field, including Fabian Theis and Ci Chu, who stress the importance of large, high-quality datasets in enabling next-gen predictive models. With new entrants like Xaira Therapeutics and AI experts such as Bo Wang pushing the boundaries, the Virtual Cell Challenge could pave the way for breakthroughs in understanding disease and identifying better therapeutic targets.