Member-only story
Foundation models in biological and chemical domains
AI + Medicine Newsletter 2024–02–06
A recent review, Scientific Large Language Models: A Survey on Biological & Chemical Domains, comprehensively surveyed the advancements in large foundation models in biological and chemical domains, including models for text, small molecules, proteins, and genomic sequences. The review outlines the models, datasets, and evaluation methods for each of these data types. The 75-page review provides a comprehensive and valuable resource for researchers navigating the landscape.
With hundreds of large biological and chemical models being developed, it seems the field has achieved a lot. However, as the review pointed out, this area is still in its nascent stage. The authors highlight a few challenges, such as the lack of large-scale and high-quality training data, integration of domain-specific information into the model architecture, and reliable computational and experimental evaluation. The lack of high-quality data is probably the most critical issue. After all, your results are only as good as your data.
By the way, the review unfortunately missed one important data type: RNA sequences (or transcriptomic data). Many RNA foundation models can be found in this RNA-FM GitHub repository. These models could be very useful for understanding gene…