Large Language Models for Protein
AlphaFold3: Accurate structure prediction of biomolecular interactions with AlphaFold3. Nature, 2024. [PDF] [Web Server] ESM3: Simulating 500 million years of evolution with a language model. bioRxiv, 2024. [PDF] [Code] AlphaFold-latest: Performance and structural coverage of the latest, in-development AlphaFold model. Report, 2023. [PDF] ESMFold: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023. [PDF] [Code] xTrimoPGLM: Unified 100B-scale pre-trained transformer for deciphering the language of protein. bioRxiv, 2023. [PDF] Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. bioRxiv, 2023. [PDF] [Code] CARP: Convolutions are competitive with transformers for protein sequence pretraining. bioRxiv, 2023. [PDF] [Code] ESM-GearNet: A systematic study of joint representation learning on protein sequences and structures. bioRxiv, 2023. [PDF] [Code] SaProt: Protein language modeling with structure-aware vocabulary. The Twelfth International Conference on Learning Representations, 2023. [PDF] [Code] GearNet: Protein representation learning by geometric structure pretraining. The Eleventh International Conference on Learning Representations, 2023. [PDF] [Code] ProGen2: Exploring the boundaries of protein language models. Cell Systems, 2023. [PDF] [Code] ProGen: Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 2023. [PDF] [Code] ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 2022. [PDF] [Code] Ontoprotein: Protein pretraining with gene ontology embedding. International Conference on Machine Learning, 2022. [PDF] [Code] ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [PDF] [Code] ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics, 2022. [PDF] [Code] AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 2022. [PDF] [Database] AlphaFold2: Highly accurate protein structure prediction with AlphaFold. Nature, 2021. [PDF] [Code] [Web server] MSA Transformer. International Conference on Machine Learning, 2021. [PDF] [Code] [Framework] [Details] ESM-1b: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 2021. [PDF] [Code] [Framework] [Details] ProSE: Learning the protein language: evolution, structure, and function. Cell Systems , 2021. [PDF] [Code] Evaluating protein transfer learning with TAPE. Advances in neural information processing systems, 2019. [PDF] [Code] Learning protein sequence embeddings using information from structure. arXiv preprint, 2019. [PDF] [Code] SeqVec: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics, 2019. [PDF] [Code]
Large Language Models for DNA
Evo: Sequence modeling and design from molecular to genome scale. bioRxiv, 2024. [PDF] [Code] CALM: Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence, 2024. [PDF] [Code] GPN: DNA language models are powerful predictors of genome-wide variant effects. Proceedings of the National Academy of Sciences, 2023. [PDF] [Web Server] [Code] GPN-MSA: An alignment-based DNA language model for genome-wide variant effect prediction. bioRxiv, 2023. [PDF] [Web Server] [Code] The Nucleotide Transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, 2023. [PDF] [Code1] [Code2] DNABERT-2: Efficient foundation model and benchmark For multi-species genome. bioRxiv, 2023. [PDF] [Code] HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution. NeurIPS, 2023. [PDF] [Code] Enformer: Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 2021. [PDF] [Code] DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 2021. [PDF] [Code]
Large Language Models for RNA
RNA-MSM: Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Research, 2023. [PDF] [Web Server] [Code] RNA-FM: Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. bioRxiv, 2022. [PDF] [Code] RNABERT: Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genomics and Bioinformatics, 2022. [PDF] [Code]
Large Language Models for scRNA-seq data
Geneformer: Transfer learning enables predictions in network biology. Nature, 2023. [PDF] [Code] scFoundation: Large scale foundation model on single-cell transcriptomics. bioRxiv, 2023. [PDF] [Code] scGPT: Towards building a foundation model for single-cell multi-omics using generative AI. bioRxiv, 2023. [PDF] [Code] scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence, 2022. [PDF] [Code]
Large Language Models for gene expression data
AD-AE: Adversarial deconfounding autoencoder for learning robust gene expression embeddings. Bioinformatics, 2020. [PDF] [Code]