Large Language Models for Protein

  • AlphaFold3: Accurate structure prediction of biomolecular interactions with AlphaFold3. Nature, 2024. [PDF] [Web Server]

  • ESM3: Simulating 500 million years of evolution with a language model. bioRxiv, 2024. [PDF] [Code]

  • AlphaFold-latest: Performance and structural coverage of the latest, in-development AlphaFold model. Report, 2023. [PDF]

  • ESMFold: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023. [PDF] [Code]

  • xTrimoPGLM: Unified 100B-scale pre-trained transformer for deciphering the language of protein. bioRxiv, 2023. [PDF]

  • Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. bioRxiv, 2023. [PDF] [Code]

  • CARP: Convolutions are competitive with transformers for protein sequence pretraining. bioRxiv, 2023. [PDF] [Code]

  • ESM-GearNet: A systematic study of joint representation learning on protein sequences and structures. bioRxiv, 2023. [PDF] [Code]

  • SaProt: Protein language modeling with structure-aware vocabulary. The Twelfth International Conference on Learning Representations, 2023. [PDF] [Code]

  • GearNet: Protein representation learning by geometric structure pretraining. The Eleventh International Conference on Learning Representations, 2023. [PDF] [Code]

  • ProGen2: Exploring the boundaries of protein language models. Cell Systems, 2023. [PDF] [Code]

  • ProGen: Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 2023. [PDF] [Code]

  • ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 2022. [PDF] [Code]

  • Ontoprotein: Protein pretraining with gene ontology embedding. International Conference on Machine Learning, 2022. [PDF] [Code]

  • ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [PDF] [Code]

  • ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics, 2022. [PDF] [Code]

  • AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 2022. [PDF] [Database]

  • AlphaFold2: Highly accurate protein structure prediction with AlphaFold. Nature, 2021. [PDF] [Code] [Web server]

  • MSA Transformer. International Conference on Machine Learning, 2021. [PDF] [Code] [Framework] [Details]

  • ESM-1b: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 2021. [PDF] [Code] [Framework] [Details]

  • ProSE: Learning the protein language: evolution, structure, and function. Cell Systems , 2021. [PDF] [Code]

  • Evaluating protein transfer learning with TAPE. Advances in neural information processing systems, 2019. [PDF] [Code]

  • Learning protein sequence embeddings using information from structure. arXiv preprint, 2019. [PDF] [Code]

  • SeqVec: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics, 2019. [PDF] [Code]

Large Language Models for DNA

  • Evo: Sequence modeling and design from molecular to genome scale. bioRxiv, 2024. [PDF] [Code]

  • CALM: Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence, 2024. [PDF] [Code]

  • GPN: DNA language models are powerful predictors of genome-wide variant effects. Proceedings of the National Academy of Sciences, 2023. [PDF] [Web Server] [Code]

  • GPN-MSA: An alignment-based DNA language model for genome-wide variant effect prediction. bioRxiv, 2023. [PDF] [Web Server] [Code]

  • The Nucleotide Transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, 2023. [PDF] [Code1] [Code2]

  • DNABERT-2: Efficient foundation model and benchmark For multi-species genome. bioRxiv, 2023. [PDF] [Code]

  • HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution. NeurIPS, 2023. [PDF] [Code]

  • Enformer: Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 2021. [PDF] [Code]

  • DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 2021. [PDF] [Code]

Large Language Models for RNA

  • RNA-MSM: Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Research, 2023. [PDF] [Web Server] [Code]

  • RNA-FM: Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. bioRxiv, 2022. [PDF] [Code]

  • RNABERT: Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genomics and Bioinformatics, 2022. [PDF] [Code]

Large Language Models for scRNA-seq data

  • Geneformer: Transfer learning enables predictions in network biology. Nature, 2023. [PDF] [Code]

  • scFoundation: Large scale foundation model on single-cell transcriptomics. bioRxiv, 2023. [PDF] [Code]

  • scGPT: Towards building a foundation model for single-cell multi-omics using generative AI. bioRxiv, 2023. [PDF] [Code]

  • scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence, 2022. [PDF] [Code]

Large Language Models for gene expression data

  • AD-AE: Adversarial deconfounding autoencoder for learning robust gene expression embeddings. Bioinformatics, 2020. [PDF] [Code]