Association·arcadia
Taxonomic sampling bias affects protein language models
Claim that uneven species sampling results in systematic biases in the output of protein language models and negatively influences aspects of protein design.
Confidence
80%
active
Evidence Quote
“uneven sampling led to systematic biases in the output of protein language models and negatively influenced aspects of protein design”
Relationship
Phylogenetic bias systematic bias Machine learning models for protein design
Connections (4)
Taxonomic bias reasoning for AFDB and model outcomesInferenceChain
Structural similarity often diverges from sequence similarityAssociation
Reasoning: Language models and deep learning in protein structure prediction and designInferenceChain
Tree-of-life sampling and algorithmic biases shape the performance of protein language/design modelsInferenceChain
Evidence
“Evidence summarizing that language models for protein sequences can generalize beyond naturally occurring proteins, as found by Verkuil et al. (2022).”
Verkuil R et al. (2022). Language models generalize beyond natural proteins doi:10.1101/2022.12.21.521521 ↗
“Evidence for ProtGPT2, a deep unsupervised language model developed for protein design, as shown by Ferruz et al. (2022).”
Ferruz N et al. (2022). ProtGPT2 is a deep unsupervised language model for protein design doi:10.1038/s41467-022-32007-7 ↗
“Reference to Ding & Steinhardt (2024) describing bias in protein language models from sequence sampling.”
Ding F & Steinhardt J (2024). Protein language models are biased by unequal sequence sampling across the tree of life doi:10.1101/2024.03.07.584001 ↗
“Reference to Madani et al. (2023) about language models generating functional protein sequences.”
Madani A et al. (2023). Large language models generate functional protein sequences across diverse families doi:10.1038/s41587-022-01618-2 ↗
“Supports claims that protein language models learn evolutionarily and functionally relevant patterns.”
(2021). Learning the protein language: Evolution, structure, and function doi:10.1016/j.cels.2021.05.017 ↗