Fylo›ARCADIA›Graph
Hubs
Association·arcadia

Taxonomic sampling bias affects protein language models

Claim that uneven species sampling results in systematic biases in the output of protein language models and negatively influences aspects of protein design.

Confidence
80%
active

Evidence Quote

“uneven sampling led to systematic biases in the output of protein language models and negatively influenced aspects of protein design”

Relationship

Phylogenetic bias systematic bias Machine learning models for protein design

Arguments

Phylogenetic biassubject
Machine learning models for protein designobject

Connections (4)

Taxonomic bias reasoning for AFDB and model outcomesInferenceChain
Structural similarity often diverges from sequence similarityAssociation
Reasoning: Language models and deep learning in protein structure prediction and designInferenceChain
Tree-of-life sampling and algorithmic biases shape the performance of protein language/design modelsInferenceChain

Evidence

“Evidence summarizing that language models for protein sequences can generalize beyond naturally occurring proteins, as found by Verkuil et al. (2022).”

Verkuil R et al. (2022). Language models generalize beyond natural proteins doi:10.1101/2022.12.21.521521 ↗

“Evidence for ProtGPT2, a deep unsupervised language model developed for protein design, as shown by Ferruz et al. (2022).”

Ferruz N et al. (2022). ProtGPT2 is a deep unsupervised language model for protein design doi:10.1038/s41467-022-32007-7 ↗

“Reference to Ding & Steinhardt (2024) describing bias in protein language models from sequence sampling.”

Ding F & Steinhardt J (2024). Protein language models are biased by unequal sequence sampling across the tree of life doi:10.1101/2024.03.07.584001 ↗

“Reference to Madani et al. (2023) about language models generating functional protein sequences.”

Madani A et al. (2023). Large language models generate functional protein sequences across diverse families doi:10.1038/s41587-022-01618-2 ↗

“Supports claims that protein language models learn evolutionarily and functionally relevant patterns.”

(2021). Learning the protein language: Evolution, structure, and function doi:10.1016/j.cels.2021.05.017 ↗