Association·arcadia

Taxonomic bias alters Foldseek, protein language model, and protein design outcomes

Claim that taxonomic bias in AlphaFold and related datasets alters outputs from Foldseek, protein language models (like Progen2, ESM2), and can negatively affect protein design.

Confidence
80%
active

Evidence Quote

Taxonomic makeup of AlphaFold and representative proteins used in Foldseek’s clustering workflow reflect biases. Uneven sampling led to systematic biases in the output of protein language models and negatively influenced protein design.

Relationship

Phylogenetic bias affects Machine learning models for protein design

Evidence

Evidence demonstrating evolutionary-scale prediction of atomic-level protein structure using a language model, as described by Lin et al. (2023).

Lin Z et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model doi:10.1126/science.ade2574