Taxonomic bias alters Foldseek, protein language model, and protein design outcomes

Association·arcadia

Claim that taxonomic bias in AlphaFold and related datasets alters outputs from Foldseek, protein language models (like Progen2, ESM2), and can negatively affect protein design.

Confidence

80%

active

Evidence Quote

“Taxonomic makeup of AlphaFold and representative proteins used in Foldseek’s clustering workflow reflect biases. Uneven sampling led to systematic biases in the output of protein language models and negatively influenced protein design.”

Relationship

Phylogenetic bias affects Machine learning models for protein design

Arguments

Phylogenetic biassubject

Machine learning models for protein designobject

Connections (3)

Reasoning: Language models and deep learning in protein structure prediction and designInferenceChain

Tree-of-life sampling and algorithmic biases shape the performance of protein language/design modelsInferenceChain

Evidence

“Evidence demonstrating evolutionary-scale prediction of atomic-level protein structure using a language model, as described by Lin et al. (2023).”

Lin Z et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model doi:10.1126/science.ade2574 ↗

Taxonomic bias alters Foldseek, protein language model, and protein design outcomes — ARCADIA Knowledge Graph