Tree-of-life sampling and algorithmic biases shape the performance of protein language/design models

InferenceChain·arcadia

This reasoning chain explains how phylogenetic and taxonomic sampling biases, along with biases in algorithmic curation and model architecture, induce systematic distortion in machine learning and generative models for protein design. It synthesizes evidence and claims regarding data imbalance, representation gaps, and resulting limitations on model generalizability and accuracy.

Confidence

80%

◑partialactivecomplexity: mid

Reasoning Steps (3)

Uneven sampling across tree of life distorts sequence/structure spaceStep 1

Algorithmic curation and model training amplify data biasStep 2

Model generalizability and accuracy limited by non-independence and biasStep 3

Source

Synthesis for current paper

Connections (5)

Phylogenetic bias in databases influences protein model outcomesAssociation

Taxonomic sampling bias affects protein language modelsAssociation

Taxonomic bias alters Foldseek, protein language model, and protein design outcomesAssociation

Phylogenetic biases and non-independence cap model generalizabilityAssociation

Pseudoreplication and non-independence limit language model generalizabilityAssociation

Tree-of-life sampling and algorithmic biases shape the performance of protein language/design models — ARCADIA Knowledge Graph

Tree-of-life sampling and algorithmic biases shape the performance of protein language/design models

Reasoning Steps (3)

Source

Connections (5)

Tree-of-life sampling and algorithmic biases shape the performance of protein language/design models — ARCADIA Knowledge Graph

Tree-of-life sampling and algorithmic biases shape the performance of protein language/design models

Reasoning Steps (3)

Source

Connections (5)