Association·arcadia

Species abundance causes model bias

Protein language models preferentially generate proteins from abundant species creating bias

Confidence
90%
active

Evidence Quote

Species abundance disparities cause protein language model biases favoring abundant species

Relationship

Species abundance disparities causing model bias causes pLM performance bias

Evidence

Preprint demonstrating that protein language models reflect biases due to unequal sequence sampling in protein databases across taxa

Ding F & Steinhardt J (2024). Protein language models are biased by unequal sequence sampling across the tree of life doi:10.1101/2024.03.07.584001