Association·arcadia
Data leakage affects biological foundation models
Claim that data leakage due to nonindependence in training datasets causes overly optimistic error estimates and model overfitting in biological foundation models like protein language models.
Confidence
80%
active
Evidence Quote
“Data leakage occurs when information intended to be restricted to test sets is learned by models, leading to overfitting and biased error estimates in biological foundation models”
Relationship
Data leakage causes Model bias
Arguments
Connections (7)
Signatures of nonindependence affect biological foundation modelsInferenceChain
Genome contamination yields HGT false positivesAssociation
Limitations of phylogeny inference and utility of perplexity in BFMsInferenceChain
Evolutionary patterns explain COX1 sequence diversification and model limitationsInferenceChain
Data leakage and training data bias impact model performanceInferenceChain
Data leakage and training data biases impact model performanceInferenceChain
Data leakage and biases impact biological foundation model performanceInferenceChain
Evidence
“Preprint reporting on strategies to detect and avoid homology-based data leakage in sequence models trained on genome data”
Rafi AM et al. (2025). Detecting and avoiding homology-based data leakage in genome-trained sequence models doi:10.1101/2025.01.22.634321 ↗
“Paper describing recommended questions and guidelines to prevent data leakage in biological machine learning”
Bernett J et al. (2024). Guiding questions to avoid data leakage in biological machine learning applications doi:10.1038/s41592-024-02362-y ↗