Association·arcadia
Phylogenetic data engineering optimizes protein diversity in databases
Claim that intentionally collecting protein sequence and structure data to optimize biological diversity across taxa (phylogenetic data engineering) will help recover the true distribution of protein diversity and support better generalization by protein prediction models.
Confidence
90%
active
Evidence Quote
“Future collection of protein data should optimize biological diversity; undersampled taxa prioritized across the tree of life; taxonomic completeness can guide efforts.”
Relationship
Dataset curation increases Protein diversity
Connections (2)
Evidence
“Evidence describing the Universal Protein Resource (UniProt) as a foundational resource for protein data, as outlined by Bairoch (2004).”
“Supports claims about protein diversity and how it informs dataset curation and structural analyses in protein databases.”
(2023). Clustering predicted structures at the scale of the known protein universe doi:10.1038/s41586-023-06622-3 ↗
“Supports claims involving sensitive protein sequence searching and analysis of massive sequence datasets.”
(2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets doi:10.1038/nbt.3988 ↗