Fylo›ARCADIA›Graph
Hubs
Association·arcadia

Phylogenetic data engineering optimizes protein diversity in databases

Claim that intentionally collecting protein sequence and structure data to optimize biological diversity across taxa (phylogenetic data engineering) will help recover the true distribution of protein diversity and support better generalization by protein prediction models.

Confidence
90%
active

Evidence Quote

“Future collection of protein data should optimize biological diversity; undersampled taxa prioritized across the tree of life; taxonomic completeness can guide efforts.”

Relationship

Dataset curation increases Protein diversity

Arguments

Dataset curationsubject
Protein diversityobject

Connections (2)

How phylogenetic data engineering expands protein model generalizabilityInferenceChain
Structural similarity often diverges from sequence similarityAssociation

Evidence

“Evidence describing the Universal Protein Resource (UniProt) as a foundational resource for protein data, as outlined by Bairoch (2004).”

Bairoch A. (2004). The Universal Protein Resource (UniProt) doi:10.1093/nar/gki070 ↗

“Supports claims about protein diversity and how it informs dataset curation and structural analyses in protein databases.”

(2023). Clustering predicted structures at the scale of the known protein universe doi:10.1038/s41586-023-06622-3 ↗

“Supports claims involving sensitive protein sequence searching and analysis of massive sequence datasets.”

(2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets doi:10.1038/nbt.3988 ↗