Fylo›ARCADIA›Graph
Hubs
ReasoningCheckpoint·arcadia

Limitations of clustering-based filtering methods

Sequence similarity filtering via clustering can be insensitive to phylogenetic structure, causing uneven sequence retention among protein families and impacting training data distribution.

Confidence
80%
◑partialactive

Part of Chain

Impact of sequence similarity filtering on data leakage and sequence diversity