Skip to content

cluster_selection_persistence does not prevent low-persistence clusters from appearing #678

Open
@notluquis

Description

@notluquis

I’m observing that even when I set:

import hdbscan

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=XX,
    min_samples=YY,
    cluster_selection_persistence=0.2,
    algorithm='best',
    core_dist_n_jobs=-1,
)
labels = clusterer.fit_predict(X)

some of the final clusters have a measured persistence (cluster_persistence_) below 0.2. In other words, branches with low persistence still survive and become clusters.

Steps to reproduce
1. Prepare any dataset X.
2. Run:

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=50,
    min_samples=10,
    cluster_selection_persistence=0.2,
    algorithm='best',
    core_dist_n_jobs=-1,
)
labels = clusterer.fit_predict(X)
print(clusterer.cluster_persistence_)
# observe values < 0.2

What I’ve tried
1. Default persistence flag

HDBSCAN(..., cluster_selection_persistence=0.2)

→ low-persistence clusters still appear in labels_, cluster_persistence_, and in the condensed tree.

2.	Manual patch in _tree_to_labels

I tried to insert a filter immediately after condensation:

def _tree_to_labels(...):
    condensed = condense_tree(single_linkage_tree, min_cluster_size)
-   if cluster_selection_persistence > 0.0 and condensed.shape[0] > 0:
-       condensed = simplify_hierarchy(condensed, cluster_selection_persistence)
+   # attempt to filter low-persistence branches too early
+   if cluster_selection_persistence > 0.0 and condensed.size > 0:
+       condensed = condensed[condensed["lambda_val"] >= cluster_selection_persistence]
+   else:
+       condensed = simplify_hierarchy(condensed, cluster_selection_persistence)

After reinstalling/reloading, behavior did not change—branches with lambda_val < 0.2 still become clusters.

Expected behavior / Request behavior

I would expect that setting cluster_selection_persistence=0.2 would prevent any cluster whose true persistence is below 0.2 from appearing in the final labels or tree.

Actual behavior

Clusters with cluster_persistence_ < 0.2 still appear, and their low persistence values are included in clusterer.cluster_persistence_ and shown in the condensed tree.

Questions

  1. How exactly does cluster_selection_persistence interact with the EOM/leaf selection?
  2. At what point is persistence applied, and why are low-persistence branches still surviving?
  3. What is the intended mechanism to ensure that any branch below the persistence threshold is treated as noise (label –1)?

Any pointers to the relevant code paths or suggestions for how to enforce this behavior would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions