pykanto.signal.cluster#

Perform dimensionality reduction and clustering.

Functions

hdbscan_cluster(embedding[, ...])

Perform HDBSCAN clustering from vector array or distance matrix.

reduce_and_cluster(dataset, ID[, ...])

param dataset

Data to be used.

reduce_and_cluster_parallel(dataset[, ...])

Parallel implementation of reduce_and_cluster().

umap_reduce(data[, n_neighbors, ...])

Uniform Manifold Approximation and Projection.

pykanto.signal.cluster.umap_reduce(data: np.ndarray, n_neighbors: int = 15, n_components: int = 2, min_dist: float = 0.1, random_state: int | None = None, verbose: bool = False, **kwargs_umap) Tuple[np.ndarray, umap.UMAP][source]#

Uniform Manifold Approximation and Projection. Uses either the cuml GPU-accelerated version or the ‘regular’ umap version. See the documentation of either for valid kwargs.

Parameters
  • data (array-like, shape = (n_samples, n_features)) – Data to reduce.

  • n_neighbors (int, optional) – See UMAP docs. Defaults to 15.

  • n_components (int, optional) – See UMAP docs. Defaults to 2.

  • min_dist (float, optional) – See UMAP docs. Defaults to 0.1.

  • random_state (int, optional) – See UMAP docs. Defaults to None.

  • kwargs_umap – extra named arguments passed to umap.UMAP or cuml.umap.UMAP.

Returns

Embedding coordinates and UMAP reducer.

Return type

Tuple[np.ndarray, umap.UMAP]

pykanto.signal.cluster.hdbscan_cluster(embedding: np.ndarray, min_cluster_size: int = 5, min_samples: None | int = None, **kwargs_hdbscan) HDBSCAN[source]#

Perform HDBSCAN clustering from vector array or distance matrix. Convenience wrapper. See the HDBSCAN* docs.

Parameters
  • embedding (np.ndarray) – Data to cluster. See hdbscan documentation for more.

  • min_cluster_size (int, optional) – Minimum number of samples to consider a cluster. Defaults to 5.

  • min_samples (int, optional) – Controls how ‘conservative’ clustering is. Larger values = more points will be declared as noise. Defaults to None.

  • kwargs_hdbscan – Extra named arguments passed to HDBSCAN.

Returns

HDBSCAN object. Labels are at self.labels_.

Return type

HDBSCAN

pykanto.signal.cluster.reduce_and_cluster(dataset: KantoData, ID: str, song_level: bool = False, min_sample: int = 10, kwargs_umap: Dict[str, Any] = {}, kwargs_hdbscan: Dict[str, Any] = {}) pd.DataFrame | None[source]#
Parameters
  • dataset (KantoData) – Data to be used.

  • ID (str) – Grouping factor.

  • song_level (bool, optional) – Whether to use the average of all units in each vocalisation instead of all units. Defaults to False.

  • min_sample (int, optional) – Minimum number of vocalisations or units. Defaults to 10.

  • kwargs_umap (dict) – dictionary of UMAP parameters.

  • kwargs_hdbscan (dict) – dictionary of HDBSCAN+ parameters.

Returns

Dataframe with columns [‘vocalisation_key’, ‘ID’,

’idx’, ‘umap_x’, ‘umap_y’, ‘auto_class’] or None if sample size is too small

Return type

pd.DataFrame | None

pykanto.signal.cluster.reduce_and_cluster_parallel(dataset: KantoData, kwargs_umap: dict = {}, kwargs_hdbscan: dict = {}, min_sample: int = 10, num_cpus: float | None = None) pd.DataFrame | None[source]#

Parallel implementation of reduce_and_cluster().