pykanto.signal.cluster
pykanto.signal.cluster#
Perform dimensionality reduction and clustering.
Functions
  | 
Perform HDBSCAN clustering from vector array or distance matrix.  | 
  | 
  | 
  | 
Parallel implementation of   | 
  | 
Uniform Manifold Approximation and Projection.  | 
- pykanto.signal.cluster.umap_reduce(data: np.ndarray, n_neighbors: int = 15, n_components: int = 2, min_dist: float = 0.1, random_state: int | None = None, verbose: bool = False, **kwargs_umap) Tuple[np.ndarray, umap.UMAP][source]#
 Uniform Manifold Approximation and Projection. Uses either the cuml GPU-accelerated version or the ‘regular’ umap version. See the documentation of either for valid kwargs.
- Parameters
 data (array-like, shape = (n_samples, n_features)) – Data to reduce.
n_neighbors (int, optional) – See UMAP docs. Defaults to 15.
n_components (int, optional) – See UMAP docs. Defaults to 2.
min_dist (float, optional) – See UMAP docs. Defaults to 0.1.
random_state (int, optional) – See UMAP docs. Defaults to None.
kwargs_umap – extra named arguments passed to umap.UMAP or cuml.umap.UMAP.
- Returns
 Embedding coordinates and UMAP reducer.
- Return type
 Tuple[np.ndarray, umap.UMAP]
- pykanto.signal.cluster.hdbscan_cluster(embedding: np.ndarray, min_cluster_size: int = 5, min_samples: None | int = None, **kwargs_hdbscan) HDBSCAN[source]#
 Perform HDBSCAN clustering from vector array or distance matrix. Convenience wrapper. See the HDBSCAN* docs.
- Parameters
 embedding (np.ndarray) – Data to cluster. See hdbscan documentation for more.
min_cluster_size (int, optional) – Minimum number of samples to consider a cluster. Defaults to 5.
min_samples (int, optional) – Controls how ‘conservative’ clustering is. Larger values = more points will be declared as noise. Defaults to None.
kwargs_hdbscan – Extra named arguments passed to HDBSCAN.
- Returns
 HDBSCAN object. Labels are at
self.labels_.- Return type
 HDBSCAN
- pykanto.signal.cluster.reduce_and_cluster(dataset: KantoData, ID: str, song_level: bool = False, min_sample: int = 10, kwargs_umap: Dict[str, Any] = {}, kwargs_hdbscan: Dict[str, Any] = {}) pd.DataFrame | None[source]#
 - Parameters
 dataset (KantoData) – Data to be used.
ID (str) – Grouping factor.
song_level (bool, optional) – Whether to use the average of all units in each vocalisation instead of all units. Defaults to False.
min_sample (int, optional) – Minimum number of vocalisations or units. Defaults to 10.
kwargs_umap (dict) – dictionary of UMAP parameters.
kwargs_hdbscan (dict) – dictionary of HDBSCAN+ parameters.
- Returns
 - Dataframe with columns [‘vocalisation_key’, ‘ID’,
 ’idx’, ‘umap_x’, ‘umap_y’, ‘auto_class’] or None if sample size is too small
- Return type
 pd.DataFrame | None
- pykanto.signal.cluster.reduce_and_cluster_parallel(dataset: KantoData, kwargs_umap: dict = {}, kwargs_hdbscan: dict = {}, min_sample: int = 10, num_cpus: float | None = None) pd.DataFrame | None[source]#
 Parallel implementation of
reduce_and_cluster().