pykanto.signal.cluster
pykanto.signal.cluster#
Perform dimensionality reduction and clustering.
Functions
|
Perform HDBSCAN clustering from vector array or distance matrix. |
|
|
|
Parallel implementation of |
|
Uniform Manifold Approximation and Projection. |
- pykanto.signal.cluster.umap_reduce(data: np.ndarray, n_neighbors: int = 15, n_components: int = 2, min_dist: float = 0.1, random_state: int | None = None, verbose: bool = False, **kwargs_umap) Tuple[np.ndarray, umap.UMAP] [source]#
Uniform Manifold Approximation and Projection. Uses either the cuml GPU-accelerated version or the ‘regular’ umap version. See the documentation of either for valid kwargs.
- Parameters
data (array-like, shape = (n_samples, n_features)) – Data to reduce.
n_neighbors (int, optional) – See UMAP docs. Defaults to 15.
n_components (int, optional) – See UMAP docs. Defaults to 2.
min_dist (float, optional) – See UMAP docs. Defaults to 0.1.
random_state (int, optional) – See UMAP docs. Defaults to None.
kwargs_umap – extra named arguments passed to umap.UMAP or cuml.umap.UMAP.
- Returns
Embedding coordinates and UMAP reducer.
- Return type
Tuple[np.ndarray, umap.UMAP]
- pykanto.signal.cluster.hdbscan_cluster(embedding: np.ndarray, min_cluster_size: int = 5, min_samples: None | int = None, **kwargs_hdbscan) HDBSCAN [source]#
Perform HDBSCAN clustering from vector array or distance matrix. Convenience wrapper. See the HDBSCAN* docs.
- Parameters
embedding (np.ndarray) – Data to cluster. See hdbscan documentation for more.
min_cluster_size (int, optional) – Minimum number of samples to consider a cluster. Defaults to 5.
min_samples (int, optional) – Controls how ‘conservative’ clustering is. Larger values = more points will be declared as noise. Defaults to None.
kwargs_hdbscan – Extra named arguments passed to HDBSCAN.
- Returns
HDBSCAN object. Labels are at
self.labels_
.- Return type
HDBSCAN
- pykanto.signal.cluster.reduce_and_cluster(dataset: KantoData, ID: str, song_level: bool = False, min_sample: int = 10, kwargs_umap: Dict[str, Any] = {}, kwargs_hdbscan: Dict[str, Any] = {}) pd.DataFrame | None [source]#
- Parameters
dataset (KantoData) – Data to be used.
ID (str) – Grouping factor.
song_level (bool, optional) – Whether to use the average of all units in each vocalisation instead of all units. Defaults to False.
min_sample (int, optional) – Minimum number of vocalisations or units. Defaults to 10.
kwargs_umap (dict) – dictionary of UMAP parameters.
kwargs_hdbscan (dict) – dictionary of HDBSCAN+ parameters.
- Returns
- Dataframe with columns [‘vocalisation_key’, ‘ID’,
’idx’, ‘umap_x’, ‘umap_y’, ‘auto_class’] or None if sample size is too small
- Return type
pd.DataFrame | None
- pykanto.signal.cluster.reduce_and_cluster_parallel(dataset: KantoData, kwargs_umap: dict = {}, kwargs_hdbscan: dict = {}, min_sample: int = 10, num_cpus: float | None = None) pd.DataFrame | None [source]#
Parallel implementation of
reduce_and_cluster()
.