pykanto.dataset#

Build the main dataset class and its methods to visualise, segment and label animal vocalisations.

Classes

KantoData(DIRS[, parameters, random_subset, ...])

Main dataset class.

class pykanto.dataset.KantoData(DIRS: ProjDirs, parameters: None | Parameters = None, random_subset: None | int = None, overwrite_dataset: bool = False, overwrite_data: bool = False)[source]#

Main dataset class. See __init__ docstring.

__init__(DIRS: ProjDirs, parameters: None | Parameters = None, random_subset: None | int = None, overwrite_dataset: bool = False, overwrite_data: bool = False) → None[source]#

Instantiates the main dataset class.

Note

If your dataset contains noise samples (useful when training a neural network), these should be labelled as ‘NOISE’ in the corresponding json file. i.e., json_file["label"] == 'NOISE'.

Parameters

DIRS (ProjDirs) – Project directory structure. Must contain a ‘SEGMENTED’ attribute pointing to a directory that contains the segmented data organised in two folders: ‘WAV’ and ‘JSON’ for audio and metadata, respectively.
parameters (Parameters, optional) – Parameters for dataset. Defaults to None. See Parameters.
random_subset (None, optional) – Size of random subset for testing. Defaults to None.
overwrite_dataset (bool, optional) – Whether to overwrite dataset if it already exists. Defaults to False.
overwrite_data (bool, optional) – Whether to overwrite spectrogram files if they already exist. Defaults to False.

Raises

FileExistsError – DATASET_ID already exists. You can overwrite it by setting overwrite_dataset=True

plot_summary(nbins: int = 50, variable: str = 'frequency') → None[source]#

Plots a histogram + kernel density estimate of the frequency distribution of vocalisation duration and frequencies.

Note

Durations and frequencies come from bounding boxes, not vocalisations. This function, along with show_extreme_songs(), is useful to spot any outliers, and to quickly explore the full range of data.

Parameters

nbins (int, optional) – Number of bins in histogram. Defaults to 50.
variable (str, optional) – One of ‘frequency’, ‘duration’, ‘sample_size’, ‘all’. Defaults to ‘frequency’.

Raises

ValueError – variable must be one of [‘frequency’, ‘duration’, ‘sample_size’, ‘all’]

plot(key: str, segmented: bool = False, **kwargs) → None[source]#

Plot an spectrogram from the dataset.

Parameters

key (str) – Key of the spectrogram.
segmented (bool, optional) – Whether to overlay onset/offset information. Defaults to False.
kwargs – passed to melspectrogram()

Examples

Plot the first 10 specrograms in the vocalisations dataframe: >>> for spec in dataset.data.index[:10]: … dataset.plot(spec)

sample_info() → None[source]#: Prints the length of the KantoData and other information.

plot_example(n_songs: int = 1, query: str = 'maxfreq', order: str = 'descending', return_keys: bool = False, **kwargs) → None | List[str][source]#

Show mel spectrograms for songs at the ends of the time or frequency distribution.

Note

Durations and frequencies come from bounding boxes, not vocalisations. This function, along with plot_summary(), is useful to spot any outliers, and to quickly explore the full range of data.

Parameters

n_songs (int, optional) – Number of songs to return. Defaults to 1.
query (str, optional) –
What to query the database for. Defaults to ‘maxfreq’. One of:
- ’duration’
- ’maxfreq’
- ’minfreq’
order (str, optional) –
Defaults to ‘descending’. One of:
- ’ascending’
- ’descending’
return_keys (bool, optional) – Defaults to ‘False’. Whether to return the keys of the displayed spectrograms.
**kwargs – Keyword arguments to be passed to melspectrogram()

segment_into_units(overwrite: bool = False) → None[source]#

Adds segment onsets, offsets, unit and silence durations to data.

Warning

If segmentation fails for a vocalisation it will be dropped from the database so that it doesn’t cause trouble downstream.

Parameters

overwrite (bool, optional) – Whether to overwrite unit
False. (segmentation information if it already exists. Defaults to) –

Raises

FileExistsError – The vocalisations in this dataset have already been segmented.

subset(ids: List[str], new_dataset: str) → pykanto.dataset.KantoData[source]#

Creates a new dataset containing a subset of the IDs present in the original dataset (e.g., different individual birds).

Note

Existing files common to both datasets (vocalisation spectrograms, unit spectrograms, wav files) will not be copied or moved from their original location.
Any newly generated files (e.g., by running a function that saves spectrograms to disk) will be exclusive to the new dataset.

Parameters

dataset ([type]) – Source dataset.
ids (List[str]) – IDs to keep.
new_dataset (str) – Name of new dataset.

Returns

A subset of the dataset.

Return type

KantoData

save_to_disk(verbose: bool = True) → None[source]#: Save dataset to disk.

to_csv(path: pathlib.Path, timestamp: bool = True) → None[source]#

Output vocalisation (and, if present, unit) metadata in the dataset as a .csv file.

Parameters

path (Path) – Directory where to save the file(s).
timestamp (bool, optional) – Whether to add timestamp to file name. Defaults to True.

reload() → pykanto.dataset.KantoData[source]#

Load the current dataset from disk. Remember to assign the output to a variable!

Warning

You will lose any changes that happened after the last time you saved the dataset to disk.

Returns: Last saved version of the dataset.
Return type: KantoData

Examples

>>> dataset = dataset.reload()

get_units(pad: bool = False) → None[source]#

Creates and saves a dictionary containing spectrogram representations of the units or the average of the units present in the vocalisations of each individual ID in the dataset.

Parameters: pad (bool, optional) – Whether to pad spectrograms to the maximum lenght (per ID). Defaults to False.

Note

If pad = True, unit spectrograms are padded to the maximum duration found among units that belong to the same ID, not the dataset maximum.

Note

If each of your IDs (grouping factor, such as individual or population) has lots of data and you are using a machine with very limited memory resources you will likely run out of it when executing this funtion in parallel. If this is the case, you can limit the number of cpus to be used at once by setting the num_cpus parameter to a smaller number.

write_to_json() → None[source]#: Write the dataset to the existing JSON files for each vocalisation.

cluster_ids(min_sample: int = 10, kwargs_umap: Dict[str, Any] = {}, kwargs_hdbscan: Dict[str, Any] = {}) → None[source]#

Dimensionality reduction using UMAP + unsupervised clustering using HDBSCAN. This will fail if the sample size for an ID (grouping factor, such as individual or population) is too small.

Adds cluster membership information and 2D UMAP coordinates to data if song_level=True in parameters, else to units.

Parameters

min_sample (int) – Minimum sample size below which an ID will be skipped. Defaults to 10, but you can reallistically expect good automatic results above ~100.
kwargs_umap (dict) – Dictionary of UMAP parameters.
kwargs_hdbscan (dict) – Dictionary of HDBSCAN+ parameters.

prepare_interactive_data(spec_length: float | None = None) → None[source]#

Prepare lightweigth representations of each vocalization or unit (if song_level=False in parameters) for each individual in the dataset.

Parameters

spec_length (float, optional) – In seconds, duration of
segments (spectrogram that will be produced. Shorter) –
padded (will be) –
note (longer segments trimmed. Defaults to maximum) –
dataset. (duration in the) –

open_label_app(palette: typing.Tuple[typing.Literal[<class 'str'>], ...] = ('#8dd3c7', '#ffffb3', '#bebada', '#fb8072', '#80b1d3', '#fdb462', '#b3de69', '#fccde5', '#d9d9d9', '#bc80bd', '#ccebc5', '#ffed6f'), max_n_labs: int = 10) → None[source]#

Opens a new web browser tab with an interactive app that can be used to check the accuracy of the automaticaly assigned labels. Uses average units per vocalisation (if song_level=True in self.parameters) or individual units.

Note

Starting this app requires the output of prepare_interactive_data(); you will be prompted to choose whether to run it if it is missing.

Note

The app will try to create a palette based on the maximum number of categories in any individual in the dataset. You will get a ValueError if the palette you provided is not large enough (the default palette allows for a maximum of 12 categories). You can use your own palettes or import existing ones, see ‘Examples’ below.

Parameters

palette (List[str], optional) – A colour palette of length >= max_n_labs. Defaults to list(Set3_12).
max_n_labs (int, optional) – maximum number of classes expected in the dataset. Defaults to 10.

Examples

Allow for a maximum of 20 labels by setting max_n_labs = 20 and palette = list(Category20_20).

>>> from bokeh.palettes import Category20_20
...
>>> self.open_label_app(max_n_labs = 20,
>>> palette = list(Category20_20))