pykanto.dataset
pykanto.dataset#
Build the main dataset class and its methods to visualise, segment and label animal vocalisations.
Classes
|
Main dataset class. |
- class pykanto.dataset.KantoData(DIRS: ProjDirs, parameters: None | Parameters = None, random_subset: None | int = None, overwrite_dataset: bool = False, overwrite_data: bool = False)[source]#
Main dataset class. See
__init__
docstring.- __init__(DIRS: ProjDirs, parameters: None | Parameters = None, random_subset: None | int = None, overwrite_dataset: bool = False, overwrite_data: bool = False) None [source]#
Instantiates the main dataset class.
Note
If your dataset contains noise samples (useful when training a neural network), these should be labelled as ‘NOISE’ in the corresponding json file. i.e.,
json_file["label"] == 'NOISE'
.- Parameters
DIRS (ProjDirs) – Project directory structure. Must contain a ‘SEGMENTED’ attribute pointing to a directory that contains the segmented data organised in two folders: ‘WAV’ and ‘JSON’ for audio and metadata, respectively.
parameters (Parameters, optional) – Parameters for dataset. Defaults to None. See
Parameters
.random_subset (None, optional) – Size of random subset for testing. Defaults to None.
overwrite_dataset (bool, optional) – Whether to overwrite dataset if it already exists. Defaults to False.
overwrite_data (bool, optional) – Whether to overwrite spectrogram files if they already exist. Defaults to False.
- Raises
FileExistsError – DATASET_ID already exists. You can overwrite it by setting
overwrite_dataset=True
- plot_summary(nbins: int = 50, variable: str = 'frequency') None [source]#
Plots a histogram + kernel density estimate of the frequency distribution of vocalisation duration and frequencies.
Note
Durations and frequencies come from bounding boxes, not vocalisations. This function, along with
show_extreme_songs()
, is useful to spot any outliers, and to quickly explore the full range of data.- Parameters
nbins (int, optional) – Number of bins in histogram. Defaults to 50.
variable (str, optional) – One of ‘frequency’, ‘duration’, ‘sample_size’, ‘all’. Defaults to ‘frequency’.
- Raises
ValueError –
variable
must be one of [‘frequency’, ‘duration’, ‘sample_size’, ‘all’]
- plot(key: str, segmented: bool = False, **kwargs) None [source]#
Plot an spectrogram from the dataset.
- Parameters
key (str) – Key of the spectrogram.
segmented (bool, optional) – Whether to overlay onset/offset information. Defaults to False.
kwargs – passed to
melspectrogram()
Examples
Plot the first 10 specrograms in the vocalisations dataframe: >>> for spec in dataset.data.index[:10]: … dataset.plot(spec)
- plot_example(n_songs: int = 1, query: str = 'maxfreq', order: str = 'descending', return_keys: bool = False, **kwargs) None | List[str] [source]#
Show mel spectrograms for songs at the ends of the time or frequency distribution.
Note
Durations and frequencies come from bounding boxes, not vocalisations. This function, along with
plot_summary()
, is useful to spot any outliers, and to quickly explore the full range of data.- Parameters
n_songs (int, optional) – Number of songs to return. Defaults to 1.
query (str, optional) –
What to query the database for. Defaults to ‘maxfreq’. One of:
’duration’
’maxfreq’
’minfreq’
order (str, optional) –
Defaults to ‘descending’. One of:
’ascending’
’descending’
return_keys (bool, optional) – Defaults to ‘False’. Whether to return the keys of the displayed spectrograms.
**kwargs – Keyword arguments to be passed to
melspectrogram()
- segment_into_units(overwrite: bool = False) None [source]#
Adds segment onsets, offsets, unit and silence durations to
data
.Warning
If segmentation fails for a vocalisation it will be dropped from the database so that it doesn’t cause trouble downstream.
- Parameters
overwrite (bool, optional) – Whether to overwrite unit
False. (segmentation information if it already exists. Defaults to) –
- Raises
FileExistsError – The vocalisations in this dataset have already been segmented.
- subset(ids: List[str], new_dataset: str) pykanto.dataset.KantoData [source]#
Creates a new dataset containing a subset of the IDs present in the original dataset (e.g., different individual birds).
Note
Existing files common to both datasets (vocalisation spectrograms, unit spectrograms, wav files) will not be copied or moved from their original location.
Any newly generated files (e.g., by running a function that saves spectrograms to disk) will be exclusive to the new dataset.
- Parameters
dataset ([type]) – Source dataset.
ids (List[str]) – IDs to keep.
new_dataset (str) – Name of new dataset.
- Returns
A subset of the dataset.
- Return type
- to_csv(path: pathlib.Path, timestamp: bool = True) None [source]#
Output vocalisation (and, if present, unit) metadata in the dataset as a .csv file.
- Parameters
path (Path) – Directory where to save the file(s).
timestamp (bool, optional) – Whether to add timestamp to file name. Defaults to True.
- reload() pykanto.dataset.KantoData [source]#
Load the current dataset from disk. Remember to assign the output to a variable!
Warning
You will lose any changes that happened after the last time you saved the dataset to disk.
- Returns
Last saved version of the dataset.
- Return type
Examples
>>> dataset = dataset.reload()
- get_units(pad: bool = False) None [source]#
Creates and saves a dictionary containing spectrogram representations of the units or the average of the units present in the vocalisations of each individual ID in the dataset.
- Parameters
pad (bool, optional) – Whether to pad spectrograms to the maximum lenght (per ID). Defaults to False.
Note
If
pad = True
, unit spectrograms are padded to the maximum duration found among units that belong to the same ID, not the dataset maximum.Note
If each of your IDs (grouping factor, such as individual or population) has lots of data and you are using a machine with very limited memory resources you will likely run out of it when executing this funtion in parallel. If this is the case, you can limit the number of cpus to be used at once by setting the
num_cpus
parameter to a smaller number.
- cluster_ids(min_sample: int = 10, kwargs_umap: Dict[str, Any] = {}, kwargs_hdbscan: Dict[str, Any] = {}) None [source]#
Dimensionality reduction using UMAP + unsupervised clustering using HDBSCAN. This will fail if the sample size for an ID (grouping factor, such as individual or population) is too small.
Adds cluster membership information and 2D UMAP coordinates to
data
ifsong_level=True
inparameters
, else tounits
.- Parameters
min_sample (int) – Minimum sample size below which an ID will be skipped. Defaults to 10, but you can reallistically expect good automatic results above ~100.
kwargs_umap (dict) – Dictionary of UMAP parameters.
kwargs_hdbscan (dict) – Dictionary of HDBSCAN+ parameters.
- prepare_interactive_data(spec_length: float | None = None) None [source]#
Prepare lightweigth representations of each vocalization or unit (if song_level=False in
parameters
) for each individual in the dataset.- Parameters
spec_length (float, optional) – In seconds, duration of
segments (spectrogram that will be produced. Shorter) –
padded (will be) –
note (longer segments trimmed. Defaults to maximum) –
dataset. (duration in the) –
- open_label_app(palette: typing.Tuple[typing.Literal[<class 'str'>], ...] = ('#8dd3c7', '#ffffb3', '#bebada', '#fb8072', '#80b1d3', '#fdb462', '#b3de69', '#fccde5', '#d9d9d9', '#bc80bd', '#ccebc5', '#ffed6f'), max_n_labs: int = 10) None [source]#
Opens a new web browser tab with an interactive app that can be used to check the accuracy of the automaticaly assigned labels. Uses average units per vocalisation (if song_level=True in self.parameters) or individual units.
Note
Starting this app requires the output of
prepare_interactive_data()
; you will be prompted to choose whether to run it if it is missing.Note
The app will try to create a palette based on the maximum number of categories in any individual in the dataset. You will get a ValueError if the palette you provided is not large enough (the default palette allows for a maximum of 12 categories). You can use your own palettes or import existing ones, see ‘Examples’ below.
- Parameters
palette (List[str], optional) – A colour palette of length >= max_n_labs. Defaults to list(Set3_12).
max_n_labs (int, optional) – maximum number of classes expected in the dataset. Defaults to 10.
Examples
Allow for a maximum of 20 labels by setting
max_n_labs = 20
andpalette = list(Category20_20)
.>>> from bokeh.palettes import Category20_20 ... >>> self.open_label_app(max_n_labs = 20, >>> palette = list(Category20_20))