Segmenting vocalisations#

‘Vocalisation’ is a common but slightly vague term. In this guide, I use it to refer to a single song or call, while ‘unit’ refers to a single note or syllable.

Segmenting vocalisations into units#

Following the example in the previous section, we can first create a project structure:

# Prepare a project:
DATASET_ID = "BENGALESE_FINCH"
DATA_PATH = Path(pkg_resources.resource_filename("pykanto", "data")) 
PROJECT_ROOT = Path(DATA_PATH).parent
RAW_DATA = DATA_PATH / "raw" / DATASET_ID
DIRS = ProjDirs(PROJECT_ROOT, RAW_DATA, DATASET_ID, mkdir=True)

Then, we can set audio parameters and load the audio files into a KantoData object:

params = Parameters(
    sr=32000,
    window_length=512,
    hop_length=128,
    fft_size=2048,
    top_dB=120,
    num_mel_bins=224,
    lowcut=500,
    highcut=11000,
    dereverb=False,
    silence_threshold=0.1,
)

dataset = KantoData(
    DIRS,
    parameters=params,
    overwrite_dataset=True,
    overwrite_data=True,
)

You can check any vocalisation in the newly created dataset by using the plot method:

# Plot the first vocalisation in the dataset
key = dataset.data.index[0]
dataset.plot(key, max_lenght =5)

../_images/0889f9f682e1d5f98ebc3344e87586e464c9494246fae7ee60ae668171184ec9.png

Finally, to segment the vocalisations into units, we do the following:

dataset.segment_into_units()

Found and segmented 268 units.

# Plot another vocalisation in the dataset, now segmented:
key = dataset.data.index[1]
dataset.plot(key, segmented=True, max_lenght =5)

../_images/94979e30a3ffdf1a840e2740d3fd84deaf73a878f0775e68526ac571805ac139.png

If you want to access the onsets or offsets manually, simply get them from the dataset:

print(f"First 10 note onset times: \n{dataset.data.loc[key, 'onsets'][:10]}")

First 10 note onset times: 
[0.516 0.68  0.836 1.02  1.204 1.376 1.536 1.712 1.876 2.048]

Note:

This method works reasonably well, especially after some fine-tuning of the parameters (see Parameters). However, it will necessarily fail in difficult cases, for example if the SNR is very bad of there is a lot of amplitude modulation within notes. In such cases you might have to resort to manual segmentation, using a tool such as sonic visualiser, or train a model to work with your specific species and recording conditions. Here is a very good option if you want to do the latter:

TweetyNet: a single neural network model that learns how to segment spectrograms of birdsong into annotated syllables. elifesciences.org/articles/63853

Segmenting vocalisations

Contents

Segmenting vocalisations#

Segmenting vocalisations into units#