The KantoData dataset#

Useful attributes#

KantoData datasets contain a series of attributes: these are some of the ones you are most likely to access:

Attribute

Description

KantoData.data

Dataframe containing information about each vocalization

KantoData.files

List of files associated with the dataset

KantoData.parameters

A Parameters instance containing the params used to generate the dataset

KantoData.metadata

A dictionary of metadata associated with the dataset

KantoData.units

Dataframe of single sound units in dataset, created after running KantoData.cluster_ids() if song_level is set to False in the parameters

Common operations with datasets#


Method

Description

dataset = load_dataset()

Load an existing dataset

dataset.save_to_disk()

Save an existing dataset

dataset.to_csv()

Save a dataset to csv

dataset.write_to_json()

Save new metadata to JSON files

You can get some basic information about the contents of the dataset by running:

dataset.sample_info()
dataset.data['ID'].value_counts()
Total length: 20
Unique IDs: 2
B32     11
SW83     9
Name: ID, dtype: int64

KantoData.data and KantoData.units are pandas.DataFrame instances: I have chosen this format because it is a very flexible and most users are already familiar with it. You can query and modify it as you would any other pandas dataframe. For example, to see the first three rows and a subset of columns:

dataset.data[['date', 'recordist', 'unit_durations']].head(3)
date recordist unit_durations
2021-B32-0415_05-11 2021-04-15 Nilo Merino Recalde [0.0986848072562358, 0.10448979591836727, 0.10...
2021-B32-0415_05-15 2021-04-15 Nilo Merino Recalde [0.1102947845804989, 0.09868480725623585, 0.12...
2021-B32-0415_05-21 2021-04-15 Nilo Merino Recalde [0.1219047619047619, 0.10448979591836738, 0.14...

Or to extract the length of each vocalisation and calculate inter-onset intervals:

last_offsets = dataset.data["offsets"].apply(lambda x: x[-1]).to_list()
iois = dataset.data.onsets.apply(
    lambda x: np.diff(x)
)
Hide code cell source
print("Vocalisation durations: ",[f"{x:.2f}" for x in last_offsets[:5]])
print("IOIs: ", [f"{x:.2f}" for x in iois[0][:5]])
Vocalisation durations:  ['2.12', '1.99', '2.16', '2.32', '1.81']
IOIs:  ['0.22', '0.23', '0.25', '0.24', '0.26']