{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# The KantoData dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "13b1f7d522f34950965738258402ae21",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Loading JSON files: 0%| | 0/20 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2023-02-09 14:10:37,986\tINFO services.py:1456 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3ced60ffb5b44b3eb734f00888efa919",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Preparing spectrograms: 0%| | 0/10 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "423786351412488b83ad1cb11393d8c2",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Finding units in vocalisations: 0%| | 0/10 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found and segmented 169 units.\n",
"Saved dataset to /home/nilomr/projects/pykanto/pykanto/data/datasets/GREAT_TIT/GREAT_TIT.db\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "159fc62be68546d8b0f725682c6d9eef",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Adding new metadata to .json files in /home/nilomr/projects/pykanto/pykanto/data/segmented/great_tit/JSON: …"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from pathlib import Path\n",
"from pykanto.dataset import KantoData\n",
"from pykanto.parameters import Parameters\n",
"from pykanto.utils.paths import ProjDirs, pykanto_data\n",
"from pykanto.utils.io import load_dataset\n",
"import numpy as np\n",
"\n",
"DATASET_ID = \"GREAT_TIT\"\n",
"DIRS = pykanto_data(dataset=DATASET_ID)\n",
"\n",
"params = Parameters(dereverb=False, verbose=False)\n",
"dataset = KantoData(\n",
" DIRS,\n",
" parameters=params,\n",
" overwrite_dataset=True,\n",
" overwrite_data=True\n",
")\n",
"\n",
"dataset.segment_into_units(overwrite=True)\n",
"dataset.save_to_disk()\n",
"dataset = load_dataset(dataset.DIRS.DATASET, DIRS)\n",
"dataset.to_csv(dataset.DIRS.DATASET.parent)\n",
"dataset.write_to_json()\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Useful attributes"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"{py:class}`~pykanto.dataset.KantoData` datasets contain a series of attributes: these are some of the ones you are most likely to access:\n",
"\n",
"| Attribute | Description |\n",
"|-----------|-------------|\n",
"| `KantoData.data` | Dataframe containing information about each vocalization |\n",
"| `KantoData.files` | List of files associated with the dataset |\n",
"| `KantoData.parameters` | A {py:class}`~pykanto.parameters.Parameters` instance containing the params used to generate the dataset |\n",
"| `KantoData.metadata` | A dictionary of metadata associated with the dataset |\n",
"| `KantoData.units` | Dataframe of single sound units in dataset, created after running `KantoData.cluster_ids()` if `song_level` is set to `False` in the parameters |"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Common operations with datasets\n",
"
"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"| Method | Description |\n",
"| --- | --- |\n",
"|```dataset = load_dataset()``` | Load an existing dataset |\n",
"|```dataset.save_to_disk()``` | Save an existing dataset | \n",
"|```dataset.to_csv()``` | Save a dataset to csv |\n",
"|```dataset.write_to_json()``` | Save new metadata to JSON files |"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can get some basic information about the contents of the dataset by running:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total length: 20\n",
"Unique IDs: 2\n"
]
},
{
"data": {
"text/plain": [
"B32 11\n",
"SW83 9\n",
"Name: ID, dtype: int64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.sample_info()\n",
"dataset.data['ID'].value_counts()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`KantoData.data` and `KantoData.units` are {py:class}`pandas.DataFrame`\n",
"instances: I have chosen this format because it is a very flexible and most users are\n",
"already familiar with it. You can query and modify it as you would any other\n",
"pandas dataframe. For example, to see the first three rows and a subset of columns:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | date | \n", "recordist | \n", "unit_durations | \n", "
---|---|---|---|
2021-B32-0415_05-11 | \n", "2021-04-15 | \n", "Nilo Merino Recalde | \n", "[0.0986848072562358, 0.10448979591836727, 0.10... | \n", "
2021-B32-0415_05-15 | \n", "2021-04-15 | \n", "Nilo Merino Recalde | \n", "[0.1102947845804989, 0.09868480725623585, 0.12... | \n", "
2021-B32-0415_05-21 | \n", "2021-04-15 | \n", "Nilo Merino Recalde | \n", "[0.1219047619047619, 0.10448979591836738, 0.14... | \n", "