utils module

Utils module.

This module is intended to provide a set of utility functions for printing messages, throwing errors and downloading or processing the dataset.

utils.download_dataset(url: str, outpath: str)

download_dataset.

This function downloads a dataset from a given URL and extracts it to a specified output path. The dataset is assumed to be a zip file that contains multiple zip files, each representing a subset of the data. The function handles the extraction of all the zip files and deletes them after the extraction.

Parameters:

url (str) – URL of the dataset to download
outpath (str) – output path where the dataset will be extracted

utils.excerr(msg: str)

excerr.

Print an error message with the current date and time.

Parameters:: msg (str) – error message to print

utils.printt(msg: str)

printt.

Print a message with the current date and time.

Parameters:: msg (str) – message to print

utils.process_dataset(inpath: str, outpath: str, remove_source: bool = False)

process_dataset.

This function processes a dataset from a given input path and saves the processed data to a specified output path. The dataset is assumed to contain multiple subsets of data, each with multiple experiments (e.g., Learning Set, Test Set…). The function iterates over each subset and each experiment and uses the to_parquet function to convert the CSV files from the experiment to parquet files. The function also has an option to remove the source files after the conversion.

Parameters:

inpath (str) – input path of the dataset to process
outpath (str) – output path where the processed data will be saved
remove_source (bool) – boolean flag that indicates whether to remove the source files after the conversion. Default is False

utils.process_features(inpath: str, outpath: str, conf_mfcc: dict, skip_full: bool = True)

process_features.

This function processes a dataset of sensor data and extracts the MFCC features from the data using the get_mfcc function. The dataset is assumed to contain multiple subsets of data, each with multiple experiments. The function iterates over each subset and each experiment and reads the parquet files of the sensor data. It then applies the get_mfcc function to the data frame of the sensor data and obtains a data frame of the MFCC features. The function then saves the data frame of the MFCC features as a parquet file to the output path. The function also has an option to skip the subset containing the full dataset, i.e., the dataset with the test ground truth.

Parameters:

inpath (str) – input path of the dataset to process
outpath (str) – output path where the processed data will be saved
conf_mfcc (dict) – dictionary of librosa mfcc function parameters. For parameters setting please refer to the official documentation: https://librosa.org/doc/latest/generated/librosa.feature.mfcc.html
skip_full (bool) – boolean flag that indicates whether to skip the subset containing the test ground truth. Default is True.

utils.process_sample(sample_path: str, experiment: str) → DataFrame

process_sample.

This function processes a sample of data from a given experiment and returns a data frame of the processed data. The sample is assumed to be a CSV file that contains four columns: the first three usually contain date and time information, but are irrelevant in this context, as only microseconds are needed (fourth column). The function reads the CSV file, drops not relevant columns and sets the elapsed time in seconds as index. The remaining two columns are the x and y coordinates of the data. The function also adds a column for the experiment name and returns the data frame, which is helpful for the partitioning utilities of the parquet format.

Parameters:

sample_path (str) – path of the sample CSV file to process
experiment (str) – name of the experiment that the sample belongs to

Return type:

pd.DataFrame

Returns:

dataframe of the processed sample data, with time as the index and x, y, and experiment as the columns

utils.set_environment(seed: int)

set_environment.

Setup the execution environment by tuning pytorch parameters and by setting a random seed for reproducibility.

Parameters:: seed (int) – random seed to set

utils.to_parquet(experiment_path: str, outdir: str, experiment: str)

to_parquet.

This function converts a set of CSV files from a given experiment to parquet files and saves them to a specified output directory. The CSV files are assumed to contain data from different sensors (e.g. accelerometer, temperature) and have different names (e.g. acc_00001.csv, temp_00001.csv). The function uses the process_sample function to process each CSV file and obtain a processed data frame of the sensor data. The function then saves the data frame as a parquet file, partitioned by the experiment name.

Warning: temperature measurements are dropped for simplicity!

Parameters:

experiment_path (str) – path of the directory that contains the CSV files from the experiment
outdir (str) – output directory where the parquet files will be saved
experiment (str) – name of the experiment that the CSV files belong to