In this article, you will learn the basics of doing Machine Learning on sound data. Sound data (often called audio data) is a data type that is not very intuitive to work with. At the end of this article, you will understand:

Sound data

The big difficulty when starting with sound data is that, unlike tabular data or images, sound data is not easy to represent in a tabular format.

As you may or may not know, images can be easily represented as matrices, because they are based on pixels. Each of the pixels has a value that indicates the intensity of the black and white (or for color images you have an intensity of red, green, and blue separately).

Sound, on the other hand, is a much more complex data format to work with:

An Example Sound Data Set for Genre Classification

Before going into depth, let’s first introduce an example data set. As we are working with songs, it is difficult to obtain a dataset of which we can be certain that there are no copyright violation problems. For this article, I have therefore decided to create a small database myself that uses only beats from a website for copyright-free music called Uppbeat:

I do not want to re-distribute the songs, but if you want to follow along, you can easily create a free account and download a number of songs from the two categories to follow along.

How to organize your data and environment for working with sound

For following along in your own environment, I advise you to create a folder called tracks, in which you create two folders:

For this tutorial, I advise working with Jupyter Notebooks, as there are some great features to play sound directly from a notebook. If you want to avoid environment or package installation problems, you may want to work on Google Colab: a great, free, online, notebook environment.

Genre Classification with Python

In the described data set, you have tracks of two genres: “Silly Energy” and “Dramatic”. The goal of this article is to give you the basic steps to build a genre classification model.

The sounds in those two genres should be quite different, and it should therefore be relatively easy to train a Machine Learning model that can separate the two classes from each other.

Content-Based vs Spectrogram Based Approach

In this tutorial, we will do a ‘real’ sound-based machine learning model.: we will use the numerical transcription of the sound for our Machine Learning models.

However, I want to add that it is also possible to build genre classification models using metadata of the sound rather than the actual sound. In this case, you could use features like:

Although this can be a great approach as well, it is not in the scope of the current model.

Preparing sound data for Machine Learning

Let’s start using Python and dive into the practical side now. By now, you should already have the tracks in your environment. Let’s now start preparing it for Machine Learning. This is the tricky part.

As sound, or music, is something that we hear, it is difficult to imagine how to make it digital. Yet, digital music is everywhere and it is a problem that has been solved well. If you want more background information on this, you can check out this deep dive article into Fourier Transforms or this article on how music is converted into numbers.

In this article, we will use the librosa library: a great library for working with sound and music in Python.

Load and play a song using Jupyter Notebook

You can use the following code to import an example song and listen to it:

Machine Learning on sound. Importing MP3 in Python using Jupyter.

You can also see what it looks like numerically, by importing it with librosa, as follows:

You will obtain a NumPy Array that represents the music, like so:

Cutting the songs in equally long pieces

To prepare the sound data for Machine Learning, we have just generated numerical arrays that represent the music using the librosa load function. Now that has been done, it is relatively easy to split the sound into equally long pieces.

For this project, we create arrays of width 100000, as this creates a good number of splits in the songs (it splits the songs into around 20 to 30 segments).

In real life use cases, the file splitting and re-using is a choice that you need to make depending on your use case and on the data and models that you use. For example, if you need to predict on short sound segments, you should train on short segments. Also, if you have more data, you may need less segments per song: having many similar segments is not necessarily optimal for Machine Learning.

Creating melspectrograms for genre classification with CNNs

The second step of our sound preparation is the conversion into melspectrograms. Although this is not the only method for sound preparation, this approach is commonly used and a good starting point.

Spectrograms are created using a Fourier Transform. In short, Fourier Transforms isolate different sound frequencies that are present in a sound, and resumes them in a strength-per-frequency-matrix, which we call a spectrogram. You can read more about Fourier Transforms on sound in this article as well.

The following code does the two described steps at the same time:

CNNs for Machine Learning on sound data

The spectrogram approach that was just described converts each song (or song segment) into a spectrogram: a two-dimensional matrix. To do Machine Learning on two-dimensional input data, the best approach is to use CNNs, Convolutional Neural Networks.

CNNs are very well know for being performant on image data. You could consider that spectrgrams are a sort of image and therefore CNNs are the state-of-the-art Machine Learning model for sound data.

Let’s move on to a simple CNN definition to see how to put the prepared sound data into a CNN neural network.

Generate sound data for Keras

To create a data format that is suited for input in keras, we can use the following code. Be aware that this will only work if you created the file structure as described above:

You need a folder called ‘tracks’ that contains:

If you have other songs and other genres, you can easily adapt the code for your specific situation.

Train/validate/test split

We can create a train/validate/test split as is usual in standard Machine Learning pipelines, using the following code:

We will use the training data to train the model, while we pass the validation as well. The test set will be kept separate until the end for a final evaluation of our model.

Building a Genre Classification CNN using Keras

We will only use a very basic and standard CNN here. The goal of this article is not to spend much time on tweaking the Neural Network, but rather to give a full understanding of Machine Learning pipelines for sound data. Feel free to build a much better model!

Genre Classification CNN model architecture

The model that is used below uses a very standard CNN architecture with three Convolutional layers and two MaxPooling layers in between.

This architecture is really only a starting point and would need much tweaking in real life use cases. For the example pipeline, it will at least show you how to fit a CNN on sound data, which is the goal of this article.

You can see the model summary below. If you are not following the exact example, be aware that this model architecture will work for binary classification. If you want to use more than two genres, you will need to output shape in your last Dense Layer (line 15 of previous code).

Machine Learning on sound. CNN for genre classification using keras.

Fitting the Keras CNN for Genre Classification

Now that the model architecture is defined, we need to compile it, together with a choice for the loss function, the optimizer and the metric. In this example, we are working with a binary classification (two possible outcomes) so we select ‘binary_crossentropy’. As the optimizer, let’s use RMSprop and we select set ‘accuracy’ as metric.

We then fit the model using X_train (the training songs) and y_train (the training labels). We set the validation data to X_val and y_val.

We train the model for 10 epochs and obtain the following result:

Machine Learning on Sound. Epocs of the CNN for Genre Classifcation.

Showing the model history

A better way to inspect the training history is to make the very common graph showing training accuracy and validation accuracy over epochs. You can use the following code for this:

For the current CNN, we obtain the following graph. It shows that training accuracy is going up: the model is learning something. However, the validation accuracy does not go up much, and this indicated that the model will need some tuning to get useful.

Conclusion and Next Steps for Improving the model

In this article, you have seen a simple and understandable pipeline for working with sound data in Python. You have learned how to prepare the data from mp3 files to NumPy array. You have also seen how to start working with CNNs to build a genre classification model on this data.

I hope that this article has given you a good understanding of the different elements of a genre classification pipeline and that it gives you the fundamental knowledge to go further in this domain.

If you want to go further with this model, let me give y nu some pointers. You could work on the data (more songs, more genres, reviewing the number of segments per song and segment length), you could also work on the model (trying other architectures, other hyperparameters, etc).

For now, I want to thank you for reading. Don’t hesitate to stay tuned for more stats, maths, ML, and data science content!

This content was originally published here.

Machine Learning on Sound and Audio data | Towards Data Science