Speaker Recognition using Python and Machine Learning

Introduction

I had been wanting to learn more about machine learning (as well as Python), and I was really excited to take a “Neural Network and Fuzzy Logic” course my last semester. But unfortunately, it got canceled due to low enrollment. So when I found out that my Fractals class had a final project with a free choice of topic, I took advantage of the opportunity and did mine on machine learning.

For this project, I used the popular machine learning algorithm Gaussian Mixture Models (GMM) to train models to recognize the speakers of some commonly used “command” words. This project was implemented in Python, which I’m still new to, so I followed a base code and modified it to run on Python 3.

Full project details

Project Overview

Type: Individual
Timeframe: Approx. 1 month
Relevant Concepts/Skills: Research, machine learning, Gaussian Mixture Model, (audio) signal processing
Tools: Audacity, Python, Jupyter Notebook, NumPy, SciPy, Python Speech Features, scikit-learn, Librosa
Deliverable(s): Final presentation

Procedures

Researched machine learning and algorithmns
Pre-processed voice clips from selected dataset
Performed feature extraction with Mel-frequency cepstral coefficients and delta-Mel-frequency cepstral coefficients
Trained models using Gaussian Mixture Models (GMM) algorithm
Tested validity of models by using unseen voice clips

Results

Trained models achieved 67-100% accuracy in speaker recognition depending on test data used.

Due to the small dataset used (about 5-13 seconds, including silence, of voice clips per speaker), in general the model is only successful about 4-5 out of 6 times at identifying the speaker. The model tends to be more successful if the test word is one that it has “heard” at least once.

However, when the model is trained on the same word (4x) used for testing, even though the test clip is previously unseen, the model is able to identify the speaker 100% of the time.

It was also accidentally found that the system seemed able to recognize between different words, implying that the system could be used to perform speech recognition instead/as well.

Future Work

To improve performance, more voice samples (available in the main Speech Commands Dataset) will likely be needed for training. Data can also be augmented (for example by introducing noise) in order to increases the robustness of the model.

Other ML/DL algorithms could be implemented instead/in addition to GMM.

It should also be possible to extend this model to perform speech recognition and/or speaker verification.

Repository

The full code can be found in the repository here.