Speaker Recognition System using Python - python

I'm trying to make a Speaker recognition (not speech but speaker) system using Python. I've extracted mfcc features of both train audio file and test audio file and have made a gmm model for each. I'm not sure how to compare the models to compute a score of similarity based on which I can program the system to validate the test audio. I'm struggling for 4 days to get this done. Would be glad if someone can help.

From what I can understand from the question, you are describing an aspect of the cocktail party problem
I have found a whitepaper with a solution to your problem using a modified iterative Wiener filter and a multi-layer perceptron neural network that can separate speakers into separate channels.
Intrestingly the cocktail party problem can be solved in one line in ocatve: [W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
you can read more about it on this stackoverflow post

Related

Is there a time series ML model (like TimesFormer) which extracts features from 4 channel input images?

Regular TimesFormer takes 3 channel input images, while I have 4 channel images (RGBD). I am struggling to find a TimesFormer (or a model similar to TimesFormer) that takes 4 channel input images and extract features from them.
Does anybody know such a model? Preferably, I would like to find pretrained model with weights.
MORE CONTEXT:
I am working with RGBD video frames and have multiclass classification problem at the end. My videos are fairly large, between 2 to 4 minutes so classical time-series models doesn't work for me. So my inputs are RGBD frames/images from the video and at the end I would like to get class prediction.
My idea was to divide the problem into 2 stages:
Extract features from video into smaller dimension with TimesFormer-like model. Result: I would get a new data representation (dataset).
Train clasification ML network with new data to get a class prediction.
As of Jan 2023, I don't think there's any readily available TimeSformer model/code that works on RGBD 4 channel image.
Alternatively, If you are looking for Vision Transformers that can work with depth as well (RGBD data), you have the entire list with state-of-the-art approaches and corresponding code (wherever available) here.
One of the good approach to start with is DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation. You can find the pre-trained models with this approach here.
If you're looking for 3D CNN based object detectors that can work on RGBD data: RGB-D Salient Object Detection via 3D Convolutional Neural Networks is one of the good ones to start with. Code and pre-trained models can be found here.
Since I don't fully understand your problem statement or exact requirement I'm proposed few things that I thought could be helpful.

how to do speaker identification using voice?

I was wondering if you can detect s using their voice. For example, we give feed some data in our program like this is the voice of a cat and when it detects it, it says hello cat or something.
Sure! Machine learning can do that, I suggest using deep-learning such as CNN, LSTM model, etc. what you need to prepare is the dataset of your cat voices and accurate model architecture.
roughly process
prepare dataset:
for example: 300 audio files each contain 3 seconds of different cat voices.
train model
model evaluation
script to respond to the output of model classification.
say "hello cat" when the prediction output is a cat.
but it's not that easy, hope this example is helpful.
more detail: https://www.analyticsvidhya.com/blog/2022/03/implementing-audio-classification-project-using-deep-learning/
of course, you can use machine learning to solve that problem, you can either deal with it as a signal processing/time series-ish problem and slide a window over your audio then feed it to your model to classify it (cat, dog, ...) LSTMs and RNNs are your go-to in this case. Or transform your audios to spectrogram to transform it into an image, then feed those images to your typical CNN architecture (NASNet, DenseNet, ...).
here is a reference you can check:
UrbanSound (audio/signal):https://www.kaggle.com/datasets/raghavrawat/segregatedurban8ksounds
UrbanSound (Spectrogram, I create it long ago): https://www.kaggle.com/datasets/skywolfmo/urban8kspectograms
There might be other ideas and methods besides the ones I mentioned above.
Good luck

How to add new Category in the CNN for Attendance by AI

I am working on the project with the group and we have decided to make the project on the ' Automatic Attendance System by AI '
I have learned the CNNs to categorize the objects i.e dogs and cats.
With that knowledge, we have decided to make the attendance system based on CNN. ( Please tell me if we shouldn't take this path or the technology if you find something bad here... )
But continuing it with CNN, let's say we have trained the model with 2 students, and on the last layer we put the two neurons as they are just two, right...?
Now the third comes, now to train his face to the NN, I have to change the model's structure and retrain every faces again...
If we apply the project to the big institute, where hundreds of students are there and if we want to train the model for each individual student, the nthis is not the feasible solution to recreate the model..
So we thought, we will fix the model's output layer size to let's say 50.
So only 50 faces can be trained per model.
But it is not always possible that there will always be 50.
They may 40 or if one gets in with ne admission, then 41.
So how to re-train the network with existing weights ?
( The same question is asked somewhere I know, but please direct me with my situation )
Or is there any other technology to use... ?
Please Direct me...
You don't need classification. Classification is not the solution for everywhere problem.
You should look into these:
Cosine Similarity
Siamese Network
You can use existing models from FaceNet or OpenCV.
Since they are already trained on a huge dataset of faces, you can extract feature vector easily.
Store the feature vector for every new student.
Then compute similarities(existing image, current image) based on distance or similarity score mark attendance.
This is scable and much faster approach. No training or retraining.
IMHO, you don't need a classifier for that. What you need are the vectors or encodings as described in this article. by Adam Geitgey. Then Facial recognition depends on cosine similarity of encodings instead of a classification problem.
For this purpose, Adam Geitgey's wrapper around Dlib's Facial recognition is pretty good. It can give you a similarity measure between a pair of faces that can help match the face to the student. Additionally for your use case, you can store the face encodings of the students and match them with incoming image data.

Decoding tracebacks using machine learning

I am trying to solve a problem where I have files which contain decoded- tracebacks( Stack call trace) whenever there is a Crash (in Linux world) and I have a unique ID to track the Crash occuring each time.
I want to build a classfier which will learn from the previous decoded-tracebacks and predict if there is an already existing ID for current traceback seen.
This is my first machine learning project . I used machine learning and did a trial using CountVectorizer and TF-IDF approach in python.
I want to know which features to consider for classification and suitable algorithm for text-classification to solve this problem.
great to hear that this is your first machine learning project! For my first NLP, i'm using the Amazon product reviewed to doing it. Do you try the Bag of words (BOW) model? And you can try N-gram too. And you can consider to use NaiveBayes Classifier and evaluate your classification. Then you will know which will give you the best algorithm to solve the problem.
Extra reading (if you like) : https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/

Software for Image classification

Currently I am working for a project to classify a given set of test images into one of the 5 predefined categories. I implemented Logistic Regression with a feature vector of 240 features for each image and trained it using 100 images/ category. The learning accuracy I achieved was ~98% for each category, whereas when tested on validation set consisting of 500 images (100 images/category), only ~57% images were rightly classified.
Please suggest me few libraries/tools which I can use (preferably based on Neural Network) in order to attain higher accuracy.
I tried using a Java based tool, Neurophy (neuroph.sourceforge.net) on windows but, it didn't run as expected.
Edit: The feature vector were already provided for the project. I am also looking for a better feature extraction tool for Images.
You can get help from this paper Image Classification
In My opinion, SVM is relatively better than logistic regression when it comes to multi-class response problems. We use it in e commerce classification of product where there are 1000s of response level and thousands of features.
Based on your tags I assume you would like a python package, scikit-learn has good classification routines: scikit-learn.org.
I have had good success using the WEKA tools, you need to isolate the feature set that you are interested in and then apply a classifier from this library. The examples are very clear. http://weka.wikispaces.com

Categories