How to get GFCC instead of MFCC in python? - python

Today i'm using MFCC from librosa in python with the code below. It gives an array with dimension(40,40).
import librosa
sound_clip, s = librosa.load(filename.wav)
mfcc=librosa.feature.mfcc(sound_clip, n_mfcc=40, n_mels=60)
Is there a similiar way to extract the GFCC from another library? I do not find it in
librosa.
For example essentia:
https://essentia.upf.edu/documentation/essentia_python_tutorial.html
https://essentia.upf.edu/documentation/reference/std_GFCC.html
import essentia
import essentia.standard
essentia.standard.GFCC
#Get array with dimension (40,40)

I have been facing similar problems, therefore I wrote a small library called spafe, that simplifies features extractions from audio files. Among the supported features there is GFCC. The extraction can be done as in the following:
import scipy
from spafe.features.gfcc import gfcc
# read wav
fs, sig = scipy.io.wavfile.read("test.wav")
# compute features
gfccs = gfcc(sig, fs=fs, num_ceps=13)
You can find a thorough example of GFCC extraction (as a jupyter-notebook) under gfcc-features-example.
The documentation of all the possible input variables and their significance are available under: gfcc-docs.
the gfcc implementation is done as in the following paper

https://github.com/jsingh811/pyAudioProcessing provides gfcc, mfcc, spectral and chroma feature extraction capability along with classification, cross validation and Hyperparameter tuning.
The readme describes getting started methods as well as examples on how to run classifications.

Related

What libraries I can use for Anomaly detection in Time-series data in Python?

I am working with data consists of two variables:
Date-Time (in 15minutes intervals) and
Demand
With these variables, I need to formulate a model to train the data in detecting anomalies in the data. Currently, I am using Pandas libraries. but are there any other libraries that I might use?
Thank you.
The Python libraries pyod, pycaret, fbprophet, and scipy are good for automating anomaly detection.
There is a good article on how to do a variety of anomaly detection exercises on a sample dataset from Expedia. Although it isn't explained in the article, the author used the Pandas library to load and analyze time series data. This is a good article to make sure you better understand some of the capabilities of the library that you are already using for anomaly detection.
Another good article uses Pandas for the time series data and uses additional libraries for anomaly detection analysis. I found this article useful when starting out since it uses Faker and NumPy to create fake data, so it is easy to duplicate the tests in the article.

Implement the MATLAB 'fitdist' in Python

I am working on a image processing tool, and I am having some trouble finding a good substitute for matlab's fitdist to a Python soltuion.
The matlab code works something like this:
pdR = fitdist(Red,'Kernel','Support','positive');
Any of you have found a good implementation for this in Python?
Generally SciPy is useful in your case:
import scipy.stats as st
# KDE
st.gaussian_kde(data)
# Fit to specified distribution (Normal distribution in this example)
st.norm.fit(data)
Full reference is here: https://docs.scipy.org/doc/scipy/reference/stats.html

Why is scikit-learn SVM.SVC() extremely slow?

I tried to use SVM classifier to train a data with about 100k samples, but I found it to be extremely slow and even after two hours there was no response. When the dataset has around 1k samples, I can get the result immediately. I also tried SGDClassifier and naïve bayes which is quite fast and I got results within couple of minutes. Could you explain this phenomena?
General remarks about SVM-learning
SVM-training with nonlinear-kernels, which is default in sklearn's SVC, is complexity-wise approximately: O(n_samples^2 * n_features) link to some question with this approximation given by one of sklearn's devs. This applies to the SMO-algorithm used within libsvm, which is the core-solver in sklearn for this type of problem.
This changes much when no kernels are used and one uses sklearn.svm.LinearSVC (based on liblinear) or sklearn.linear_model.SGDClassifier.
So we can do some math to approximate the time-difference between 1k and 100k samples:
1k = 1000^2 = 1.000.000 steps = Time X
100k = 100.000^2 = 10.000.000.000 steps = Time X * 10000 !!!
This is only an approximation and can be even worse or less worse (e.g. setting cache-size; trading-off memory for speed-gains)!
Scikit-learn specific remarks
The situation could also be much more complex because of all that nice stuff scikit-learn is doing for us behind the bars. The above is valid for the classic 2-class SVM. If you are by any chance trying to learn some multi-class data; scikit-learn will automatically use OneVsRest or OneVsAll approaches to do this (as the core SVM-algorithm does not support this). Read up scikit-learns docs to understand this part.
The same warning applies to generating probabilities: SVM's do not naturally produce probabilities for final-predictions. So to use these (activated by parameter) scikit-learn uses a heavy cross-validation procedure called Platt scaling which will take a lot of time too!
Scikit-learn documentation
Because sklearn has one of the best docs, there is often a good part within these docs to explain something like that (link):
If you are using intel CPU then Intel has provided the solution for it.
Intel Extension for Scikit-learn offers you a way to accelerate existing scikit-learn code. The acceleration is achieved through patching: replacing the stock scikit-learn algorithms with their optimized versions provided by the extension.
You should follow the following steps:
First install intelex package for sklearn
pip install scikit-learn-intelex
Now just add the following line in the top of the program
from sklearnex import patch_sklearn
patch_sklearn()
Now run the program it will be much faster than before.
You can read more about it from the following link:
https://intel.github.io/scikit-learn-intelex/

Multidimensional/multivariate dynamic time warping (DTW) library/code in Python

I am working on a time series data. The data available is multi-variate. So for every instance of time there are three data points available.
Format:
| X | Y | Z |
So one time series data in above format would be generated real time. I am trying to find a good match of this real time generated time series within another time series base data, which is already stored (which is much larger in size and was collected at a different frequency). If I apply standard DTW to each of the series (X,Y,Z) individually they might end up getting a match at different points within the base database, which is unfavorable. So I need to find a point in base database where all three components (X,Y,Z) match well and at the same point.
I have researched into the matter and found out that multidimensional DTW is a perfect solution to such a problem. In R the dtw package does include multidimensional DTW but I have to implement it in Python. The R-Python bridging package namely "rpy2" can probably of help here but I have no experience in R. I have looked through available DTW packages in Python like mlpy, dtw but are not help. Can anyone suggest a package in Python to do the same or the code for multi-dimensional DTW using rpy2.
Thanks in advance!
Thanks #lgautier I dug deeper and found implementation of multivariate DTW using rpy2 in Python. Just passing the template and query as 2D matrices (matrices as in R) would allow rpy2 dtw package to do a multivariate DTW. Also if you have R installed, loading the R dtw library and "?dtw" would give access to the library's documentation and different functionalities available with the library.
For future reference to other users with similar questions:
Official documentation of R dtw package: https://cran.r-project.org/web/packages/dtw/dtw.pdf
Sample code, passing two 2-D matrices for multivariate DTW, the open_begin and open_end arguments enable subsequence matching:
import numpy as np
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
import rpy2.robjects as robj
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
template = np.array([[1,2,3,4,5],[1,2,3,4,5]]).transpose()
rt,ct = template.shape
query = np.array([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]]).transpose()
rq,cq = query.shape
#converting numpy matrices to R matrices
templateR=R.matrix(template,nrow=rt,ncol=ct)
queryR=R.matrix(query,nrow=rq,ncol=cq)
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(templateR,queryR,keep=True, step_pattern=R.rabinerJuangStepPattern(4,"c"),open_begin=True,open_end=True)
dist = alignment.rx('distance')[0][0]
print dist
It seems like tslearn's dtw_path() is exactly what you are looking for. to quote the docs linked before:
Compute Dynamic Time Warping (DTW) similarity measure between (possibly multidimensional) time series and return both the path and the similarity.
[...]
It is not required that both time series share the same size, but they must be the same dimension. [...]
The implementation they provide follows:
H. Sakoe, S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 26(1), pp. 43–49, 1978.
I think that it is a good idea to try out a method in whatever implementation is already available before considering whether it worth working on a reimplementation.
Did you try the following ?
from rpy2.robjects.packages import importr
# You'll obviously need the R package "dtw" installed with your R
dtw = importr("dtw")
# all functions and objects in the R package "dtw" are now available
# with `dtw.<function or object>`
I happened upon this post and thought I would provide some updated information in case anyone else is trying to locate a way to do multivariate DTW in Python. The DTADistance package has the option to perform multivariate DTW.

How can I classify data with the nearest-neighbor algorithm using Python?

I need to classify some data with (I hope) nearest-neighbour algorithm. I've googled this problem and found a lot of libraries (including PyML, mlPy and Orange), but I'm unsure of where to start here.
How should I go about implementing k-NN using Python?
Particularly given the technique (k-Nearest Neighbors) that you mentioned in your Q, i would strongly recommend scikits.learn. [Note: after this Answer was posted, the lead developer of this Project informed me of a new homepage for this Project.]
A few features that i believe distinguish this library from the others (at least the other Python ML libraries that i have used, which is most of them):
an extensive diagnostics & testing library (including plotting
modules, via Matplotlib)--includes feature-selection algorithms,
confusion matrix, ROC, precision-recall, etc.;
a nice selection of 'batteries-included' data sets (including
handwriting digits, facial images, etc.) particularly suited for ML techniques;
extensive documentation (a nice surprise given that this Project is
only about two years old) including tutorials and step-by-step
example code (which use the supplied data sets);
Without exception (at least that i can think of at this moment) the python ML libraries are superb. (See the PyMVPA homepage for a list of the dozen or so most popular python ML libraries.)
In the past 12 months for instance, i have used ffnet (for MLP), neurolab (also for MLP), PyBrain (Q-Learning), neurolab (MLP), and PyMVPA (SVM) (all available from the Python Package Index)--these vary significantly from each other w/r/t maturity, scope, and supplied infrastructure, but i found them all to be of very high quality.
Still, the best of these might be scikits.learn; for instance, i am not aware of any python ML library--other than scikits.learn--that includes any of the three features i mentioned above (though a few have solid example code and/or tutorials, none that i know of integrate these with a library of research-grade data sets and diagnostic algorithms).
Second, given you the technique you intend to use (k-nearest neighbor) scikits.learn is a particularly good choice. Scikits.learn includes kNN algorithms for both regression (returns a score) and classification (returns a class label), as well as detailed sample code for each.
Using the scikits.learn k-nearest neighbor module (literally) couldn't be any easier:
>>> # import NumPy and the relevant scikits.learn module
>>> import numpy as NP
>>> from sklearn import neighbors as kNN
>>> # load one of the sklearn-suppplied data sets
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> # the call to load_iris() loaded both the data and the class labels, so
>>> # bind each to its own variable
>>> data = iris.data
>>> class_labels = iris.target
>>> # construct a classifier-builder by instantiating the kNN module's primary class
>>> kNN1 = kNN.NeighborsClassifier()
>>> # now construct ('train') the classifier by passing the data and class labels
>>> # to the classifier-builder
>>> kNN1.fit(data, class_labels)
NeighborsClassifier(n_neighbors=5, leaf_size=20, algorithm='auto')
What's more, unlike nearly all other ML techniques, the crux of k-nearest neighbors is not coding a working classifier builder, rather the difficult step in building a production-grade k-nearest neighbor classifier/regressor is the persistence layer--i.e., storage and fast retrieval of the data points from which the nearest neighbors are selected. For the kNN data storage layer, scikits.learn includes an algorithm for a ball tree (which i know almost nothing about other than is apparently superior to the kd-tree (the traditional data structure for k-NN) because its performance doesn't degrade in higher dimensional features space.
Additionally, k-nearest neighbors requires an appropriate similarity metric (Euclidean distance is the usual choice, though not always the best one). Scikits.learn includes a stand-along module comprised of various distance metrics as well as testing algorithms for selection of the appropriate one.
Finally, there are a few libraries that i have not mentioned either because they are out of scope (PyML, Bayesian); they are not primarily 'libraries' for developers but rather applications for end users (e.g., Orange), or they have unusual or difficult-to-install dependencies (e.g., mlpy, which requires the gsl, which in turn must be built from source) at least for my OS, which is Mac OS X.
(Note: i am not a developer/committer for scikits.learn.)

Categories