Scala: Class that is similar to QuantileTransformer in python - python

I am looking for a Scala implementation of Python's sklearn.preprocessing.QuantileTransformer class. There doesn't seem to be a single Class that can implement the entire functionality in scala.
The Python implementation has 3 major parts:
1) Compute quantiles for given data and percentile array using numpy.percentile(). If quantile lies between two input data points, then linear interpolation is used. The closest I can find in Scala is in breeze, which has percentile() function (Observation: The DataFrame.stats.approxQuantile() does not perform the linear interpolation and thus can't be used here).
2) Uses numpy.interp() to convert the input range of values to a given range. Eg If input data range is 1-100, it can be converted to any given range say 0-1. Again this uses linear interpolation when input data is present between 2 quantiles. The closest I can find in Scala is breeze.interpolation class.
3)Calculate the inverse CDF using numpy.ppf(). I believe, for this I can use the NormalDistribution class as one answer below or StandardScaler class.
Anything better to make the coding short and simple?

The Apache Commons Math library has a NormalDistribution class, which has an inverseCumulativeProbability method that calculates the specified quantile value. That should suit your purposes.

Related

Scipy.optimize.minimize only iterates some variables.

I have written python (2.7.3) code wherein I aim to create a weighted sum of 16 data sets, and compare the result to some expected value. My problem is to find the weighting coefficients which will produce the best fit to the model. To do this, I have been experimenting with scipy's optimize.minimize routines, but have had mixed results.
Each of my individual data sets is stored as a 15x15 ndarray, so their weighted sum is also a 15x15 array. I define my own 'model' of what the sum should look like (also a 15x15 array), and quantify the goodness of fit between my result and the model using a basic least squares calculation.
R=np.sum(np.abs(model/np.max(model)-myresult)**2)
'myresult' is produced as a function of some set of parameters 'wts'. I want to find the set of parameters 'wts' which will minimise R.
To do so, I have been trying this:
res = minimize(get_best_weightings,wts,bounds=bnds,method='SLSQP',options={'disp':True,'eps':100})
Where my objective function is:
def get_best_weightings(wts):
wts_tr=wts[0:16]
wts_ti=wts[16:32]
for i,j in enumerate(portlist):
originalwtsr[j]=wts_tr[i]
originalwtsi[j]=wts_ti[i]
realwts=originalwtsr
imagwts=originalwtsi
myresult=make_weighted_beam(realwts,imagwts,1)
R=np.sum((np.abs(modelbeam/np.max(modelbeam)-myresult))**2)
return R
The input (wts) is an ndarray of shape (32,), and the output, R, is just some scalar, which should get smaller as my fit gets better. By my understanding, this is exactly the sort of problem ("Minimization of scalar function of one or more variables.") which scipy.optimize.minimize is designed to optimize (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.minimize.html ).
However, when I run the code, although the optimization routine seems to iterate over different values of all the elements of wts, only a few of them seem to 'stick'. Ie, all but four of the values are returned with the same values as my initial guess. To illustrate, I plot the values of my initial guess for wts (in blue), and the optimized values in red. You can see that for most elements, the two lines overlap.
Image:
http://imgur.com/p1hQuz7
Changing just these few parameters is not enough to get a good answer, and I can't understand why the other parameters aren't also being optimised. I suspect that maybe I'm not understanding the nature of my minimization problem, so I'm hoping someone here can point out where I'm going wrong.
I have experimented with a variety of minimize's inbuilt methods (I am by no means committed to SLSQP, or certain that it's the most appropriate choice), and with a variety of 'step sizes' eps. The bounds I am using for my parameters are all (-4000,4000). I only have scipy version .11, so I haven't tested a basinhopping routine to get the global minimum (this needs .12). I have looked at minimize.brute, but haven't tried implementing it yet - thought I'd check if anyone can steer me in a better direction first.
Any advice appreciated! Sorry for the wall of text and the possibly (probably?) idiotic question. I can post more of my code if necessary, but it's pretty long and unpolished.

What is scipy's equivalent to matlab's `mle` function?

I'm trying to fit some data to a mixed model using an expectation maximization approach. In Matlab, the code is as follows
% mixture model's PDF
mixtureModel = ...
#(x,pguess,kappa) pguess/180 + (1-pguess)*exp(kappa*cos(2*x/180*pi))/(180*besseli(0,kappa));
% Set up parameters for the MLE function
options = statset('mlecustom');
options.MaxIter = 20000;
options.MaxFunEvals = 20000;
% fit the model using maximum likelihood estimate
params = mle(data, 'pdf', mixtureModel, 'start', [.1 1/10], ...
'lowerbound', [0 1/50], 'upperbound', [1 50], ...
'options', options);
The data parameter is a 1-D vector of floats.
I'm wondering how the equivalent computation can be achieved in Python. I looked into scipy.optimize.minimize, but this doesn't seem to be a drop-in replacement for Matlab's mle.
I'm a bit lost and overwhelmed, can somebody point me in the right direction (ideally with some example code?)
Thanks very much in advance!
Edit: In the meantime I've found this, but I'm still rather lost as (1) this seems primarily focused on mixed guassian models (which mine is not) and (2) my mathematical skills are severely lacking. That said, I'll happily accept an answer that elucidates how this notebook relates to my specific problem!
This is a mixture model (not mixed model) of uniform and von mises distributions whose parameters you are trying to infer using direct maximum likelihood estimation (not EM, although that may be more appropriate). You can find theses written on this exact problem if you search on the internet. SciPy doesn't have anything that would be as clear a choice as matlab's fmincon which it uses as its default in your code, but you could look for scipy optimization methods that allow bounds on parameters. The scipy interface is different from that of matlab's mle, and you will want to pass the data in the 'args' argument of the scipy minimization functions, whereas the pguess and kappa parameters will need to be represented by a parameter array of length 2.
I believe the scikit-learn toolkit has what you need:
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html.
Gaussian Mixture Model
Representation of a Gaussian mixture model probability distribution. This class allows for easy evaluation of, sampling from, and maximum-likelihood estimation of the parameters of a GMM distribution.
Initializes parameters such that every mixture component has zero mean and identity covariance.

DBSCAN plotting Non-geometrical-Data

I used sklearn cluster-algorithm dbscan to get clusters of my data.
Data: Non-Geometrical objects based on hex-decimal strings
I used a simple distance to create a distance matrix as input for dbscan resulting in expected clusters.
Question Is it possible to create a plot of these cluster-results like in demo
I didn't found a solution through search.
I need to graphically demonstrate the similarities of the objects and clusters to each other.
Since I am using python for everything (in that project) I would appreciate it to choose a solution in python.
I don't use python, so I cannot give you example code.
If your data isn't 2 dimensional, you can try to find a good 2-dimensional approximation using Multidimensional Scaling.
Essentially, it takes an input matrix (which should satistify triangular ineuqality, and ideally be derived from Euclidean distance in some vector space; but you can often get good results if this does not strictly hold). It then tries to find the best 2-dimensional data set that has the same distances.

Where can I see the list of built-in wavelet functions that I can pass to scipy.signal.cwt?

scipy.signal.cwt's documentation says:
scipy.signal.cwt(data, wavelet, widths)
wavelet : function
Wavelet function, which should take 2 arguments. The first argument is the number of points that the returned vector
will have (len(wavelet(width,length)) == length). The second is a
width parameter, defining the size of the wavelet (e.g. standard
deviation of a gaussian). See ricker, which satisfies these
requirements.wavelet : function Wavelet function, which should take 2 arguments.
Beyond scipy.signal.ricket, what are the other built-in wavelet functions that I can pass to scipy.signal.cwt?
I see in scipy / scipy / signal / wavelets.py
__all__ = ['daub', 'qmf', 'cascade', 'morlet', 'ricker', 'cwt']
and looking at the arguments of each of those wavelet functions, only ricket seems to work with scipy.signal.cwt(data, wavelet, widths) (as only ricker takes precisely 2 arguments).
I asked the question on the SciPy Users List , answer 1:
I found the module for CWT quite confusing, so I rolled my own:
https://github.com/Dapid/fast-pycwt
It is built for speed (I got my running time from 4 h down to 20 min).
It is not thoroughly tested, and it is limited to single and double;
but for me it is in a "good enough" state.
Answer 2:
You might also find my version useful:
https://github.com/aaren/wavelets
I also found scipy wavelets confusing. My version includes a faster
cwt that can take wavelets expressed in either frequency or time.
I found it more intuitive to have wavelet functions that take
time/frequency and width as arguments rather than the present method
(I prefer thinking in real space rather than sample space).
Presently, the morlet wavelet that comes with scipy,
scipy.signal.wavelets.morlet, cannot be used as input to cwt. This
is unfortunate I think.
Additionally, the present cwt doesn't allow complex output. This
doesn't make a difference for ricker but wavelet functions are complex
in general.
My modified 'cwt' method is here:
https://github.com/aaren/wavelets/blob/master/wavelets.py#L15
It can accept wavelet functions defined in time or frequency space,
uses fftconvolve, and allows complex output.
My background on this is based on a reading of Torrence and Compo:
Torrence and Compo, 'A Practical Guide to Wavelet Analysis' (BAMS,
1998)
http://paos.colorado.edu/research/wavelets/
hope that helps a bit,
aaron

Is there an equivalent of the matlab 'idealfilter' for Python in Scipy (or other libraries)?

I am looking for an equivalent of the time series idealfilter that is implemented in Matlab, for Python.
My goal is to implement an ideal filter using Discrete Cosine Transform as is used in the Eulerian Video Magnification paper in Python in order to obtain the heartbeat of a human being from standard video. I am using their video as my input and I have implemented the bandpass filter method, but I have not been able to find an idealfilter method to use in my script.
They state that they implement an ideal filter using DCT from 0.83 - 1.0Hz.
My problem is that the idealfilter in Matlab takes in as input the cutoff frequencies, but I don't think it is implemented with dct.
In contrast, the DCT filter found in scipy.fftpack does not take in as input the frequency cutoffs.
If I have to use these in some type of succession please let me know.
If such a function equivalent exists I would like to attempt to use it in order to see if it yields similar results to what they have obtained.
Non-causal means that your filter depends on future inputs.
DCT is a transform, not a filter. You want a filter.
You want to apply a bandpass filter to your data within the range you specified, so I would use a butterworth filter.
Here is some example code: https://stackoverflow.com/a/12233959/1097117
The trickiest part of all of this is getting everything in terms of your Nyquist frequency.
I maybe worth to have a look to the time series analysis module of the statsmodel library. This module implements several time series filters, including the Hodrick-Prescott filter, which I think is non-causal.

Categories