efficient regression of multiple outputs in python - python

Say I have predictor array x=numpy.array(n,px) and a predicted array y=numpy.array(n, py)
What would be the best way in python to calculate all regression (linear) from x to each dimension of y (1...py)?
The output of the whole thing would be a matrix (py, px) (for each output, px parameters).
I could of course easily iterate over outputs dimensions (for each computing normal single output multivariate input OLS), however that would be inefficient as I will recalculate the pseudo inverse matrix of x.
Is there any efficient implementation out there?
Could not find any (neither http://wiki.scipy.org/Cookbook/OLS)

I figured scikit-learn would have done this already, so I looked at the source code and discovered that they use scipy.linalg.lstsq (see line 379).
According to the docs, the scipy version of lstsq does indeed accept a matrix as the b parameter. (Actually the numpy version accepts a matrix value as well.)
Maybe these are what you're looking for ?

The fit() method of sklearn.linear_model.LinearRegression accepts multi-target output, so this is now handled natively in sklearn. Just use a 2-dimensional array for the y value of fit(X,y) of shape (n_samples, n_targets).

Related

What parameters do I provide to the fit function of the FeatureAgglomeration in sklearn for precomputed distances?

I want to use FeatureAgglomeration in SKLearn with a number of data points (each defined by multiple features) and a pre-defined affinity (or distance) matrix.
I've used AgglomerativeClustering, with affinity set to precomputed, and in the fit function I've provided a matrix with my precomputed distances to much success.
With FeatureAgglomeration I understand that I have to provide both the features of each data point and the precomputed distance, but I can't work out what input is meant to be provided to the fit function (fit(X, y=none)) and it doesn't appear to be documented (specifically the case when you are using pre-computed).
I faced the same issue and the documentation is not helpful here. However, I figured out that fit() expects a distance matrix and transform() the feature matrix to be reduced in dimensionality. This becomes clear when reading the source code of FeatureAgglomeration, particularly the following two parts
https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/cluster/_agglomerative.py#L1237
and
https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/cluster/_feature_agglomeration.py#L18
Hope this helps.

How to change scipy curve_fit/least_squares step size?

I have a python function that takes a bunch (1 or 2) of arguments and returns a 2D array. I have been trying to use scipy curve_fit and least_squares to optimize the input arguments so that the resultant 2D array matches another 2D array that has be pre-made. I ran into the problem of both the methods returning me the initial guess as the converged solution. After ripping apart much hair from my head, I have figured out that the issue was that since the small increment that it makes to the initial guess is too small to make any difference in the 2D array that my function returns (as the cell values in the array are quantized and are not continuous) and hence scipy assumes that it has reached convergence (or local minimum) at the initial guess.
I was wondering if there is a way around this (such as forcing it to use a bigger increment while guessing).
Thanks.
I have ran into a very similar problem recently and it turns out that these kind of optimizers work only for continous-differentiable functions. That's why they would return the initial parameters, as the function you want to fit cannot be differentiated. In my case, I could manually make my fit function differentiable by first fitting a polynomial function to it before plugging it into the curve_fit optimizer.

SciPy: n-dimensional interpolation of sparse data

I currently have a collection of n-dimensional data points, each with a value associated with it (n typically will range from 2 to 4).
I would like to employ some form of non-linear interpolation on the data points I am supplied with so that I can try and minimise this value. Of course, I am open to better methods of minimising the value.
At the moment, I have code that works for 1D and 2D arrays
mesh = np.meshgrid(*[i['grid2'] for i in self.cambParams], indexing='ij')
chi2 = griddata(data[:,:-1], data[:,-1], tuple(mesh), method='cubic')
However scipy.interpolate.griddata only supports linear interpolation above 2D grids, meaning interpolation is useless as the minimum will be a defined point in the data. Does anyone know of an alternate interpolation method that might work, or a better way of solving the problem in general?
Cheers
Received a tip from an external source that work, so posting the answer in case it helps anyone in the future.
SciPy has an Rbf interpolation method (radial basis function) which allows better than linear interpolation at arbitrary dimensions.
Taking a variable data with rows of (x1,x2,x3...,xn,v) values, the follow code modification to the original post allows for interpolation:
rbfi = Rbf(*data.T)
mesh = np.meshgrid(*[i['grid2'] for i in self.cambParams], indexing='ij')
chi2 = rbfi(*mesh)
The documentation here is useful, and there is a simple and easy to follow example here, which will make more sense than the code snippet above.

Generate two-dimensional normal distribution given a mean and standard deviation

I'm looking for a two-dimensional analog to the numpy.random.normal routine, i.e. numpy.random.normal generates a one-dimensional array with a mean, standard deviation and sample number as input, and what I'm looking for is a way to generate points in two-dimensional space with those same input parameters.
Looks like numpy.random.multivariate_normal can do this, but I don't quite understand what the cov parameter is supposed to be. The following excerpt describes this parameter in more detail and is from the scipy docs:
Covariance matrix of the distribution. Must be symmetric and
positive-semidefinite for “physically meaningful” results.
Later in the page, in the examples section, a sample cov value is given:
cov = [[1,0],[0,100]] # diagonal covariance, points lie on x or y-axis
The concept is still quite opaque to me, however.
If someone could clarify what the cov should be or suggest another way to generate points in two-dimensional space given a mean and standard deviation using python I would appreciate it.
If you pass size=[1, 2] to the normal() function, you get a 2D-array, which is actually what you're looking for:
>>> numpy.random.normal(size=[1, 2])
array([[-1.4734477 , -1.50257962]])

What is the corresponding function for corrmtx (in MATLAB) in Python?

I'm translating some code from MATLAB to Python and I'm stuck with the corrmtx() MATLAB function. Is there any similar function in Python, or how could I replace it?
The spectrum package has such a function.
How about:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.toeplitz.html
The matlab docs for corrmtx state:
X = corrmtx(x,m) returns an (n+m)-by-(m+1) rectangular Toeplitz matrix
X, such that X'X is a (biased) estimate of the autocorrelation matrix
for the length n data vector x.
The scipy function gives the Toeplitz matrix, although I'm not sure if the implementations are identical.
Here is a list of alternatives that can help you in translating your code, all of which contain that function:
scipy (toeplitz | corrmtx)
spectrum (corrmtx)
The following is a link to another post that tells you how to use numpy for the auto correlation since it seems to be the default funcationality of corrmtx
Additional Information:
Finding the correlation matrix in Python
Unbiased Estimation of Covariance Matrix

Categories