Get parameters of fitted model of KernelRidge class scikit learn library - python

I want to use KernelRidge class of scikit_learn library to fit nonlinear regression model on my data. But I am getting confused how I can do that.
from sklearn.kernel_ridge import KernelRidge
import numpy as np
n_samples, n_features = 20,1
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
Krr = KernelRidge(alpha=1.0, kernel='linear',degree = 4)
Krr.fit(X, y)
I am expecting 5 coefficients to be set for this model, how can I get them?
The above code will transform 1-D data to 4-D space and fit the model to the data. I think it should find best c0,c1,c2,c3,c4 according to the training data. My question is how can I access c0,c1,c2,c3,c4?
EDIT:
I made a mistake in above my code here, kernel parameter should be "polynomial" instead of "linear" in line 7.
Krr = KernelRidge(alpha=1.0, kernel='polynomial',degree = 4)
But my question is same as before.

http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.KernelRidge
dual_coef_ : array, shape = [n_features] or [n_targets, n_features]
so
Krr.dual_coef_
should do it.
EDIT:
Ok, so dual_coef_ is the coefficient in the Kernel space. For a linear kernel, the Kernel, K(X,X') is X.T *X . So this is an NxN matrix, hence the number of coefficients equal to the the dimension of y.
there are 3 equations we need to understand,
The first is the standard ridge regression weight estimation.
The second is the partially kernalised version, with the relation linking the two being the third equation.
dual_coef_ returns the alpha of equation 2. Therefore to have the weight vector in the 'normal' space, rather than the kernel space as it is returned, you need to do X.T * Krr.dual_coef_
We can check this is correct because KRR and Ridge Regression are the same if the kernel is linear.
import numpy as np
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge
rng = np.random.RandomState(0)
X = 5 * rng.rand(100, 1)
y = np.sin(X).ravel()
Krr = KernelRidge(alpha=1.0, kernel='linear', coef0=0)
R = Ridge(alpha=1.0,fit_intercept=False)
Krr.fit(X, y)
R.fit(X, y)
print np.dot(X.transpose(),Krr.dual_coef_)
print R.coef_
I see this to output:
[-0.03997686]
[-0.03997686]
Will show they are equivalent (you have to change the intercept options as the defaults differ between the models).
As the degree parameter is ignored, as I mentioned in the comments, the coefficient should be 1x1 in this case (as it is).
If you want to know exactly what a particular model returns, I recommend looking at the source code on github, which I think is the only way to gain a deeper understanding of how this stuff works. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/kernel_ridge.py
Additionally, for a non-linear kernel, the intuition of the weights can easily be lost, so always start from first principles if you do this.

Illustration of how KernelRidge prediction works. Hope it will help someone to understand the model.

Related

Calculate odds ratio with different method in python [duplicate]

When performed a logistic regression using the two API, they give different coefficients.
Even with this simple example it doesn't produce the same results in terms of coefficients. And I follow advice from older advice on the same topic, like setting a large value for the parameter C in sklearn since it makes the penalization almost vanish (or setting penalty="none").
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
n = 200
x = np.random.randint(0, 2, size=n)
y = (x > (0.5 + np.random.normal(0, 0.5, n))).astype(int)
display(pd.crosstab( y, x ))
max_iter = 100
#### Statsmodels
res_sm = sm.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sm.params)
#### Scikit-Learn
res_sk = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.coef_)
For example I just run the above code and get 1.72276655 for statsmodels and 1.86324749 for sklearn. And when run multiple times it always gives different coefficients (sometimes closer than others, but anyway).
Thus, even with that toy example the two APIs give different coefficients (so odds ratios), and with real data (not shown here), it almost get "out of control"...
Am I missing something? How can I produce similar coefficients, for example at least at one or two numbers after the comma?
There are some issues with your code.
To start with, the two models you show here are not equivalent: although you fit your scikit-learn LogisticRegression with fit_intercept=True (which is the default setting), you don't do so with your statsmodels one; from the statsmodels docs:
An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
It seems that this is a frequent point of confusion - see for example scikit-learn & statsmodels - which R-squared is correct? (and own answer there as well).
The other issue is that, although you are in a binary classification setting, you ask for multi_class='multinomial' in your LogisticRegression, which should not be the case.
The third issue is that, as explained in the relevant Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels:
There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number.
which makes the two models again non-comparable in principle, but you have successfully addressed it here by setting C=1e8. In fact, since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none' since, according to the docs:
If ‘none’ (not supported by the liblinear solver), no regularization is applied.
which should now be considered the canonical way to switch off the regularization.
So, incorporating these changes in your code, we have:
np.random.seed(42) # for reproducibility
#### Statsmodels
# first artificially add intercept to x, as advised in the docs:
x_ = sm.add_constant(x)
res_sm = sm.Logit(y, x_).fit(method="ncg", maxiter=max_iter) # x_ here
print(res_sm.params)
Which gives the result:
Optimization terminated successfully.
Current function value: 0.403297
Iterations: 5
Function evaluations: 6
Gradient evaluations: 10
Hessian evaluations: 5
[-1.65822763 3.65065752]
with the first element of the array being the intercept and the second the coefficient of x. While for scikit learn we have:
#### Scikit-Learn
res_sk = LogisticRegression(solver='newton-cg', max_iter=max_iter, fit_intercept=True, penalty='none')
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.intercept_, res_sk.coef_)
with the result being:
[-1.65822806] [[3.65065707]]
These results are practically identical, within the machine's numeric precision.
Repeating the procedure for different values of np.random.seed() does not change the essence of the results shown above.

How PCA computes the transformed version in `sklearn`?

I'm confused with sklearn's PCA(here is the documentation), and its relation with Singular Value Decomposition (SVD).
In Wikipedia we have,
The full principal components decomposition of X can, therefore, be given as T=WX,
where W is a p-by-p matrix of weights whose columns are the eigenvectors of $X^T X$. The transpose of W is sometimes called the whitening or sphering transformation.
Later once it explains the relationship with SVD, we have:
X=U $\Sigma W^T$
So I assume that matrix W, embeds samples into latent space (which makes sense noting the dimension of the matrices) and using transform module of the class PCA in sklearn should give the same result as if I was multiplying observation matrix by W. However, I checked them and they don't match.
Is there anything wrong that I'm missing or there's a bug in the code?
import numpy as np
from sklearn.decomposition import PCA
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # should be a small value, but it's not :(
print(np.abs(t_svd2-t_pca).max()) # should be a small value, but it's not :(
There is a difference between the theoretical Wikipedia description and the practical sklearn implementation, but it is not a bug, merely just a stability and reproducibility enhancement.
You have almost pretty much nailed the exact implementation of the PCA, however in order to be able to fully reproduce the computation, sklearn developers added one more enforcement to their implementation. The problem stems from the indeterministic nature of SVD, i.e. the SVD does not have a unique solution. That can be easily seen from your equation as well by setting U_s = -U and W_s = -W, then U_s and W_s also satisfy:
X=U_s $\Sigma W_s^T$
More importantly this holds also when switching the signs of columns of U and W. If we just reverse the signs of k-th column of U and W, the equality will still hold. You can read more about this issue f.e. here https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2007/076422.pdf.
The implementation of PCA deals with this problem by enforcing the highest loading values in absolute values to be always positive, specifically the method sklearn.utils.extmath.svd_flip is being used. This way, no matter which sign the resulting vectors have from the indeterministic method np.linalg.svd, the loading values in absolutes will remain the same, i.e. the signs of the matrices will remain the same.
Thus in order for your code to have the same result as the PCA implementation:
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(41)
x = np.random.rand(200).reshape(20,10)
x = x-x.mean(axis=0)
u, s, vh = np.linalg.svd(x, full_matrices=False)
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
vh *= signs.reshape(-1,1)
pca = PCA().fit(x)
# transformed version based on WIKI: t = X#vh.T = u#np.diag(s)
t_svd1= x#vh.T
t_svd2= u#np.diag(s)
# the pca transform
t_pca = pca.transform(x)
print(np.abs(t_svd1-t_pca).max()) # pretty small value :)
print(np.abs(t_svd2-t_pca).max()) # pretty small value :)

Scaling wide range datas in scikit learn

I'm trying to use a MLPregressor from scikit learn in order to do a non linear regression on a set of 260 examples (X,Y). One example is composed of 200 features for X and 1 feature for Y.
File containing X
File containing Y
The link between X and Y is not obvious if directly plotted together but if we plot x=log10(sum(X)) and y=log10(Y), the link between both is almost linear.
As a first approach, I tried to apply my neural network directly on X and Y without success.
I have read that scaling would improve regression. In my case, Y is containing datas in a very wide range of values (from 10e-12 to 10e-5). When computing the error, of course 10e-5 as much more weight than 10e-12. But I would like my neural network to correctly approximate both. When using a linear scaling, let's say preprocessing.MinMaxScaler from scikit learn, 10e-8 ~ -0.99 and 10e-12 ~ -1. So I'm loosing all the information of my target.
My question here is: what kind of scaling could I use to get consistent results?
The only solution I have found is to apply log10(Y) but of course, error is increased exponentially.
The best I could get is with the code below:
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=(20,10)
freqIter=[]
for i in np.arange(0,0.2,0.001):
freqIter.append([i,i+0.001])
#############################################################################
X = np.zeros((len(learningFiles),len(freqIter)))
Y = np.zeros(len(learningFiles))
# Import X: loadtxt()
# Import Y: loadtxt
maxy = np.amax(Y)
Y *= 1/maxy
Y = Y.reshape(-1, 1)
maxx = np.amax(X)
X *= 1/maxx
#############################################################################
reg = MLPRegressor(hidden_layer_sizes=(8,2), activation='tanh', solver='adam', alpha=0.0001, learning_rate='adaptive', max_iter=10000, verbose=False, tol = 1e-7)
reg.fit(X, Y)
#############################################################################
plt.scatter([np.log10(np.sum(kou*maxx)) for kou in X],Y*maxy,label = 'INPUTS',color='blue')
plt.scatter([np.log10(np.sum(kou*maxx)) for kou in X],reg.predict(X)*maxy,label='Predicted',color='red')
plt.grid()
plt.legend()
plt.show()
Result:
Thanks for your help.
You may want to look at a FunctionTransformer. The example given applies a logarithmic transformation as part of pre-processing. You can also do it for an arbitrary mathematical function.
I would also suggest trying a ReLU activation function if you scale logarithmically. After the transformation your data looks fairly linear, so it may be converge a little faster -- but that's just a hunch.
I've finally found something interesting that is working well on my case.
First, I've used a log scaling for Y. I think it is the most adapted scaling when the range of values is very wide such as mine (from 10e-12 to 10e-5). Target is then between -5 and -12.
Secondly, my error about scaling X was to apply the same scaling to all features. Let's say my X contains 200 features, then I was dividing by the max of all features of all examples. My solution here is to scale feature1 by the max of all feature1 through all examples and then to reapeat it for all features. This gives me feature1 between 0 and 1 for all examples instead of far less previously (feature1 could be betwwen 0 and 0.0001 with my previous scaling).
I get better results, my main issue now is to select the correct parameters (number of layers, tolerance,...) but this is another problem.

How do I improve a Gaussian/Normal fit in Python 3.X by using a running median?

I have an array of 100x100 data points, where I'm trying to perform a Gaussian fit to each column of 100 values in the array. I then want the parameters of the Gaussian found by using the fit of the first column to be the initial parameters of the starting point for the next column to use. Let's say I start with the initial parameters of 1000, 0, and 1, and the fit finds values of 800, 3, and 1.5. I then want the fitter to use these three parameters as initial values for the next column.
My code is:
x = np.linspace(-50,50,100)
Gauss_Model = models.Gaussian1D(amplitude = 1000., mean = 0, stddev = 1.)
Fitting_Model = fitting.LevMarLSQFitter()
Fit_Data = []
for i in range(0, Data_Array.shape[0]):
Fit_Data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
Right now it uses the same initial values for every fit. Does anyone know how to perform such a running median/mean for a Gaussian fitting method? Would really appreciate any help or being pointed in the right direction, thanks!
I'm not familiar with the specific library you are using, but if you can get your fitted parameters out with something like fit_data[-1].amplitude or fit_data[-1].mean, then you could modify your loop to use something like:
for i in range(0, data_array.shape[0]):
if fit_data: # true if not an empty list
Gauss_Model = models.Gaussian1D(amplitude=fit_data[-1].amplitude,
mean=fit_data[-1].mean,
stddev=fit_data[-1].stddev)
fit_data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
basically checking whether you have already fit a model, and if you have, use the most recent fitted amplitude, mean, and standard deviation as the starting point for your next Gauss_Model.
A thought: this might speed up your fitting, but it shouldn't result in a "better" fit to the 100 data points in each fit operation. Your resulting model is probably the best fit model to the data it was presented. If you want to estimate the error in the parameters of your model, you can use the fact that, for two normal distributions A ~ N(m_a, v_a) and B ~ N(m_b, v_b), the distribution A + B will have mean m_a + m_b and variance is v_a + v_b. Thus, the distribution of your means will be N(sum(means)/n, sum(variances)/n). Basically you can say that your true mean is centered at the mean of your means with standard deviation (sum(stddev)/sqrt(n)).
I also cannot tell what library you are using, and the details of how to do this probably depend on the details of how that library stores the fitted values. I can say that for lmfit (https://lmfit.github.io/lmfit-py/) we struggled with this sort of usage and arrived at a design that makes what you are trying to do pretty easy. With lmfit, you might compose this problem as:
import numpy as np
from lmfit import GaussianModel
x = np.linspace(-50,50,100)
# get Data_Array from somewhere....
# create a model for a Gaussian
Gauss_Model = GaussianModel()
# make a set of parameters, setting initial values
params = Gauss_Model.make_params(amplitude=1000, center=0, sigma=1.0)
Fit_Results = []
for i in range(Data_Array.shape[1]):
result = Gauss_Model.fit(Data_Array[:, i], params, x=x)
Fit_Results.append(result)
# update `params` with the current best fit params for the next column
params = result.params
Note that this works because lmfit is careful that Model.fit() will not alter the input parameters, and will put the resulting best-fit parameters for each fit in result.params.
And, if you decide you do want to have all columns use the original initial values, just comment out that last params = result.params.
Lmfit has a lot more bells and whistles, but I hope that helps you do what you need.

Fit mixture of Gaussians with fixed covariance in Python

I have some 2D data (GPS data) with clusters (stop locations) that I know resemble Gaussians with a characteristic standard deviation (proportional to the inherent noise of GPS samples). The figure below visualizes a sample that I expect has two such clusters. The image is 25 meters wide and 13 meters tall.
The sklearn module has a function sklearn.mixture.GaussianMixture which allows you to fit a mixture of Gaussians to data. The function has a parameter, covariance_type, that enables you to assume different things about the shape of the Gaussians. You can, for example, assume them to be uniform using the 'tied' argument.
However, it does not appear directly possible to assume the covariance matrices to remain constant. From the sklearn source code it seems trivial to make a modification that enables this but it feels a bit excessive to make a pull request with an update that allows this (also I don't want to accidentally add bugs in sklearn). Is there a better way to fit a mixture to data where the covariance matrix of each Gaussian is fixed?
I want to assume that the SD should remain constant at around 3 meters for each component, since that is roughly the noise level of my GPS samples.
It is simple enough to write your own implementation of EM algorithm. It would also give you a good intuition of the process. I assume that covariance is known and that prior probabilities of components are equal, and fit only means.
The class would look like this (in Python 3):
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
class FixedCovMixture:
""" The model to estimate gaussian mixture with fixed covariance matrix. """
def __init__(self, n_components, cov, max_iter=100, random_state=None, tol=1e-10):
self.n_components = n_components
self.cov = cov
self.random_state = random_state
self.max_iter = max_iter
self.tol=tol
def fit(self, X):
# initialize the process:
np.random.seed(self.random_state)
n_obs, n_features = X.shape
self.mean_ = X[np.random.choice(n_obs, size=self.n_components)]
# make EM loop until convergence
i = 0
for i in range(self.max_iter):
new_centers = self.updated_centers(X)
if np.sum(np.abs(new_centers-self.mean_)) < self.tol:
break
else:
self.mean_ = new_centers
self.n_iter_ = i
def updated_centers(self, X):
""" A single iteration """
# E-step: estimate probability of each cluster given cluster centers
cluster_posterior = self.predict_proba(X)
# M-step: update cluster centers as weighted average of observations
weights = (cluster_posterior.T / cluster_posterior.sum(axis=1)).T
new_centers = np.dot(weights, X)
return new_centers
def predict_proba(self, X):
likelihood = np.stack([multivariate_normal.pdf(X, mean=center, cov=self.cov)
for center in self.mean_])
cluster_posterior = (likelihood / likelihood.sum(axis=0))
return cluster_posterior
def predict(self, X):
return np.argmax(self.predict_proba(X), axis=0)
On the data like yours, the model would converge quickly:
np.random.seed(1)
X = np.random.normal(size=(100,2), scale=3)
X[50:] += (10, 5)
model = FixedCovMixture(2, cov=[[3,0],[0,3]], random_state=1)
model.fit(X)
print(model.n_iter_, 'iterations')
print(model.mean_)
plt.scatter(X[:,0], X[:,1], s=10, c=model.predict(X))
plt.scatter(model.mean_[:,0], model.mean_[:,1], s=100, c='k')
plt.axis('equal')
plt.show();
and output
11 iterations
[[9.92301067 4.62282807]
[0.09413883 0.03527411]]
You can see that the estimated centers ((9.9, 4.6) and (0.09, 0.03)) are close to the true centers ((10, 5) and (0, 0)).
I think the best option would be to "roll your own" GMM model by defining a new scikit-learn class that inherits from GaussianMixture and overwrites the methods to get the behavior you want. This way you just have an implementation yourself and you don't have to change the scikit-learn code (and create a pull-request).
Another option that might work is to look at the Bayesian version of GMM in scikit-learn. You might be able to set the prior for the covariance matrix so that the covariance is fixed. It seems to use the Wishart distribution as a prior for the covariance. However I'm not familiar enough with this distribution to help you out more.
First, you can use spherical option, which will give you single variance value for each component. This way you can check yourself, and if the received values of variance are too different then something went wrong.
In a case you want to preset the variance, you problem degenerates to finding only best centers for your components. You can do it by using k-means, for example. If you don't know the number of the components, you may sweep over all logical values (like 1 to 20) and evaluate the decrement in fitting error. Or you can optimize your own EM function, to find the centers and the number of components simultaneously.

Categories