R lm versus Python sklearn linear_model - python

When I study Python SKlearn, the first example that I come across is Generalized Linear Models.
Code of its very first example:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2,2]], [0, 1,2])
reg.fit
reg.coef_
array([ 0.5, 0.5])
Here I assume [[0, 0], [1, 1], [2,2]] represents a data.frame containing x1 = c(0,1,2) and x2 = c(0,1,2) and y = c(0,1,2) as well.
Immediately, I begin to think that array([ 0.5, 0.5]) are the coeffs for x1 and x2.
But, are there standard errors for those estimates? What about t tests p values, R2 and other figures?
Then I try to do the same thing in R.
X = data.frame(x1 = c(0,1,2),x2 = c(0,1,2),y = c(0,1,2))
lm(data=X, y~x1+x2)
Call:
lm(formula = y ~ x1 + x2, data = X)
#Coefficients:
#(Intercept) x1 x2
# 1.282e-16 1.000e+00 NA
Obviously x1 and x2 are completely linearly dependent so the OLS will fail. Why the SKlearn still works and gives this results? Am I getting sklearn in a wrong way? Thanks.

Both solutions are correct (assuming that NA behaves like a zero). Which solution is favored depends on the numerical solver used by the OLS estimator.
sklearn.linear_model.LinearRegression is based on scipy.linalg.lstsq which in turn calls the LAPACK gelsd routine which is described here:
http://www.netlib.org/lapack/lug/node27.html
In particular it says that when the problem is rank deficient it seeks the minimum norm least squares solution.
If you want to favor the other solution, you can use a coordinate descent solver with a tiny bit of L1 penalty as implemented in th Lasso class:
>>> from sklearn.linear_model import Lasso
>>> reg = Lasso(alpha=1e-8)
>>> reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
Lasso(alpha=1e-08, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
>>> reg.coef_
array([ 9.99999985e-01, 3.97204719e-17])

Related

Use of pykalman

I want to try to use pykalman to apply a kalman filter to data from sensor variables. Now, I have a doubt with the data of the observations. In the example, the 3 observations are two variables measured in three instants of time or are 3 variables measured in a moment of time
from pykalman import KalmanFilter
>>> import numpy as np
>>> kf = KalmanFilter(transition_matrices = [[1, 1], [0, 1]], observation_matrices = [[0.1, 0.5], [-0.3, 0.0]])
>>> measurements = np.asarray([[1,0], [0,0], [0,1]]) # 3 observations
>>> kf = kf.em(measurements, n_iter=5)
>>> (filtered_state_means, filtered_state_covariances) = kf.filter(measurements)
>>> (smoothed_state_means, smoothed_state_covariances) = kf.smooth(measurements)
Let's see:
transition_matrices = [[1, 1], [0, 1]]
means
So your state vector consists of 2 elements, for example:
observation_matrices = [[0.1, 0.5], [-0.3, 0.0]]
means
The dimension of an observation matrix should be [n_dim_obs, n_dim_state].
So your measurement vector also consists of 2 elements.
Conclusion: the code has 3 observations of two variables measured at 3 different points in time.
You can change the given code so it can process each measurement at a time step. You use kf.filter_update() for each measurement instead of kf.filter() for all measurements at once:
from pykalman import KalmanFilter
import numpy as np
kf = KalmanFilter(transition_matrices = [[1, 1], [0, 1]], observation_matrices = [[0.1, 0.5], [-0.3, 0.0]])
measurements = np.asarray([[1,0], [0,0], [0,1]]) # 3 observations
kf = kf.em(measurements, n_iter=5)
filtered_state_means = kf.initial_state_mean
filtered_state_covariances = kf.initial_state_covariance
for m in measurements:
filtered_state_means, filtered_state_covariances = (
kf.filter_update(
filtered_state_means,
filtered_state_covariances,
observation = m)
)
print(filtered_state_means);
Output:
[-1.69112511 0.30509999]
The result is slightly different as when using kf.filter() because this function does not perform prediction on the first measurement, but I think it should.

Howto: CVXPY Matrix Inequality Constraints

I am trying to formulate an optimization problem in the following way:
My optimization variable x is a n*n matrix.
x should be PSD.
It should be in the range 0<=x<=I. Meaning, it would be in the range from the all zeros square matrix to n dimensional identity matrix.
Here is what I have come up with so far:
import cvxpy as cp
import numpy as np
import cvxopt
x = cp.Variable((2, 2), PSD=True)
a = cvxopt.matrix([[1, 0], [0, 0]])
b = cvxopt.matrix([[.5, .5], [.5, .5]])
identity = cvxopt.matrix([[1, 0], [0, 1]])
zeros = cvxopt.matrix([[0, 0], [0, 0]])
constraints = [x >= zeros, x <= identity]
objective = cp.Maximize(cp.trace(x*a - x * b))
prob = cp.Problem(objective, constraints)
prob.solve()
This gives me a result of [[1, 0], [0, 0]] as the optimal x, with a maximum trace of .5. But that should not be the case. Because I have done this same program in CVX in matlab and I got the answer matrix as [[.85, -.35], [-.35, .14]] with an optimal value of .707. Which is correct.
I think my constraint formulation is not correct or not following cvxpy standards. How do I enforce the constraints in my program correctly?
(Here is my matlab version of the code:)
a = [1, 0; 0, 0];
b = [.5, .5; .5, .5];
cvx_begin sdp
variable x(2, 2) hermitian;
maximize(trace(x*a - x*b))
subject to
x >= 0;
x <= eye(2);
cvx_end
TIA
You need to use the PSD constraint. If you compare a matrix against a scalar, cvxpy does elementwise inequalities unless you use >> or <<. You already have constrained x to be PSD when you created it so all you need to change is:
constraints = [x << np.eye(2)]
Then I get your solution:
array([[ 0.85355339, -0.35355339],
[-0.35355339, 0.14644661]])

Return associated probability for samples from tf.multinomial in Tensorflow

I'm generating samples in Tensorflow with tf.multinomial, and I'm looking for a way to return associated probability with the randomly selected element. So in the following case:
logits = [[-1., 0., 1], [1, 1, 1], [0, 1, 2]]
samples = tf.multinomial(logits, 2)
with tf.Session() as sess:
sess.run(samples)
Instead of having
[[1, 2], [0, 1], [1, 1]]
as result, I'd like to see something like
[[(1, 0.244728), (2, 0.66524)],
[(0, 0.33333), (1, 0.33333)],
[(1, 0.244728), (1, 0.244728)]]
Is there any way to achieve this?
I'm confused , does tensorflow do some sort of transformation on the inside that turns your logits into probabilities? The multinomial distribution takes in as parameters a set of positional probabilities that determines, probabilistically the likelihood of the outcome (positionally) being sampled. i.e
# this is all psuedocode
x = multinomial([.2, .3, .5])
y ~ x
# this will give a value of 0 20% of the time
# a value of 1 30% of the time
# and a value of 2 50% of the time
therefor your probabilities might be your logits.
looking at https://www.tensorflow.org/api_docs/python/tf/multinomial
you see it states they are "unnormalized log probabilities" so if you can apply that transformation, you have the probabilities
You can try tf.gather_nd, you can try
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> probs = tf.constant([[0.5, 0.2, 0.1, 0.2], [0.6, 0.1, 0.1, 0.1]], dtype=tf.float32)
>>> idx = tf.multinomial(probs, 1)
>>> row_indices = tf.range(probs.get_shape()[0], dtype=tf.int64)
>>> full_indices = tf.stack([row_indices, tf.squeeze(idx)], axis=1)
>>> rs = tf.gather_nd(probs, full_indices)
Or, you can use tf.distributions.Multinomial, the advantage is you do not need to care about the batch_size in the above code. It works under varying batch_size when you set the batch_size=None. Here is a simple example,
multinomail = tf.distributions.Multinomial(
total_count=tf.constant(1, dtype=tf.float32),
probs=probs)
sampled_actions = multinomail.sample() # sample one action for data in the batch
predicted_actions = tf.argmax(sampled_actions, axis=-1)
action_probs = sampled_actions * predicted_probs
action_probs = tf.reduce_sum(action_probs, axis=-1)
I think this is what you want to do. I prefer the latter one because it is flexible and elegant.

Standardizing X different in Python Lasso and R glmnet?

I was trying to get the same result fitting lasso using Python's scikit-learn and R's glmnet. A helpful link
If I specify "normalize =True" in Python and "standardize = T" in R, they gave me the same result.
Python:
from sklearn.linear_model import Lasso
X = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Lasso(alpha =0.01, fit_intercept = True, normalize =True)
reg.fit(X, y)
np.hstack((reg.intercept_, reg.coef_))
Out[95]: array([-0.89607695, 0. , -0.24743375, 1.03286824])
R:
reg_glmnet = glmnet(X, y, alpha = 1, lambda = 0.02,standardize = T)
coef(reg_glmnet)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.8960770
V1 .
V2 -0.2474338
V3 1.0328682
However, if I don't want to standardize variables and set normalize =False and standardize = F, they gave me quite different results.
Python:
from sklearn.linear_model import Lasso
Z = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Lasso(alpha =0.01, fit_intercept = True, normalize =False)
reg.fit(Z, y)
np.hstack((reg.intercept_, reg.coef_))
Out[96]: array([-0.88 , 0.09384212, -0.36159299, 1.05958478])
R:
reg_glmnet = glmnet(X, y, alpha = 1, lambda = 0.02,standardize = F)
coef(reg_glmnet)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.76000000
V1 0.04441697
V2 -0.29415542
V3 0.97623074
What's the difference between "normalize" in Python's Lasso and "standardize" in R's glmnet?
Currently, with regard to the normalize parameter the docs state "If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.''
So evidently normalize and standardize are not the same with sklearn.linear_model.Lasso. Having read the StandardScaler docs I fail to understand the difference, but the fact that there is one is implied by the provided description of the normalize parameter.

Multiple linear regression for a surface using NumPy - example

This question is close to: fitting a linear surface with numpy least squares, but there's no sample data. I must be terribly slow but it seems I can't get it to work.
I have the following code:
import numpy as np
XYZ = np.array([[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 1, 1]])
A = np.row_stack((np.ones(len(XYZ[0])), XYZ[0, :], XYZ[1:]))
coeffs = np.linalg.lstsq(A.T, XYZ[2, :])[0]
print coeffs
The output is:
[ 5.00000000e-01 5.55111512e-17 9.71445147e-17 5.00000000e-01]
I want z = a + bx + cy, i.e. three coefficients, but the output gives me four. Where do I go wrong here? I expected coeffs to be something like:
[ 1.0 0.0 0.0]
Any help appreciated.
Peter Schneider (comment) is right: you'll want to feed XYZ[1, :] to row_stack:
>>> A = np.row_stack((np.ones(len(XYZ[0])), XYZ[0, :], XYZ[1, :]))
>>> np.linalg.lstsq(A.T, XYZ[2, :])[0]
array([ 1.00000000e+00, -7.85046229e-17, -7.85046229e-17])

Categories