By default all regularised linear regression techniques of scikit-learn pull the model coefficients w towards 0 with increased alpha. Is it possible to instead pull the coefficients towards some predefined values? In my application I do have such values that have been obtained from a previous analysis of a similar but much larger dataset. In other words, can I transfer the knowledge from one model to another?
The documentation of LassoCV says:
The optimization objective for Lasso is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
In theory it's easy to incorporate previously obtained coefficients w0 by changing the above to
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w - w0||_1
The problem is that the actual optimisation is carried out by the Cython function enet_coordinate_descent (called via lasso_path and enet_path). If I want to change it, do I need to fork, modify, and recompile the whole sklearn.linear_model package or reimplement the whole optimisation routine?
Toy example
The following code defines a dataset X with 4 features and a matching response vector y.
import numpy as np
from sklearn.linear_model import LassoCV
n = 50
x1 = np.random.normal(10, 8, n)
x2 = np.random.normal(8, 6, n)
X = np.column_stack([x1, x1 ** 2, x2, x2 ** 2])
y = .8 * x1 + .2 * x2 + .7 * x2**2 + np.random.normal(0, 3, n)
cv = LassoCV(cv=10).fit(X, y)
The resulting coefficients and alpha are
>>> print(cv.coef_)
[ 0.46262115 0.01245427 0. 0.70642803]
>>> print(cv.alpha_)
7.63613474003
If we had prior knowledge regarding two of the coefficients w0 = np.array([.8, 0, .2, 0]), how could that be incorporated?
My final solution, based on #lejlot's answer
Rather than using vanilla GD I eventually arrived at using Adam.
This solution just fits a lasso for a given value of alpha, it does not find the value alpha by itself like LassoCV does (but it's easy to add a layer of CV on top of it).
from autograd import numpy as np
from autograd import grad
from autograd.optimizers import adam
def fit_lasso(X, y, alpha=0, W0=None):
if W0 is None:
W0 = np.zeros(X.shape[1])
def l1_loss(W, i):
# i is only used for compatibility with adam
return np.mean((np.dot(X, W) - y) ** 2) + alpha * np.sum(np.abs(W - W0))
gradient = grad(l1_loss)
def print_w(w, i, g):
if (i + 1) % 250 is 0:
print("After %i step: w = %s" % (i + 1, np.array2string(w.T)))
W_init = np.random.normal(size=(X.shape[1], 1))
W = adam(gradient, W_init, step_size=.1, num_iters=1000, callback=print_w)
return W
n = 50
x1 = np.random.normal(10, 8, n)
x2 = np.random.normal(8, 6, n)
X = np.column_stack([x1, x1 ** 2, x2, x2 ** 2])
y = .8 * x1 + .2 * x2 + .7 * x2 ** 2 + np.random.normal(0, 3, n)
fit_lasso(X, y, alpha=30)
fit_lasso(X, y, alpha=30, W0=np.array([.8, 0, .2, 0]))
After 250 step: w = [[ 0.886 0.131 0.005 0.291]]
After 500 step: w = [[ 0.886 0.131 0.003 0.291]]
After 750 step: w = [[ 0.886 0.131 0.013 0.291]]
After 1000 step: w = [[ 0.887 0.131 0.013 0.292]]
After 250 step: w = [[ 0.868 0.129 0.728 0.247]]
After 500 step: w = [[ 0.803 0.132 0.717 0.249]]
After 750 step: w = [[ 0.801 0.132 0.714 0.249]]
After 1000 step: w = [[ 0.801 0.132 0.714 0.249]]
The results are quite similar on this example, but you can at least tell that specifying a W0 prevented the model from killing the third coefficient.
The effect is only apparent if you use an alpha > 20 or thereabouts.
In short - yes, you need to do it by hand by recompiling everything. Scikit-learn is not a library for customizable ML models. It is about providing simple, typical models with easy to use interface. If you want customization you should look for things like tensorflow, keras etc. or at least - autograd. In fact with autograd this is extremely simple, since you can write your code with numpy and use autograd to compute gradients.
X = ... your data
y = ... your targets
W0 = ... target weights
alpha = ... pulling strength
lr = ... learning rate (step size of gradient descent)
from autograd import numpy as np
from autograd import grad
def your_loss(W):
return np.mean((np.dot(X, W) - y)**2) + alpha * np.sum(np.abs(W - W0))
g = grad(your_loss)
W = np.random.normal(size=(X.shape[1], 1))
for i in range(100):
W = W - lr * g(W)
print(W)
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have to do Logistic regression using batch gradient descent.
import numpy as np
X = np.asarray([
[0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75],
[2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50],
[4.00],[4.25],[4.50],[4.75],[5.00],[5.50]])
y = np.asarray([0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1])
m = len(X)
def sigmoid(a):
return 1.0 / (1 + np.exp(-a))
def gradient_Descent(theta, alpha, X , y):
for i in range(0,m):
cost = ((-y) * np.log(sigmoid(X[i]))) - ((1 - y) * np.log(1 - sigmoid(X[i])))
grad = theta - alpha * (1.0/m) * (np.dot(cost,X[i]))
theta = theta - alpha * grad
gradient_Descent(0.1,0.005,X,y)
The way I have to do it is like this but I can't seem to understand how to make it work.
It looks like you have some stuff mixed up in here. It's critical when doing this that you keep track of the shape of your vectors and makes sure you're getting sensible results. For example, you are calculating cost with:
cost = ((-y) * np.log(sigmoid(X[i]))) - ((1 - y) * np.log(1 - sigmoid(X[i])))
In your case y is vector with 20 items and X[i] is a single value. This makes your cost calculation a 20 item vector which doesn't makes sense. Your cost should be a single value. (you're also calculating this cost a bunch of times for no reason in your gradient descent function).
Also, if you want this to be able to fit your data you need to add a bias terms to X. So let's start there.
X = np.asarray([
[0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75],
[2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50],
[4.00],[4.25],[4.50],[4.75],[5.00],[5.50]])
ones = np.ones(X.shape)
X = np.hstack([ones, X])
# X.shape is now (20, 2)
Theta will now need 2 values for each X. So initialize that and Y:
Y = np.array([0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]).reshape([-1, 1])
# reshape Y so it's column vector so matrix multiplication is easier
Theta = np.array([[0], [0]])
Your sigmoid function is good. Let's also make a vectorized cost function:
def sigmoid(a):
return 1.0 / (1 + np.exp(-a))
def cost(x, y, theta):
m = x.shape[0]
h = sigmoid(np.matmul(x, theta))
cost = (np.matmul(-y.T, np.log(h)) - np.matmul((1 -y.T), np.log(1 - h)))/m
return cost
The cost function works because Theta has a shape of (2, 1) and X has a shape of (20, 2) so matmul(X, Theta) will be shaped (20, 1). The then matrix multiply the transpose of Y (y.T shape is (1, 20)), which result in a single value, our cost given a particular value of Theta.
We can then write a function that performs a single step of batch gradient descent:
def gradient_Descent(theta, alpha, x , y):
m = x.shape[0]
h = sigmoid(np.matmul(x, theta))
grad = np.matmul(X.T, (h - y)) / m;
theta = theta - alpha * grad
return theta
Notice np.matmul(X.T, (h - y)) is multiplying shapes (2, 20) and (20, 1) which results in a shape of (2, 1) — the same shape as Theta, which is what you want from your gradient. This allows you to multiply is by your learning rate and subtract it from the initial Theta, which is what gradient descent is supposed to do.
So now you just write a loop for a number of iterations and update Theta until it looks like it converges:
n_iterations = 500
learning_rate = 0.5
for i in range(n_iterations):
Theta = gradient_Descent(Theta, learning_rate, X, Y)
if i % 50 == 0:
print(cost(X, Y, Theta))
This will print the cost every 50 iterations resulting in a steadily decreasing cost, which is what you hope for:
[[ 0.6410409]]
[[ 0.44766253]]
[[ 0.41593581]]
[[ 0.40697167]]
[[ 0.40377785]]
[[ 0.4024982]]
[[ 0.40195]]
[[ 0.40170533]]
[[ 0.40159325]]
[[ 0.40154101]]
You can try different initial values of Theta and you will see it always converges to the same thing.
Now you can use your newly found values of Theta to make predictions:
h = sigmoid(np.matmul(X, Theta))
print((h > .5).astype(int) )
This prints what you would expect for a linear fit to your data:
[[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]]
I'm trying to understand the gradient descent algorithm.
Can someone please explain why I'm getting high MSE values using the following code, or if I missed some concept can you please clarify?
import numpy as np
import pandas as pd
my_data = pd.DataFrame({'x': np.arange(0,100),
'y': np.arange(0,100)})
X = my_data.iloc[:,0:1].values
y = my_data.iloc[:,1].values
def gradientDescent(X, y, lr = 0.001, n = 1000):
n_samples, n_features = X.shape
cost = []
weight = np.zeros([n_features])
b = 0
for _ in range(n):
# predict
y_hat = np.dot(X, weight) + b # y = ax + b
residual = y - y_hat
db = -(2/n_samples) * np.sum(residual)
dw = -(2/n_samples) * np.sum(X.T * residual, axis = 1)
# update weights
weight -= (lr * dw)
b -= (lr * db)
cost.append(((y-y_hat) **2).mean())
return weight, b, cost
gradientDescent(X,y)
Not an expert, but I think you are currently experiencing the exploding gradient problem. If you step through your code you will notice that your weight value is swinging from positive to negative in increasing steps. I believe you cannot find the minimum because using mse for this dataset is causing you to jump back and forth never converging. Your x and y ranges to 100, so when you look at the cost it is just blowing up.
If you want to use mse with your current x and y values you should normalize your data. You can do this by subtracting the mean and dividing by the standard deviation, or just normalize both x and y to 1.
For example:
my_data.x = my_data.x.transform(lambda x: x / x.max())
my_data.y = my_data.y.transform(lambda x: x / x.max())
If you do this you should see your cost converge to ~0 with enough iterations.
I have to code scipy.special.expi in tensorflow but I don't know how!!!!
please, someone, help
since there is no as such direct code in tensorflow so I'm stuck here
please help!!!
I don't really know much about this function, but based on the Fortran implementation in SciPy, the function EIX in scipy/special/specfun/specfun.f, I have put together a TensorFlow implementation following each step there. It is only for positive values, though, as the computation for negative values included a loop harder to vectorize.
import math
import tensorflow as tf
def expi(x):
x = tf.convert_to_tensor(x)
# When X is zero
m_0 = tf.equal(x, 0)
y_0 = -math.inf + tf.zeros_like(x)
# When X is negative
m_neg = x < 0
# This should be -e1xb(-x) according to ScyPy
# (negative exponential integral -1)
# Here it is just left as NaN
y_neg = math.nan + tf.zeros_like(x)
# When X is less or equal to 40 - Power series around x = 0
m_le40 = x <= 40
k = tf.range(1, 101, dtype=x.dtype)
r = tf.cumprod(tf.expand_dims(x, -1) * k / tf.square(k + 1), axis=-1)
ga = tf.constant(0.5772156649015328, dtype=x.dtype)
y_le40 = ga + tf.log(x) + x * (1 + tf.reduce_sum(r, axis=-1))
# Otherwise (X is greater than 40) - Asymptotic expansion (the series is not convergent)
k = tf.range(1, 21, dtype=x.dtype)
r = tf.cumprod(k / tf.expand_dims(x, -1), axis=-1)
y_gt40 = tf.exp(x) / x * (1 + tf.reduce_sum(r, axis=-1))
# Select values
return tf.where(
m_0, y_0, tf.where(
m_neg, y_neg, tf.where(
m_le40, y_le40, y_gt40)))
A small test
import tensorflow as tf
import scipy.special
import numpy as np
# Test
x = np.linspace(0, 100, 20)
y = scipy.special.expi(x)
with tf.Graph().as_default(), tf.Session() as sess:
y_tf = sess.run(expi(x))
print(np.allclose(y, y_tf))
# True
Note however this will take more memory than SciPy, because it is unrolling the approximation loops in memory instead of computing one step at a time.
I'm trying to implement the gradient descent algorithm from scratch on a toy problem. My code always returns a vector of NaN's:
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
x = np.linspace(0, 1000, num=1000)
y = 3*x + 2 + np.random.randn(len(x))
# sklearn output - This works (returns intercept = 1.6, coef = 3)
lm = LinearRegression()
lm.fit(x.reshape(-1, 1), y.reshape(-1, 1))
print("Intercept = {:.2f}, Coef = {:.2f}".format(lm.coef_[0][0], lm.intercept_[0]))
# BGD output
theta = np.array((0, 0)).reshape(-1, 1)
X = np.hstack([np.ones_like(x.reshape(-1, 1)), x.reshape(-1, 1)]) # [1, x]
Y = y.reshape(-1, 1) # Column vector
alpha = 0.05
for i in range(100):
# Update: theta <- theta - alpha * [X.T][X][theta] - [X.T][Y]
h = np.dot(X, theta) # Hypothesis
loss = h - Y
theta = theta - alpha*np.dot(X.T, loss)
theta
The sklearn part runs fine, so I must be doing something wrong in the for loop. I've tried various different alpha values and none of them converge.
The problem is theta keeps getting bigger and bigger throughout the loop, and eventually becomes too big for python to store.
Here's a contour plot of the cost function:
J = np.dot((np.dot(X, theta) - y).T, (np.dot(X, theta) - y))
plt.contour(J)
Clearly there's no minimum here. Where have I gone wrong?
Thanks
In the theta update, the second term should be divided by the size of the training set. More details are there: gradient descent using python and numpy
I am currently working with some Raman Spectra data, and I am trying to correct my data caused by florescence skewing. Take a look at the graph below:
I am pretty close to achieving what I want. As you can see, I am trying to fit a polynomial in all my data whereas I should really just be fitting a polynomial at the local minimas.
Ideally I would want to have a polynomial fitting which when subtracted from my original data would result in something like this:
Are there any built in libs that does this already?
If not, any simple algorithm one can recommend for me?
I found an answer to my question, just sharing for everyone who stumbles upon this.
There is an algorithm called "Asymmetric Least Squares Smoothing" by P. Eilers and H. Boelens in 2005. The paper is free and you can find it on google.
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in xrange(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
The following code works on Python 3.6.
This is adapted from the accepted correct answer to avoid the dense matrix diff computation (which can easily cause memory issues) and uses range (not xrange)
import numpy as np
from scipy import sparse
from scipy.sparse.linalg import spsolve
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
w = np.ones(L)
for i in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
There is a python library available for baseline correction/removal. It has Modpoly, IModploy and Zhang fit algorithm which can return baseline corrected results when you input the original values as a python list or pandas series and specify the polynomial degree.
Install the library as pip install BaselineRemoval. Below is an example
from BaselineRemoval import BaselineRemoval
input_array=[10,20,1.5,5,2,9,99,25,47]
polynomial_degree=2 #only needed for Modpoly and IModPoly algorithm
baseObj=BaselineRemoval(input_array)
Modpoly_output=baseObj.ModPoly(polynomial_degree)
Imodpoly_output=baseObj.IModPoly(polynomial_degree)
Zhangfit_output=baseObj.ZhangFit()
print('Original input:',input_array)
print('Modpoly base corrected values:',Modpoly_output)
print('IModPoly base corrected values:',Imodpoly_output)
print('ZhangFit base corrected values:',Zhangfit_output)
Original input: [10, 20, 1.5, 5, 2, 9, 99, 25, 47]
Modpoly base corrected values: [-1.98455800e-04 1.61793368e+01 1.08455179e+00 5.21544654e+00
7.20210508e-02 2.15427531e+00 8.44622093e+01 -4.17691125e-03
8.75511661e+00]
IModPoly base corrected values: [-0.84912125 15.13786196 -0.11351367 3.89675187 -1.33134142 0.70220645
82.99739548 -1.44577432 7.37269705]
ZhangFit base corrected values: [ 8.49924691e+00 1.84994576e+01 -3.31739230e-04 3.49854060e+00
4.97412948e-01 7.49628529e+00 9.74951576e+01 2.34940300e+01
4.54929023e+01
Recently, I needed to use this method. The code from answers works well, but it obviously overuses the memory. So, here is my version with optimized memory usage.
def baseline_als_optimized(y, lam, p, niter=10):
L = len(y)
D = sparse.diags([1,-2,1],[0,-1,-2], shape=(L,L-2))
D = lam * D.dot(D.transpose()) # Precompute this term since it does not depend on `w`
w = np.ones(L)
W = sparse.spdiags(w, 0, L, L)
for i in range(niter):
W.setdiag(w) # Do not create a new matrix, just update diagonal values
Z = W + D
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
According to my benchmarks bellow, it is also about 1,5 times faster.
%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als(y, 10000, 0.05) # function from #jpantina's answer
# 20.5 ms ± 382 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)
%%timeit -n 1000 -r 10 y = randn(1000)
baseline_als_optimized(y, 10000, 0.05)
# 13.3 ms ± 874 µs per loop (mean ± std. dev. of 10 runs, 1000 loops each)
NOTE 1: The original article says:
To emphasize the basic simplicity of the algorithm, the number of iterations has been fixed to 10. In practical applications one should check whether the weights show any change; if not, convergence has been attained.
So, it means that the more correct way to stop iteration is to check that ||w_new - w|| < tolerance
NOTE 2: Another useful quote (from #glycoaddict's comment) gives an idea how to choose values of the parameters.
There are two parameters: p for asymmetry and λ for smoothness. Both have to be
tuned to the data at hand. We found that generally 0.001 ≤ p ≤ 0.1 is a good choice (for a signal with positive peaks) and 102 ≤ λ ≤ 109, but exceptions may occur. In any case one should vary λ on a grid that is approximately linear for log λ. Often visual inspection is sufficient to get good parameter values.
I worked the version of the algorithm referenced by glinka in a previous comment, which is an improvement of the penalized weighted linear squares method published in a relatively recent paper. I took Rustam Guliev's code to build this one:
from scipy import sparse
from scipy.sparse import linalg
import numpy as np
from numpy.linalg import norm
def baseline_arPLS(y, ratio=1e-6, lam=100, niter=10, full_output=False):
L = len(y)
diag = np.ones(L - 2)
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L - 2)
H = lam * D.dot(D.T) # The transposes are flipped w.r.t the Algorithm on pg. 252
w = np.ones(L)
W = sparse.spdiags(w, 0, L, L)
crit = 1
count = 0
while crit > ratio:
z = linalg.spsolve(W + H, W * y)
d = y - z
dn = d[d < 0]
m = np.mean(dn)
s = np.std(dn)
w_new = 1 / (1 + np.exp(2 * (d - (2*s - m))/s))
crit = norm(w_new - w) / norm(w)
w = w_new
W.setdiag(w) # Do not create a new matrix, just update diagonal values
count += 1
if count > niter:
print('Maximum number of iterations exceeded')
break
if full_output:
info = {'num_iter': count, 'stop_criterion': crit}
return z, d, info
else:
return z
In order to test the algorithm, I created a spectrum similar to the one shown in Fig. 3 of the paper, by first generating a simulated spectra consisting of multiple Gaussian peaks:
def spectra_model(x):
coeff = np.array([100, 200, 100])
mean = np.array([300, 750, 800])
stdv = np.array([15, 30, 15])
terms = []
for ind in range(len(coeff)):
term = coeff[ind] * np.exp(-((x - mean[ind]) / stdv[ind])**2)
terms.append(term)
spectra = sum(terms)
return spectra
x_vals = np.arange(1, 1001)
spectra_sim = spectra_model(x_vals)
Then, I created a third-order interpolating polynomial using 4 points taken directly from the paper:
from scipy.interpolate import CubicSpline
x_poly = np.array([0, 250, 700, 1000])
y_poly = np.array([200, 180, 230, 200])
poly = CubicSpline(x_poly, y_poly)
baseline = poly(x_vals)
noise = np.random.randn(len(x_vals)) * 0.1
spectra_base = spectra_sim + baseline + noise
Finally, I used the baseline correction algorithm to subtract the baseline out of the altered spectra (spectra_base):
_, spectra_arPLS, info = baseline_arPLS(spectra_base, lam=1e4, niter=10,
full_output=True)
The results were (for reference, I compared with the pure ALS implementation by Rustam Guliev's, using lam = 1e4 and p = 0.001):
I know this is an old question, but I stumpled upon it a few months ago and implemented the equivalent answer using spicy.sparse routines.
# Baseline removal
def baseline_als(y, lam, p, niter=10):
s = len(y)
# assemble difference matrix
D0 = sparse.eye( s )
d1 = [numpy.ones( s-1 ) * -2]
D1 = sparse.diags( d1, [-1] )
d2 = [ numpy.ones( s-2 ) * 1]
D2 = sparse.diags( d2, [-2] )
D = D0 + D2 + D1
w = np.ones( s )
for i in range( niter ):
W = sparse.diags( [w], [0] )
Z = W + lam*D.dot( D.transpose() )
z = spsolve( Z, w*y )
w = p * (y > z) + (1-p) * (y < z)
return z
Cheers,
Pedro.