How to find regression curve equation for a fitted PolynomialFeatures model - python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
data=pd.DataFrame(
{"input":
[0.001,0.015,0.066,0.151,0.266,0.402,0.45,0.499,0.598,0.646,0.738,0.782,0.86,0.894,0.924,0.95],
"output":[0.5263157894736842,0.5789473684210524,0.6315789473684206,0.6842105263157897,
0.6315789473684206, 0.7894736842105263, 0.8421052631578945, 0.7894736842105263, 0.736842105263158,
0.6842105263157897, 0.736842105263158, 0.736842105263158,0.6842105263157897, 0.6842105263157897,
0.6315789473684206,0.5789473684210524]})
I have the above data that includes input and output data and ı want to make a curve that properly fits this data. Firstly plotting of input and output values are here :
I have made this code:
X=data.iloc[:,0].to_numpy()
X=X.reshape(-1,1)
y=data.iloc[:,1].to_numpy()
y=y.reshape(-1,1)
poly=PolynomialFeatures(degree=2)
poly.fit(X,y)
X_poly=poly.transform(X)
reg=LinearRegression().fit(X_poly,y)
plt.scatter(X,y,color="blue")
plt.plot(X,reg.predict(X_poly),color="orange",label="Polynomial Linear Regression")
plt.xlabel("Temperature")
plt.ylabel("Pressure")
plt.legend(loc="upper left")
plot is:
But ı don't find the above curve's equation (orange curve) how can ı find?

Your plot actually corresponds to your code run with
poly=PolynomialFeatures(degree=7)
and not to degree=2. Indeed, running your code with the above change, we get:
Now, your polynomial features are:
poly.get_feature_names()
# ['1', 'x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7']
and the respective coefficients of your linear regression are:
reg.coef_
# array([[ 0. , 5.43894411, -68.14277256, 364.28508827,
# -941.70924401, 1254.89358662, -831.27091422, 216.43304954]])
plus the intercept:
reg.intercept_
# array([0.51228593])
Given the above, and setting
coef = reg.coef_[0]
since here we have a single feature in the initial data, your regression equation is:
y = reg.intercept_ + coef[0] + coef[1]*x + coef[2]*x**2 + coef[3]*x**3 + coef[4]*x**4 + coef[5]*x**5 + coef[6]*x**6 + coef[7]*x**7
For visual verification, we can plot the above function with some x data in [0, 1]
x = np.linspace(0, 1, 15)
Running the above expression for y and
plt.plot(x, y)
gives:
Using some randomly generated data x, we can verify that the results of the equation y_eq are indeed equal to the results produced by the regression model y_reg within the limits of numerical precision:
x = np.random.rand(1,10)
y_eq = reg.intercept_ + coef[0] + coef[1]*x + coef[2]*x**2 + coef[3]*x**3 + coef[4]*x**4 + coef[5]*x**5 + coef[6]*x**6 + coef[7]*x**7
y_reg = np.concatenate(reg.predict(poly.transform(x.reshape(-1,1))))
y_eq
# array([[0.72452703, 0.64106819, 0.67394222, 0.71756648, 0.71102853,
# 0.63582055, 0.54243177, 0.71104983, 0.71287962, 0.6311952 ]])
y_reg
# array([0.72452703, 0.64106819, 0.67394222, 0.71756648, 0.71102853,
# 0.63582055, 0.54243177, 0.71104983, 0.71287962, 0.6311952 ])
np.allclose(y_reg, y_eq)
# True
Irrelevant to the question, I guess you already know that trying to fit such high order polynomials to so few data points is not a good idea, and you probably should remain to a low degree of 2 or 3...

Note sure how you produced the plot shown in the question. When I ran your code I got the following (degree=2) polynomial fitted to the data as expected:
Now that you have fitted the data you can see the coefficients of the model thus:
print(reg.coef_)
print(reg.intercept_)
# [[ 0. 0.85962436 -0.83796885]]
# [0.5523586]
Note that the data that was used to fit this model is equivalent to the following:
X_poly = np.concatenate([np.ones((16,1)), X, X**2], axis=1)
Therefore a single data point is a vector created as follows:
temp = 0.5
x = np.array([1, temp, temp**2]).reshape((1,3))
Your polynomial model is simply a linear model of the polynomial features:
y = A.x + B
or
y = reg.coef_.dot(x.T) + reg.intercept_
print(y) # [[0.77267856]]
Verification:
print(reg.predict(x)) # array([[0.77267856]])

Related

Can I inverse transform the intercept and coefficients of LASSO regression after using Robust Scaler?

Is it possible to inverse transform the intercept and coefficients in LASSO regression, after fitting the model on scaled data using Robust Scaler?
I'm using LASSO regression to predict values on data that is not normalized and doesn't perform well with LASSO unless it's scaled beforehand. After scaling the data and fitting the LASSO model, I ideally want to be able to see what the model intercept and coefficients are but in the original units (not the scaled versions). I asked a similar question here and it doesn't appear this is possible. If not, why? Can someone explain this to me? I'm trying to broaden my understanding of how LASSO and Robust Scaler work.
Below is the code I was using. Here I was trying to inverse transform the coefficients using transformer_x and the intercept using transformer_y. However, it sounds like this is incorrect.
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import Lasso
df = pd.DataFrame({'Y':[5, -10, 10, .5, 2.5, 15], 'X1':[1., -2., 2., .1, .5, 3], 'X2':[1, 1, 2, 1, 1, 1],
'X3':[6, 6, 6, 5, 6, 4], 'X4':[6, 5, 4, 3, 2, 1]})
X = df[['X1','X2', 'X3' ,'X4']]
y = df[['Y']]
#Scaling
transformer_x = RobustScaler().fit(X)
transformer_y = RobustScaler().fit(y)
X_scal = transformer_x.transform(X)
y_scal = transformer_y.transform(y)
#LASSO
lasso = Lasso()
lasso = lasso.fit(X_scal, y_scal)
def pred_val(X1,X2,X3,X4):
print('X1 entered: ', X1)
#Scale X value that user entered - by hand
med_X = X.median()
Q1_X = X.quantile(0.25)
Q3_X = X.quantile(0.75)
IQR_X = Q3_X - Q1_X
X_scaled = (X1 - med_X)/IQR_X
print('X1 scaled by hand: ', X_scaled[0].round(2))
#Scale X value that user entered - by function
X_scaled2 = transformer_x.transform(np.array([[X1,X2]]))
print('X1 scaled by function: ', X_scaled2[0][0].round(2))
#Intercept by hand
med_y = y.median()
Q1_y = y.quantile(0.25)
Q3_y = y.quantile(0.75)
IQR_y = Q3_y - Q1_y
inv_int = med_y + IQR_y*lasso.intercept_[0]
#Intercept by function
inv_int2 = transformer_y.inverse_transform(lasso.intercept_.reshape(-1, 1))[0][0]
#Coefficient by hand
inv_coef = lasso.coef_[0]*IQR_y
#Coefficient by function
inv_coef2 = transformer_x.inverse_transform(reg.coef_.reshape(1,-1))[0]
#Prediction by hand
preds = inv_int + inv_coef*X_scaled[0]
#Prediction by function
preds_inner = lasso.predict(X_scaled2)
preds_f = transformer_y.inverse_transform(preds_inner.reshape(-1, 1))[0][0]
print('\nIntercept by hand: ', inv_int[0].round(2))
print('Intercept by function: ', inv_int2.round(2))
print('\nCoefficients by hand: ', inv_coef[0].round(2))
print('Coefficients by function: ', inv_coef2[0].round(2))
print('\nYour predicted value by hand is: ', preds[0].round(2))
print('Your predicted value by function is: ', preds_f.round(2))
print('Perfect Prediction would be 80')
pred_val(10,1,1,1)
Update: I've updated my code to show the type of prediction function I'm trying to create. I'm just trying to create a function that does exactly what .predict does, but also shows the intercept and coefficients in their unscaled units.
Current output:
Out[1]:
X1 entered: 10
X1 scaled by hand: 5.97
X1 scaled by function: 5.97
Intercept by hand: 34.19
Intercept by function: 34.19
Coefficients by hand: 7.6
Coefficients by function: 8.5
Your predicted value by hand is: 79.54
Your predicted value by function is: 79.54
Perfect Prediction would be 80
Ideal output:
Out[1]:
X1 entered: 10
X1 scaled by hand: 5.97
X1 scaled by function: 5.97
Intercept by hand: 34.19
Intercept by function: 34.19
Coefficients by hand: 7.6
Coefficients by function: 7.6
Your predicted value by hand is: 79.54
Your predicted value by function is: 79.54
Perfect Prediction would be 80
Based on the linked SO thread, all you want to do is to get the unscaled prediction value. Is that right?
If yes, then all you need to do is:
# Scale the test dataset
X_test_scaled = transformer_x.transform(X_test)
# Predict with the trained model
prediction = lasso.predict(X_test_scaled)
# Inverse transform the prediction
prediction_in_dollars = transformer_y.inverse_transform(prediction)
UPDATE:
Suppose the train data contain just a single feature named X. Here is what the RobustScaler will do:
X_scaled = (X - median(X))/IQR(X)
y_scaled = (y - median(y))/IQR(y)
Then, the lasso regression will give a prediction like this:
a * X_scaled + b = y_scaled
You have to work out the equations to see what model coefficient on the unscaled data:
# Substituting X_scaled and y_scaled from the 1st equation
# In this equation `median(X), IQR(X), median(y) and IQR(y) are plain numbers you already know from the training phase
a * (X - median(X))/IQR(X) + b = (y - median(y))/IQR(y)
If you try to make a a_new * x + b_new = y-like equation out of this, you end up with:
a_new = (a * (X - median(X)) / (X * IQR(X))) * IQR(y)
b_new = b * IQR(y) + median(y)
a_new * X + b_new = y
You can see that the unscaled coefficient (a_new) depends on X. So, you can use the unscaled X to make predictions directly but in between you are applying the transformation indirectly.
UPDATE 2
I've adapted your code and it now shows how you can get the coefficients in the original scale. The script is just the implementation of the formulas I'm showing above.
import pandas as pd
import numpy as np
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import Lasso
df = pd.DataFrame({'Y':[5, -10, 10, .5, 2.5, 15], 'X1':[1., -2., 2., .1, .5, 3], 'X2':[1, 1, 2, 1, 1, 1],
'X3':[6, 6, 6, 5, 6, 4], 'X4':[6, 5, 4, 3, 2, 1]})
X = df[['X1','X2','X3','X4']]
y = df[['Y']]
#Scaling
transformer_x = RobustScaler().fit(X)
transformer_y = RobustScaler().fit(y)
X_scal = transformer_x.transform(X)
y_scal = transformer_y.transform(y)
#LASSO
lasso = Lasso()
lasso = lasso.fit(X_scal, y_scal)
def pred_val(X_test):
print('X entered: ',)
print (X_test.values[0])
#Scale X value that user entered - by hand
med_X = X.median()
Q1_X = X.quantile(0.25)
Q3_X = X.quantile(0.75)
IQR_X = Q3_X - Q1_X
X_scaled = ((X_test - med_X)/IQR_X).fillna(0).values
print('X_test scaled by hand: ',)
print (X_scaled[0])
#Scale X value that user entered - by function
X_scaled2 = transformer_x.transform(X_test)
print('X_test scaled by function: ',)
print (X_scaled2[0])
#Intercept by hand
med_y = y.median()
Q1_y = y.quantile(0.25)
Q3_y = y.quantile(0.75)
IQR_y = Q3_y - Q1_y
a = lasso.coef_
coef_new = ((a * (X_test - med_X).values) / (X_test * IQR_X).values) * float(IQR_y)
coef_new = np.nan_to_num(coef_new)[0]
b = lasso.intercept_[0]
intercept_new = b * float(IQR_y) + float(med_y)
custom_pred = sum((coef_new * X_test.values)[0]) + intercept_new
pred = lasso.predict(X_scaled2)
final_pred = transformer_y.inverse_transform(pred.reshape(-1, 1))[0][0]
print('Original intercept: ', lasso.intercept_[0].round(2))
print('New intercept: ', intercept_new.round(2))
print('Original coefficients: ', lasso.coef_.round(2))
print('New coefficients: ', coef_new.round(2))
print('Your predicted value by function is: ', final_pred.round(2))
print('Your predicted value by hand is: ', custom_pred.round(2))
X_test = pd.DataFrame([10,1,1,1]).T
X_test.columns = ['X1', 'X2', 'X3', 'X4']
pred_val(X_test)
You can see the the custom prediction uses the original values (X_test.values).
Result:
X entered:
[10 1 1 1]
X_test scaled by hand:
[ 5.96774194 0. -6.66666667 -1. ]
X_test scaled by function:
[ 5.96774194 0. -6.66666667 -1. ]
Original intercept: 0.01
New intercept: 3.83
Original coefficients: [ 0.02 0. -0. -0. ]
New coefficients: [0.1 0. 0. 0. ]
Your predicted value by function is: 4.83
Your predicted value by hand is: 4.83
As I explained above, the new coefficients depend on X_test. This means that you cannot use their current values with another test sample. Their values will be different for different inputs.

Poisson Regression in statsmodels and R

Given the some randomly generated data with
2 columns,
50 rows and
integer range between 0-100
With R, the poisson glm and diagnostics plot can be achieved as such:
> col=2
> row=50
> range=0:100
> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))
> model <- glm(X2 ~ X1, data = df, family = poisson)
> glm.diag.plots(model)
In Python, this would give me the line predictor vs residual plot:
import numpy as np
import pandas as pd
import statsmodels.formula.api
from statsmodels.genmod.families import Poisson
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randint(100, size=(50,2)))
df.rename(columns={0:'X1', 1:'X2'}, inplace=True)
glm = statsmodels.formula.api.gee
model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())
results = model.fit()
And to plot the diagnostics in Python:
model_fitted_y = results.fittedvalues # fitted values (need a constant term for intercept)
model_residuals = results.resid # model residuals
model_abs_resid = np.abs(model_residuals) # absolute residuals
plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_xlabel('Line Predictor')
plot_lm_1.axes[0].set_ylabel('Residuals')
plt.show()
But when I try to get the cook statistics,
# cook's distance, from statsmodels internals
model_cooks = results.get_influence().cooks_distance[0]
it threw an error saying:
AttributeError Traceback (most recent call last)
<ipython-input-66-0f2bedfa1741> in <module>()
4 model_residuals = results.resid
5 # normalized residuals
----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal
7 # absolute squared normalized residuals
8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)
33 pass
34
---> 35 obj = getattr(results, attr)
36 data = results.model.data
37 how = self._wrap_attrs.get(attr)
AttributeError: 'GEEResults' object has no attribute 'get_influence'
Is there a way to plot out all 4 diagnostic plots in Python like in R?
How do I retrieve the cook statistics of the fitted model results in Python using statsmodels?
The generalized estimating equations API should give you a different result than R's GLM model estimation. To get similar estimates in statsmodels, you need to use something like:
import pandas as pd
import statsmodels.api as sm
# Read data generated in R using pandas or something similar
df = pd.read_csv(...) # file name goes here
# Add a column of ones for the intercept to create input X
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )
# Relabel dependent variable as y (standard notation)
y = df.X2
# Fit GLM in statsmodels using Poisson link function
sm.GLM(y, X, family = Poisson()).fit().summary()
EDIT -- Here is the rest of the answer on how to get Cook's distance in Poisson regression. This is a script I wrote based on some data generated in R. I compared my values against those in R calculated using the cooks.distance function and the values matched.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import statsmodels.api as sm
PATH = '/Users/robertmilletich/test_reg.csv'
def _weight_matrix(fitted_model):
"""Calculates weight matrix in Poisson regression
Parameters
----------
fitted_model : statsmodel object
Fitted Poisson model
Returns
-------
W : 2d array-like
Diagonal weight matrix in Poisson regression
"""
return np.diag(fitted_model.fittedvalues)
def _hessian(X, W):
"""Hessian matrix calculated as -X'*W*X
Parameters
----------
X : 2d array-like
Matrix of covariates
W : 2d array-like
Weight matrix
Returns
-------
hessian : 2d array-like
Hessian matrix
"""
return -np.dot(X.T, np.dot(W, X))
def _hat_matrix(X, W):
"""Calculate hat matrix = W^(1/2) * X * (X'*W*X)^(-1) * X'*W^(1/2)
Parameters
----------
X : 2d array-like
Matrix of covariates
W : 2d array-like
Diagonal weight matrix
Returns
-------
hat : 2d array-like
Hat matrix
"""
# W^(1/2)
Wsqrt = W**(0.5)
# (X'*W*X)^(-1)
XtWX = -_hessian(X = X, W = W)
XtWX_inv = np.linalg.inv(XtWX)
# W^(1/2)*X
WsqrtX = np.dot(Wsqrt, X)
# X'*W^(1/2)
XtWsqrt = np.dot(X.T, Wsqrt)
return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))
def main():
# Load data and separate into X and y
df = pd.read_csv(PATH)
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )
y = df.X2
# Fit model
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
# Weight matrix
W = _weight_matrix(model)
# Hat matrix
H = _hat_matrix(X, W)
hii = np.diag(H) # Diagonal values of hat matrix
# Pearson residuals
r = model.resid_pearson
# Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))
# Note: dispersion is 1 since we aren't modeling overdispersion
cooks_d = (r/(1 - hii))**2 * hii/(1*2)
if __name__ == "__main__":
main()
As an update here
statsmodels has now, since version 0.10, get_influence method also for GLMResults.
https://www.statsmodels.org/dev/examples/notebooks/generated/influence_glm_logit.html
for example:
Print influence and outlier measures for 10 observations with largest cook distance:
infl = res.get_influence(observed=False)
summ_df = infl.summary_frame()
summ_df.sort_values("cooks_d", ascending=False)[:10]
There are no combination plots, but influence plot infl.plot_influence() and index plot infl.plot_index(...) for any of the measures are available.
Generic influence measures for maximum likelihood models is or will become available for discrete and other models.
MLE influence measures are based on hessian, i.e. observed information matrix, while for GLM both expected information matrix and hessian versions are available.
In GLM, the distinction is only relevant when non-canonical links are used.

power-law curve fitting scipy, numpy not working

I came up with a problem in fitting a power-law curve on my data. I have two data sets: bins1 and bins2
bins1 acting fine in curve-fitting by using numpy.linalg.lstsq (I then use np.exp(coefs[0])*x**coefs[1] to get power-law equation)
On the other hand, bins2 is acting weird and shows a bad R-squared
Both data have different equations than what excel shows me (and worse R-squared).
here is the code (and data):
import numpy as np
import matplotlib.pyplot as plt
bins1 = np.array([[6.769318871738219667e-03,
1.306418618130891773e-02,
1.912138120913448383e-02,
2.545189874466026111e-02,
3.214689891729670401e-02,
4.101898933375244805e-02,
5.129862592803200588e-02,
6.636505322669797313e-02,
8.409809827572585494e-02,
1.058164348650862258e-01,
1.375849753230810046e-01,
1.830664031837437311e-01,
2.682454535427478137e-01,
3.912508246490400410e-01,
5.893271848997768680e-01,
8.480213305038615257e-01,
2.408136266017391058e+00,
3.629192766488219313e+00,
4.639246557509275171e+00,
9.901792214343277720e+00],
[8.501658465758301112e-04,
1.562697718429977012e-03,
1.902062808421856087e-04,
4.411817741488644959e-03,
3.409236963162485048e-03,
1.686099657013027898e-03,
3.643231240239608402e-03,
2.544120616413291154e-04,
2.549036204611017029e-02,
3.527340723977697573e-02,
5.038482027310990652e-02,
5.617932487522721979e-02,
1.620407270423956103e-01,
1.906538999080910068e-01,
3.180688368126549093e-01,
2.364903188268162038e-01,
3.267322385964683273e-01,
9.384571074801122403e-01,
4.419747716107813029e-01,
9.254710022316929852e+00]]).T
bins2 = np.array([[6.522512685133712192e-03,
1.300415548684437199e-02,
1.888928895701269539e-02,
2.509905819337970856e-02,
3.239654633369139919e-02,
4.130706234846069635e-02,
5.123820846515786398e-02,
6.444380072984744190e-02,
8.235238352205621892e-02,
1.070907072127811749e-01,
1.403438221033725120e-01,
1.863115065963684147e-01,
2.670209758710758163e-01,
4.003337413814173074e-01,
6.549054078382223754e-01,
1.116611087124244062e+00,
2.438604844718367914e+00,
3.480674117919704269e+00,
4.410201659398489404e+00,
6.401903059926267403e+00],
[1.793454543936148608e-03,
2.441092334386309615e-03,
2.754373929745804715e-03,
1.182752729942167062e-03,
1.357797177773524414e-03,
6.711673916715021199e-03,
1.392761674092503343e-02,
1.127957613093066511e-02,
7.928803089359596004e-03,
2.524609593305639915e-02,
5.698702885370290905e-02,
8.607729156137132465e-02,
2.453761830112021203e-01,
9.734443815196883176e-02,
1.487480479168299119e-01,
9.918002699934079791e-01,
1.121298151253063535e+00,
1.389239135742518227e+00,
4.254082922056571237e-01,
2.643453492951096440e+00]]).T
bins = bins1 #change to bins2 to see results for bins2
def fit(x,a,m): # power-law fit (based on previous studies)
return a*(x**m)
coefs= np.linalg.lstsq(np.vstack([np.ones(len(bins[:,0])), np.log(bins[:,0]), bins[:,0]]).T, np.log(bins[:,1]))[0] # calculating fitting coefficients (a,m)
y_predict = fit(bins[:,0],np.exp(coefs[0]),coefs[1]) # prediction based of fitted model
model_plot = plt.loglog(bins[:,0],bins[:,1],'o',label="error")
fit_line = plt.plot(bins[:,0],y_predict,'r', label="fit")
plt.ylabel('Y (bins[:,1])')
plt.xlabel('X (bins[:,0])')
plt.title('model')
plt.legend(loc='best')
plt.show(model_plot,fit_line)
def R_sqr (y,y_predict): # calculating R squared value to measure fitting accuracy
rsdl = y - y_predict
ss_res = np.sum(rsdl**2)
ss_tot = np.sum((y-np.mean(y))**2)
R2 = 1-(ss_res/ss_tot)
R2 = np.around(R2,decimals=4)
return R2
R2= R_sqr(bins[:,1],y_predict)
print ('(R^2 = %s)' % (R2))
The fit formula for bins1[[x],[y]]: python: y = 0.337*(x)^1.223 (R^2 = 0.7773), excel: y = 0.289*(x)^1.174 (R^2 = 0.8548)
The fit formula for bins2[[x],[y]]: python: y = 0.509*(x)^1.332 (R^2 = -1.753), excel: y = 0.311*(x)^1.174 (R^2 = 0.9116)
And these are two sample data sets out of 30, I randomly see this fitting problem in my data and some have R-squared around "-150"!!
Itried scipy "curve_fit" but I didn't get better results, in fact worse!
Anyone knows how to get excel-like fit in python?
You are trying to calculate an R-squared using Y's that have not been converted to log-space. The following change gives reasonable R-squared values:
R2 = R_sqr(np.log(bins[:,1]), np.log(y_predict))

[scikit learn]: Anomaly Detection - Alternative for OneClassSVM

I have implemented LinearSVC and SVC from the sklearn-framework for text classification.
I am using TfidfVectorizer to get sparse representation of the input data that consists of two different classes(benign data and malicious data). This part is working pretty fine but now i wanted to implement some kind of anomaly detection by using the OneClassSVM classificator and training a model with only one class (outliers detection...). Unfortunately it is not working with sparse-data. Some developers are working on a patch (https://github.com/scikit-learn/scikit-learn/pull/1586) but there a some bugs so there is no solution yet for using the OneClassSVM-implementation.
Are there any other methods in the sklearn-framework for doing something like that? I am looking over the examples but nothing seems to fit.
Thanks!
A bit late, but in case anyone else is looking for information on this... There's a third-party anomaly detection module for sklearn here: http://www.cit.mak.ac.ug/staff/jquinn/software/lsanomaly.html, based on least-squares methods. It should be a plug-in replacement for OneClassSVM.
Unfortunately, scikit-learn currently implements only one-class SVM and robust covariance estimator for outlier detection
You can try a comparision of these methods (as provided in the doc) by examining differences on the 2d data:
import numpy as np
import pylab as pl
import matplotlib.font_manager
from scipy import stats
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]
# define two outlier detection tools to be compared
classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"robust covariance estimator": EllipticEnvelope(contamination=.1)}
# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)
ground_truth = np.ones(n_samples, dtype=int)
ground_truth[-n_outliers:] = 0
# Fit the problem with varying cluster separation
for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(0.5 * n_inliers, 2) - offset
X2 = 0.3 * np.random.randn(0.5 * n_inliers, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
# Fit the model with the One-Class SVM
pl.figure(figsize=(10, 5))
for i, (clf_name, clf) in enumerate(classifiers.iteritems()):
# fit the data and tag outliers
clf.fit(X)
y_pred = clf.decision_function(X).ravel()
threshold = stats.scoreatpercentile(y_pred,
100 * outliers_fraction)
y_pred = y_pred > threshold
n_errors = (y_pred != ground_truth).sum()
# plot the levels lines and the points
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
subplot = pl.subplot(1, 2, i + 1)
subplot.set_title("Outlier detection")
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
cmap=pl.cm.Blues_r)
a = subplot.contour(xx, yy, Z, levels=[threshold],
linewidths=2, colors='red')
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
colors='orange')
b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white')
c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black')
subplot.axis('tight')
subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'],
prop=matplotlib.font_manager.FontProperties(size=11))
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
pl.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
pl.show()

Detrending a time-series of a multi-dimensional array without the for loops

I have a 3D array which has a time-series of air-sea carbon flux for each grid point on the earth's surface (model output). I want to remove the trend (linear) in the time series. I came across this code:
from matplotlib import mlab
for x in xrange(40):
for y in xrange(182):
cflux_detrended[:, x, y] = mlab.detrend_linear(cflux[:, x, y])
Can I speed this up by not using for loops?
Scipy has a lot of signal processing tools.
Using scipy.signal.detrend() will remove the linear trend along an axis of the data. From the documentation it looks like the linear trend of the complete data set will be subtracted from the time-series at each grid point.
import scipy.signal
cflux_detrended = scipy.signal.detrend(cflux, axis=0)
Using scipy.signal will get the same result as using the method in the original post. Using Josef's detrend_separate() function will also return the same result.
Here are two versions using numpy.linalg.lstsq. This version uses np.vander to create any polynomial trend.
Warning: not tested except on the example.
I think something like this will be added to scikits.statsmodels, which doesn't have yet a multivariate version for detrending either. For the common trend case, we could use scikits.statsmodels OLS and we would also get all the result statistics for the estimation.
# -*- coding: utf-8 -*-
"""Detrending multivariate array
Created on Fri Dec 02 15:08:42 2011
Author: Josef Perktold
http://stackoverflow.com/questions/8355197/detrending-a-time-series-of-a-multi-dimensional-array-without-the-for-loops
I should also add the multivariate version to statsmodels
"""
import numpy as np
import matplotlib.pyplot as plt
def detrend_common(y, order=1):
'''detrend multivariate series by common trend
Paramters
---------
y : ndarray
data, can be 1d or nd. if ndim is greater then 1, then observations
are along zero axis
order : int
degree of polynomial trend, 1 is linear, 0 is constant
Returns
-------
y_detrended : ndarray
detrended data in same shape as original
'''
nobs = y.shape[0]
shape = y.shape
y_ = y.ravel()
nobs_ = len(y_)
t = np.repeat(np.arange(nobs), nobs_ /float(nobs))
exog = np.vander(t, order+1)
params = np.linalg.lstsq(exog, y_)[0]
fittedvalues = np.dot(exog, params)
resid = (y_ - fittedvalues).reshape(*shape)
return resid, params
def detrend_separate(y, order=1):
'''detrend multivariate series by series specific trends
Paramters
---------
y : ndarray
data, can be 1d or nd. if ndim is greater then 1, then observations
are along zero axis
order : int
degree of polynomial trend, 1 is linear, 0 is constant
Returns
-------
y_detrended : ndarray
detrended data in same shape as original
'''
nobs = y.shape[0]
shape = y.shape
y_ = y.reshape(nobs, -1)
kvars_ = len(y_)
t = np.arange(nobs)
exog = np.vander(t, order+1)
params = np.linalg.lstsq(exog, y_)[0]
fittedvalues = np.dot(exog, params)
resid = (y_ - fittedvalues).reshape(*shape)
return resid, params
nobs = 30
sige = 0.1
y0 = 0.5 * np.random.randn(nobs,4,3)
t = np.arange(nobs)
y_observed = y0 + t[:,None,None]
for detrend_func, name in zip([detrend_common, detrend_separate],
['common', 'separate']):
y_detrended, params = detrend_func(y_observed, order=1)
print '\n\n', name
print 'params for detrending'
print params
print 'std of detrended', y_detrended.std() #should be roughly sig=0.5 (var of y0)
print 'maxabs', np.max(np.abs(y_detrended - y0))
print 'observed'
print y_observed[-1]
print 'detrended'
print y_detrended[-1]
print 'original "true"'
print y0[-1]
plt.figure()
for i in range(4):
for j in range(3):
plt.plot(y0[:,i,j], 'bo', alpha=0.75)
plt.plot(y_detrended[:,i,j], 'ro', alpha=0.75)
plt.title(name + ' detrending: blue - original, red - detrended')
plt.show()
Since Nicholas pointed out scipy.signal.detrend. My detrend separate is basically the same as scipy.signal.detrend with fewer (no axis or breaks) or different (with polynomial order) options.
>>> res = signal.detrend(y_observed, axis=0)
>>> (res - y0).var()
0.016931858083279336
>>> (y_detrended - y0).var()
0.01693185808327945
>>> (res - y_detrended).var()
8.402584948582852e-30
I think a plain old list comprehension is easiest:
cflux_detrended = np.array([[mlab.detrend_linear(t) for t in kk] for kk in cflux.T])

Categories