Lasso regression on boston crime dataset in python - python

Lasso regression solution in R
The above link contains the the code for solution of Lasso regression in R. I am trying to solve it in python. Can someone help me out to solve it python??
Output
Output of it is as in the above picture.

Try using the below approach
from imp import new_module from sklearn.linear_model import LassoCV,
Lasso new_module = LassoCV(cv=5, random_state=0, max_iter=10000)
new_module.fit(train_x, train_y) new_module.alpha_
BestLassofit = Lasso(alpha=model.alpha_) BestLassofit.fit(train_x,
train_y) importance = np.abs(BestLassofit.coef_)[1:] importance[:10]
Col = np.array(df.Col)[importance > 0] x =
sm.add_constant(df[Col])
train_x, test_x, train_y, test_y =
sklearn.model_selection.train_test_split(
x, crimerate_df, test_size=0.2, random_state=123 )
train_x_tmp = sm.add_constant(train_x)
lmod = sm.OLS(train_y, train_x_tmp).fit() lmod.summary()
lmod.predict()[:10]
lmod.get_prediction().summary_frame()[:10]
sm.qqplot(lmod.resid,line="q") plt.title("Q-Q plot of Standardized
Residuals")
plt.show()

I'm a huge fan of scikit-learn's linear models module, where you can find sklearn.linear_model.Lasso for an out-of-the-box Lasso regression implementation.
Example from the docs:
>>> from sklearn import linear_model
>>> clf = linear_model.Lasso(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
Lasso(alpha=0.1)
>>> print(clf.coef_)
[0.85 0. ]
>>> print(clf.intercept_)
0.15...
The link you sent seems to want you to tune the "shrinkage" parameter (which I imagine is alpha), so you could create a loop where you iterate over values of alpha, record the score (i.e. dataset error), and create the plot they display in the link.

Related

SVR/SVM output predictions are very similar to each other but far from true value

The main idea is to predict 2 target output, based on input features.
the input features are already scaled using Standardscaler() from sklearn.
size of X_train is (190 x 6), Y_train = (190 x 2). X_test is (20 x 6), Y_test = (20x2)
linear and rbf kernel also make use of GridsearchCV to find the best C (linear), gamma and C ('rbf')
[PROBLEM] I perform SVR utilizing MultiOutputRegressor on both linear and rbf kernel but, the predicted outputs are very similar to each other (not exactly constant prediction) and pretty far from the true value of y.
Below are the plots where the scatter plot represent the true value of Y. first picture correspond to result of first target, Y[:,0]. while second picture is second target, Y[:,1].
Do i have to scale my target output? Any other model that could help improving test accuracy?
I have tried random forest regressor and perform tuning as well, and test accuracy is about similar to what I'm getting with SVR. (below result from SVR)
Best parameter: {'estimator__C': 1}
MAE: [18.51151192 9.604601 ] #from linear kernel
Best parameter (rbf): {'estimator__C': 1, 'estimator__gamma': 1e-09}
MAE (rbf): [17.80482033 9.39780134] #from rbf kernel
Thankyou so much! any help and input is greatly appreciated!! ^__^
---------------- Code -----------------------------
import numpy as np
from numpy import load
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=3)
#input features - HR, HRV, PTT, breathing_rate, LASI, AI
X = load('200_patient_input_scaled.npy')
#Output features - SBP, DBP
Y = load('200_patient_output_raw.npy')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.095, random_state = 43)
epsilon = 0.1
#--------------------------- Linear SVR kernel Model ------------------------------------------------------
linear_svr = SVR(kernel='linear', epsilon = epsilon)
multi_output_linear_svr = MultiOutputRegressor(linear_svr)
#multi_output_linear_svr.fit(X_train, Y_train) #just to see the output
#GridSearch - find the best C
grid = {'estimator__C': [1,10,10,100,1000] }
grid_linear_svr = GridSearchCV(multi_output_linear_svr, grid, scoring='neg_mean_absolute_error', cv=rkf, refit=True)
grid_linear_svr.fit(X_train, Y_train)
#Prediction
Y_predict = grid_linear_svr.predict(X_test)
print("\nBest parameter:", grid_linear_svr.best_params_ )
print("MAE:", mean_absolute_error(Y_predict,Y_test, multioutput='raw_values'))
#-------------------------- RBF SVR kernel Model --------------------------------------------------------
rbf_svr = SVR(kernel='rbf', epsilon = epsilon)
multi_output_rbf_svr = MultiOutputRegressor(rbf_svr)
#Grid search - Find best combination of C and gamma
grid_rbf = {'estimator__C': [1,10,10,100,1000], 'estimator__gamma': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2] }
grid_rbf_svr = GridSearchCV(multi_output_rbf_svr, grid_rbf, scoring='neg_mean_absolute_error', cv=rkf, refit=True)
grid_rbf_svr.fit(X_train, Y_train)
#Prediction
Y_predict_rbf = grid_rbf_svr.predict(X_test)
print("\nBest parameter (rbf):", grid_rbf_svr.best_params_ )
print("MAE (rbf):", mean_absolute_error(Y_predict_rbf,Y_test, multioutput='raw_values'))
#Plotting
plot_y_predict = Y_predict_rbf[:,1]
plt.scatter( np.linspace(0, 20, num = 20), Y_test[:,1], color = 'red')
plt.plot(np.linspace(0, 20, num = 20), plot_y_predict)
A common mistake is that when people use StandardScaler they use it along the wrong axis of the data. You may scale all the data, or row by row instead of column by column, please make sure you've done this right! I would do this by hand to be sure because else I think it needs different StandardScaler fit for each feature.
[RESPONSE/EDIT]: I think that just negates what StandardScaler did by inversing the application. I'm not entirely sure of the StandardScaler behaviour I'm just saying all this out of experience and having trouble scaling multiple feature data. If i were you (for example for MInMax scaling) I would prefer something like this:
columnsX = X.shape[1]
for i in range(columnsX):
X[:, i] = (X[:, i] - X[:, i].min()) / (X[:, i].max() - X[:, i].min())

Unexpected behavior of ML Logistic Regression?

I am new to AI and ML so apologies if this is a stupid question.
I was reading about Logistic Regression, and found out it is a classification supervised ML model.
So I tried to code an example to give it a try. My idea was to see if the program was able to figure out the "rule" behind the label (Y) I established, which is "Y = 1 if and only if X1 OR X2 is a multiple of 3 but not both, 0 otherwise"
But as you can see the accuracy is very poor. Am I doing something wrong? Did I misunderstood the concept of Logistic Regression?
DATASET:
3,1,1
2,3,1
1,1,0
2,4,0
5,6,1
9,3,1
8,9,1
5,5,0
9,9,0
5,7,0
3,3,0
5,3,1
2,4,0
7,7,0
4,9,1
7,3,1
6,2,1
8,1,0
6,4,0
9,4,1
CODE:
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
col_names = ['x1', 'x2', 'y']
multi3 = pd.read_csv("1.csv", header=None, names=col_names)
feature_cols = ['x1', 'x2']
X = multi3[feature_cols]
y = multi3.y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))
OUTPUT:
[[1 2]
[1 1]]
Accuracy: 0.4
Precision: 0.3333333333333333
Recall: 0.5
EDIT:
Source code of my comment below.
You can visualize your data:
multi3.plot.scatter(x = "x1",y="x2", c = "y",cmap="viridis")
You can see there's no clear separation between your two different classes (0 or 1). So the accuracy you get, even though with a small test set, would be low, because x1 and x2 are not useful at all in discriminating the labels.
In the code you posted, you did it over a larger dataset and also simulated data, if we do something similar,
import numpy as np
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,10,(60,2)),columns=['x1', 'x2'])
df['y'] = ((df['x1']>5) & (df['x2'] > 5)).astype(int)
logreg = LogisticRegression()
logreg.fit(df[['x1','x2']], df['y'])
y_pred = logreg.predict(df[['x1','x2']])
cnf_matrix = metrics.confusion_matrix(df['y'], y_pred)
cnf_matrix
array([[49, 2],
[ 2, 7]])
And of course, you can see that there is separation:
My guess is that the original dataset is wrong or has nothing to do with what you posted as an image.

How to implement polynomial logistic regression in scikit-learn?

I'm trying to create a non-linear logistic regression, i.e. polynomial logistic regression using scikit-learn. But I couldn't find how I can define a degree of polynomial. Did anybody try it?
Thanks a lot!
For this you will need to proceed in two steps. Let us assume you are using the iris dataset (so you have a reproducible example):
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
Step 1
First you need to convert your data to polynomial features. Originally, our data has 4 columns:
X_train.shape
>>> (112,4)
You can create the polynomial features with scikit learn (here it is for degree 2):
poly = PolynomialFeatures(degree = 2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_train)
X_poly.shape
>>> (112,14)
We know have 14 features (the original 4, their square, and the 6 crossed combinations)
Step 2
On this you can now build your logistic regression calling X_poly
lr = LogisticRegression()
lr.fit(X_poly,y_train)
Note: if you then want to evaluate your model on the test data, you also need to follow these 2 steps and do:
lr.score(poly.transform(X_test), y_test)
Putting everything together in a Pipeline (optional)
You may want to use a Pipeline instead that processes these two steps in one object to avoid building intermediary objects:
pipe = Pipeline([('polynomial_features',poly), ('logistic_regression',lr)])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

sklearn - how do you get a probability instead of a label?

I know the SVM (specifically linear SVC) has an option namely when probability = True as an optional parameter when you instantiate, model.predict_proba() is supposed to give the probability of each of its predictions along with the label (1 or 0). However I keep getting the numpy error "use all() on an 1 dimensional array" when I call predict_proba() and I can only figure out how to get a prediction in the form of a label (1 or 0) using model.predict().
Documentation example works fine for me setting the flag probability=True. The problem has to be in your input data. Try this very simple example:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(probability=True)
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
print(clf.predict_proba([[-0.8, -1]]))
You can use CallibratedClassifierCV.
from sklearn.calibration import CalibratedClassifierCV
model_svc = LinearSVC()
model = CalibratedClassifierCV(model_svc)
model.fit(X_train, y_train)
pred_class = model.predict(y_test)
probability = model.predict_proba(predict_vec)
You will get predicted probability score in array values.

SKLearn how to get decision probabilities for LinearSVC classifier

I am using scikit-learn's linearSVC classifier for text mining. I have the y value as a label 0/1 and the X value as the TfidfVectorizer of the text document.
I use a pipeline like below
pipeline = Pipeline([
('count_vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
('classifier', LinearSVC())
])
For a prediction, I would like to get the confidence score or probability of a data point being classified as
1 in the range (0,1)
I currently use the decision function feature
pipeline.decision_function(test_X)
However it returns positive and negative values that seem to indicate confidence. I am not too sure about what they mean either.
However, is there a way to get the values in range 0-1?
For example here is the output of the decision function for some of the data points
-0.40671879072078421,
-0.40671879072078421,
-0.64549376401063352,
-0.40610652684648957,
-0.40610652684648957,
-0.64549376401063352,
-0.64549376401063352,
-0.5468745098794594,
-0.33976011539714374,
0.36781572474117097,
-0.094943829974515004,
0.37728641897721765,
0.2856211778200019,
0.11775493140003235,
0.19387473663623439,
-0.062620918785563556,
-0.17080866610522819,
0.61791016307670399,
0.33631340372946961,
0.87081276844501176,
1.026991628346146,
0.092097790098391641,
-0.3266704728249083,
0.050368652422013376,
-0.046834129250376291,
You can't.
However you can use sklearn.svm.SVC with kernel='linear' and probability=True
It may run longer, but you can get probabilities from this classifier by using predict_proba method.
clf=sklearn.svm.SVC(kernel='linear',probability=True)
clf.fit(X,y)
clf.predict_proba(X_test)
If you insist on using the LinearSVC class, you can wrap it in a sklearn.calibration.CalibratedClassifierCV object and fit the calibrated classifier which will give you a probabilistic classifier.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features
y = iris.target #3 classes: 0, 1, 2
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = [[2.3, 5],
[4, 7]]
predicted_probs = calibrated_svc.predict_proba(prediction_data) #important to use predict_proba
print predicted_probs
Here is the output:
[[ 9.98626760e-01 1.27594869e-03 9.72912751e-05]
[ 9.99578199e-01 1.79053170e-05 4.03895759e-04]]
which shows probabilities for each class for each data point.

Categories