I'm trying to create a non-linear logistic regression, i.e. polynomial logistic regression using scikit-learn. But I couldn't find how I can define a degree of polynomial. Did anybody try it?
Thanks a lot!
For this you will need to proceed in two steps. Let us assume you are using the iris dataset (so you have a reproducible example):
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
data = load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
Step 1
First you need to convert your data to polynomial features. Originally, our data has 4 columns:
X_train.shape
>>> (112,4)
You can create the polynomial features with scikit learn (here it is for degree 2):
poly = PolynomialFeatures(degree = 2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_train)
X_poly.shape
>>> (112,14)
We know have 14 features (the original 4, their square, and the 6 crossed combinations)
Step 2
On this you can now build your logistic regression calling X_poly
lr = LogisticRegression()
lr.fit(X_poly,y_train)
Note: if you then want to evaluate your model on the test data, you also need to follow these 2 steps and do:
lr.score(poly.transform(X_test), y_test)
Putting everything together in a Pipeline (optional)
You may want to use a Pipeline instead that processes these two steps in one object to avoid building intermediary objects:
pipe = Pipeline([('polynomial_features',poly), ('logistic_regression',lr)])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
Related
So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html
I am trying to perform linear regression to predict an outcome y based on about ten independent variables X. I have 125 data points with data for y and each of X.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
X_array = X.values.reshape(-1,X.shape[1]) #converting dataframe into array
#125 x 9
y_array = y.values.reshape(-1,1)
#125 x 1
#Split data
X_train, X_test, y_train, y_test = train_test_split(X_array, y_array, test_size=0.2, random_state=100)
#define linear regression and fit to the data
regressor = LinearRegression()
regressor.fit(X_train, y_train)
r_sqr = regressor.score(X_test,y_test,sample_weight=None)
The problem is that when I run this code for just one variable within X (0.52), I get a better R-squared score than when I run it for all of X (0.33). How can this be? Shouldn't I get better performance of the model with more independent variables?
I've tried r2_score function and tried running train_test_split on the data in data frame for instead of array, but no changes.
From what I know, Linear Discriminant Analysis (LDA) is a technique to reduce the number of input features. Wiki also states the same
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
However, when I try to use the LinearDiscriminantAnalysis from sklearn.discriminant_analysis, I failed to get the data with reduced features.
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X, y = make_blobs(40000,600,2,cluster_std=20,random_state=101)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)
model = LinearDiscriminantAnalysis(n_components=100)
model.fit(X_train,y_train)
X_train_new = model.transform(X_train)
print(X_train_new.shape)
>>> (28000, 1)
My original data has 600 features, I would like to reduce it to only 100 features with the LDA. But the LDA from sklearn gave me the shape (28000,1) instead.
Why is there only 1 feature after the LDA transformation? What am I doing it wrong?
Your LDA transforms your dataset to only one feature because LDA will escape n_components > (n_classes - 1).
Here you have two classes => 2 - 1 = 1 feature.
Please refer to LDA for two classes on Wikipedia
Change your number of centers to 200 for example and you'll see the difference
Xx, yy = make_blobs(40000, 600, centers=200, cluster_std=5)
X_train, X_test, y_train, y_test = train_test_split(Xx, yy, test_size=0.3)
model = LinearDiscriminantAnalysis(n_components=100)
model.fit(X_train, y_train)
X_train_new = model.transform(X_train)
print(X_train_new.shape)
>> (28000, 100)
Use PCA or SVD otherwise
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=100)
X_train_new = svd.fit_transform(X_train)
svd.explained_variance_ratio_.sum() # should be > 0.90
print(X_train_new.shape)
>>> (28000, 100)
Logistic Regression with inputs of "Machine Learning.csv" file.
#Import Libraries
import pandas as pd
#Import Dataset
dataset = pd.read_csv('Machine Learning Data Set.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 10]
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Fitting Logistic Regression to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
#Predicting the Test set results
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
I have a machine learning / logistic regression code (python) as above. It has properly trained my model and gives a really good match with the test data. But unfortunately it is only giving me 0/1 (binary) results when I test with some other random values. (the training set has only 0/1 - as in failed/succeeded)
How can I get a probability result instead of a binary result in this algorithm? I have tried very different set of numbers and would like find out a probability of failing - instead of a 0 and 1.
Any help is strongly appreciated :) Thanks a lot!
Just replace
y_pred = classifier.predict(X_test)
with
y_pred = classifier.predict_proba(X_test)
For details refer Logistic Regression Probability
predict_proba(X_test) will give you probability of each sample for each class.i.e if X_test contains n_samples and you have 2 classes output of above function will be a "n_samples X 2 " matrix. and sum of two classes predicted will be 1. for more details have a look at documentation here
I have 4 features and one target variable. I am using RandomForestRegressor instead of RandomForestClassifer as my target variable is float. When I am trying to fit my model and then output them in sorted order to get the important features I am getting Not fitted error how to fix it?
Code:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
feat_labels = data.columns[:4]
regr = RandomForestRegressor(max_depth=2, random_state=0)
#clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
#clf.fit(X_train, y_train)
regr.fit(X, y)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
You are fitting to regr but calling the feature importances on clf. Try calling this instead:
importances = regr.feature_importances_
I noticed that previously your classifier was being fit with the training data you setup, but the regressor is now being fit with X and y.
However, I don't see here where you're setting X and y in the first place or even more where you actually load in a dataset. Could it be you forgot this step as well as what Harpal mentioned in another answer?