Logistic Regression: Train using past data and predict using current data? - python

I've trained and tested my logistic regression using available data but now need to output a future prediction. I want to include the 2017 values that I used in my training and test set to predict the 2018 probability.
This is the code I used to train and test my model:
Xadj = train.ix[:,('2016 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016')]
#Coded is the transformation of 2017 transaction count to a binary variable
y = y=train.ix[:,('2017 transaction count coded')]
logit_model=sm.Logit(y,Xadj)
result=logit_model.fit()
print(result.summary())
X_train, X_test, y_train, y_test = train_test_split(Xadj, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
#Cross Validation
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))
In an attempt to export predictions for 2018, I have done the following:
#Create 2018 Purchase Probability
train['2018 Purchase Probability']=pd.DataFrame({'2018 Purchase Probability' : []})
yact=train.ix[:,('2018 Purchase Probability')]
#Adding in 2017 values
X = train.ix[:, ('2017 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016','create_year_2017')]
from sklearn.preprocessing import scale, StandardScaler
scaler = StandardScaler()
scaler.fit(Xadj)
X = scaler.transform(Xadj)
X_pred = scaler.transform(X)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(Xadj, y)
#Generate 0/1 prediction
prediction = logreg.predict(X= X)
#Generate odds ratio
precent_prediction = logreg.predict_proba(X= X)
prediction = pd.DataFrame(prediction)
I'm not sure if I've done this correctly and judging from my output (which is mostly 1's) I don't think I have. I am new to coding in Python and am struggling to turn my tested model into a future prediction that can be used to make decisions.
Thanks in advance for any help!

Related

How to fit long time period data into Regression models in scikit-learn?

I'm working on the regresion model with population and demand values my data is for period from 1980 to 2021 by country, below example where under year is the number of population and under year_dem is the demand for item.
Taks is to create prediction model to forecast demand for each country in future.
# Import necessary libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# Load the dataset containing past data on vaccine demand and supply
data = df.iloc[0]
X = data.drop(['Country','ISO','1980_dem', '1981_dem', '1982_dem','1983_dem','1984_dem','1985_dem','1986_dem','1987_dem','1988_dem','1989_dem','1990_dem','1991_dem','1992_dem','1993_dem','1994_dem','1995_dem','1996_dem','1997_dem','1998_dem','1999_dem','2000_dem','2001_dem','2002_dem','2003_dem','2004_dem','2005_dem','2006_dem','2007_dem','2008_dem','2009_dem','2010_dem','2011_dem','2012_dem','2013_dem','2014_dem','2015_dem','2016_dem','2017_dem','2018_dem','2019_dem','2020_dem','2021_dem'])
y = data['1980_dem']
model = RandomForestRegressor(n_estimators=50, max_features="auto", random_state=44)
model.fit(X_train, y_train)
# Split the DataFrame into a training set and a testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)
# Use the trained model to make predictions on the test set
#predictions = model.predict(X_test)
# Calculate the accuracy of the predictions
#accuracy = model.score(X_test, y_test)
#print('Accuracy:', round(accuracy,2),'%.')
expect to have created a model with Accuracy printed and poisbilit to predict values for future based on the model.

Why do my CatBoost fit metrics are different than the sklearn evaluation metrics?

I'm still not sure this should be a question for this forum or for Cross-Validated, but I'll try this one, since it's more about the output of the code than the technique per se. Here's the thing, I'm running a CatBoost Classifier, just like this:
# import libraries
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
# import data
train = pd.read_csv("train.csv")
# get features and label
X = train[["Pclass", "Sex", "SibSp", "Parch", "Fare"]]
y = train[["Survived"]]
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# model parameters
model_cb = CatBoostClassifier(
cat_features=["Pclass", "Sex"],
loss_function="Logloss",
eval_metric="AUC",
learning_rate=0.1,
iterations=500,
od_type = "Iter",
od_wait = 200
)
# fit model
model_cb.fit(
X_train,
y_train,
plot=True,
eval_set=(X_test, y_test),
verbose=50,
)
y_pred = model_cb.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(roc_auc_score(y_test, y_pred))
The dataframe I'm using is from the Titanic competition (link).
The problem is that the model_cb.fit step is showing an AUC of 0.87, but the last line, the roc_auc_score from sklearn is showing me an AUC of 0.73, i.e., a much lower. The AUC from CatBoost, from what I understood is supposedly already on the testing dataset.
Any ideas on which is the problem here and how could I fix it?
The ROC curve needs predicted probabilities or some other sort of confidence measure, not hard class predictions. Use
y_pred = model_cb.predict_proba(X_test)[:, 1]
See Scikit-learn : roc_auc_score and Why does roc_curve return only 3 values?.

how to run the same linear model n times?

I built a linear model with the sklearn based on the Cement and Concrete Composites dataset.
Initially, i used the train_test_split(X, Y, test_size=0.3, Shuffle=False) and i found the train and test error.
Now i try to run the same model 10 times with Shuffle=True and compute the mean and sd of the errors. The new results should be compared to the first ones.
How could i loop the same model n times and save the errors in a list?
Try something like this:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
errors = []
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, Shuffle=True)
model = LinearRegression() # the model you want to use here
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
error = accuracy_score(y_test, y_pred) # the error metric you want to use here
errors.append(error)
What you need is cross-validation: repeated evaluation of the model on different splits of the same data. train_test_split in this case is a wrapper around ShuffleSplit cross-validation.
In your case it might look like this:
from sklearn.model_selection import ShuffleSplit, cross_val_score
import numpy as np
from sklearn.linear_model import LinearRegression
X, y = ... # read dataset
model = LinearRegression()
# n_splits=10 is for 10 random shuffled train-test splits
cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
np.mean(scores), np.std(scores)
If you want to compute the error on your own or do anything else with models/results, you could do it like this:
for train_ids, test_ids in cv.split(X):
model.fit(X[train_ids], y[train_ids])
model.score(X[test_ids], y[test_ids])
...
More about this:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html

Best Practice for Executing Ensemble Methods

I am running the sample code below.
df = pd.read_csv('C:\\my_path\\test.csv', header=0, encoding = 'unicode_escape')
df = df.fillna(0)
X = df1.drop(columns = ['PRICE','MATURITYDATE'])
y = df1['PRICE']
from sklearn.model_selection import train_test_split
#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
#create new a knn model
knn = KNeighborsClassifier()
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5)
#fit model to training data
knn_gs.fit(X_train, y_train)
#save best model
knn_best = knn_gs.best_estimator_
#check best n_neigbors value
print(knn_gs.best_params_)
# RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
#create a new random forest classifier
rf = RandomForestClassifier()
#create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}
#use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)
#fit model to training data
rf_gs.fit(X_train, y_train)
#save best model
rf_best = rf_gs.best_estimator_
#check best n_estimators value
print(rf_gs.best_params_)
# REGRESSION
from sklearn.linear_model import LogisticRegression
#create a new logistic regression model
log_reg = LogisticRegression()
#fit the model to the training data
log_reg.fit(X_train, y_train)
#test the three models with the test data and print their accuracy scores
print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(log_reg.score(X_test, y_test)))
# VOTING CLASSIFIER
from sklearn.ensemble import VotingClassifier
#create a dictionary of our models
estimators=[('knn', knn_best), ('rf', rf_best), ('log_reg', log_reg)]
#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')
It's all from the link below
https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a
The problem that I'm running into with each of these methods is always this:
ValueError: Unknown label type: 'continuous'
I guess everything needs to be converted into a categorical type or, perhaps, one hot encoding needs to be applied. Is this correct? What is the best way to deal with this kind of issue? I'm hoping to keep things simple and very generic, without introducing custom coding. This is why I am leaning towards the scikit-learn libraries. I'd greatly appreciate any/all thoughts and insights. Thanks so much!

Logistic Regression - Machine Learning

Logistic Regression with inputs of "Machine Learning.csv" file.
#Import Libraries
import pandas as pd
#Import Dataset
dataset = pd.read_csv('Machine Learning Data Set.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 10]
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Fitting Logistic Regression to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
#Predicting the Test set results
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
I have a machine learning / logistic regression code (python) as above. It has properly trained my model and gives a really good match with the test data. But unfortunately it is only giving me 0/1 (binary) results when I test with some other random values. (the training set has only 0/1 - as in failed/succeeded)
How can I get a probability result instead of a binary result in this algorithm? I have tried very different set of numbers and would like find out a probability of failing - instead of a 0 and 1.
Any help is strongly appreciated :) Thanks a lot!
Just replace
y_pred = classifier.predict(X_test)
with
y_pred = classifier.predict_proba(X_test)
For details refer Logistic Regression Probability
predict_proba(X_test) will give you probability of each sample for each class.i.e if X_test contains n_samples and you have 2 classes output of above function will be a "n_samples X 2 " matrix. and sum of two classes predicted will be 1. for more details have a look at documentation here

Categories