How to get coefficients of multinomial logistic regression? - python

I need to calculate coefficients of a multiple logistic regression using sklearn:
X =
x1 x2 x3 x4 x5 x6
0.300000 0.100000 0.0 0.0000 0.5 0.0
0.000000 0.006000 0.0 0.0000 0.2 0.0
0.010000 0.678000 0.0 0.0000 2.0 0.0
0.000000 0.333000 1.0 12.3966 0.1 4.0
0.200000 0.005000 1.0 0.4050 1.0 0.0
0.000000 0.340000 1.0 15.7025 0.5 0.0
0.000000 0.440000 1.0 8.2645 0.0 4.0
0.500000 0.055000 1.0 18.1818 0.0 4.0
The values of y are categorical in range [1; 4].
y =
1
2
1
3
4
1
2
3
This is what I do:
import pandas as pd
from sklearn import linear_modelion
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
h = .02
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, y)
# print the coefficients
print(logreg.intercept_)
print(logreg.coef_)
However, I get 6 columns in the output of logreg.intercept_ and 6 columns in the output of logreg.coef_ How can I get 1 coefficient per feature, e.g. a - f values?
y = a*x1 + b*x2 + c*x3 + d*x4 + e*x5 + f*x6
Also, probably I am doing something wrong, because y_pred = logreg.predict(X) gives me the value of 1 for all rows.

Check the online documentation:
coef_ : array, shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
As #Xochipilli has already mentioned in comments you are going to have (n_classes, n_features) or in your case (4,6) coefficients and 4 intercepts (one for each class)
Probably I am doing something wrong, because y_pred =
logreg.predict(X) gives me the value of 1 for all rows.
yes, you shouldn't try to use data that you've used for training your model for prediction. Split your data into training and test data sets, train your model using train data set and check it's accuracy using test data set.

Related

Why does "StratifiedShuffleSplit" give the same result for every split of dataset?

I'm using StratifiedShuffleSplit to repeat the procedure of split dataset, fit, predict, and compute metric. Could you please explan why it gives the same result for every split?
import csv
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report
clf = RandomForestClassifier(max_depth = 5)
df = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/BigData/main/cll_dataset.csv")
X, y = df.iloc[:, 1:], df.iloc[:, 0]
sss = StratifiedShuffleSplit(n_splits = 5, test_size = 0.25, random_state = 0).split(X, y)
for train_ind, test_ind in sss:
X_train, X_test = X.loc[train_ind], X.loc[test_ind]
y_train, y_test = y.loc[train_ind], y.loc[test_ind]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
report = classification_report(y_test, y_pred, zero_division = 0, output_dict = True)
report = pd.DataFrame(report).T
report = report[:2]
print(report)
The result is
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
Every model you build, predicts that the output is always class 0, and, as you have stratified split (always have same proportion of class 0 and class 1 than the X), you always predict exactly same values.
The models obtain better accuracy when they predicts always class 0, than "learning" some pattern or rule. This is a huge problem. To solve it you have some options below:
Try modifying some hyperparameters of the Random Forest algorithm.
Collect more data in order to obtain a bigger dataset, you only test with 8 samples (maybe is
difficult obtain new data for you)
You have unbalanced data (more samples of class 0 than class 1), you
should consider balancing it using SMOTE library

how to test the data after training the train data with k-fold cross validation?

Here in the code, I have:
Split the dataset into two part: Train set and Test set (7:3). The dataset consists of 200 rows and 9394 columns.
Define the model
cross validation used: 10 folds on train set
accuracy obtained for each fold
mean accuracy obtained: 94.29%
The confusion is:
Is it the right way I am doing?
Is cross_val_predict() used in the right way to predict the x over the test data?
Tasks remaining:
To plot accuracy of model.
To plot loss of model.
Can anyone suggest in this regards.
Sorry for this long notes!!!
The dataset is as: (These are tfidf of each words in the title and body of news)
Unnamed: 0 Unnamed: 0.1 Label Cosine_Similarity c0 c1 c2 c3 c4 c5 ... c9386 c9387 c9388 c9389 c9390 c9391 c9392 c9393 c9394 c9395
0 0 0 Real 0.180319 0.000000 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 1 Real 0.224159 0.166667 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 2 Real 0.233877 0.142857 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 3 3 Real 0.155789 0.111111 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 4 4 Real 0.225480 0.000000 0.0 0.111111 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The code and output are:
df_all = pd.read_csv("C:/Users/shiva/Desktop/allinone200.csv")
dataset=df_all.values
x=dataset[0:,3:]
Y= dataset[0:,2]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
y = np_utils.to_categorical(encoded_Y)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=15,shuffle=True)
x_train.shape,y_train.shape
def baseline_model():
model = Sequential()
model.add(Dense(512, activation='relu',input_dim=x_train.shape[1]))
model.add(Dense(64, activation='relu')))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
code for fitting the model:
estimator = KerasClassifier(build_fn=baseline_model, epochs=5, batch_size=4, verbose=1)
kf = KFold(n_splits=10, shuffle=True,random_state=15)
for train_index, test_index in kf.split(x_train,y_train):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
code for taking out the results:
results = cross_val_score(estimator, x_train, y_train, cv=kf)
print results
Output:
[0.9285714 1. 0.9285714 1. 0.78571427 0.85714287
1. 1. 0.9285714 1. ]
Mean accuracy:`
print("Accuracy: %0.2f (+/-%0.2f)" % (results.mean()*100, results.std()*2))
Output:
Accuracy: 94.29 (+/-0.14)
code for prediction:
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(estimator, x_test, y_test,cv=kf)
print(y_test[0])
print(y_pred[0])
Output:after processing
[1. 0.]
0
Here prediction seems seems okay. Because 1 is REAL and O is FALSE. y_test is 0 and y_predict is also 0.
Confusion matrix:
import numpy as np
y_test=np.argmax(y_test, axis=1)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm
Output:
array([[32, 0],
[ 1, 27]], dtype=int64)
Subject to Andreas' comment related to the number of your observations, does this help you in any way: Keras - Plot training, validation and test set accuracy
Bests
Unfortunately my comment became to long therefore I try it here:
Please have a look at this: https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e in short, larger batch sizes have often worse results but are faster, which in your case might be irrelevant (200 rows).
Secondly you do not have a (reusable) hold-out which might give you false assumptions regarding your true accuracy. That you have an accuracy of over 90% on your first try can mean either: overfitting, leaking or imbalanced data (e.g. here: https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html) or that you were really lucky.
K-fold in combination with small samples sizes can lead to wrong assumptions:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224365
A few rule of thumbs:
1. you want to have 2x as much datapoints (rows) as features (columns).
2. If you still get a good result, this can mean multiple things. Most likely its an error in code or methodology.
Imagine you have to predict the fraud risk of a bank. If the chance a fraud happens is 1% I can build you a modell which is right 99% of the time by simply saying there is never any fraud....
Neuronal Nets are extremly powerfull, that is good and bad. The bad thing is that they nearly always find some kind of pattern, even if there isn't one. If you give them 2000 columns essentially it gets a bit like the number "Pi" if you search long enough in the numbers after the comma you will find any number combination you want.
Here its explained in a bit more detail:
https://medium.com/#jennifer.zzz/more-features-than-data-points-in-linear-regression-5bcabba6883e

statsmodels anova_lm returns PR>F 0.0: How to change the number of decimals?

I'm performing ANOVA on a pandas dataframe using statsmodels anova_lm.
The returned significance level PR>F is 0.0. I assume this is a rounded value, but rounded at how many decimal places?
Is there a way to specify the number of decimal places in statsmodels?
my code:
from statsmodels.formula.api import ols
import statsmodels.api as sm
formula = 'consensus_rate ~ C(strategy) + np.power(nr_clues,' + str(exp) +') + shared_ratio + primacy_weight + edges_per_node '
lm = ols(formula, data=bigdf).fit()
sm.stats.anova_lm(lm, typ=2)
returns
>>>> sum_sq df F PR(>F)
C(strategy) 1.909980e+06 3.0 15196.209763 0.0
np.power(nr_clues, 0.1) 5.159021e+05 1.0 12313.884367 0.0
shared_ratio 7.383109e+05 1.0 17622.480378 0.0
primacy_weight 2.099998e+05 1.0 5012.410347 0.0
edges_per_node 8.457493e+04 1.0 2018.689015 0.0
Residual 3.013158e+05 7192.0 NaN NaN
PR(>F) is probably smaller than 0.000000
Looking at other statsmodels anova tables, it seems statsmodels displays floats with 6 decimals.
For example:
df sum_sq mean_sq F PR(>F)
C(Fitness) 2.0 672.0 336.000000 16.961538 0.000041
Residual 21.0 416.0 19.809524 NaN NaN
sourced from: https://www.statsmodels.org/stable/examples/notebooks/generated/interactions_anova.html

Python - How to do prediction and testing over multiple files using sklearn

I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y.
Let's say we have 5 traces of dataset (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv) in the current directory. For a single trace (dataset) (for example, a1.csv) – I can do the prediction of a 5 window as the following
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
df = pd.read_csv('a1.csv')
for i in range(1,5):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
reg = RandomForestRegressor(criterion='mse')
reg.fit(X,y)
modelPred = reg.predict(X)
print(modelPred)
print("Number of predictions:",len(modelPred))
modelPred.tofile('predictedValues1.txt',sep="\n",format="%s")
meanSquaredError=mean_squared_error(y, modelPred)
print("Mean Square Error (MSE):", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError)
I have solved this problem with random forest, which yields df:
rolling_regression')
time X Y X_t1 X_t2 X_t3 X_t4
0 0.000543 0 10 NaN NaN NaN NaN
1 0.000575 0 10 0.0 NaN NaN NaN
2 0.041324 1 10 0.0 0.0 NaN NaN
3 0.041331 2 10 1.0 0.0 0.0 NaN
4 0.041336 3 10 2.0 1.0 0.0 0.0
5 0.041340 4 10 3.0 2.0 1.0 0.0
6 0.041345 5 10 4.0 3.0 2.0 1.0
7 0.041350 6 10 5.0 4.0 3.0 2.0
.........................................................
[2845 rows x 7 columns]
[ 10. 10. 10. ..., 20. 20. 20.]
RMSE: 0.5136564734333562
However, now I want to do the prediction over all of the files (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv)by dividing the training into 60% of the datasets whose file name start with a and the remaining 40% for testing whose file name start with a using sklearn in Python (meaning 3 traces will be used for training and 2 files for testing)?
PS: All the files have the same structure but they are with different lengths for they are generated with different parameters.
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)
To read in multiple files, you'll need a slight extension. Aggregate data from each csv, then call pd.concat to join them:
df_list = []
for i in range(1, 6):
df_list.append(pd.read_csv('a%d.csv' %i))
df = pd.concat(df_list)
This will read in all your csvs, and you can carry on as usual. Get X and y:
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
Use sklearn.cross_validation.train_test_split to segment your data:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
You can also look at StratifiedKFold.

How to train and predict a model using Random Forest?

How can we predict a model using random forest? I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y. Similarly, using a simple rolling OLS regression model, we can do it as in the following but I wanted to do it using random forest model.
import pandas as pd
df = pd.read_csv('data_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],
window_type='rolling', window=5, intercept=True)
I have solved this problem with random forest, which yields df:
t_stamp X Y X_t1 X_t2 X_t3 X_t4 X_t5
0.000543 0 10 NaN NaN NaN NaN NaN
0.000575 0 10 0.0 NaN NaN NaN NaN
0.041324 1 10 0.0 0.0 NaN NaN NaN
0.041331 2 10 1.0 0.0 0.0 NaN NaN
0.041336 3 10 2.0 1.0 0.0 0.0 NaN
0.041340 4 10 3.0 2.0 1.0 0.0 0.0
0.041345 5 10 4.0 3.0 2.0 1.0 0.0
0.041350 6 10 5.0 4.0 3.0 2.0 1.0
0.041354 7 10 6.0 5.0 4.0 3.0 2.0
.........................................................
[ 10. 10. 10. 10. .................................]
MSE: 1.3273548431
This seems to work fine for ranges 5, 10, 15, 20, 22. However, it doesn't seem to work fine for ranges greater than 23 (it prints MSE: 0.0) and this is because, as you can see from the dataset the values of Y are fixed (10) from row 1 - 23 and then changes to another value (20, and so on) from row 24. How can we train and predict a model of such cases based on the last data points?
It seems with the existing code, when calling dropna, you truncate X but not y. You also train and test on the same data.
Fixing this will give non-zero MSE.
Code:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
df = pd.read_csv('/Users/shivadeviah/Desktop/estimated_pred.csv')
df1 = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(25)})
df1['Y'] = df['Y']
df1 = df1.sample(frac=1).reset_index(drop=True)
df1.dropna(inplace=True)
X = df1.iloc[:, :-1].values
y = df1.iloc[:, -1].values
x = int(len(X) * 0.66)
X_train = X[:x]
X_test = X[x:]
y_train = y[:x]
y_test = y[x:]
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train, y_train)
modelPred = reg.predict(X_test)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError = mean_squared_error(y_test, modelPred)
print("MSE:", meanSquaredError)
print(df1.size)
df2 = df1.iloc[x:, :].copy()
df2['pred'] = modelPred
df2.head()
Output:
[ 267.7 258.26608241 265.07037249 ..., 267.27370169 256.7 272.2 ]
Number of predictions: 87891
MSE: 1954.9271256
6721026
X_0 pred
170625 48 267.700000
170626 66 258.266082
170627 184 265.070372
170628 259 294.700000
170629 271 281.966667

Categories