How to train and predict a model using Random Forest? - python

How can we predict a model using random forest? I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y. Similarly, using a simple rolling OLS regression model, we can do it as in the following but I wanted to do it using random forest model.
import pandas as pd
df = pd.read_csv('data_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],
window_type='rolling', window=5, intercept=True)
I have solved this problem with random forest, which yields df:
t_stamp X Y X_t1 X_t2 X_t3 X_t4 X_t5
0.000543 0 10 NaN NaN NaN NaN NaN
0.000575 0 10 0.0 NaN NaN NaN NaN
0.041324 1 10 0.0 0.0 NaN NaN NaN
0.041331 2 10 1.0 0.0 0.0 NaN NaN
0.041336 3 10 2.0 1.0 0.0 0.0 NaN
0.041340 4 10 3.0 2.0 1.0 0.0 0.0
0.041345 5 10 4.0 3.0 2.0 1.0 0.0
0.041350 6 10 5.0 4.0 3.0 2.0 1.0
0.041354 7 10 6.0 5.0 4.0 3.0 2.0
.........................................................
[ 10. 10. 10. 10. .................................]
MSE: 1.3273548431
This seems to work fine for ranges 5, 10, 15, 20, 22. However, it doesn't seem to work fine for ranges greater than 23 (it prints MSE: 0.0) and this is because, as you can see from the dataset the values of Y are fixed (10) from row 1 - 23 and then changes to another value (20, and so on) from row 24. How can we train and predict a model of such cases based on the last data points?

It seems with the existing code, when calling dropna, you truncate X but not y. You also train and test on the same data.
Fixing this will give non-zero MSE.
Code:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
df = pd.read_csv('/Users/shivadeviah/Desktop/estimated_pred.csv')
df1 = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(25)})
df1['Y'] = df['Y']
df1 = df1.sample(frac=1).reset_index(drop=True)
df1.dropna(inplace=True)
X = df1.iloc[:, :-1].values
y = df1.iloc[:, -1].values
x = int(len(X) * 0.66)
X_train = X[:x]
X_test = X[x:]
y_train = y[:x]
y_test = y[x:]
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train, y_train)
modelPred = reg.predict(X_test)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError = mean_squared_error(y_test, modelPred)
print("MSE:", meanSquaredError)
print(df1.size)
df2 = df1.iloc[x:, :].copy()
df2['pred'] = modelPred
df2.head()
Output:
[ 267.7 258.26608241 265.07037249 ..., 267.27370169 256.7 272.2 ]
Number of predictions: 87891
MSE: 1954.9271256
6721026
X_0 pred
170625 48 267.700000
170626 66 258.266082
170627 184 265.070372
170628 259 294.700000
170629 271 281.966667

Related

Fitting model with NaN values in the features

I would know if there is a method for fitting a model even some features contains some NaN values.
X
Feature1 Feature2 Feature3 Feature4 Feature5
0 0.1 NaN 0.3 NaN 4.0
1 4.0 6.0 6.6 99.0 2.0
2 11.0 15.0 2.2 3.3 NaN
3 1.0 6.0 2.0 2.5 4.0
4 5.0 11.2 NaN 3.0 NaN
Code
model = LogisticRegression()
model.fit(X_train, y_train)
Error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Usually, tree-based classifiers can handle NaNs as they just split the dataset based on the feature values. Of course, it also depends on how the algorithm is implemented.
I am not sure about sklearn but if you really want to classify them while preserving the NaN values, your best choice is to use XGBoost. It is not on sklearn but there are very good libraries and they are easy to use as well. It is also one of the most powerful classifiers, so you should definitely try it!
https://xgboost.readthedocs.io/en/latest/python/python_intro.html
You can use a SimpleImputer() to replace nan by the mean value, or a constant prior to fitting the model. Have a look at the documentation to find the correct strategy that work for your usecase.
In your case if you want to have still have nan value and take them out of the equation, you can simply replace nan by 0 using SimpleImputer(strategy='constant', fill_value=0)
As follows:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
model = make_pipeline(
SimpleImputer(strategy='constant', fill_value=0),
LinearRegression()
)
model.fit(X, y)
Note: I am using here a pipeline to all the steps in one go.

how to test the data after training the train data with k-fold cross validation?

Here in the code, I have:
Split the dataset into two part: Train set and Test set (7:3). The dataset consists of 200 rows and 9394 columns.
Define the model
cross validation used: 10 folds on train set
accuracy obtained for each fold
mean accuracy obtained: 94.29%
The confusion is:
Is it the right way I am doing?
Is cross_val_predict() used in the right way to predict the x over the test data?
Tasks remaining:
To plot accuracy of model.
To plot loss of model.
Can anyone suggest in this regards.
Sorry for this long notes!!!
The dataset is as: (These are tfidf of each words in the title and body of news)
Unnamed: 0 Unnamed: 0.1 Label Cosine_Similarity c0 c1 c2 c3 c4 c5 ... c9386 c9387 c9388 c9389 c9390 c9391 c9392 c9393 c9394 c9395
0 0 0 Real 0.180319 0.000000 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 1 Real 0.224159 0.166667 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 2 Real 0.233877 0.142857 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 3 3 Real 0.155789 0.111111 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 4 4 Real 0.225480 0.000000 0.0 0.111111 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The code and output are:
df_all = pd.read_csv("C:/Users/shiva/Desktop/allinone200.csv")
dataset=df_all.values
x=dataset[0:,3:]
Y= dataset[0:,2]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
y = np_utils.to_categorical(encoded_Y)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=15,shuffle=True)
x_train.shape,y_train.shape
def baseline_model():
model = Sequential()
model.add(Dense(512, activation='relu',input_dim=x_train.shape[1]))
model.add(Dense(64, activation='relu')))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
code for fitting the model:
estimator = KerasClassifier(build_fn=baseline_model, epochs=5, batch_size=4, verbose=1)
kf = KFold(n_splits=10, shuffle=True,random_state=15)
for train_index, test_index in kf.split(x_train,y_train):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
code for taking out the results:
results = cross_val_score(estimator, x_train, y_train, cv=kf)
print results
Output:
[0.9285714 1. 0.9285714 1. 0.78571427 0.85714287
1. 1. 0.9285714 1. ]
Mean accuracy:`
print("Accuracy: %0.2f (+/-%0.2f)" % (results.mean()*100, results.std()*2))
Output:
Accuracy: 94.29 (+/-0.14)
code for prediction:
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(estimator, x_test, y_test,cv=kf)
print(y_test[0])
print(y_pred[0])
Output:after processing
[1. 0.]
0
Here prediction seems seems okay. Because 1 is REAL and O is FALSE. y_test is 0 and y_predict is also 0.
Confusion matrix:
import numpy as np
y_test=np.argmax(y_test, axis=1)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm
Output:
array([[32, 0],
[ 1, 27]], dtype=int64)
Subject to Andreas' comment related to the number of your observations, does this help you in any way: Keras - Plot training, validation and test set accuracy
Bests
Unfortunately my comment became to long therefore I try it here:
Please have a look at this: https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e in short, larger batch sizes have often worse results but are faster, which in your case might be irrelevant (200 rows).
Secondly you do not have a (reusable) hold-out which might give you false assumptions regarding your true accuracy. That you have an accuracy of over 90% on your first try can mean either: overfitting, leaking or imbalanced data (e.g. here: https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html) or that you were really lucky.
K-fold in combination with small samples sizes can lead to wrong assumptions:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224365
A few rule of thumbs:
1. you want to have 2x as much datapoints (rows) as features (columns).
2. If you still get a good result, this can mean multiple things. Most likely its an error in code or methodology.
Imagine you have to predict the fraud risk of a bank. If the chance a fraud happens is 1% I can build you a modell which is right 99% of the time by simply saying there is never any fraud....
Neuronal Nets are extremly powerfull, that is good and bad. The bad thing is that they nearly always find some kind of pattern, even if there isn't one. If you give them 2000 columns essentially it gets a bit like the number "Pi" if you search long enough in the numbers after the comma you will find any number combination you want.
Here its explained in a bit more detail:
https://medium.com/#jennifer.zzz/more-features-than-data-points-in-linear-regression-5bcabba6883e

I want to fill the missing values in my z matrix with the mean of all the values of the z matrix

I'm want to fill missing data in columns with the means of their respective columns and using the code below:
#Data Preprocessing
#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Importing dataset
dataset = pd.read_csv('Book1.csv')
x = dataset.iloc[:, :-2].values
y = dataset.iloc[:, -2].values
z = dataset.iloc[:, 4].values
#Dealing with missing data
from sklearn.preprocessing import Imputer
imputer = Imputer()
imputer = imputer.fit(x[:,1:3])
imputer = imputer.fit(z[:])
x[:, 1:3] = imputer.transform(x[:, 1:3])
z[:] = imputer.transform(z[:])
When I try to run this I get an error:
Traceback (most recent call last):
File "<ipython-input-24-f33b6b1880df>", line 15, in <module>
imputer = imputer.fit(z[:])
File "C:\ProgramData\Anaconda3\lib\site- packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 1. 3. 4. nan 5. 7. 6. 9. 8. 10.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample
The dataset:
dataset
Out[37]:
Country Age Salary Testing Purchased
0 France 44.0 72000.0 1.0 No
1 Spain 27.0 48000.0 3.0 Yes
2 Germany 30.0 54000.0 4.0 No
3 Spain 38.0 61000.0 NaN No
4 Germany 40.0 NaN 5.0 Yes
5 France 35.0 58000.0 7.0 Yes
6 Spain NaN 52000.0 6.0 No
7 France 48.0 79000.0 9.0 Yes
8 Germany 50.0 83000.0 8.0 No
9 France 37.0 67000.0 10.0 Yes
What should I change in my code to fill the missing data in the 'test' column. I tried including the 'test column in x
Apparently, you are using a single Imputer instance to impute both x (2D) and z (1D) arrays. You should have created separate imputers for both variables:
imputer_x = Imputer()
imputer_z = Imputer()
imputer_x.fit(x[:,1:3])
imputer_z.fit(z[:])
x[:, 1:3] = imputer_x.transform(x[:, 1:3])
z[:] = imputer_z.transform(z[:])

How to get coefficients of multinomial logistic regression?

I need to calculate coefficients of a multiple logistic regression using sklearn:
X =
x1 x2 x3 x4 x5 x6
0.300000 0.100000 0.0 0.0000 0.5 0.0
0.000000 0.006000 0.0 0.0000 0.2 0.0
0.010000 0.678000 0.0 0.0000 2.0 0.0
0.000000 0.333000 1.0 12.3966 0.1 4.0
0.200000 0.005000 1.0 0.4050 1.0 0.0
0.000000 0.340000 1.0 15.7025 0.5 0.0
0.000000 0.440000 1.0 8.2645 0.0 4.0
0.500000 0.055000 1.0 18.1818 0.0 4.0
The values of y are categorical in range [1; 4].
y =
1
2
1
3
4
1
2
3
This is what I do:
import pandas as pd
from sklearn import linear_modelion
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
h = .02
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, y)
# print the coefficients
print(logreg.intercept_)
print(logreg.coef_)
However, I get 6 columns in the output of logreg.intercept_ and 6 columns in the output of logreg.coef_ How can I get 1 coefficient per feature, e.g. a - f values?
y = a*x1 + b*x2 + c*x3 + d*x4 + e*x5 + f*x6
Also, probably I am doing something wrong, because y_pred = logreg.predict(X) gives me the value of 1 for all rows.
Check the online documentation:
coef_ : array, shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
As #Xochipilli has already mentioned in comments you are going to have (n_classes, n_features) or in your case (4,6) coefficients and 4 intercepts (one for each class)
Probably I am doing something wrong, because y_pred =
logreg.predict(X) gives me the value of 1 for all rows.
yes, you shouldn't try to use data that you've used for training your model for prediction. Split your data into training and test data sets, train your model using train data set and check it's accuracy using test data set.

Python - How to do prediction and testing over multiple files using sklearn

I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y.
Let's say we have 5 traces of dataset (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv) in the current directory. For a single trace (dataset) (for example, a1.csv) – I can do the prediction of a 5 window as the following
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
df = pd.read_csv('a1.csv')
for i in range(1,5):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
reg = RandomForestRegressor(criterion='mse')
reg.fit(X,y)
modelPred = reg.predict(X)
print(modelPred)
print("Number of predictions:",len(modelPred))
modelPred.tofile('predictedValues1.txt',sep="\n",format="%s")
meanSquaredError=mean_squared_error(y, modelPred)
print("Mean Square Error (MSE):", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError)
I have solved this problem with random forest, which yields df:
rolling_regression')
time X Y X_t1 X_t2 X_t3 X_t4
0 0.000543 0 10 NaN NaN NaN NaN
1 0.000575 0 10 0.0 NaN NaN NaN
2 0.041324 1 10 0.0 0.0 NaN NaN
3 0.041331 2 10 1.0 0.0 0.0 NaN
4 0.041336 3 10 2.0 1.0 0.0 0.0
5 0.041340 4 10 3.0 2.0 1.0 0.0
6 0.041345 5 10 4.0 3.0 2.0 1.0
7 0.041350 6 10 5.0 4.0 3.0 2.0
.........................................................
[2845 rows x 7 columns]
[ 10. 10. 10. ..., 20. 20. 20.]
RMSE: 0.5136564734333562
However, now I want to do the prediction over all of the files (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv)by dividing the training into 60% of the datasets whose file name start with a and the remaining 40% for testing whose file name start with a using sklearn in Python (meaning 3 traces will be used for training and 2 files for testing)?
PS: All the files have the same structure but they are with different lengths for they are generated with different parameters.
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)
To read in multiple files, you'll need a slight extension. Aggregate data from each csv, then call pd.concat to join them:
df_list = []
for i in range(1, 6):
df_list.append(pd.read_csv('a%d.csv' %i))
df = pd.concat(df_list)
This will read in all your csvs, and you can carry on as usual. Get X and y:
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
Use sklearn.cross_validation.train_test_split to segment your data:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
You can also look at StratifiedKFold.

Categories