Python - How to do prediction and testing over multiple files using sklearn

Python - How to do prediction and testing over multiple files using sklearn - python

I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y.
Let's say we have 5 traces of dataset (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv) in the current directory. For a single trace (dataset) (for example, a1.csv) – I can do the prediction of a 5 window as the following
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
df = pd.read_csv('a1.csv')
for i in range(1,5):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
reg = RandomForestRegressor(criterion='mse')
reg.fit(X,y)
modelPred = reg.predict(X)
print(modelPred)
print("Number of predictions:",len(modelPred))
modelPred.tofile('predictedValues1.txt',sep="\n",format="%s")
meanSquaredError=mean_squared_error(y, modelPred)
print("Mean Square Error (MSE):", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError)
I have solved this problem with random forest, which yields df:
rolling_regression')
time X Y X_t1 X_t2 X_t3 X_t4
0 0.000543 0 10 NaN NaN NaN NaN
1 0.000575 0 10 0.0 NaN NaN NaN
2 0.041324 1 10 0.0 0.0 NaN NaN
3 0.041331 2 10 1.0 0.0 0.0 NaN
4 0.041336 3 10 2.0 1.0 0.0 0.0
5 0.041340 4 10 3.0 2.0 1.0 0.0
6 0.041345 5 10 4.0 3.0 2.0 1.0
7 0.041350 6 10 5.0 4.0 3.0 2.0
.........................................................
[2845 rows x 7 columns]
[ 10. 10. 10. ..., 20. 20. 20.]
RMSE: 0.5136564734333562
However, now I want to do the prediction over all of the files (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv)by dividing the training into 60% of the datasets whose file name start with a and the remaining 40% for testing whose file name start with a using sklearn in Python (meaning 3 traces will be used for training and 2 files for testing)?
PS: All the files have the same structure but they are with different lengths for they are generated with different parameters.

import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)

To read in multiple files, you'll need a slight extension. Aggregate data from each csv, then call pd.concat to join them:
df_list = []
for i in range(1, 6):
df_list.append(pd.read_csv('a%d.csv' %i))
df = pd.concat(df_list)
This will read in all your csvs, and you can carry on as usual. Get X and y:
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
Use sklearn.cross_validation.train_test_split to segment your data:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
You can also look at StratifiedKFold.

Related

NaN Values inputed into test and train data

I am working on a Data Science project with the Fifa dataset. I cleaned the data and took care of any NaN values in the Data to get it ready to be split into test and train. I need to use StratifiedShuffleSplit in order to split the data. Updated to a cleaner way to divided the value data into groups, but I am still getting NaN values once it goes through the split.
Link to the data set I am using: https://www.kaggle.com/karangadiya/fifa19
n = fifa['value'].count()
folds = 3
fifa.sort_values('value', ascending=False, inplace=True)
fifa['group_id'] = np.floor(np.arange(n)/folds)
fifa['value_cat'] = fifa.groupby('group_id', as_index = False)['name'].transform(lambda x: np.random.choice(v_cats, size=x.size, replace = False))
At this point when I check the test and train data I now have mystery NaN values inputed. I think the NaN values maybe a result of .loc since I am getting a 'warning' in jupyter.
c:\python37\lib\site-packages\ipykernel_launcher.py:6: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Code below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(fifa, fifa['value_cat']):
strat_train_set = fifa.loc[train_index]
strat_test_set = fifa.loc[test_index]
fifa = strat_train_set.drop('value', axis=1)
value_labels = strat_train_set['value'].copy()
PLEASE HELP MY POOR SOUL!!

Here's one solution.
import numpy as np
import pandas as pd
n = 100
folds = 3
# Make some data
df = pd.DataFrame({'id':np.arange(n), 'value':np.random.lognormal(mean=10, sigma=1, size=n)})
# Sort by value
df.sort_values('value', ascending=False, inplace=True)
# Insert 'group' ids, 0, 0, 0, 1, 1, 1, 2, 2, 2, ...
df['group_id'] = np.floor(np.arange(n)/folds)
# Randomly assign folds within each group
df['fold'] = df.groupby('group_id', as_index=False)['id'].transform(lambda x: np.random.choice(folds, size=x.size, replace=False))
# Inspect
df.head(10)
id value group_id fold
46 46 208904.679048 0.0 0
3 3 175730.118616 0.0 2
0 0 137067.103600 0.0 1
87 87 101894.243831 1.0 2
11 11 100570.573379 1.0 1
90 90 93681.391254 1.0 0
73 73 92462.150435 2.0 2
13 13 90349.408620 2.0 1
86 86 87568.402021 2.0 0
88 88 82581.010789 3.0 1
Assuming you want k folds, the idea is to sort the data by value, then randomly assign folds 1, 2, ..., k to the first k rows, then do the same to the next k rows, etc.
By the way, you will have more luck getting answers to questions here if you can create reproducible examples with data that make it easy for others to tinker with. :)

I want to fill the missing values in my z matrix with the mean of all the values of the z matrix

I'm want to fill missing data in columns with the means of their respective columns and using the code below:
#Data Preprocessing
#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Importing dataset
dataset = pd.read_csv('Book1.csv')
x = dataset.iloc[:, :-2].values
y = dataset.iloc[:, -2].values
z = dataset.iloc[:, 4].values
#Dealing with missing data
from sklearn.preprocessing import Imputer
imputer = Imputer()
imputer = imputer.fit(x[:,1:3])
imputer = imputer.fit(z[:])
x[:, 1:3] = imputer.transform(x[:, 1:3])
z[:] = imputer.transform(z[:])
When I try to run this I get an error:
Traceback (most recent call last):
File "<ipython-input-24-f33b6b1880df>", line 15, in <module>
imputer = imputer.fit(z[:])
File "C:\ProgramData\Anaconda3\lib\site- packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 1. 3. 4. nan 5. 7. 6. 9. 8. 10.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample
The dataset:
dataset
Out[37]:
Country Age Salary Testing Purchased
0 France 44.0 72000.0 1.0 No
1 Spain 27.0 48000.0 3.0 Yes
2 Germany 30.0 54000.0 4.0 No
3 Spain 38.0 61000.0 NaN No
4 Germany 40.0 NaN 5.0 Yes
5 France 35.0 58000.0 7.0 Yes
6 Spain NaN 52000.0 6.0 No
7 France 48.0 79000.0 9.0 Yes
8 Germany 50.0 83000.0 8.0 No
9 France 37.0 67000.0 10.0 Yes
What should I change in my code to fill the missing data in the 'test' column. I tried including the 'test column in x

Apparently, you are using a single Imputer instance to impute both x (2D) and z (1D) arrays. You should have created separate imputers for both variables:
imputer_x = Imputer()
imputer_z = Imputer()
imputer_x.fit(x[:,1:3])
imputer_z.fit(z[:])
x[:, 1:3] = imputer_x.transform(x[:, 1:3])
z[:] = imputer_z.transform(z[:])

How to get coefficients of multinomial logistic regression?

I need to calculate coefficients of a multiple logistic regression using sklearn:
X =
x1 x2 x3 x4 x5 x6
0.300000 0.100000 0.0 0.0000 0.5 0.0
0.000000 0.006000 0.0 0.0000 0.2 0.0
0.010000 0.678000 0.0 0.0000 2.0 0.0
0.000000 0.333000 1.0 12.3966 0.1 4.0
0.200000 0.005000 1.0 0.4050 1.0 0.0
0.000000 0.340000 1.0 15.7025 0.5 0.0
0.000000 0.440000 1.0 8.2645 0.0 4.0
0.500000 0.055000 1.0 18.1818 0.0 4.0
The values of y are categorical in range [1; 4].
y =
1
2
1
3
4
1
2
3
This is what I do:
import pandas as pd
from sklearn import linear_modelion
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
h = .02
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, y)
# print the coefficients
print(logreg.intercept_)
print(logreg.coef_)
However, I get 6 columns in the output of logreg.intercept_ and 6 columns in the output of logreg.coef_ How can I get 1 coefficient per feature, e.g. a - f values?
y = a*x1 + b*x2 + c*x3 + d*x4 + e*x5 + f*x6
Also, probably I am doing something wrong, because y_pred = logreg.predict(X) gives me the value of 1 for all rows.

Check the online documentation:
coef_ : array, shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
As #Xochipilli has already mentioned in comments you are going to have (n_classes, n_features) or in your case (4,6) coefficients and 4 intercepts (one for each class)
Probably I am doing something wrong, because y_pred =
logreg.predict(X) gives me the value of 1 for all rows.
yes, you shouldn't try to use data that you've used for training your model for prediction. Split your data into training and test data sets, train your model using train data set and check it's accuracy using test data set.

How to train and predict a model using Random Forest?

How can we predict a model using random forest? I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y. Similarly, using a simple rolling OLS regression model, we can do it as in the following but I wanted to do it using random forest model.
import pandas as pd
df = pd.read_csv('data_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],
window_type='rolling', window=5, intercept=True)
I have solved this problem with random forest, which yields df:
t_stamp X Y X_t1 X_t2 X_t3 X_t4 X_t5
0.000543 0 10 NaN NaN NaN NaN NaN
0.000575 0 10 0.0 NaN NaN NaN NaN
0.041324 1 10 0.0 0.0 NaN NaN NaN
0.041331 2 10 1.0 0.0 0.0 NaN NaN
0.041336 3 10 2.0 1.0 0.0 0.0 NaN
0.041340 4 10 3.0 2.0 1.0 0.0 0.0
0.041345 5 10 4.0 3.0 2.0 1.0 0.0
0.041350 6 10 5.0 4.0 3.0 2.0 1.0
0.041354 7 10 6.0 5.0 4.0 3.0 2.0
.........................................................
[ 10. 10. 10. 10. .................................]
MSE: 1.3273548431
This seems to work fine for ranges 5, 10, 15, 20, 22. However, it doesn't seem to work fine for ranges greater than 23 (it prints MSE: 0.0) and this is because, as you can see from the dataset the values of Y are fixed (10) from row 1 - 23 and then changes to another value (20, and so on) from row 24. How can we train and predict a model of such cases based on the last data points?

It seems with the existing code, when calling dropna, you truncate X but not y. You also train and test on the same data.
Fixing this will give non-zero MSE.
Code:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
df = pd.read_csv('/Users/shivadeviah/Desktop/estimated_pred.csv')
df1 = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(25)})
df1['Y'] = df['Y']
df1 = df1.sample(frac=1).reset_index(drop=True)
df1.dropna(inplace=True)
X = df1.iloc[:, :-1].values
y = df1.iloc[:, -1].values
x = int(len(X) * 0.66)
X_train = X[:x]
X_test = X[x:]
y_train = y[:x]
y_test = y[x:]
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train, y_train)
modelPred = reg.predict(X_test)
print(modelPred)
print("Number of predictions:",len(modelPred))
meanSquaredError = mean_squared_error(y_test, modelPred)
print("MSE:", meanSquaredError)
print(df1.size)
df2 = df1.iloc[x:, :].copy()
df2['pred'] = modelPred
df2.head()
Output:
[ 267.7 258.26608241 265.07037249 ..., 267.27370169 256.7 272.2 ]
Number of predictions: 87891
MSE: 1954.9271256
6721026
X_0 pred
170625 48 267.700000
170626 66 258.266082
170627 184 265.070372
170628 259 294.700000
170629 271 281.966667

How to do Onehotencoding in Sklearn Pipeline

I am trying to oneHotEncode the categorical variables of my Pandas dataframe, which includes both categorical and continues variables. I realise this can be done easily with the pandas .get_dummies() function, but I need to use a pipeline so I can generate a PMML-file later on.
This is the code to create a mapper. The categorical variables I would like to encode are stored in a list called 'dummies'.
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
mapper = DataFrameMapper(
[(d, LabelEncoder()) for d in dummies] +
[(d, OneHotEncoder()) for d in dummies]
)
And this is the code to create a pipeline, including the mapper and linear regression.
from sklearn2pmml import PMMLPipeline
from sklearn.linear_model import LinearRegression
lm = PMMLPipeline([("mapper", mapper),
("regressor", LinearRegression())])
When I now try to fit (with 'features' being a dataframe, and 'targets' a series), it gives an error 'could not convert string to float'.
lm.fit(features, targets)

OneHotEncoder doesn't support string features, and with [(d, OneHotEncoder()) for d in dummies] you are applying it to all dummies columns. Use LabelBinarizer instead:
mapper = DataFrameMapper(
[(d, LabelBinarizer()) for d in dummies]
)
An alternative would be to use the LabelEncoder with a second OneHotEncoder step.
mapper = DataFrameMapper(
[(d, LabelEncoder()) for d in dummies]
)
lm = PMMLPipeline([("mapper", mapper),
("onehot", OneHotEncoder()),
("regressor", LinearRegression())])

LabelEncoder and LabelBinarizer are intended for encoding/binarizing the target (label) of your data, i.e. the y vector. Of course they do more or less the same thing as OneHotEncoder, the main difference being the Label preprocessing steps don't accept matrices, only 1-D vectors.
example = pd.DataFrame({'x':np.arange(2,14,2),
'cat1':['A','B','A','B','C','A'],
'cat2':['p','q','w','p','q','w']})
dummies = ['cat1', 'cat2']
x cat1 cat2
0 2 A p
1 4 B q
2 6 A w
3 8 B p
4 10 C q
5 12 A w
As an example, LabelEncoder().fit_transform(example['cat1']) works, but LabelEncoder().fit_transform(example[dummies]) throws a ValueError exception.
In contrast, OneHotEncoder accepts multiple columns:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder().fit_transform(example[dummies])
<6x6 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
This can be incorporated into a pipeline using a ColumnTransformer, passing through (or alternatively applying different transformations to) the other columns :
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encode_cats', OneHotEncoder(), dummies),],
remainder='passthrough')
pd.DataFrame(ct.fit_transform(example), columns = ct.get_feature_names_out())
encode_cats__cat1_A encode_cats__cat1_B ... encode_cats__cat2_w remainder__x
0 1.0 0.0 ... 0.0 2.0
1 0.0 1.0 ... 0.0 4.0
2 1.0 0.0 ... 1.0 6.0
3 0.0 1.0 ... 0.0 8.0
4 0.0 0.0 ... 0.0 10.0
5 1.0 0.0 ... 1.0 12.0
Finally, slot this into a pipeline:
from sklearn.pipeline import Pipeline
Pipeline([('preprocessing', ct),
('regressor', LinearRegression())])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - How to do prediction and testing over multiple files using sklearn - python

import glob, os df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv")))) # get your X and Y Df's x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)

Related

NaN Values inputed into test and train data

I want to fill the missing values in my z matrix with the mean of all the values of the z matrix

How to get coefficients of multinomial logistic regression?

How to train and predict a model using Random Forest?

How to do Onehotencoding in Sklearn Pipeline

Categories

Resources