Different results when using train_test_split vs manually splitting the data - python

I have a pandas dataframe that I want to make predictions on and get the root mean squared error for each feature. I'm following an online guide that splits the dataset manually, but I thought it would be more convenient to use train_test_split from sklearn.model_selection. Unfortunately, I'm getting different results when looking at the rmse values after splitting the data manually vs using train_test_split.
A (hopefully) reproducible example:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['feature_1','feature_2','feature_3','feature_4'])
df['target'] = np.random.randint(2,size=100)
df2 = df.copy()
Here is a function, knn_train_test, that splits the data manually, fits the model, makes predictions, etc:
def knn_train_test(train_col, target_col, df):
knn = KNeighborsRegressor()
np.random.seed(0)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
# Select the first half and set as training set.
# Select the second half and set as test set.
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
# Fit a KNN model using default k value.
knn.fit(train_df[[train_col]], train_df[target_col])
# Make predictions using model.
predicted_labels = knn.predict(test_df[[train_col]])
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predicted_labels)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = df.columns.drop('target')
# For each column (minus `target`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
rmse_val = knn_train_test(col, 'target', df)
rmse_results[col] = rmse_val
# Create a Series object from the dictionary so
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
#Output
feature_3 0.541110
feature_2 0.548452
feature_4 0.559285
feature_1 0.569912
dtype: float64
Now, here is a function, knn_train_test2, that splits the data using train_test_split:
def knn_train_test2(train_col, target_col, df2):
knn = KNeighborsRegressor()
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(df2[[train_col]],df2[[target_col]], test_size=0.5)
knn.fit(X_train,y_train)
predictions = knn.predict(X_test)
mse = mean_squared_error(y_test,predictions)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = df2.columns.drop('target')
for col in train_cols:
rmse_val = knn_train_test2(col, 'target', df2)
rmse_results[col] = rmse_val
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
# Output
feature_4 0.522303
feature_3 0.556417
feature_1 0.569210
feature_2 0.572713
dtype: float64
Why am I getting different results? I think I'm misunderstanding the split > train > test process in general, or maybe misunderstanding/mis-specifying train_test_split. Thank you in advance

Your custom train_test_split implementation differs from scikit-learn's implementation, that's why you get different results for the same seed.
Here you can find the official implementation. The first thing which is notable is, that scikit-learn is doing by default 10 iterations of re-shuffeling & splitting. (check the n_splits parameter)
Only if your approach is doing exactly the same as the scitkit-learn approach, then you can expect to have the same result for the same seed.

This is basic machine learning nature. When you manually split the data, you have a different version of training and testing set. When you use the sklearn function, you get different training and testing set. Your model will make prediction based on what training data it recieves and thus your final results are different for both.
If you want to reproduce result, then use the train_test_split to create multiple training set by setting a seed value. A seed value is used to reproduce the same result in the train_test_split function. Then when running your ml function, set a seed in there too as even ML functions start training with random weights. Try your model on these datasets with same seed and you will get the results.

Splitting data manually is just slicing but train_test_split will also randomize the sliced data. Try fix the random number seed and see if you can get same results each time when using train_test_split.

Related

Returning a trained scikit learn (random forest) model from a function?

I am training a random forest model and have found that returning the trained model object from a function consistently results in different .predict behavior. Is this intended or not?
I think this is completely reproducible code. Input data is just 1000 rows of 6 columns of floats:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
def as_a_function():
df = pd.to_csv() # read file
lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
selcol = #y 'real' data
train_df = df.sample(frac=testsize,random_state=42)
test_df = df.drop(train_df.index) #test/train split
rfmodel, fitvals_mid = RF_model(train_df,test_df,selcol, lcscols)
tempdf = df.copy(deep=True) # new copy, not totally necessary but helpful in edge cases
tempdf.dropna(inplace=True)
selcolname = selcol + '_cal'
mid_cal = pd.DataFrame(data=rfmodel.predict(tempdf[lcscols]),index=tempdf.index,columns=[selcolname])
#new df just made from a .predict call
# note that input order of columns matters, needs to be identical to training order??
def RF_model(train_df, test_df, ycol, xcols):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rfmodel = rf.fit(train_df[xcols], train_df[ycol])
y_pred_test = rfmodel.predict(test_df[xcols])
#missing code to test predicted values of testing set
return rfmodel
#################################
def inline():
df = pd.to_csv() # read file
lcscols = #just a list of 6/12 of the columns in the csv file that are used to build the model (ignore timestamp, etc)
refcol = #'true' data
X = df[lcscols].values
y = df[[refcol]].values
x_train,x_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
ramp = rf.fit(x_train, y_train.flatten())
y_pred_test = ramp.predict(x_test)
#missing code to check prediction on test values
tempdf = df.copy(deep=True)[lcscols]
tempdf.dropna(axis=1,how='all',inplace=True)
tempdf.dropna(axis=0,inplace=True)
df_cal = pd.DataFrame(data=ramp.predict(tempdf),index=tempdf.index,columns=['name'])
return df_cal
The problem is that rfmodel.predict(tempdf[lcscols]) produces different output than ramp.predict(tempdf).
I imagine that it's going to be somewhat different given that pd.DataFrame.sample is not going to be the exact same split as test_train_split but it's r^2 value of 0.98 when .predict is called on the trained model in the same function as compared to r^2 = 0.5 when .predict is called on a returned model object. That seems like way too different to be attributable to a different split method?
Try using np.random.seed(42) before you call the method - Make sure you have numpy imported first. Every time the model predicts it uses random values, every time you run your code with that seed it uses different random values, however when you use np.random.seed(42), every time you run your code the model will use the same random values.

Why results are inaccurate when I am using different dataset for testing a model in Machine Learning?

I am trying to do forecasting based on time series. I am doing temperature forecasting by using the past three years of hourly data.
Instead of using X_test from train_test_split method, I am using my own test dataset because I need seven-day ahead forecasting.
Problem: When I am using dummy Test data set for forecasting it’s giving incorrect values. But when I using Test data set from train_test_split method, then it’s giving accurate values. I don’t understand why this is happening.
What I tried to fix this problem: First, I thought this is happening because I am not applying feature scaling but after implementing feature scaling the results are same. Then I thought, when train_test_split split the data it also gives some randomness to data so I applied randomness on my dummy Test data but still, results are the same.
My question: How can I apply different dataframe for testing a model? And how did I get accurate results?
Program:
df = pd.read_csv("Timeseries_47.999_7.850_SA_0deg_0deg_2013_2016.csv", sep=",")
time_mod = []
for i in range(0,len(df['time'])):
ss=pd.to_datetime(df['time'][i], format= "%Y%m%d:%H%M")
time_mod.append(ss)
df['datetime'] = time_mod
df["Hour"] = pd.to_datetime(df["datetime"]).dt.hour
df["Month"] = pd.to_datetime(df["datetime"]).dt.month
df["Day_of_year"] = pd.to_datetime(df["datetime"]).dt.dayofyear
df["Day_of_month"] = pd.to_datetime(df["datetime"]).dt.day
df["week_of_year"] = pd.to_datetime(df["datetime"]).dt.week
X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
y = df[{"T2m"}].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
## Creating dummy datetime for Test data
df.set_index('datetime',inplace=True)
future_dates = [df.index[-1]+DateOffset(hours=x) for x in range(0,168)]
future_dates_df = pd.DataFrame({'Data':future_dates})
future_dates_df["Hour"] = pd.to_datetime(future_dates_df["Data"]).dt.hour
future_dates_df["Month"] = pd.to_datetime(future_dates_df["Data"]).dt.month
future_dates_df["Day_of_year"] = future_dates_df["Data"].dt.dayofyear
future_dates_df["Day_of_month"] = pd.to_datetime(future_dates_df["Data"]).dt.day
future_dates_df["Date"] = pd.to_datetime(future_dates_df["Data"]).dt.date
future_dates_df["week_of_year"] = future_dates_df["Data"].dt.week
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values
#Model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test_dum)
plt.plot(y_test, color="r", label="actual")
plt.plot(y_pred, label="forecasted")
sns.set(rc={'figure.figsize':(20,10)})
plt.legend()
plt.show()
The reason behind getting inaccurate could be:
The dummies dataset variables are not arranged in the same way as actual dataset.
X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values
I also notice that you are applying Linear Regression but data does not look like linear. Try Polynomial Regression, Decision Tree, Random Forest or the model which is good with non-linear data.
I think eliminating some non-essential independent variables can also improve your results.
Only consider: Hour and Day_of_year
Last, try to create dummies dataset directly in csv file and then separate train and test dataset in python.

Why is Multi Class Machine Learning Model Giving Bad Results?

I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.

Logistic regression sklearn - train and apply model

I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.

Time Series Classification

you can access the data set at this link https://drive.google.com/file/d/0B9Hd-26lI95ZeVU5cDY0ZU5MTWs/view?usp=sharing
My Task is to predict the price movement of a sector fund. How much it goes up or down doesn't really matter, I only want to know whether it's going up or down. So I define it as a classification problem.
Since this data set is a time-series data, I met many problems. I have read articles about these problems like I can't use k-fold cross validation since this is time series data. You can't ignore the order of the data.
my code is as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.svm import LinearSVC
from sklearn.svm import SVCenter code here
lag1 = pd.read_csv(#local file path, parse_dates=['Date'])
#Trend : if price going up: ture, otherwise false
lag1['Trend'] = lag1.XLF > lag1.XLF.shift()
train_size = round(len(lag1)*0.50)
train = lag1[0:train_size]
test = lag1[train_size:]
variable_to_use= ['rGDP','interest_rate','private_auto_insurance','M2_money_supply','VXX']
y_train = train['Trend']
X_train = train[variable_to_use]
y_test = test['Trend']
X_test = test[variable_to_use]
#SVM Lag1
this_C = 1.0
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
print('XLF Lag1 dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
#Check prediction results
clf.predict(X_test)
First of all, is my method right here : first generating a column of true and false? I am afraid the machine can't understand this column if I simply feed this column to it. Should I first perform a regression then compare the numeric result to generate a list of going up or down?
The accuracy on training set is very low at : 0.58 I am getting an array with all trues with clf.predict(X_test) which I don't know why I would get all trues.
And I don't know whether the resulting accuracy is calculated in which way: for example, I think my current accuracy only counts the number of true and false but ignoring the order of them? Since this is time-series data, ignoring the order is not right and gives me no information about predicting price movement. Let's say I have 40 examples in test set, and I got 20 Tures I would get 50% accuracy. But I guess the trues are no in the right position as it appears in the ground truth set. (Tell me if I am wrong)
I am also considering using Gradient Boosted Tree to do the classification, would it be better?
Some preprocessing of this data would probably be helpful. Step one might go something like:
df = pd.read_csv('YOURLOCALFILEPATH',header=0)
#more code than your method but labels rows as 0 or 1 and easy to output to new file for later reference
df['Date'] = pd.to_datetime(df['date'], unit='d')
df = df.set_index('Date')
df['compare'] = df['XLF'].shift(-1)
df['Label'] np.where(df['XLF']>df['compare'), 1, 0)
df.drop('compare', axis=1, inplace=True)
Step two can use one of sklearn's built in scalers, such as the MinMax scaler, to preprocess the data by scaling your feature inputs before feeding it into your model.

Categories