xgboost - feature mismatch when I predict on my test data - python

I"m using xgboost to train some data and then I want to score it on a test set.
My data is a combination of categorical and numeric variables, so I used pd.get_dummies to dummy all my categorical variables. training is fine, but the problem happens when I score the model on the test set.
I get an error of "feature_names_mismatch" and it lists the columns that are missing. My dataset is already in a matrix (numpy array format).
the mismatch in feature name is valid since some dummy-categories may not be present in the test set. So if this happens, is there a way for the model to still work?

If I understood your problem correctly; you have some categorical values which appears in train set but not in test set. This usually happens when you create dummy variables (converting categorical features using one hot coding etc) separately for train and test instead of doing it on entire dataset. Following code can help
for col in featurs_object:
X[col]=pd.Categorical(X[col],categories=df[col].dropna().unique())
X_col = pd.get_dummies(X[col])
X = X.drop(col,axis=1)
X_col.columns = X_col.columns.tolist()
frames = [X_col, X]
X = pd.concat(frames,axis=1)
X = pd.concat([X,df_continous],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 1)
featurs_object : consists of all categorical columns which you want to include for model building.
df : your entire dataset (post cleanup)
df_continous : Subset of df, with only continuous features.

Related

Why results are inaccurate when I am using different dataset for testing a model in Machine Learning?

I am trying to do forecasting based on time series. I am doing temperature forecasting by using the past three years of hourly data.
Instead of using X_test from train_test_split method, I am using my own test dataset because I need seven-day ahead forecasting.
Problem: When I am using dummy Test data set for forecasting it’s giving incorrect values. But when I using Test data set from train_test_split method, then it’s giving accurate values. I don’t understand why this is happening.
What I tried to fix this problem: First, I thought this is happening because I am not applying feature scaling but after implementing feature scaling the results are same. Then I thought, when train_test_split split the data it also gives some randomness to data so I applied randomness on my dummy Test data but still, results are the same.
My question: How can I apply different dataframe for testing a model? And how did I get accurate results?
Program:
df = pd.read_csv("Timeseries_47.999_7.850_SA_0deg_0deg_2013_2016.csv", sep=",")
time_mod = []
for i in range(0,len(df['time'])):
ss=pd.to_datetime(df['time'][i], format= "%Y%m%d:%H%M")
time_mod.append(ss)
df['datetime'] = time_mod
df["Hour"] = pd.to_datetime(df["datetime"]).dt.hour
df["Month"] = pd.to_datetime(df["datetime"]).dt.month
df["Day_of_year"] = pd.to_datetime(df["datetime"]).dt.dayofyear
df["Day_of_month"] = pd.to_datetime(df["datetime"]).dt.day
df["week_of_year"] = pd.to_datetime(df["datetime"]).dt.week
X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
y = df[{"T2m"}].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
## Creating dummy datetime for Test data
df.set_index('datetime',inplace=True)
future_dates = [df.index[-1]+DateOffset(hours=x) for x in range(0,168)]
future_dates_df = pd.DataFrame({'Data':future_dates})
future_dates_df["Hour"] = pd.to_datetime(future_dates_df["Data"]).dt.hour
future_dates_df["Month"] = pd.to_datetime(future_dates_df["Data"]).dt.month
future_dates_df["Day_of_year"] = future_dates_df["Data"].dt.dayofyear
future_dates_df["Day_of_month"] = pd.to_datetime(future_dates_df["Data"]).dt.day
future_dates_df["Date"] = pd.to_datetime(future_dates_df["Data"]).dt.date
future_dates_df["week_of_year"] = future_dates_df["Data"].dt.week
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values
#Model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test_dum)
plt.plot(y_test, color="r", label="actual")
plt.plot(y_pred, label="forecasted")
sns.set(rc={'figure.figsize':(20,10)})
plt.legend()
plt.show()
The reason behind getting inaccurate could be:
The dummies dataset variables are not arranged in the same way as actual dataset.
X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values
I also notice that you are applying Linear Regression but data does not look like linear. Try Polynomial Regression, Decision Tree, Random Forest or the model which is good with non-linear data.
I think eliminating some non-essential independent variables can also improve your results.
Only consider: Hour and Day_of_year
Last, try to create dummies dataset directly in csv file and then separate train and test dataset in python.

Adding predict_disable_shape_check=true as a parameter on LightGBM

In short, my initial df has a column that has probabilities from an external predictive model that I would like to compare to the predictions generated from my lightGBM model. First I used the train test split on my data, which included my column old_predictions
X = A, B, C, old_predictions
Y = outcome
seed=47
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,random_state=seed)
I do not, however, want old_predictions to be included as a feature in my lightGBM model, so I made a separate df from the X_test data (which I will later append the light GBM prediction probabilities to) and dropped the old_predictions from the X_test and X_train
pred_df = X_test
X_test.drop(['old_predictions'], axis = 1, inplace = True)
X_train.drop(['old_predictions'], axis = 1, inplace = True)
When I try to train my model, however, I receive the following error:
LightGBMError: The number of features in data (4) is not the same as it was in training data (3).
You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.
The two questions I have are
using the logic I described about why I dropped this variable, would you agree that it is indeed fine to disregard this error?
Where do I add predict_disable_shape_check=true to disregard the error? I have tried the below, but none have been successful and the same error reappears. I tried reading the docs but am having trouble finding clarity
model = lgb.LGBMClassifier(**parameters1, predict_disable_shape_check=True)
y_pred=model.predict(X_test, predict_disable_shape_check=True)
predictions = model.predict_proba(X_test, predict_disable_shape_check=True)[:, 1]
predictions_train = model.predict_proba(X_train, predict_disable_shape_check=True)[:, 1]
I have also added it directly to the parameters list and this did not work either.
This should work.
y_pred=model.predict(X_test, predict_disable_shape_check=True)
Make sure X_test contains only the columns which were used to train the model.

Splitting test/training data for scikit?

I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?
The starter code looks like so:
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])
##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?
The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation
First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".
Incorporating the above, one recommended simple solution might be:
# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])
# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])
# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)
# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]
# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way #SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:
from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')
If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.

Train, Test, Validate split Python. Three sets

Someone presented a solution to split a dataset into three sets. I wonder where is the label in this case. Or how to set the labels then.
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
I will answer the question based on comments:
Using this method for splitting:
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test. The labels are within these dataframes.
In train_test_split you are passing two objects, X and Y, which have been most likely previously split from an original dataset and getting in return 4 objects, 2 corresponding to train and two corresponding to test. Keep in mind this: You are first splitting your dataset into independent variables and explained/target variable, and then splitting these two objects into train and test.
With np.split you are going the otherway around, you are first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y. You are doing the same splits, just in reverse order.
However, keep in mind that by passing the indexes for np.split it means the splitting is not performed randomly, whereas with train_test_split you get a random train and test subesets. np.split on the other hand, allows for more flexibility, for instance, as you prove with your example, creating more than 2 subsets.
Maybe this will help!
Try this. Feed the output of one of the train_test_split into a second one as input
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
X_test, X_validate, y_test, y_validate = train_test_split(X_test, y_test, test_size=0.5)
The function randomly splits 2 arrays into 4 arrays, and test_size determines the size of the split allocated to the test output vs train. The y input is meant to be a target for building a machine learning model and X is meant to be the features for the model. If you want them combined, then just concat the equivalent X and y outputs.

What does this error mean with StratifiedShuffleSplit?

I'm totally new to Data Science in general and was hoping someone could explain why this does not work:
I'm using the Advertising dataset from the following url: "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" which has 3 feature columns ("TV", "Radio", "Newspaper") and 1 label column ("sales"). My complete dataset is named data.
Next, I try to use sklearn's StratifiedShuffleSplit function to divide the data into training and testing sets.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, random_state=0) # can use test_size=0.8
for train_index, test_index in split.split(data.drop("sales", axis=1), data["sales"]): # Generate indices to split data into training and test set.
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
I get this ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Using the same code on another dataset which has 14 feature columns and 1 label column separates the data appropriately. Why doesn't it work here? Thanks.
I think that problem is your data_y is 2D matrix.
but as I see in sklearn.model_selection.StratifiedShuffleSplit doc, it should be the 1D vector. Try to encode each row of data_y as the integer (it will be interpreted as a class), and after use split.
Or possibly your y is a regression variable (continuous numerical data).(Vivek's link)

Categories