In short, my initial df has a column that has probabilities from an external predictive model that I would like to compare to the predictions generated from my lightGBM model. First I used the train test split on my data, which included my column old_predictions
X = A, B, C, old_predictions
Y = outcome
seed=47
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,random_state=seed)
I do not, however, want old_predictions to be included as a feature in my lightGBM model, so I made a separate df from the X_test data (which I will later append the light GBM prediction probabilities to) and dropped the old_predictions from the X_test and X_train
pred_df = X_test
X_test.drop(['old_predictions'], axis = 1, inplace = True)
X_train.drop(['old_predictions'], axis = 1, inplace = True)
When I try to train my model, however, I receive the following error:
LightGBMError: The number of features in data (4) is not the same as it was in training data (3).
You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.
The two questions I have are
using the logic I described about why I dropped this variable, would you agree that it is indeed fine to disregard this error?
Where do I add predict_disable_shape_check=true to disregard the error? I have tried the below, but none have been successful and the same error reappears. I tried reading the docs but am having trouble finding clarity
model = lgb.LGBMClassifier(**parameters1, predict_disable_shape_check=True)
y_pred=model.predict(X_test, predict_disable_shape_check=True)
predictions = model.predict_proba(X_test, predict_disable_shape_check=True)[:, 1]
predictions_train = model.predict_proba(X_train, predict_disable_shape_check=True)[:, 1]
I have also added it directly to the parameters list and this did not work either.
This should work.
y_pred=model.predict(X_test, predict_disable_shape_check=True)
Make sure X_test contains only the columns which were used to train the model.
Related
I'm a person who doesnt get used to loocv yet.
I've been curious of the title problem.
I did leave one out cross validation for my random forest model.
My codes are like this.
for train_index, test_index in loo.split(x):
x_train, x_test = x.iloc[train_index], x.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model = RandomForestRegressor()
model.fit(x_train, y_train.values.ravel()) #y_train.values.ravel()
y_pred = model.predict(x_test)
#y_pred = [np.round(x) for x in y_pred]
y_tests += y_test.values.tolist()[0]
y_preds += list(y_pred)
rr = metrics.r2_score(y_tests, y_preds)
ms_error = metrics.mean_squared_error(y_tests, y_preds)**0.5
After that, I wanted to get a feature importance of my model like this.
features = x.columns
sorted_idx = model.feature_importances_.argsort()
It's pretty different to what i've expected.
In loocv process, my computer made many different models with using different test and datasets from my original data, which has a length of literally same to original data.
So I'm thinking that feature importances should be multiple as the length of original data, because the test set of each loocv epoch is just one.. (I don't know which word is best for explaining this in English, english is not my mother tongue)
It was not multiple results, just one though.
It was only one feature importance, like calculated for only one sets (as if loocv hadnt added in my codes)
Then why should i had gotten only one importance? I want to understand the reason of it.
Thank you for reading my question.
want to know the reason why I got only one feature importance even though loocv was added in my codes
I am trying to do forecasting based on time series. I am doing temperature forecasting by using the past three years of hourly data.
Instead of using X_test from train_test_split method, I am using my own test dataset because I need seven-day ahead forecasting.
Problem: When I am using dummy Test data set for forecasting it’s giving incorrect values. But when I using Test data set from train_test_split method, then it’s giving accurate values. I don’t understand why this is happening.
What I tried to fix this problem: First, I thought this is happening because I am not applying feature scaling but after implementing feature scaling the results are same. Then I thought, when train_test_split split the data it also gives some randomness to data so I applied randomness on my dummy Test data but still, results are the same.
My question: How can I apply different dataframe for testing a model? And how did I get accurate results?
Program:
df = pd.read_csv("Timeseries_47.999_7.850_SA_0deg_0deg_2013_2016.csv", sep=",")
time_mod = []
for i in range(0,len(df['time'])):
ss=pd.to_datetime(df['time'][i], format= "%Y%m%d:%H%M")
time_mod.append(ss)
df['datetime'] = time_mod
df["Hour"] = pd.to_datetime(df["datetime"]).dt.hour
df["Month"] = pd.to_datetime(df["datetime"]).dt.month
df["Day_of_year"] = pd.to_datetime(df["datetime"]).dt.dayofyear
df["Day_of_month"] = pd.to_datetime(df["datetime"]).dt.day
df["week_of_year"] = pd.to_datetime(df["datetime"]).dt.week
X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
y = df[{"T2m"}].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
## Creating dummy datetime for Test data
df.set_index('datetime',inplace=True)
future_dates = [df.index[-1]+DateOffset(hours=x) for x in range(0,168)]
future_dates_df = pd.DataFrame({'Data':future_dates})
future_dates_df["Hour"] = pd.to_datetime(future_dates_df["Data"]).dt.hour
future_dates_df["Month"] = pd.to_datetime(future_dates_df["Data"]).dt.month
future_dates_df["Day_of_year"] = future_dates_df["Data"].dt.dayofyear
future_dates_df["Day_of_month"] = pd.to_datetime(future_dates_df["Data"]).dt.day
future_dates_df["Date"] = pd.to_datetime(future_dates_df["Data"]).dt.date
future_dates_df["week_of_year"] = future_dates_df["Data"].dt.week
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values
#Model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test_dum)
plt.plot(y_test, color="r", label="actual")
plt.plot(y_pred, label="forecasted")
sns.set(rc={'figure.figsize':(20,10)})
plt.legend()
plt.show()
The reason behind getting inaccurate could be:
The dummies dataset variables are not arranged in the same way as actual dataset.
X = df[{"Hour", "Day_of_year", "Day_of_month", 'week_of_year', 'Month'}].values
X_test_dum = future_dates_df[["Hour",'Month','Day_of_year','week_of_year','Day_of_month']].values
I also notice that you are applying Linear Regression but data does not look like linear. Try Polynomial Regression, Decision Tree, Random Forest or the model which is good with non-linear data.
I think eliminating some non-essential independent variables can also improve your results.
Only consider: Hour and Day_of_year
Last, try to create dummies dataset directly in csv file and then separate train and test dataset in python.
I"m using xgboost to train some data and then I want to score it on a test set.
My data is a combination of categorical and numeric variables, so I used pd.get_dummies to dummy all my categorical variables. training is fine, but the problem happens when I score the model on the test set.
I get an error of "feature_names_mismatch" and it lists the columns that are missing. My dataset is already in a matrix (numpy array format).
the mismatch in feature name is valid since some dummy-categories may not be present in the test set. So if this happens, is there a way for the model to still work?
If I understood your problem correctly; you have some categorical values which appears in train set but not in test set. This usually happens when you create dummy variables (converting categorical features using one hot coding etc) separately for train and test instead of doing it on entire dataset. Following code can help
for col in featurs_object:
X[col]=pd.Categorical(X[col],categories=df[col].dropna().unique())
X_col = pd.get_dummies(X[col])
X = X.drop(col,axis=1)
X_col.columns = X_col.columns.tolist()
frames = [X_col, X]
X = pd.concat(frames,axis=1)
X = pd.concat([X,df_continous],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 1)
featurs_object : consists of all categorical columns which you want to include for model building.
df : your entire dataset (post cleanup)
df_continous : Subset of df, with only continuous features.
I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?
The starter code looks like so:
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])
##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?
The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation
First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".
Incorporating the above, one recommended simple solution might be:
# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])
# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])
# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)
# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]
# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way #SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:
from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')
If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.
I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.
What you need to do is:
Get predictions using X instead of data
Append the predictions to your initial data set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):
y_new = model.predict(X_new)
data_new['prediction'] = y_new