Saving xgboost binary prediction to submission csv file - python

I have 'train.csv' and 'test.csv' files. The former contains 'Id', a list of features, and a 'Status' column with values in it, the 'test.csv' file contains the same columns except the 'Status' one.
My task is to train an XGboost model on the 'train.csv' file and predict binary outcome of 'Status' for the 'test.csv' file, then to save 'Id' and 'Status' to a separate csv file for submission.
I am able to train XGboost on the 'train' file, and the roc_auc score is pretty good (above 0.8). I have spent hours searching the internet how to make predictions for the 'test' file and save them to the 'submission' file. To my surprise, and although this is quite a common task, I couldn't find any scripts that would reliably perform the operations described above.
My working code for the 'train.csv' file just in case:
predict = pd.read_csv("train.csv")
predictors =['par48','par52','par75','par82','par84','par85','par86','par87','par89','par108','par109','par132','par156','par165','par167','par175','par190','par197']
X, y = predict[predictors], predict['Status']
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
xg_cl=xgb.XGBClassifier(objective='binary:logistic',n_estimators=10,seed=123)
xg_cl.fit(X_train, y_train)
preds=xg_cl.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
print(xg_cl.feature_importances_)
print(roc_auc_score(y_test, xg_cl.predict_proba(X_test)[:,1]))
Do you have a working code to share? Thanks!

Well, the model.predict code returns the predicted results in an array format, so, first you need to read the separate test file if it exists, then you can use the model you have built from the training data to predict the output. Finally, you can add that array of predictions to the pandas DataFrame that you read as a new column and then write it to a csv file:
#Separate test (evaluation) dataset that doesn't include the output
test_data = pd.read_csv('test.csv')
#Choose the same columns you trained the model with
X = test_data[predictors]
test_data['predictions'] = xg_cl.predict(X)
test_data.to_csv('submission.csv')

Related

xgboost - feature mismatch when I predict on my test data

I"m using xgboost to train some data and then I want to score it on a test set.
My data is a combination of categorical and numeric variables, so I used pd.get_dummies to dummy all my categorical variables. training is fine, but the problem happens when I score the model on the test set.
I get an error of "feature_names_mismatch" and it lists the columns that are missing. My dataset is already in a matrix (numpy array format).
the mismatch in feature name is valid since some dummy-categories may not be present in the test set. So if this happens, is there a way for the model to still work?
If I understood your problem correctly; you have some categorical values which appears in train set but not in test set. This usually happens when you create dummy variables (converting categorical features using one hot coding etc) separately for train and test instead of doing it on entire dataset. Following code can help
for col in featurs_object:
X[col]=pd.Categorical(X[col],categories=df[col].dropna().unique())
X_col = pd.get_dummies(X[col])
X = X.drop(col,axis=1)
X_col.columns = X_col.columns.tolist()
frames = [X_col, X]
X = pd.concat(frames,axis=1)
X = pd.concat([X,df_continous],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 1)
featurs_object : consists of all categorical columns which you want to include for model building.
df : your entire dataset (post cleanup)
df_continous : Subset of df, with only continuous features.

Splitting test/training data for scikit?

I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?
The starter code looks like so:
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])
##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?
The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation
First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".
Incorporating the above, one recommended simple solution might be:
# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])
# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])
# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)
# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]
# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way #SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:
from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')
If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.

Semi-supervised sentiment analysis in Python?

I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.
how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.

How does one code data from a different (test) file vs all the data in one file?

All examples I've ever come across always conveniently have data in one file to show how train_test_split works (or any model really). But quite often the training data and testing data are two separate files.
So, I made a ultra-basic logistic regression train file and test file consisting of two columns, 'age', 'insurance'. And naming the df's df_train, df_test.
I realize df_test hasn't been trained, hence the error but...isn't that the point?
I know model.predict(X_test) doesn't throw an error, but that is based on the training data not the test data.
Word of warning, this is what happens when you're old and trying to learn new things. Don't get old.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['age']],df.insurance,test_size=0.1)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.predict(df_test)
Thanks,
Old fart
As you stated :
train file and test file consisting of two columns, 'age',
'insurance'.
So if test files contains both age and insurance columns and used as it is, the predict function will not work because of mis-match of input between training and prediction.
Also model.predict expect the independent variable only(in your case its age) in below format :
predict(self, X)[source]ΒΆ
Predict class labels for samples in X.
Parameters:
X : array_like or sparse matrix, shape (n_samples, n_features)
Samples.
Now coming to the modification :
model.predict(df_test["age"].values)
Edit : Try this :
from sklearn.model_selection import train_test_split
X = df["age"].values
y = df["insurance"].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.predict([list(df_test["age"].values)])

Train-test Split of a CSV file in Python

I have a .csv file that contains my data. I would like to do Logistic Regression, Naive Bayes and Decision Trees. I already know how to implement these.
However, my teacher wants me to split the data in my .csv file into 80% and let my algorithms predict the other 20%. I would like to know how to actually split the data in that way.
diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()
with open("diabetes.csv", "rb") as f:
data = f.read().split()
train_data = data[:80]
test_data = data[20:]
I tried to split it like this (sure it isn't working).
Workflow
Load the data (see How do I read and write CSV files with Python?
)
Preprocess the data (e.g. filtering / creating new features)
Make the train-test (validation and dev-set) split
Code
Sklearns sklearn.model_selection.train_test_split is what you are looking for:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=0)
splitted_csv = "value1,value2,value3".split(',')
print(str(splitted_csv)) #["value1", "value2", "value3"]
print(splitted_csv[0]) #value1
print(splitted_csv[1]) #value2
print(splitted_csv[2]) #value3
There are also libraries that parse csv and allow you to access at value by column name, but from your example i thought that you need some "low level" way to do it

Categories