Train-test Split of a CSV file in Python - python

I have a .csv file that contains my data. I would like to do Logistic Regression, Naive Bayes and Decision Trees. I already know how to implement these.
However, my teacher wants me to split the data in my .csv file into 80% and let my algorithms predict the other 20%. I would like to know how to actually split the data in that way.
diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()
with open("diabetes.csv", "rb") as f:
data = f.read().split()
train_data = data[:80]
test_data = data[20:]
I tried to split it like this (sure it isn't working).

Workflow
Load the data (see How do I read and write CSV files with Python?
)
Preprocess the data (e.g. filtering / creating new features)
Make the train-test (validation and dev-set) split
Code
Sklearns sklearn.model_selection.train_test_split is what you are looking for:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=0)

splitted_csv = "value1,value2,value3".split(',')
print(str(splitted_csv)) #["value1", "value2", "value3"]
print(splitted_csv[0]) #value1
print(splitted_csv[1]) #value2
print(splitted_csv[2]) #value3
There are also libraries that parse csv and allow you to access at value by column name, but from your example i thought that you need some "low level" way to do it

Related

Splitting test/training data for scikit?

I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?
The starter code looks like so:
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])
##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?
The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation
First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".
Incorporating the above, one recommended simple solution might be:
# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])
# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])
# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)
# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]
# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way #SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:
from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')
If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.

Split data into testing and training and convert to csv or excel files

I have a large dataset (around 200k rows), i wanted to split the dataset into 2 parts randomly, 70% as the training data and 30% as the testing data. Is there a way to do this in python? Note I also want to get these datasets saved as excel or csv files in my computer. Thanks!
from sklearn.model_selection import train_test_split
#split the data into train and test set
train,test = train_test_split(data, test_size=0.30, random_state=0)
#save the data
train.to_csv('train.csv',index=False)
test.to_csv('test.csv',index=False)
Start by importing the following:
from sklearn.model_selection import train_test_split
import pandas as pd
In order to split you can use the train_test_split function from sklearn package:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
where X, y is your taken from your original dataframe.
Later, you can export each of them as CSV using the pandas package:
X_train.to_csv(index=False)
X_test.to_csv(index=False)
Same goes for y data as well.
EDIT: as you clarified the question and required both X and y factors on the same file, you can do the following:
train, test = train_test_split(yourdata, test_size=0.3, random_state=42)
and then export them to csv as I mentioned above.

Split Random Data Training And Data Testing in Pandas CSV

I'm newbie using Python
First of all, I would like to split my data training and data testing where
Data Training = 6 and Data Testing = 2
I confused using csv file to random data training and data testing
I've been trying to split data training and data testing but sequence same with csv file
Here we go my data training and data testing:
def ambilData():
df = pd.read_csv("datalatihnodummy.csv", sep=';')
dropdata = df.drop(['data', 'Klasifikasi'], axis =1)
datalatih = dropdata.iloc[:6]
datauji = dropdata.iloc[6:]
return datalatih, datauji
and here is a output of training :
and here is a output of testing:
I would like to test Hepatitis B only or Hepatitis A only.
Anyone know how to random my dataset? thx u^
here is my data: https://drive.google.com/open?id=1tD3h0aS-AB4qrMg2vw0fHcMx6F3jzJCx
try this:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)
it will randomly split your data into training and testing data.
the x_data is the independent variable - so here you want to drop the 'Klasifikasi'
the y_data is the dependent variable in yours df it's the 'Klasifikasi'
hope it will help

Semi-supervised sentiment analysis in Python?

I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.
how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.

Saving xgboost binary prediction to submission csv file

I have 'train.csv' and 'test.csv' files. The former contains 'Id', a list of features, and a 'Status' column with values in it, the 'test.csv' file contains the same columns except the 'Status' one.
My task is to train an XGboost model on the 'train.csv' file and predict binary outcome of 'Status' for the 'test.csv' file, then to save 'Id' and 'Status' to a separate csv file for submission.
I am able to train XGboost on the 'train' file, and the roc_auc score is pretty good (above 0.8). I have spent hours searching the internet how to make predictions for the 'test' file and save them to the 'submission' file. To my surprise, and although this is quite a common task, I couldn't find any scripts that would reliably perform the operations described above.
My working code for the 'train.csv' file just in case:
predict = pd.read_csv("train.csv")
predictors =['par48','par52','par75','par82','par84','par85','par86','par87','par89','par108','par109','par132','par156','par165','par167','par175','par190','par197']
X, y = predict[predictors], predict['Status']
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)
xg_cl=xgb.XGBClassifier(objective='binary:logistic',n_estimators=10,seed=123)
xg_cl.fit(X_train, y_train)
preds=xg_cl.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
print(xg_cl.feature_importances_)
print(roc_auc_score(y_test, xg_cl.predict_proba(X_test)[:,1]))
Do you have a working code to share? Thanks!
Well, the model.predict code returns the predicted results in an array format, so, first you need to read the separate test file if it exists, then you can use the model you have built from the training data to predict the output. Finally, you can add that array of predictions to the pandas DataFrame that you read as a new column and then write it to a csv file:
#Separate test (evaluation) dataset that doesn't include the output
test_data = pd.read_csv('test.csv')
#Choose the same columns you trained the model with
X = test_data[predictors]
test_data['predictions'] = xg_cl.predict(X)
test_data.to_csv('submission.csv')

Categories