I am trying to split a pandas dataframe of size 610x9724 (610 users x 9724 movies), putting 80% of the non-null values of the dataset into training and 20% of the remaining non-null values into the test set while replacing the 20% removed values from training with null and likewise replacing the removed values from the test set with null (training set and test set would still be 610x9724 but just with more nulls than original dataset).
I would then use SVD on the test set (610x9724) to predict the removed values which are in the test set.
I have tried using sklearn train_test_split but after splitting, the train set becomes dimension 549x9724 and the validation set becomes 61x9724 which makes it difficult to take the RMSE between predicted and test set. Is there an easy way to do this split?
data = df.pivot_table(index='userId', columns='movieId', values='rating')
data_train, data_valid = model_selection.train_test_split(
data, test_size=0.1, random_state=42
)
print(data.shape) # (610, 9724)
print(data_train.shape) # (549, 9724)
print(data_valid.shape) # (61, 9724)
You can reindex your dataframes to restore the initial dimension. Every values from missing index will be set to NaN:
train, test = train_test_split(data, test_size=0.2, random_state=42)
train = train.reindex(data.index)
test = test.reindex(data.index)
Output:
>>> train.shape
(610, 9724)
>>> test.shape
(610, 9724)
Related
I want to create id_set.csv. This file will contain the split of data between train/validation/test. It will have2 columns: ID and set. The IDs must be identical to the ones in dataset.csv. The set value must be one of "train", "validation" or "test". Data will be randomly split in 50-70% to the training set, 20-30% to the validation set and 10-20% to the test set.
# Train-test-validation split
train, test = train_test_split(self.df, test_size=0.2, random_state=1)
train, val = train_test_split(train, test_size=0.25, random_state=1)
Desired output
ID
set
r2_HG_3
train
r2_HG_4
train
r2_HG_5
validation
r2_HG_6
validation
r2_HG_7
test
r2_HG_8
test
If you're working with pandas you could shuffle by rows using df.sample(frac=1), and then set the first 50-70% of the rows as training set, followed by 20-30% as validation set, and the final 10-20% as test set.
If I have correctly understood the input for the split is a dataframe and it contains already the ID column, then:
# Train-test-validation split
train, test = train_test_split(self.df, test_size=0.2, random_state=1)
train, val = train_test_split(train, test_size=0.25, random_state=1)
# Assuming train, val, test are dataframes
# A string is assigned to the "set" column.
train.loc[:,'set'] = 'train'
val.loc[:,'set'] = 'val'
test.loc[:,'set'] = 'test'
# Concatenate all the dataframe together
id_set = pd.concat([train, val, test], axis=0)
id_set.to_csv('id_set.csv', index=False)
I"m using xgboost to train some data and then I want to score it on a test set.
My data is a combination of categorical and numeric variables, so I used pd.get_dummies to dummy all my categorical variables. training is fine, but the problem happens when I score the model on the test set.
I get an error of "feature_names_mismatch" and it lists the columns that are missing. My dataset is already in a matrix (numpy array format).
the mismatch in feature name is valid since some dummy-categories may not be present in the test set. So if this happens, is there a way for the model to still work?
If I understood your problem correctly; you have some categorical values which appears in train set but not in test set. This usually happens when you create dummy variables (converting categorical features using one hot coding etc) separately for train and test instead of doing it on entire dataset. Following code can help
for col in featurs_object:
X[col]=pd.Categorical(X[col],categories=df[col].dropna().unique())
X_col = pd.get_dummies(X[col])
X = X.drop(col,axis=1)
X_col.columns = X_col.columns.tolist()
frames = [X_col, X]
X = pd.concat(frames,axis=1)
X = pd.concat([X,df_continous],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 1)
featurs_object : consists of all categorical columns which you want to include for model building.
df : your entire dataset (post cleanup)
df_continous : Subset of df, with only continuous features.
I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?
The starter code looks like so:
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])
##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?
The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation
First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".
Incorporating the above, one recommended simple solution might be:
# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])
# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])
# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)
# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]
# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way #SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:
from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')
If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.
Someone presented a solution to split a dataset into three sets. I wonder where is the label in this case. Or how to set the labels then.
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
I will answer the question based on comments:
Using this method for splitting:
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test. The labels are within these dataframes.
In train_test_split you are passing two objects, X and Y, which have been most likely previously split from an original dataset and getting in return 4 objects, 2 corresponding to train and two corresponding to test. Keep in mind this: You are first splitting your dataset into independent variables and explained/target variable, and then splitting these two objects into train and test.
With np.split you are going the otherway around, you are first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y. You are doing the same splits, just in reverse order.
However, keep in mind that by passing the indexes for np.split it means the splitting is not performed randomly, whereas with train_test_split you get a random train and test subesets. np.split on the other hand, allows for more flexibility, for instance, as you prove with your example, creating more than 2 subsets.
Maybe this will help!
Try this. Feed the output of one of the train_test_split into a second one as input
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
X_test, X_validate, y_test, y_validate = train_test_split(X_test, y_test, test_size=0.5)
The function randomly splits 2 arrays into 4 arrays, and test_size determines the size of the split allocated to the test output vs train. The y input is meant to be a target for building a machine learning model and X is meant to be the features for the model. If you want them combined, then just concat the equivalent X and y outputs.
I converted two columns of a pandas dataframe into numpy arrays to use as the features and labels for a machine learning problem.
Code:
train_index, test_index = next(iter(ShuffleSplit(len(labels), train_size=0.2, test_size=0.80, random_state=42)))
features_train, features_test, = X[train_index], X[test_index]
labels_train, labels_test = labels[train_index], labels[test_index]
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features)
print pred
Features is currently an array of frequency counts (I used a CountVectorizer earlier to fit and transform my original pandas dataframe column). I have the full list of labels stored as pred, but I would like the corresponding feature to each label, so that I may return the list of labels to my pandas dataframe.
Ordering of predictions is the same as passed data (and as #Ulf pointed out - you are incorrectly using term "feature" here, feature is a column of your matrix, particular object that you are counting using countvectorizer; rows are observations, samples, data-points - and this is what you currently call features). Thus in order to see sample-label pairs you can simply zip them together:
pred = clf.predict(features)
for sample, label in zip(features, pred):
print sample, label
If you actually want to recover what each column means, your CountVectorizer is your guy. Somewhere in your code you created it
vectorizer = CountVectorizer( ... )
and later used it
... = vectorizer.fit_transform( ... )
now you can use it to transform your samples back through
pred = clf.predict(features)
for sample, label in zip(features, pred):
print vectorizer.inverse_transform(np.array([sample])), label