Cross validation and pipeline in sci-kit learn

Cross validation and pipeline in sci-kit learn - python

For a machine learning project, i'm trying to predict a categorical outcome variable using features extracted from text.
Using cross validation, i split my X and Y into a test set and training set. The training set is trained using a pipeline. However, when i compute the performance using X from my test set my performance is 0.0. This is while there are no features extracted from X_test yet.
Is it possible to split the dataset within the pipeline?
My code:
X, Y = read_data('development2.csv')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
train_pipeline = Pipeline([('vect', CountVectorizer()), #ngram_range=(1,2), analyzer='word'
('tfidf', TfidfTransformer(use_idf=False)),
('clf', OneVsRestClassifier(SVC(kernel='linear', probability=True))),
])
train_pipeline.fit(X_train, Y_train)
predicted = train_pipeline.predict(X_test)
print accuracy_score(Y_test, predicted)
The traceback when using SVC:
File "/Users/Robbert/Documents/pipeline.py", line 62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)

I solved the problem.
The target variable (Y) did not have the appropriate format. The variables were stored like this: [[0 0 0 0 1],[0 0 1 0 0]]. I converted this to a different array format like this: [5, 3].
This did the trick for me.
Thanks for all answers.

Related

Tf-IDF vectorized data won't work with naive bayes classifier

I have the following python code that I am using after preprocessing the data where data has to columns, one is the label either positive or negative and the other has tweet texts.
X_train, X_test, y_train, y_test = train_test_split(data['Tweet'], data['Label'], test_size=0.20, random_state=0)
tf_idf = TfidfVectorizer()
x_traintf = tf_idf.fit_transform(X_train)
x_traintf = tf_idf.transform(X_train)
x_testtf = tf_idf.fit_transform(X_test)
x_testtf = tf_idf.transform(X_test)
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(x_traintf, y_train)
y_pred = naive_bayes_classifier.predict(x_testtf)
print(metrics.classification_report(y_test, y_pred, target_names=['pos', 'neg']))
Here is the full error:
Traceback (most recent call last):
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\main.py", line 72, in <module>
naive_bayes_classifier.fit(x_traintf, y_train)
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\naive_bayes.py", line 749, in fit
X, y = self._check_X_y(X, y)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\naive_bayes.py", line 583, in _check_X_y
return self._validate_data(X, y, accept_sparse="csr", reset=reset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\base.py", line 565, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 1122, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 1144, in _check_y
_assert_all_finite(y, input_name="y", estimator_name=estimator_name)
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 111, in _assert_all_finite
raise ValueError("Input contains NaN")
ValueError: Input contains NaN
I've tried this as well but got similar results:
X_train, X_test, y_train, y_test = train_test_split(all_X, all_y, test_size=0.2, random_state=42)
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)
X_test_tfidf = vectorizer.transform(X_test)
y_pred = clf.predict(X_test_tfidf)
print(metrics.accuracy_score(y_test, y_pred))
edit: I've used data.dropna(inplace=True) and it appears to think that my strings are null because they are in Arabic.

Ok, so first of all, there must be some confusion around what the methods do because they are being used wrong. There might a chance that this is the issue.
Method .fit_transform() will fit to the data but also transform in place, it is like calling fit() and then transform(). Have a look at the documentation for more clarity: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform
Your code, for the transformation of the data, should be something like the following:
tf_idf = TfidfVectorizer()
x_traintf = tf_idf.fit_transform(X_train)
x_testtf = tf_idf.transform(X_test)
You only fit and learn the train data, and transform the train and test data with that learning. Your original code, was actually transforming the train data twice, and learning from the test data again and transforming twice also.
Do let me know if this is the issue. It might be that this double transformation caused issues. Please print the output after using the transformer, to see the output of the train and test data (features).

train_test_split producing inconsistent samples

I am working on using sklearn's train_test_split to create a training set and testing set of my data.
My script is below:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import neighbors
# function to perform one hot encoding and dropping the original item
# in this case its the part number
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
# read in data from csv
data = pd.read_csv('export2.csv')
# one hot encode the part number
new = encode_and_bind(data, 'PART_NO')
# create the labels, or field we are trying to estimate
label = new['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
thedata = new.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# remove the header
thedata = thedata[1:]
print(label.shape)
print(thedata.shape)
# # split into training and testing sets
train_data, train_classes, test_data, test_classes = train_test_split(thedata, label, test_size = 0.3)
# create a knn model
knn = neighbors.KNeighborsRegressor()
# fit it with our data
knn.fit(train_data, train_classes)
Running it, I get the following:
C:\Users\jerry\Desktop>python test.py (6262,) (6262, 253) Traceback
(most recent call last): File "test.py", line 37, in
knn.fit(train_data, train_classes) File "C:\Python367-64\lib\site-packages\sklearn\neighbors\base.py", line
872, in fit
X, y = check_X_y(X, y, "csr", multi_output=True) File "C:\Python367-64\lib\site-packages\sklearn\utils\validation.py", line
729, in check_X_y
check_consistent_length(X, y) File "C:\Python367-64\lib\site-packages\sklearn\utils\validation.py", line
205, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [4383, 1879]
So, it looks like both my X and Y have the same number of rows (6262), but different # of columns, since I thought Y was just supposed to be one column of the label or value you are trying to predict.
How can I use train_test_split to give me a training and testing dataset that I can use for a KNN Regressor?

You've switched the outputs of train_test_split, from what I can tell.
The function returns, in order: Training features, Testing features, Training labels, Testing labels.
The common naming convention is X_train, X_test, y_train, y_test=... where X is the features (columns or features) and yy is the targets (labels or, I'm assuming, "classes" in your code)
You appear to be trying to get it to return, instead, X_train, y_train, X_test, y_test
Try this and see if it works for you:
train_data, test_data, train_classes, test_classes = train_test_split(thedata, label, test_size = 0.3)

ValueError when using sklearn.linear_model.LinearRegression in Python

I am trying to predict y values based on X values. I have a Excel file which has how many Siblings and Spouses a person has. The file also contains a survival outcome which is y (1 = Survived, 0 = Died).
The code snippet below shows how I do this
dataSet = pd.read_excel("TitanicData.xlsx", sheet_name="TitanicData")
dataSet.head()
dataSet.columns
SibSp = dataSet.iloc[:, 6]
Parch = dataSet.iloc[:, 7]
Stack = np.column_stack((SibSp, Parch, SibSp + Parch))
Family = pd.DataFrame(Stack, columns=['SibSp', 'Parch', 'Family'])
X = Family.iloc[:, 2]
y = dataSet.iloc[:, 1]
This now gives me the correct values I expect, y is a DataFrame of 1's and 0's depicting if the person died or not, X holds the sum of SibSp and Parch columns.
I then split the data into training and testing dataframes which is done like so (update to show where X_train, X_test derives from)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
However, when I then try to use sklearn.linear_model.LinearRegression, I start getting errors
classifier = LinearRegression()
classifier.fit(X_train, y_train)
classifier.predict(X_test)
ValueError: Expected 2D array, got 1D array instead: array=[ 1 2 0 1 0 0 0 0 4 ...] Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried taking a look at similar questions on SO but the line throwing this exception is
classifier.fit(X_train, y_train)
How can I fit my training values into my classifier?
Update:
print(X_train.shape, y_train.values.reshape(-1,1).shape)
Gives me (534,) (534, 1)
Update to show full debug trace
File "<ipython-input-56-2da0ffaf5447>", line 1, in <module>
train()
File "C:/Users/user/Desktop/dantitanic/AnotherTest.py", line 41, in train
classifier.fit(X_train, y_train)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 458, in fit
y_numeric=True, multi_output=True)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 756, in check_X_y
estimator=estimator)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 552, in check_array
"if it contains a single sample.".format(array))

You need to reshape X_train and X_test before fitting like this:
X_train = X_train.reshape(1, -1)
X_test = X_test.reshape(1, -1)

Convert a (n_samples, n_features) ndarray to a (n_samples, 1) array of vectors to use as training labels for an sklearn SVM

I'm trying to calculate the ROC and AUC for an SVM model I'm building. I'm following the code from this sklearn example. One of the requirements is that the output labels y need to be binarized. I do this by using creating a MultiLabelBinarizer and encode all the labels, which works fine. However, this creates a (n_samples, n_features) ndarray. The classifier.fit(X, y) function assumes y.shape = (n_samples). I want to essentially "smush" the columns of y together, so that y[0][0] would return the entire feature-vector, V, instead of just the first value of V.
Here's my code:
enc = MultiLabelBinarizer()
print("Encoding data...")
# Fit the encoder onto all possible data values
print(pandas.DataFrame(enc.fit_transform(df["present"] + df["member"].apply(str).apply(lambda x: [x])),
columns=enc.classes_, index=df.index))
X, y = enc.transform(df["present"]), list(df["member"].apply(str))
print("Training svm...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
y_train = enc.transform([[x] for x in y_train]) # Strings to 1HotVectors
svc = svm.SVC(C=1.1, kernel="linear", probability=True, class_weight='balanced')
svc.fit(X_train, y_train) # y_train should be 1D but isn't
The exception I get is:
Traceback (most recent call last):
File "C:/Users/SawyerPC/PycharmProjects/DiscordSocialGraph/encode_and_train.py", line 129, in <module>
enc, clf, split_data = encode_and_train(df)
File "C:/Users/SawyerPC/PycharmProjects/DiscordSocialGraph/encode_and_train.py", line 57, in encode_and_train
svc.fit(X_train, y_train) # TODO y_train needs to be flattened to (n_samples,)
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\svm\base.py", line 149, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 547, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 583, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (5000, 10)

I ended up fixing this by using a LabelEncoder. Thanks #G.Anderson. The flat_member_list is just a list of all the unique user ids encountered both in the labels y, and the vectors X.
# Encode "present" users as OneHotVectors
mlb = MultiLabelBinarizer()
print("Encoding data...")
mlb.fit(df["present"] + df["member"].apply(str).apply(lambda x: [x]))
# Encode user labels as ints
enc = LabelEncoder()
flat_member_list = df["member"].apply(str).append(pandas.Series(np.concatenate(df["present"]).ravel()))
enc.fit(flat_member_list)
X, y = mlb.transform(df["present"]), enc.transform(df["member"].apply(str))
print("Training svm...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0, stratify=y)
svc = svm.SVC(C=0.317, kernel="linear", probability=True)
svc.fit(X_train, y_train)

Bad input shape () in multi-class classification

I'am performing a multi-class classification task using sci-kit learn. In the setup i created, i want to compare different classification algorithms.
I use a pipeline, where text is inserted as X and Y is the class (multi-class, N = 5). Textual features are extracted in the pipeline using TfidfVectorizer().
KNN does the job, but other classifiers give this: ValueError: bad input shape (670, 5)
Full traceback:
"/Users/Robbert/pipeline.py", line 62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)
The code i use:
def read_data(f):
data = []
for row in csv.reader(open(f), delimiter=';'):
if row:
plottext = row[8]
target = { 'Age': row[4] }
data.append((plottext, target))
(X, Ycat) = zip(*data)
Y = DictVectorizer().fit_transform(Ycat)
Y = preprocessing.LabelBinarizer().fit_transform(Y)
return (X, Y)
X, Y = read_data('development2.csv')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
###KNN Pipeline
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', KNeighborsClassifier(n_neighbors=350, weights='uniform'))])
###Logistic regression Pipeline
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', LogisticRegression())])
##SVC
train_pipeline = Pipeline([
('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
('clf', SVC(C=1, kernel='rbf', gamma=0.001, probability=True))])
##Decision tree
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', DecisionTreeClassifier(random_state=0))])
train_pipeline.fit(X_train, Y_train)
predicted = train_pipeline.predict(X_test)
print accuracy_score(Y_test, predicted)
How is it possible that KNN accepts the shape of the array and other classifiers don't? And how to change this shape?

If you compare documentation for fit(X, y) function in KNeighborsClassifier and SVC, you will see that only the former one accepts the y in the form [n_samples, n_outputs].
Possible solution: why do you need LabelBinarizer at all? Just do not use it.

If your Y vector is of size (n_samples, n_classes) and contains at least a single row which has more than one non-zero element, then you are solving a multi-label classification problem. If that is the case, The multiclass and multilabel algorithms page in scikit-learn docs lists KNN as one of the classifiers that supports multi-label classification. You might want to try out other classifiers from that list
* sklearn.tree.DecisionTreeClassifier
* sklearn.tree.ExtraTreeClassifier
* sklearn.ensemble.ExtraTreesClassifier
* sklearn.neural_network.MLPClassifier
* sklearn.neighbors.RadiusNeighborsClassifier
* sklearn.ensemble.RandomForestClassifier
* sklearn.linear_model.RidgeClassifierCV

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cross validation and pipeline in sci-kit learn - python

I solved the problem. The target variable (Y) did not have the appropriate format. The variables were stored like this: [[0 0 0 0 1],[0 0 1 0 0]]. I converted this to a different array format like this: [5, 3]. This did the trick for me. Thanks for all answers.

Related

Tf-IDF vectorized data won't work with naive bayes classifier

train_test_split producing inconsistent samples

ValueError when using sklearn.linear_model.LinearRegression in Python

Convert a (n_samples, n_features) ndarray to a (n_samples, 1) array of vectors to use as training labels for an sklearn SVM

Bad input shape () in multi-class classification

Categories

Resources