train_test_split producing inconsistent samples - python

I am working on using sklearn's train_test_split to create a training set and testing set of my data.
My script is below:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import neighbors
# function to perform one hot encoding and dropping the original item
# in this case its the part number
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
# read in data from csv
data = pd.read_csv('export2.csv')
# one hot encode the part number
new = encode_and_bind(data, 'PART_NO')
# create the labels, or field we are trying to estimate
label = new['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
thedata = new.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# remove the header
thedata = thedata[1:]
print(label.shape)
print(thedata.shape)
# # split into training and testing sets
train_data, train_classes, test_data, test_classes = train_test_split(thedata, label, test_size = 0.3)
# create a knn model
knn = neighbors.KNeighborsRegressor()
# fit it with our data
knn.fit(train_data, train_classes)
Running it, I get the following:
C:\Users\jerry\Desktop>python test.py (6262,) (6262, 253) Traceback
(most recent call last): File "test.py", line 37, in
knn.fit(train_data, train_classes) File "C:\Python367-64\lib\site-packages\sklearn\neighbors\base.py", line
872, in fit
X, y = check_X_y(X, y, "csr", multi_output=True) File "C:\Python367-64\lib\site-packages\sklearn\utils\validation.py", line
729, in check_X_y
check_consistent_length(X, y) File "C:\Python367-64\lib\site-packages\sklearn\utils\validation.py", line
205, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [4383, 1879]
So, it looks like both my X and Y have the same number of rows (6262), but different # of columns, since I thought Y was just supposed to be one column of the label or value you are trying to predict.
How can I use train_test_split to give me a training and testing dataset that I can use for a KNN Regressor?

You've switched the outputs of train_test_split, from what I can tell.
The function returns, in order: Training features, Testing features, Training labels, Testing labels.
The common naming convention is X_train, X_test, y_train, y_test=... where X is the features (columns or features) and yy is the targets (labels or, I'm assuming, "classes" in your code)
You appear to be trying to get it to return, instead, X_train, y_train, X_test, y_test
Try this and see if it works for you:
train_data, test_data, train_classes, test_classes = train_test_split(thedata, label, test_size = 0.3)

Related

Tf-IDF vectorized data won't work with naive bayes classifier

I have the following python code that I am using after preprocessing the data where data has to columns, one is the label either positive or negative and the other has tweet texts.
X_train, X_test, y_train, y_test = train_test_split(data['Tweet'], data['Label'], test_size=0.20, random_state=0)
tf_idf = TfidfVectorizer()
x_traintf = tf_idf.fit_transform(X_train)
x_traintf = tf_idf.transform(X_train)
x_testtf = tf_idf.fit_transform(X_test)
x_testtf = tf_idf.transform(X_test)
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(x_traintf, y_train)
y_pred = naive_bayes_classifier.predict(x_testtf)
print(metrics.classification_report(y_test, y_pred, target_names=['pos', 'neg']))
Here is the full error:
Traceback (most recent call last):
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\main.py", line 72, in <module>
naive_bayes_classifier.fit(x_traintf, y_train)
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\naive_bayes.py", line 749, in fit
X, y = self._check_X_y(X, y)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\naive_bayes.py", line 583, in _check_X_y
return self._validate_data(X, y, accept_sparse="csr", reset=reset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\base.py", line 565, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 1122, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 1144, in _check_y
_assert_all_finite(y, input_name="y", estimator_name=estimator_name)
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 111, in _assert_all_finite
raise ValueError("Input contains NaN")
ValueError: Input contains NaN
I've tried this as well but got similar results:
X_train, X_test, y_train, y_test = train_test_split(all_X, all_y, test_size=0.2, random_state=42)
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)
X_test_tfidf = vectorizer.transform(X_test)
y_pred = clf.predict(X_test_tfidf)
print(metrics.accuracy_score(y_test, y_pred))
edit: I've used data.dropna(inplace=True) and it appears to think that my strings are null because they are in Arabic.
Ok, so first of all, there must be some confusion around what the methods do because they are being used wrong. There might a chance that this is the issue.
Method .fit_transform() will fit to the data but also transform in place, it is like calling fit() and then transform(). Have a look at the documentation for more clarity: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform
Your code, for the transformation of the data, should be something like the following:
tf_idf = TfidfVectorizer()
x_traintf = tf_idf.fit_transform(X_train)
x_testtf = tf_idf.transform(X_test)
You only fit and learn the train data, and transform the train and test data with that learning. Your original code, was actually transforming the train data twice, and learning from the test data again and transforming twice also.
Do let me know if this is the issue. It might be that this double transformation caused issues. Please print the output after using the transformer, to see the output of the train and test data (features).

Scikit-learn - What am I predicting?

My aim is to predict between five and six numbers in an array, based on csv data with six columns. The below script is supposed to predict only one number, from an array of 5. I assumed I could work my way up to the entire 5 or 6 from there, but I might be wrong about that.
Mre:
import csv
import numpy as np
import pandas as pd
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('subdata.csv')
ft = [9,8,15,4,6]
fintest = np.array(ft)
def train():
df.astype(np.float64)
df.drop(['One'], axis = 1)
X = df
y = X['One']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
scaler = StandardScaler()
train_scaled = scaler.fit_transform(X_train)
test_scaled = scaler.transform(X_test)
tree_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor()
tree_model.fit(train_scaled, y_train)
rf_model.fit(train_scaled, y_train)
rfp = rf_model.predict(fintest.reshape(1, -1))
tmp = tree_model.predict(fintest.reshape(1, -1))
print(rfp)
print(tmp)
train()
Could you please clarify, what I am asking this script to predict in the final rfp and tmp lines?
My data looks like this:
The script as is currently gives an error:
Traceback (most recent call last):
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 43, in <module>
train()
File "C:\Users\conra\Desktop\Code\lotto\pie.py", line 37, in train
rfp = rf_model.predict(fintest.reshape(1, -1))
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 784, in predict
X = self._validate_X_predict(X)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\ensemble\_forest.py", line 422, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\tree\_classes.py", line 402, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 437, in _validate_data
self._check_n_features(X, reset=reset)
File "C:\Users\conra\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py", line 365, in _check_n_features
raise ValueError(
ValueError: X has 5 features, but DecisionTreeRegressor is expecting 6 features as input.
By adding a sixth digit to the ft array I can get around this error and receive wildly inaccurate outputs, that appear to have no correlation with the data whatsoever. For example, by setting variable ft to [9,8,15,4,6,2] which is the first row in the csv file, and setting X and y to use the 'Four' label; I get an output of [37.22] and [37.].
My other questions will probably be answered by my first. But here they are:
Could you also please clarify why I need to pass an array of 6?
And why are my predictions so close together (all ~35), no matter what array I pass for the prediction?
The way you defined your X is wrong. It is containing 6 features.
Your y is contained in your X in the way you defined it :
X = df #6 features
y = X['One'] #1 feature
I think what you wanted to do was something like this :
X = df[['Two', 'Three', 'Four', 'Five', 'Zero']]
y = df['One']
It depends on your data, and like I saw your data is an example without context, so actually you are trying to train your data to predict the 'One' column using two different models, that doesn't make sense to me.
The error is because you give to X the dataframe without column 'One' and after you are asking for the column 'One' to variable Y, Y=X['One'].

ValueError: Found input variables with inconsistent numbers of samples: [29675, 9574, 29675]

I am a beginner in ML. The problem is that I have the training and test data in different files and are of different lengths due to which I am getting the following errors:
Traceback (most recent call last):
File "C:/Users/Ellen/Desktop/Python/ML_4.py", line 35, in <module>
X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
File "C:\Python\Python37\lib\site-
packages\sklearn\model_selection\_split.py", line 2184, in
train_test_split
arrays = indexable(*arrays)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py",
line 260, in indexable
check_consistent_length(*result)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py",
line 235, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples:
[29675, 9574, 29675]
I don't know how to resolve these errors. Below is my code:
tweets_train = pd.read_csv('Final.csv')
features_train = tweets_train.iloc[:, 1].values
labels= tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train =
vectorizer.fit_transform(features_train).toarray()
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_test =
vectorizer.fit_transform(features_test).toarray()
X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
#regr.fit(X_train, y_train)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
The line producing error is:X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
processed_features_train.shape produces output as (29675, 28148) whereas,
processed_features_test.shape produces output as (9574, 11526)
The sample data is as follows-(First column is 'labels' and the second column is 'text')
neutral tap to explore the biggest change to world wars since world war
neutral tap to explore the biggest change to sliced bread.
negative apple blocked
neutral apple applesupport can i have a yawning emoji ? i think i am
asking for the 3rd or 5th time
neutral apple made with 20 more child labor
negative apple is not she the one who said she hates americans ?
There are only 3 labels (Positive, Negative, Neutral) in train data file and test data file.
Since your test set is in a separate file, there's no need to split the data (unless you want a validation set, or the test set is in the sense of competitions, unlabelled).
You shouldn't fit a new Vectorizer on the test data; doing so means there is no connection between the columns in the training and testing sets. Instead, use vectorizer.transform(features_test) (with the same object vectorizer that you fit_transformed the training data).
So, try:
tweets_train = pd.read_csv('Final.csv')
features_train = tweets_train.iloc[:, 1].values
labels_train = tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train = vectorizer.fit_transform(features_train).toarray()
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
labels_test = tweets_test.iloc[:, 0].values
processed_features_test = vectorizer.transform(features_test).toarray()
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(processed_features_train, labels_train)
predictions = text_classifier.predict(processed_features_test)
print(confusion_matrix(labels_test,predictions))
print(classification_report(labels_test,predictions))
Make sure the number of samples is equal to the number of labels
I had the same error and found to be because the number of samples was not equal to the number of labels.
More specific, I had this code
clf = MultinomialNB().fit(X_train, Y_train)
And the size of X_train was not equal to Y_train.
Then, I reviewed my code and fixed the mistake.
It's because you're passing three datasets into train_test_split, instead of just X, y as it's argument.

How may I un-encode the features from a decision tree to see the important features?

I have a dataset that I am working with. I am converting them from categorical features to numerical features for my decision tree. The conversion happens on the entire data frame with the following lines:
le = LE()
df = df.apply(le.fit_transform)
I later take this data and split it into training and testing data with the following:
target = ['label']
df_y = df['label']
df_x = df.drop(target, axis=1)
# Split into training and testing data
train_x, test_x, train_y, test_y = tts(df_x, df_y, test_size=0.3, random_state=42)
Then I am passing it to a method to train a decision tree:
def Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le):
print " - Candidate: Decision Tree Classifier"
dec_tree_classifier = DecisionTreeClassifier(random_state=0) # Load Module
dec_tree_classifier.fit(train_x, train_y) # Fit
accuracy = dec_tree_classifier.score(test_x, test_y) # Acc
predicted = dec_tree_classifier.predict(test_x)
mse = mean_squared_error(test_y, predicted)
tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
print "Tree Features:"
print tree_feat
print "Tree Thresholds:"
print dec_tree_classifier.tree_.threshold
scores = cross_val_score(dec_tree_classifier, test_x, test_y.values.ravel(), cv=10)
return (accuracy, mse, scores.mean(), scores.std())
In the above method, I am passing the LabelEncoder object originally used to encode the dataframe. I have the line
tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
To try and convert the features back to their original categoric representation, but I keep getting this stack trace error:
File "<ipython-input-6-c2005f8661bc>", line 1, in <module>
runfile('main.py', wdir='/Users/mydir)
File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
execfile(filename, namespace)
File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
builtins.execfile(filename, *where)
File "/Users/me/mydir/main.py", line 125, in <module>
main() # Run main routine
File "candidates.py", line 175, in get_baseline
dec_tre_acc = Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le)
File "candidates.py", line 40, in Decision_Tree_Classifier
tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
File "/Users/me/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 281, in inverse_transform
"y contains previously unseen labels: %s" % str(diff))
ValueError: y contains previously unseen labels: [-2]
What do I need to change to be able to look at the actual features themselves?
When you do this:
df = df.apply(le.fit_transform)
you are using a single LabelEncoder instance for all of your columns. When called fit() or fit_transform(), le will forget the previous data and learn the current data only. So the le you have is only storing the information about the last column it seen, not all columns.
There are multiple ways to solve this:
You can maintain multiple LabelEncoder objects (one for each column). See this excellent answer here:
Label encoding across multiple columns in scikit-learn
from collections import defaultdict
d = defaultdict(LabelEncoder)
df = df.apply(lambda x: d[x.name].fit_transform(x))
If you want to keep a single object to handle all columns, you can use the OrdinalEncoder if you have the latest version of scikit-learn installed.
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
df = enc.fit_transform(df)
But still the error will not be solved, because the tree_.feature dont correspond to values of the features, but the index (column in df) that was used for splitting at that node. So if you have 3 features (Columns) in the data (irrespective of values in that column), the tree_.feature can have values:
0, 1, 2, -2
-2 is a special placeholder value to denote that the node is a leaf node, and so no feature is used to split anything.
tree_.threshold will contain the values corresponding to your values of data. But that will be in floats, so there you will have to convert according the conversion of categories to numbers.
See this example for understanding the tree structure in detail:
https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html

Cross validation and pipeline in sci-kit learn

For a machine learning project, i'm trying to predict a categorical outcome variable using features extracted from text.
Using cross validation, i split my X and Y into a test set and training set. The training set is trained using a pipeline. However, when i compute the performance using X from my test set my performance is 0.0. This is while there are no features extracted from X_test yet.
Is it possible to split the dataset within the pipeline?
My code:
X, Y = read_data('development2.csv')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
train_pipeline = Pipeline([('vect', CountVectorizer()), #ngram_range=(1,2), analyzer='word'
('tfidf', TfidfTransformer(use_idf=False)),
('clf', OneVsRestClassifier(SVC(kernel='linear', probability=True))),
])
train_pipeline.fit(X_train, Y_train)
predicted = train_pipeline.predict(X_test)
print accuracy_score(Y_test, predicted)
The traceback when using SVC:
File "/Users/Robbert/Documents/pipeline.py", line 62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)
I solved the problem.
The target variable (Y) did not have the appropriate format. The variables were stored like this: [[0 0 0 0 1],[0 0 1 0 0]]. I converted this to a different array format like this: [5, 3].
This did the trick for me.
Thanks for all answers.

Categories