I am trying to predict y values based on X values. I have a Excel file which has how many Siblings and Spouses a person has. The file also contains a survival outcome which is y (1 = Survived, 0 = Died).
The code snippet below shows how I do this
dataSet = pd.read_excel("TitanicData.xlsx", sheet_name="TitanicData")
dataSet.head()
dataSet.columns
SibSp = dataSet.iloc[:, 6]
Parch = dataSet.iloc[:, 7]
Stack = np.column_stack((SibSp, Parch, SibSp + Parch))
Family = pd.DataFrame(Stack, columns=['SibSp', 'Parch', 'Family'])
X = Family.iloc[:, 2]
y = dataSet.iloc[:, 1]
This now gives me the correct values I expect, y is a DataFrame of 1's and 0's depicting if the person died or not, X holds the sum of SibSp and Parch columns.
I then split the data into training and testing dataframes which is done like so (update to show where X_train, X_test derives from)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
However, when I then try to use sklearn.linear_model.LinearRegression, I start getting errors
classifier = LinearRegression()
classifier.fit(X_train, y_train)
classifier.predict(X_test)
ValueError: Expected 2D array, got 1D array instead: array=[ 1 2 0 1 0 0 0 0 4 ...] Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried taking a look at similar questions on SO but the line throwing this exception is
classifier.fit(X_train, y_train)
How can I fit my training values into my classifier?
Update:
print(X_train.shape, y_train.values.reshape(-1,1).shape)
Gives me (534,) (534, 1)
Update to show full debug trace
File "<ipython-input-56-2da0ffaf5447>", line 1, in <module>
train()
File "C:/Users/user/Desktop/dantitanic/AnotherTest.py", line 41, in train
classifier.fit(X_train, y_train)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 458, in fit
y_numeric=True, multi_output=True)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 756, in check_X_y
estimator=estimator)
File "C:\Users\user\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 552, in check_array
"if it contains a single sample.".format(array))
You need to reshape X_train and X_test before fitting like this:
X_train = X_train.reshape(1, -1)
X_test = X_test.reshape(1, -1)
Related
I have the following python code that I am using after preprocessing the data where data has to columns, one is the label either positive or negative and the other has tweet texts.
X_train, X_test, y_train, y_test = train_test_split(data['Tweet'], data['Label'], test_size=0.20, random_state=0)
tf_idf = TfidfVectorizer()
x_traintf = tf_idf.fit_transform(X_train)
x_traintf = tf_idf.transform(X_train)
x_testtf = tf_idf.fit_transform(X_test)
x_testtf = tf_idf.transform(X_test)
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(x_traintf, y_train)
y_pred = naive_bayes_classifier.predict(x_testtf)
print(metrics.classification_report(y_test, y_pred, target_names=['pos', 'neg']))
Here is the full error:
Traceback (most recent call last):
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\main.py", line 72, in <module>
naive_bayes_classifier.fit(x_traintf, y_train)
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\naive_bayes.py", line 749, in fit
X, y = self._check_X_y(X, y)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\naive_bayes.py", line 583, in _check_X_y
return self._validate_data(X, y, accept_sparse="csr", reset=reset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\base.py", line 565, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 1122, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 1144, in _check_y
_assert_all_finite(y, input_name="y", estimator_name=estimator_name)
File "C:\Users\Lenovo\PycharmProjects\pythonProject1\venv\Lib\site-packages\sklearn\utils\validation.py", line 111, in _assert_all_finite
raise ValueError("Input contains NaN")
ValueError: Input contains NaN
I've tried this as well but got similar results:
X_train, X_test, y_train, y_test = train_test_split(all_X, all_y, test_size=0.2, random_state=42)
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)
X_test_tfidf = vectorizer.transform(X_test)
y_pred = clf.predict(X_test_tfidf)
print(metrics.accuracy_score(y_test, y_pred))
edit: I've used data.dropna(inplace=True) and it appears to think that my strings are null because they are in Arabic.
Ok, so first of all, there must be some confusion around what the methods do because they are being used wrong. There might a chance that this is the issue.
Method .fit_transform() will fit to the data but also transform in place, it is like calling fit() and then transform(). Have a look at the documentation for more clarity: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform
Your code, for the transformation of the data, should be something like the following:
tf_idf = TfidfVectorizer()
x_traintf = tf_idf.fit_transform(X_train)
x_testtf = tf_idf.transform(X_test)
You only fit and learn the train data, and transform the train and test data with that learning. Your original code, was actually transforming the train data twice, and learning from the test data again and transforming twice also.
Do let me know if this is the issue. It might be that this double transformation caused issues. Please print the output after using the transformer, to see the output of the train and test data (features).
I am a beginner in ML. The problem is that I have the training and test data in different files and are of different lengths due to which I am getting the following errors:
Traceback (most recent call last):
File "C:/Users/Ellen/Desktop/Python/ML_4.py", line 35, in <module>
X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
File "C:\Python\Python37\lib\site-
packages\sklearn\model_selection\_split.py", line 2184, in
train_test_split
arrays = indexable(*arrays)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py",
line 260, in indexable
check_consistent_length(*result)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py",
line 235, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples:
[29675, 9574, 29675]
I don't know how to resolve these errors. Below is my code:
tweets_train = pd.read_csv('Final.csv')
features_train = tweets_train.iloc[:, 1].values
labels= tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train =
vectorizer.fit_transform(features_train).toarray()
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_test =
vectorizer.fit_transform(features_test).toarray()
X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
#regr.fit(X_train, y_train)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
The line producing error is:X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
processed_features_train.shape produces output as (29675, 28148) whereas,
processed_features_test.shape produces output as (9574, 11526)
The sample data is as follows-(First column is 'labels' and the second column is 'text')
neutral tap to explore the biggest change to world wars since world war
neutral tap to explore the biggest change to sliced bread.
negative apple blocked
neutral apple applesupport can i have a yawning emoji ? i think i am
asking for the 3rd or 5th time
neutral apple made with 20 more child labor
negative apple is not she the one who said she hates americans ?
There are only 3 labels (Positive, Negative, Neutral) in train data file and test data file.
Since your test set is in a separate file, there's no need to split the data (unless you want a validation set, or the test set is in the sense of competitions, unlabelled).
You shouldn't fit a new Vectorizer on the test data; doing so means there is no connection between the columns in the training and testing sets. Instead, use vectorizer.transform(features_test) (with the same object vectorizer that you fit_transformed the training data).
So, try:
tweets_train = pd.read_csv('Final.csv')
features_train = tweets_train.iloc[:, 1].values
labels_train = tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train = vectorizer.fit_transform(features_train).toarray()
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
labels_test = tweets_test.iloc[:, 0].values
processed_features_test = vectorizer.transform(features_test).toarray()
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(processed_features_train, labels_train)
predictions = text_classifier.predict(processed_features_test)
print(confusion_matrix(labels_test,predictions))
print(classification_report(labels_test,predictions))
Make sure the number of samples is equal to the number of labels
I had the same error and found to be because the number of samples was not equal to the number of labels.
More specific, I had this code
clf = MultinomialNB().fit(X_train, Y_train)
And the size of X_train was not equal to Y_train.
Then, I reviewed my code and fixed the mistake.
It's because you're passing three datasets into train_test_split, instead of just X, y as it's argument.
I am working on using sklearn's train_test_split to create a training set and testing set of my data.
My script is below:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import neighbors
# function to perform one hot encoding and dropping the original item
# in this case its the part number
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
# read in data from csv
data = pd.read_csv('export2.csv')
# one hot encode the part number
new = encode_and_bind(data, 'PART_NO')
# create the labels, or field we are trying to estimate
label = new['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
thedata = new.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# remove the header
thedata = thedata[1:]
print(label.shape)
print(thedata.shape)
# # split into training and testing sets
train_data, train_classes, test_data, test_classes = train_test_split(thedata, label, test_size = 0.3)
# create a knn model
knn = neighbors.KNeighborsRegressor()
# fit it with our data
knn.fit(train_data, train_classes)
Running it, I get the following:
C:\Users\jerry\Desktop>python test.py (6262,) (6262, 253) Traceback
(most recent call last): File "test.py", line 37, in
knn.fit(train_data, train_classes) File "C:\Python367-64\lib\site-packages\sklearn\neighbors\base.py", line
872, in fit
X, y = check_X_y(X, y, "csr", multi_output=True) File "C:\Python367-64\lib\site-packages\sklearn\utils\validation.py", line
729, in check_X_y
check_consistent_length(X, y) File "C:\Python367-64\lib\site-packages\sklearn\utils\validation.py", line
205, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [4383, 1879]
So, it looks like both my X and Y have the same number of rows (6262), but different # of columns, since I thought Y was just supposed to be one column of the label or value you are trying to predict.
How can I use train_test_split to give me a training and testing dataset that I can use for a KNN Regressor?
You've switched the outputs of train_test_split, from what I can tell.
The function returns, in order: Training features, Testing features, Training labels, Testing labels.
The common naming convention is X_train, X_test, y_train, y_test=... where X is the features (columns or features) and yy is the targets (labels or, I'm assuming, "classes" in your code)
You appear to be trying to get it to return, instead, X_train, y_train, X_test, y_test
Try this and see if it works for you:
train_data, test_data, train_classes, test_classes = train_test_split(thedata, label, test_size = 0.3)
I'm trying to calculate the ROC and AUC for an SVM model I'm building. I'm following the code from this sklearn example. One of the requirements is that the output labels y need to be binarized. I do this by using creating a MultiLabelBinarizer and encode all the labels, which works fine. However, this creates a (n_samples, n_features) ndarray. The classifier.fit(X, y) function assumes y.shape = (n_samples). I want to essentially "smush" the columns of y together, so that y[0][0] would return the entire feature-vector, V, instead of just the first value of V.
Here's my code:
enc = MultiLabelBinarizer()
print("Encoding data...")
# Fit the encoder onto all possible data values
print(pandas.DataFrame(enc.fit_transform(df["present"] + df["member"].apply(str).apply(lambda x: [x])),
columns=enc.classes_, index=df.index))
X, y = enc.transform(df["present"]), list(df["member"].apply(str))
print("Training svm...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
y_train = enc.transform([[x] for x in y_train]) # Strings to 1HotVectors
svc = svm.SVC(C=1.1, kernel="linear", probability=True, class_weight='balanced')
svc.fit(X_train, y_train) # y_train should be 1D but isn't
The exception I get is:
Traceback (most recent call last):
File "C:/Users/SawyerPC/PycharmProjects/DiscordSocialGraph/encode_and_train.py", line 129, in <module>
enc, clf, split_data = encode_and_train(df)
File "C:/Users/SawyerPC/PycharmProjects/DiscordSocialGraph/encode_and_train.py", line 57, in encode_and_train
svc.fit(X_train, y_train) # TODO y_train needs to be flattened to (n_samples,)
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\svm\base.py", line 149, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 547, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 583, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (5000, 10)
I ended up fixing this by using a LabelEncoder. Thanks #G.Anderson. The flat_member_list is just a list of all the unique user ids encountered both in the labels y, and the vectors X.
# Encode "present" users as OneHotVectors
mlb = MultiLabelBinarizer()
print("Encoding data...")
mlb.fit(df["present"] + df["member"].apply(str).apply(lambda x: [x]))
# Encode user labels as ints
enc = LabelEncoder()
flat_member_list = df["member"].apply(str).append(pandas.Series(np.concatenate(df["present"]).ravel()))
enc.fit(flat_member_list)
X, y = mlb.transform(df["present"]), enc.transform(df["member"].apply(str))
print("Training svm...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0, stratify=y)
svc = svm.SVC(C=0.317, kernel="linear", probability=True)
svc.fit(X_train, y_train)
For a machine learning project, i'm trying to predict a categorical outcome variable using features extracted from text.
Using cross validation, i split my X and Y into a test set and training set. The training set is trained using a pipeline. However, when i compute the performance using X from my test set my performance is 0.0. This is while there are no features extracted from X_test yet.
Is it possible to split the dataset within the pipeline?
My code:
X, Y = read_data('development2.csv')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
train_pipeline = Pipeline([('vect', CountVectorizer()), #ngram_range=(1,2), analyzer='word'
('tfidf', TfidfTransformer(use_idf=False)),
('clf', OneVsRestClassifier(SVC(kernel='linear', probability=True))),
])
train_pipeline.fit(X_train, Y_train)
predicted = train_pipeline.predict(X_test)
print accuracy_score(Y_test, predicted)
The traceback when using SVC:
File "/Users/Robbert/Documents/pipeline.py", line 62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)
I solved the problem.
The target variable (Y) did not have the appropriate format. The variables were stored like this: [[0 0 0 0 1],[0 0 1 0 0]]. I converted this to a different array format like this: [5, 3].
This did the trick for me.
Thanks for all answers.