ValueError: Found input variables with inconsistent numbers of samples: [40000, 10000]

ValueError: Found input variables with inconsistent numbers of samples: [40000, 10000] - python

I am getting the error in this line of code (X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Error: ValueError: Found input variables with inconsistent numbers of samples: [40000, 10000].
Seems like after the vectorization, the array size gets change and that does not match with y.
Seeking support in resolving the error. Thanks in advance
Output:
(10000, 4)
(10000,)
(40000, 1500)
(10000,)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# Import dataset:
dataset = pd.read_excel(r"C:\Users\HPS1RT\Downloads\test\Safety_Prediction.xlsx", nrows=10000)
dataset[["Safety"]] *= 1
# Assign values to the X and y variables:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
print(X.shape)
print(y.shape)
#vectorization
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7)
X = vectorizer.fit_transform(X.ravel()).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()
print(X.shape)
print(y.shape)
# Split dataset into random train and test subsets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train)
print(y_train)

Related

What does the error mean and how to fix it - "ValueError: query data dimension must match training data dimension"

I am trying to write the code for K-NN
Below is my code. - I know that issue is in `predict() but I am not able to figure out how o fix it.
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('UniversalBank.csv')
X = dataset.iloc[:,[ 1,2,3,5,6,7,8,10,11,12,13]].values #,
y = dataset.iloc[:,9].values
#Splitting the dataset to training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state= 0)
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Fitting the classifier to training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train,y_train)
#Predicting the test results
y_pred = classifier.predict(X_test)

ValueError: Shape of passed values is (39, 1), indices imply (39, 7)

Heyy I am trying to do a simple logistic regression on my data which is returns (y) versus market indices (x).
import numpy as np
from sklearn import metrics
data = pd.read_excel ('Datafile.xlsx', index_col=0)
#split dataset into features and target variable
col_features = ['Market Beta','Value','Size','High-Yield Spread','Term Spread','Momentum','Bid-Ask Spread']
target=['Return']
x = data[col_features] #features
y = data[target] #target
#split x and y into training and testing datasets
from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split (x, y, test_size = 0.25, random_state = 0)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
y_train = np.argmax(y_train, axis=1)
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
The error I get is
ValueError: Shape of passed values is (39, 1), indices imply (39, 7)
Thank you.

you just confused the order of train_test_split results so x_test and y_train became switched. Proper order should be this:
x_train, x_test, y_train, y_test = train_test_split(x, y, ...

How to fix "ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]"?

I have problem with training my code using Stochastic Gradient Descent and MNIST database.
from sklearn.datasets import fetch_mldata
from sklearn.linear_model import SGDClassifier
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28)
X_train, X_train, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
Error at the end of process (in my opinion the last verse of code is bad):
ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]

It's a typo on your side, you are assigning to X_train twice:
X_train, X_train, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Correct answer would be:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
BTW. fetch_mldata will be deprecated soon, it would be a better idea to use:
from sklearn.datasets import fetch_openml
X, y = fetch_openml("mnist_784", version=1, return_X_y=True)

I would suggest using a stratified splitting between train and test dataset because some classes might skewed representation in the training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

LeaveOneOut to determine k of knn

I want to know the best k for k-nearest-neighbor. I am using LeaveOneOut to divide my data into train and test sets. In the code below I have 150 data entries, so I get 150 different train and test sets. K should be in-between 1 and 40.
I want to plot the cross-validation average classification error as a function of k, too see which k is the best for KNN.
Here is my code:
import scipy.io as sio
import seaborn as sn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import LeaveOneOut
error = []
array = np.array(range(1,41))
dataset = pd.read_excel('Data/iris.xls')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
loo = LeaveOneOut()
loo.get_n_splits(X)
for train_index, test_index in loo.split(X):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
#print(X_train, X_test, y_train, y_test)
for i in range(1, 41):
classifier = KNeighborsClassifier(n_neighbors=i)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
error.append(np.mean(y_pred != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 41), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')

You are calculating error at each prediction, thats why you have 6000 points in your error array. You need to collect the predictions of all points in the fold for a given 'n_neighbors' and then calculate the error for that value.
You can do this:
# Loop over possible values of "n_neighbors"
for i in range(1, 41):
# Collect the actual and predicted values for all splits for a single "n_neighbors"
actual = []
predicted = []
for train_index, test_index in loo.split(X):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
classifier = KNeighborsClassifier(n_neighbors=i)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# Append the single predictions and actual values here.
actual.append(y_test[0])
predicted.append(y_pred[0])
# Outside the loop, calculate the error.
error.append(np.mean(np.array(predicted) != np.array(actual)))
Rest of your code is okay.
There is a more compact way to do this if you use the cross_val_predict
from sklearn.model_selection import cross_val_predict
for i in range(1, 41):
classifier = KNeighborsClassifier(n_neighbors=i)
y_pred = cross_val_predict(classifier, X, y, cv=loo)
error.append(np.mean(y_pred != y))

How to spliting datasets - Number of labels=150 does not match number of samples=600

I have a data sample of 750x256.
Rows = 750
Columns = 256
If I split my data into 20%. I will have for X_train 600 samples and y_train 150 samples.
Then the problem would accure when doing decisionTreeRegressor
it will say Number of y_train=150 does not match number of samples=600
But if I split my test_size into 50%, then it will work.
is there a way to around this? I don't want to use 50% of my test_size.
Any help would be great!
here is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphviz
#Load the data
dataset = pd.read_csv('new_york.csv')
dataset['Higher'] = dataset['2016-12'].gt(dataset['2016-11']).astype(int)
X = dataset.iloc[:, 6:254].values
y = dataset.iloc[:, 255].values
#Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, :248])
X[:, :248] = imputer.transform(X[:, :248])
#Split the data into train and test sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_test, y_train = train_test_split(X, y, test_size = .2, random_state = 0)
#let's build our first model
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
clf = DecisionTreeClassifier(max_depth=6)
clf.fit(X_train, y_train)
clf.score(X_train, y_train)

train_test_split() returns X_train, X_test, y_train, y_test, you have y_train and y_test in the wrong order.
If you use a split of 50% this is not causing an error because y_train and y_test will have the same size (but the wrong values obviously).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

ValueError: Found input variables with inconsistent numbers of samples: [40000, 10000] - python

Related

What does the error mean and how to fix it - "ValueError: query data dimension must match training data dimension"

ValueError: Shape of passed values is (39, 1), indices imply (39, 7)

How to fix "ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]"?

LeaveOneOut to determine k of knn

How to spliting datasets - Number of labels=150 does not match number of samples=600

Categories

Resources