ValueError: Found input variables with inconsistent numbers of samples: [2935848, 2935849] - python

When I run this code:
feature_names = ["date","shop_id", "item_id", "item_price", "item_cnt_day"]
feature_names
X_train = train[feature_names]
print(X_train.shape)
X_train.head()
X_sales = sales[feature_names]
print(X_sales.shape)
X_sales.head()
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3)
feature_names = ["date","shop_id", "item_id", "item_price", "item_cnt_day"]
feature_names
​
X_train = train[feature_names]
print(X_train.shape)
X_train.head()
​
X_sales = sales[feature_names]
print(X_sales.shape)
X_sales.head()
​
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
​
X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3)
​
(2935848, 5)
(2935849, 5)
I get this ValueError:
ValueError Traceback (most recent call
last) in
13 from sklearn.metrics import mean_squared_error
14
---> 15 X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3)
16
~/anaconda3/envs/aiffel/lib/python3.7/site-packages/sklearn/model_selection/_split.py
in train_test_split(*arrays, **options) 2125 raise
TypeError("Invalid parameters passed: %s" % str(options)) 2126
-> 2127 arrays = indexable(*arrays) 2128 2129 n_samples = _num_samples(arrays[0])
~/anaconda3/envs/aiffel/lib/python3.7/site-packages/sklearn/utils/validation.py
in indexable(*iterables)
291 """
292 result = [_make_indexable(X) for X in iterables]
--> 293 check_consistent_length(*result)
294 return result
295
~/anaconda3/envs/aiffel/lib/python3.7/site-packages/sklearn/utils/validation.py
in check_consistent_length(*arrays)
255 if len(uniques) > 1:
256 raise ValueError("Found input variables with inconsistent numbers of"
--> 257 " samples: %r" % [int(l) for l in lengths])
258
259
ValueError: Found input variables with inconsistent numbers of samples: [2935848, 2935849]

Your problem is reached because you two dataframe (train and sales) have different length. Your train dataset has 2935848 samples and the sales dataset has 2935849. Both dataset has to have the same length in order to work properly. Check why this length is not matching and add one row or drop one to match them.
Secondly, but no least, you should understand what are you doing with train_test_split and which is your goal. This function inputs are X and Y, and outputs X_train, X_test, y_train, y_test. Reading your code, you are inputting two X (X_train and X_sales) with same 5 features. I hope you are doing this because some reason, be aware of this.
X are all the samples with their features, and Y are the corresponding outputs value you want to predict. Check that and evaluate is using train_test_split is the function you are looking for.

I have this error while I'm trying to do my confusion matrix:
Found input variables with inconsistent numbers of samples: [1527, 1]
This is my code:
x = df[['gender', 'age', 'hypertension', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'work_type_cat', 'gender_cat', 'Residence_type_cat']]
y = df['stroke']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=20)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
scaler = StandardScaler()
x_train_scale = scaler.fit_transform(x_train)
x_test_scale = scaler.fit_transform(x_test)
KNN = KNeighborsClassifier()
x = df[['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'work_type_cat', 'gender_cat', 'Residence_type_cat']]
y = df['stroke']
print(x.head())
print(y.head())
KNN = KNN.fit(x, y)
test = pd.DataFrame()
test['gender'] = [2]
test['age'] = [3]
test['hypertension'] = [0]
test['heart_disease'] = [0]
test['ever_married'] = [2]
test['work_type'] = [4]
test['Residence_type'] = [2]
test['avg_glucose_level'] = [95.12]
test['bmi'] = [18]
test['smoking_status'] = [2]
test['work_type_cat'] = [4]
test['gender_cat'] = [1]
test['Residence_type_cat'] = [1]
y_predict = KNN.predict(test)
print(y_predict)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_predict))

Related

how to change my code to use k fold cross validation with k = 5

I want to change my code so that instead of this part:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100, test_size=0.2)
train_data = X_train.copy()
train_data.loc[:, 'target'] = y_train
test_data = X_test.copy()
test_data.loc[:, 'target'] = y_test
data_config = DataConfig(
target=['target'], #target should always be a list. Multi-targets are only supported for
regression. Multi-Task Classification is not implemented
continuous_cols=train_data.columns.tolist(),
categorical_cols=[],
normalize_continuous_features=True
)
trainer_config = TrainerConfig(
auto_lr_find=True,
batch_size=64,
max_epochs=10,
)
optimizer_config = {'optimizer':'Adam', 'optimizer_params':{'weight_decay': 0, 'amsgrad':
False}, 'lr_scheduler':None, 'lr_scheduler_params':{},
'lr_scheduler_monitor_metric':'valid_loss'}
model_config = NodeConfig(
task="classification",
num_layers=2,
num_trees=512,
learning_rate=1,
embed_categorical=True,
)
tabular_model = TabularModel(
data_config=data_config,
model_config=model_config,
optimizer_config=optimizer_config,
trainer_config=trainer_config,
)
tabular_model.fit(train=train_data, test=test_data)
pred = tabular_model.predict(test_data)
pred['prediction'] = pred['prediction'].astype(int)
pred.loc[(pred['prediction'] >= 1 )] = 1
print_metrics(test_data['target'], pred["prediction"].astype('int'), tag="Holdout")
I want to Use the K fold method with k = 5 or 10.
Thank you for your advice.
The complete code example that I have used method train_test_split is above.
Here is an example of the k-fold method:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.4, random_state=0)
X_train.shape, y_train.shape
X_test.shape, y_test.shape
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
result (in this example):
0.9666666666666667
The example is from here: https://scikit-learn.org/stable/modules/cross_validation.html

ValueError: Found input variables with inconsistent numbers of samples: [40000, 10000]

I am getting the error in this line of code (X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Error: ValueError: Found input variables with inconsistent numbers of samples: [40000, 10000].
Seems like after the vectorization, the array size gets change and that does not match with y.
Seeking support in resolving the error. Thanks in advance
Output:
(10000, 4)
(10000,)
(40000, 1500)
(10000,)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# Import dataset:
dataset = pd.read_excel(r"C:\Users\HPS1RT\Downloads\test\Safety_Prediction.xlsx", nrows=10000)
dataset[["Safety"]] *= 1
# Assign values to the X and y variables:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
print(X.shape)
print(y.shape)
#vectorization
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7)
X = vectorizer.fit_transform(X.ravel()).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()
print(X.shape)
print(y.shape)
# Split dataset into random train and test subsets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train)
print(y_train)

operand could not be broadcast together with shapes <56962,2> .. error

I try logistic regression classification using "k-fold cross validation" in python.
my code:
`import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix,roc_auc_score
data = pd.read_csv('xxx.csv')
X = data[["a","b","c",...]]
y = data["Class"]
def get_predictions(clf, X_train, y_train, X_test):
clf = clf
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)
train_pred = clf.predict(X_train)
print('train-set confusion matrix:\n', confusion_matrix(y_train,train_pred))
return y_pred, y_pred_prob
skf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)
pred_test_full=0
cv_score=[]
i=1
for train_index, test_index in skf.split(X, y):
X_train, y_train = X.loc[train_index], y.loc[train_index]
X_test, y_test = X.loc[test_index], y.loc[test_index]
log_cfl = LogisticRegression(C=2);
log_cfl.fit(X_train, y_train)
y_pred, y_pred_prob = get_predictions(LogisticRegression(C=2), X_train, y_train, X_test)
score=roc_auc_score(y_test,log_cfl.predict(X_test))
print('ROC AUC score: ',score)
cv_score.append(score)
pred_test_full = pred_test_full + y_pred_prob
i+=1`
I get error at this line of code:
`pred_test_full = pred_test_full + y_pred_prob`
For loop runs 2 times. Then in third, I get the error.
'operands could not be broadcast together with shapes <56962,2> <5696..' error.
I couldn't understand what is wrong, could you help to figure out?

LeaveOneOut to determine k of knn

I want to know the best k for k-nearest-neighbor. I am using LeaveOneOut to divide my data into train and test sets. In the code below I have 150 data entries, so I get 150 different train and test sets. K should be in-between 1 and 40.
I want to plot the cross-validation average classification error as a function of k, too see which k is the best for KNN.
Here is my code:
import scipy.io as sio
import seaborn as sn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import LeaveOneOut
error = []
array = np.array(range(1,41))
dataset = pd.read_excel('Data/iris.xls')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
loo = LeaveOneOut()
loo.get_n_splits(X)
for train_index, test_index in loo.split(X):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
#print(X_train, X_test, y_train, y_test)
for i in range(1, 41):
classifier = KNeighborsClassifier(n_neighbors=i)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
error.append(np.mean(y_pred != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 41), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
You are calculating error at each prediction, thats why you have 6000 points in your error array. You need to collect the predictions of all points in the fold for a given 'n_neighbors' and then calculate the error for that value.
You can do this:
# Loop over possible values of "n_neighbors"
for i in range(1, 41):
# Collect the actual and predicted values for all splits for a single "n_neighbors"
actual = []
predicted = []
for train_index, test_index in loo.split(X):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
classifier = KNeighborsClassifier(n_neighbors=i)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# Append the single predictions and actual values here.
actual.append(y_test[0])
predicted.append(y_pred[0])
# Outside the loop, calculate the error.
error.append(np.mean(np.array(predicted) != np.array(actual)))
Rest of your code is okay.
There is a more compact way to do this if you use the cross_val_predict
from sklearn.model_selection import cross_val_predict
for i in range(1, 41):
classifier = KNeighborsClassifier(n_neighbors=i)
y_pred = cross_val_predict(classifier, X, y, cv=loo)
error.append(np.mean(y_pred != y))

writing a train_test_split function with numpy

I am trying to write my own train test split function using numpy instead of using sklearn's train_test_split function. I am splitting the data into 70% training and 30% test. I am using the boston housing data set from sklearn.
This is the shape of the data:
housing_features.shape #(506,13) where 506 is sample size and it has 13 features.
This is my code:
city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data
def shuffle_split_data(X, y):
split = np.random.rand(X.shape[0]) < 0.7
X_Train = X[split]
y_Train = y[split]
X_Test = X[~split]
y_Test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_Train, y_Train, X_Test, y_Test
try:
X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
print "Successful"
except:
print "Fail"
The print output i got is:
362 362 144 144
"Successful"
But i know it was not successful because i get a different numbers for the length when i run it again Versus just using SKlearn's train test function and always get 354 for the length of X_train.
#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train)
#354
What am i missing my my function?
Because you're using np.random.rand which gives you random numbers and it'll be close to 70% for 0.7 limit for very big numbers. You could use np.percentile for that to get value for 70% and then compare with that value as you did:
def shuffle_split_data(X, y):
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand, 70)
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_train, y_train, X_test, y_test
EDIT
Alternatively you could use np.random.choice to select indices with your desired amount. For your case:
np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))

Categories