Python - Predicting test data that is smaller than train data

Python - Predicting test data that is smaller than train data - python

I have preprocessed some data ready to train a Multinomial Naive Bayes classification. The train data is 80% of my data and the test data is 20%.
The train data is an array of size 8452 and the test data is an array of size of 4231
If I want to see the predictions of train data I execute the following code just fine
multiNB = MultinomialNB()
model = multiNB.fit(x_train, y_train)
y_preds = model.predict(x_train)
but if I want to predict my test
i.e.
y_preds = model.predict(x_test)
I get the following error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 8452 is different from 4231)
If I need to provide more information about my code please ask, but I am stuck here and I do not really understand what is causing that error, and any help is welcomed.
This is how I obtained my train-test sets:
total_count = len(tokenised_reviews)
split = int(total_count * 0.8)
shuffle = np.random.permutation(total_count)
x = []
y = []
for i in range(total_count):
x.append(x_data[shuffle[i]])
y.append(y_data[shuffle[i]])
x_train = x[:split]
x_test = x[split:]
y_train = y[:split]
y_test = y[split:]

Too long to type as a comment, I got a very weird structure when I tried your again. I have no idea what is x_data so hard to explain what is the exact error.
i suspect something went wrong with putting the data back into a list again, so if you do this:
total_count = len(x_train)
split = int(total_count * 0.8)
shuffle = np.random.permutation(total_count)
x_train = x_data[shuffle[split:]]
x_test = x_data[shuffle[:split]]
y_train = y_data[shuffle[split:]]
y_test = y_data[shuffle[:split]]
You should get your x_train and x_test as a subset of the original data.
Or you can simply do:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2)

Related

Regression with Self Organizing Map (SOM) / Kohonen Map

I am evaluating an SOM/Kohonen Map as a regressor for a dataset. Unfortunately it performs extremely bad - so bad, that I think I might have an error in my code. While the R2 score for the training dataset is usually roughly only around 1-5%, the R2 score for the test dataset is ALWAYS extremely negative; example:
Train: 1.09 %
Test: -5668908.61 %
Even though I went over my code over and over again, I just want to make sure, that I did not make a mistake with scaling the data or such, which might cause the bad performance. Basically I split the data into X and y and then use sklearns test_train_split() to get the respective datasets.
I use sklearns MinMaxScaler() to fit_transform() X_train and apply the same transformation on X_test so that there is no data leakage. For y_train I use a separate scaler (scalery).
After each model is trained, I use the y_train scaler (scalery) to inverse the scaling on y_pred, y_pred_train and y_train.
Is there some mistake in my approach? I just want to make sure, that this type of model performs just inherently badly and not because of an error on my side.
Here is my code:
data = load_dataset(currency, 1440, predictor, data_range)
X = data.drop(predictor, axis =1)
y = data[[predictor]]
scaler = MinMaxScaler(feature_range=(0, 1))
scalery = MinMaxScaler(feature_range=(0, 1))
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
shuffle=False,
)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scalery.fit_transform(y_train)
map_size= int(5* math.sqrt(X_test.shape[0])) #vesanto
info_dict = {
'currency': currency,
'data_range': data_range,
'epochs': 0
}
for i in range(100,2100,100):
info_dict['epochs'] = i
print(f"GridSearch Configuration: {map_size}x{map_size}")
print(currency, data_range, i)
som = susi.SOMRegressor(
n_rows=map_size,
n_columns=map_size,
n_iter_unsupervised=i,
n_iter_supervised=i,
neighborhood_mode_unsupervised="linear",
neighborhood_mode_supervised="linear",
learn_mode_unsupervised="min",
learn_mode_supervised="min",
learning_rate_start=0.5,
learning_rate_end=0.05,
# do_class_weighting=True,
random_state=None,
n_jobs=1)
som.fit(X_train, y_train.ravel())
y_pred = som.predict(X_test)
y_pred_train = som.predict(X_train)
y_pred = scalery.inverse_transform(pd.DataFrame(y_pred))
y_train = scalery.inverse_transform(pd.DataFrame(y_train))
y_pred_train = scalery.inverse_transform(pd.DataFrame(y_pred_train))
print("Train: {0:.2f} %".format(r2_score(y_train, y_pred_train)*100))
print("Test: {0:.2f} %".format(r2_score(y_test, y_pred)*100))

How can I predict on the trained SVR model and resolve error Value Error: X.shape[1] = 1 should be equal to 22

I have datasets that have more than 2000 rows and 23 columns including the age column. I have completed all of the processes for SVR. Now I want to predict the trained SVR model is where I need to input X_test to the model? Have faced an error that is
ValueError: X.shape[1] = 1 should be equal to 22, the number of features at training time
How may I resolve this problem? How may I write code for making predictions on the trained SVR model?
import pandas as pd
import numpy as np
# Make fake dataset
dataset = pd.DataFrame(data= np.random.rand(2000,22))
dataset['age'] = np.random.randint(2, size=2000)
# Separate the target from the other features
target = dataset['age']
data = dataset.drop('age', axis = 1)
X_train, y_train = data.loc[:1000], target.loc[:1000]
X_test, y_test = data.loc[1001], target.loc[1001]
X_test = np.array(X_test).reshape((len(X_test), 1))
print(X_test.shape)
SupportVectorRefModel = SVR()
SupportVectorRefModel.fit(X_train, y_train)
y_pred = SupportVectorRefModel.predict(X_test)
Output:
ValueError: X.shape[1] = 1 should be equal to 22, the number of features at training time

Your reshaping of X_test is not correct; it should be:
X_test = np.array(X_test).reshape(1, -1)
print(X_test.shape)
# (1, 22)
With that change, the rest of your code runs OK:
y_pred = SupportVectorRefModel.predict(X_test)
y_pred
# array([0.90156667])
UPDATE
In the case as you show it in your code, obviously X_test consists of one single sample, as defined here:
X_test, y_test = data.loc[1001], target.loc[1001]
But if (as I suspect) this is not what you actually want, but in fact you want the rest of your data as your test set, you should change the definition to:
X_test, y_test = data.loc[1001:], target.loc[1001:]
X_test.shape
# (999, 22)
and without any reshaping
y_pred = SupportVectorRefModel.predict(X_test)
y_pred.shape
# (999,)
i.e. a y_pred of 999 predictions.

Random Forest In Python [Error in r2_score]

I am new to Machine Learning and to Python. I am trying to build a Random Forest model in order to predict cement strength.
There are two .csv files: train_data.csv and test_data.csv.
This is what I have done. I am trying to predict the r2_score here.
df=pd.read_csv("train_data(1).csv")
X=df.drop('strength',axis=1)
y=df['strength']
model=RandomForestRegressor()
model.fit(X,y)
X_test=pd.read_csv("test_data.csv")
y_pred=model.predict(X_test)
acc_R=metrics.r2_score(y,y_pred)
acc_R
The problem here is that the shape of y and y_pred is different. So I get this error:
ValueError: Found input variables with inconsistent numbers of samples: [721, 309]
How do I correct this? Can someone explain to me what I am doing wrong?

df_train = pd.read_csv("train_data(1).csv")
X_train = df.drop('strength',axis=1)
y_train = df['strength']
model=RandomForestRegressor()
model.fit(X_train,y_train)
df_test = pd.read_csv("test_data.csv")
X_test = df.drop('strength',axis=1) # if your test data consists of 'strength'
y_test = df['strength'] # if your test data consists of 'strength'
y_pred = model.predict(X_test)
acc_R = metrics.r2_score(y_test,y_pred)
acc_R

You need to compare y_pred with y_test. Not y which you used to train the model:
acc_R=metrics.r2_score(y_test,y_pred)
There should be another list of labels for the y_test in test_data.csv.
Try the following:
df=pd.read_csv("train_data(1).csv")
X=df.drop('strength',axis=1)
y=df['strength']
model=RandomForestRegressor()
model.fit(X,y)
df1=pd.read_csv("test_data.csv") # we read the csv data from test
X_test=df1.drop('strength',axis=1) # get the fields that we will predict
y_test=df1['strength'] # get the correct labels for X_test
y_pred=model.predict(X_test) # get the predicted results
acc_R=metrics.r2_score(y_test,y_pred) # compare
acc_R

How is train_test_split with test_size=0 affecting the data?

I was using train_test_split in my code and then wanted to change it to cross validation, but something strange is hapenning.
train, test = train_test_split(data, test_size=0)
x_train = train.drop('CRO', axis=1)
y_train = train['CRO']
scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)
for k in range(1, 5):
knn = neighbors.KNeighborsRegressor(n_neighbors=k, weights='uniform')
scores = model_selection.cross_val_score(knn, x_train, y_train, cv=5)
print(scores.mean(), 'score for k = ', k)
This code gives the scores around 0.8, but when I delete that first line and change the 'train' set for the 'data' set in the 2nd and 3rd lines, the score changes to 0.2, which is strange because I even set the test_size to 0 so the train should be equal to the whole data.
What is hapenning?

One thing to be aware of are the implicit arguments passed in train_test_split.
By default, shuffle=True, which could easily be adding some noise into your training data by shuffling it, where just passing the data in without shuffling my be introducing some other pattern into the model.

Why two different AUC scores are produced when evaluated on same data and same algorithm

I am working on a classification problem whose evaluation metric in ROC AUC. So far I have tried using xgb with different parameters. Here is the function which I used to sample the data. And you can find the relevant notebook here (google colab)
def get_data(x_train, y_train, shuffle=False):
if shuffle:
total_train = pd.concat([x_train, y_train], axis=1)
# generate n random number in range(0, len(data))
n = np.random.randint(0, len(total_train), size=len(total_train))
x_train = total_train.iloc[n]
y_train = total_train.iloc[n]['is_pass']
x_train.drop('is_pass', axis=1, inplace=True)
# keep the first 1000 rows as test data
x_test = x_train.iloc[:1000]
# keep the 1000 to 10000 rows as validation data
x_valid = x_train.iloc[1000:10000]
x_train = x_train.iloc[10000:]
y_test = y_train[:1000]
y_valid = y_train[1000:10000]
y_train = y_train.iloc[10000:]
return x_train, x_valid, x_test, y_train, y_valid, y_test
else:
# keep the first 1000 rows as test data
x_test = x_train.iloc[:1000]
# keep the 1000 to 10000 rows as validation data
x_valid = x_train.iloc[1000:10000]
x_train = x_train.iloc[10000:]
y_test = y_train[:1000]
y_valid = y_train[1000:10000]
y_train = y_train.iloc[10000:]
return x_train, x_valid, x_test, y_train, y_valid, y_test
Here are the two outputs that I get after running on shuffled and non shuffled data
AUC with shuffling: 0.9021756235738453
AUC without shuffling: 0.8025162142685565
Can you find out what's the issue here ?

The problem is that in your implementation of shuffling- np.random.randint generates random numbers, but they can be repeated, thus you have the same events appearing in your train and test+valid sets. You should use np.random.permutation instead (and consider to use np.random.seed to ensure reproducibility of the outcome).
Another note- you have very large difference in performance between training and validation/testing sets (the training shows almost perfect ROC AUC). I guess, this is due to too high max depth of the tree (14) that you allow for the size of the dataset (~60K) that you have in hand
P.S. Thanks for sharing collaboratory link- I was not aware of it, but it is very useful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Predicting test data that is smaller than train data - python

Related

Regression with Self Organizing Map (SOM) / Kohonen Map

How can I predict on the trained SVR model and resolve error Value Error: X.shape[1] = 1 should be equal to 22

Random Forest In Python [Error in r2_score]

How is train_test_split with test_size=0 affecting the data?

Why two different AUC scores are produced when evaluated on same data and same algorithm

Categories

Resources