2D arrays in Random Forest fitting and scoring - python

I'm trying to fit a Random Forest regression model. This are the steps I've followed (pls see code below with comments):
Before fitting the model, I've splitted into training and test
I've converted the results into arrays
I've reshaped them to 2D arrays as the regressor likes them to be using the reshape function
I'm getting the following error (it seems that there is a 1D array even though I've reshaped them at the beginning):
ValueError: Expected 2D array, got 1D array instead:
array=[183. 27. 520. ... 23. 28. 34.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or
array.reshape(1, -1) if it contains a single sample.
Here's the code I've used:
#train & test split
X = order_final.loc[:, ~order_final.columns.isin(['lag','observed'])]
y = order_final['lag']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#convert X,y train and X test into arrays
X_train = X_train.to_numpy()
y_train = y_train.to_numpy()
X_test = X_test.to_numpy()
#make them 2D-arrays
X_train.reshape(-1,1)
y_train.reshape(-1,1)
X_test.reshape(-1,1)
# Fitting Random Forest Regression to the dataset
# import the regressor
from sklearn.ensemble import RandomForestRegressor
# create regressor object
RF = RandomForestRegressor(n_estimators = 100, random_state = 0)
# fit the regressor with x and y data
RF.fit(X_train, y_train)
#Prediction of test set
y_pred = RF.predict(X_test)
# View accuracy score
RF.score(y_test, y_pred)
And here is the shape of my arrays (which look goods to me but...):
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(y_pred.shape)
(7326, 10)
(7326,)
(1832, 10)
(1832,)
(1832,)
Can Someone pls help me out and point me where the error is? Thanks in advance!
Stefano

RF.score() needs the inputs to be reshaped:
RF.score(y_test.reshape(-1,1), y_pred.reshape(-1,1))

You need to pass the inputs used for making your predictions to the score() method instead, like so:
RF.score(X_test, y_test)

Related

getting shape errors for .score method from sklearn

df = pd.read_csv('../input/etu-ai-club-competition-2/train.csv')
df.shape
(750000,77)
X = df.drop(columns = 'Target')
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
model = MLPRegressor(hidden_layer_sizes = 60, activation = "relu", solver = "adam")
model
model.fit(X_train, y_train)
pr = model.predict(X_test)
pr.shape
(187500,)
model.score(y_test, pr)
ValueError: Expected 2D array, got 1D array instead:
array=[-120.79511811 -394.11307519 -449.59524477 ... -432.46130084 -492.81440014
-753.02016315].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Just started getting into ml. I dont really understand why I need to have a 2d array to get score or how do I convert mine into one. I did try to reshape it as said in the error but when I do that I get the messages ValueError: X has 1 features, but MLPRegressor is expecting 76 features as input. and ValueError: X has 187500 features, but MLPRegressor is expecting 76 features as input. for reshaping into (-1, 1) and (1, -1) respectively.
The correct way to call the score method would be:
model.score(X_test, y_test)
Internally, it first computes the predictions and then passes the predictions to a scoring function.
If you want to pass the predictions directly, you need to use one of the scoring functions in the metrics package, as explained here:
https://scikit-learn.org/0.15/modules/model_evaluation.html
Note: you might also want to have a look at the example code in the MLPRegressor documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html

Trying to train simple linear regression algorithm. Keep getting error [duplicate]

While practicing Simple Linear Regression Model I got this error,
I think there is something wrong with my data set.
Here is my data set:
Here is independent variable X:
Here is dependent variable Y:
Here is X_train
Here Is Y_train
This is error body:
ValueError: Expected 2D array, got 1D array instead:
array=[ 7. 8.4 10.1 6.5 6.9 7.9 5.8 7.4 9.3 10.3 7.3 8.1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
And this is My code:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
Thank you
You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work:
x_train= x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
This uses numpy's reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to a matrix x' = [[1], [2], [3]] (-1 gives the x dimension of the matrix, inferred from the length of the array and remaining dimensions, 1 is the y dimension - giving us a n x 1 matrix where n is the input length).
Questions about reshape have been answered in the past, this for example should answer what reshape(-1,1) fully means: What does -1 mean in numpy reshape? (also some of the other below answers explain this very well too)
A lot of times when doing linear regression problems, people like to envision this graph
On the input, we have an X of X = [1,2,3,4,5]
However, many regression problems have multidimensional inputs. Consider the prediction of housing prices. It's not one attribute that determines housing prices. It's multiple features (ex: number of rooms, location, etc. )
If you look at the documentation you will see this
It tells us that rows consist of the samples while the columns consist of the features.
However, consider what happens when he have one feature as our input. Then we need an n x 1 dimensional input where n is the number of samples and the 1 column represents our only feature.
Why does the array.reshape(-1, 1) suggestion work? -1 means choose a number of rows that works based on the number of columns provided. See the image for how it changes in the input.
If you look at documentation of LinearRegression of scikit-learn.
fit(X, y, sample_weight=None)
X : numpy array or sparse matrix of shape [n_samples,n_features]
predict(X)
X : {array-like, sparse matrix}, shape = (n_samples, n_features)
As you can see X has 2 dimensions, where as, your x_train and x_test clearly have one.
As suggested, add:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
Before fitting and predicting the model.
Use
y_pred = regressor.predict([[x_test]])
I would suggest to reshape X at the beginning before you do the split into train and test dataset:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
# Here is the trick
x = x.reshape(-1,1)
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
This is what I use
X_train = X_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)
This is the solution
regressor.predict([[x_test]])
And for polynomial regression:
regressor_2.predict(poly_reg.fit_transform([[x_test]]))
Modify
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
to
regressor.fit(x_train.values.reshape(-1,1),y_train)
y_pred = regressor.predict(x_test.values.reshape(-1,1))

after using logistics regression i got ValueError: y should be a 1d array, got an array of shape (295, 7) instead

#splitting the dataset into dependent(y) and independent variable(x)
x = training_data.iloc[:,[0,2,3,4,5,6,7]].values
y = training_data.iloc[:,1].values
from sklearn.model_selection import train_test_split
x_train,y_train,x_test,y_test = train_test_split(x,y,test_size = 0.3,random_state = 0)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)
i am trying to use logistic regression to train independent(x_train) and dependent variable(y_train) but everytime i run the code i see error
ValueError: y should be a 1d array, got an array of shape (295, 7) instead.
i don't know what to do
You have an error when making the train_test_split.
Be aware of output variables order, the correct output is like below:
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state=0)
Just changing this line, your problem should disappear.

I'm trying to create a image classifier machine learning model. while fitting my training data to pipeline, it's showing me this error

the shape of X is
X = np.array(X).reshape(len(X),4096).astype(float)
X.shape
(529, 4096)
the code
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC(kernel = 'rbf', C = 10))])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
error
ValueError: Found array with dim 3. StandardScaler expected <= 2.
I don't know if you need an explanation of what the error message means or how to solve it... but I guess it's both ^^
First, StandardScaler() function accepts 2D NumPy arrays and would fail when processing 3 or more dimensions arrays.
My first guess would have been that you were trying to process some kind of pictures array (like (100, 380, 240, 3)) but as your X.shape returns a 2D array, this is not so obvious.
I would recommend double-checking the X_train and y_train shapes as well, just before you run the pipe.
If their shapes are 2D, you might as well try to apply the StandardScaler() as a stand-alone function to see if this works.

How can I predict on the trained SVR model and resolve error Value Error: X.shape[1] = 1 should be equal to 22

I have datasets that have more than 2000 rows and 23 columns including the age column. I have completed all of the processes for SVR. Now I want to predict the trained SVR model is where I need to input X_test to the model? Have faced an error that is
ValueError: X.shape[1] = 1 should be equal to 22, the number of features at training time
How may I resolve this problem? How may I write code for making predictions on the trained SVR model?
import pandas as pd
import numpy as np
# Make fake dataset
dataset = pd.DataFrame(data= np.random.rand(2000,22))
dataset['age'] = np.random.randint(2, size=2000)
# Separate the target from the other features
target = dataset['age']
data = dataset.drop('age', axis = 1)
X_train, y_train = data.loc[:1000], target.loc[:1000]
X_test, y_test = data.loc[1001], target.loc[1001]
X_test = np.array(X_test).reshape((len(X_test), 1))
print(X_test.shape)
SupportVectorRefModel = SVR()
SupportVectorRefModel.fit(X_train, y_train)
y_pred = SupportVectorRefModel.predict(X_test)
Output:
ValueError: X.shape[1] = 1 should be equal to 22, the number of features at training time
Your reshaping of X_test is not correct; it should be:
X_test = np.array(X_test).reshape(1, -1)
print(X_test.shape)
# (1, 22)
With that change, the rest of your code runs OK:
y_pred = SupportVectorRefModel.predict(X_test)
y_pred
# array([0.90156667])
UPDATE
In the case as you show it in your code, obviously X_test consists of one single sample, as defined here:
X_test, y_test = data.loc[1001], target.loc[1001]
But if (as I suspect) this is not what you actually want, but in fact you want the rest of your data as your test set, you should change the definition to:
X_test, y_test = data.loc[1001:], target.loc[1001:]
X_test.shape
# (999, 22)
and without any reshaping
y_pred = SupportVectorRefModel.predict(X_test)
y_pred.shape
# (999,)
i.e. a y_pred of 999 predictions.

Categories