DecisionTree - Python - python

I am new to this, anything will be helpful. The data size is large...
I am not sure where the error could be coming from. I dont even know if this is a good idea hahah, I am using longitude and latitude for my x and y.
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
df = pd.read_csv('aug.csv')
X = df.Lon
y = df.Lat
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)```
ValueError: Expected 2D array, got 1D array instead:
array=[-73.9713 -74.0635 -73.9881 ... -74.1777 -73.9923 -73.9661].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Your X variable for the inputs needs to be an array of features. You have a single column in that csv so it interprets that as a 1D array. The error message you are getting is correct, so change that line "X = df.Lon" to be:
"X = df.Lon.reshape(-1, 1)"
One thing to note: what you're doing doesn't make a ton of sense. What this code is trying to do is predict the Y (lat) given the X (lon). These really should be independent variables, so predicting one from the other will probably not yield any meaningful results.

Related

Trying to train simple linear regression algorithm. Keep getting error [duplicate]

While practicing Simple Linear Regression Model I got this error,
I think there is something wrong with my data set.
Here is my data set:
Here is independent variable X:
Here is dependent variable Y:
Here is X_train
Here Is Y_train
This is error body:
ValueError: Expected 2D array, got 1D array instead:
array=[ 7. 8.4 10.1 6.5 6.9 7.9 5.8 7.4 9.3 10.3 7.3 8.1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
And this is My code:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
Thank you
You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work:
x_train= x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
This uses numpy's reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to a matrix x' = [[1], [2], [3]] (-1 gives the x dimension of the matrix, inferred from the length of the array and remaining dimensions, 1 is the y dimension - giving us a n x 1 matrix where n is the input length).
Questions about reshape have been answered in the past, this for example should answer what reshape(-1,1) fully means: What does -1 mean in numpy reshape? (also some of the other below answers explain this very well too)
A lot of times when doing linear regression problems, people like to envision this graph
On the input, we have an X of X = [1,2,3,4,5]
However, many regression problems have multidimensional inputs. Consider the prediction of housing prices. It's not one attribute that determines housing prices. It's multiple features (ex: number of rooms, location, etc. )
If you look at the documentation you will see this
It tells us that rows consist of the samples while the columns consist of the features.
However, consider what happens when he have one feature as our input. Then we need an n x 1 dimensional input where n is the number of samples and the 1 column represents our only feature.
Why does the array.reshape(-1, 1) suggestion work? -1 means choose a number of rows that works based on the number of columns provided. See the image for how it changes in the input.
If you look at documentation of LinearRegression of scikit-learn.
fit(X, y, sample_weight=None)
X : numpy array or sparse matrix of shape [n_samples,n_features]
predict(X)
X : {array-like, sparse matrix}, shape = (n_samples, n_features)
As you can see X has 2 dimensions, where as, your x_train and x_test clearly have one.
As suggested, add:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
Before fitting and predicting the model.
Use
y_pred = regressor.predict([[x_test]])
I would suggest to reshape X at the beginning before you do the split into train and test dataset:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
# Here is the trick
x = x.reshape(-1,1)
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
This is what I use
X_train = X_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)
This is the solution
regressor.predict([[x_test]])
And for polynomial regression:
regressor_2.predict(poly_reg.fit_transform([[x_test]]))
Modify
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
to
regressor.fit(x_train.values.reshape(-1,1),y_train)
y_pred = regressor.predict(x_test.values.reshape(-1,1))

after using logistics regression i got ValueError: y should be a 1d array, got an array of shape (295, 7) instead

#splitting the dataset into dependent(y) and independent variable(x)
x = training_data.iloc[:,[0,2,3,4,5,6,7]].values
y = training_data.iloc[:,1].values
from sklearn.model_selection import train_test_split
x_train,y_train,x_test,y_test = train_test_split(x,y,test_size = 0.3,random_state = 0)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)
i am trying to use logistic regression to train independent(x_train) and dependent variable(y_train) but everytime i run the code i see error
ValueError: y should be a 1d array, got an array of shape (295, 7) instead.
i don't know what to do
You have an error when making the train_test_split.
Be aware of output variables order, the correct output is like below:
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state=0)
Just changing this line, your problem should disappear.

how to fix ValueError: Expected 2D array, got 1D array instead?

from sklearn.linear_model import LinearRegression
X=data['reck']
y=data['price']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)
linreg = LinearRegression().fit(X, y)
I wrote codes for linear regression problem but this error appeared when i want to see result this error is:
ValueError: Expected 2D array, got 1D array instead:
array=[122360. 122365. 49800. ... 2696. 2357. nan].
Reshape your data either using
array.reshape(-1, 1) if your data has a single feature or
array.reshape(1, -1) if it contains a single sample.
My model is just 1D. It tries to find relation between reception kilometer of cars and the price of services they have received.
chasis number reck price
0 999.JACJ5AT.SPC00 122360.0 330000
1 999.JACJ5AT.SPC00 122365.0 385000
2 999.JACS5AT.SPC00 49800.0 753500
3 999.JACS5AT.SPC00 49805.0 1732500
4 999.JACS5AT.SPC00 49908.0 1375000
The problem is the way you are declaring the X and Y
if you print shape of X or Y
X.shape
it would come something like this
(49,)
Which says 49 rows , but column is blank
to avoid this , you can edit your code like this
X=data[['reck']]
y=data[['price']]
when you print the shape
X.shape
the value would come something like this
(49,1)
When you pass these values to your model, model will not throw any error.
PS : i am also a new contributor, i tried to explain it as much i understand, however there could be more logical explanation to this
What about reshapeing array to 2D? (Note that the error message is verbose enough to propose it as well!)
from sklearn.linear_model import LinearRegression
X=data['reck'].reshape(-1, 1)
y=data['price']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)
linreg = LinearRegression().fit(X, y)

how to pass float argument in predict function in Python?

I was following a course on machine learning where the instructor passes a float argument in predict function for polynomial linear regression and it works for him. However, when I pass the code it throws an error stating
"Expected 2D array, got scalar array instead".
I have tried to use the scalar into an array but it does not seem to work.
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
"""from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"""
# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)"""
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
The code seems to run smoothly for the instructor. However, I am getting the following error:
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
This is the error that I am getting.
Actually the predict function accepts 2D array as an input, so u can put 6.5 inside big brackets like this [[6.5]]
lin_reg.predict([[6.5]])
This will work.
Welcome to stackoverflow! You're more likely to get your question answered with a minimal reproducible example, and show at least a portion of any required external files. In this case, I think I've boiled it down to the essentials:
import pandas as pd
# Importing the dataset
salaries = [('Junior', 1, 50000),
('Associate', 2, 60000),
('Senior', 3, 70000),
('Manager', 4, 80000)]
df = pd.DataFrame(salaries)
X = df.iloc[:, 1:2].values
y = df.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Predicting a new result with Linear Regression
print(lin_reg.predict(6.5))
Although I can't be sure exactly what is in the Position_Salaries.csv, I assume based on other arguments that it looks something like what I've shown. Running that example returns the expected result of 76100 in python 3.6 with sklearn 0.19. If you still get an error, try updating sklearn
pip update sklearn
If you're still getting an error after that, not sure where the difference is, but you can spoof a 2d array by passing the argument like this: lin_reg.predict([[6.5]])

LinearRegression in Python giving incorrect results?

I have a comma-separated CSV file with two numerical columns - inputs and outputs. They are correlated in a (more or less linear function), see below. The sample I have is very small.
Below, is the Python code I wrote using sklearn in order to predict values. Somehow it's not giving me the correct values (reasonable predictions). I am quite new to this, so please bear with me.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Data.
89,155
86,161
82.5,168
79.25,174
76.25,182
73,189
70,198
66.66,207
63.5,218
60.25,229
57,241
54,257
51,259
Actually you are having problem understanding your own code.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
Until here what you have done is that you have loaded the dataframe. After that you seprated X and y from the dataset.
labels represent the y values.
train1 represent the x values.
Since you wrote you can't understand :- train1 = data.drop(['kg'], axis=1)
Let me explain this. What this does is that from the dataframe which consist both column 'kg' and 'cm'. It removes 'kg' column (axis = 1 means column, axis = 0 means row). Hence only 'cm' is remaining which is your x.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Now you train the model on x values which represents 'cm' and y values which represent 'kg'.
When you predict(80) what happens is that you input the 'cm' value to be 80. Let me just plot the 'cm' vs 'kg' for training data.
When you input height as 80 this means that you are going more left, even more left than your plot. Hence as you can see x decrease y increase. It means that as 'cm' decrease means 'kg' increase. Hence ouput is 110 which is more.
from io import StringIO
input_data=StringIO("""89,155\n
86,161\n
82.5,168\n
79.25,174\n
76.25,182\n
73,189\n
70,198\n
66.66,207\n
63.5,218\n
60.25,229\n
57,241\n
54,257\n
51,259""")
import pandas as pd
data = pd.read_csv(input_data, header=None, names=['kg', 'cm'])
labels = data['cm']
train1 = data.drop(['cm'], axis=1) #This is similar to selecting the kg column
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
import numpy as np
reg.predict(np.array([80]).reshape(-1, 1)) # 172.65013306.
I think you are having problems with small data size. The code flow looks normal to me, I would suggest you try to find the p-value for the input-output. This will tell you if the correlation found from your linear regression is significant or not (p-value <0.05).
You can find p-value using:
from scipy.stats import linregress
print(linregress(input, output))
To find p-value using scikit learn you probably need to use the formula to find p-value. Good luck.

Categories