Bad Input Shape

Bad Input Shape - python

I don't know that my code is correct or not. but I got the error:
bad input shape (1, 301)
from sklearn import svm
import pandas as pd
clf = svm.SVC(gamma='scale')
df = pd.read_csv('C:\\Users\\Armin\\Desktop\\heart.csv')
x = [df.age[1:302], df.sex[1:302], df.cp[1:302], df.trestbps[1:302], df.chol[1:302], df.fbs[1:302], df.restecg[1:302], df.thalach[1:302], df.exang[1:302], df.oldpeak[1:302], df.slope[1:302], df.ca[1:302], df.thal[1:302]]
y = [df.target[1:302]]
clf.fit(x, y)

This is a very simple fix.
You need all the columns from df in x except the target column, for that, just do:
x = df.drop('target', axis=1)
And your target column will be:
y = df['target']
And now do your fit:
clf.fit(x, y)
It will work.
PS: What you were trying to do is passing list of Series having the features value. But what you just need to do is, pass the actual values of your feature set and targets from the dataframe directly.
Some more references for you to get started and keep going:
Read more about what to pass to the fit method here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit
Here is a super basic tutorial from the folks of scikit themselves: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Related

How to pass values from list to scikit learn linear regression model?

I have imported values into python from a PostgreSQL DB.
data = cur.fetchall()
The list is like this:-
[('Ending Crowds', 85, Decimal('50.49')), ('Salute Apollo', 73, Decimal('319.93'))][0]
I need to give 85 as X & Decimal('50.49') as Y in LinearRegression model
Then I imported packages & class-
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
I provide data & perform linear regression -
X = data.iloc[:, 1].values.reshape(-1, 1)
Y = data.iloc[:, 2].values.reshape(-1, 1)
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
I am getting the error-
AttributeError: 'list' object has no attribute 'iloc'
I am a beginner to pyhon and started just 2 days back but need to do linear regression in python at my job for a project. I think iloc can't be used for list object. But, not able to figure out as to how to pass on X & Y values to linear_regressor. All the examples performing Linear Regression on sites are using .CSV. Please help me out.

No, you can't use .iloc on 'list', it is for dataframe.
convert it into dataframe and try using .iloc
Your solution is below, please approve it if it is correct.
Because it's my 1st answer on StackOverflow
import pandas as pd
from decimal import Decimal
from sklearn.linear_model import LinearRegression
#I don't know what that "[0]" in your list,because I haven't used data fetched from PostgreSQL. Anyway remove it first and store it in temp
temp=[('Ending Crowds', 85, Decimal('50.49')), ('Salute Apollo', 73, Decimal('319.93'))]
#I don't know it really needed or not
var = list(var)
data = []
#It is to remove "Decimal" word
for row in var:
data.append(list(map(str, list(row))))
data=pd.DataFrame(data,columns=["no_use","X","Y"])
X=data['X'].values.reshape(-1, 1)
Y=data['Y'].values.reshape(-1, 1)
print(X,Y)
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression

Shape error when using PolynomialFeatures

The Issue
To begin with I'm pretty new to machine learning. I have decided to test up some of the things that I have learned on some financial datam my machine learning model looks like this:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
df = pd.read_csv("/Users/Documents/Trading.csv")
poly_features = PolynomialFeatures(degree=2, include_bias=False)
linear_reg = LinearRegression(fit_intercept = True)
X = df_copy[["open","volume", "base volume", "RSI_14"]]
X_poly = poly_features.fit_transform(X)[1]
y = df_copy[["high"]]
linear_reg.fit(X_poly, y)
x = linear_reg.predict([[1.905E-05, 18637.07503453,0.35522205, 69.95820948552947]])
print(x)
all works great until the moment I try to implement PolynomialFeatures which brings to be the following error:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Attempts to solve the issue:
Atempt 1
I've tried adding .values to X but the same error still comes up:
X_poly = poly_features.fit_transform(X.values)[1]
Atempt 2
I tried solving this problem by adding reshape(-1, 1) at the end of X_poly:
X_poly = poly_features.fit_transform(X)[1].reshape(-1, 1)
but it just replaces the previous error with this one:
ValueError: Found input variables with inconsistent numbers of samples: [14, 5696]
Thank you very much in advance for your help.

It wants you to transform your input. Try using X_poly = poly_features.fit_transform(X.values.reshape(1,-1))[1]

How to find the features names of the coefficients using scikit linear regression?

I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_

The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())

#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.

Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))

Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]

pd.DataFrame(data=regression.coef_, index=X_train.columns)

All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)

Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!

As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64

Why does this return 'Too Many Indexers'?

My code is:
import pandas as pd
import numpy as np
from sklearn import svm
name = '../CLIWOC/CLIWOC15.csv'
data = pd.read_csv(name)
# Get info into dataframe and drop NaNs
data = pd.concat([data.UTC, data.Lon3, data.Lat3, data.Rain]).dropna(how='any')
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']]
y = data['Rain']
# Partition a test set
Xtest = X[-1]
ytest = y[-1]
X = X[1:-2]
y = y[1:-2]
# Train classifier
classifier = svm.svc(gamma=0.01, C=100.)
classifier.fit(X, y)
classifier.predict(Xtest)
y
Arriving at the 'set target' section, the compiler returns the error 'Too Many Indexers'. I lifted this syntax directly from the documentation, so I'm unsure what could be wrong.
The csv is organized with these headers for columns of data.

Without your data, it is hard to verify. My immediate suspicion, however, is that you need to pass a numpy array instead of a DataFrame.
Try this to extract them:
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']].values
y = data['Rain'].values

Use data.loc[['UTC', 'Lon3', 'Lat3']]. This will also work in iloc method as well.
Do not use like data.loc[:, 0] etc...

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
data = pd.read_csv('xxxx.csv')
After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered
X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)
which resulted in the following error
IndexError: tuple index out of range
What's wrong here? Also, I'd like to know
visualize the result
make predictions based on the result?
I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.
Can you please help? Thank you very much for your time.
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.

Let's assume your csv looks something like:
c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...
I generated the data as such:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))
This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).
data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
You need to take a look at the shape of the data you are feeding into .fit().
Here x.shape = (10,) but we need it to be (10, 1), see sklearn. Same goes for y. So we reshape:
x = x.reshape(length, 1)
y = y.reshape(length, 1)
Now we create the regression object and then call fit():
regr = linear_model.LinearRegression()
regr.fit(x, y)
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y, color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
See sklearn linear regression example.

Dataset
Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
Importing the dataset
dataset = pd.read_csv('1.csv')
X = dataset[["mark1"]]
y = dataset[["mark2"]]
Fitting Simple Linear Regression to the set
regressor = LinearRegression()
regressor.fit(X, y)
Predicting the set results
y_pred = regressor.predict(X)
Visualising the set results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('mark1 vs mark2')
plt.xlabel('mark1')
plt.ylabel('mark2')
plt.show()

I post an answer that addresses exactly the error that you got:
IndexError: tuple index out of range
Scikit-learn expects 2D inputs. Just reshape the X and Y.
Replace:
X=data['c1'].values # this has shape (XXX, ) - It's 1D
Y=data['c2'].values # this has shape (XXX, ) - It's 1D
linear_model.LinearRegression().fit(X,Y)
with
X=data['c1'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
Y=data['c2'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D
linear_model.LinearRegression().fit(X,Y)

make predictions based on the result?
To predict,
lr = linear_model.LinearRegression().fit(X,Y)
lr.predict(X)
Is there any way I can view details of the regression?
The LinearRegression has coef_ and intercept_ attributes.
lr.coef_
lr.intercept_
show the slope and intercept.

You really should have a look at the docs for the fit method which you can view here
For how to visualize a linear regression, play with the example here. I'm guessing you haven't used ipython (Now called jupyter) much either, so you should definitely invest some time into learning that. It's a great tool for exploring data and machine learning. You can literally copy/paste the example from scikit linear regression into an ipython notebook and run it
For your specific problem with the fit method, by referring to the docs, you can see that the format of the data you are passing in for your X values is wrong.
Per the docs,
"X : numpy array or sparse matrix of shape [n_samples,n_features]"
You can fix your code with this
X = [[x] for x in data['c1'].values]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bad Input Shape - python

Related

How to pass values from list to scikit learn linear regression model?

Shape error when using PolynomialFeatures

How to find the features names of the coefficients using scikit linear regression?

Why does this return 'Too Many Indexers'?

Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

Categories

Resources