Error: Found array with dim 3. Estimator expected <= 2 - python

I have a 14x5 data matrix titled data. The first column (Y) is the dependent variable followed by 4 independent variables (X,S1,S2,S3). When trying to fit a regression model to a subset of the independent variables ['S2'][:T] I get the following error:
ValueError: Found array with dim 3. Estimator expected <= 2.
I'd appreciate any insight on a fix. Code below.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
data = pd.read_csv('C:/path/Macro.csv')
T=len(data['X'])-1
#Fit variables
X = data['X'][:T]
S1 = data['S1'][:T]
S2 = data['S2'][:T]
S3 = data['S3'][:T]
Y = data['Y'][:T]
regressor = LinearRegression()
regressor.fit([[X,S1,S2,S3]], Y)

You are passing a 3-dimensional array as the first argument to fit(). X, S1, S2, S3 are all Series objects (1-dimensional), so the following
[[X, S1, S2, S3]]
is 3-dimensional. sklearn estimators expect an array of feature vectors (2-dimensional).
Try something like this:
# pandas indexing syntax
# data.ix[ row index/slice, column index/slice ]
X = data.ix[:T, 'X':] # rows up to T, columns from X onward
y = data.ix[:T, 'Y'] # rows up to T, Y column
regressor = LinearRegression()
regressor.fit(X, y)

Related

Not matching sample in y axis for knn

Im trying to make my way to a sligthly more flexible knn input script than the tutorials based of the iris dataset but Im having some trouble (I think) to add the matching 2nd dimension to the numpy array in #6 and when I come to #11. the fitting.
File "G:\PROGRAMMERING\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 212, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [150, 1]
x is (150,5) and y is (150,1). 150 is the number of samples in both, but they differ in number of fields, is this the problem and if so how do I fix it?
#1. Loading the Pandas libraries as pd
import pandas as pd
import numpy as np
#2. Read data from the file 'custom.csv' placed in your code directory
data = pd.read_csv("custom.csv")
#3. Preview the first 5 lines of the loaded data
print(data.head())
print(type(data))
#4.Test the shape of the data
print(data.shape)
df = pd.DataFrame(data)
print(df)
#5. Convert non-numericals to numericals
print(df.dtypes)
# Any object should be converted to numerical
df['species'] = pd.Categorical(df['species'])
df['species'] = df.species.cat.codes
print("outcome:")
print(df.dtypes)
#6.Convert df to numpy.ndarray
np = df.to_numpy()
print(type(np)) #this should state <class 'numpy.ndarray'>
print(data.shape)
print(np)
x = np.data
y = [df['species']]
print(y)
#K-nearest neighbor (find closest) - searach for the K nearest observations in the dataset
#The model calculates the distance to all, and selects the K nearest ones.
#8. Import the class you plan to use
from sklearn.neighbors import (KNeighborsClassifier)
#9. Pick a value for K
k = 2
#10. Instantiate the "estimator" (make an instance of the model)
knn = KNeighborsClassifier(n_neighbors=k)
print(knn)
#11. fit the model with data/model training
knn.fit(x, y)
#12. Predict the response for a new observation
print(knn.predict([[3, 5, 4, 2]]))```
This is how I used the scikit-learn KNeighborsClassifier to fit the knn model:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
df = datasets.load_iris()
X = pd.DataFrame(df.data)
y = df.target
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X,y)
print(knn.predict([[6, 3, 5, 2]]))
#prints output class [2]
print(knn.predict([[3, 5, 4, 2]]))
#prints output class [1]
From DataFrame you don't need to convert to numpy array, you can directly fit the model on DataFrame, also while converting the DataFrame to numpy array you have named that as np which is also used to import numpy at the top import numpy as np
The input prediction input is 4 columns, leaving the fifth 'species' without prediction. Also, if 'species' was the target it cannot be given as input to the knn at the same time. The pop removes this particular column from the dataFrame df.
#npdf = df.to_numpy()
df = df.apply(lambda x:pd.Series(x))
y = np.asarray(df['species'])
#removes the target from the sample
df.pop('species')
x = df.to_numpy()

Bad Input Shape

I don't know that my code is correct or not. but I got the error:
bad input shape (1, 301)
from sklearn import svm
import pandas as pd
clf = svm.SVC(gamma='scale')
df = pd.read_csv('C:\\Users\\Armin\\Desktop\\heart.csv')
x = [df.age[1:302], df.sex[1:302], df.cp[1:302], df.trestbps[1:302], df.chol[1:302], df.fbs[1:302], df.restecg[1:302], df.thalach[1:302], df.exang[1:302], df.oldpeak[1:302], df.slope[1:302], df.ca[1:302], df.thal[1:302]]
y = [df.target[1:302]]
clf.fit(x, y)
This is a very simple fix.
You need all the columns from df in x except the target column, for that, just do:
x = df.drop('target', axis=1)
And your target column will be:
y = df['target']
And now do your fit:
clf.fit(x, y)
It will work.
PS: What you were trying to do is passing list of Series having the features value. But what you just need to do is, pass the actual values of your feature set and targets from the dataframe directly.
Some more references for you to get started and keep going:
Read more about what to pass to the fit method here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit
Here is a super basic tutorial from the folks of scikit themselves: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

MLPClassifier: Expected 2D array got 1D array instead

Sup guys, I'm new to Python and new to Neural Networks as well. I'm trying to implement a Neural Network to predict the Close price of Bitcoin in a day, based on Open price in the same day. So I get a CSV file, and I'm trying to use 'Open' column as entry, and 'Close' column as target, you can see this in the code below:
from sklearn.neural_network import MLPClassifier
import numpy as np
import pandas as pd
dataset = pd.read_csv('BTC_USD.csv')
X = dataset['Open']
y = dataset['Close']
NeuralNetwork = MLPClassifier(verbose = True,
max_iter = 1000,
tol = 0,
activation = 'logistic')
NeuralNetwork.fit(X, y)
When I run the code I get this error:
ValueError: Expected 2D array, got 1D array instead:
array=[4.95100000e-02 4.95100000e-02 8.58400000e-02 ... 6.70745996e+03
6.66883984e+03 7.32675977e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
After this error, I did some research here in stackoverflow, and I tried some solutions proposed in other posts, like this one:
from sklearn.neural_network import MLPClassifier
import numpy as np
import pandas as pd
dataset = pd.read_csv('BTC_USD.csv')
X = np.array(dataset[['Open']])
X = X.reshape(-1, 1)
y = np.array(dataset[['Close']])
y = y.reshape(-1, 1)
NeuralNetwork = MLPClassifier(verbose = True,
max_iter = 1000,
tol = 0,
activation = 'logistic')
NeuralNetwork.fit(X, y)
After running this code, I get this new error:
ValueError: Unknown label type: (array([4.95100000e-02, 8.58400000e-02, 8.08000000e-02, ...,
6.66883984e+03, 6.30685010e+03, 7.49379980e+03]),)
and this ''warning'' at the first line (which contains the directory):
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Could you help me please? I tried many solutions, but any of them worked.
You should use the values attribute of a data frame to get the elements of one column. In addition, what you want to achieve is a regression, not a classification, thus you must use a regressor such as MLPRegressor, following
from sklearn.neural_network import MLPRegressor
import numpy as np
import pandas as pd
dataset = pd.read_csv('BTC_USD.csv')
X = dataset["Open"].values.reshape(-1, 1)
y = dataset["Close"].values
NeuralNetwork = MLPRegressor(verbose = True,
max_iter = 1000,
tol = 0,
activation = "logistic")
NeuralNetwork.fit(X, y)
The code works now, but the results are not correct as you will need to work on the features and your network hyperparameters. But this is beyond the scope of SO.

Why does this return 'Too Many Indexers'?

My code is:
import pandas as pd
import numpy as np
from sklearn import svm
name = '../CLIWOC/CLIWOC15.csv'
data = pd.read_csv(name)
# Get info into dataframe and drop NaNs
data = pd.concat([data.UTC, data.Lon3, data.Lat3, data.Rain]).dropna(how='any')
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']]
y = data['Rain']
# Partition a test set
Xtest = X[-1]
ytest = y[-1]
X = X[1:-2]
y = y[1:-2]
# Train classifier
classifier = svm.svc(gamma=0.01, C=100.)
classifier.fit(X, y)
classifier.predict(Xtest)
y
Arriving at the 'set target' section, the compiler returns the error 'Too Many Indexers'. I lifted this syntax directly from the documentation, so I'm unsure what could be wrong.
The csv is organized with these headers for columns of data.
Without your data, it is hard to verify. My immediate suspicion, however, is that you need to pass a numpy array instead of a DataFrame.
Try this to extract them:
# Set target
X = data.loc[:, ['UTC', 'Lon3', 'Lat3']].values
y = data['Rain'].values
Use data.loc[['UTC', 'Lon3', 'Lat3']]. This will also work in iloc method as well.
Do not use like data.loc[:, 0] etc...

How to fit multidimensional output using scikit-learn?

I am trying to fit OneVsAll Classification output in training data , rows of output adds upto 1 .
One possible way is to read all the rows and find which column has highest value and prepare data for training .
Eg : y = [[0.2,0.8,0],[0,1,0],[0,0.3,0.7]] can be reduced to y = [b,b,c] , considering a,b,c as corresponding class of the columns 0,1,2 respectively.
Is there a function in scikit-learn which helps to achieve such transformations?
This code does what you want:
import numpy as np
import string
y = np.array([[0.2,0.8,0],[0,1,0],[0,0.3,0.7]])
def transform(y,labels):
f = np.vectorize(lambda i : string.letters[i])
y = f(y.argmax(axis=1))
return y
y = transform(y,'abc')
EDIT: Using the comment by alko, I made it more general be letting the user supply the labels to the transform function.

Categories