I am beginning to learn SVM and PCA.I tried to apply SVM on the Sci-Kit Learn 'load_digits' dataset.
When i apply the .fit method to SVC,i get an error:
"Expected 2D array, got 1D array instead:
array=[ 1.9142151 0.58897807 1.30203491 ... 1.02259477 1.07605691
-1.25769703].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature
or array.reshape(1, -1) if it contains a single sample."
Here is the code i wrote:**
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
from sklearn.svm import SVC
clf = SVC(kernel='rbf', C=1E6)
X=[reduced_data[:,0]
y=reduced_data[:,1]
clf.fit(X, y)
Can someone help me out?Thank you in advance.
Your error results from the fact that clf.fit() requires the array X to be of dimension 2 (currently it is 1 dimensional), and by using X.reshape(-1, 1), X becomes a (N,1) (2D - as we would like) array, as opposed to (N,) (1D), where N is the number of samples in the dataset. However, I also believe that your interpretation of reduced_data may be incorrect (from my limited experience of sklearn):
The reduced_data array that you have contains two principle components (the two most important features in the dataset, n_components=2), which you should be using as the new "data" (X).
Instead, you have taken the first column of reduced_data to be the samples X, and the second column to be the target values y. It is to my understanding that a better approach would be to make X = reduced_data since the sample data should consist of both PCA features, and make y = y_digits, since the labels (targets) are unchanged by PCA.
(I also noticed you defined pca = PCA(n_components=10).fit_transform(data), but did not go on to use it, so I have removed it from the code in my answer).
As a result, you would have something like this:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.svm import SVC
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
# pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
clf = SVC(kernel='rbf', C=1e6)
clf.fit(reduced_data, y_digits)
I hope this has helped!
Related
I am fitting a sklearn model on pandas dataframe and then trying to predict it on each row. Because the fitting and prediction dimension is different, I am facing following error.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
iris_dict = load_iris()
X = pd.DataFrame(iris_dict['data'])
y = pd.Series(iris_dict['target'])
clf = LogisticRegression()
clf.fit(X, y)
y_pred = clf.predict(X.loc[0,:])
Prediction on single row gives me an error
Expected 2D array, got 1D array instead:
How can I predict on each pandas row, one at a time. I have tried reshaping, it didn't work
Sklearn works with columns and primarily numpy. Your X.loc[0,:] is a series meaning when it is converted to numpy it is a 1D array. I believe simply calling X.loc[0,:].to_numpy().reshape(1,-1) would work
I am trying to learn Python and Data Science out of scratch using on line material.
I have just tried to create a simple linear regression model to get some hands on practice after reading a lot of material. However, I get the following error while trying to do it.
Can you kindly help to understand this error and see what I have done wrong.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import randn
np.random.seed(101)
df3=pd.DataFrame(randn(5,2),index ='0 1 2 3 4'.split(), columns='Test Price'.split())
y= df3['Price']
x= df3['Test']
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2, random_state=101)
from sklearn.linear_model import LinearRegression
lm2= LinearRegression()
lm2.fit(X_train,y_train)
Error
ValueError: Expected 2D array, got 1D array instead:
array=[-2.01816824 0.65111795 0.90796945 -0.84807698].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
check the doc Link
Parameters
X: {array-like, sparse matrix} of shape (n_samples, n_features) Training
data
So you will have to reshape your X to (n_samples, 1) in your case.
Use
lm2.fit(X_train.values.reshape(-1,1),y_train)
Hey there I'm using Label Encoder and Onehotencoder in my machine learning project sample but an error appeared while executing the code at the part where Onehotencoder executed and the error was Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. and my feature column has only two attributes Negative or Positive.
What does this error message mean and how do I fix it
#read data set from excel
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('diab.csv')
feature=dataset.iloc[:,:-1].values
lablel=dataset.iloc[:,-1].values
#convert string data to binary
#transform sting data in lablel column to decimal/binary 0 /1
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
lab=LabelEncoder()
lablel=lab.fit_transform(lablel)
onehotencoder=OneHotEncoder()
lablel=onehotencoder.fit_transform(lablel).toarray()
#create trainning model and test it
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(feature,lablel,test_size=0.30)
#fitting SVM to trainnong set
from sklearn.svm import SVC
classifier=SVC(kernel='linear',random_state=0)
classifier.fit(x_train,y_train)
y_pred=classifier.predict(x_test)
#making the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
from sklearn.neighbors import KNeighborsClassifier
my_classifier=KNeighborsClassifier()
my_classifier.fit(x_train,y_train)
prediction=my_classifier.predict(x_test)
print(prediction)
from sklearn.metrics import accuracy_score
print (accuracy_score(y_test,prediction))
plot=plt.plot((prediction), 'b', label='GreenDots')
plt.show()
I suspect the issue is that you have 2 possible labels and are treating them as separate values. The output of an SVM is usually a single value, so your labels need to be a single value for each sample. Instead of mapping the labels to one hot vectors, instead just use a single value of 1 when the label is positive and a value of 0 when the label is negative.
I was following a course on machine learning where the instructor passes a float argument in predict function for polynomial linear regression and it works for him. However, when I pass the code it throws an error stating
"Expected 2D array, got scalar array instead".
I have tried to use the scalar into an array but it does not seem to work.
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
"""from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"""
# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)"""
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
The code seems to run smoothly for the instructor. However, I am getting the following error:
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
This is the error that I am getting.
Actually the predict function accepts 2D array as an input, so u can put 6.5 inside big brackets like this [[6.5]]
lin_reg.predict([[6.5]])
This will work.
Welcome to stackoverflow! You're more likely to get your question answered with a minimal reproducible example, and show at least a portion of any required external files. In this case, I think I've boiled it down to the essentials:
import pandas as pd
# Importing the dataset
salaries = [('Junior', 1, 50000),
('Associate', 2, 60000),
('Senior', 3, 70000),
('Manager', 4, 80000)]
df = pd.DataFrame(salaries)
X = df.iloc[:, 1:2].values
y = df.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Predicting a new result with Linear Regression
print(lin_reg.predict(6.5))
Although I can't be sure exactly what is in the Position_Salaries.csv, I assume based on other arguments that it looks something like what I've shown. Running that example returns the expected result of 76100 in python 3.6 with sklearn 0.19. If you still get an error, try updating sklearn
pip update sklearn
If you're still getting an error after that, not sure where the difference is, but you can spoof a 2d array by passing the argument like this: lin_reg.predict([[6.5]])
from tensorflow.examples.tutorials.mnist import input_data
mnist=input_data.read_data_sets('data/MNIST/', one_hot=True)
numpy implementation
# Entire Data set
Data=np.array(mnist.train.images)
#centering the data
mu_D=np.mean(Data, axis=0)
Data-=mu_D
COV_MA = np.cov(Data, rowvar=False)
eigenvalues, eigenvec=scipy.linalg.eigh(COV_MA, eigvals_only=False)
together = zip(eigenvalues, eigenvec)
together = sorted(together, key=lambda t: t[0], reverse=True)
eigenvalues[:], eigenvec[:] = zip(*together)
n=3
pca_components=eigenvec[:,:n]
print(pca_components.shape)
data_reduced = Data.dot(pca_components)
print(data_reduced.shape)
data_original = np.dot(data_reduced, pca_components.T) # inverse_transform
print(data_original.shape)
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
sklearn implementation
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(Data)
data_reduced = np.dot(Data, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
I'd like to implement PCA algorithms by using numpy. However I don't know how to reconstruct the images from that and I don't even know if this code is correct.
Actually, when I used sklearn.decomposition.PCA, the result is different from the numpy implementation.
Can you explain the differences?
I can spot a few differences already.
For one:
n=300
projections = only_2.dot(eigenvec[:,:n])
Xhat = np.dot(projections, eigenvec[:,:n].T)
Xhat += mu_D
plt.imshow(Xhat[5].reshape(28,28),cmap='Greys',interpolation='nearest')
The point I'm trying to make is, if my understanding is correct n = 300, you are trying to fit 300 eigen vectors whose eigen values go from high to low.
But in sklearn
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(only_2)
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # invers
It seems to me you are fitting just the FIRST component (the component that maximizes variance) and you're not taking all 300.
Further more:
One thing I can clearly, say is that you seem to understand what's happening in PCA but you're having trouble implementing it. Correct me if I'm wrong but:
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
In this part, you are trying to PROJECT your eigenvectors to your data which is what you should go about doing in PCA, but in sklearn, what you should do is the following:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=300)
pca.fit_transform(only_2)
If you could tell me how you created only_2, I can give you a much more specific answer tomorrow.
Here is what sklearn says about fit_transform for PCA: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform:
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
y : Ignored
Returns:
X_new : array-like, shape (n_samples, n_components)