I am trying to perform a PCA Analysis on Data in a CSV File but I keep getting a weird warning when I attempt to scale the data.
def prepare_data(filename):
df=pd.read_csv(filename,index_col=0)
df.dropna(axis=0,how='any',inplace=True)
return df
def perform_PCA(df):
threshold = 0.3
component = 1 #Second of two right now
pca = decomposition.PCA(n_components=2)
print df.head()
scaled_data = preprocessing.scale(df)
#pca.fit(scaled_data)
#transformed = pca.transform(scaled_data)
#pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
This is the warning I keep getting.
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\utils\validation.py:498: UserWarning: The scale function assumes floating point values as input, got int64
"got %s" % (estimator, X.dtype))
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:145: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
Xr -= mean_
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:153: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:158: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
Xr -= mean_1
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:160: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
Xr /= std_
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:169: UserWarning: Numerical issues were encountered when scaling the data and might not be solved. The standard deviation of the data is probably very close to 0.
warnings.warn("Numerical issues were encountered "
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:174: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
Xr -= mean_2
All the values in the CSV file are numbers. This is what the head looks like
TOOLS/TEST EQUIPMENT WIN PRODUCTIVITY/UTILITY \
HouseholdID
144748819 0 0
144764123 0 0
144765100 0 0
144765495 2 0
144765756 0 2
Can somebody please tell me why I am getting this warning and how I can fix it?
I figured it out. I had to convert my Dataframe into a numpy Matrix and then define the type as float.
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
Related
I have a problem using the MinMaxScaler from scikit-learn and cannot interpret the error message correctly, nor can I find information about it.
TypeError: ufunc 'subtract' output (typecode 'O') could not be coerced to provided output parameter (typecode 'd') according to the casting rule ''same_kind''
For the fit I used a matrix with the following formats:
(275, 821), numpy.ndarray, numpy.float64
The correctly transformed output was:
(275, 821), numpy.ndarray, numpy.float64
My input for the "way back" with inverse_transform:
(206, 821), numpy.ndarray, numpy.float64
I have done this before and it worked without any problems.
Obviously something about my data is different now, which I can't see and is not related to the format?
I would be happy if someone could explain the error message or give me another hint about what went wrong.
numpy 1.13.1,
pandas 0.20.3,
scikit-learn 0.19.0,
python 2.7.6
Thank you very much!
I discovered that the MinMaxScalers attributes have been saved separately by joblib. The min_ attribute appears to be: joblib.numpy_pickle.NDArrayWrapper at 0x7fbc302253d0
For clarification: I save the scaler with joblib and load it before inverse_transform.
### X and Y are two matrices with values between 0-6000
X_frame = pd.DataFrame(X)
Y_frame= pd.DataFrame(Y)
XYdata = pd.concat([X_frame, Y_frame], axis=1)
XYdata = XYdata.as_matrix()
mm = MinMaxScaler((0,1))
XY_new = mm.fit_transform(XYdata)
np.save('data',XY_new)
filename_scaler = 'scaler.sav'
joblib.dump(mm, filename_scaler)
### There's a prediction algorithm in between I can't add because of company restrictions, the code returns the matrix data
scaler = joblib.load('scaler.sav')
new_data1 = scaler.inverse_transform(data)
Solved it.
joblib.dump(mm, filename_scaler, compress =1)
The problem was that joblib saved all attributes of the scaler separatly and obviously didn't load them.
Setting the compression level on 1 (True) solved the issue.
Thanks everyone who read and/or who replied.
I am trying to read in the complete Titanic dataset, which can be found here:
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('titanic3.xls')
y = dataset.iloc[:, 1].values
x = dataset.iloc[:, 2:14].values
# Create Dataset for Men
men_on_board = dataset[dataset['sex'] == 'male']
male_fatalities = men_on_board[men_on_board['survived'] ==0]
X_male = male_fatalities.iloc[:, [4,8]].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X_male[:,0])
X_male[:,0] = imputer.transform(X_male[:,0])
When I run all but the last line, I get the following warning:
/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
When I run the last line, it throws the following error:
File "<ipython-input-2-07afef05ee1c>", line 1, in <module>
X_male[:,0] = imputer.transform(X_male[:,0])
ValueError: could not broadcast input array from shape (523) into shape (682)
I've used the above code snippet for imputation on other projects, not sure why it's not working.
A quick solution is to change axis = 0 to axis = 1. This will make it work, though I'm not sure if that's what you want. So I want to give some explanation about what happened here as following:
The warning basically tells you sklearn estimator now requires 2D data arrays rather than 1D data arrays where interpreting data as samples (rows) vs as features (columns) matters. During this deprecation process, this requirement is enforce by np.atleast_2d which assume your data has a single sample (row). Meanwhile, you passed axis = 0 to the Imputer which "impute along columns" by strategy = 'mean'. However, you have only 1 row now. When it comes across a missing value, there is no mean to replace that missing value. Therefore the entire column (which contains just this missing value) is discarded. As you can see, this is equal to
X_male[:,0][~np.isnan(X_male[:,0])].reshape(1, -1)
That's why the assignment X_male[:,0] = imputer.transform(X_male[:,0]) failed: X_male[:,0] is shape(682) while imputer.transform(X_male[:,0]) is shape(523). My previous solution basically changes it to "impute along rows" where you do have mean to replace missing values. You won't drop anything this time and your imputer.transform(X_male[:,0]) is shape(682) which can be assigned to X_male[:,0].
Now I don't know why your code snippet for imputation works on other projects. For your specific case here, a (logically) better way in regarding to the deprecation warning could be using X.reshape(-1, 1) since your data has a single feature and 682 samples. However, you need to reshape the transformed data back before being able to be assigned to X_male[:,0]:
imputer = imputer.fit(X_male[:,0].reshape(-1, 1))
X_male[:,0] = imputer.transform(X_male[:,0].reshape(-1, 1)).reshape(-1)
df:
cont1 cont2 cont3 cont4 cont5 cont6 cont7
0 0.726300 0.245921 0.187583 0.789639 0.310061 0.718367 0.335060
1 0.330514 0.737068 0.592681 0.614134 0.885834 0.438917 0.436585
2 0.261841 0.358319 0.484196 0.236924 0.397069 0.289648 0.315545
3 0.321594 0.555782 0.527991 0.373816 0.422268 0.440945 0.391128
4 0.273204 0.159990 0.527991 0.473202 0.704268 0.178193 0.247408
Code:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
for each_column in df.columns:
df[each_column].reshape(1, -1) #suggested solution
df[each_column] = min_max_scaler.fit_transform(df[each_column])
Warning:
validation.py:395: DeprecationWarning: Passing 1d arrays as
data is deprecated in 0.17 and will raise ValueError in 0.19.
Reshape your data either using X.reshape(-1, 1) if your data
has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
Please suggest me on what is the mistake, is it because I am not passing the data to the preprocessor as numpy array?
I have tried the suggested solutions still getting the same warning.
The deprecation warning is telling you what to do.
Use either df[each_column].reshape(-1, 1) or df[each_column].reshape(1, -1)
If you read the documentation for Series you'll also see that Pandas uses ndarray internally.
When something is deprecated, it means that it is no longer planned to be supported in future versions. As the message explains, passing a 1D array will start giving you an error in version 0.19. If you're writing new code, you should try to avoid using deprecated functions, and follow the recommendation of the message (use the reshape method for arrays).
Whether you call df[each_column].reshape(-1, 1) or df[each_column].reshape(1, -1) depends on the nature of the data contained in df[each_column], as explained by the deprecation warning message. It'll turn your 1D array into either a "column" or a "row" vector.
code :
import numpy
from matplotlib.mlab import PCA
file_name = "store1_pca_matrix.txt"
ori_data = numpy.loadtxt(file_name,dtype='float', comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0)
result = PCA(ori_data)
this is my code. though my input matrix is devoid of the nan and inf, i do get the error stated below.
raise LinAlgError("SVD did not converge") LinAlgError: SVD did not converge
what's the problem?
This can happen when there are inf or nan values in the data.
Use this to remove nan values:
ori_data.dropna(inplace=True)
I know this post is old, but in case someone else encounters the same problem. #jseabold was right when he said that the problem is nan or inf and the op was probably right when he said that the data did not have nan's or inf. However, if one of the columns in ori_data has always the same value, the data will get Nans, since the implementation of PCA in mlab normalizes the input data by doing
ori_data = (ori_data - mean(ori_data)) / std(ori_data).
The solution is to do:
result = PCA(ori_data, standardize=False)
In this way, only the mean will be subtracted without dividing by the standard deviation.
If there are no inf or NaN values, possibly that is a memory issue. Please try in a machine with higher RAM.
I do not have an answer to this question but I have the reproduction scenario with no
nans and infs. Unfortunately the datataset is pretty large (96MB gzipped).
import numpy as np
from StringIO import StringIO
from scipy import linalg
import urllib2
import gzip
url = 'http://physics.muni.cz/~vazny/gauss/X.gz'
X = np.loadtxt(gzip.GzipFile(fileobj=StringIO(urllib2.urlopen(url).read())), delimiter=',')
linalg.svd(X, full_matrices=False)
which rise:
LinAlgError: SVD did not converge
on:
>>> np.__version__
'1.8.1'
>>> import scipy
>>> scipy.__version__
'0.10.1'
but did not raise an exception on:
>>> np.__version__
'1.8.2'
>>> import scipy
>>> scipy.__version__
'0.14.0'
Following on #c-chavez answer, what worked for me was first replacing inf and -inf to nan, then removing nan.
For example:
data = data.replace(np.inf, np.nan).replace(-np.inf, np.nan).dropna()
Even if your data is correct, it may happen because it runs out of memory. In my case, moving from a 32-bit machine to a 64-bit machine with bigger memory solved the problem.
I had this error multiple times:
If the length of data is 1. Then it can't fit anything
If a value is infinity. You divided by 0 in your processing ?
If a value is None. This is very common.
This may be due to the singular nature of your input datamatrix (which you are feeding to PCA)
This happened to me when I accidentally resized an image dataset to (0, 64, 3). Try checking the shape of your dataset to see if one of the dimensions is 0.
I am using numpy 1.11.0. If the matrix has more than 1 eigvalues equal to 0, then 'SVD did not converge' is raised.
I am generating a random.uniform(low=0.0, high=100.0, size=(150,150)) array.
I input this into a function that generates the X, x, and y.
However, if the random test matrix is greater than 100, I get the error below.
I have tried playing around with theta values.
Has anyone had this problem? Is this a bug?
I am using python2.6 and scikit-learn-0.10. Should I try python3?
Any suggestions or comments are welcome.
Thank you.
gp.fit( XKrn, yKrn )
File "/usr/lib/python2.6/scikit_learn-0.10_git-py2.6-linux-x86_64.egg/sklearn/gaussian_process/gaussian_process.py", line 258, in fit
raise ValueError("X and y must have the same number of rows.")
ValueError: X and y must have the same number of rows.
ValueError: X and y must have the same number of rows. means that in your case XKrn.shape[0] should be equal to yKrn.shape[0]. You probably have an error in the code generating the dataset.
Here is a working example:
In [1]: from sklearn.gaussian_process import GaussianProcess
In [2]: import numpy as np
In [3]: X, y = np.random.randn(150, 10), np.random.randn(150)
In [4]: GaussianProcess().fit(X, y)
Out[4]:
GaussianProcess(beta0=None,
corr=<function squared_exponential at 0x10d42aaa0>, normalize=True,
nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla',
random_start=1,
random_state=<mtrand.RandomState object at 0x10b4c8360>,
regr=<function constant at 0x10d42a488>, storage_mode='full',
theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
Python 3 is not supported yet and the latest released version of scikit-learn is 0.12.1 at this time.
My original post was deleted. Thanks, Flexo.
I had the same problem, and number of rows I was passing in was the same in my X and y.
In my case, the problem was in fact that I was passing in a number of features to fit against in my output. Gaussian processes fit to a single output feature.
The "number of rows" error was misleading, and stemmed from the fact that I wasn't using the package correctly. To fit multiple output features like this, you'll need a GP for each feature.