df:
cont1 cont2 cont3 cont4 cont5 cont6 cont7
0 0.726300 0.245921 0.187583 0.789639 0.310061 0.718367 0.335060
1 0.330514 0.737068 0.592681 0.614134 0.885834 0.438917 0.436585
2 0.261841 0.358319 0.484196 0.236924 0.397069 0.289648 0.315545
3 0.321594 0.555782 0.527991 0.373816 0.422268 0.440945 0.391128
4 0.273204 0.159990 0.527991 0.473202 0.704268 0.178193 0.247408
Code:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
for each_column in df.columns:
df[each_column].reshape(1, -1) #suggested solution
df[each_column] = min_max_scaler.fit_transform(df[each_column])
Warning:
validation.py:395: DeprecationWarning: Passing 1d arrays as
data is deprecated in 0.17 and will raise ValueError in 0.19.
Reshape your data either using X.reshape(-1, 1) if your data
has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
Please suggest me on what is the mistake, is it because I am not passing the data to the preprocessor as numpy array?
I have tried the suggested solutions still getting the same warning.
The deprecation warning is telling you what to do.
Use either df[each_column].reshape(-1, 1) or df[each_column].reshape(1, -1)
If you read the documentation for Series you'll also see that Pandas uses ndarray internally.
When something is deprecated, it means that it is no longer planned to be supported in future versions. As the message explains, passing a 1D array will start giving you an error in version 0.19. If you're writing new code, you should try to avoid using deprecated functions, and follow the recommendation of the message (use the reshape method for arrays).
Whether you call df[each_column].reshape(-1, 1) or df[each_column].reshape(1, -1) depends on the nature of the data contained in df[each_column], as explained by the deprecation warning message. It'll turn your 1D array into either a "column" or a "row" vector.
Related
I use sklearn to impute some time-series which include NaN values. At the moment, I use the following:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean')
signals = imp.fit_transform(array)
in which array is a numpy array of shape n_points x n_time_steps. It works fine but I get a deprecation warning which suggest I should use SimpleImpute from sklearn.impute. Hence I replaced those lines with the following:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values='NaN', strategy='mean')
signals = imp.fit_transform(array)
but I get the following error on the last line:
ValueError: 'X' and 'missing_values' types are expected to be both
numerical. Got X.dtype=float32 and type(missing_values)=< class 'str'>.
If anybody has any idea on what is the cause of this error be glad if you let me know. I am using Python 3.6.7 with sklearn 0.20.1. Thanks!
If array contains missing values represented as np.NaN, you should use np.Nan as the argument to the constructor of SimpleImputer. That's the default argument, so this works:
imp = SimpleImputer(strategy='mean')
I have a problem using the MinMaxScaler from scikit-learn and cannot interpret the error message correctly, nor can I find information about it.
TypeError: ufunc 'subtract' output (typecode 'O') could not be coerced to provided output parameter (typecode 'd') according to the casting rule ''same_kind''
For the fit I used a matrix with the following formats:
(275, 821), numpy.ndarray, numpy.float64
The correctly transformed output was:
(275, 821), numpy.ndarray, numpy.float64
My input for the "way back" with inverse_transform:
(206, 821), numpy.ndarray, numpy.float64
I have done this before and it worked without any problems.
Obviously something about my data is different now, which I can't see and is not related to the format?
I would be happy if someone could explain the error message or give me another hint about what went wrong.
numpy 1.13.1,
pandas 0.20.3,
scikit-learn 0.19.0,
python 2.7.6
Thank you very much!
I discovered that the MinMaxScalers attributes have been saved separately by joblib. The min_ attribute appears to be: joblib.numpy_pickle.NDArrayWrapper at 0x7fbc302253d0
For clarification: I save the scaler with joblib and load it before inverse_transform.
### X and Y are two matrices with values between 0-6000
X_frame = pd.DataFrame(X)
Y_frame= pd.DataFrame(Y)
XYdata = pd.concat([X_frame, Y_frame], axis=1)
XYdata = XYdata.as_matrix()
mm = MinMaxScaler((0,1))
XY_new = mm.fit_transform(XYdata)
np.save('data',XY_new)
filename_scaler = 'scaler.sav'
joblib.dump(mm, filename_scaler)
### There's a prediction algorithm in between I can't add because of company restrictions, the code returns the matrix data
scaler = joblib.load('scaler.sav')
new_data1 = scaler.inverse_transform(data)
Solved it.
joblib.dump(mm, filename_scaler, compress =1)
The problem was that joblib saved all attributes of the scaler separatly and obviously didn't load them.
Setting the compression level on 1 (True) solved the issue.
Thanks everyone who read and/or who replied.
I am trying to read in the complete Titanic dataset, which can be found here:
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('titanic3.xls')
y = dataset.iloc[:, 1].values
x = dataset.iloc[:, 2:14].values
# Create Dataset for Men
men_on_board = dataset[dataset['sex'] == 'male']
male_fatalities = men_on_board[men_on_board['survived'] ==0]
X_male = male_fatalities.iloc[:, [4,8]].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X_male[:,0])
X_male[:,0] = imputer.transform(X_male[:,0])
When I run all but the last line, I get the following warning:
/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
When I run the last line, it throws the following error:
File "<ipython-input-2-07afef05ee1c>", line 1, in <module>
X_male[:,0] = imputer.transform(X_male[:,0])
ValueError: could not broadcast input array from shape (523) into shape (682)
I've used the above code snippet for imputation on other projects, not sure why it's not working.
A quick solution is to change axis = 0 to axis = 1. This will make it work, though I'm not sure if that's what you want. So I want to give some explanation about what happened here as following:
The warning basically tells you sklearn estimator now requires 2D data arrays rather than 1D data arrays where interpreting data as samples (rows) vs as features (columns) matters. During this deprecation process, this requirement is enforce by np.atleast_2d which assume your data has a single sample (row). Meanwhile, you passed axis = 0 to the Imputer which "impute along columns" by strategy = 'mean'. However, you have only 1 row now. When it comes across a missing value, there is no mean to replace that missing value. Therefore the entire column (which contains just this missing value) is discarded. As you can see, this is equal to
X_male[:,0][~np.isnan(X_male[:,0])].reshape(1, -1)
That's why the assignment X_male[:,0] = imputer.transform(X_male[:,0]) failed: X_male[:,0] is shape(682) while imputer.transform(X_male[:,0]) is shape(523). My previous solution basically changes it to "impute along rows" where you do have mean to replace missing values. You won't drop anything this time and your imputer.transform(X_male[:,0]) is shape(682) which can be assigned to X_male[:,0].
Now I don't know why your code snippet for imputation works on other projects. For your specific case here, a (logically) better way in regarding to the deprecation warning could be using X.reshape(-1, 1) since your data has a single feature and 682 samples. However, you need to reshape the transformed data back before being able to be assigned to X_male[:,0]:
imputer = imputer.fit(X_male[:,0].reshape(-1, 1))
X_male[:,0] = imputer.transform(X_male[:,0].reshape(-1, 1)).reshape(-1)
I am trying to perform a PCA Analysis on Data in a CSV File but I keep getting a weird warning when I attempt to scale the data.
def prepare_data(filename):
df=pd.read_csv(filename,index_col=0)
df.dropna(axis=0,how='any',inplace=True)
return df
def perform_PCA(df):
threshold = 0.3
component = 1 #Second of two right now
pca = decomposition.PCA(n_components=2)
print df.head()
scaled_data = preprocessing.scale(df)
#pca.fit(scaled_data)
#transformed = pca.transform(scaled_data)
#pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
This is the warning I keep getting.
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\utils\validation.py:498: UserWarning: The scale function assumes floating point values as input, got int64
"got %s" % (estimator, X.dtype))
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:145: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
Xr -= mean_
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:153: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:158: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
Xr -= mean_1
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:160: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
Xr /= std_
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:169: UserWarning: Numerical issues were encountered when scaling the data and might not be solved. The standard deviation of the data is probably very close to 0.
warnings.warn("Numerical issues were encountered "
C:\Users\mbellissimo\AppData\Local\Continuum\Anaconda\lib\site-packages\sklearn\preprocessing\data.py:174: DeprecationWarning: Implicitly casting between incompatible kinds. In a future numpy release, this will raise an error. Use casting="unsafe" if this is intentional.
Xr -= mean_2
All the values in the CSV file are numbers. This is what the head looks like
TOOLS/TEST EQUIPMENT WIN PRODUCTIVITY/UTILITY \
HouseholdID
144748819 0 0
144764123 0 0
144765100 0 0
144765495 2 0
144765756 0 2
Can somebody please tell me why I am getting this warning and how I can fix it?
I figured it out. I had to convert my Dataframe into a numpy Matrix and then define the type as float.
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
I am generating a random.uniform(low=0.0, high=100.0, size=(150,150)) array.
I input this into a function that generates the X, x, and y.
However, if the random test matrix is greater than 100, I get the error below.
I have tried playing around with theta values.
Has anyone had this problem? Is this a bug?
I am using python2.6 and scikit-learn-0.10. Should I try python3?
Any suggestions or comments are welcome.
Thank you.
gp.fit( XKrn, yKrn )
File "/usr/lib/python2.6/scikit_learn-0.10_git-py2.6-linux-x86_64.egg/sklearn/gaussian_process/gaussian_process.py", line 258, in fit
raise ValueError("X and y must have the same number of rows.")
ValueError: X and y must have the same number of rows.
ValueError: X and y must have the same number of rows. means that in your case XKrn.shape[0] should be equal to yKrn.shape[0]. You probably have an error in the code generating the dataset.
Here is a working example:
In [1]: from sklearn.gaussian_process import GaussianProcess
In [2]: import numpy as np
In [3]: X, y = np.random.randn(150, 10), np.random.randn(150)
In [4]: GaussianProcess().fit(X, y)
Out[4]:
GaussianProcess(beta0=None,
corr=<function squared_exponential at 0x10d42aaa0>, normalize=True,
nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla',
random_start=1,
random_state=<mtrand.RandomState object at 0x10b4c8360>,
regr=<function constant at 0x10d42a488>, storage_mode='full',
theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
Python 3 is not supported yet and the latest released version of scikit-learn is 0.12.1 at this time.
My original post was deleted. Thanks, Flexo.
I had the same problem, and number of rows I was passing in was the same in my X and y.
In my case, the problem was in fact that I was passing in a number of features to fit against in my output. Gaussian processes fit to a single output feature.
The "number of rows" error was misleading, and stemmed from the fact that I wasn't using the package correctly. To fit multiple output features like this, you'll need a GP for each feature.