I use sklearn to impute some time-series which include NaN values. At the moment, I use the following:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean')
signals = imp.fit_transform(array)
in which array is a numpy array of shape n_points x n_time_steps. It works fine but I get a deprecation warning which suggest I should use SimpleImpute from sklearn.impute. Hence I replaced those lines with the following:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values='NaN', strategy='mean')
signals = imp.fit_transform(array)
but I get the following error on the last line:
ValueError: 'X' and 'missing_values' types are expected to be both
numerical. Got X.dtype=float32 and type(missing_values)=< class 'str'>.
If anybody has any idea on what is the cause of this error be glad if you let me know. I am using Python 3.6.7 with sklearn 0.20.1. Thanks!
If array contains missing values represented as np.NaN, you should use np.Nan as the argument to the constructor of SimpleImputer. That's the default argument, so this works:
imp = SimpleImputer(strategy='mean')
Related
I am currently following along with a Machine Learning Full Course sponsored by Simplilearn to get a better understanding of regression, and am running into this error:
TypeError: init() got an unexpected keyword argument 'categorical_features'
From this code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
companies = pd.read_csv('Companies_1000.csv')
X = companies.iloc[:, :-1].values
X = companies.iloc[:, :4].values
companies.head()
cmap = sns.cm.rocket_r
sns.heatmap(companies.corr(), cmap = cmap)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
print(X)
This is the csv file: https://raw.githubusercontent.com/boosuro/profit_estimation_of_companies/master/1000_Companies.csv
The video does not get the same error as me, and I have assumed that it is outdated, however after crawling through the sklearn docs, I have come up empty-handed for a solution. I am using python 3. If you want to check out exactly the info and code that's happening in the video, here it is:
https://www.youtube.com/watch?v=9f-GarcDY58
My error appears around the 47:25 mark. Thank you for checking this out, and thanks for your answers.
The error is due to the following line
onehotencoder = OneHotEncoder(categorical_features = [3])
There is no parameter named "categorical_features" . Instead there is "categories" where you can pass a list of categories. By default "categories" is set to "auto" which means it will automatically determine categories from the training data.
So you need not to pass anything in the OneHotEncoder() function, just leave it like this.
Change the line as below
onehotencoder = OneHotEncoder()
Trying to learn sklearn in python. But the jupyter ntbk is giving error saying "ValueError: Expected 2D array, got scalar array instead:
array=750.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
*But I have already defined x to be 2D array using x.values.reshape(-1,1)
You can find the CSV file and screenshot of the Error Code here -> https://github.com/CaptainRD/CSV-for-StackOverflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
data = pd.read_csv('1.02. Multiple linear regression.csv')
data.head()
x = data[['SAT','Rand 1,2,3']]
y = data['GPA']
reg = LinearRegression()
reg.fit(x,y)r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2
reg.predict(1750)
As you can see in your code, your x has two variables, SAT and Rand 1,2,3. Which means, you need to provide a two dimensional input for your predict method. example:
reg.predict([[1750, 1]])
which returns:
>>> array([1.88])
You are facing this error because you did not provide the second value (for the Rand 1,2,3 variable). Note, if this variable is not important, you should remove it from your x data.
This model is mapping two inputs (SAT and Rand 1,2,3) to a single output (GPA), and thus requires a list of two elements as input for a valid prediction. I'm guessing the 1750 that you're supplying is meant to be the SAT value, but you also need to provide the Rand 1,2,3 value. Something like [1750, 1] would work.
I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.
I have read carefully the documentation but I still cannot figure out how to achieve this.
To make this more specific, let's say I have:
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?
An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?
There is no need to use the SimpleImputer.
DataFrame.fillna() can do the work as well
For the second column, use
column.fillna(column.mean(), inplace=True)
For the third column, use
column.fillna(constant, inplace=True)
Of course, you will need to replace column with your DataFrame's column you want to change and constant with your desired constant.
Edit
Since the use of inplace is discouraged and will be deprecated, the syntax should be
column = column.fillna(column.mean())
Following Dan's advice, an example of using ColumnTransformer and SimpleImputer to backfill the columns is:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
column_trans = ColumnTransformer(
[('imp_col1', SimpleImputer(strategy='mean'), [1]),
('imp_col2', SimpleImputer(strategy='constant', fill_value=29), [2])],
remainder='passthrough')
print(column_trans.fit_transform(A)[:, [2,0,1]])
# [[7 2.0 3]
# [4 3.5 6]
# [10 5.0 29]]
This approach helps with constructing pipelines which are more suitable for larger applications.
This is methode I use, you can replace low_cardinality_cols by cols you want to encode. But this works also justt set value unique to max(df.columns.nunique()).
#check cardinalité des cols a encoder
low_cardinality_cols = [cname for cname in df.columns if df[cname].nunique() < 16 and
df[cname].dtype == "object"]
Why thes columns, it's recommanded, to encode only cols with cardinality near 10.
# Replace NaN, if not you'll stuck
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # feel free to use others strategy
df[low_cardinality_cols] = imp.fit_transform(df[low_cardinality_cols])
# Apply label encoder
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
df[col] = label_encoder.fit_transform(df[col])
```
I am assuming you have your data as a pandas dataframe.
In this case, all you need to do to use the SimpleImputer from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.
An example of this is,
## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))
Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3]=imputer.transform(X[:, 1:3])
Can you help me know what above code does? I don't know much about Imputer. Kindly help!
The confusing part is fit and transform.
#here fit method will calculate the required parameters (In this case mean)
#and store it in the impute object
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3]=imputer.transform(X[:, 1:3])
#imputer.transform will actually do the work of replacement of nan with mean.
#This can be done in one step using fit_transform
Imputer is used to replace missing values. The fit method calculates the parameters while the fit_transform method changes the data to replace those NaN with the mean and outputs a new matrix X.
# Imports library
from sklearn.preprocessing import Imputer
# Create a new instance of the Imputer object
# Missing values are replaced with NaN
# Missing values are replaced by the mean later on
# The axis determines whether you want to move column or row wise
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
# Fit the imputer to X
imputer = imputer.fit(X[:, 1:3])
# Replace in the original matrix X
# with the new values after the transformation of X
X[:, 1:3]=imputer.transform(X[:, 1:3])
I commented out the code for you, I hope this will make a bit more sense. You need to think of X as a matrix that you have to transform in order to have no more NaN (missing values).
Refer to the documentation for more information.
Your comments tell you the difference. It is saying that if you don't use imputer.fit, you can't do the replacement of nan with some method, for example with mean or median. To apply this process, you need to use imputer.transform after imputer.fit and then, you will have a new dataset without nan values.
See as far as I have understood
import a specific class from the library
from sklearn.preprocessing import Imputer
Creating an object of the class which handles the data according to our personalized data
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
Applying (as in applying a function on a data) to the matrix x
For example let an operator e applied to data d Imputer.fit returns ed imputer = imputer.fit(X[:, 1:3])
Now Imputer.transform computes the value of ed and assigns it to the given matrice
X[:, 1:3]=imputer.transform(X[:, 1:3])
I am trying to read in the complete Titanic dataset, which can be found here:
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('titanic3.xls')
y = dataset.iloc[:, 1].values
x = dataset.iloc[:, 2:14].values
# Create Dataset for Men
men_on_board = dataset[dataset['sex'] == 'male']
male_fatalities = men_on_board[men_on_board['survived'] ==0]
X_male = male_fatalities.iloc[:, [4,8]].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X_male[:,0])
X_male[:,0] = imputer.transform(X_male[:,0])
When I run all but the last line, I get the following warning:
/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
When I run the last line, it throws the following error:
File "<ipython-input-2-07afef05ee1c>", line 1, in <module>
X_male[:,0] = imputer.transform(X_male[:,0])
ValueError: could not broadcast input array from shape (523) into shape (682)
I've used the above code snippet for imputation on other projects, not sure why it's not working.
A quick solution is to change axis = 0 to axis = 1. This will make it work, though I'm not sure if that's what you want. So I want to give some explanation about what happened here as following:
The warning basically tells you sklearn estimator now requires 2D data arrays rather than 1D data arrays where interpreting data as samples (rows) vs as features (columns) matters. During this deprecation process, this requirement is enforce by np.atleast_2d which assume your data has a single sample (row). Meanwhile, you passed axis = 0 to the Imputer which "impute along columns" by strategy = 'mean'. However, you have only 1 row now. When it comes across a missing value, there is no mean to replace that missing value. Therefore the entire column (which contains just this missing value) is discarded. As you can see, this is equal to
X_male[:,0][~np.isnan(X_male[:,0])].reshape(1, -1)
That's why the assignment X_male[:,0] = imputer.transform(X_male[:,0]) failed: X_male[:,0] is shape(682) while imputer.transform(X_male[:,0]) is shape(523). My previous solution basically changes it to "impute along rows" where you do have mean to replace missing values. You won't drop anything this time and your imputer.transform(X_male[:,0]) is shape(682) which can be assigned to X_male[:,0].
Now I don't know why your code snippet for imputation works on other projects. For your specific case here, a (logically) better way in regarding to the deprecation warning could be using X.reshape(-1, 1) since your data has a single feature and 682 samples. However, you need to reshape the transformed data back before being able to be assigned to X_male[:,0]:
imputer = imputer.fit(X_male[:,0].reshape(-1, 1))
X_male[:,0] = imputer.transform(X_male[:,0].reshape(-1, 1)).reshape(-1)