Related
I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?
I am trying to do the following for feature selection:
I read the train file:
num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
I change the type of the categorical features to 'category':
non_categorial_features = ['orig_destination_distance',
'srch_adults_cnt',
'srch_children_cnt',
'srch_rm_cnt',
'cnt']
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small[categorical_feature] = train_small[categorical_feature].astype('category')
I use one hot encoding:
train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
The problem is that the 3'rd part often get stuck, although I am using a strong machine.
Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.
What do you recommend?
Approach 1: You can use pandas' pd.get_dummies.
Example 1:
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]:
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
Example 2:
The following will transform a given column into one hot. Use prefix to have multiple dummies.
import pandas as pd
df = pd.DataFrame({
'A':['a','b','a'],
'B':['b','a','c']
})
df
Out[]:
A B
0 a b
1 b a
2 a c
# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df
Out[]:
A a b c
0 a 0 1 0
1 b 1 0 0
2 a 0 0 1
Approach 2: Use Scikit-learn
Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.
Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Much easier to use Pandas for basic one-hot encoding. If you're looking for more options you can use scikit-learn.
For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.
For example, if I have a dataframe called imdb_movies:
...and I want to one-hot encode the Rated column, I do this:
pd.get_dummies(imdb_movies.Rated)
This returns a new dataframe with a column for every "level" of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.
Usually, we want this to be part of the original dataframe. In this case, we attach our new dummy coded frame onto the original frame using "column-binding.
We can column-bind by using Pandas concat function:
rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)
We can now run an analysis on our full dataframe.
SIMPLE UTILITY FUNCTION
I would recommend making yourself a utility function to do this quickly:
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
return(res)
Usage:
encode_and_bind(imdb_movies, 'Rated')
Result:
Also, as per #pmalbu comment, if you would like the function to remove the original feature_to_encode then use this version:
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
You can encode multiple features at the same time as follows:
features_to_encode = ['feature_1', 'feature_2', 'feature_3',
'feature_4']
for feature in features_to_encode:
res = encode_and_bind(train_set, feature)
You can do it with numpy.eye and a using the array element selection mechanism:
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
The the return value of indices_to_one_hot(nb_classes, data) is now
array([[[ 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0.]]])
The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).
One hot encoding with pandas is very easy:
def one_hot(df, cols):
"""
#param df pandas DataFrame
#param cols a list of columns to encode
#return a DataFrame with one-hot encoding
"""
for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)
return df
EDIT:
Another way to one_hot using sklearn's LabelBinarizer :
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later
def one_hot_encode(x):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return label_binarizer.transform(x)
Firstly, easiest way to one hot encode: use Sklearn.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Secondly, I don't think using pandas to one hot encode is that simple (unconfirmed though)
Creating dummy variables in pandas for python
Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.
Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, 'Less is More'.
Here's the code for my custom encoding function if you want.
from sklearn.preprocessing import LabelEncoder
#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
columnsToEncode = list(df.select_dtypes(include=['category','object']))
le = LabelEncoder()
for feature in columnsToEncode:
try:
df[feature] = le.fit_transform(df[feature])
except:
print('Error encoding '+feature)
return df
EDIT: Comparison to be clearer:
One-hot encoding: convert n levels to n-1 columns.
Index Animal Index cat mouse
1 dog 1 0 0
2 cat --> 2 1 0
3 mouse 3 0 1
You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.
Dummy Coding:
Index Animal Index Animal
1 dog 1 0
2 cat --> 2 1
3 mouse 3 2
Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.
You can use numpy.eye function.
import numpy as np
def one_hot_encode(x, n_classes):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return np.eye(n_classes)[x]
def main():
list = [0,1,2,3,4,3,2,1,0]
n_classes = 5
one_hot_list = one_hot_encode(list, n_classes)
print(one_hot_list)
if __name__ == "__main__":
main()
Result
D:\Desktop>python test.py
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]]
pandas as has inbuilt function "get_dummies" to get one hot encoding of that particular column/s.
one line code for one-hot-encoding:
df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)
Here is a solution using DictVectorizer and the Pandas DataFrame.to_dict('records') method.
>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
})
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
'country=MEX': 1,
'country=US': 2,
'race=Black': 3,
'race=Latino': 4,
'race=White': 5}
>>> X_qual.toarray()
array([[ 0., 0., 1., 0., 0., 1.],
[ 1., 0., 0., 1., 0., 0.],
[ 0., 0., 1., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0., 1.],
[ 0., 0., 1., 1., 0., 0.]])
One-hot encoding requires bit more than converting the values to indicator variables. Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data. You should store the mapping (transform) that was used to construct the model. A good solution would use the DictVectorizer or LabelEncoder (followed by get_dummies. Here is a function that you can use:
def oneHotEncode2(df, le_dict = {}):
if not le_dict:
columnsToEncode = list(df.select_dtypes(include=['category','object']))
train = True;
else:
columnsToEncode = le_dict.keys()
train = False;
for feature in columnsToEncode:
if train:
le_dict[feature] = LabelEncoder()
try:
if train:
df[feature] = le_dict[feature].fit_transform(df[feature])
else:
df[feature] = le_dict[feature].transform(df[feature])
df = pd.concat([df,
pd.get_dummies(df[feature]).rename(columns=lambda x: feature + '_' + str(x))], axis=1)
df = df.drop(feature, axis=1)
except:
print('Error encoding '+feature)
#df[feature] = df[feature].convert_objects(convert_numeric='force')
df[feature] = df[feature].apply(pd.to_numeric, errors='coerce')
return (df, le_dict)
This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back. So you would call it like this:
train_data, le_dict = oneHotEncode2(train_data)
Then on the test data, the call is made by passing the dictionary returned back from training:
test_data, _ = oneHotEncode2(test_data, le_dict)
An equivalent method is to use DictVectorizer. A related post on the same is on my blog. I mention it here since it provides some reasoning behind this approach over simply using get_dummies post (disclosure: this is my own blog).
You can pass the data to catboost classifier without encoding. Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding.
You can do the following as well. Note for the below you don't have to use pd.concat.
import pandas as pd
# intialise data of lists.
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
'Group':[1,2,1,2]}
# Create DataFrame
df = pd.DataFrame(data)
for _c in df.select_dtypes(include=['object']).columns:
print(_c)
df[_c] = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
You can also change explicit columns to categorical. For example, here I am changing the Color and Group
import pandas as pd
# intialise data of lists.
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
'Group':[1,2,1,2]}
# Create DataFrame
df = pd.DataFrame(data)
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')
for _c in columns_to_change:
print(_c)
df[_c] = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
I know I'm late to this party, but the simplest way to hot encode a dataframe in an automated way is to use this function:
def hot_encode(df):
obj_df = df.select_dtypes(include=['object'])
return pd.get_dummies(df, columns=obj_df.columns).values
This works for me:
pandas.factorize( ['B', 'C', 'D', 'B'] )[0]
Output:
[0, 1, 2, 0]
I used this in my acoustic model:
probably this helps in ur model.
def one_hot_encoding(x, n_out):
x = x.astype(int)
shape = x.shape
x = x.flatten()
N = len(x)
x_categ = np.zeros((N,n_out))
x_categ[np.arange(N), x] = 1
return x_categ.reshape((shape)+(n_out,))
Short Answer
Here is a function to do one-hot-encoding without using numpy, pandas, or other packages. It takes a list of integers, booleans, or strings (and perhaps other types too).
import typing
def one_hot_encode(items: list) -> typing.List[list]:
results = []
# find the unique items (we want to unique items b/c duplicate items will have the same encoding)
unique_items = list(set(items))
# sort the unique items
sorted_items = sorted(unique_items)
# find how long the list of each item should be
max_index = len(unique_items)
for item in items:
# create a list of zeros the appropriate length
one_hot_encoded_result = [0 for i in range(0, max_index)]
# find the index of the item
one_hot_index = sorted_items.index(item)
# change the zero at the index from the previous line to a one
one_hot_encoded_result[one_hot_index] = 1
# add the result
results.append(one_hot_encoded_result)
return results
Example:
one_hot_encode([2, 1, 1, 2, 5, 3])
# [[0, 1, 0, 0],
# [1, 0, 0, 0],
# [1, 0, 0, 0],
# [0, 1, 0, 0],
# [0, 0, 0, 1],
# [0, 0, 1, 0]]
one_hot_encode([True, False, True])
# [[0, 1], [1, 0], [0, 1]]
one_hot_encode(['a', 'b', 'c', 'a', 'e'])
# [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1]]
Long(er) Answer
I know there are already a lot of answers to this question, but I noticed two things. First, most of the answers use packages like numpy and/or pandas. And this is a good thing. If you are writing production code, you should probably be using robust, fast algorithms like those provided in the numpy/pandas packages. But, for the sake of education, I think someone should provide an answer which has a transparent algorithm and not just an implementation of someone else's algorithm. Second, I noticed that many of the answers do not provide a robust implementation of one-hot encoding because they do not meet one of the requirements below. Below are some of the requirements (as I see them) for a useful, accurate, and robust one-hot encoding function:
A one-hot encoding function must:
handle list of various types (e.g. integers, strings, floats, etc.) as input
handle an input list with duplicates
return a list of lists corresponding (in the same order as) to the inputs
return a list of lists where each list is as short as possible
I tested many of the answers to this question and most of them fail on one of the requirements above.
Try this:
!pip install category_encoders
import category_encoders as ce
categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)
df_encoded.head()
The resulting dataframe df_train_encoded is the same as the original, but the categorical features are now replaced with their one-hot-encoded versions.
More information on category_encoders here.
To add to other questions, let me provide how I did it with a Python 2.0 function using Numpy:
def one_hot(y_):
# Function to encode output labels from number indexes
# e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]
y_ = y_.reshape(len(y_))
n_values = np.max(y_) + 1
return np.eye(n_values)[np.array(y_, dtype=np.int32)] # Returns FLOATS
The line n_values = np.max(y_) + 1 could be hard-coded for you to use the good number of neurons in case you use mini-batches for example.
Demo project/tutorial where this function has been used:
https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition
It can and it should be easy as :
class OneHotEncoder:
def __init__(self,optionKeys):
length=len(optionKeys)
self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}
Usage :
ohe=OneHotEncoder(["A","B","C","D"])
print(ohe.A)
print(ohe.D)
Expanding #Martin Thoma's answer
def one_hot_encode(y):
"""Convert an iterable of indices to one-hot encoded labels."""
y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
# the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
nb_classes = len(np.unique(y)) # get the number of unique classes
standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
# which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
# directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
# standardised labels fixes this issue by returning a dictionary;
# standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
# standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
# cannot be called by an integer index e.g y[1.0] - throws an index error.
targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
return np.eye(nb_classes)[targets]
Lets assume out of 10 variables, you have 3 categorical variables in your data frame named as cname1, cname2 and cname3.
Then following code will automatically create one hot encoded variable in the new dataframe.
import category_encoders as ce
encoder_var=ce.OneHotEncoder(cols=['cname1','cname2','cname3'],handle_unknown='return_nan',return_df=True,use_cat_names=True)
new_df = encoder_var.fit_transform(old_df)
A simple example using vectorize in numpy and apply example in pandas:
import numpy as np
a = np.array(['male','female','female','male'])
#define function
onehot_function = lambda x: 1.0 if (x=='male') else 0.0
onehot_a = np.vectorize(onehot_function)(a)
print(onehot_a)
# [1., 0., 0., 1.]
# -----------------------------------------
import pandas as pd
s = pd.Series(['male','female','female','male'])
onehot_s = s.apply(onehot_function)
print(onehot_s)
# 0 1.0
# 1 0.0
# 2 0.0
# 3 1.0
# dtype: float64
Here i tried with this approach :
import numpy as np
#converting to one_hot
def one_hot_encoder(value, datal):
datal[value] = 1
return datal
def _one_hot_values(labels_data):
encoded = [0] * len(labels_data)
for j, i in enumerate(labels_data):
max_value = [0] * (np.max(labels_data) + 1)
encoded[j] = one_hot_encoder(i, max_value)
return np.array(encoded)
I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?
I am trying to do the following for feature selection:
I read the train file:
num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
I change the type of the categorical features to 'category':
non_categorial_features = ['orig_destination_distance',
'srch_adults_cnt',
'srch_children_cnt',
'srch_rm_cnt',
'cnt']
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small[categorical_feature] = train_small[categorical_feature].astype('category')
I use one hot encoding:
train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
The problem is that the 3'rd part often get stuck, although I am using a strong machine.
Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.
What do you recommend?
Approach 1: You can use pandas' pd.get_dummies.
Example 1:
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]:
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
Example 2:
The following will transform a given column into one hot. Use prefix to have multiple dummies.
import pandas as pd
df = pd.DataFrame({
'A':['a','b','a'],
'B':['b','a','c']
})
df
Out[]:
A B
0 a b
1 b a
2 a c
# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df
Out[]:
A a b c
0 a 0 1 0
1 b 1 0 0
2 a 0 0 1
Approach 2: Use Scikit-learn
Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.
Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Much easier to use Pandas for basic one-hot encoding. If you're looking for more options you can use scikit-learn.
For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.
For example, if I have a dataframe called imdb_movies:
...and I want to one-hot encode the Rated column, I do this:
pd.get_dummies(imdb_movies.Rated)
This returns a new dataframe with a column for every "level" of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.
Usually, we want this to be part of the original dataframe. In this case, we attach our new dummy coded frame onto the original frame using "column-binding.
We can column-bind by using Pandas concat function:
rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)
We can now run an analysis on our full dataframe.
SIMPLE UTILITY FUNCTION
I would recommend making yourself a utility function to do this quickly:
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
return(res)
Usage:
encode_and_bind(imdb_movies, 'Rated')
Result:
Also, as per #pmalbu comment, if you would like the function to remove the original feature_to_encode then use this version:
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
You can encode multiple features at the same time as follows:
features_to_encode = ['feature_1', 'feature_2', 'feature_3',
'feature_4']
for feature in features_to_encode:
res = encode_and_bind(train_set, feature)
You can do it with numpy.eye and a using the array element selection mechanism:
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
The the return value of indices_to_one_hot(nb_classes, data) is now
array([[[ 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0.]]])
The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).
One hot encoding with pandas is very easy:
def one_hot(df, cols):
"""
#param df pandas DataFrame
#param cols a list of columns to encode
#return a DataFrame with one-hot encoding
"""
for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)
return df
EDIT:
Another way to one_hot using sklearn's LabelBinarizer :
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later
def one_hot_encode(x):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return label_binarizer.transform(x)
Firstly, easiest way to one hot encode: use Sklearn.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Secondly, I don't think using pandas to one hot encode is that simple (unconfirmed though)
Creating dummy variables in pandas for python
Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.
Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, 'Less is More'.
Here's the code for my custom encoding function if you want.
from sklearn.preprocessing import LabelEncoder
#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
columnsToEncode = list(df.select_dtypes(include=['category','object']))
le = LabelEncoder()
for feature in columnsToEncode:
try:
df[feature] = le.fit_transform(df[feature])
except:
print('Error encoding '+feature)
return df
EDIT: Comparison to be clearer:
One-hot encoding: convert n levels to n-1 columns.
Index Animal Index cat mouse
1 dog 1 0 0
2 cat --> 2 1 0
3 mouse 3 0 1
You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.
Dummy Coding:
Index Animal Index Animal
1 dog 1 0
2 cat --> 2 1
3 mouse 3 2
Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.
You can use numpy.eye function.
import numpy as np
def one_hot_encode(x, n_classes):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return np.eye(n_classes)[x]
def main():
list = [0,1,2,3,4,3,2,1,0]
n_classes = 5
one_hot_list = one_hot_encode(list, n_classes)
print(one_hot_list)
if __name__ == "__main__":
main()
Result
D:\Desktop>python test.py
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]]
pandas as has inbuilt function "get_dummies" to get one hot encoding of that particular column/s.
one line code for one-hot-encoding:
df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)
Here is a solution using DictVectorizer and the Pandas DataFrame.to_dict('records') method.
>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
})
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
'country=MEX': 1,
'country=US': 2,
'race=Black': 3,
'race=Latino': 4,
'race=White': 5}
>>> X_qual.toarray()
array([[ 0., 0., 1., 0., 0., 1.],
[ 1., 0., 0., 1., 0., 0.],
[ 0., 0., 1., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0., 1.],
[ 0., 0., 1., 1., 0., 0.]])
One-hot encoding requires bit more than converting the values to indicator variables. Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data. You should store the mapping (transform) that was used to construct the model. A good solution would use the DictVectorizer or LabelEncoder (followed by get_dummies. Here is a function that you can use:
def oneHotEncode2(df, le_dict = {}):
if not le_dict:
columnsToEncode = list(df.select_dtypes(include=['category','object']))
train = True;
else:
columnsToEncode = le_dict.keys()
train = False;
for feature in columnsToEncode:
if train:
le_dict[feature] = LabelEncoder()
try:
if train:
df[feature] = le_dict[feature].fit_transform(df[feature])
else:
df[feature] = le_dict[feature].transform(df[feature])
df = pd.concat([df,
pd.get_dummies(df[feature]).rename(columns=lambda x: feature + '_' + str(x))], axis=1)
df = df.drop(feature, axis=1)
except:
print('Error encoding '+feature)
#df[feature] = df[feature].convert_objects(convert_numeric='force')
df[feature] = df[feature].apply(pd.to_numeric, errors='coerce')
return (df, le_dict)
This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back. So you would call it like this:
train_data, le_dict = oneHotEncode2(train_data)
Then on the test data, the call is made by passing the dictionary returned back from training:
test_data, _ = oneHotEncode2(test_data, le_dict)
An equivalent method is to use DictVectorizer. A related post on the same is on my blog. I mention it here since it provides some reasoning behind this approach over simply using get_dummies post (disclosure: this is my own blog).
You can pass the data to catboost classifier without encoding. Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding.
You can do the following as well. Note for the below you don't have to use pd.concat.
import pandas as pd
# intialise data of lists.
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
'Group':[1,2,1,2]}
# Create DataFrame
df = pd.DataFrame(data)
for _c in df.select_dtypes(include=['object']).columns:
print(_c)
df[_c] = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
You can also change explicit columns to categorical. For example, here I am changing the Color and Group
import pandas as pd
# intialise data of lists.
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
'Group':[1,2,1,2]}
# Create DataFrame
df = pd.DataFrame(data)
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')
for _c in columns_to_change:
print(_c)
df[_c] = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
I know I'm late to this party, but the simplest way to hot encode a dataframe in an automated way is to use this function:
def hot_encode(df):
obj_df = df.select_dtypes(include=['object'])
return pd.get_dummies(df, columns=obj_df.columns).values
This works for me:
pandas.factorize( ['B', 'C', 'D', 'B'] )[0]
Output:
[0, 1, 2, 0]
I used this in my acoustic model:
probably this helps in ur model.
def one_hot_encoding(x, n_out):
x = x.astype(int)
shape = x.shape
x = x.flatten()
N = len(x)
x_categ = np.zeros((N,n_out))
x_categ[np.arange(N), x] = 1
return x_categ.reshape((shape)+(n_out,))
Short Answer
Here is a function to do one-hot-encoding without using numpy, pandas, or other packages. It takes a list of integers, booleans, or strings (and perhaps other types too).
import typing
def one_hot_encode(items: list) -> typing.List[list]:
results = []
# find the unique items (we want to unique items b/c duplicate items will have the same encoding)
unique_items = list(set(items))
# sort the unique items
sorted_items = sorted(unique_items)
# find how long the list of each item should be
max_index = len(unique_items)
for item in items:
# create a list of zeros the appropriate length
one_hot_encoded_result = [0 for i in range(0, max_index)]
# find the index of the item
one_hot_index = sorted_items.index(item)
# change the zero at the index from the previous line to a one
one_hot_encoded_result[one_hot_index] = 1
# add the result
results.append(one_hot_encoded_result)
return results
Example:
one_hot_encode([2, 1, 1, 2, 5, 3])
# [[0, 1, 0, 0],
# [1, 0, 0, 0],
# [1, 0, 0, 0],
# [0, 1, 0, 0],
# [0, 0, 0, 1],
# [0, 0, 1, 0]]
one_hot_encode([True, False, True])
# [[0, 1], [1, 0], [0, 1]]
one_hot_encode(['a', 'b', 'c', 'a', 'e'])
# [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1]]
Long(er) Answer
I know there are already a lot of answers to this question, but I noticed two things. First, most of the answers use packages like numpy and/or pandas. And this is a good thing. If you are writing production code, you should probably be using robust, fast algorithms like those provided in the numpy/pandas packages. But, for the sake of education, I think someone should provide an answer which has a transparent algorithm and not just an implementation of someone else's algorithm. Second, I noticed that many of the answers do not provide a robust implementation of one-hot encoding because they do not meet one of the requirements below. Below are some of the requirements (as I see them) for a useful, accurate, and robust one-hot encoding function:
A one-hot encoding function must:
handle list of various types (e.g. integers, strings, floats, etc.) as input
handle an input list with duplicates
return a list of lists corresponding (in the same order as) to the inputs
return a list of lists where each list is as short as possible
I tested many of the answers to this question and most of them fail on one of the requirements above.
Try this:
!pip install category_encoders
import category_encoders as ce
categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)
df_encoded.head()
The resulting dataframe df_train_encoded is the same as the original, but the categorical features are now replaced with their one-hot-encoded versions.
More information on category_encoders here.
To add to other questions, let me provide how I did it with a Python 2.0 function using Numpy:
def one_hot(y_):
# Function to encode output labels from number indexes
# e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]
y_ = y_.reshape(len(y_))
n_values = np.max(y_) + 1
return np.eye(n_values)[np.array(y_, dtype=np.int32)] # Returns FLOATS
The line n_values = np.max(y_) + 1 could be hard-coded for you to use the good number of neurons in case you use mini-batches for example.
Demo project/tutorial where this function has been used:
https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition
It can and it should be easy as :
class OneHotEncoder:
def __init__(self,optionKeys):
length=len(optionKeys)
self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}
Usage :
ohe=OneHotEncoder(["A","B","C","D"])
print(ohe.A)
print(ohe.D)
Expanding #Martin Thoma's answer
def one_hot_encode(y):
"""Convert an iterable of indices to one-hot encoded labels."""
y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
# the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
nb_classes = len(np.unique(y)) # get the number of unique classes
standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
# which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
# directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
# standardised labels fixes this issue by returning a dictionary;
# standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
# standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
# cannot be called by an integer index e.g y[1.0] - throws an index error.
targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
return np.eye(nb_classes)[targets]
Lets assume out of 10 variables, you have 3 categorical variables in your data frame named as cname1, cname2 and cname3.
Then following code will automatically create one hot encoded variable in the new dataframe.
import category_encoders as ce
encoder_var=ce.OneHotEncoder(cols=['cname1','cname2','cname3'],handle_unknown='return_nan',return_df=True,use_cat_names=True)
new_df = encoder_var.fit_transform(old_df)
A simple example using vectorize in numpy and apply example in pandas:
import numpy as np
a = np.array(['male','female','female','male'])
#define function
onehot_function = lambda x: 1.0 if (x=='male') else 0.0
onehot_a = np.vectorize(onehot_function)(a)
print(onehot_a)
# [1., 0., 0., 1.]
# -----------------------------------------
import pandas as pd
s = pd.Series(['male','female','female','male'])
onehot_s = s.apply(onehot_function)
print(onehot_s)
# 0 1.0
# 1 0.0
# 2 0.0
# 3 1.0
# dtype: float64
Here i tried with this approach :
import numpy as np
#converting to one_hot
def one_hot_encoder(value, datal):
datal[value] = 1
return datal
def _one_hot_values(labels_data):
encoded = [0] * len(labels_data)
for j, i in enumerate(labels_data):
max_value = [0] * (np.max(labels_data) + 1)
encoded[j] = one_hot_encoder(i, max_value)
return np.array(encoded)
Lets say I have the following arrays which contain the X and Y values for a bunch of vectors, respectively:
xdat = np.array([3,2,7,4])
ydat = np.array([2,4,4,9])
Lets say that I wanted to draw the sum total of these vectors (a+b+c+d), not only as a single line from the origin, but drawn sequentially from the sum of each individual vector.
How do I do this?
My idea is to use plt.plot for the values of two new arrays which contain the X and Y coordinates for each start/end point of all the vectors. The specific coordinates would be calculated from xdat and ydat. Assuming this was the most efficient method (without resorting to some easy-to-use function already built into python) how would I code this?
It sounds like you want numpy.cumsum
import numpy as np
xdat = np.array([3,2,7,4])
ydat = np.array([2,4,4,9])
dat = np.vstack((xdat, ydat))
# array([[3, 2, 7, 4],
# [2, 4, 4, 9]])
dat = np.cumsum(dat, axis=1)
# array([[ 3, 5, 12, 16],
# [ 2, 6, 10, 19]], dtype=int32)
# optionally start at 0, 0 (can do this before or after cumsum)
dat = np.hstack([np.zeros((2, 1)), dat])
# array([[ 0., 3., 5., 12., 16.],
# [ 0., 2., 6., 10., 19.]])
I stacked them up for convenience, but you could also run cumsum on the 1-D arrays. The axis argument selects either to run over the whole flattened array (None, the default), or along the n-th axis (row = 0, column = 1)
If you want to plot the X-Y coordinates, I'd do so with plt.plot(*dat), which will unpack the X and Y rows as arguments to plot.
First things first: this is not a duplicate of NumPy: calculate averages with NaNs removed, i'll explain why:
Suppose I have an array
a = array([1,2,3,4])
and I want to average over it with the weights
weights = [4,3,2,1]
output = average(a, weights=weights)
print output
2.0
ok. So this is pretty straightforward. But now I have something like this:
a = array([1,2,nan,4])
calculating the average with the usual method yields of coursenan. Can I avoid this?
In principle I want to ignore the nans, so I'd like to have something like this:
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
1.75
Alternatively, you can use a MaskedArray as such:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
First find out indices where the items are not nan, and then pass the filtered versions of a and weights to numpy.average:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75
As suggested by #mtrw in comments, it would be cleaner to use masked array here instead of index array:
>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75
I would offer another solution, which is more scalable to bigger dimensions (eg when doing average over different axis). Attached code works with 2D array, which possibly contains nans, and takes average over axis=0.
a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array
# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a))
# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec
# mean_vec is vector with weighted nan-averages of array a taken along axis=0
Expanding on #Ashwini and #Nicolas' answers, here is a version that can also handle an edge case where all the data values are np.nan, and that is designed to also work with pandas DataFrame without type-related issues:
def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
weights: List[Union[float, int]]) -> np.ndarray:
""" Calculates the weighted average of `measures`' values, ex-nans.
When nans are present in `measures`' values,
the weights are recalculated based only on the weights for non-nan measures.
Note:
The calculation used is NOT the same as just ignoring nans.
For example, if we had data and weights:
data = [2, 3, np.nan]
weights = [0.5, 0.2, 0.3]
calc_wa_ignore_nan approach:
(2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
The ignoring nans approach:
(2*0.5) + (3*0.2) == 1.6
Args:
data: Multiple rows of numeric data values with `measures` as column headers.
measures: The str names of values to select from `row`.
weights: The numeric weights associated with `measures`.
Example:
>>> df = pd.DataFrame({"meas1": [1, 1],
"meas2": [2, 2],
"meas3": [3, 3],
"meas4": [np.nan, 0],
"meas5": [5, 5]})
>>> measures = ["meas2", "meas3", "meas4"]
>>> weights = [0.5, 0.2, 0.3]
>>> calc_wa_ignore_nan(df, measures, weights)
array([2.28571429, 1.6])
"""
assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
# Need to coerce type to np.float instead of python's float
# to avoid "ufunc 'isnan' not supported for the input types ..." error
data = np.array(df[measures].values, dtype=np.float64)
# Make a 2d array with the same weights for each row
# cast for safety and better errors
weights = np.array([weights, ] * data.shape[0], dtype=np.float64)
mask = np.isnan(data)
masked_data = np.ma.masked_array(data, mask=mask)
masked_weights = np.ma.masked_array(weights, mask=mask)
# np.nanmean doesn't support weights
weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
# Replace masked elements with np.nan
# otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
weighted_avgs = weighted_avgs.filled(np.nan)
return weighted_avgs
All the solutions above are very good, but has don't handle the cases when there is nan in weights. For doing so, using pandas :
def weighted_average_ignoring_nan(df, col_value, col_weight):
den = 0
num = 0
for index, row in df.iterrows():
if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
den = den + row[col_weight]
num = num + row[col_weight]*row[col_value]
return num/den
Since you're looking for the mean another idea is to simply replace all the nan values with 0's:
>>>import numpy as np
>>>a = np.array([[ 3., 2., 5.], [np.nan, 4., np.nan], [np.nan, np.nan, np.nan]])
>>>w = np.array([[ 1., 2., 3.], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]])
>>>a[np.isnan(a)] = 0
>>>w[np.isnan(w)] = 0
>>>np.average(a, weights=w)
3.6666666666666665
This can be used with the axis functionality of the average function but be carful that your weights don't sum up to 0.
Is there a quick way of replacing all NaN values in a numpy array with (say) the linearly interpolated values?
For example,
[1 1 1 nan nan 2 2 nan 0]
would be converted into
[1 1 1 1.3 1.6 2 2 1 0]
Lets define first a simple helper function in order to make it more straightforward to handle indices and logical indices of NaNs:
import numpy as np
def nan_helper(y):
"""Helper to handle indices and logical indices of NaNs.
Input:
- y, 1d numpy array with possible NaNs
Output:
- nans, logical indices of NaNs
- index, a function, with signature indices= index(logical_indices),
to convert logical indices of NaNs to 'equivalent' indices
Example:
>>> # linear interpolation of NaNs
>>> nans, x= nan_helper(y)
>>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
"""
return np.isnan(y), lambda z: z.nonzero()[0]
Now the nan_helper(.) can now be utilized like:
>>> y= array([1, 1, 1, NaN, NaN, 2, 2, NaN, 0])
>>>
>>> nans, x= nan_helper(y)
>>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
>>>
>>> print y.round(2)
[ 1. 1. 1. 1.33 1.67 2. 2. 1. 0. ]
---
Although it may seem first a little bit overkill to specify a separate function to do just things like this:
>>> nans, x= np.isnan(y), lambda z: z.nonzero()[0]
it will eventually pay dividends.
So, whenever you are working with NaNs related data, just encapsulate all the (new NaN related) functionality needed, under some specific helper function(s). Your code base will be more coherent and readable, because it follows easily understandable idioms.
Interpolation, indeed, is a nice context to see how NaN handling is done, but similar techniques are utilized in various other contexts as well.
I came up with this code:
import numpy as np
nan = np.nan
A = np.array([1, nan, nan, 2, 2, nan, 0])
ok = -np.isnan(A)
xp = ok.ravel().nonzero()[0]
fp = A[-np.isnan(A)]
x = np.isnan(A).ravel().nonzero()[0]
A[np.isnan(A)] = np.interp(x, xp, fp)
print A
It prints
[ 1. 1.33333333 1.66666667 2. 2. 1. 0. ]
Just use numpy logical and there where statement to apply a 1D interpolation.
import numpy as np
from scipy import interpolate
def fill_nan(A):
'''
interpolate to fill nan values
'''
inds = np.arange(A.shape[0])
good = np.where(np.isfinite(A))
f = interpolate.interp1d(inds[good], A[good],bounds_error=False)
B = np.where(np.isfinite(A),A,f(inds))
return B
For two dimensional data, the SciPy's griddata works fairly well for me:
>>> import numpy as np
>>> from scipy.interpolate import griddata
>>>
>>> # SETUP
>>> a = np.arange(25).reshape((5, 5)).astype(float)
>>> a
array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
>>> a[np.random.randint(2, size=(5, 5)).astype(bool)] = np.NaN
>>> a
array([[ nan, nan, nan, 3., 4.],
[ nan, 6., 7., nan, nan],
[ 10., nan, nan, 13., nan],
[ 15., 16., 17., nan, 19.],
[ nan, nan, 22., 23., nan]])
>>>
>>> # THE INTERPOLATION
>>> x, y = np.indices(a.shape)
>>> interp = np.array(a)
>>> interp[np.isnan(interp)] = griddata(
... (x[~np.isnan(a)], y[~np.isnan(a)]), # points we know
... a[~np.isnan(a)], # values we know
... (x[np.isnan(a)], y[np.isnan(a)])) # points to interpolate
>>> interp
array([[ nan, nan, nan, 3., 4.],
[ nan, 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ nan, nan, 22., 23., nan]])
I am using it on 3D images, operating on 2D slices (4000 slices of 350x350). The whole operation still takes about an hour :/
Or building on Winston's answer
def pad(data):
bad_indexes = np.isnan(data)
good_indexes = np.logical_not(bad_indexes)
good_data = data[good_indexes]
interpolated = np.interp(bad_indexes.nonzero()[0], good_indexes.nonzero()[0], good_data)
data[bad_indexes] = interpolated
return data
A = np.array([[1, 20, 300],
[nan, nan, nan],
[3, 40, 500]])
A = np.apply_along_axis(pad, 0, A)
print A
Result
[[ 1. 20. 300.]
[ 2. 30. 400.]
[ 3. 40. 500.]]
It might be easier to change how the data is being generated in the first place, but if not:
bad_indexes = np.isnan(data)
Create a boolean array indicating where the nans are
good_indexes = np.logical_not(bad_indexes)
Create a boolean array indicating where the good values area
good_data = data[good_indexes]
A restricted version of the original data excluding the nans
interpolated = np.interp(bad_indexes.nonzero(), good_indexes.nonzero(), good_data)
Run all the bad indexes through interpolation
data[bad_indexes] = interpolated
Replace the original data with the interpolated values.
I use the interpolation for replacing all NaN values.
A = np.array([1, nan, nan, 2, 2, nan, 0])
np.interp(np.arange(len(A)),
np.arange(len(A))[np.isnan(A) == False],
A[np.isnan(A) == False])
Output :
array([1. , 1.33333333, 1.66666667, 2. , 2. , 1. , 0. ])
I needed an approach that would also fill in NaN's at the start of end of the data, which the main answer does not appear to do.
The function I came up with uses a linear regression to fill in the NaN's. This overcomes my problem:
import numpy as np
def linearly_interpolate_nans(y):
# Fit a linear regression to the non-nan y values
# Create X matrix for linreg with an intercept and an index
X = np.vstack((np.ones(len(y)), np.arange(len(y))))
# Get the non-NaN values of X and y
X_fit = X[:, ~np.isnan(y)]
y_fit = y[~np.isnan(y)].reshape(-1, 1)
# Estimate the coefficients of the linear regression
beta = np.linalg.lstsq(X_fit.T, y_fit)[0]
# Fill in all the nan values using the predicted coefficients
y.flat[np.isnan(y)] = np.dot(X[:, np.isnan(y)].T, beta)
return y
Here's an example usage case:
# Make an array according to some linear function
y = np.arange(12) * 1.5 + 10.
# First and last value are NaN
y[0] = np.nan
y[-1] = np.nan
# 30% of other values are NaN
for i in range(len(y)):
if np.random.rand() > 0.7:
y[i] = np.nan
# NaN's are filled in!
print (y)
print (linearly_interpolate_nans(y))
Slightly optimized version based on response of BRYAN WOODS. It handles starting and ending values of source data correctly, and it is faster on 25-30% than original version. Also you may use different kinds of interpolations (see scipy.interpolate.interp1d documentations for details).
import numpy as np
from scipy.interpolate import interp1d
def fill_nans_scipy1(padata, pkind='linear'):
"""
Interpolates data to fill nan values
Parameters:
padata : nd array
source data with np.NaN values
Returns:
nd array
resulting data with interpolated values instead of nans
"""
aindexes = np.arange(padata.shape[0])
agood_indexes, = np.where(np.isfinite(padata))
f = interp1d(agood_indexes
, padata[agood_indexes]
, bounds_error=False
, copy=False
, fill_value="extrapolate"
, kind=pkind)
return f(aindexes)
In [17]: adata = np.array([1, 2, np.NaN, 4])
Out[18]: array([ 1., 2., nan, 4.])
In [19]: fill_nans_scipy1(adata)
Out[19]: array([1., 2., 3., 4.])
Building on the answer by Bryan Woods, I modified his code to also convert lists consisting only of NaN to a list of zeros:
def fill_nan(A):
'''
interpolate to fill nan values
'''
inds = np.arange(A.shape[0])
good = np.where(np.isfinite(A))
if len(good[0]) == 0:
return np.nan_to_num(A)
f = interp1d(inds[good], A[good], bounds_error=False)
B = np.where(np.isfinite(A), A, f(inds))
return B
Simple addition, I hope it will be of use to someone.
Interpolation and extrapolation with padding keywords
The following solution interpolates the nan values in an array by np.interp, if a finite value is present on both sides. Nan values at the borders are handled by np.pad with modes like constant or reflect.
import numpy as np
import matplotlib.pyplot as plt
def extrainterpolate_nans_1d(
arr, kws_pad=({'mode': 'edge'}, {'mode': 'edge'})
):
"""Interpolates and extrapolates nan values.
Interpolation is linear, compare np.interp(..).
Extrapolation works with pad keywords, compare np.pad(..).
Parameters
----------
arr : np.ndarray, shape (N,)
Array to replace nans in.
kws_pad : dict or (dict, dict)
kwargs for np.pad on left and right side
Returns
-------
bool
Description of return value
See Also
--------
https://numpy.org/doc/stable/reference/generated/numpy.interp.html
https://numpy.org/doc/stable/reference/generated/numpy.pad.html
https://stackoverflow.com/a/43821453/7128154
"""
assert arr.ndim == 1
if isinstance(kws_pad, dict):
kws_pad_left = kws_pad
kws_pad_right = kws_pad
else:
assert len(kws_pad) == 2
assert isinstance(kws_pad[0], dict)
assert isinstance(kws_pad[1], dict)
kws_pad_left = kws_pad[0]
kws_pad_right = kws_pad[1]
arr_ip = arr.copy()
# interpolation
inds = np.arange(len(arr_ip))
nan_msk = np.isnan(arr_ip)
arr_ip[nan_msk] = np.interp(inds[nan_msk], inds[~nan_msk], arr[~nan_msk])
# detemine pad range
i0 = next(
(ids for ids, val in np.ndenumerate(arr) if not np.isnan(val)), 0)[0]
i1 = next(
(ids for ids, val in np.ndenumerate(arr[::-1]) if not np.isnan(val)), 0)[0]
i1 = len(arr) - i1
# print('pad in range [0:{:}] and [{:}:{:}]'.format(i0, i1, len(arr)))
# pad
arr_pad = np.pad(
arr_ip[i0:], pad_width=[(i0, 0)], **kws_pad_left)
arr_pad = np.pad(
arr_pad[:i1], pad_width=[(0, len(arr) - i1)], **kws_pad_right)
return arr_pad
# setup data
ys = np.arange(30, dtype=float)**2/20
ys[:5] = np.nan
ys[20:] = 20
ys[28:] = np.nan
ys[[7, 13, 14, 18, 22]] = np.nan
ys_ie0 = extrainterpolate_nans_1d(ys)
kws_pad_sym = {'mode': 'symmetric'}
kws_pad_const7 = {'mode': 'constant', 'constant_values':7.}
ys_ie1 = extrainterpolate_nans_1d(ys, kws_pad=(kws_pad_sym, kws_pad_const7))
ys_ie2 = extrainterpolate_nans_1d(ys, kws_pad=(kws_pad_const7, kws_pad_sym))
fig, ax = plt.subplots()
ax.scatter(np.arange(len(ys)), ys, s=15**2, label='ys')
ax.scatter(np.arange(len(ys)), ys_ie0, s=8**2, label='ys_ie0, left_pad edge, right_pad edge')
ax.scatter(np.arange(len(ys)), ys_ie1, s=6**2, label='ys_ie1, left_pad symmetric, right_pad 7')
ax.scatter(np.arange(len(ys)), ys_ie2, s=4**2, label='ys_ie2, left_pad 7, right_pad symmetric')
ax.legend()
As suggested by an earlier comment, the best way to do this is to use a peer reviewed implementation. The pandas library has an interpolation method for 1d data, which interpolates np.nan values in Series or DataFrame:
pandas.Series.interpolate or pandas.DataFrame.interpolate
The documentation is very concise, recommend reading through! My implementation:
import pandas as pd
magnitudes_series = pd.Series(magnitudes) # Convert np.array to pd.Series
magnitudes_series.interpolate(
# I used "akima" because the second derivative of my data has frequent drops to 0
method=interpolation_method,
# Interpolate from both sides of the sequence, up to you (made sense for my data)
limit_direction="both",
# Interpolate only np.nan sequences that have number sequences at the ends of the respective np.nan sequences
limit_area="inside",
inplace=True,
)
# I chose to remove np.nan at the tails of data sequence
magnitudes_series.dropna(inplace=True)
result_in_numpy_array = magnitudes_series.values
Importing scipy looks like overkill to me. Here's a simple way using numpy and maintaining the same conventions as np.interp
def interp_nans(x:[float],left=None, right=None, period=None)->[float]:
"""
e.g. [1 1 1 nan nan 2 2 nan 0] -> [1 1 1 1.3 1.6 2 2 1 0]
"""
xp = [i for i, yi in enumerate(x) if np.isfinite(yi)]
fp = [yi for i, yi in enumerate(x) if np.isfinite(yi)]
return list(np.interp(x=list(range(len(x))), xp=xp, fp=fp,left=left,right=right,period=period))