I want to process a pandas dataframe with rank-hot encoding instead of one-hot encoding.
For example take this pandas dataframe:
df = pd.DataFrame([[1,2],[3,2],[2,2]], columns=['colA', 'colB'])
print(df)
>> colA colB
0 1 2
1 3 0
2 2 3
How it should look in the end:
print(df)
>> colA_0 colA_1 colA_2 colA_3 colB_0 colB_1 colB_2 colB_3
0 1 1 0 0 1 1 1 0
1 1 1 1 1 1 0 0 0
2 1 1 1 0 1 1 1 1
This worked on small dataFrames:
def rankHotEncode(row):
newFeatures = {}
for i, v in row.iteritems():
for k in range(MULTIPLYFEATURES):
newFeatures[i + repr(k)] = 1 if v >= k else 0
return pd.Series(newFeatures)
df.apply(rankHotEncode, axis=1)
The solution should not be hardcoded and efficient for order ~100.000 rows.
How can I improve the provided solution to make it more efficient or what is the best way to do this?
You can use scikit-learn oneHotEncoder with numpy.cumsum. While it involves some copies, it is quite efficient as it does not deal with the matrix row by row. Here is a sample code using it.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2],[3,0],[2,3]], columns=['colA', 'colB'])
print(df)
n_values = df.max().values + 1
enc = OneHotEncoder(sparse=False, n_values=n_values, dtype=int)
enc.fit(df)
encoded_columns = [
'{}_{}'.format(col_name, i)
for col_name, n_value in zip(df.columns, n_values)
for i in range(n_value)
]
one_hot = enc.transform(df)
rank_hot = np.zeros_like(one_hot)
for col_start, col_end in zip(enc.feature_indices_[:-1], enc.feature_indices_[1:]):
one_hot_col_reversed = one_hot[:, col_start: col_end][:, ::-1]
rank_hot[:, col_start: col_end] = np.cumsum(one_hot_col_reversed, axis=1)[:, ::-1]
encoded_df = pd.DataFrame(rank_hot, columns=encoded_columns)
It outputs for your example
print(encoded_df)
>> colA_0 colA_1 colA_2 colA_3 colB_0 colB_1 colB_2 colB_3
0 1 1 0 0 1 1 1 0
1 1 1 1 1 1 0 0 0
2 1 1 1 0 1 1 1 1
Related
I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)
I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
I have a numeric DataFrame, for example:
x = np.array([[1,2,3],[-1,-1,1],[0,0,0]])
df = pd.DataFrame(x, columns=['A','B','C'])
df
A B C
0 1 2 3
1 -1 -1 1
2 0 0 0
And I want to count, for each row, the number of positive values, negativa values and values equals to 0. I've been trying the following:
df['positive_count'] = df.apply(lambda row: (row > 0).sum(), axis = 1)
df['negative_count'] = df.apply(lambda row: (row < 0).sum(), axis = 1)
df['zero_count'] = df.apply(lambda row: (row == 0).sum(), axis = 1)
But I'm getting the following result, which is obviously incorrent
A B C positive_count negative_count zero_count
0 1 2 3 3 0 1
1 -1 -1 1 1 2 0
2 0 0 0 0 0 5
Anyone knows what might be going wrong, or could help me find the best way to do what I'm looking for?
Thank you.
There are some ways, but one option is using np.sign and get_dummies:
u = (pd.get_dummies(np.sign(df.stack()))
.sum(level=0)
.rename({-1: 'negative_count', 1: 'positive_count', 0: 'zero_count'}, axis=1))
u
negative_count zero_count positive_count
0 0 0 3
1 2 0 1
2 0 3 0
df = pd.concat([df, u], axis=1)
df
A B C negative_count zero_count positive_count
0 1 2 3 0 0 3
1 -1 -1 1 2 0 1
2 0 0 0 0 3 0
np.sign treats zero differently from positive and negative values, so it is ideal to use here.
Another option is groupby and value_counts:
(np.sign(df)
.stack()
.groupby(level=0)
.value_counts()
.unstack(1, fill_value=0)
.rename({-1: 'negative_count', 1: 'positive_count', 0: 'zero_count'}, axis=1))
negative_count zero_count positive_count
0 0 0 3
1 2 0 1
2 0 3 0
Slightly more verbose but still worth knowing about.
I have a dataframe that has multiple columns that represent whether or not something had existed, but they are ordinal in nature. Something could have existed in all 3 categories, but I only want to indicate the highest level that it existed in.
So for a given row, i only want a single '1' value , but I want it to be kept at the highest level it was found at.
For this row:
1,1,0 , I would want the row to be changed to 1,0,0
and this row:
0,1,1 , I would want the row to be changed to 0,1,0
Here is a sample of what the data could look like, and expected output:
import pandas as pd
#input data
df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,1],
'level3':[0,1,1,1,0]})
#expected output:
new_df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,0],
'level3':[0,1,0,1,0]})
Using numpy.zeros and filling via numpy.argmax:
out = np.zeros(df.iloc[:, 1:].shape, dtype=int)
out[np.arange(len(out)), np.argmax(df.iloc[:, 1:].values, 1)] = 1
df.iloc[:, 1:] = out
Using broadcasting with argmax:
a = df.iloc[:, 1:].values
df.iloc[:, 1:] = (a.argmax(axis=1)[:,None] == range(a.shape[1])).astype(int)
Both produce:
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
You can use advanced indexing with NumPy. Updating underlying NumPy array works here since you have a dataframe of int dtype.
idx = df.iloc[:, 1:].eq(1).values.argmax(1)
df.iloc[:, 1:] = 0
df.values[np.arange(df.shape[0]), idx+1] = 1
print(df)
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
numpy.eye
v = df.iloc[:, 1:].values
i = np.eye(3, dtype=np.int64)
a = v.argmax(1)
df.iloc[:, 1:] = i[a]
df
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
cumsum and mask
df.set_index('id').pipe(
lambda d: d.mask(d.cumsum(1) > 1, 0)
).reset_index()
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
You can use get_dummies() by assigning a 1 to the maximum index
df[df.filter(like='level').columns] = pd.get_dummies(df.filter(like='level').idxmax(1))
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
Hi have a pandas dataframe df containing categorical variables.
df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])
df
Out[16]:
gender eyes
0 male blue
1 female brown
2 male black
using the function get_dummies I get the following dataframe
df_dummies = pandas.get_dummies(df)
df_dummies
Out[18]:
gender_female gender_male eyes_black eyes_blue eyes_brown
0 0 1 0 1 0
1 1 0 0 0 1
2 0 1 1 0 0
Owever the columns gender_female and gender_male contain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?
UPDATED
The use of
df_dummies = pandas.get_dummies(df,drop_first=True)
Would give me
df_dummies
Out[21]:
gender_male eyes_blue eyes_brown
0 1 1 0
1 0 0 1
2 1 0 0
but I would like to remove the columns for which originally I had only 2 possibilities
The desired result should be
df_dummies
Out[18]:
gender_male eyes_black eyes_blue eyes_brown
0 1 0 1 0
1 0 0 0 1
2 1 1 0 0
Yes, you can use the argument dropfirst:
drop_first=True
From the documentation:
pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
To have all dummy columns for eyes, and one for gender, use this:
df = pd.get_dummies(df, prefix=['eyes'], columns=['eyes'])
df = pd.get_dummies(df,drop_first=True)
Output:
eyes_black eyes_blue eyes_brown gender_male
0 0 1 0 1
1 0 0 1 0
2 1 0 0 1
More general:
gender eyes heigh
0 male blue tall
1 female brown short
2 male black average
for i in df.columns:
if len(df.groupby([i]).size()) > 2:
df = pd.get_dummies(df, prefix=[i], columns=[i])
df = pd.get_dummies(df, drop_first=True)
Output:
eyes_black eyes_blue eyes_brown heigh_average heigh_short heigh_tall \
0 0 1 0 0 0 1
1 0 0 1 0 1 0
2 1 0 0 1 0 0
gender_male
0 1
1 0
2 1
You could use itertools.combinations to find all pairs of columns, then any potentially redundant pair of columns will be one where for every row one column is True and the other is False - i.e. an XOR:
import pandas as pd
from itertools import combinations
df = pd.DataFrame(data=[['male','blue'],['female','brown'],['male','black']],
columns=['gender','eyes'])
dummies = pd.get_dummies(df)
for c1, c2 in combinations(dummies.columns, 2):
if all(dummies[c1] ^ dummies[c2]):
print(c1,c2)
However, this also notices that in your examples all females have brown eyes, hence we get the following printed:
gender_female gender_male
gender_male eyes_brown
Alternatively, you can split the dataframe into parts you want to apply drop_first=True and parts you don't. Then concatenate them together.
df1 = df.iloc[:, 0:2]
df2 = df.iloc[:, 2:]
df1 = pd.get_dummies(df1 ,drop_first=True)
df = pd.concat([df1, df2], axis=1)