pd.get_dummies() for DataFrames with multi-index columns - python

Using get_dummies(), it is possible to create one-hot encoded dummy variables for categorical data. For example:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c']})
print(pd.get_dummies(df))
# A_a A_b B_a B_b B_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
So far, so good. But how can I use get_dummies() in combination with multi-index columns? The default behavior is not very practical: The multi-index tuple is converted into a string and the same suffix mechanism applies as with the simple-index columns.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
ret = pd.get_dummies(df)
print(ret)
print(type(ret.columns[0]))
# ('i','A')_a ('i','A')_b ('ii','B')_a ('ii','B')_b ('ii','B')_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# str
What I would like to get, however, is that the dummies create a new column level:
ret = pd.get_dummies(df, ???)
print(ret)
print(type(ret.columns[0]))
# i ii
# A B
# a b a b c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# tuple
#
# Note that the ret would be equivalent to the following:
# ('i','A','a') ('i','A','b') ('ii','B','a') ('ii','B','b') ('ii','B','c')
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
How could this be achieved?
Update: I placed a feature request for better support of multi-index data-frames in get_dummies: https://github.com/pandas-dev/pandas/issues/26560

You can parse the column names and rename them:
import ast
def parse_dummy(x):
parts = x.split('_')
return ast.literal_eval(parts[0]) + (parts[1],)
ret.columns = pd.Series(ret.columns).apply(parse_dummy)
# (i, A, a) (i, A, b) (ii, B, a) (ii, B, b) (ii, B, c)
#0 1 0 0 1 0
#1 0 1 1 0 0
#2 1 0 0 0 1
Note that this DataFrame is not the same as a DataFrame with three-level multiindex column names.

I had a similar need, but in a more complex DataFrame, with a multi index as row index and numerical columns which shall not be converted to dummy. So my case required to scan through the columns, expand to dummy only the columns of dtype='object', and build a new column index as a concatenation of the name of the column with the dummy variable and the value of the dummy variable itself. This because I didn't want to add a new column index level.
Here is the code
first build a dataframe in the format I need
import pandas as pd
import numpy as np
df_size = 3
objects = ['obj1','obj2']
attributes = ['a1','a2','a3']
cols = pd.MultiIndex.from_product([objects, attributes], names=['objects', 'attributes'])
lab1 = ['car','truck','van']
lab2 = ['bay','zoe','ros','lol']
time = np.arange(df_size)
node = ['n1','n2']
idx = pd.MultiIndex.from_product([time, node], names=['time', 'node'])
df = pd.DataFrame(np.random.randint(10,size=(len(idx),len(cols))),columns=cols,index=idx)
c1 = map(lambda i:lab1[i],np.random.randint(len(lab1),size=len(idx)))
c2 = map(lambda i:lab2[i],np.random.randint(len(lab2),size=len(idx)))
df[('obj1','a3')]=list(c1)
df[('obj2','a2')]=list(c2)
print(df)
objects obj1 obj2
attributes a1 a2 a3 a1 a2 a3
time node
0 n1 6 5 truck 3 ros 3
n2 5 6 car 9 lol 7
1 n1 0 8 truck 7 zoe 8
n2 4 3 truck 8 bay 3
2 n1 5 8 van 0 bay 0
n2 4 8 car 5 lol 4
And here is the code to dummify only the object columns
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[f'{t[1]}_{c}' for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
objects obj1 obj2
attributes a1 a2 a3_car a3_truck a3_van a1 a2_bay a2_lol a2_ros a2_zoe a3
time node
0 n1 6 5 0 1 0 3 0 0 1 0 3
n2 5 6 1 0 0 9 0 1 0 0 7
1 n1 0 8 0 1 0 7 0 0 0 1 8
n2 4 3 0 1 0 8 1 0 0 0 3
2 n1 5 8 0 0 1 0 1 0 0 0 0
n2 4 8 1 0 0 5 0 1 0 0 4
It can be easily changed to answer to original use case, adding one more row to the columns multi index.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
print(df)
i ii
A B
0 a b
1 b a
2 a c
df.columns = pd.MultiIndex.from_tuples([t+('',) for t in df.columns])
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[t[1]],[c for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
i ii
A B
a b a b c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
note that it still works if there are numerical columns - it just adds an empty additional level to them in the columns index as well.

Related

Match a data frame columns to another data frame rows content

I have a pandas data frame as follows
A
B
C
D
...
Z
and another data frame in which every column has zero or more letters as follows:
Letters
A,C,D
A,B,F
A,H,G
A
B,F
None
I want to match the two dataframes to have something like this
A
B
C
D
...
Z
1
0
1
1
0
0
make example and desired output for answer
Example:
data = ['A,C,D', 'A,B,F', 'A,E,G', None]
df = pd.DataFrame(data, columns=['letter'])
df :
letter
0 A,C,D
1 A,B,F
2 A,E,G
3 None
get_dummies and groupby
pd.get_dummies(df['letter'].str.split(',').explode()).groupby(level=0).sum()
output:
A B C D E F G
0 1 0 1 1 0 0 0
1 1 1 0 0 0 1 0
2 1 0 0 0 1 0 1
3 0 0 0 0 0 0 0

Create an ordering dataframe depending on the ordering of items in a smaller dataframe

I have a dataframe that looks something like this:
i j
0 a b
1 a c
2 b c
I would like to convert it to another dataframe that looks like this:
a b c
0 1 -1 0
1 1 0 -1
2 0 1 -1
The idea is to look at each row in the first dataframe and assign the value 1 to the item in the first column and the value -1 for the item in the second column and 0 for all other items in the new dataframe.
The second dataframe will have as many rows as the first and as many columns as the number of unique entries in the first dataframe. Thank you.
Couldn't really get a start on this.
example
data = {'i': {0: 'a', 1: 'a', 2: 'b'}, 'j': {0: 'b', 1: 'c', 2: 'c'}}
df = pd.DataFrame(data)
df
i j
0 a b
1 a c
2 b c
First make dummy
df1 = pd.get_dummies(df)
df1
i_a i_b j_b j_c
0 1 0 1 0
1 1 0 0 1
2 0 1 0 1
Second make df1 index to multi-index
df1.columns = df1.columns.map(lambda x: tuple(x.split('_')))
df1
i j
a b b c
0 1 0 1 0
1 1 0 0 1
2 0 1 0 1
Third make j to negative value
df1.loc[:, 'j'] = df1.loc[:, 'j'].mul(-1).to_numpy()
df1
i j
a b b c
0 1 0 -1 0
1 1 0 0 -1
2 0 1 0 -1
Final sum i & j
df1.sum(level=1 ,axis=1)
a b c
0 1 -1 0
1 1 0 -1
2 0 1 -1
we can put multiple columns as list instead of i and j
columns = ['a', 'b', 'c']
def get_order(input_row):
output_row[input_row[i]] = 1
output_row[input_row[j]] = -1
return pd.Series(output_row)
ordering_df = original_df.apply(get_order, axis = 1)

How to split comma separated text into columns on pandas dataframe?

I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)

Pandas: occurrence matrix from one hot encoding from pandas dataframe

I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0

Python - Column-wise keep first unique value

I have a dataframe that has multiple columns that represent whether or not something had existed, but they are ordinal in nature. Something could have existed in all 3 categories, but I only want to indicate the highest level that it existed in.
So for a given row, i only want a single '1' value , but I want it to be kept at the highest level it was found at.
For this row:
1,1,0 , I would want the row to be changed to 1,0,0
and this row:
0,1,1 , I would want the row to be changed to 0,1,0
Here is a sample of what the data could look like, and expected output:
import pandas as pd
#input data
df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,1],
'level3':[0,1,1,1,0]})
#expected output:
new_df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,0],
'level3':[0,1,0,1,0]})
Using numpy.zeros and filling via numpy.argmax:
out = np.zeros(df.iloc[:, 1:].shape, dtype=int)
out[np.arange(len(out)), np.argmax(df.iloc[:, 1:].values, 1)] = 1
df.iloc[:, 1:] = out
Using broadcasting with argmax:
a = df.iloc[:, 1:].values
df.iloc[:, 1:] = (a.argmax(axis=1)[:,None] == range(a.shape[1])).astype(int)
Both produce:
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
You can use advanced indexing with NumPy. Updating underlying NumPy array works here since you have a dataframe of int dtype.
idx = df.iloc[:, 1:].eq(1).values.argmax(1)
df.iloc[:, 1:] = 0
df.values[np.arange(df.shape[0]), idx+1] = 1
print(df)
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
numpy.eye
v = df.iloc[:, 1:].values
i = np.eye(3, dtype=np.int64)
a = v.argmax(1)
df.iloc[:, 1:] = i[a]
df
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
cumsum and mask
df.set_index('id').pipe(
lambda d: d.mask(d.cumsum(1) > 1, 0)
).reset_index()
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
You can use get_dummies() by assigning a 1 to the maximum index
df[df.filter(like='level').columns] = pd.get_dummies(df.filter(like='level').idxmax(1))
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0

Categories