Pandas: occurrence matrix from one hot encoding from pandas dataframe - python

I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?

You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0

Related

Match a data frame columns to another data frame rows content

I have a pandas data frame as follows
A
B
C
D
...
Z
and another data frame in which every column has zero or more letters as follows:
Letters
A,C,D
A,B,F
A,H,G
A
B,F
None
I want to match the two dataframes to have something like this
A
B
C
D
...
Z
1
0
1
1
0
0
make example and desired output for answer
Example:
data = ['A,C,D', 'A,B,F', 'A,E,G', None]
df = pd.DataFrame(data, columns=['letter'])
df :
letter
0 A,C,D
1 A,B,F
2 A,E,G
3 None
get_dummies and groupby
pd.get_dummies(df['letter'].str.split(',').explode()).groupby(level=0).sum()
output:
A B C D E F G
0 1 0 1 1 0 0 0
1 1 1 0 0 0 1 0
2 1 0 0 0 1 0 1
3 0 0 0 0 0 0 0

How to split comma separated text into columns on pandas dataframe?

I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)

pd.get_dummies() for DataFrames with multi-index columns

Using get_dummies(), it is possible to create one-hot encoded dummy variables for categorical data. For example:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c']})
print(pd.get_dummies(df))
# A_a A_b B_a B_b B_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
So far, so good. But how can I use get_dummies() in combination with multi-index columns? The default behavior is not very practical: The multi-index tuple is converted into a string and the same suffix mechanism applies as with the simple-index columns.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
ret = pd.get_dummies(df)
print(ret)
print(type(ret.columns[0]))
# ('i','A')_a ('i','A')_b ('ii','B')_a ('ii','B')_b ('ii','B')_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# str
What I would like to get, however, is that the dummies create a new column level:
ret = pd.get_dummies(df, ???)
print(ret)
print(type(ret.columns[0]))
# i ii
# A B
# a b a b c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# tuple
#
# Note that the ret would be equivalent to the following:
# ('i','A','a') ('i','A','b') ('ii','B','a') ('ii','B','b') ('ii','B','c')
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
How could this be achieved?
Update: I placed a feature request for better support of multi-index data-frames in get_dummies: https://github.com/pandas-dev/pandas/issues/26560
You can parse the column names and rename them:
import ast
def parse_dummy(x):
parts = x.split('_')
return ast.literal_eval(parts[0]) + (parts[1],)
ret.columns = pd.Series(ret.columns).apply(parse_dummy)
# (i, A, a) (i, A, b) (ii, B, a) (ii, B, b) (ii, B, c)
#0 1 0 0 1 0
#1 0 1 1 0 0
#2 1 0 0 0 1
Note that this DataFrame is not the same as a DataFrame with three-level multiindex column names.
I had a similar need, but in a more complex DataFrame, with a multi index as row index and numerical columns which shall not be converted to dummy. So my case required to scan through the columns, expand to dummy only the columns of dtype='object', and build a new column index as a concatenation of the name of the column with the dummy variable and the value of the dummy variable itself. This because I didn't want to add a new column index level.
Here is the code
first build a dataframe in the format I need
import pandas as pd
import numpy as np
df_size = 3
objects = ['obj1','obj2']
attributes = ['a1','a2','a3']
cols = pd.MultiIndex.from_product([objects, attributes], names=['objects', 'attributes'])
lab1 = ['car','truck','van']
lab2 = ['bay','zoe','ros','lol']
time = np.arange(df_size)
node = ['n1','n2']
idx = pd.MultiIndex.from_product([time, node], names=['time', 'node'])
df = pd.DataFrame(np.random.randint(10,size=(len(idx),len(cols))),columns=cols,index=idx)
c1 = map(lambda i:lab1[i],np.random.randint(len(lab1),size=len(idx)))
c2 = map(lambda i:lab2[i],np.random.randint(len(lab2),size=len(idx)))
df[('obj1','a3')]=list(c1)
df[('obj2','a2')]=list(c2)
print(df)
objects obj1 obj2
attributes a1 a2 a3 a1 a2 a3
time node
0 n1 6 5 truck 3 ros 3
n2 5 6 car 9 lol 7
1 n1 0 8 truck 7 zoe 8
n2 4 3 truck 8 bay 3
2 n1 5 8 van 0 bay 0
n2 4 8 car 5 lol 4
And here is the code to dummify only the object columns
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[f'{t[1]}_{c}' for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
objects obj1 obj2
attributes a1 a2 a3_car a3_truck a3_van a1 a2_bay a2_lol a2_ros a2_zoe a3
time node
0 n1 6 5 0 1 0 3 0 0 1 0 3
n2 5 6 1 0 0 9 0 1 0 0 7
1 n1 0 8 0 1 0 7 0 0 0 1 8
n2 4 3 0 1 0 8 1 0 0 0 3
2 n1 5 8 0 0 1 0 1 0 0 0 0
n2 4 8 1 0 0 5 0 1 0 0 4
It can be easily changed to answer to original use case, adding one more row to the columns multi index.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
print(df)
i ii
A B
0 a b
1 b a
2 a c
df.columns = pd.MultiIndex.from_tuples([t+('',) for t in df.columns])
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[t[1]],[c for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
i ii
A B
a b a b c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
note that it still works if there are numerical columns - it just adds an empty additional level to them in the columns index as well.

Splitting a concatenated string into seperated columns using pandas

I have a pandas dataframe consiting of one column containing a string seperated by "/" I would like split these seperated strings into new columns denoted by a boolean (if they exist)
d = {'col1': ["A/B/C", "B/C", "D/B/A", "C/B"]}
dataFrame = pd.DataFrame(data=d)
col1
0 A/B/C
1 B/C
2 D/B/A
3 C/B
the result would be as following:
d = {'A': [1, 0, 1, 0], 'B':[1,1,1,1], 'C':[1,1,0,1], 'D':[0,0,1,0]}
dataFrame = pd.DataFrame(data=d)
A B C D
0 1 1 1 0
1 0 1 1 0
2 1 1 0 1
3 0 1 1 0
I have attempted with pandas.Series.str.split and pandas.pivot but nothing quite returns the result I am looking for. Any help or nudges in the right direction, would be highly appreciated!
Use pandas.Series.str.get_dummies
df.col1.str.get_dummies('/')
A B C D
0 1 1 1 0
1 0 1 1 0
2 1 1 0 1
3 0 1 1 0
Setup
d = {'col1': ["A/B/C", "B/C", "D/B/A", "C/B"]}
df = pd.DataFrame(data=d)

Transform the relationship data with weight into a Matrix in python

Input data format like that: data.txt
col1 col2 weight
a b 1
a c 2
a d 0
b c 3
b d 0
c d 0
i want the output data format like that: result.txt
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
I would use pandas in this way
import pandas as pd
# Read your data from a .csv file
df = pd.read_csv('yourdata.csv')
# Pivot table
mat = pd.pivot_table(df,index='col1',columns='col2',values='weight')
# Rebuild the index
index = mat.index.union(mat.columns)
# Build the new full matrix and fill NaN values with 0
mat = mat.reindex(index=index, columns=index).fillna(0)
# Make the matrix symmetric
m = mat + mat.T
This returns:
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
EDIT: instead of pivot_table() you can also use:
mat = df.pivot(index='col1',columns='col2',values='weight')
give a, b, c, d values and set col 1 = i, and col 2 = j. evaluate row by row. For example, row 1, i = 0, j = 1 , weights(i,j) = 1

Categories