How to index data by elements of lists in columns? - python

I have the following DataFrame (named df2 later):
recipe_id ingredients
0 3332 [11307, 11322, 11632, 11338, 11478, 11438]
1 3333 [11322, 11338, 11632, 11314, 11682, 11478, 108...
2 3334 [11632, 11682, 11338, 11337, 10837, 11435, 113...
3 3335 [11149, 11322, 11532, 11996, 10616, 10837, 113...
4 3336 [11330, 11632, 11422, 11256, 11338, 11314, 114...
5 3812 [959, 92, 3, 554, 12271, 202]
...
I want to create another DataFrame which will have following colums : ['ingredients', "recipe_id1", "recipe_id2", ..., "recipe_idn"], where n is total number of recipes in database. I did that with the following snippet:
columns = ['ingredient'] + (list(df2['recipe_id'].unique()))
ingredient_df = pd.DataFrame(columns=columns)
After i create this DataFrame (which i did already), and populate it (problem i'm having), output should look like this:
In [1]:
# Create and populate ingredient_df by some method
columns = ['ingredient'] + (list(df2['recipe_id'].unique()))
ingredient_df = pd.DataFrame(columns=columns)
ingredient_df = populate_df(ingredient_df, df2)
Out [1]:
In [2]:
ingredient_df
Out[2]:
ingredient ... 3332 3333 3334 3335 3336 ...
...
11322 ... 1 1 0 1 0 ...
...
In the example above, value at (11322, 3334) is 0 because ingredient 11322 is not present in recipe with id 3334.
In other words, I want for every ingredient to have mapping (ingredient, recipe_id) = 1 if ingredient is present in that recipe, and 0 otherwise.
I've managed to do this by iterating over all recipes and trough all ingredient, but this is very slow. How can i do this in more robust and elegant way using Pandas methods (if this is possible at all)?

setup
df = pd.DataFrame(
dict(
recipe_id=list('abcde'),
ingredients=[list('xyz'),
list('tuv'),
list('ytw'),
list('vy'),
list('zxs')]
)
)[['recipe_id', 'ingredients']]
df
recipe_id ingredients
0 a [x, y, z]
1 b [t, u, v]
2 c [y, t, w]
3 d [v, y]
4 e [z, x, s]
method 1
df.set_index('recipe_id').ingredients.apply(pd.value_counts) \
.fillna(0).astype(int).T.rename_axis('ingredients')
recipe_id a b c d e
ingredients
s 0 0 0 0 1
t 0 1 1 0 0
u 0 1 0 0 0
v 0 1 0 1 0
w 0 0 1 0 0
x 1 0 0 0 1
y 1 0 1 1 0
z 1 0 0 0 1
method 2
idx = np.repeat(df.index.values, df.ingredients.str.len())
df1 = df.drop('ingredients', 1).loc[idx]
df1['ingredients'] = df.ingredients.sum()
df1.groupby('ingredients').recipe_id.apply(pd.value_counts) \
.unstack(fill_value=0).rename_axis('recipe_id', 1)
recipe_id a b c d e
ingredients
s 0 0 0 0 1
t 0 1 1 0 0
u 0 1 0 0 0
v 0 1 0 1 0
w 0 0 1 0 0
x 1 0 0 0 1
y 1 0 1 1 0
z 1 0 0 0 1

Related

Match a data frame columns to another data frame rows content

I have a pandas data frame as follows
A
B
C
D
...
Z
and another data frame in which every column has zero or more letters as follows:
Letters
A,C,D
A,B,F
A,H,G
A
B,F
None
I want to match the two dataframes to have something like this
A
B
C
D
...
Z
1
0
1
1
0
0
make example and desired output for answer
Example:
data = ['A,C,D', 'A,B,F', 'A,E,G', None]
df = pd.DataFrame(data, columns=['letter'])
df :
letter
0 A,C,D
1 A,B,F
2 A,E,G
3 None
get_dummies and groupby
pd.get_dummies(df['letter'].str.split(',').explode()).groupby(level=0).sum()
output:
A B C D E F G
0 1 0 1 1 0 0 0
1 1 1 0 0 0 1 0
2 1 0 0 0 1 0 1
3 0 0 0 0 0 0 0

How to split comma separated text into columns on pandas dataframe?

I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)

Pandas: occurrence matrix from one hot encoding from pandas dataframe

I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0

Python - Column-wise keep first unique value

I have a dataframe that has multiple columns that represent whether or not something had existed, but they are ordinal in nature. Something could have existed in all 3 categories, but I only want to indicate the highest level that it existed in.
So for a given row, i only want a single '1' value , but I want it to be kept at the highest level it was found at.
For this row:
1,1,0 , I would want the row to be changed to 1,0,0
and this row:
0,1,1 , I would want the row to be changed to 0,1,0
Here is a sample of what the data could look like, and expected output:
import pandas as pd
#input data
df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,1],
'level3':[0,1,1,1,0]})
#expected output:
new_df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,0],
'level3':[0,1,0,1,0]})
Using numpy.zeros and filling via numpy.argmax:
out = np.zeros(df.iloc[:, 1:].shape, dtype=int)
out[np.arange(len(out)), np.argmax(df.iloc[:, 1:].values, 1)] = 1
df.iloc[:, 1:] = out
Using broadcasting with argmax:
a = df.iloc[:, 1:].values
df.iloc[:, 1:] = (a.argmax(axis=1)[:,None] == range(a.shape[1])).astype(int)
Both produce:
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
You can use advanced indexing with NumPy. Updating underlying NumPy array works here since you have a dataframe of int dtype.
idx = df.iloc[:, 1:].eq(1).values.argmax(1)
df.iloc[:, 1:] = 0
df.values[np.arange(df.shape[0]), idx+1] = 1
print(df)
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
numpy.eye
v = df.iloc[:, 1:].values
i = np.eye(3, dtype=np.int64)
a = v.argmax(1)
df.iloc[:, 1:] = i[a]
df
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
cumsum and mask
df.set_index('id').pipe(
lambda d: d.mask(d.cumsum(1) > 1, 0)
).reset_index()
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
You can use get_dummies() by assigning a 1 to the maximum index
df[df.filter(like='level').columns] = pd.get_dummies(df.filter(like='level').idxmax(1))
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0

Pandas - Map - Dummy Variables - Assign value of 1

I have two dataframes, x.head() looks like this:
top mid adc support jungle
Irelia Ahri Jinx Janna RekSai
Gnar Ahri Caitlyn Leona Rengar
Renekton Fizz Sivir Annie Rengar
Irelia Leblanc Sivir Thresh JarvanIV
Gnar Lissandra Tristana Janna JarvanIV
and dataframe fullmatrix.head() that I have created looks like this:
Irelia Gnar Renekton Kassadin Sion Jax Lulu Maokai Rumble Lissandra ... XinZhao Amumu Udyr Ivern Shaco Skarner FiddleSticks Aatrox Volibear MonkeyKing
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ...
Now what I cannot figure out is how to assign a value of 1 for each name in the x dataframe to the respective column that has the same name in the fullmatrix dataframe row by row (both dataframes have the same number of rows).
I'm sure this can be improved but one advantage is that it only requires the first DataFrame, and it's conceptually nice to chain operations until you get the desired solution.
fullmatrix = (x.stack()
.reset_index(name='names')
.pivot(index='level_0', columns='names', values='names')
.applymap(lambda x: int(x!=None))
.reset_index(drop=True))
note that only the names that appear in your x DataFrame will appear as columns in fullmatrix. if you want the additional columns you can simply perform a join.
Consider adding a key = 1 column and then iterating through each column for a list of pivoted dfs which you then horizontally merge with pd.concat. Finally run a DataFrame.update() to update original fullmatrix with values from pvt_df, aligned to indices.
x['key'] = 1
dfs = []
for col in x.columns[:-1]:
dfs.append(x.pivot_table(index=df.index, columns=[col], values='key').fillna(0))
pvt_df = pd.concat(dfs, axis=1).astype(int)
fullmatrix.update(pvt_df)
fullmatrix = fullmatrix.astype(int)
fullmatrix # ONLY FOR VISIBLE COLUMNS IN ORIGINAL POST
# Irelia Gnar Renekton Kassadin Sion Jax Lulu Maokai Rumble Lissandra XinZhao Amumu Udyr Ivern Shaco Skarner FiddleSticks Aatrox Volibear MonkeyKing
# 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The OP tries to create a table of dummy variables with a set of data points. For each data point, it contains 5 attributes. There are in total N unique attributes.
We will use a simplied dataset to demonstrate how to do it:
5 unique attributes
3 data entries
each data entry contains 3 attributes.
x = pd.DataFrame([['a', 'b', 'c'],
['b', 'd', 'e'],
['e', 'b', 'a']])
fullmatrix = pd.DataFrame([[0 for _ in range(5)] for _ in range(3)],
columns=['a','b','c','d','e'])
""" fullmatrix:
a b c d e
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
"""
# each row in x_temp is a string of attributed delimited by ","
x_row_joined = pd.Series((",".join(row[1]) for row in x.iterrows()))
fullmatrix = x_row_joined.str.get_dummies(sep=',')
The method is inspired by offbyone's answer It uses pandas.Series.str.get_dummies. We first joins each row of x with a specified delimiter. Then make use of the Series.str.get_dummies method. The method takes a delimiter that we just use to join attributes and will generate the dummy-varaible table for you. (Caution: don't pick sep that exists in x.)

Categories