How to build "many-hot" in Python/Pandas?

How to build "many-hot" in Python/Pandas? - python

I need to combine three columns of categorical data into a single set of binary category-named columns. This is similar to a "one-hot" but the source rows have up to three categories instead of just one. Also, note that there are 100+ categories and I will not know them beforehand.
id, fruit1, fruit2, fruit3
1, apple, orange,
2, orange, ,
3, banana, apple,
should generate...
id, apple, banana, orange
1, 1, 0, 1
2, 0, 0, 1
3, 1, 1, 0

You could use pd.melt to combine all the fruit columns into one column, and the use pd.crosstab to create a frequency table:
import numpy as np
import pandas as pd
df = pd.read_csv('data')
df = df.replace(r' ', np.nan)
# id fruit1 fruit2 fruit3
# 0 1 apple orange NaN
# 1 2 orange NaN NaN
# 2 3 banana apple NaN
melted = pd.melt(df, id_vars=['id'])
result = pd.crosstab(melted['id'], melted['value'])
print(result)
yields
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0
Explanation: The melted DataFrame looks like this:
In [148]: melted = pd.melt(df, id_vars=['id']); melted
Out[149]:
id variable value
0 1 fruit1 apple
1 2 fruit1 orange
2 3 fruit1 banana
3 1 fruit2 orange
4 2 fruit2 NaN
5 3 fruit2 apple
6 1 fruit3 NaN
7 2 fruit3 NaN
8 3 fruit3 NaN
We can ignore the variable column; it is the id and value which is important.
pd.crosstab can be used to create a frequency table with melted['id'] values in the index and melted['value'] values as the columns:
In [150]: pd.crosstab(melted['id'], melted['value'])
Out[150]:
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0

You can apply value counts to each row:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'fruit1': ['Apple', 'Banana', np.nan],
'fruit2': ['Banana', np.nan, 'Apple'],
'fruit3': ['Grape', np.nan, np.nan],
})
df = df.apply(lambda row: row.value_counts(), axis=1).fillna(0).applymap(int)
Before:
fruit1 fruit2 fruit3
0 Apple Banana Grape
1 Banana NaN NaN
2 NaN Apple NaN
After:
Apple Banana Grape
0 1 1 1
1 0 1 0
2 1 0 0

Related

Co-occurence matrix from two data frames. Python

I have two data frames, Food and Drink.
food = {'fruit':['Apple', np.nan, 'Apple'],
'food':['Cake', 'Bread', np.nan]}
# Create DataFrame
food = pd.DataFrame(food)
fruit food
0 Apple Cake
1 NaN Bread
2 Apple NaN
drink = {'smoothie':['S_Strawberry', 'S_Watermelon', np.nan],
'tea':['T_white', np.nan, 'T_green']}
# Create DataFrame
drink = pd.DataFrame(drink)
smoothie tea
0 S_Strawberry T_white
1 S_Watermelon NaN
2 NaN T_green
The rows represent specific customers.
I would like to make a co-occurrence matrix of food and drinks.
expected outcome: (columns and ids do not have to be in this order)
Apple Bread Cake
S_Strawberry 1.0 NaN 1.0
S_Watermelon NaN 1.0 NaN
T_white 1.0 NaN 1.0
T_green 1.0 NaN NaN
so far I can make a co-occurrence matrix for each of the df but I don't know how I would bind the two data frames.
thank you.

I think you want pd.get_dummies and matrix multiplication:
pd.get_dummies(drink). T # pd.get_dummies(food)
Output:
fruit_Apple food_Bread food_Cake
smoothie_S_Strawberry 1 0 1
smoothie_S_Watermelon 0 1 0
tea_T_green 1 0 0
tea_T_white 1 0 1
You can get rid of the prefixes with:
pd.get_dummies(drink, prefix='', prefix_sep=''). T # pd.get_dummies(food, prefix='', prefix_sep='')
Output:
Apple Bread Cake
S_Strawberry 1 0 1
S_Watermelon 0 1 0
T_green 1 0 0
T_white 1 0 1

Handling values with multiple items in dataframe

Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading

Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana

Update pandas dataframe from another dataframe

I have two dataframes.
df1 is
name colour count
0 apple red 3
1 orange orange 2
3 kiwi green 12
df2 is
name count
0 kiwi 2
1 apple 1
I want to update df1 with count from df2 based on name match.
Expected result:
name colour count
0 apple red 1
1 orange orange 2
3 kiwi green 2
I am using this but the results are incorrect.
df3 = df2.combine_first(df1).reindex(df1.index)
How can I do this correctly?

Create index by name in both DataFrames for matching by them by DataFrame.set_index, then DataFrame.combine_first and last DataFrame.reset_index for column from index:
df = df2.set_index('name').combine_first(df1.set_index('name')).reset_index()
print (df)
name colour count
0 apple red 1.0
1 kiwi green 2.0
2 orange orange 2.0

how to map multiple records to one unique id

I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance

You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available

How to assign values based on multiple columns in pandas?

Is there an elegant way to assign values based on multiple columns in a dataframe in pandas? Let's say I have a dataframe with 2 columns: FruitType and Color.
import pandas as pd
df = pd.DataFrame({'FruitType':['apple', 'banana','kiwi','orange','loquat'],
'Color':['red_black','yellow','greenish_yellow', 'orangered','orangeyellow']})
I would like to assign the value of a third column, 'isYellowSeedless', based on both 'FruitType' and 'Color' columns.
I have a list of fruits that I consider seedless, and would like to check the Color column to see if it contains the str "yellow".
seedless = ['banana', 'loquat']
How do I string this all together elegantly?
This is my attempt that didn't work:
df[(df['FruitType'].isin(seedless)) & (culture_table['Color'].str.contains("yellow"))]['isYellowSeedless'] = True

Use loc with mask:
m = (df['FruitType'].isin(seedless)) & (df['Color'].str.contains("yellow"))
df.loc[m, 'isYellowSeedless'] = True
print (df)
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True
If need True and False output:
df['isYellowSeedless'] = m
print (df)
Color FruitType isYellowSeedless
0 red_black apple False
1 yellow banana True
2 greenish_yellow kiwi False
3 orangered orange False
4 orangeyellow loquat True
For if-else by some scalars use numpy.where:
df['isYellowSeedless'] = np.where(m, 'a', 'b')
print (df)
Color FruitType isYellowSeedless
0 red_black apple b
1 yellow banana a
2 greenish_yellow kiwi b
3 orangered orange b
4 orangeyellow loquat a
And for convert to 0 and 1:
df['isYellowSeedless'] = m.astype(int)
print (df)
Color FruitType isYellowSeedless
0 red_black apple 0
1 yellow banana 1
2 greenish_yellow kiwi 0
3 orangered orange 0
4 orangeyellow loquat 1

Or you can try
df['isYellowSeedless']=df.loc[df.FruitType.isin(seedless),'Color'].str.contains('yellow')
df
Out[546]:
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to build "many-hot" in Python/Pandas? - python

Related

Co-occurence matrix from two data frames. Python

Handling values with multiple items in dataframe

Update pandas dataframe from another dataframe

how to map multiple records to one unique id

How to assign values based on multiple columns in pandas?

Categories

Resources