How to build "many-hot" in Python/Pandas? - python

I need to combine three columns of categorical data into a single set of binary category-named columns. This is similar to a "one-hot" but the source rows have up to three categories instead of just one. Also, note that there are 100+ categories and I will not know them beforehand.
id, fruit1, fruit2, fruit3
1, apple, orange,
2, orange, ,
3, banana, apple,
should generate...
id, apple, banana, orange
1, 1, 0, 1
2, 0, 0, 1
3, 1, 1, 0

You could use pd.melt to combine all the fruit columns into one column, and the use pd.crosstab to create a frequency table:
import numpy as np
import pandas as pd
df = pd.read_csv('data')
df = df.replace(r' ', np.nan)
# id fruit1 fruit2 fruit3
# 0 1 apple orange NaN
# 1 2 orange NaN NaN
# 2 3 banana apple NaN
melted = pd.melt(df, id_vars=['id'])
result = pd.crosstab(melted['id'], melted['value'])
print(result)
yields
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0
Explanation: The melted DataFrame looks like this:
In [148]: melted = pd.melt(df, id_vars=['id']); melted
Out[149]:
id variable value
0 1 fruit1 apple
1 2 fruit1 orange
2 3 fruit1 banana
3 1 fruit2 orange
4 2 fruit2 NaN
5 3 fruit2 apple
6 1 fruit3 NaN
7 2 fruit3 NaN
8 3 fruit3 NaN
We can ignore the variable column; it is the id and value which is important.
pd.crosstab can be used to create a frequency table with melted['id'] values in the index and melted['value'] values as the columns:
In [150]: pd.crosstab(melted['id'], melted['value'])
Out[150]:
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0

You can apply value counts to each row:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'fruit1': ['Apple', 'Banana', np.nan],
'fruit2': ['Banana', np.nan, 'Apple'],
'fruit3': ['Grape', np.nan, np.nan],
})
df = df.apply(lambda row: row.value_counts(), axis=1).fillna(0).applymap(int)
Before:
fruit1 fruit2 fruit3
0 Apple Banana Grape
1 Banana NaN NaN
2 NaN Apple NaN
After:
Apple Banana Grape
0 1 1 1
1 0 1 0
2 1 0 0

Related

Co-occurence matrix from two data frames. Python

I have two data frames, Food and Drink.
food = {'fruit':['Apple', np.nan, 'Apple'],
'food':['Cake', 'Bread', np.nan]}
# Create DataFrame
food = pd.DataFrame(food)
fruit food
0 Apple Cake
1 NaN Bread
2 Apple NaN
drink = {'smoothie':['S_Strawberry', 'S_Watermelon', np.nan],
'tea':['T_white', np.nan, 'T_green']}
# Create DataFrame
drink = pd.DataFrame(drink)
smoothie tea
0 S_Strawberry T_white
1 S_Watermelon NaN
2 NaN T_green
The rows represent specific customers.
I would like to make a co-occurrence matrix of food and drinks.
expected outcome: (columns and ids do not have to be in this order)
Apple Bread Cake
S_Strawberry 1.0 NaN 1.0
S_Watermelon NaN 1.0 NaN
T_white 1.0 NaN 1.0
T_green 1.0 NaN NaN
so far I can make a co-occurrence matrix for each of the df but I don't know how I would bind the two data frames.
thank you.
I think you want pd.get_dummies and matrix multiplication:
pd.get_dummies(drink). T # pd.get_dummies(food)
Output:
fruit_Apple food_Bread food_Cake
smoothie_S_Strawberry 1 0 1
smoothie_S_Watermelon 0 1 0
tea_T_green 1 0 0
tea_T_white 1 0 1
You can get rid of the prefixes with:
pd.get_dummies(drink, prefix='', prefix_sep=''). T # pd.get_dummies(food, prefix='', prefix_sep='')
Output:
Apple Bread Cake
S_Strawberry 1 0 1
S_Watermelon 0 1 0
T_green 1 0 0
T_white 1 0 1

Handling values with multiple items in dataframe

Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading
Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana

Update pandas dataframe from another dataframe

I have two dataframes.
df1 is
name colour count
0 apple red 3
1 orange orange 2
3 kiwi green 12
df2 is
name count
0 kiwi 2
1 apple 1
I want to update df1 with count from df2 based on name match.
Expected result:
name colour count
0 apple red 1
1 orange orange 2
3 kiwi green 2
I am using this but the results are incorrect.
df3 = df2.combine_first(df1).reindex(df1.index)
How can I do this correctly?
Create index by name in both DataFrames for matching by them by DataFrame.set_index, then DataFrame.combine_first and last DataFrame.reset_index for column from index:
df = df2.set_index('name').combine_first(df1.set_index('name')).reset_index()
print (df)
name colour count
0 apple red 1.0
1 kiwi green 2.0
2 orange orange 2.0

how to map multiple records to one unique id

I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance
You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available

How to assign values based on multiple columns in pandas?

Is there an elegant way to assign values based on multiple columns in a dataframe in pandas? Let's say I have a dataframe with 2 columns: FruitType and Color.
import pandas as pd
df = pd.DataFrame({'FruitType':['apple', 'banana','kiwi','orange','loquat'],
'Color':['red_black','yellow','greenish_yellow', 'orangered','orangeyellow']})
I would like to assign the value of a third column, 'isYellowSeedless', based on both 'FruitType' and 'Color' columns.
I have a list of fruits that I consider seedless, and would like to check the Color column to see if it contains the str "yellow".
seedless = ['banana', 'loquat']
How do I string this all together elegantly?
This is my attempt that didn't work:
df[(df['FruitType'].isin(seedless)) & (culture_table['Color'].str.contains("yellow"))]['isYellowSeedless'] = True
Use loc with mask:
m = (df['FruitType'].isin(seedless)) & (df['Color'].str.contains("yellow"))
df.loc[m, 'isYellowSeedless'] = True
print (df)
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True
If need True and False output:
df['isYellowSeedless'] = m
print (df)
Color FruitType isYellowSeedless
0 red_black apple False
1 yellow banana True
2 greenish_yellow kiwi False
3 orangered orange False
4 orangeyellow loquat True
For if-else by some scalars use numpy.where:
df['isYellowSeedless'] = np.where(m, 'a', 'b')
print (df)
Color FruitType isYellowSeedless
0 red_black apple b
1 yellow banana a
2 greenish_yellow kiwi b
3 orangered orange b
4 orangeyellow loquat a
And for convert to 0 and 1:
df['isYellowSeedless'] = m.astype(int)
print (df)
Color FruitType isYellowSeedless
0 red_black apple 0
1 yellow banana 1
2 greenish_yellow kiwi 0
3 orangered orange 0
4 orangeyellow loquat 1
Or you can try
df['isYellowSeedless']=df.loc[df.FruitType.isin(seedless),'Color'].str.contains('yellow')
df
Out[546]:
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True

Categories