Handling values with multiple items in dataframe - python

Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading

Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana

Related

Generate combinations by systematically selecting rows from groups (using pandas)

I have a pandas dataframe df which appears as following: (toy version below but the real df contains many more columns and groups)
group sub fruit
a 1 apple
a 2 banana
a 3 orange
b 1 pear
b 2 strawberry
b 3 cherry
c 1 kiwi
c 2 tomato
c 3 lemon
All groups have the same number of rows. I am trying to generate a new dataframe that contains all the combinations of group and sub but specifically so that each combo contains the all types of groups and all types of subs.
Desired output:
combo group sub fruit
1 a 1 apple
1 b 2 strawberry
1 c 3 lemon
2 a 1 apple
2 c 2 tomato
2 b 3 cherry
3 b 1 pear
3 a 2 banana
3 c 3 lemon
4 c 1 kiwi
4 a 2 banana
4 b 3 cherry
5 c 1 kiwi
5 b 2 strawberry
5 a 3 orange
...
So the below would be a wrong combination, since it was two values of the same sub:
6 c 2 tomato
6 b 2 strawberry
6 a 3 orange
A previously post of mine randomly selected subs but I realized was too unconstrained: Generate combinations by randomly selecting a row from multiple groups (using pandas)
A solution could be:
import pandas as pd
import numpy as np
from itertools import permutations
df = pd.DataFrame({"group":{"0":"a","1":"a","2":"a","3":"b","4":"b","5":"b","6":"c","7":"c","8":"c"},
"sub":{"0":1,"1":2,"2":3,"3":1,"4":2,"5":3,"6":1,"7":2,"8":3},
"fruit":{"0":"apple","1":"banana","2":"orange","3":"pear","4":"strawberry",
"5":"cherry","6":"kiwi","7":"tomato","8":"lemon"}})
df2 = pd.DataFrame({"combo": [j for i in ([i]*3 for i in range(1,7)) for j in i],
"group": [j for i in permutations(["a","b","c"]) for j in i],
"sub":[1,2,3]*6})
pd.merge(df2, df, how="left", on=["group","sub"])

Groupby a column containing duplicates but also preserving the duplicate information

I have the following dataframe:
df=pd.DataFrame({'id':['A','A','B','C','D'],'Name':['apple','apricot','banana','orange','citrus'], 'count':[2,3,6,5,12]})
id Name count
0 A apple 2
1 A apricot 3
2 B banana 6
3 C orange 5
4 D citrus 12
I am trying to group the dataframe by the 'id' column, but also preserve the duplicated names as separate columns. Below is the expected output:
id sum(count) id1 id2
0 A 5 apple apricot
1 B 6 banana na
2 C 5 orange na
3 D 12 citrus na
I tried grouping by the id column using the following statement but that removes the name column completely.
df.groupby(['id'], as_index=False).sum()
I would appreciate any suggestions/ help.
You can use DataFrame.pivot_table for this:
g = df.groupby('id')
# Generate the new columns of the pivoted dataframe
col = g.Name.cumcount()
# Sum of count grouped by id
sum_count = g['count'].sum()
(df.pivot_table(values='Name', index='id', columns = col, aggfunc='first')
.add_prefix('id')
.assign(sum_count = sum_count))
id0 id1 sum_count
id
A apple apricot 5
B banana NaN 6
C orange NaN 5
D citrus NaN 12

How to normalize columns with one-hot encoding efficiently in pandas dataframes?

A column of an example dataframe is shown:
Fruit FruitA FruitB
Apple Banana Mango
Banana Apple Apple
Mango Apple Banana
Banana Mango Banana
Mango Banana Apple
Apple Mango Mango
I want to introduce new columns in the dataframe Fruit-Apple, Fruit-Mango, Fruit-Banana with one-hot encoding in the rows they are respectively present. So, the desired output is:
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
Apple Banana Mango 1 1 1
Banana Apple Apple 1 1 0
Mango Apple Banana 1 1 1
Banana Mango Banana 0 1 1
Mango Banana Apple 1 1 1
Apple Mango Mango 1 0 1
My code to do this is:
for i in range(len(data)):
if (data['Fruits'][i] == 'Apple' or data['FruitsA'][i] == 'Apple' or data['FruitsB'][i] == 'Apple'):
data['Fruits-Apple'][i]=1
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Banana' or data['FruitsA'][i] == 'Banana' or data['FruitsB'][i] == 'Banana'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=1
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Mango' or data['FruitsA'][i] == 'Mango' or data['FruitsB'][i] == 'Mango'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=1
But I notice that the time taken for running this code increases dramatically if there are a lot of types of 'fruits'. In my actual data, there are only 1074 rows, and the column I'm trying to "normalize" with one-hot encoding has 18 different values. So, there are 18 if conditions inside the for loop, and the code hasn't finished running for 15 mins now. That's absurd (It would be great to know why its taking so long - in another column that contained only 6 different types of values, the code took much less time to execute, about 3 mins).
So, what's the best (vectorized) way to achieve this output?
Use join with get_dummies and add_prefix:
df = df.join(pd.get_dummies(df['Fruit']).add_prefix('Fruit-'))
print (df)
Fruit Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple 1 0 0
1 Banana 0 1 0
2 Mango 0 0 1
3 Banana 0 1 0
4 Mango 0 0 1
5 Apple 1 0 0
EDIT: If input are multiple columns use get_dummies with max by columns:
df = (df.join(pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix('Fruit-')))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
For better performance use MultiLabelBinarizer with DataFrame converted to lists:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.values.tolist()),
columns=mlb.classes_,
index=df.index).add_prefix('Fruit-'))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1

how to map multiple records to one unique id

I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance
You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available

Merge Duplicates based on column?

Here's my situation -
In[9]: df
Out[9]:
fruit val1 val2
0 Orange 1 1
1 orANGE 2 2
2 apple 3 3
3 APPLE 4 4
4 mango 5 5
5 appLE 6 6
In[10]: type(df)
Out[10]: pandas.core.frame.DataFrame
How do remove case-insensitive duplicates such that resulting fruit will be all lower with val1 as sum of each val1s and val2 as sum of eachval2s
Expected result:
fruit val1 val2
0 orange 3 3
1 apple 13 13
2 mango 5 5
In two steps:
df['fruit'] = df['fruit'].map(lambda x: x.lower())
res = df.groupby('fruit').sum()
res
# val1 val2
# fruit
# apple 13 13
# mango 5 5
# orange 3 3
And to recover your structure:
res.reset_index()
as per the comment, the lower casing can be accomplished in a more straight forward way like this:
df['fruit'] = df['fruit'].str.lower()

Categories