Generate combinations by systematically selecting rows from groups (using pandas) - python

I have a pandas dataframe df which appears as following: (toy version below but the real df contains many more columns and groups)
group sub fruit
a 1 apple
a 2 banana
a 3 orange
b 1 pear
b 2 strawberry
b 3 cherry
c 1 kiwi
c 2 tomato
c 3 lemon
All groups have the same number of rows. I am trying to generate a new dataframe that contains all the combinations of group and sub but specifically so that each combo contains the all types of groups and all types of subs.
Desired output:
combo group sub fruit
1 a 1 apple
1 b 2 strawberry
1 c 3 lemon
2 a 1 apple
2 c 2 tomato
2 b 3 cherry
3 b 1 pear
3 a 2 banana
3 c 3 lemon
4 c 1 kiwi
4 a 2 banana
4 b 3 cherry
5 c 1 kiwi
5 b 2 strawberry
5 a 3 orange
...
So the below would be a wrong combination, since it was two values of the same sub:
6 c 2 tomato
6 b 2 strawberry
6 a 3 orange
A previously post of mine randomly selected subs but I realized was too unconstrained: Generate combinations by randomly selecting a row from multiple groups (using pandas)

A solution could be:
import pandas as pd
import numpy as np
from itertools import permutations
df = pd.DataFrame({"group":{"0":"a","1":"a","2":"a","3":"b","4":"b","5":"b","6":"c","7":"c","8":"c"},
"sub":{"0":1,"1":2,"2":3,"3":1,"4":2,"5":3,"6":1,"7":2,"8":3},
"fruit":{"0":"apple","1":"banana","2":"orange","3":"pear","4":"strawberry",
"5":"cherry","6":"kiwi","7":"tomato","8":"lemon"}})
df2 = pd.DataFrame({"combo": [j for i in ([i]*3 for i in range(1,7)) for j in i],
"group": [j for i in permutations(["a","b","c"]) for j in i],
"sub":[1,2,3]*6})
pd.merge(df2, df, how="left", on=["group","sub"])

Related

merging 2 data frames that doesnt have a shared column

I am having some trouble with merging 2 data frames that doesn't have any same column.
I have 2 data frames and I want to combine them (they both have the same row number)
For example, I have these two data frames:
A:
store
red candy
apple
first
5
3
second
1
2
third
4
2
B:
yellow candy
banana
green candy
10
5
4
5
3
3
1
1
0
and I want to merge them so I will have one data frame looks like this:
store
red candy
apple
yellow candy
banana
green candy
first
5
3
10
5
4
second
1
2
5
3
3
third
4
2
1
1
0

Handling values with multiple items in dataframe

Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading
Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

how to map multiple records to one unique id

I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance
You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available

Using Pandas Data Frame how to apply count to multi level grouped columns?

I have a data frame with multiple columns and I want to use count after group by such that it is applied to the combination of 2 or more columns. for example, let's say I have two columns:
user_id product_name
1 Apple
1 Banana
1 Apple
2 Carrot
2 Tomato
2 Carrot
2 Tomato
3 Milk
3 Cucumber
...
What I want to achieve is something like this:
user_id product_name Product_Count_per_User
1 Apple 1
1 Banana 2
2 Carrot 2
2 Tomato 2
3 Milk 1
3 Cucumber 1
I cannot get it. I tried this:
dcf6 = df3.groupby(['user_id','product_name'])['user_id', 'product_name'].count()
but it does not seem to get what I want and it is displaying 4 columns instead of 3. How to do to it? Thanks.
You are counting two columns at the same time, you can just use groupby.size:
(df.groupby(['user_id', 'Product_Name']).size()
.rename('Product_Count_per_User').reset_index())
Or count only one column:
df.groupby(['user_id','Product_Name'])['user_id'].size()
Use GroupBy.size:
dcf6 = df3.groupby(['user_id','Product_Name']).size()
.reset_index(name='Product_Count_per_User')
print (dcf6)
user_id Product_Name Product_Count_per_User
0 1 Apple 2
1 1 Banana 1
2 2 Carrot 2
3 2 Tomato 2
4 3 Cucumber 1
5 3 Milk 1
What is the difference between size and count in pandas?
Base on your own code , just do this .
df.groupby(['user_id','product_name'])['user_id'].
agg({'Product_Count_per_User':'count'}).reset_index(level=1)
product_name Product_Count_per_User
user_id
1 Apple 2
1 Banana 1
2 Carrot 2
2 Tomato 2
3 Cucumber 1
3 Milk 1

Categories