insert dataframe into rows for each group in another dataframe - python

I've created MRE for clarity.
df = pd.DataFrame({
"region": ["Canada", "Korea", "Norway", "China", "Canada", "Korea", "Norway", "China", "Canada", "Korea", "Norway", "China"],
"type" :["A", "B", "C", "D", "A", "C", "C", "A", "B", "B", "B", "B"],
"actual fees": [1235, 422, 333, 111, 1233, 555, 23, 3, 3,4, 1, 2],
"total fee": [2222, 444, 67, 711, 4873, 785, 453, 7, 7,9, 11, 352]
})
df_to_insert = pd.DataFrame({
"region":["Canada", "Korea", "Norway", "China"],
"users" :[55, 36, 87, 250]
})
so my df would look like:
actual fees total fee
region type
Canada A 2 2
B 1 1
China A 1 1
B 1 1
D 1 1
and df_to_insert looks like below:
region users
0 Canada 55
1 Korea 36
2 Norway 87
3 China 250
now what I want to do is at end of each region in column "type" insert "users" and user values under "actual fees" column and under "total fee" column its regional sum.
So my desired dataframe would look like something below:
actual fees total fee
region type
Canada A 2 2
B 1 1
Users 55 3
China A 1 1
B 1 1
D 1 1
Users 250 3
I hope this was clear enough. Let me know if something is not clear.
Thanks in advance!

You can melt the df_to_insert first , then concat and set_index for MultiIndex, lastly for total fee , groupby region and map back to mlt dataframe
mlt = df_to_insert.melt('region',var_name='type',value_name='actual fees')
mlt['total fee'] = mlt['region'].map(df.groupby('region')['total fee'].sum())
out = pd.concat((df,mlt),sort=False).set_index(['region','type']).sort_index(0)
print(out)
actual fees total fee
region type
Canada A 1235 2222
A 1233 4873
B 3 7
users 55 7102
China A 3 7
B 2 352
D 111 711
users 250 1070
Korea B 422 444
B 4 9
C 555 785
users 36 1238
Norway B 1 11
C 333 67
C 23 453
users 87 531
You can see how the melt work and helps in concating :
print(df_to_insert.melt('region',var_name='type',value_name='actual fees'))
region type actual fees
0 Canada users 55
1 Korea users 36
2 Norway users 87
3 China users 250

Related

Group by, Pivot with multiple columns and condition count

I have the dataframe:
df = pd.DataFrame({
"Agreement": ["Peace", "Peace", "Love", "Love", "Sun","Sun","Sun"],
"country1": ["USA", "UK", "Germany", "Spain", "Italy","India","China"],
"country2": ["Canada", "France", "Portugal", "Italy","India","Spain","UK"],
"EP1": [1, 0, 1, 0, 0,1,1],
"EP2": [0, 0, 0, 0,0,0,0],
"EP3": [1, 0, 1, 0,1,1,1]
})
I would like to group by or pivot so that I get the count of times a country is in an agreement with at least one EP equal or greater than 1. I would like as output:
df = pd.DataFrame({
"Country": ["USA", "UK", "Germany", "Spain", "Italy","India","China", "Canada","France","Portugal"],
"Agreement with at least one EP per country": [1, 1, 1, 1,1,2,1,1,0,1]
})
I have tried with pivot and group by and loop but I never reach the desired output. Thanks
Summarize 'EPx' columns in 'Agreement' then flatten your dataframe. Finally group by Country to count the number of agreement.
cols = ['country1', 'country2', 'Agreement']
out = (df.assign(Agreement=df.filter(like='EP').any(axis=1))[cols]
.melt('Agreement', value_name='Country')
.groupby('Country', sort=False)['Agreement'].sum().reset_index())
print(out)
# Output
Country Agreement
0 USA 1
1 UK 1
2 Germany 1
3 Spain 1
4 Italy 1
5 India 2
6 China 1
7 Canada 1
8 France 0
9 Portugal 1
Update
I am interested in the count of times a country is in a unique agreement with at least one EP equal or greater than 1.
cols = ['country1', 'country2', 'Agreement']
out = (df.assign(Agreement=df.filter(like='EP').any(axis=1))[cols]
.melt('Agreement', value_name='Country')
.groupby('Country', sort=False)['Agreement'].max().astype(int).reset_index())
print(out)
# Output
Country Agreement
0 USA 1
1 UK 1
2 Germany 1
3 Spain 1
4 Italy 1
5 India 1
6 China 1
7 Canada 1
8 France 0
9 Portugal 1

Combine text using delimiter for duplicate column values

What im trying to achieve is to combine Name into one value using comma delimiter whenever Country column is duplicated, and sum the values in Salary column.
Current input :
pd.DataFrame({'Name': {0: 'John',1: 'Steven',2: 'Ibrahim',3: 'George',4: 'Nancy',5: 'Mo',6: 'Khalil'},
'Country': {0: 'USA',1: 'UK',2: 'UK',3: 'France',4: 'Ireland',5: 'Ireland',6: 'Ireland'},
'Salary': {0: 100, 1: 200, 2: 200, 3: 100, 4: 50, 5: 100, 6: 10}})
Name Country Salary
0 John USA 100
1 Steven UK 200
2 Ibrahim UK 200
3 George France 100
4 Nancy Ireland 50
5 Mo Ireland 100
6 Khalil Ireland 10
Expected output :
Row 1 & 2 (in inputs) got grupped into one since Country column is duplicated & Salary column is summed up.
Tha same goes for Row 4,5 & 6.
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
What i have tried, but im not sure how to combine text in Name column :
df.groupby(['Country'],as_index=False)['Salary'].sum()
[Out:]
Country Salary
0 France 100
1 Ireland 160
2 UK 400
3 USA 100
use groupby() and agg():
out=df.groupby('Country',as_index=False).agg({'Name':', '.join,'Salary':'sum'})
If needed unique values of 'Name' column then use :
out=(df.groupby('Country',as_index=False)
.agg({'Name':lambda x:', '.join(set(x)),'Salary':'sum'}))
Note: use pd.unique() in place of set() if order of unique values is important
output of out:
Country Name Salary
0 France George 100
1 Ireland Nancy, Mo, Khalil 160
2 UK Steven, Ibrahim 400
3 USA John 100
Use agg:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})
And to get the columns in order you can add [df.columns] to the pipe:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})[df.columns]
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160

group by and sum values based on different rows

I have a dataset that looks like this:
store itemId numberOfItemsSold
Berlin 1 78
Amsterdam 3 12
Berlin 2 31
Amsterdam 1 12
Berlin 1 90
I want to create a dataset or dictionary such that I have accumulated information regarding how many of EACH item was sold in each different store. For example, in Berlin, 78+90 items were sold of itemId = 1. Then, 31 items were sold where itemId = 2.
How can I extract such information for each store for each different product (itemId)?
You can do this using groupby(), this would give a DataFrame:
summary_df = df.groupby(['store', 'itemId']).sum()
If you want a dictionary:
summary_dict = dict(zip(summary_df.index, summary_df.numberOfItemsSold))
Does the pd.DataSet.groupby() work for you?
pd.DataFrame(
[["Berlin", 1, 78],
["Amsterdam",3, 12],
["Berlin",2, 31],
["Amsterdam", 1,12],
["Berlin", 1, 90]],
columns=["store", "itemId", "numberOfItemsSold"]).groupby(['store', 'itemId']).sum().reset_index()
output:
store itemId numberOfItemsSold
0 Amsterdam 1 12
1 Amsterdam 3 12
2 Berlin 1 168
3 Berlin 2 31

Pandas: Merge many-to-one

Let's say I have 2 data frames:
df1:
Name Age
Pete 19
John 30
Max 24
df2:
Name Subject Grade
Pete Math 90
Pete History 100
John English 90
Max History 90
Max Math 80
I want to merge them df2 to df1, many to one, to end up with something like this:
Name Age Subject Grade
Pete 19 Math 90
Pete 19 History 100
John 30 English 90
Max 24 History 90
Max 24 Math 80
I don't want to group them by Subject and Grade, I need to duplicate them so it would keep everything.
Simply you could use pd.merge as follows:
import pandas as pd
if __name__ == '__main__':
df1 = pd.DataFrame({"Name": ["Pete", "John", "Max"],
"Age": [19, 30, 24]})
df2 = pd.DataFrame({"Name": ["Pete", "Pete", "John", "Max", "Max"],
"Subject": ["Math", "History", "English", "History", "Math"],
"Grade": [90, 100, 90, 90, 80]})
df3 = pd.merge(df1, df2, how="right", on="Name")
print(df1)
print(df2)
print(df3)
Result:
Name Age
0 Pete 19
1 John 30
2 Max 24
Name Subject Grade
0 Pete Math 90
1 Pete History 100
2 John English 90
3 Max History 90
4 Max Math 80
Name Age Subject Grade
0 Pete 19 Math 90
1 Pete 19 History 100
2 John 30 English 90
3 Max 24 History 90
4 Max 24 Math 80

Column Mapping using Python

I have two dataframes , the first one has 1000 rows and looks like:
Date tri23_1 hsgç_T2 bbbj-1Y_jn Family Bonus
2011-06-09 qwer 1 rits Laavin 456
2011-07-09 ww 43 mayo Grendy 679
2011-09-10 wwer 44 ramya Fantol 431
2011-11-02 5 sam Gondow 569
The second dataframe contains all the unique values and also the hotels, that are associated to these values:
Group Hotel
tri23_1 Jamel
hsgç_T2 Frank
bbbj-1Y_jn Luxy
mlkl_781 Grand Hotel
vchs_94 Vancouver
My goal is to replace the columns of the first dataframe by the the corresponding values of the column Hotel of the second dataframe and the output should look like below:-
Date Jamel Frank Luxy Family Bonus
2011-06-09 qwer 1 rits Laavin 456
2011-07-09 ww 43 mayo Grendy 679
2011-09-10 wwer 44 ramya Fantol 431
2011-11-02 5 sam Gondow 569
Can i achieve this using python.
You could try this, using to_dict():
df1.columns=[df2.set_index('Group').to_dict()['Hotel'][i] if i in df2.set_index('Group').to_dict()['Hotel'].keys() else i for i in df1.columns]
print(df1)
Output:
df1
Date tri23_1 hsgç_T2 bbbj-1Y_jn Family Bonus
0 2011-06-09 qwer 1 rits Laavin 456.0
1 2011-07-09 ww 43 mayo Grendy 679.0
2 2011-09-10 wwer 44 ramya Fantol 431.0
3 2011-11-02 5 sam Gondow 569
df2
Group Hotel
0 tri23_1 Jamel
1 hsgç_T2 Frank
2 bbbj-1Y_jn Luxy
3 mlkl_781 Grand Hotel
4 vchs_94 Vancouver
df1 changed
Date Jamel Frank Luxy Family Bonus
0 2011-06-09 qwer 1 rits Laavin 456.0
1 2011-07-09 ww 43 mayo Grendy 679.0
2 2011-09-10 wwer 44 ramya Fantol 431.0
3 2011-11-02 5 sam Gondow 569
Update: Explanation
First, if df2['Group'] isn't the index of df2, we set it as index.
Then pass the dataframe to a dict:
df2.set_index('Group').to_dict()
>>>{'Hotel': {'tri23_1': 'Jamel', 'hsgç_T2': 'Frank', 'bbbj-1Y_jn': 'Luxy', 'mlkl_781': 'Grand Hotel', 'vchs_94': 'Vancouver'}}
Then we select the value of key 'Hotel'
df2.set_index('Group').to_dict()['Hotel']
>>>{'tri23_1': 'Jamel', 'hsgç_T2': 'Frank', 'bbbj-1Y_jn': 'Luxy', 'mlkl_781': 'Grand Hotel', 'vchs_94': 'Vancouver'}
Then column by column we search its value in that dictionary, and if such column doesn't exit in the keys of the dictionary, we just return the same value e.g. Date, Family, Bonus:
i='Date'
i in df2.set_index('Group').to_dict()['Hotel'].keys --->False
return 'Date'
...
i='tri23_1'
i in df2.set_index('Group').to_dict()['Hotel'].keys --->True
return df2.set_index('Group').to_dict()['Hotel']['tri23_1']
...
...
#And so on...

Categories