Group by, Pivot with multiple columns and condition count - python

I have the dataframe:
df = pd.DataFrame({
"Agreement": ["Peace", "Peace", "Love", "Love", "Sun","Sun","Sun"],
"country1": ["USA", "UK", "Germany", "Spain", "Italy","India","China"],
"country2": ["Canada", "France", "Portugal", "Italy","India","Spain","UK"],
"EP1": [1, 0, 1, 0, 0,1,1],
"EP2": [0, 0, 0, 0,0,0,0],
"EP3": [1, 0, 1, 0,1,1,1]
})
I would like to group by or pivot so that I get the count of times a country is in an agreement with at least one EP equal or greater than 1. I would like as output:
df = pd.DataFrame({
"Country": ["USA", "UK", "Germany", "Spain", "Italy","India","China", "Canada","France","Portugal"],
"Agreement with at least one EP per country": [1, 1, 1, 1,1,2,1,1,0,1]
})
I have tried with pivot and group by and loop but I never reach the desired output. Thanks

Summarize 'EPx' columns in 'Agreement' then flatten your dataframe. Finally group by Country to count the number of agreement.
cols = ['country1', 'country2', 'Agreement']
out = (df.assign(Agreement=df.filter(like='EP').any(axis=1))[cols]
.melt('Agreement', value_name='Country')
.groupby('Country', sort=False)['Agreement'].sum().reset_index())
print(out)
# Output
Country Agreement
0 USA 1
1 UK 1
2 Germany 1
3 Spain 1
4 Italy 1
5 India 2
6 China 1
7 Canada 1
8 France 0
9 Portugal 1
Update
I am interested in the count of times a country is in a unique agreement with at least one EP equal or greater than 1.
cols = ['country1', 'country2', 'Agreement']
out = (df.assign(Agreement=df.filter(like='EP').any(axis=1))[cols]
.melt('Agreement', value_name='Country')
.groupby('Country', sort=False)['Agreement'].max().astype(int).reset_index())
print(out)
# Output
Country Agreement
0 USA 1
1 UK 1
2 Germany 1
3 Spain 1
4 Italy 1
5 India 1
6 China 1
7 Canada 1
8 France 0
9 Portugal 1

Related

How to split a dataframe row into two rows in Pandas?

I have a dataframe as follows:
location | amount
---------------------------
1 new york $27.00
2 california $21.00
3 florida $19.00
4 texas $18.00
What I want to do is split the row where Location='California' into two rows where California turns into 'Sacramento' and 'Los Angeles' and the amount (21) gets divided into two, split between the two new rows.
This is the desired result:
location | amount
------------------------------
1 new york $27.00
2 los angeles $10.50
3 sacramento $10.50
4 florida $19
5 texas $18
Duplicating & Removing
cal = df.loc["location" == "california"]
df = df.append({
"location": "sacramento",
"amount": cali["amount"] / 2
}, ignore_index=True)
df = df.append({
"location": "los angeles",
"amount": cali["amount"] / 2
}, ignore_index=True)
df.drop(cal.index.to(list))
Sources: https://www.codeforests.com/2020/09/27/pandas-split-one-row-of-data-into-multiple-rows/
Python pandas: fill a dataframe row by row

Combine text using delimiter for duplicate column values

What im trying to achieve is to combine Name into one value using comma delimiter whenever Country column is duplicated, and sum the values in Salary column.
Current input :
pd.DataFrame({'Name': {0: 'John',1: 'Steven',2: 'Ibrahim',3: 'George',4: 'Nancy',5: 'Mo',6: 'Khalil'},
'Country': {0: 'USA',1: 'UK',2: 'UK',3: 'France',4: 'Ireland',5: 'Ireland',6: 'Ireland'},
'Salary': {0: 100, 1: 200, 2: 200, 3: 100, 4: 50, 5: 100, 6: 10}})
Name Country Salary
0 John USA 100
1 Steven UK 200
2 Ibrahim UK 200
3 George France 100
4 Nancy Ireland 50
5 Mo Ireland 100
6 Khalil Ireland 10
Expected output :
Row 1 & 2 (in inputs) got grupped into one since Country column is duplicated & Salary column is summed up.
Tha same goes for Row 4,5 & 6.
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
What i have tried, but im not sure how to combine text in Name column :
df.groupby(['Country'],as_index=False)['Salary'].sum()
[Out:]
Country Salary
0 France 100
1 Ireland 160
2 UK 400
3 USA 100
use groupby() and agg():
out=df.groupby('Country',as_index=False).agg({'Name':', '.join,'Salary':'sum'})
If needed unique values of 'Name' column then use :
out=(df.groupby('Country',as_index=False)
.agg({'Name':lambda x:', '.join(set(x)),'Salary':'sum'}))
Note: use pd.unique() in place of set() if order of unique values is important
output of out:
Country Name Salary
0 France George 100
1 Ireland Nancy, Mo, Khalil 160
2 UK Steven, Ibrahim 400
3 USA John 100
Use agg:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})
And to get the columns in order you can add [df.columns] to the pipe:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})[df.columns]
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160

group by and sum values based on different rows

I have a dataset that looks like this:
store itemId numberOfItemsSold
Berlin 1 78
Amsterdam 3 12
Berlin 2 31
Amsterdam 1 12
Berlin 1 90
I want to create a dataset or dictionary such that I have accumulated information regarding how many of EACH item was sold in each different store. For example, in Berlin, 78+90 items were sold of itemId = 1. Then, 31 items were sold where itemId = 2.
How can I extract such information for each store for each different product (itemId)?
You can do this using groupby(), this would give a DataFrame:
summary_df = df.groupby(['store', 'itemId']).sum()
If you want a dictionary:
summary_dict = dict(zip(summary_df.index, summary_df.numberOfItemsSold))
Does the pd.DataSet.groupby() work for you?
pd.DataFrame(
[["Berlin", 1, 78],
["Amsterdam",3, 12],
["Berlin",2, 31],
["Amsterdam", 1,12],
["Berlin", 1, 90]],
columns=["store", "itemId", "numberOfItemsSold"]).groupby(['store', 'itemId']).sum().reset_index()
output:
store itemId numberOfItemsSold
0 Amsterdam 1 12
1 Amsterdam 3 12
2 Berlin 1 168
3 Berlin 2 31

insert dataframe into rows for each group in another dataframe

I've created MRE for clarity.
df = pd.DataFrame({
"region": ["Canada", "Korea", "Norway", "China", "Canada", "Korea", "Norway", "China", "Canada", "Korea", "Norway", "China"],
"type" :["A", "B", "C", "D", "A", "C", "C", "A", "B", "B", "B", "B"],
"actual fees": [1235, 422, 333, 111, 1233, 555, 23, 3, 3,4, 1, 2],
"total fee": [2222, 444, 67, 711, 4873, 785, 453, 7, 7,9, 11, 352]
})
df_to_insert = pd.DataFrame({
"region":["Canada", "Korea", "Norway", "China"],
"users" :[55, 36, 87, 250]
})
so my df would look like:
actual fees total fee
region type
Canada A 2 2
B 1 1
China A 1 1
B 1 1
D 1 1
and df_to_insert looks like below:
region users
0 Canada 55
1 Korea 36
2 Norway 87
3 China 250
now what I want to do is at end of each region in column "type" insert "users" and user values under "actual fees" column and under "total fee" column its regional sum.
So my desired dataframe would look like something below:
actual fees total fee
region type
Canada A 2 2
B 1 1
Users 55 3
China A 1 1
B 1 1
D 1 1
Users 250 3
I hope this was clear enough. Let me know if something is not clear.
Thanks in advance!
You can melt the df_to_insert first , then concat and set_index for MultiIndex, lastly for total fee , groupby region and map back to mlt dataframe
mlt = df_to_insert.melt('region',var_name='type',value_name='actual fees')
mlt['total fee'] = mlt['region'].map(df.groupby('region')['total fee'].sum())
out = pd.concat((df,mlt),sort=False).set_index(['region','type']).sort_index(0)
print(out)
actual fees total fee
region type
Canada A 1235 2222
A 1233 4873
B 3 7
users 55 7102
China A 3 7
B 2 352
D 111 711
users 250 1070
Korea B 422 444
B 4 9
C 555 785
users 36 1238
Norway B 1 11
C 333 67
C 23 453
users 87 531
You can see how the melt work and helps in concating :
print(df_to_insert.melt('region',var_name='type',value_name='actual fees'))
region type actual fees
0 Canada users 55
1 Korea users 36
2 Norway users 87
3 China users 250

Replace Matrix elements with 1

I have an empty matrix and I want to replace the matrix elements with 1 if country (index) belongs to Region (column).
I try to create a double loop, but I get stacked when I need to do the conditional. Thanks. ([152 rows x 6 columns]). Thanks so much.
west europe east europe latin america
Norway 0 0 0
Denmark 0 0 0
Iceland 0 0 0
Switzerland 0 0 0
Finland 0 0 0
Netherlands 0 0 0
Sweden 0 0 0
Austria 0 0 0
Ireland 0 0 0
Germany 0 0 0
Belgium 0 0 0
I was thinking smth like:
matrix = pd.DataFrame(np.random.randint(1, size=(152, 6)), index=['# enumarate all the countries], columns=['west europe', 'east europe', 'latin america','north america','africa', 'asia'])
print (matrix)
for i in range (len(matrix)):
for j in range(len(matrix)):
if data[i] =='Africa' and data['Country'] = [ '#here enumarate all Africa countries':
matrix[i][j]==1
elif:
....
matrix[i][j]==1
else:
matrix[i][j]==0
print (matrix)
Sample data frame with countries and region:
Country Happiness Rank Happiness Score Economy Family Health Freedom Generosity Corruption Dystopia Job Satisfaction Region
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 Western Europe
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 Western Europe
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 Western Europe
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 Western Europe
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 Western Europe
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 Western Europe
If your input variable data is a DataFrame, then as #Alollz mentioned, you can use the pandas pd.get_dummies function.
Something like this: pd.get_dummies(data, columns=['Region'])
And the output would look like:
Country HappinessRank HappinessScore Economy Family Health Freedom Generosity Corruption Dystopia JobSatisfaction Region_WesternEurope
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 1
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 1
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 1
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 1
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 1
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 1
It will take the Region category column and make it into indicator columns. In this case it uses the column name as the prefix but you can play around with that.

Categories