How to groupby and collpase with pandas? - python

I have a dataframe of the following type:
Country Year Age Male Female
0 Canada 2005 50 400 25
1 Canada 2005 51 100 25
2 Canada 2006 50 100 70
3 Columbia 2005 50 75 75
I would like to, for example, get the total number of males+females of any age, grouped by country and year. I.e. I'm trying to understand what operation could allow me to see a table such as
Country Year Total over ages and sexes
0 Canada 2005 550
1 Canada 2006 170
2 Columbia 2005 150
In the above example, the value 550 comes from the total number of males and females in Canada for the year 2005, regardless of age: so 550 = 400+25+100+25.
I probably need to groupby Country and Year, but I'm not sure how to collapse the ages and total the number of males and females.

df["Total"] = df.Male + df.Female
df.groupby(["Country", "Year"]).Total.sum()
Output:
Country Year
Canada 2005 550
2006 170
Columbia 2005 150
Name: Total, dtype: int64
Update
cᴏʟᴅsᴘᴇᴇᴅ's chained version:
(df.assign(Total=df.Male + df.Female)
.groupby(['Country', 'Year'])
.Total
.sum()
.reset_index(name='Total over ages and sexes'))

Related

Python summing selected values in a column that match given condition

Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!
You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)

Panda dataframe sort by date while keeping a certain order in second column

I have multiple time-series dataframes from different countries. I want to merge these together, keeping the order in the dates (instead of merely putting one dataframe below the other) while at the same time making sure that for each date the column next to it which has the country index has a consistent pattern. However, when I do this the countries seem to be randomly distributed. For one date Australia is first but for another date, Japan is put first.
To clarify with an example:
Australia
country crime-index
2000 AU 100
2001 AU 110
2002 AU 120
Japan
country crime-index
2000 JP 90
2001 JP 100
2002 JP 95
United Kingdom
country crime-index
2000 UK 120
2001 UK 130
2002 UK 130
Merged
country crime-index
2000 AU 100
2000 JP 90
2000 UK 120
2001 AU 110
2001 JP 100
2001 UK 130
2002 AU 120
2002 JP 95
2002 UK 130
You can simply use the sort_values functions of pandas to sort your dataframe by multiple columns or together with index. With this, the ordering of the country column will be the same for each date.
df.rename_axis('dates').sort_values(["dates","country"])
You can try with
df['temp'] = df.index
df.sort_values(['temp', 'country'])
del df['temp']
df['temp'] copies the dates in a column, then sort values by two columns

Sort Sales Data by Customer Name and Year

I have a data set which contains Customer Name, Ship Date, and PO Amount.
I would like to sort the data frame to output a table with the format of
cols:[Customer Name,2016,2017,2018,2019,2020,2021]
rows: 1 row for each customer and the sum of PO's within a given year.
This is what I have tried:
The data is coming in from an excel sheet, but assume ShipToName is a String, Bill Amount is a Float, and Sell data is a datetime.datetime.year().
ShipToName = ['Bob', 'Joe', 'Josh', 'Bob','Joe','Josh']
BillAmount = [30.02,23.2,20,45.32,54.23,65]
SellDate = [2016,2016,2018,2020,2021,2018]
dfSales = {'Customer': ShipToName, 'Total Sales': BillAmmount,
'Year':SellDate}
dfSales = pd.DataFrame(dfSales,columns = ['Customer', 'Year','Total
Sales'])
dfbyyear = dfSales.groupby(['Customer','Year'], as_index =
False).sum().sort_values('Total Sales', ascending = False)
This gives me a new row for each customer/year combo.
I would like the output to look like:
Customer Name
2016
2017
2018
2019
2020
2021
Bob
30.02
45.32
Joe
23.20
54.23
Josh
85.00
Edit v2
Using the data from the original version, we can create a temp dataframe dx that groups the data by Customer Name and Year. Then we can pivot the data to the format you wanted.
dx = df.groupby(['Customer Name','Year'])['PO Amount'].agg(Total_Amt=sum).reset_index()
dp = dx.pivot(index='Customer Name',columns='Year',values='Total_Amt')
print (dp)
The output of this will be:
Year 2020 2021
Customer Name
Boby 6754 6371
Jack 5887 6421
Jane 5161 4411
Jill 5857 5641
Kate 6205 6457
Suzy 5027 4561
Original v1
I am making some assumptions with the data as you haven't provided me with any.
Assumptions:
There are many customers in the dataframe - My example has 6
customers
Each customer has more than one Ship Date - My example has 1
shipment each month for 2 years
The shipment amount is a dollar amount - I used integer range from 100 to 900
The total dataframe size is 144 rows with 3 columns - Customer Name, Ship Date, and PO Amount
You are looking for an output by Customer, by Year, sum of all POs for that year
With these assumptions, here's the dataframe and the output.
import pandas as pd
import random
df = pd.DataFrame({'Customer Name':['Jack'] * 24 + ['Jill'] * 24 + ['Jane'] * 24 +
['Kate'] * 24 + ['Suzy'] * 24 + ['Boby'] * 24,
'Ship Date':pd.date_range('2020-01-01',periods = 24,freq='MS').tolist()*6,
'PO Amount':[random.randint(100,900) for _ in range(144)]})
print (df)
df['Year'] = df['Ship Date'].dt.year
print (df.groupby(['Customer Name','Year'])['PO Amount'].agg(Total_Amt=sum).reset_index())
Customer Name Ship Date PO Amount
0 Jack 2020-01-01 310
1 Jack 2020-02-01 677
2 Jack 2020-03-01 355
3 Jack 2020-04-01 571
4 Jack 2020-05-01 875
.. ... ... ...
139 Boby 2021-08-01 116
140 Boby 2021-09-01 822
141 Boby 2021-10-01 751
142 Boby 2021-11-01 109
143 Boby 2021-12-01 866
Each customer has data from 2020-01-01 through 2021-12-01.
The summary report will be as follows:
Customer Name Year Total_Amt
0 Boby 2020 7176
1 Boby 2021 6049
2 Jack 2020 6187
3 Jack 2021 5240
4 Jane 2020 4919
5 Jane 2021 6105
6 Jill 2020 6556
7 Jill 2021 5963
8 Kate 2020 6300
9 Kate 2021 6360
10 Suzy 2020 5969
11 Suzy 2021 4866

how to group by and sort the columns given the column value in a function

I have a data frame as below, I need to write a function which should be able to give me the below results:
Input Parameters:
Country, for example 'INDIA'
Age, for example 'Student'
My input dataframe looks like this:
Card Name Country Age Code Amount
0 AAA INDIA Young House 100
1 AAA Australia Old Hardware 200
2 AAA INDIA Student House 300
3 AAA US Young Hardware 600
4 AAA INDIA Student Electricity 200
5 BBB Australia Young Electricity 100
6 BBB INDIA Student Electricity 200
7 BBB Australia Young House 450
8 BBB INDIA Old House 150
9 CCC Australia Old Hardware 200
10 CCC Australia Young House 350
11 CCC INDIA Old Electricity 400
12 CCC US Young House 200
The expected output would be
Code Total Amount Frequency Average
0 Electricity 400 2 200
1 House 300 1 300
Top 10 ( In our case, we can get only Top 2 ) Code for the given Country ( = India) and Age ( = Student ) based on the Total sum of the Amount. In addition it should also give a new column ‘Frequency’ which will count the no. of records in that group and column ‘Average’ will be the Total sum / Frequency
I have tried
df.groupby(['Country','Age','Code']).agg({'Amount': sum})['Amount'].groupby(level=0, group_keys=False).nlargest(10)
which produces
Country Age Code
Australia Young House 800
Old Hardware 400
Young Electricity 100
INDIA Old Electricity 400
Student Electricity 400
House 300
Old House 150
Young House 100
US Young Hardware 600
House 200
Name: Amount, dtype: int64
which is unfortunately different from the expected output.
Given
>>> df
Card Name Country Age Code Amount
0 AAA INDIA Young House 100
1 AAA Australia Old Hardware 200
2 AAA INDIA Student House 300
3 AAA US Young Hardware 600
4 AAA INDIA Student Electricity 200
5 BBB Australia Young Electricity 100
6 BBB INDIA Student Electricity 200
7 BBB Australia Young House 450
8 BBB INDIA Old House 150
9 CCC Australia Old Hardware 200
10 CCC Australia Young House 350
11 CCC INDIA Old Electricity 400
12 CCC US Young House 200
you can filter your dataframe first:
>>> country = 'INDIA'
>>> age = 'Student'
>>> tmp = df[df.Country.eq(country) & df.Age.eq(age)].loc[:, ['Code', 'Amount']]
>>> tmp
Code Amount
2 House 300
4 Electricity 200
6 Electricity 200
... and then group:
>>> result = tmp.groupby('Code')['Amount'].agg([['Total Amount', 'sum'], ['Frequency', 'size'], ['Average', 'mean']]).reset_index()
>>> result
Code Total Amount Frequency Average
0 Electricity 400 2 200
1 House 300 1 300
If I understand your filtering criterion by the Total Amount correctly, you can then issue
result.nlargest(10, 'Total Amount')

Removing data from a column in pandas

I'm trying to prune some data from my data frame but only the rows where there are duplicates in the "To country" column
My data frame looks like this:
Year From country To country Points
0 2016 Albania Armenia 0
1 2016 Albania Armenia 2
2 2016 Albania Australia 12
Year From country To country Points
2129 2016 United Kingdom The Netherlands 0
2130 2016 United Kingdom Ukraine 10
2131 2016 United Kingdom Ukraine 5
[2132 rows x 4 columns]
I try this on it:
df.drop_duplicates(subset='To country', inplace=True)
And what happens is this:
Year From country To country Points
0 2016 Albania Armenia 0
2 2016 Albania Australia 12
4 2016 Albania Austria 0
Year From country To country Points
46 2016 Albania The Netherlands 0
48 2016 Albania Ukraine 0
50 2016 Albania United Kingdom 5
[50 rows x 4 columns]
While this does get rid of the duplicated 'To country' entries, it also removes all the values of the 'From country' column. I must be using the drop_duplicates() wrong, but the pandas documentation isn't helping me understand why its dropping more than I'd expect it to?
No, this behavior is correct—assuming every team played every other team, it's finding the firsts, and all of those firsts are "From" Albania.
From what you've said below, you want to keep row 0, but not row 1 because it repeats both the To and From countries. The way to eliminate those is:
df.drop_duplicates(subset=['To country', 'From country'], inplace=True)
The simplest solution is to group by the 'to country' name and take the first (or the last, if you prefer) row from each group:
df.groupby('To country').first().reset_index()
# To country Year From country Points
#0 Armenia 2016 Albania 0
#1 Australia 2016 Albania 12
#2 The Netherlands 2016 United Kingdom 0
#3 Ukraine 2016 United Kingdom 10
Compared to aryamccarthy's solution, this one gives you more control over which duplicates to keep.

Categories