Find top n elements in pandas dataframe column by keeping the grouping - python

I am trying to find the top 5 elements of the column total_petitions, but keeping the ordered grouping I did.
df = df[['fy', 'EmployerState', 'total_petitions']]
table = df.groupby(['fy','EmployerState']).mean()
table.nlargest(5, 'total_petitions')
sample output:
fy EmployerState total_petitions
2020 WA 7039.333333
2016 MD 2647.400000
2017 MD 2313.142857
... TX 2305.541667
2020 TX 2081.952381
desired output:
fy EmployerState total_petitions
2016 AL 3.875000
AR 225.333333
AZ 26.666667
CA 326.056604
CO 21.333333
... ... ...
2020 VA 36.714286
WA 7039.333333
WI 43.750000
WV 8986086.08
WY 1.000000
with the elements of total_petitions being the 5 states with highest means by year

What you are looking for is a pivot table:
df = df.pivot_table(values='total_petitions', index=['fy','EmployerState'])
df = df.groupby(level='fy')['total_petitions'].nlargest(5).reset_index(level=0, drop=True).reset_index()

Related

Read nested JSON in a cell with Pandas

I recovered a list of ISO 3166-2 countries and regions in this Github repository.
I managed to have a first look of the regions using the following code:
import pandas as pd
import json
data = "/content/data.json"
df = pd.read_json(data)
df = df.T
Which gives the following output:
name
divisions
Afghanistan
{'AF-BDS': 'Badakhshān', 'AF-BDG': 'Bādghīs', 'AF-BGL': 'Baghlān', 'AF-BAL': 'Balkh', 'AF-BAM': 'Bāmīān', 'AF-FRA': 'Farāh', 'AF-FYB': 'Fāryāb', 'AF-GHA': 'Ghaznī', 'AF-GHO': 'Ghowr', 'AF-HEL': 'Helmand', 'AF-HER': 'Herāt', 'AF-JOW': 'Jowzjān', 'AF-KAB': 'Kabul (Kābol)', 'AF-KAN': 'Kandahār', 'AF-KAP': 'Kāpīsā', 'AF-KNR': 'Konar (Kunar)', 'AF-KDZ': 'Kondoz (Kunduz)', 'AF-LAG': 'Laghmān', 'AF-LOW': 'Lowgar', 'AF-NAN': 'Nangrahār (Nangarhār)', 'AF-NIM': 'Nīmrūz', 'AF-ORU': 'Orūzgān (Urūzgā', 'AF-PIA': 'Paktīā', 'AF-PKA': 'Paktīkā', 'AF-PAR': 'Parwān', 'AF-SAM': 'Samangān', 'AF-SAR': 'Sar-e Pol', 'AF-TAK': 'Takhār', 'AF-WAR': 'Wardak (Wardag)', 'AF-ZAB': 'Zābol (Zābul)'}
Albania
{'AL-BR': 'Berat', 'AL-BU': 'Bulqizë', 'AL-DL': 'Delvinë', 'AL-DV': 'Devoll', 'AL-DI': 'Dibër', 'AL-DR': 'Durrës', 'AL-EL': 'Elbasan', 'AL-FR': 'Fier', 'AL-GR': 'Gramsh', 'AL-GJ': 'Gjirokastër', 'AL-HA': 'Has', 'AL-KA': 'Kavajë', 'AL-ER': 'Kolonjë', 'AL-KO': 'Korcë', 'AL-KR': 'Krujë', 'AL-KC': 'Kucovë', 'AL-KU': 'Kukës', 'AL-LA': 'Laç', 'AL-LE': 'Lezhë', 'AL-LB': 'Librazhd', 'AL-LU': 'Lushnjë', 'AL-MM': 'Malësia e Madhe', 'AL-MK': 'Mallakastër', 'AL-MT': 'Mat', 'AL-MR': 'Mirditë', 'AL-PQ': 'Peqin', 'AL-PR': 'Përmet', 'AL-PG': 'Pogradec', 'AL-PU': 'Pukë', 'AL-SR': 'Sarandë', 'AL-SK': 'Skrapar', 'AL-SH': 'Shkodër', 'AL-TE': 'Tepelenë', 'AL-TR': 'Tiranë', 'AL-TP': 'Tropojë', 'AL-VL': 'Vlorë'}
But I can't manage to achieve the following output because of the nested JSON.
country code
country name
region code
region name
AF
Afghanistan
AF-BDS
Badakhshān
AF
Afghanistan
AF-BDG
Bādghīs
I tried to loop inside the DataFrame with :
df = json_normalize(df['divisions']).unstack().apply(pd.Series)
But I'm not getting any satisfying result.
This should work:
df1 = (
pd.DataFrame(data)
.transpose()
.reset_index(names="country code")
.rename(columns={"name": "country name"})
)
divisions = [(k1, v1) for k, v in df1["divisions"].to_dict().items() for k1, v1 in v.items()]
df2 = pd.DataFrame(divisions, columns=["region code", "region name"])
final_df = (
pd
.merge(df1.explode("divisions"), df2, left_on="divisions", right_on="region code")
.drop(columns="divisions")
)
print(final_df.head(10))
country code country name region code region name
0 AF Afghanistan AF-BDS Badakhshān
1 AF Afghanistan AF-BDG Bādghīs
2 AF Afghanistan AF-BGL Baghlān
3 AF Afghanistan AF-BAL Balkh
4 AF Afghanistan AF-BAM Bāmīān
5 AF Afghanistan AF-FRA Farāh
6 AF Afghanistan AF-FYB Fāryāb
7 AF Afghanistan AF-GHA Ghaznī
8 AF Afghanistan AF-GHO Ghowr
9 AF Afghanistan AF-HEL Helmand
you can simply read in the data one country at a time
J = json.load(open("iso-3166-2.json","r"))
dfs = []
for country_code in J:
df = pd.DataFrame(J[country_code])
df.index.name="region_code"
df['country_code'] = country_code
dfs.append(df)
df = pd.concat(dfs).reset_index()
# region_code name divisions country_code
#0 AF-BAL Afghanistan Balkh AF
#1 AF-BAM Afghanistan Bāmīān AF
#2 AF-BDG Afghanistan Bādghīs AF
#3 AF-BDS Afghanistan Badakhshān AF
#4 AF-BGL Afghanistan Baghlān AF
#... ... ... ... ...
#3802 ZW-MI Zimbabwe Midlands ZW
#3803 ZW-MN Zimbabwe Matabeleland North ZW
#3804 ZW-MS Zimbabwe Matabeleland South ZW
#3805 ZW-MV Zimbabwe Masvingo ZW
#3806 ZW-MW Zimbabwe Mashonaland West ZW
Let's do it in the logic of the original post:
(
pd.read_json('iso-3166-2.json', orient='index')
.set_index('name', append=True)
.squeeze()
.apply(pd.Series)
.stack()
.rename_axis(['country code','country name','region code'])
.rename('region name')
.reset_index()
)
Some notes:
orient='index' - read data with country codes as index, so transposition is not required
set_index('name', append=True) - save country codes and names together as a multy index
instead of squeeze we could use ['divisions'].apply
.apply(pd.Series) - transform dictionaries in divisions into records with the region codes as column names
.stack() - unpivot the table with the region codes in a columns to long format
.rename_axis(...) - at this stage contry codes, names and region codes make up a multyindex of a series with region names as values

How to do cumulative division in groupby level [0,1] in pandas dataframe based on conditions?

I have a dataframe where I want append row add some groupby + additional conditions. Looking for for loop or other solution whatever can work.
or if its easier...
first melt df and then add new ratio % col then unmelt.
As calculations are customise, I think for loop can find the solution with or without groupby.
---Line 6,7,8 are my requirement.---
0-14 = child and unemployed
14-50 = young and working
50+ = old and unemployed
# ref line 6,7,8 = showing which rows to (+) and (/)
Currently I want to put 3 conditions in output line 6,7,8:
d = { 'year': [2019,2019,2019,2020,2020,2020],
'age group': ['(0-14)','(14-50)','(50+)','(0-14)','(14-50)','(50+)'],
'con': ['UK','UK','UK','US','US','US'],
'population': [10,20,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
df
year age group con population
0 2019 (0-14) UK 10
1 2019 (14-50) UK 20
2 2019 (50+) UK 300
3 2020 (0-14) US 400
4 2020 (14-50) US 1000
5 2020 (50+) US 2000
output required:
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50+) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50+) US 2000.0
6 2019 young vs child UK-young vs child 2.0 # 20/10
7 2019 old vs young UK-old vs young 15.0 #300/20
8 2019 unemployed vs working UK-unemployed vs working. 15.5 #300+10 20
Trials now:
df2 = df.copy()
criteria = [df2['con'].str.contains('0-14'),
df2['con'].str.contains('14-50'),
df2['con'].str.contains('50+')]
#conditions should be according to requirements
values = ['young vs child','old vs young', 'unemployed vs working']
df2['con'] = df2['con']+'_'+np.select(criteria, values, 0)
df2['age group'] = df2['age group']+'_'+np.select(criteria, values, 0)
df.groupby(['year','age group','con']).sum().groupby(level=[0,1]).cumdiv()
pd.concat([df,df2])
#----errors. cumdiv() not found and missing conditions criteria-------
also tried:
df['population'].div(df.groupby('con')['population'].shift(1))
#but looking for customisations into this
#so it can first sum rows and then divide
#according to unemployed condition-- row 8 reference.
CLOSEST TRAIL
n_df_2 = df.copy()
con_list = [x for x in df.con]
year_list = [x for x in df.year]
age_list = [x for x in df['age group']]
new_list = ['young vs child','old vs young', 'unemployed vs working']
for country in con_list:
bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]
bev_child.loc[:,'population'] = bev_work.loc[:,'population'].max() / bev_child.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+new_list[0]
bev_child.loc[:,'age group'] = new_list[0]
s = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_child.loc[:,'population'].max() + bev_old.loc[:,'population'].max()/ bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+ new_list[2]
bev_child.loc[:,'age group'] = new_list[2]
s = s.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_old.loc[:,'population'].max() / bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+ new_list[1]
bev_child.loc[:,'age group'] = new_list[1]
s = s.append(bev_child, ignore_index=True)
s
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50+) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50+) US 2000.0
6 2020 young vs child US-young vs child 2.5
7 2020 unemployed vs working US-unemployed vs working 4.5
8 2020 old vs young US-old vs young 2.0
also
PLEASE find the easiest way to solve it... Please...

Add a column of repeating numbers to existing dataframe

I have the following dataframe where each row is a unique state-city pair:
State City
NY Albany
NY NYC
MA Boston
MA Cambridge
I want to a add a column of years ranging from 2000 to 2018:
State City. Year
NY Albany 2000
NY Albany 2001
NY Albany 2002
...
NY Albany 2018
NY NYC 2000
NY NYC 2018
...
MA Cambridge 2018
I know I can create a list of numbers using Year = list(range(2000,2019))
Does anyone know how to put this list as a column in the dataframe for each state-city?
You could try adding it as a list and then performing explode. I think it should work:
df['Year'] = [list(range(2000,2019))] * len(df)
df = df.explode('Year')
One way is to use the DataFrame.stack() method.
Here is sample of your current data:
data = [['NY', 'Albany'],
['NY', 'NYC'],
['MA', 'Boston'],
['MA', 'Cambridge']]
cities = pd.DataFrame(data, columns=['State', 'City'])
print(cities)
# State City
# 0 NY Albany
# 1 NY NYC
# 2 MA Boston
# 3 MA Cambridge
First, make this into a multi-level index (this will end up in the final dataframe):
cities_index = pd.MultiIndex.from_frame(cities)
print(cities_index)
# MultiIndex([('NY', 'Albany'),
# ('NY', 'NYC'),
# ('MA', 'Boston'),
# ('MA', 'Cambridge')],
# names=['State', 'City'])
Now, make a dataframe with all the years in it (I only use 3 years for brevity):
years = list(range(2000, 2003))
n_cities = len(cities)
years_data = np.repeat(years, n_cities).reshape(len(years), n_cities).T
years_data = pd.DataFrame(years_data, index=cities_index)
years_data.columns.name = 'Year index'
print(years_data)
# Year index 0 1 2
# State City
# NY Albany 2000 2001 2002
# NYC 2000 2001 2002
# MA Boston 2000 2001 2002
# Cambridge 2000 2001 2002
Finally, use stack to transform this dataframe into a vertically-stacked series which I think is what you want:
years_by_city = years_data.stack().rename('Year')
print(years_by_city.head())
# State City Year index
# NY Albany 0 2000
# 1 2001
# 2 2002
# NYC 0 2000
# 1 2001
# Name: Year, dtype: int64
If you want to remove the index and have all the values as a dataframe just do
cities_and_years = years_by_city.reset_index()

Transpose subset of pandas dataframe into multi-indexed data frame

I have the following dataframe:
df.head(14)
I'd like to transpose just the yr and the ['WA_','BA_','IA_','AA_','NA_','TOM_']
variables by Label. The resulting dataframe should then be a Multi-indexed frame with Label and the WA_, BA_, etc. and the columns names will be 2010, 2011, etc. I've tried,
transpose(), groubby(), pivot_table(), long_to_wide(),
and before I roll my own nested loop going line by line through this df I thought I'd ping the community. Something like this by every Label group:
I feel like the answer is in one of those functions but I'm just missing it. Thanks for your help!
From what I can tell by your illustrated screenshots, you want WA_, BA_ etc as rows and yr as columns, with Label remaining as a row index. If so, consider stack() and unstack():
# sample data
labels = ["Albany County","Big Horn County"]
n_per_label = 7
n_rows = n_per_label * len(labels)
years = np.arange(2010, 2017)
min_val = 10000
max_val = 40000
data = {"Label": sorted(np.array(labels * n_per_label)),
"WA_": np.random.randint(min_val, max_val, n_rows),
"BA_": np.random.randint(min_val, max_val, n_rows),
"IA_": np.random.randint(min_val, max_val, n_rows),
"AA_": np.random.randint(min_val, max_val, n_rows),
"NA_": np.random.randint(min_val, max_val, n_rows),
"TOM_": np.random.randint(min_val, max_val, n_rows),
"yr":np.append(years,years)
}
df = pd.DataFrame(data)
AA_ BA_ IA_ NA_ TOM_ WA_ Label yr
0 27757 23138 10476 20047 34015 12457 Albany County 2010
1 37135 30525 12296 22809 27235 29045 Albany County 2011
2 11017 16448 17955 33310 11956 19070 Albany County 2012
3 24406 21758 15538 32746 38139 39553 Albany County 2013
4 29874 33105 23106 30216 30176 13380 Albany County 2014
5 24409 27454 14510 34497 10326 29278 Albany County 2015
6 31787 11301 39259 12081 31513 13820 Albany County 2016
7 17119 20961 21526 37450 14937 11516 Big Horn County 2010
8 13663 33901 12420 27700 30409 26235 Big Horn County 2011
9 37861 39864 29512 24270 15853 29813 Big Horn County 2012
10 29095 27760 12304 29987 31481 39632 Big Horn County 2013
11 26966 39095 39031 26582 22851 18194 Big Horn County 2014
12 28216 33354 35498 23514 23879 17983 Big Horn County 2015
13 25440 28405 23847 26475 20780 29692 Big Horn County 2016
Now set Label and yr as indices.
df.set_index(["Label","yr"], inplace=True)
From here, unstack() will pivot the inner-most index to columns. Then, stack() can swing our value columns down into rows.
df.unstack().stack(level=0)
yr 2010 2011 2012 2013 2014 2015 2016
Label
Albany County AA_ 27757 37135 11017 24406 29874 24409 31787
BA_ 23138 30525 16448 21758 33105 27454 11301
IA_ 10476 12296 17955 15538 23106 14510 39259
NA_ 20047 22809 33310 32746 30216 34497 12081
TOM_ 34015 27235 11956 38139 30176 10326 31513
WA_ 12457 29045 19070 39553 13380 29278 13820
Big Horn County AA_ 17119 13663 37861 29095 26966 28216 25440
BA_ 20961 33901 39864 27760 39095 33354 28405
IA_ 21526 12420 29512 12304 39031 35498 23847
NA_ 37450 27700 24270 29987 26582 23514 26475
TOM_ 14937 30409 15853 31481 22851 23879 20780
WA_ 11516 26235 29813 39632 18194 17983 29692

How to add a row for each subindex in pandas multiindex dataframe?

Suppose I have the following dataframe :
import pandas as pd
df = pd.DataFrame(
{
'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [pd.np.random.randint(100000, 999999) for _ in range(12)]
}
)
Here it is :
office_id sales state
0 1 903325 CA
1 2 364594 WA
2 3 737728 CO
3 4 239378 AZ
4 5 833003 CA
5 6 501536 WA
6 1 920821 CO
7 2 879602 AZ
8 3 661818 CA
9 4 548888 WA
10 5 842459 CO
11 6 906791 AZ
Now I do a groupby operation on office_id and states :
df.groupby(["office_id", "state"]).aggregate({"sales": "sum"})
This lead to :
sales
office_id state
1 CA 903325
CO 920821
2 AZ 879602
WA 364594
3 CA 661818
CO 737728
4 AZ 239378
WA 548888
5 CA 833003
CO 842459
6 AZ 906791
WA 501536
Is it possible to add a row, for each office_id, with a new index total for example which is the sum over each state of the sales column ?
I can compute it by grouping by "office_id" and sum but I obtain a new DataFrame and I do not succeed in merging it.
Pandas has built-in functionality to do this with pivot_table by setting the margins parameter to True. And it only sorts correctly because 'total' is lowercase and uppercase comes first.
df.pivot_table(index='office_id', columns='state', margins=True,
margins_name='total', aggfunc='sum').stack()
sales
office_id state
1 CA 415727.0
CO 240142.0
total 655869.0
2 AZ 126350.0
WA 385698.0
total 512048.0
3 CA 387320.0
CO 487075.0
total 874395.0
4 AZ 978018.0
WA 878368.0
total 1856386.0
5 CA 105057.0
CO 852025.0
total 957082.0
6 AZ 130853.0
WA 435940.0
total 566793.0
total AZ 1235221.0
CA 908104.0
CO 1579242.0
WA 1700006.0
total 5422573.0
You can reshape by Series.unstack, add new column total and then reshape back by DataFrame.stack, if need MultiIndex use Series.to_frame:
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
print (df1)
sales
office_id state
1 CA 505047.0
CO 724412.0
total 1229459.0
2 AZ 402775.0
WA 339803.0
total 742578.0
3 CA 343655.0
CO 833474.0
total 1177129.0
4 AZ 574130.0
WA 656577.0
total 1230707.0
5 CA 122260.0
CO 207717.0
total 329977.0
6 AZ 262568.0
WA 504491.0
total 767059.0
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
#cast if sales are always integers
df1.sales = df1.sales.astype(int)
print (df1)
sales
office_id state
1 CA 323107
CO 658336
total 981443
2 AZ 273728
WA 942249
total 1215977
3 CA 773390
CO 692275
total 1465665
4 AZ 669435
WA 735141
total 1404576
5 CA 530182
CO 232104
total 762286
6 AZ 532248
WA 951481
total 1483729
Timings:
def jez(df):
df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
df1.sales = df1.sales
return (df1)
print (jez(df))
In [339]: %timeit (df.pivot_table(index='office_id', columns='state', margins=True, margins_name='total', aggfunc='sum').stack())
100 loops, best of 3: 14.6 ms per loop
In [340]: %timeit (jez(df))
100 loops, best of 3: 2.78 ms per loop
you can also use concat to append the aggregated totals as follows.
pd.concat([df.groupby(["office_id", "state"]).aggregate({"sales": "sum"}),
df.groupby(["state"]).aggregate({"sales": "sum"})
.set_index([['Total', 'Total', 'Total', 'Total']], append=True).swaplevel(0, 1)])
which returns
sales
office_id state
1 CA 914776
CO 902173
2 AZ 605783
WA 865189
3 CA 280203
CO 556867
4 AZ 958747
WA 643333
5 CA 703606
CO 644399
6 AZ 768268
WA 834051
Total AZ 2332798
CA 1898585
CO 2103439
WA 2342573
Here, the Data.frame is aggregated at the office-state and state levels. These are concatenated with .concat. The DataFrame aggregated to the state level must be given an additional index prior to concatnating. This is done with set_index. In addition, the indices must be swapped to conform with the office-state level DataFrame.

Categories