Read nested JSON in a cell with Pandas - python

I recovered a list of ISO 3166-2 countries and regions in this Github repository.
I managed to have a first look of the regions using the following code:
import pandas as pd
import json
data = "/content/data.json"
df = pd.read_json(data)
df = df.T
Which gives the following output:
name
divisions
Afghanistan
{'AF-BDS': 'Badakhshān', 'AF-BDG': 'Bādghīs', 'AF-BGL': 'Baghlān', 'AF-BAL': 'Balkh', 'AF-BAM': 'Bāmīān', 'AF-FRA': 'Farāh', 'AF-FYB': 'Fāryāb', 'AF-GHA': 'Ghaznī', 'AF-GHO': 'Ghowr', 'AF-HEL': 'Helmand', 'AF-HER': 'Herāt', 'AF-JOW': 'Jowzjān', 'AF-KAB': 'Kabul (Kābol)', 'AF-KAN': 'Kandahār', 'AF-KAP': 'Kāpīsā', 'AF-KNR': 'Konar (Kunar)', 'AF-KDZ': 'Kondoz (Kunduz)', 'AF-LAG': 'Laghmān', 'AF-LOW': 'Lowgar', 'AF-NAN': 'Nangrahār (Nangarhār)', 'AF-NIM': 'Nīmrūz', 'AF-ORU': 'Orūzgān (Urūzgā', 'AF-PIA': 'Paktīā', 'AF-PKA': 'Paktīkā', 'AF-PAR': 'Parwān', 'AF-SAM': 'Samangān', 'AF-SAR': 'Sar-e Pol', 'AF-TAK': 'Takhār', 'AF-WAR': 'Wardak (Wardag)', 'AF-ZAB': 'Zābol (Zābul)'}
Albania
{'AL-BR': 'Berat', 'AL-BU': 'Bulqizë', 'AL-DL': 'Delvinë', 'AL-DV': 'Devoll', 'AL-DI': 'Dibër', 'AL-DR': 'Durrës', 'AL-EL': 'Elbasan', 'AL-FR': 'Fier', 'AL-GR': 'Gramsh', 'AL-GJ': 'Gjirokastër', 'AL-HA': 'Has', 'AL-KA': 'Kavajë', 'AL-ER': 'Kolonjë', 'AL-KO': 'Korcë', 'AL-KR': 'Krujë', 'AL-KC': 'Kucovë', 'AL-KU': 'Kukës', 'AL-LA': 'Laç', 'AL-LE': 'Lezhë', 'AL-LB': 'Librazhd', 'AL-LU': 'Lushnjë', 'AL-MM': 'Malësia e Madhe', 'AL-MK': 'Mallakastër', 'AL-MT': 'Mat', 'AL-MR': 'Mirditë', 'AL-PQ': 'Peqin', 'AL-PR': 'Përmet', 'AL-PG': 'Pogradec', 'AL-PU': 'Pukë', 'AL-SR': 'Sarandë', 'AL-SK': 'Skrapar', 'AL-SH': 'Shkodër', 'AL-TE': 'Tepelenë', 'AL-TR': 'Tiranë', 'AL-TP': 'Tropojë', 'AL-VL': 'Vlorë'}
But I can't manage to achieve the following output because of the nested JSON.
country code
country name
region code
region name
AF
Afghanistan
AF-BDS
Badakhshān
AF
Afghanistan
AF-BDG
Bādghīs
I tried to loop inside the DataFrame with :
df = json_normalize(df['divisions']).unstack().apply(pd.Series)
But I'm not getting any satisfying result.

This should work:
df1 = (
pd.DataFrame(data)
.transpose()
.reset_index(names="country code")
.rename(columns={"name": "country name"})
)
divisions = [(k1, v1) for k, v in df1["divisions"].to_dict().items() for k1, v1 in v.items()]
df2 = pd.DataFrame(divisions, columns=["region code", "region name"])
final_df = (
pd
.merge(df1.explode("divisions"), df2, left_on="divisions", right_on="region code")
.drop(columns="divisions")
)
print(final_df.head(10))
country code country name region code region name
0 AF Afghanistan AF-BDS Badakhshān
1 AF Afghanistan AF-BDG Bādghīs
2 AF Afghanistan AF-BGL Baghlān
3 AF Afghanistan AF-BAL Balkh
4 AF Afghanistan AF-BAM Bāmīān
5 AF Afghanistan AF-FRA Farāh
6 AF Afghanistan AF-FYB Fāryāb
7 AF Afghanistan AF-GHA Ghaznī
8 AF Afghanistan AF-GHO Ghowr
9 AF Afghanistan AF-HEL Helmand

you can simply read in the data one country at a time
J = json.load(open("iso-3166-2.json","r"))
dfs = []
for country_code in J:
df = pd.DataFrame(J[country_code])
df.index.name="region_code"
df['country_code'] = country_code
dfs.append(df)
df = pd.concat(dfs).reset_index()
# region_code name divisions country_code
#0 AF-BAL Afghanistan Balkh AF
#1 AF-BAM Afghanistan Bāmīān AF
#2 AF-BDG Afghanistan Bādghīs AF
#3 AF-BDS Afghanistan Badakhshān AF
#4 AF-BGL Afghanistan Baghlān AF
#... ... ... ... ...
#3802 ZW-MI Zimbabwe Midlands ZW
#3803 ZW-MN Zimbabwe Matabeleland North ZW
#3804 ZW-MS Zimbabwe Matabeleland South ZW
#3805 ZW-MV Zimbabwe Masvingo ZW
#3806 ZW-MW Zimbabwe Mashonaland West ZW

Let's do it in the logic of the original post:
(
pd.read_json('iso-3166-2.json', orient='index')
.set_index('name', append=True)
.squeeze()
.apply(pd.Series)
.stack()
.rename_axis(['country code','country name','region code'])
.rename('region name')
.reset_index()
)
Some notes:
orient='index' - read data with country codes as index, so transposition is not required
set_index('name', append=True) - save country codes and names together as a multy index
instead of squeeze we could use ['divisions'].apply
.apply(pd.Series) - transform dictionaries in divisions into records with the region codes as column names
.stack() - unpivot the table with the region codes in a columns to long format
.rename_axis(...) - at this stage contry codes, names and region codes make up a multyindex of a series with region names as values

Related

How do I check multiple pandas columns against a dictionary in Python?

I have an existing pandas dataframe, consisting of a country column and market column. I want to check if the countries are assigned to the correct markets. As such I created a dictionary where each country (key) is mapped to the correct markets (values) it can fall within. The structure of the dataframe is below:
The structure of the dictionary is {'key':['Market 1', 'Market 2', 'Market 3']}. This is because each country has a couple of markets they could belong to.
I would like to write a function, which checks the values in the Country column and see if according to the dictionary, the current mapping is correct. So ideally, the desired output would be as follows:
Is there a way to reference a dictionary across two columns in a function? To confirm, the keys are the country names, and the markets are the values.
I have included code required to make the dataframe:
data = {'Country': ['Mexico','Uruguay','Uruguay','Greece','Brazil','Brazil','Brazil','Brazil','Colombia','Colombia','Colombia','Japan','Japan','Brazil','Brazil','Spain','New Zealand'],
'Market': ['LATAM','LATAM','LATAM','EMEA','ASIA','ASIA','LATAM BRAZIL','LATAM BRAZIL','LATAM CASA','LATAM CASA','LATAM','LATAM','LATAM','LATAM BRAZIL','LATAM BRAZIL','SOUTHEAST ASIA','SOUTHEAST ASIA']
}
df = pd.DataFrame(data)
Thanks a lot.
First idea is create tuples and match by Index.isin:
d = {'Colombia':['LATAM','LATAM CASA'], 'Brazil':['ASIA']}
tups = [(k, x) for k, v in d.items() for x in v]
df['Market Match'] = np.where(df.set_index(['Country','Market']).index.isin(tups),
'yes', 'no')
print (df)
Country Market Market Match
0 Mexico LATAM no
1 Uruguay LATAM no
2 Uruguay LATAM no
3 Greece EMEA no
4 Brazil ASIA yes
5 Brazil ASIA yes
6 Brazil LATAM BRAZIL no
7 Brazil LATAM BRAZIL no
8 Colombia LATAM CASA yes
9 Colombia LATAM CASA yes
10 Colombia LATAM yes
11 Japan LATAM no
12 Japan LATAM no
13 Brazil LATAM BRAZIL no
14 Brazil LATAM BRAZIL no
15 Spain SOUTHEAST ASIA no
16 New Zealand SOUTHEAST ASIA no
Or by left join in DataFrame.merge with indicator=True:
d = {'Colombia':['LATAM','LATAM CASA'], 'Brazil':['ASIA']}
df1 = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
columns=['Country','Market']).drop_duplicates()
df['Market Match'] = np.where(df.merge(df1,indicator=True,how='left')['_merge'].eq('both'),
'yes', 'no')
The following link might help you out in checking if specific strings (e.g. "Markets" are included in your dataframe).
Check if string contains substring
For example:
fullstring = "StackAbuse"
substring = "tack"
if substring in fullstring:
print("Found!")
else:
print("Not found!")
df['MATCH'] = df.apply(lambda row: row['Market'] in your_dictionary[row['Country']], axis=1)

How to create a dataframe using a list of dictionaries that also consist of lists

I have a list of dictionaries that also consist of lists and would like to create a dataframe using this list. For example, the data looks like this:
lst = [{'France': [[12548, ABC], [45681, DFG], [45684, HJK]]},
{'USA': [[84921, HJK], [28917, KLESA]]},
{'Japan':[[38292, ASF], [48902, DSJ]]}]
And this is the dataframe I'm trying to create
Country Amount Code
France 12548 ABC
France 45681 DFG
France 45684 HJK
USA 84921 HJK
USA 28917 KLESA
Japan 38292 ASF
Japan 48902 DSJ
As you can see, the keys became column values of the country column and the numbers and the strings became the amount and code columns. I thought I could use something like the following, but it's not working.
df = pd.DataFrame(lst)
You probably need to transform the data into a format that Pandas can read.
Original data
data = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
Transforming the data
new_data = []
for country_data in data:
for country, values in country_data.items():
new_data += [{"Country": country, "Amount": amt, "Code": code} for amt, code in values]
Create the dataframe
df = pd.DataFrame(new_data)
Ouput
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
df = pd.concat([pd.DataFrame(elem) for elem in list])
df = df.apply(lambda x: pd.Series(x.dropna().values)).stack()
df = df.reset_index(level=[0], drop=True).to_frame(name = 'vals')
df = pd.DataFrame(df["vals"].to_list(),index= df.index, columns=['Amount', 'Code']).sort_index()
print(df)
output:
Amount Code
France 12548 ABC
USA 84921 HJK
Japan 38292 ASF
France 45681 DFG
USA 28917 KLESA
Japan 48902 DSJ
France 45684 HJK
Use nested list comprehension for flatten data and pass to DataFrame constructor:
lst = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
L = [(country, *x) for country_data in lst
for country, values in country_data.items()
for x in values]
df = pd.DataFrame(L, columns=['Country','Amount','Code'])
print (df)
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
Build a new dictionary that combines the individual dicts into one, before concatenating the dataframes:
new_dict = {}
for ent in lst:
for key, value in ent.items():
new_dict[key] = pd.DataFrame(value, columns = ['Amount', 'Code'])
pd.concat(new_dict, names=['Country']).droplevel(1).reset_index()
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ

Find top n elements in pandas dataframe column by keeping the grouping

I am trying to find the top 5 elements of the column total_petitions, but keeping the ordered grouping I did.
df = df[['fy', 'EmployerState', 'total_petitions']]
table = df.groupby(['fy','EmployerState']).mean()
table.nlargest(5, 'total_petitions')
sample output:
fy EmployerState total_petitions
2020 WA 7039.333333
2016 MD 2647.400000
2017 MD 2313.142857
... TX 2305.541667
2020 TX 2081.952381
desired output:
fy EmployerState total_petitions
2016 AL 3.875000
AR 225.333333
AZ 26.666667
CA 326.056604
CO 21.333333
... ... ...
2020 VA 36.714286
WA 7039.333333
WI 43.750000
WV 8986086.08
WY 1.000000
with the elements of total_petitions being the 5 states with highest means by year
What you are looking for is a pivot table:
df = df.pivot_table(values='total_petitions', index=['fy','EmployerState'])
df = df.groupby(level='fy')['total_petitions'].nlargest(5).reset_index(level=0, drop=True).reset_index()

data aggregation with sum pandas/ python

import pandas as pd
dane= pd.read_csv('WHO-COVID-19-global-data _2.csv')
dane
dane.groupby('Country')[['Cumulative_cases']].sum()
KeyError: 'Country'
I don't know why this code doesn't run?
There are spaces at the beginning of dane columns
Remove them with the following line:
dane.rename(columns=lambda x: x.strip(), inplace=True)
dane.groupby('Country')[['Cumulative_cases']].sum()
Cumulative_cases
Country
Afghanistan 5702767
Albania 1300156
Algeria 5561691
American Samoa 0
Andorra 273756
... ...
Wallis and Futuna 14
Yemen 256353
Zambia 1323403
Zimbabwe 692447
occupied Palestinian territory, including east ... 4057017

Python split one column into multiple columns and reattach the split columns into original dataframe

I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None

Categories