Pandas flattening nested jsons - python

so this is probably going to be a duplicate question but i'll make a try since I have not found anything.
I am trying to flatten a json with pandas, normal work.
Looking at the examples of the docs here is the closest example for what I am trying to do:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
However, this example show us a way to get the data inside counties flatten with alongside the column state and shortname.
Let's say that I have n number of columns at the root of each json object ( n number of state or shortname columns in the example above ). How do I include them all, in order to flat the counties but keep everything else that is adjacent?
First I tried things like these:
#None to treat data as a list of records
#Result of counties is still nested, not working
result = pd.json_normalize(data, None, ['counties'])
or
result = pd.json_normalize(data, None, ['counties', 'name'])
Then I Thought of getting the columns with dataframe.columns and reuse it since meta argument of json_normalize can take array of string.
But i'm stuck. and columns appear to return nested json attribute which I don't want to.
#still nested
cols = pd.json_normalize(data).columns.to_list()
#Exclude it because we already have it
cols = [index for index in cols if index != 'counties']
#remove nested columns if any
cols = [index for index in cols if "." not in index]
result = pd.json_normalize(data, 'counties', cols, errors="ignore")
#still nested
name population state shortname ... other6 other7 counties info.governor
0 Dade 12345 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
1 Broward 40000 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
2 Palm Beach 60000 Florida FL ... dumb_data dumb_data [{'name': 'Dade', 'population': 12345}, {'name... NaN
3 Summit 1234 Ohio OH ... dumb_data dumb_data [{'name': 'Summit', 'population': 1234}, {'nam... NaN
4 Cuyahoga 1337 Ohio OH ... dumb_data dumb_data [{'name': 'Summit', 'population': 1234}, {'nam... NaN
I would prefere not to just harcode the column names since they change and that for this case I have 64 of them...
For better understanding, this is the real kind of data i'm working on from Woo Rest API. I am not using it here because its really long, but basically I am trying to flat line_items keeping only product_id inside it and of course all the other columns which is adjacent to line_items.

Okay so guys if you want to flatten a json and keeping everything else, you should used pd.Dataframe.explode()
Here is my logic:
import pandas as pd
data = [
{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [
{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}
]
},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}
]
#No Formating only converting to a Df
result = pd.json_normalize(data)
#Exploding the wanted nested column
exploded = result.explode('counties')
#Keeping the name only - this can be custom
exploded['countie_name'] = exploded['counties'].apply(lambda x: x['name'])
#Drop the used column since we took what interested us inside it.
exploded = exploded.drop(['counties'], axis=1)
print(exploded)
#Duplicate for Florida, as wanted with diferent countie names
state shortname info.governor countie_name
0 Florida FL Rick Scott Dade
0 Florida FL Rick Scott Broward
0 Florida FL Rick Scott Palm Beach
1 Ohio OH John Kasich Summit
1 Ohio OH John Kasich Cuyahoga
Imagine you have the content of a basket of product as a nested json, to explode the content of the basket while keeping the general basket attributes, you can do this.

Related

How to loop to consecutively go through a list of strings, assign value to each string and return it to a new list

Say instead of a dictionary I have these lists:
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
I want to create a pd.DataFrame from this such as:
City
Continent
New York
America
Vancouver
America
London
Europe
Berlin
Europe
Tokyo
Asia
Bangkok
Asia
Note: this is the minimum reproductible example to keep it simple, but the real dataset is more like city -> country -> continent
I understand with such a small sample it would be possible to manually create a dictionary, but in the real example there are many more data-points. So I need to automate it.
I've tried a for loop and a while loop with arguments such as "if Europe in cities" but that doesn't do anything and I think that's because it's "false" since it compares the whole list "Europe" against the whole list "cities".
Either way, my idea was that the loops would go through every city in the cities list and return (city + continent) for each. I just don't know how to um... actually make that work.
I am very new and I wasn't able to figure anything out from looking at similar questions.
Thank you for any direction!
Problem in your Code:
First of all, let's take a look at a Code Snippet used by you: if Europe in cities: was returned nothing Correct!
It is because you are comparing the whole list [Europe] instead of individual list element ['London', 'Berlin']
Solution:
Initially, I have imported all the important modules and regenerated a List of Sample Data provided by you.
# Import all the Important Modules
import pandas as pd
# Read Data
cities = ['New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok']
Europe = ['London', 'Berlin']
America = ['New York', 'Vancouver']
Asia = ['Tokyo', 'Bangkok']
Now, As you can see in your Expected Output we have 2 Columns mentioned below:
City [Which is already available in the form of cities (List)]
Continent [Which we have to generate based on other Lists. In our case: Europe, America, Asia]
For Generating a proper Continent List follow the Code mentioned below:
# Make Continent list
continent = []
# Compare the list of Europe, America and Asia with cities
for city in cities:
if city in Europe:
continent.append('Europe')
elif city in America:
continent.append('America')
elif city in Asia:
continent.append('Asia')
else:
pass
# Print the continent list
continent
# Output of Above Code:
['America', 'America', 'Europe', 'Europe', 'Asia', 'Asia']
As you can see we have received the expected Continent List. Now let's generate the pd.DataFrame() from the same:
# Make dataframe from 'City' and 'Continent List`
data_df = pd.DataFrame({'City': cities, 'Continent': continent})
# Print Results
data_df
# Output of the above Code:
City Continent
0 New York America
1 Vancouver America
2 London Europe
3 Berlin Europe
4 Tokyo Asia
5 Bangkok Asia
Hope this Solution helps you. But if you are still facing Errors then feel free to start a thread below.
1 : Counting elements
You just count the number of cities in each continent and create a list with it :
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
continent = []
cities = []
for name, cont in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
continent += [name for _ in range(len(cont))]
cities += [city for city in cont]
df = pd.DataFrame({'City': cities, 'Continent': continent}
print(df)
And this gives you the following result :
City Continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This is I think the best solution.
2: With dictionnary
You can create an intermediate dictionnary.
Starting from your code
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
You would do this :
continent = dict()
for cont_name, cont_cities in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
for city in cont_cities:
continent[city] = cont_name
This give you the following result :
{
'London': 'Europe', 'Berlin': 'Europe',
'New York': 'America', 'Vancouver': 'America',
'Tokyo': 'Asia', 'Bangkok': 'Asia'
}
Then, you can create your DataFrame :
df = pd.DataFrame(continent.items())
print(df)
0 1
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This solution allows you not to override your cities tuple
I think on the long run you might want to elimninate loops for large datasets. Also, you might need to include more continent depending on the content of your data.
import pandas as pd
continent = {
'0': 'Europe',
'1': 'America',
'2': 'Asia'
}
df= pd.DataFrame([Europe, America, Asia]).stack().reset_index()
df['continent']= df['level_0'].astype(str).map(continent)
df.drop(['level_0','level_1'], inplace=True, axis=1)
You should get this output
0 continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
Feel free to adjust to suit your use case

Extracting a scraped list into new columns

I have this code (borrowed from an old question posted ont his site)
import pandas as pd
import json
import numpy as np
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml")
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, "html.parser")
#(The table has an id, it makes it more simple to target )
batting = doc.find(id='misc_batting')
careers = []
for row in batting.find_all('tr')[1:]:
dictionary = {}
dictionary['names'] = row.find(attrs = {"data-stat": "player"}).text.strip()
dictionary['experience'] = row.find(attrs={"data-stat": "experience"}).text.strip()
careers.append(dictionary)
Which generates a result like this:
[{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}, {'names':
How do I create this into a column separated dataframe like this?
Names Experience
David Adams 1
You can simplify this quite a bit with pandas. Have it pull the table, then you just want the Names and Yrs columns.
import pandas as pd
url = "https://www.baseball-reference.com/leagues/MLB/2013-finalyear.shtml"
df = pd.read_html(url, attrs = {'id': 'misc_batting'})[0]
df_filter = df[['Name','Yrs']]
If you need to rename those columns, add:
df_filter = df_filter.rename(columns={'Name':'names','Yrs':'experience'})
Output:
print(df_filter)
names experience
0 David Adams 1
1 Steve Ames 1
2 Rick Ankiel 11
3 Jairo Asencio 4
4 Luis Ayala 9
.. ... ...
209 Dewayne Wise 11
210 Ross Wolf 3
211 Kevin Youkilis 10
212 Michael Young 14
213 Totals 1357
[214 rows x 2 columns]
Simply pass your list of dicts (careers) to pandas.DataFrame() to get your expected result.
Example
import pandas as pd
careers = [{'names': 'David Adams', 'experience': '1'}, {'names': 'Steve Ames', 'experience': '1'}, {'names': 'Rick Ankiel', 'experience': '11'}, {'names': 'Jairo Asencio', 'experience': '4'}, {'names': 'Luis Ayala', 'experience': '9'}, {'names': 'Brandon Bantz', 'experience': '1'}, {'names': 'Scott Barnes', 'experience': '2'}]
pd.DataFrame(careers)
Output
names
experience
David Adams
1
Steve Ames
1
Rick Ankiel
11
Jairo Asencio
4
Luis Ayala
9
Brandon Bantz
1
Scott Barnes
2

pandas json normalize not all fields from record path

I am trying to get just some of the fields of a record because I do not want to delete the not wanted columns afterwards but can't figure out how to do it. My real JSON has a lot more fields in the "countries" path, this is just an example.
Example JSON
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
json_normalize
result = pd.json_normalize(
data=data,
record_path='counties',
meta=['state', 'shortname',
['info', 'governor']])
output
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
but I do not want the "population" in this example, I just want the name of the counties
I tried all kind of combinations in the meta attribute.

Is there a way to check if a list item is the ONLY item in the list?

I have a list of dictionaries, the whole list represents different countries, and each dictionary includes basic data about each country as following:
example a
df.countries[3]
"[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'US', 'name': 'United States of America'}, {'iso_3166_1': 'IN', 'name': 'India'}]"
Of course, there are other cells where countries list has only one dictionary like this:
example b
df.countries[0]
"[{'iso_3166_1': 'US', 'name': 'United States of America'}]"
Or empty list like this:
example c
df.countries[505]
'[]'
What I want to do is:
Delete rows where country name is United States of America BUT
ONLY WHEN it's the only country in the list, not when there are
other countries like example a.
I tried to brainstorm and came up with something like this:
countryToRemove = "United States of America"
for index, row in df.iterrows():
if countryToRemove in row['countries']:
# row to be removed
But it deletes any row with the US in it even if other countries were there.
Edit: My dataframe is as following:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
1 [{'iso_3166_1': 'US', 'name': 'United States o...
2 []
If you have dataframe like this:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
1 [{'iso_3166_1': 'US', 'name': 'United States o...
2 []
Then you can use boolean indexing to filter out your dataframe:
mask = df.countries.apply(
lambda x: len(s := set(d["name"] for d in x)) == 1
and s.pop() == "United States of America"
)
print(df[~mask])
Prints:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
2 []
EDIT: Version without := operator:
def fn(x):
s = set(d["name"] for d in x)
return len(s) == 1 and s.pop() == "United States of America"
mask = df.countries.apply(fn)
print(df[~mask])

Using a Loop to delete rows from a Dataframe

I have a dataframe of environmental data for each country in the world. I want to remove any entries for country that do not represent individual countries, ie 'Africa', or 'World'. I have made a list of those values. I am trying to loop through the df and drop each row where the country = a value in my list. There aren't that many problem entries, I have removed them before with .loc, but I'm unsure why this function is not working. I get an error: KeyError: '[(bunch of numbers)] not found in axis'
not_country = ['Africa', 'Asia', 'Asia (excl. China & India)','EU-27','EU-28', 'Europe','Europe (excl. EU-27)',
'Europe (excl. EU-28)', 'International transport', 'Kuwaiti Oil Fires', 'North America',
'North America (excl. USA)', 'World', 'South America']
def clean_countries(df, lst):
index_names = []
for country_name in lst:
index_names.append(df[df['country'] == country_name].index)
for i in df:
df.drop(index_names, inplace = True)
clean_co2_df = clean_countries(co2_df, not_country) ```
One of the advantages of a dataframe is that you seldom have to iterate through it to get the job done. There are usually more efficient ways. Here is a solution to your question using a sample dataframe with world population data.
not_country = ['Africa', 'Asia', 'Asia (excl. China & India)','EU-27','EU-28', 'Europe','Europe (excl. EU-27)',
'Europe (excl. EU-28)', 'International transport', 'Kuwaiti Oil Fires', 'North America',
'North America (excl. USA)', 'World', 'South America']
pop_data = {'Country': {0: 'China', 1: 'India', 2: 'USA', 3: 'Asia'}, 'Population': {0: 1439000000, 1: 1380004385, 2: 331002651, 3: 4641054775}, 'Yearly Change %': {0: 0.39, 1: 0.99, 2: 0.59, 3: 0.86}}
df = pd.DataFrame(pop_data)
print(f'BEFORE: \n {df}')
df = df.loc[df['Country'].apply(lambda x: x not in not_country)]
print(f'AFTER: \n {df}')
#output:
BEFORE:
Country Population Yearly Change %
0 China 1439000000 0.39
1 India 1380004385 0.99
2 USA 331002651 0.59
3 Asia 4641054775 0.86
AFTER:
Country Population Yearly Change %
0 China 1439000000 0.39
1 India 1380004385 0.99
2 USA 331002651 0.59

Categories