I am trying to read a JSON dataset (see below a part of it). I want to use it in a flattened Pandas DataFrame to have access to all columns, in particular "A" and "B "with some data as columns for further processing.
import pandas as pd
datajson= {
"10001": {
"extra": {"user": "Tom"},
"data":{"A":5, "B":10}
},
"10002":{
"extra": {"user": "Ben"},
"data":{"A":7, "B":20}
},
"10003":{
"extra": {"user": "Ben"},
"data":{"A":6, "B":15}
}
}
df = pd.read_json(datajson, orient='index')
# same with DataFrame.from_dict
# df2 = pd.DataFrame.from_dict(datajson, orient='index')
which results in Dataframe.
I am assuming there is a simple way without looping/appending and making a complicated and slow decoder but rather using for example Panda's json_normalize().
I don't think you will be able to do that without looping through the json. You can do that relatively efficiently though if you make use of a list comprehension:
def parse_inner_dictionary(data):
return pd.concat([pd.DataFrame(i, index=[0]) for i in data.values()], axis=1)
df = pd.concat([parse_inner_dictionary(v) for v in datajson.values()])
df.index = datajson.keys()
print(df)
user A B
10001 Tom 5 10
10002 Ben 7 20
10003 Ben 6 15
Related
I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe
This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN
I have the following json file
{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}
Note - Above json is fragment of original json. The actual file contains lot more attributes after 'margins', some of them nested and others not so. I just put some for brevity and to give an idea of expectations.
My goal is to flatten the data and load it into CSV. Here is the code I have written so far -
import json
import pandas as pd
path = r"/Users/samt/Downloads/test_data.json"
with open(path) as f:
t_data = {}
data = json.load(f)
for team in data['matches']:
if team['margins']:
for idx, margin in enumerate(team['margins']):
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
else:
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
df = pd.DataFrame.from_dict(t_data, orient='index')
print(df)
I know that data is getting over-written and loop is not properly structured.I am bit new to dealing with JSON objects using Python and I am not able to understand how to concate the results.
My goal is once, all the results are appended, use to_csv and convert them into rows. For each margin, the entire data is to be replicated as a seperate row. Here is what I am expecting the output to be. Can someone please help how to translate this?
From whatever I find on the net, it is about first gathering the dictionary items but how to transpose it to rows is something I am not able to understand. Also, is there a better way to parse the json than doing the loop twice for one attribute i.e. margins?
I can't use json_normalize as that library is not supported in our environment.
[output data]
Using the json and csv modules: create a dictionary for each team, for each margin if there is one.
import json, csv
s = '''{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}'''
j = json.loads(s)
matches = j['matches']
rows = []
for thing in matches:
# print(thing)
if not thing['margins']:
rows.append(thing)
else:
for bar in (b['bar'] for b in thing['margins']):
d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
d['margins'] = bar
rows.append(d)
# for row in rows: print(row)
# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')
fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())
team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,
Getting multiple item values from a dictionary can be aided by using operator.itemgetter()
>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
... 'overallResult': 'Won',
... 'team': 'Sunrisers Hyderabad',
... 'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>
I like to use use it and give the callable a descriptive name but I don't see it used much here on SO.
You can use pd.DataFrame to create DataFrame and explode the margins column
import json
import pandas as pd
with open('data.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)
print(df)
team overallResult totalMatches margins
0 Sunrisers Hyderabad Won 3 {'bar': 290}
1 Sunrisers Hyderabad Won 3 {'bar': 90}
2 Pune Warriors None 0 None
Then fill the None value in margins column to dictionary and convert it to column
bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)
print(bar)
bar
0 290
1 90
2 <NA>
At last, join the Series to original dataframe
df = df.join(bar).drop(columns='margins')
print(df)
team overallResult totalMatches bar
0 Sunrisers Hyderabad Won 3 290
1 Sunrisers Hyderabad Won 3 90
2 Pune Warriors None 0 <NA>
I have a nested json (like the one reported below) of translated labels, and I want to extract the leaves in separate json files, based on the languages key (it, en, etc).
I don't know at "compile time" the depth and the schema of the json, because there are a lot of files similiar to the big nested one, but I know that I always have the following structure: key path/to/en/label and value content.
I tried using Pandas with the json_normalize function to flatten my json, and works great, but afterwards I had trouble rebuilding the json schema, e.g. with the following json I get a 1x12 DataFrame, but I want a resulting DataFrame with shape 4x3, where 4 are the different labels (index) and 3 are the different languages (columns).
def fix_df(df: pd.DataFrame):
assert df.shape[0] == 1
columns = df.columns
columns_last_piece = [s.split("/")[-1] for s in columns]
fixed_columns = [s.split(".")[1] for s in columns_last_piece]
index = [".".join(elem.split(".")[2:]) for elem in columns_last_piece]
return pd.DataFrame(df.values, index=index, columns=fixed_columns)
def main():
path = pathlib.Path(os.getenv("FIXTURE_FLATTEN_PATH"))
assert path.exists()
json_dict = json.load(open(path, encoding="utf-8"))
flattened_json = pd.json_normalize(json_dict)
flattened_json_fixed = fix_df(flattened_json)
# do something with flattened_json_fixed
Example of my_labels.json:
{
"dynamicData": {
"bff_scoring": {
"subCollection": {
"dynamicData/bff_scoring/translations": {
"it": {
"area_title.people": "PERSONE",
"area_title.planet": "PIANETA",
"area_title.prosperity": "PROSPERITÀ",
"area_title.principle-gov": "PRINCIPI DI GOVERNANCE"
},
"en": {
"area_title.people": "PEOPLE",
"area_title.planet": "PLANET",
"area_title.prosperity": "PROSPERITY",
"area_title.principle-gov": "PRINCIPLE OF GOVERNANCE"
},
"fr":{
"area_title.people": "PERSONNES",
"area_title.planet": "PLANÈTE",
"area_title.prosperity": "PROSPERITÉ",
"area_title.principle-gov": "PRINCIPES DE GOUVERNANCE"
}
}
}
}
}
}
Example of my_labels_it.json:
{
"area_title.people": "PERSONE",
"area_title.planet": "PIANETA",
"area_title.prosperity": "PROSPERITÀ",
"area_title.principle-gov": "PRINCIPI DI GOVERNANCE"
}
I finally managed to solve this problem.
First, I need to use the melt function.
>>> df = flattened_json.melt()
>>> df
variable value
0 dynamicData.bff_scoring.subCollection.dynamicD... PERSONE
1 dynamicData.bff_scoring.subCollection.dynamicD... PIANETA
2 dynamicData.bff_scoring.subCollection.dynamicD... PROSPERITÀ
3 dynamicData.bff_scoring.subCollection.dynamicD... PRINCIPI DI GOVERNANCE
...
From here, I can extract the fields I'm interested with a regular expression. I tried using .str.extractall and explode, but I was greeted with an exception, so I relied to use .str.extract two times.
>>> df2 = df.assign(language=df.variable.str.extract(r".*\.([a-z]{2})\.[\w\.-]+$"), label=df.variable.str.extract(r"(?<=\.[a-z]{2}\.)([\w\.-]+)$")).drop(columns="variable")
>>> df2
value language label
0 PERSONE it area_title.people
1 PIANETA it area_title.planet
2 PROSPERITÀ it area_title.prosperity
3 PRINCIPI DI GOVERNANCE it area_title.principle-gov
...
And then, with a pivot, I can have the dataframe with the desired schema.
>>> df3 = df2.pivot(index="label", columns="language", values="value")
>>> df3
language en ... it
label ...
area_title.people PEOPLE ... PERSONE
area_title.planet PLANET ... PIANETA
area_title.principle-gov PRINCIPLE OF GOVERNANCE ... PRINCIPI DI GOVERNANCE
area_title.prosperity PROSPERITY ... PROSPERITÀ
From this dataframe is very simple to obtain the expected json.
>>> df3["it"].to_json(force_ascii=False)
'{"area_title.people":"PERSONE","area_title.planet":"PIANETA","area_title.principle-gov":"PRINCIPI DI GOVERNANCE","area_title.prosperity":"PROSPERITÀ"}'
I'm trying to replace the 'starters' column of this DataFrame
starters
roster_id
Bob 3086
Bob 1234
Cam 6130
... ...
with the player names from a large nested dict like this. The values in my 'starters' column are the keys.
{
"3086": {
"team": "NE",
"player_id":"3086",
"full_name": "tombrady",
},
"1234": {
"team": "SEA",
"player_id":"1234",
"full_name": "RussellWilson",
},
"6130": {
"team": "BUF",
"player_id":"6130",
"full_name": "DevinSingletary",
},
...
}
I tried using DataFrame.replace(dict) and Dataframe.map(dict) but that gives me back all the player info instead of just the name.
is there a way to do this with a nested dict? thanks.
let df be the dataframe and d be the dictionary, then you can use apply from pandas on axis 1 to change the column
df.apply(lambda x: d[str(x.starters)]['full_name'], axis=1)
I am not sure, if I understand your question correctly. Have you tried using dict['full_name'] instead of simply dict?
Try pd.concat with series.map:
>>> pd.concat([
df,
pd.DataFrame.from_records(
df.astype(str)
.starters
.map(dct)
.values
).set_index(df.index)
], axis=1)
starters team player_id full_name
roster_id
Bob 3086 NE 3086 tombrady
Bob 1234 SEA 1234 RussellWilson
Cam 6130 BUF 6130 DevinSingletary
Here is example JSON im working with.
{
":#computed_region_amqz_jbr4": "587",
":#computed_region_d3gw_znnf": "18",
":#computed_region_nmsq_hqvv": "55",
":#computed_region_r6rf_p9et": "36",
":#computed_region_rayf_jjgk": "295",
"arrests": "1",
"county_code": "44",
"county_code_text": "44",
"county_name": "Mifflin",
"fips_county_code": "087",
"fips_state_code": "42",
"incident_count": "1",
"lat_long": {
"type": "Point",
"coordinates": [
-77.620031,
40.612749
]
}
I have been able to pull out select columns I want except I'm having troubles with "lat_long". So far my code looks like:
# PRINTS OUT SPECIFIED COLUMNS
col_titles = ['county_name', 'incident_count', 'lat_long']
df = df.reindex(columns=col_titles)
However 'lat_long' is added to the data frame as such: {'type': 'Point', 'coordinates': [-75.71107, 4...
I thought once I figured out how properly add the coordinates to the data frame I would then create two seperate columns, one for latitude and one for longitude.
Any help with this matter would be appreciated. Thank you.
If I don't misunderstood your requirements then you can try this way with json_normalize. I just added the demo for single json, you can use apply or lambda for multiple datasets.
import pandas as pd
from pandas.io.json import json_normalize
df = {":#computed_region_amqz_jbr4":"587",":#computed_region_d3gw_znnf":"18",":#computed_region_nmsq_hqvv":"55",":#computed_region_r6rf_p9et":"36",":#computed_region_rayf_jjgk":"295","arrests":"1","county_code":"44","county_code_text":"44","county_name":"Mifflin","fips_county_code":"087","fips_state_code":"42","incident_count":"1","lat_long":{"type":"Point","coordinates":[-77.620031,40.612749]}}
df = pd.io.json.json_normalize(df)
df_modified = df[['county_name', 'incident_count', 'lat_long.type']]
df_modified['lat'] = df['lat_long.coordinates'][0][0]
df_modified['lng'] = df['lat_long.coordinates'][0][1]
print(df_modified)
Here is how you can do it as well:
df1 = pd.io.json.json_normalize(df)
pd.concat([df1, df1['lat_long.coordinates'].apply(pd.Series) \
.rename(columns={0: 'lat', 1: 'long'})], axis=1) \
.drop(columns=['lat_long.coordinates', 'lat_long.type'])