Normalizing json using pandas with inconsistent nested lists/dictionaries - python

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe

This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101

Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN

Related

Converting complex nested json to csv via pandas

I have the following json file
{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}
Note - Above json is fragment of original json. The actual file contains lot more attributes after 'margins', some of them nested and others not so. I just put some for brevity and to give an idea of expectations.
My goal is to flatten the data and load it into CSV. Here is the code I have written so far -
import json
import pandas as pd
path = r"/Users/samt/Downloads/test_data.json"
with open(path) as f:
t_data = {}
data = json.load(f)
for team in data['matches']:
if team['margins']:
for idx, margin in enumerate(team['margins']):
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
else:
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
df = pd.DataFrame.from_dict(t_data, orient='index')
print(df)
I know that data is getting over-written and loop is not properly structured.I am bit new to dealing with JSON objects using Python and I am not able to understand how to concate the results.
My goal is once, all the results are appended, use to_csv and convert them into rows. For each margin, the entire data is to be replicated as a seperate row. Here is what I am expecting the output to be. Can someone please help how to translate this?
From whatever I find on the net, it is about first gathering the dictionary items but how to transpose it to rows is something I am not able to understand. Also, is there a better way to parse the json than doing the loop twice for one attribute i.e. margins?
I can't use json_normalize as that library is not supported in our environment.
[output data]
Using the json and csv modules: create a dictionary for each team, for each margin if there is one.
import json, csv
s = '''{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}'''
j = json.loads(s)
matches = j['matches']
rows = []
for thing in matches:
# print(thing)
if not thing['margins']:
rows.append(thing)
else:
for bar in (b['bar'] for b in thing['margins']):
d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
d['margins'] = bar
rows.append(d)
# for row in rows: print(row)
# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')
fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())
team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,
Getting multiple item values from a dictionary can be aided by using operator.itemgetter()
>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
... 'overallResult': 'Won',
... 'team': 'Sunrisers Hyderabad',
... 'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>
I like to use use it and give the callable a descriptive name but I don't see it used much here on SO.
You can use pd.DataFrame to create DataFrame and explode the margins column
import json
import pandas as pd
with open('data.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)
print(df)
team overallResult totalMatches margins
0 Sunrisers Hyderabad Won 3 {'bar': 290}
1 Sunrisers Hyderabad Won 3 {'bar': 90}
2 Pune Warriors None 0 None
Then fill the None value in margins column to dictionary and convert it to column
bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)
print(bar)
bar
0 290
1 90
2 <NA>
At last, join the Series to original dataframe
df = df.join(bar).drop(columns='margins')
print(df)
team overallResult totalMatches bar
0 Sunrisers Hyderabad Won 3 290
1 Sunrisers Hyderabad Won 3 90
2 Pune Warriors None 0 <NA>

How to create a single json file from two DataFrames?

I have two DataFrames, and I want to post these DataFrames as json (to the web service) but first I have to concatenate them as json.
#first df
input_df = pd.DataFrame()
input_df['first'] = ['a', 'b']
input_df['second'] = [1, 2]
#second df
customer_df = pd.DataFrame()
customer_df['first'] = ['c']
customer_df['second'] = [3]
For converting to json, I used following code for each DataFrame;
df.to_json(
path_or_buf='out.json',
orient='records', # other options are (split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’)
date_format='iso',
force_ascii=False,
default_handler=None,
lines=False,
indent=2
)
This code gives me the table like this: For ex, input_df export json
[
{
"first":"a",
"second":1
},
{
"first":"b",
"second":2
}
]
my desired output is like that:
{
"input": [
{
"first": "a",
"second": 1
},
{
"first": "b",
"second": 2
}
],
"customer": [
{
"first": "d",
"second": 3
}
]
}
How can I get this output like this? I couldn't find the way :(
You can concatenate the DataFrames with appropriate key names, then groupby the keys and build dictionaries at each group; finally build a json string from the entire thing:
out = (
pd.concat([input_df, customer_df], keys=['input', 'customer'])
.droplevel(1)
.groupby(level=0).apply(lambda x: x.to_dict('records'))
.to_json()
)
Output:
'{"customer":[{"first":"c","second":3}],"input":[{"first":"a","second":1},{"first":"b","second":2}]}'
or a dict by replacing the last to_json() to to_dict().

Convert API Reponse to Pandas DataFrame

I making an API call with the following code:
req = urllib.request.Request(url, body, headers)
try:
response = urllib.request.urlopen(req)
string = response.read().decode('utf-8')
json_obj = json.loads(string)
Which returns the following:
{"forecast": [17.588294043898163, 17.412641963452206],
"index": [
{"SaleDate": 1629417600000, "Type": "Type 1"},
{"SaleDate": 1629504000000, "Type": "Type 2"}
]
}
How can I convert this api response to a Panda DataFrame to convert the dict in the following format in pandas dataframe
Forecast SaleDate Type
17.588294043898163 2021-08-16 Type 1
17.412641963452206 2021-08-17 Type 1
You can use the following. It uses pandas.Series to convert the dictionary to columns and pandas.to_datetime to map the correct date from the millisecond timestamp:
d = {"forecast": [17.588294043898163, 17.412641963452206],
"index": [
{"SaleDate": 1629417600000, "Type": "Type 1"},
{"SaleDate": 1629504000000, "Type": "Type 2"}
]
}
df = pd.DataFrame(d)
df = pd.concat([df['forecast'], df['index'].apply(pd.Series)], axis=1)
df['SaleDate'] = pd.to_datetime(df['SaleDate'], unit='ms')
output:
forecast SaleDate Type
0 17.588294 2021-08-20 Type 1
1 17.412642 2021-08-21 Type 2
Here is a solution you can give it a try, using list comprehension to flatten the data.
import pandas as pd
flatten = [
{"forecast": j, **resp['index'][i]} for i, j in enumerate(resp['forecast'])
]
pd.DataFrame(flatten)
forecast SaleDate Type
0 17.588294 1629417600000 Type 1
1 17.412642 1629504000000 Type 2

How to import list stored as txt into pandas?

I have a following problem. I need to load txt file as pandas dataframe. See, how my txt looks like:
[ {
"type": "VALUE",
"value": 93.5,
"from": "2020-12-01T00:00:00.000+01",
"to": "2020-12-01T00:00:01.000+01"
},
{
"type": "VALUE",
"value": 75,
"from": "2020-12-01T00:00:01.000+01",
"to": "2020-12-01T00:05:01.000+01"
},
{
"type": "WARNING",
"from": "2020-12-01T00:00:01.000+01",
"to": "2020-12-01T00:05:01.000+01"
} ]
In other words, I need to read the txt as a list of dictionaries. And then read it as a pandas dataframe. Desired output is:
type value from to
0 VALUE 93.5 2020-12-01T00:00:00.000+01 2020-12-01T00:00:01.000+01
1 VALUE 75 2020-12-01T00:00:01.000+01 2020-12-01T00:05:01.000+01
2 WARNING NaN 2020-12-01T00:00:01.000+01 2020-12-01T00:05:01.000+01
How can I do this, please? I look at this question, but it wasn`t helpful: Pandas how to import a txt file as a list , How to read a file line-by-line into a list?
import json
import pandas as pd
with open('inp_file.txt', 'r') as f:
content = json.load(f)
df = pd.DataFrame(content)
Use read_json, because if extension is txt still it is json file:
df = pd.read_json('file.txt')
print (df)
type value from to
0 VALUE 93.5 2020-12-01T00:00:00.000+01 2020-12-01T00:00:01.000+01
1 VALUE 75.0 2020-12-01T00:00:01.000+01 2020-12-01T00:05:01.000+01
2 WARNING NaN 2020-12-01T00:00:01.000+01 2020-12-01T00:05:01.000+01

Filtering pandas dataframe by date to count views for timeline of programs

I need to count viewers by program for a streaming channel from a json logfile.
I identify the programs by their starttimes, such as:
So far I have two Dataframes like this:
The first one contains all the timestamps from the logfile
viewers_from_log = pd.read_json('sqllog.json', encoding='UTF-8')
# Convert date string to pandas datetime object:
viewers_from_log['time'] = pd.to_datetime(viewers_from_log['time'])
Source JSON file:
[
{
"logid": 191605,
"time": "0:00:17"
},
{
"logid": 191607,
"time": "0:00:26"
},
{
"logid": 191611,
"time": "0:01:20"
}
]
The second contains the starting times and titles of the programs
programs_start_time = pd.DataFrame.from_dict('programs.json', orient='index')
Source JSON file:
{
"2019-05-29": [
{
"title": "\"Amiről a kövek mesélnek\"",
"startTime_dt": "2019-05-29T00:00:40Z"
},
{
"title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",
"startTime_dt": "2019-05-29T00:22:44Z"
},
{
"title": "Gubancok",
"startTime_dt": "2019-05-29T00:48:08Z"
}
]
}
So what I need to do is to count the entries / program in the log file and link them to the program titles.
My approach is to slice log data for each date range from program data and get the shape. Next add column for program data with results:
import pandas as pd
# setup test data
log_data = {'Time': ['2019-05-30 00:00:26', '2019-05-30 00:00:50', '2019-05-30 00:05:50','2019-05-30 00:23:26']}
log_data = pd.DataFrame(data=log_data)
program_data = {'Time': ['2019-05-30 00:00:00', '2019-05-30 00:22:44'],
'Program': ['Program 1', 'Program 2']}
program_data = pd.DataFrame(data=program_data)
counts = []
for index, row in program_data.iterrows():
# get counts on selected range
try:
log_range = log_data[(log_data['Time'] > program_data.loc[index].values[0]) & (log_data['Time'] < program_data.loc[index+1].values[0])]
counts.append(log_range.shape[0])
except:
log_range = log_data[log_data['Time'] > program_data.loc[index].values[0]]
counts.append(log_range.shape[0])
# add aditional column with collected counts
program_data['Counts'] = counts
Output:
Time Program Counts
0 2019-05-30 00:00:00 Program 1 3
1 2019-05-30 00:22:44 Program 2 1
A working (but maybe a little quick and dirty) method:
Use the .shift(-1) method on the timestamp column of programs_start_time dataframe, to get an additional column with a name date_end indicating the timestamp of end for each TV program.
Then for each example_timestamp in the log file, you can query the TV programs dataframe like this: df[(df['date_start']=<example_timestamp) & (df['date_end']>example_timestamp)] (make sure you substitute df with your dataframe's name: programs_start_time) which will give you exactly one dataframe row and extract from it the name of the TV programm.
Hope this helps!
Solution with histogram, using numpy:
import pandas as pd
import numpy as np
df_p = pd.DataFrame([
{
"title": "\"Amiről a kövek mesélnek\"",
"startTime_dt": "2019-05-29T00:00:40Z"
},
{
"title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",
"startTime_dt": "2019-05-29T00:22:44Z"
},
{
"title": "Gubancok",
"startTime_dt": "2019-05-29T00:48:08Z"
}
])
df_v = pd.DataFrame([
{
"logid": 191605,
"time": "2019-05-29 0:00:17"
},
{
"logid": 191607,
"time": "2019-05-29 0:00:26"
},
{
"logid": 191611,
"time": "2019-05-29 0:01:20"
}
])
df_p.startTime_dt = pd.to_datetime(df_p.startTime_dt)
df_v.time = pd.to_datetime(df_v.time)
# here's part where I convert datetime to timestamp in seconds - astype(int) casts it to nanoseconds, hence there's // 10**9
programmes_start = df_p.startTime_dt.astype(int).values // 10**9
viewings_starts = df_v.time.astype(int).values // 10**9
# make bins for histogram
# add zero to the beginning of the array
# add value that is time an hour after the start of the last given programme to the end of the array
programmes_start = np.pad(programmes_start, (1, 1), mode='constant', constant_values=(0, programmes_start.max()+3600))
histogram = np.histogram(viewings_starts, bins=programmes_start)
print(histogram[0]
# prints [2 1 0 0]
Interpretation: there were 2 log entries before 'Amiről a kövek mesélnek' started, 1 log entry between starts of 'Amiről a kövek mesélnek' and 'Koffer - Kedvcsináló Kul(t)túrák Külföldön', 0 log entries between starts of 'Koffer - Kedvcsináló Kul(t)túrák Külföldön' and 'Gubancok' and 0 entries after start od 'Gubancok'. Which, looking at the data you provided, seems correct :) Hope this helps.
NOTE: I assume, that you have the date of the viewings. You don't have them in the example log file, but they appear in the screenshot - so I assumed that you can compute/get them somehow and added them by hand to the input dict.

Categories