Flattening nested JSON included embedded array in Python using Pandas

Flattening nested JSON included embedded array in Python using Pandas - python

I have a JSON-array from a mongoexport containing data from the Beddit sleeptracker. Below is an example of one of the truncated documents (removed some unneeded detail).
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"data" : [
{
"end_timestamp" : 1480570804.26226,
"properties" : {
"sleep_efficiency" : 0.8772404,
"resting_heart_rate" : 67.67578,
"short_term_resting_heart_rate" : 61.36963,
"activity_index" : 50.51958,
"average_respiration_rate" : 16.25667,
"total_sleep_score" : 64,
},
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"tags" : [
"not_enough_sleep",
"long_sleep_latency"
],
"updated" : 1480570805.25201
}
],
"__v" : 0
}
Several related questions like this and this do not seem to work for the data structure above. As recommended in other related questions I am trying to stay away from looping over each row for performance reasons (the full dataset is ~150MB). How would I flatten out the "data"-key with json_normalize so that each key is at the top-level? I would prefer one DataFrame where e.g. total_sleep_score is a column.
Any help is much appreciated! Even though I know how to 'prepare' the data using JavaScript, I would like to be able to understand and do it using Python.
edit (request from comment to show preferred structure):
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"end_timestamp" : 1480570804.26226,
"properties.sleep_efficiency" : 0.8772404,
"properties.resting_heart_rate" : 67.67578,
"properties.short_term_resting_heart_rate" : 61.36963,
"properties.activity_index" : 50.51958,
"properties.average_respiration_rate" : 16.25667,
"properties.total_sleep_score" : 64,
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"updated" : 1480570805.25201,
"__v" : 0
}
The 'properties' append is not necessary but would be nice.

Try This algo for flatten:-
def flattenPattern(pattern):
newPattern = {}
if type(pattern) is list:
pattern = pattern[0]
if type(pattern) is not str:
for key, value in pattern.items():
if type(value) in (list, dict):
returnedData = flattenPattern(value)
for i,j in returnedData.items():
if key == "data":
newPattern[i] = j
else:
newPattern[key + "." + i] = j
else:
newPattern[key] = value
return newPattern
print(flattenPattern(dictFromJson))
OutPut:-
{
'session_range_start':1480545636.55059,
'start_timestamp':1480545636.55059,
'properties.average_respiration_rate':16.25667,
'session_range_end':1480570804.26226,
'properties.resting_heart_rate':67.67578,
'properties.short_term_resting_heart_rate':61.36963,
'updated':1480570805.25201,
'properties.total_sleep_score':64,
'properties.activity_index':50.51958,
'__v':0,
'user':'xxx',
'provider':'beddit',
'date':'2016-12-01',
'properties.sleep_efficiency':0.8772404,
'end_timestamp':1480570804.26226
}

Although not explicitly what I asked for, the following worked for me so far:
Step 1
Normalize the data record using json_normalize on the original dataset (not inside a Pandas DataFrame) and prefix the data.
beddit_data = pd.io.json.json_normalize(beddit, record_path='data', record_prefix='data.', meta='_id')
Step 2
The properties record was a Series with dicts so these can be 'formatted' with .apply(pd.Series)
beddit_data_properties = beddit_data['data.properties'].apply(pd.Series)
Step 3
Final step is to merge both DataFrames. In step 1, I kept the 'meta=_id' so that DataFrame can be merged with the original DataFrame from Bedit. I didn't include it in the final step yet because I can spend some time on the results from the results so far.
beddit_final = pd.concat([beddit_data_properties[:], beddit_data[:]], axis=1)
If anyone is interested, I can share the final Jupyter Notebook when it is ready :)

Related

How to extract from_dict to pandas dataframe with varying array length?

I have a dict, I need to convert to pandas dataframe, dict have arrays, if the arrays are of same length it is working fine, but different length array throwing valueError, second question is I need to access only few key value pairs from the dict
This case working, as expected I get two rows
my_dict = {
"ColA" : "No",
"ColB" : [
{
"ColB_a" : "2011-10-26T00:00:00Z",
"ColB_b" : 8.3
},
{
"ColB_a" : "2013-10-26T00:00:00Z",
"ColB_b" : 5.3
}
],
"ColC" : "Graduate",
"ColD" : [
{
"ColD_a" : 5436.0,
"ColD_b" : "RD"
},
{
"ColD_a" : 4658.0,
"ColD_b" : "DV"
}
],
"ColE" : "Work"
}
sa = pd.DataFrame(my_dict)
In this case ColB has only one value
my_dict = {
"ColA" : "No",
"ColB" : [
{
"ColB_a" : "2011-10-26T00:00:00Z",
"ColB_b" : 8.3
}
],
"ColC" : "Graduate",
"ColD" : [
{
"ColD_a" : 5436.0,
"ColD_b" : "RD"
},
{
"ColD_a" : 4658.0,
"ColD_b" : "DV"
}
],
"ColE" : "Work"
}
sa = pd.DataFrame(my_dict)
so this throws ValueError: arrays must all be same length, How this can be fixed?
expected output is
I can do
sa = pd.DataFrame.from_dict(my_dict, orient='index').transpose()
But I have to melt and join again.
Second Question, if I need to choose only ColA, ColB from dict to create dataframe, How this to be done?

for your second question, you could select a couple of columns from your dictionary using 'columns' parameter: For example:
sa = pd.DataFrame(my_dict, columns = ['ColA', 'ColD'])

Complex json to pandas dataframe

There are lots of question about json to pandas dataframe but none of them solved my issues. I am practicing on this complex json file which looks like this
{
"type" : "FeatureCollection",
"features" : [ {
"Id" : 265068000,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 22.170376666666666, 65.57273333333333 ]
},
"properties" : {
"timestampExternal" : 1529151039629
}
}, {
"Id" : 265745760,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 20.329506666666667, 63.675425000000004 ]
},
"properties" : {
"timestampExternal" : 1529151278287
}
} ]
}
I want to convert this json directly to pandas dataframe using pd.read_json() My Primary Goal is to extract Id, Coordinates and timestampExternal. As this is very complex json, normal way of pd.read_json(), simply doesnt give correct output. Can you suggest me, how can i approach to solve in this kind of situations. Expected output is something like this
Id,Coordinates,timestampExternal
265068000,[22.170376666666666, 65.57273333333333],1529151039629
265745760,[20.329506666666667, 63.675425000000004],1529151278287

You can read the json to load it into a dictionary. Then, using dictionary comprehension, extract the attributes you want as columns -
import json
import pandas as pd
_json = json.load(open('/path/to/json'))
df_dict = [{'id':item['Id'], 'coordinates':item['geometry']['coordinates'],
'timestampExternal':item['properties']['timestampExternal']} for item in _json['features']]
extracted_df = pd.DataFrame(df_dict)
>>>
coordinates id timestampExternal
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287

You can read the json directly, and then given the features array to pandas as a dict like:
Code:
import json
with open('test.json', 'rU') as f:
data = json.load(f)
df = pd.DataFrame([dict(id=datum['Id'],
coords=datum['geometry']['coordinates'],
ts=datum['properties']['timestampExternal'],
)
for datum in data['features']])
print(df)
Results:
coords id ts
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287

Working with JSON Nested in JSON Using Python/Pandas

I am trying to load JSON data using python, however, it looks like this:
{
"instrument" : "EUR_USD",
"granularity" : "D",
"candles" : [
{
"time" : "2014-07-02T04:00:00.000000Z", // time in RFC3339 format
"openMid" : 1.36803,
"highMid" : 1.368125,
"lowMid" : 1.364275,
"closeMid" : 1.365315,
"volume" : 28242,
"complete" : true
},
{
"time" : "2014-07-03T04:00:00.000000Z", // time in RFC3339 format
"openMid" : 1.36532,
"highMid" : 1.366445,
"lowMid" : 1.35963,
"closeMid" : 1.3613,
"volume" : 30487,
"complete" : false
}
]
}
My problem is that when I load it using Pandas, instrument, granularity, and candles are processed as the column titles. However, I want to use time, openMid, highMid, lowMid, closeMid, volume, and complete to create my columns. But they are just processed as a belonging to candles. Any ideas on how I can accomplish this? Thanks

You'll have to read the string using the json library first:
import json
data = json.loads(string)
And then you can extract the candles data from the resulting dictionary and build your DataFrame that way, e.g.:
candles_data = data.pop('candles')
df = pd.DataFrame(candles_data)
for k, v in data.iteritems():
df[k] = v

How To Access Fields of nested dictionary in PyMongo?

My problem is albeit a bit atypical. My Mongo instance records appear as follows:
{
"_id" : ObjectId("559670400084d37ea4cafa29"),
"('7412791816', '3838144', '723031613')" : {
"Customer_Loc_PinCode" : "110035",
"Net_Delivery_Time" : 3,
"Manifest_Date" : ISODate("2015-04-04T00:00:00Z"),
"Shipping_Date" : ISODate("2015-04-05T00:00:00Z"),
"Shipping_Method_Code" : "COD",
"Origin_PinCode" : "382470",
"Net_Manifest_Time" : 0,
"Transition_State" : [
[
"DNE",
"CTD",
"NULL",
"2015-04-05 15:23:22",
"NULL"
],
...# Many more such tuples present within this list.
],
"Net_Shipping_Time" : 2,
"RTD_Date" : "NULL",
"Delivery_Date" : ISODate("2015-04-07T00:00:00Z"),
"Intervening_Distance" : 522.3881079330106,
"Awb_Number" : "723031613",
"SubOrder_Number" : "7412791816",
"Last_Status" : "SHP",
"Customer_LatLong" : [
-,#Some float value
-#Some float value
],
"Order_Date" : ISODate("2015-04-04T00:00:00Z"),
"RTA_Date" : "NULL",
"Return_Direction" : 0,
"New_Status" : "DEL",
"Origin_LatLong" : [
-,#Some float value
-
],
"Rec_ID" : "3838144",
"RTU_Date" : "NULL"
}}
Now I require to obtain the dates and Net_Delivery_Time, as an example here, for all the records for further processing(plotting).
However the major debacle is that each such dictionary is referenced by a composite key,i.e. a tuple consisting of 3 fields. Now each such key uniquely identifies the associated record. I wish to extract the required fields from each such dictionary, but I have no means of iterating through all the keys.
I tried an approach to first collect all the keys and then retrieve the concerned fields, but that method didn't ork as there is no associated support for that in PyMongo.
If I were to use the db.'collection_name'.find() method, how will I craft the query? Can the uniqueness of each key present any potential problems? And what approach should I employ to achieve this task?
Thank You

Turning Json object to an array

I have a JSon object that is written this way.
Data= [{
"code" : "001",
"city" : "Boston",
"zipcode":"067"
},{
"code" : "002",
"city" : "NY",
},{
"city" : "NewJersey",
}]
In order to get a specific value "Code" in an array, I make this way.
ar= Data["code"].Values
When I print results I got
ar = [001, 002 ,'nan']
'nan' seems like the empty value.
How can I get only data that exist in the field without taking in consideration these fields where we don't find "code" attribute in ?

What you have shown us is not JSON. It's a list of dicts. If you want to get all the 'code' values out of those dicts, then you could do this:
[d['code'] for d in Data if 'code' in d]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Flattening nested JSON included embedded array in Python using Pandas - python

Related

How to extract from_dict to pandas dataframe with varying array length?

Complex json to pandas dataframe

Working with JSON Nested in JSON Using Python/Pandas

How To Access Fields of nested dictionary in PyMongo?

Turning Json object to an array

Categories

Resources