Working with JSON Nested in JSON Using Python/Pandas - python

I am trying to load JSON data using python, however, it looks like this:
{
"instrument" : "EUR_USD",
"granularity" : "D",
"candles" : [
{
"time" : "2014-07-02T04:00:00.000000Z", // time in RFC3339 format
"openMid" : 1.36803,
"highMid" : 1.368125,
"lowMid" : 1.364275,
"closeMid" : 1.365315,
"volume" : 28242,
"complete" : true
},
{
"time" : "2014-07-03T04:00:00.000000Z", // time in RFC3339 format
"openMid" : 1.36532,
"highMid" : 1.366445,
"lowMid" : 1.35963,
"closeMid" : 1.3613,
"volume" : 30487,
"complete" : false
}
]
}
My problem is that when I load it using Pandas, instrument, granularity, and candles are processed as the column titles. However, I want to use time, openMid, highMid, lowMid, closeMid, volume, and complete to create my columns. But they are just processed as a belonging to candles. Any ideas on how I can accomplish this? Thanks

You'll have to read the string using the json library first:
import json
data = json.loads(string)
And then you can extract the candles data from the resulting dictionary and build your DataFrame that way, e.g.:
candles_data = data.pop('candles')
df = pd.DataFrame(candles_data)
for k, v in data.iteritems():
df[k] = v

Related

How to extract from_dict to pandas dataframe with varying array length?

I have a dict, I need to convert to pandas dataframe, dict have arrays, if the arrays are of same length it is working fine, but different length array throwing valueError, second question is I need to access only few key value pairs from the dict
This case working, as expected I get two rows
my_dict = {
"ColA" : "No",
"ColB" : [
{
"ColB_a" : "2011-10-26T00:00:00Z",
"ColB_b" : 8.3
},
{
"ColB_a" : "2013-10-26T00:00:00Z",
"ColB_b" : 5.3
}
],
"ColC" : "Graduate",
"ColD" : [
{
"ColD_a" : 5436.0,
"ColD_b" : "RD"
},
{
"ColD_a" : 4658.0,
"ColD_b" : "DV"
}
],
"ColE" : "Work"
}
sa = pd.DataFrame(my_dict)
In this case ColB has only one value
my_dict = {
"ColA" : "No",
"ColB" : [
{
"ColB_a" : "2011-10-26T00:00:00Z",
"ColB_b" : 8.3
}
],
"ColC" : "Graduate",
"ColD" : [
{
"ColD_a" : 5436.0,
"ColD_b" : "RD"
},
{
"ColD_a" : 4658.0,
"ColD_b" : "DV"
}
],
"ColE" : "Work"
}
sa = pd.DataFrame(my_dict)
so this throws ValueError: arrays must all be same length, How this can be fixed?
expected output is
I can do
sa = pd.DataFrame.from_dict(my_dict, orient='index').transpose()
But I have to melt and join again.
Second Question, if I need to choose only ColA, ColB from dict to create dataframe, How this to be done?
for your second question, you could select a couple of columns from your dictionary using 'columns' parameter: For example:
sa = pd.DataFrame(my_dict, columns = ['ColA', 'ColD'])

Complex json to pandas dataframe

There are lots of question about json to pandas dataframe but none of them solved my issues. I am practicing on this complex json file which looks like this
{
"type" : "FeatureCollection",
"features" : [ {
"Id" : 265068000,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 22.170376666666666, 65.57273333333333 ]
},
"properties" : {
"timestampExternal" : 1529151039629
}
}, {
"Id" : 265745760,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 20.329506666666667, 63.675425000000004 ]
},
"properties" : {
"timestampExternal" : 1529151278287
}
} ]
}
I want to convert this json directly to pandas dataframe using pd.read_json() My Primary Goal is to extract Id, Coordinates and timestampExternal. As this is very complex json, normal way of pd.read_json(), simply doesnt give correct output. Can you suggest me, how can i approach to solve in this kind of situations. Expected output is something like this
Id,Coordinates,timestampExternal
265068000,[22.170376666666666, 65.57273333333333],1529151039629
265745760,[20.329506666666667, 63.675425000000004],1529151278287
You can read the json to load it into a dictionary. Then, using dictionary comprehension, extract the attributes you want as columns -
import json
import pandas as pd
_json = json.load(open('/path/to/json'))
df_dict = [{'id':item['Id'], 'coordinates':item['geometry']['coordinates'],
'timestampExternal':item['properties']['timestampExternal']} for item in _json['features']]
extracted_df = pd.DataFrame(df_dict)
>>>
coordinates id timestampExternal
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287
You can read the json directly, and then given the features array to pandas as a dict like:
Code:
import json
with open('test.json', 'rU') as f:
data = json.load(f)
df = pd.DataFrame([dict(id=datum['Id'],
coords=datum['geometry']['coordinates'],
ts=datum['properties']['timestampExternal'],
)
for datum in data['features']])
print(df)
Results:
coords id ts
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287

Pandas - set numeric value from a JSON as float with 4 digits precision

I am loading data from a JSON file, which is in float format like 13.4567.
The problem is that when I load it in Pandas, when I print that field I get the scientific notation like 1.34567e+06.
Is there a way to set the field as regular float with 4 digits precision,as in the JSON source file, when I create the dataframe? Or do I have to massage the data after create the datafile?
This is the code used to load the JSON:
with open('myjson.json') as f:
loaded_file = json.load(f)
df = pd.DataFrame.from_dict(json_normalize(loaded_file), orient='column')
print(df.result)
The data file looks like this:
[
{
"info" : {
"timestamp" : "2018-01-22 00:04:00.637000",
"commit_hash" : "234234fdsfk3"
},
"action" : {
"time_sec" : 14.7584,
"diff" : 12345.375,
}
},
{
"info" : {
"timestamp" : "2018-01-22 01:04:00.543000",
"commit_hash" : "23eqdasfdsfk3"
},
"action" : {
"time_sec" : 12.3456,
"diff" : 543546.2234,
}
}
]
Perhaps this is what you are looking for:
pd.options.display.float_format = '{:.4f}'.format
or
pd.set_option('display.precision',4)
this will set pandas do set the display formatting of your floats.
If you really want to specifically to round the numbers, you would need to round the columns, since importing will automatically detect the dtype as a float, which comes with a default precision. Unless you are required to format the float for anything else than printing, which seems to be the case here, I would stick with display/printing formatting.

Flattening nested JSON included embedded array in Python using Pandas

I have a JSON-array from a mongoexport containing data from the Beddit sleeptracker. Below is an example of one of the truncated documents (removed some unneeded detail).
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"data" : [
{
"end_timestamp" : 1480570804.26226,
"properties" : {
"sleep_efficiency" : 0.8772404,
"resting_heart_rate" : 67.67578,
"short_term_resting_heart_rate" : 61.36963,
"activity_index" : 50.51958,
"average_respiration_rate" : 16.25667,
"total_sleep_score" : 64,
},
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"tags" : [
"not_enough_sleep",
"long_sleep_latency"
],
"updated" : 1480570805.25201
}
],
"__v" : 0
}
Several related questions like this and this do not seem to work for the data structure above. As recommended in other related questions I am trying to stay away from looping over each row for performance reasons (the full dataset is ~150MB). How would I flatten out the "data"-key with json_normalize so that each key is at the top-level? I would prefer one DataFrame where e.g. total_sleep_score is a column.
Any help is much appreciated! Even though I know how to 'prepare' the data using JavaScript, I would like to be able to understand and do it using Python.
edit (request from comment to show preferred structure):
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"end_timestamp" : 1480570804.26226,
"properties.sleep_efficiency" : 0.8772404,
"properties.resting_heart_rate" : 67.67578,
"properties.short_term_resting_heart_rate" : 61.36963,
"properties.activity_index" : 50.51958,
"properties.average_respiration_rate" : 16.25667,
"properties.total_sleep_score" : 64,
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"updated" : 1480570805.25201,
"__v" : 0
}
The 'properties' append is not necessary but would be nice.
Try This algo for flatten:-
def flattenPattern(pattern):
newPattern = {}
if type(pattern) is list:
pattern = pattern[0]
if type(pattern) is not str:
for key, value in pattern.items():
if type(value) in (list, dict):
returnedData = flattenPattern(value)
for i,j in returnedData.items():
if key == "data":
newPattern[i] = j
else:
newPattern[key + "." + i] = j
else:
newPattern[key] = value
return newPattern
print(flattenPattern(dictFromJson))
OutPut:-
{
'session_range_start':1480545636.55059,
'start_timestamp':1480545636.55059,
'properties.average_respiration_rate':16.25667,
'session_range_end':1480570804.26226,
'properties.resting_heart_rate':67.67578,
'properties.short_term_resting_heart_rate':61.36963,
'updated':1480570805.25201,
'properties.total_sleep_score':64,
'properties.activity_index':50.51958,
'__v':0,
'user':'xxx',
'provider':'beddit',
'date':'2016-12-01',
'properties.sleep_efficiency':0.8772404,
'end_timestamp':1480570804.26226
}
Although not explicitly what I asked for, the following worked for me so far:
Step 1
Normalize the data record using json_normalize on the original dataset (not inside a Pandas DataFrame) and prefix the data.
beddit_data = pd.io.json.json_normalize(beddit, record_path='data', record_prefix='data.', meta='_id')
Step 2
The properties record was a Series with dicts so these can be 'formatted' with .apply(pd.Series)
beddit_data_properties = beddit_data['data.properties'].apply(pd.Series)
Step 3
Final step is to merge both DataFrames. In step 1, I kept the 'meta=_id' so that DataFrame can be merged with the original DataFrame from Bedit. I didn't include it in the final step yet because I can spend some time on the results from the results so far.
beddit_final = pd.concat([beddit_data_properties[:], beddit_data[:]], axis=1)
If anyone is interested, I can share the final Jupyter Notebook when it is ready :)

Apply $geoNear to find nearest latlong using aggregation in pymongo?

My documents in a collection looks like this -
{'_id' : 'Delhi1', 'loc' : [28.34242,77.656565] }
{'_id' : 'Delhi2', 'loc' : [27.34242,78.626523] }
{'_id' : 'Delhi3', 'loc' : [25.34242,77.612345] }
{'_id' : 'Delhi4', 'loc' : [28.34242,77.676565] }
I want to apply aggregation using pymongo, to find out relevant document based on input latlong. I have created the index on 'loc'. Here is what I have done so far -
pipeline = [{'$geoNear':{'near': [27.8787, 78.2342],
'distanceField': "distance",
'maxDistance' : 2000 }}]
db['mycollection'].aggregate(pipeline)
But this is not working for me ? How to correctly use this ?
Actually, I created the '2dsphere' index in collection, and to use geoNear with 2dsphere we need to specify, spherical = True in the pipeline
pipeline = [{'$geoNear':{'near': [27.8787, 78.2342],
'distanceField': "distance",
'maxDistance' : 2000,
'spherical' : True }}]
It looks like you have a few formatting mistakes: 1) neither the collection nor operators need parentheses or brackets, 2) logical operators are lowercase.
db.mycollection.aggregate([
{
$geoNear: {
near: { coordinates: [ 27.8787 , 78.2342 ] },
distanceField: "distance",
maxDistance: 2000,
spherical: true
}
}
])

Categories