Complex json to pandas dataframe - python

There are lots of question about json to pandas dataframe but none of them solved my issues. I am practicing on this complex json file which looks like this
{
"type" : "FeatureCollection",
"features" : [ {
"Id" : 265068000,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 22.170376666666666, 65.57273333333333 ]
},
"properties" : {
"timestampExternal" : 1529151039629
}
}, {
"Id" : 265745760,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 20.329506666666667, 63.675425000000004 ]
},
"properties" : {
"timestampExternal" : 1529151278287
}
} ]
}
I want to convert this json directly to pandas dataframe using pd.read_json() My Primary Goal is to extract Id, Coordinates and timestampExternal. As this is very complex json, normal way of pd.read_json(), simply doesnt give correct output. Can you suggest me, how can i approach to solve in this kind of situations. Expected output is something like this
Id,Coordinates,timestampExternal
265068000,[22.170376666666666, 65.57273333333333],1529151039629
265745760,[20.329506666666667, 63.675425000000004],1529151278287

You can read the json to load it into a dictionary. Then, using dictionary comprehension, extract the attributes you want as columns -
import json
import pandas as pd
_json = json.load(open('/path/to/json'))
df_dict = [{'id':item['Id'], 'coordinates':item['geometry']['coordinates'],
'timestampExternal':item['properties']['timestampExternal']} for item in _json['features']]
extracted_df = pd.DataFrame(df_dict)
>>>
coordinates id timestampExternal
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287

You can read the json directly, and then given the features array to pandas as a dict like:
Code:
import json
with open('test.json', 'rU') as f:
data = json.load(f)
df = pd.DataFrame([dict(id=datum['Id'],
coords=datum['geometry']['coordinates'],
ts=datum['properties']['timestampExternal'],
)
for datum in data['features']])
print(df)
Results:
coords id ts
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287

Related

Excel to nested Json including child elements into array

I am trying to convert an Excel to nested JSON using Python where the repeated values go in as an array of elements.
Ex: structure of CSV
Manufacturer,oilType,viscosity
shell,superOil,1ova
shell,superOil,2ova
shell,normalOil,1ova
bp, power, 10bba
Should be displayed in JSON (expected output) as
elements: [
{
"Manufacturer": "shell",
"details": [
{
"OilType": "superOil",
"Viscosity": [
"1ova",
"2ova"
]
},
{
"OilType": "normalOil",
"Viscosity": [
"1ova"
]
}
]
},
{
"Manufacturer": "bp",
"details": [
{
"OilType": "power",
"Viscosity": [
"10bba"
]
}
]
}
]
I have currently converted the CSV into JSON using openpyxl and the values are displayed for each of the headers in format like (Current output)
[{Manufacturer: "shell", oilType: "superOil", Viscosity:"1ova"},{...},{...},...]
Please help in getting the expected output.
Hi and welcome to StackOverflow.
Your question has actually nothing to do with openpyxl because you don't need to save into an Excel file.
You can do thought:
Load the csv (or Excel) into a pandas DataFrame
Group by Manufacturer and oil type
Dump into the format you want
Transform to JSON (either string or file)
In practice, that gives something like that:
import json
import pandas as pd
df = pd.read_csv("oil.csv") # or read_excel if this is an Excel
oils = df.groupby(["Manufacturer", "oilType"]).aggregate(pd.Series.to_list)
elements = [
{
"Manufacturer": manufacturer,
"Details": [
{"OilType": o, "Viscosity": v}
for o, v in data.droplevel(0).viscosity.items()
],
}
for manufacturer, data in oils.groupby(level="Manufacturer")
]
with open("oil.json", "w") as f:
json.dump({"elements": elements}, f)
For information, oils would look like this:
viscosity
Manufacturer oilType
bp power [10bba]
shell normalOil [1ova]
superOil [1ova, 2ova]

Pyspark JSON array of objects into columns

Im ingesting JSON files into spark and i have come across an object as below in the nested JSON from the file
"data": {
"key1" :"v1"
"key2" : [
{"nk1" :"nv1"},
{"nk2" :"nv2" },
{"nk3" :"nv3" }
]
}
After reading it in spark, it is changing into below format:
"data": {
"key1" :"v1"
"key2" : [
{"nk1" :"nv1", "nk2" :null, "nk3" :null},
{"nk1" :null, "nk2" :"nv2", "nk3" :null},
{"nk1" :null, "nk2" :null, "nk3" :"nv3"}
]
}
I need them as columns in the spark dataframe
"key1"
"nk1"
"nk2"
"nk3"
"v1"
"kv1"
"kv2"
"kv3"
Please help me with any solution for this. I'm thinking to convert this to string and use regex. Is there any better solution?
You can explode the array and pivot key2:
import pyspark.sql.functions as F
df2 = df.select(
F.col('data.key1').alias('key1'),
F.explode('data.key2').alias('key2')
).select(
'key1',
F.map_keys('key2')[0].alias('key'),
F.map_values('key2')[0].alias('val')
).groupBy('key1').pivot('key').agg(F.first('val'))
df2.show()
+----+---+---+---+
|key1|nk1|nk2|nk3|
+----+---+---+---+
| v1|nv1|nv2|nv3|
+----+---+---+---+

How to extract from_dict to pandas dataframe with varying array length?

I have a dict, I need to convert to pandas dataframe, dict have arrays, if the arrays are of same length it is working fine, but different length array throwing valueError, second question is I need to access only few key value pairs from the dict
This case working, as expected I get two rows
my_dict = {
"ColA" : "No",
"ColB" : [
{
"ColB_a" : "2011-10-26T00:00:00Z",
"ColB_b" : 8.3
},
{
"ColB_a" : "2013-10-26T00:00:00Z",
"ColB_b" : 5.3
}
],
"ColC" : "Graduate",
"ColD" : [
{
"ColD_a" : 5436.0,
"ColD_b" : "RD"
},
{
"ColD_a" : 4658.0,
"ColD_b" : "DV"
}
],
"ColE" : "Work"
}
sa = pd.DataFrame(my_dict)
In this case ColB has only one value
my_dict = {
"ColA" : "No",
"ColB" : [
{
"ColB_a" : "2011-10-26T00:00:00Z",
"ColB_b" : 8.3
}
],
"ColC" : "Graduate",
"ColD" : [
{
"ColD_a" : 5436.0,
"ColD_b" : "RD"
},
{
"ColD_a" : 4658.0,
"ColD_b" : "DV"
}
],
"ColE" : "Work"
}
sa = pd.DataFrame(my_dict)
so this throws ValueError: arrays must all be same length, How this can be fixed?
expected output is
I can do
sa = pd.DataFrame.from_dict(my_dict, orient='index').transpose()
But I have to melt and join again.
Second Question, if I need to choose only ColA, ColB from dict to create dataframe, How this to be done?
for your second question, you could select a couple of columns from your dictionary using 'columns' parameter: For example:
sa = pd.DataFrame(my_dict, columns = ['ColA', 'ColD'])

Flattening nested JSON included embedded array in Python using Pandas

I have a JSON-array from a mongoexport containing data from the Beddit sleeptracker. Below is an example of one of the truncated documents (removed some unneeded detail).
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"data" : [
{
"end_timestamp" : 1480570804.26226,
"properties" : {
"sleep_efficiency" : 0.8772404,
"resting_heart_rate" : 67.67578,
"short_term_resting_heart_rate" : 61.36963,
"activity_index" : 50.51958,
"average_respiration_rate" : 16.25667,
"total_sleep_score" : 64,
},
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"tags" : [
"not_enough_sleep",
"long_sleep_latency"
],
"updated" : 1480570805.25201
}
],
"__v" : 0
}
Several related questions like this and this do not seem to work for the data structure above. As recommended in other related questions I am trying to stay away from looping over each row for performance reasons (the full dataset is ~150MB). How would I flatten out the "data"-key with json_normalize so that each key is at the top-level? I would prefer one DataFrame where e.g. total_sleep_score is a column.
Any help is much appreciated! Even though I know how to 'prepare' the data using JavaScript, I would like to be able to understand and do it using Python.
edit (request from comment to show preferred structure):
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"end_timestamp" : 1480570804.26226,
"properties.sleep_efficiency" : 0.8772404,
"properties.resting_heart_rate" : 67.67578,
"properties.short_term_resting_heart_rate" : 61.36963,
"properties.activity_index" : 50.51958,
"properties.average_respiration_rate" : 16.25667,
"properties.total_sleep_score" : 64,
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"updated" : 1480570805.25201,
"__v" : 0
}
The 'properties' append is not necessary but would be nice.
Try This algo for flatten:-
def flattenPattern(pattern):
newPattern = {}
if type(pattern) is list:
pattern = pattern[0]
if type(pattern) is not str:
for key, value in pattern.items():
if type(value) in (list, dict):
returnedData = flattenPattern(value)
for i,j in returnedData.items():
if key == "data":
newPattern[i] = j
else:
newPattern[key + "." + i] = j
else:
newPattern[key] = value
return newPattern
print(flattenPattern(dictFromJson))
OutPut:-
{
'session_range_start':1480545636.55059,
'start_timestamp':1480545636.55059,
'properties.average_respiration_rate':16.25667,
'session_range_end':1480570804.26226,
'properties.resting_heart_rate':67.67578,
'properties.short_term_resting_heart_rate':61.36963,
'updated':1480570805.25201,
'properties.total_sleep_score':64,
'properties.activity_index':50.51958,
'__v':0,
'user':'xxx',
'provider':'beddit',
'date':'2016-12-01',
'properties.sleep_efficiency':0.8772404,
'end_timestamp':1480570804.26226
}
Although not explicitly what I asked for, the following worked for me so far:
Step 1
Normalize the data record using json_normalize on the original dataset (not inside a Pandas DataFrame) and prefix the data.
beddit_data = pd.io.json.json_normalize(beddit, record_path='data', record_prefix='data.', meta='_id')
Step 2
The properties record was a Series with dicts so these can be 'formatted' with .apply(pd.Series)
beddit_data_properties = beddit_data['data.properties'].apply(pd.Series)
Step 3
Final step is to merge both DataFrames. In step 1, I kept the 'meta=_id' so that DataFrame can be merged with the original DataFrame from Bedit. I didn't include it in the final step yet because I can spend some time on the results from the results so far.
beddit_final = pd.concat([beddit_data_properties[:], beddit_data[:]], axis=1)
If anyone is interested, I can share the final Jupyter Notebook when it is ready :)

Working with JSON Nested in JSON Using Python/Pandas

I am trying to load JSON data using python, however, it looks like this:
{
"instrument" : "EUR_USD",
"granularity" : "D",
"candles" : [
{
"time" : "2014-07-02T04:00:00.000000Z", // time in RFC3339 format
"openMid" : 1.36803,
"highMid" : 1.368125,
"lowMid" : 1.364275,
"closeMid" : 1.365315,
"volume" : 28242,
"complete" : true
},
{
"time" : "2014-07-03T04:00:00.000000Z", // time in RFC3339 format
"openMid" : 1.36532,
"highMid" : 1.366445,
"lowMid" : 1.35963,
"closeMid" : 1.3613,
"volume" : 30487,
"complete" : false
}
]
}
My problem is that when I load it using Pandas, instrument, granularity, and candles are processed as the column titles. However, I want to use time, openMid, highMid, lowMid, closeMid, volume, and complete to create my columns. But they are just processed as a belonging to candles. Any ideas on how I can accomplish this? Thanks
You'll have to read the string using the json library first:
import json
data = json.loads(string)
And then you can extract the candles data from the resulting dictionary and build your DataFrame that way, e.g.:
candles_data = data.pop('candles')
df = pd.DataFrame(candles_data)
for k, v in data.iteritems():
df[k] = v

Categories