Normalize nested JSON data with Pandas/Python

Normalize nested JSON data with Pandas/Python - python

I'm trying to normalize a similar sample data
{
"2018-04-26 10:09:33": [
{
"user_id": "M8BE957ZA",
"ts": "2018-04-26 10:06:33",
"message": "Hello"
}
],
"2018-04-27 19:10:55": [
{
"user_id": "M5320QS1X",
"ts": "2018-04-27 19:10:55",
"message": "Thank you"
}
],
I know I can use json_normalize(data,'2018-04-26 10:09:33',record_prefix= '') to create a table in pandas but the date/time keeps changing. How can I normalize it so I have as follow? Any suggestions
user_id. ts message
2018-04-26 10:09:33 M8BE957ZA. 2018-04-26 10:06:33. Hello
2018-04-26 10:09:33 M5320QS1X 2018-04-27 19:10:55. Thank you

test = {
"2018-04-26 10:09:33": [
{
"user_id": "M8BE957ZA",
"ts": "2018-04-26 10:06:33",
"message": "Hello"
}
],
"2018-04-27 19:10:55": [
{
"user_id": "M5320QS1X",
"ts": "2018-04-27 19:10:55",
"message": "Thank you"
}
]}
df = pd.DataFrame(test).melt()
variable value
0 2018-04-26 10:09:33 {'user_id': 'M8BE957ZA', 'ts': '2018-04-26 10:...
1 2018-04-27 19:10:55 {'user_id': 'M5320QS1X', 'ts': '2018-04-27 19:...
Read in your dataframe as your dict, then melt it to get the above structure. Next you can use json_normalize on the value column, then rejoin it to the variable column like so:
df.join(json_normalize(df['value'])).drop(columns = 'value').rename(columns = {'variable':'date'})
date user_id ts message
0 2018-04-26 10:09:33 M8BE957ZA 2018-04-26 10:06:33 Hello
1 2018-04-27 19:10:55 M5320QS1X 2018-04-27 19:10:55 Thank you

Related

How to parse through json file and transform it into time series

I have this json file:
{
"walk": [
{
"date": "2021-01-10",
"duration": 301800,
"levels": {
"data": [
{
"timestamp": "2021-01-10T13:16:00.000",
"level": "slow",
"seconds": 360
},
{
"timestamp": "2021-01-10T13:22:00.000",
"level": "moderate",
"seconds": 2940
},
{
"dateTime": "2021-01-10T14:11:00.000",
"level": "fast",
"seconds": 300
and I want to parse through this such that it is a 1-min level time series data. (i.e.: 6 data points (360 seconds= 6 minutes) as level "slow".
timestamp level
2021-01-10 13:16:00 slow
2021-01-10 13:17:00 slow
.......
2021-01-10 13:22:00 moderate
I have right now:
with open('walks.json') as f:
df = pd.json_normalize(json.load(f),
record_path=['walk']
)
but that returns levels nested in one cell for each day. How can I achieve this?

You need to nest the record_path levels
df = pd.json_normalize(data=data, record_path=["walk", ["levels", "data"]])

Python:How to insert array of data into mongodb using pymongo from a dataframe

Have a dataframe with values
df
name rank subject marks age
tom 123 math 25 10
mark 124 math 50 10
How to insert the dataframe data into mongodb using pymongo like first two columns as a regular insert and another 3 as array
{
"_id": "507f1f77bcf86cd799439011",
"name":"tom",
"rank":"123"
"scores": [{
"subject": "math",
"marks": 25,
"age": 10
}]
}
{
"_id": "507f1f77bcf86cd799439012",
"name":"mark",
"rank":"124"
"scores": [{
"subject": "math",
"marks": 50,
"age": 10
}]
}
tried this :
convert_dict = df.to_dict("records")
mydb.school_data.insert_many(convert_dict)

I use this solution
convert_dict = df.to_dict(orient="records")
mydb.school_data.insert_many(convert_dict)

Pandas json_normalize with timestamps as keys

I try to read this JSON data
{
"values": [
{
"1510122047": [
35.7,
256
]
},
{
"1510125000": [
41.7,
7
]
},
{
"1510129000": [
31.7,
0
]
}
]
}
and normalize it into a pandas data frame of this format:
I tried it with json_normalize but I was not able to get the result I need.
Here is what I tried: But it's not quite efficient. I would like to find a solution that works with pandas' built in functions to do this. I'd appreciate ideas!
import pandas
import json
s = """
{"values": [
{
"1510122047": [35.7, 256]
},
{
"1510125000": [41.7, 7]
},
{
"1510129000": [31.7, 0]
}
]}
"""
data = json.loads(s)
normalized_data = []
for value in data['values']:
timestamp = list(value.keys())[0]
normalized_data.append({'timestamp':timestamp, 'value_1': value[timestamp][0], 'value_2': value[timestamp][1]})
pandas.DataFrame(normalized_data)
Thanks
EDIT
Thanks for your suggestions. Unfortunately none where faster than the solution of this OP. Here is what I did to generate a bigger payload and test for speed:
I guess it's the nature of JSON to be slowly for this application.
import pandas
import json
import time
s1 = """{
"1510122047": [35.7, 256]
},
{
"1510125000": [41.7, 7]
},
{
"1510129000": [31.7, 0]
}"""
s = """
{"values": [
{
"1510122047": [35.7, 256]
},
{
"1510125000": [41.7, 7]
},
{
"1510129000": [31.7, 0]
},
""" + ",".join([s1]*1000000) + "]}"
data = json.loads(s)
tic = time.time()
normalized_data = []
for value in data['values']:
timestamp = list(value.keys())[0]
normalized_data.append({'timestamp':timestamp, 'value_1': value[timestamp][0], 'value_2': value[timestamp][1]})
print(time.time() - tic)
pandas.DataFrame(normalized_data)

This is one approach using a nested comprehension
Ex:
df= pd.DataFrame([[key] + value for item in data['values']
for key, value in item.items()
], columns=["Timestamp", "Val_1", "Val_2"])
print(df)
Output:
Timestamp Val_1 Val_2
0 1510122047 35.7 256
1 1510125000 41.7 7
2 1510129000 31.7 0

data = {'values': [{'1510122047': [35.7, 256]},
{'1510125000': [41.7, 7]},
{'1510129000': [31.7, 0]}]}
dfn = json_normalize(data, 'values').stack()
df_output = pd.DataFrame(dfn.tolist(), index=dfn.index)
df_output = df_output.reset_index().iloc[:, 1:]
# rename columns
df_output.columns = 'value_' + df_output.columns.astype(str)
df_output.rename(columns={'value_level_1':'timestamp'}, inplace=True)
print(df_output)
# timestamp value_0 value_1
# 0 1510122047 35.7 256
# 1 1510125000 41.7 7
# 2 1510129000 31.7 0

Python Pandas - Iterate over unique columns

I am trying to iterate over a list of unique column-values to create three different keys with dictionaries inside a dictionary. This is the code I have now:
import pandas as pd
dataDict = {}
metrics = frontendFrame['METRIC'].unique()
for metric in metrics:
dataDict[metric] = frontendFrame[frontendFrame['METRIC'] == metric].to_dict('records')
print(dataDict)
This works fine for low amounts of data, but as fast as the amount of data increases this can take almost one second (!!!!).
I've tried with groupby in pandas which is even slower, and also with map, but I don't want to return things to a list. How can I iterate over this and create what I want in a faster way? I am using Python 3.6
UPDATE:
Input:
DATETIME METRIC ANOMALY VALUE
0 2018-02-27 17:30:32 SCORE 2.0 -1.0
1 2018-02-27 17:30:32 VALUE NaN 0.0
2 2018-02-27 17:30:32 INDEX NaN 6.6613381477499995E-16
3 2018-02-27 17:31:30 SCORE 2.0 -1.0
4 2018-02-27 17:31:30 VALUE NaN 0.0
5 2018-02-27 17:31:30 INDEX NaN 6.6613381477499995E-16
6 2018-02-27 17:32:30 SCORE 2.0 -1.0
7 2018-02-27 17:32:30 VALUE NaN 0.0
8 2018-02-27 17:32:30 INDEX NaN 6.6613381477499995E-16
Output:
{
"INDEX": [
{
"DATETIME": 1519759710000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
},
{
"DATETIME": 1519759770000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
}],
"SCORE": [
{
"DATETIME": 1519759710000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
}],
"VALUE": [
{
"DATETIME": 1519759710000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
}]
}

One possible solution:
a = defaultdict( list )
_ = {x['METRIC']: a[x['METRIC']].append(x) for x in frontendFrame.to_dict('records')}
a = dict(a)
from collections import defaultdict
a = defaultdict( list )
for x in frontendFrame.to_dict('records'):
a[x['METRIC']].append(x)
a = dict(a)
Slow:
dataDict = frontendFrame.groupby('METRIC').apply(lambda x: x.to_dict('records')).to_dict()

Formatting dataframe output into JSON records by group

my dataframe looks like this df:
count_arena_users count_users event timestamp
0 4458 12499 football 2017-04-30
1 2706 4605 cricket 2015-06-30
2 592 4176 tennis 2016-06-30
3 3427 10126 badminton 2017-05-31
4 717 2313 football 2016-03-31
5 101 155 hockey 2016-01-31
6 45923 191180 tennis 2015-12-31
7 1208 2824 badminton 2017-01-31
8 5577 8906 cricket 2016-02-29
9 111 205 football 2016-03-31
10 4 8 hockey 2017-09-30
the data is fetched from a psql database, Now i want to generate the output of "select * from tbl_arena" in json format. But the json format that is desired has to be something like this:
[
{
"event": "football",
"data_to_plot": [
{
"count_arena_users": 717,
"count_users": 2313,
"timestamp": "2016-03-31"
},
{
"count_arena_users": 111,
"count_users": 205,
"timestamp": "2016-03-31"
},
{
"count_arena_users": 4458,
"count_users": 12499,
"timestamp": "2017-04-30"
}
]
},
{
"event": "cricket",
"data_to_plot": [
{
"count_arena_users": 2706,
"count_users": 4605,
"timestamp": "2015-06-30"
},
{
"count_arena_users": 5577,
"count_users": 8906,
"timestamp": "2016-02-29"
}
]
}
.
.
.
.
]
the values of all the columns are grouped based on the event column and later their occurance order of sub-dictionaries is decided based on timestamp column i.e. earlier dates appearing first and newer/latest dates appearing below it.
I'm using python 3.x and json.dumps to format the data into json style.

A high level process is as follows -
Aggregate all data with respect to events. We'd need a groupby + apply for this.
Convert the result to a series of records, one record for each event and associated data. Use to_json, with the orient=records.
df.groupby('event', sort=False)\
.apply(lambda x: x.drop('event', 1).sort_values('timestamp').to_dict('r'))\
.reset_index(name='data_to_plot')\
.to_json(orient='records')
[
{
"event": "football",
"data_to_plot": [
{
"count_arena_users": 717,
"timestamp": "2016-03-31",
"count_users": 2313
},
{
"count_arena_users": 111,
"timestamp": "2016-03-31",
"count_users": 205
},
{
"count_arena_users": 4458,
"timestamp": "2017-04-30",
"count_users": 12499
}
]
},
...
]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Normalize nested JSON data with Pandas/Python - python

Related

How to parse through json file and transform it into time series

Python:How to insert array of data into mongodb using pymongo from a dataframe

Pandas json_normalize with timestamps as keys

Python Pandas - Iterate over unique columns

Formatting dataframe output into JSON records by group

Categories

Resources