How to normalize uneven JSON structures in pandas? - python

I am using the Google Maps Distance Matrix API to get several distances from multiple origins. The API response comes in a JSON structured like:
{
"destination_addresses": [
"Destination 1",
"Destination 2",
"Destination 3"
],
"origin_addresses": [
"Origin 1",
"Origin 2"
],
"rows": [
{
"elements": [
{
"distance": {
"text": "8.7 km",
"value": 8687
},
"duration": {
"text": "19 mins",
"value": 1129
},
"status": "OK"
},
{
"distance": {
"text": "223 km",
"value": 222709
},
"duration": {
"text": "2 hours 42 mins",
"value": 9704
},
"status": "OK"
},
{
"distance": {
"text": "299 km",
"value": 299156
},
"duration": {
"text": "4 hours 17 mins",
"value": 15400
},
"status": "OK"
}
]
},
{
"elements": [
{
"distance": {
"text": "216 km",
"value": 215788
},
"duration": {
"text": "2 hours 44 mins",
"value": 9851
},
"status": "OK"
},
{
"distance": {
"text": "20.3 km",
"value": 20285
},
"duration": {
"text": "21 mins",
"value": 1283
},
"status": "OK"
},
{
"distance": {
"text": "210 km",
"value": 210299
},
"duration": {
"text": "2 hours 45 mins",
"value": 9879
},
"status": "OK"
}
]
}
],
"status": "OK"
}
Note the rows array has the same number of elements in origin_addresses (2), while each elements array has the same number of elements in destination_addresses (3).
Is one able to use the pandas API to normalize everything inside rows while fetching the corresponding data from origin_addresses and destination_addresses?
The output should be:
status distance.text distance.value duration.text duration.value origin_addresses destination_addresses
0 OK 8.7 km 8687 19 mins 1129 Origin 1 Destination 1
1 OK 223 km 222709 2 hours 42 mins 9704 Origin 1 Destination 2
2 OK 299 km 299156 4 hours 17 mins 15400 Origin 1 Destination 3
3 OK 216 km 215788 2 hours 44 mins 9851 Origin 2 Destination 1
4 OK 20.3 km 20285 21 mins 1283 Origin 2 Destination 2
5 OK 210 km 210299 2 hours 45 mins 9879 Origin 2 Destination 3
If pandas does not provide a relatively simple way to do it, how would one accomplish this operation?

If data contains the dictionary from the question you can try:
df = pd.DataFrame(data["rows"])
df["origin_addresses"] = data["origin_addresses"]
df = df.explode("elements")
df = pd.concat([df.pop("elements").apply(pd.Series), df], axis=1)
df = pd.concat(
[df.pop("distance").apply(pd.Series).add_prefix("distance."), df], axis=1
)
df = pd.concat(
[df.pop("duration").apply(pd.Series).add_prefix("duration."), df], axis=1
)
df["destination_addresses"] = data["destination_addresses"] * len(
data["origin_addresses"]
)
print(df)
Prints:
duration.text duration.value distance.text distance.value status origin_addresses destination_addresses
0 19 mins 1129 8.7 km 8687 OK Origin 1 Destination 1
0 2 hours 42 mins 9704 223 km 222709 OK Origin 1 Destination 2
0 4 hours 17 mins 15400 299 km 299156 OK Origin 1 Destination 3
1 2 hours 44 mins 9851 216 km 215788 OK Origin 2 Destination 1
1 21 mins 1283 20.3 km 20285 OK Origin 2 Destination 2
1 2 hours 45 mins 9879 210 km 210299 OK Origin 2 Destination 3

Related

How to parse through json file and transform it into time series

I have this json file:
{
"walk": [
{
"date": "2021-01-10",
"duration": 301800,
"levels": {
"data": [
{
"timestamp": "2021-01-10T13:16:00.000",
"level": "slow",
"seconds": 360
},
{
"timestamp": "2021-01-10T13:22:00.000",
"level": "moderate",
"seconds": 2940
},
{
"dateTime": "2021-01-10T14:11:00.000",
"level": "fast",
"seconds": 300
and I want to parse through this such that it is a 1-min level time series data. (i.e.: 6 data points (360 seconds= 6 minutes) as level "slow".
timestamp level
2021-01-10 13:16:00 slow
2021-01-10 13:17:00 slow
.......
2021-01-10 13:22:00 moderate
I have right now:
with open('walks.json') as f:
df = pd.json_normalize(json.load(f),
record_path=['walk']
)
but that returns levels nested in one cell for each day. How can I achieve this?
You need to nest the record_path levels
df = pd.json_normalize(data=data, record_path=["walk", ["levels", "data"]])

Python - Grab specific value from known key inside large json

I need to get just 2 entries inside a very large json object, I don't know the array position, but I do know key:value pairs of the entry I want to find and where I want another value from this entry.
In this example there are only 4 examples, but in the original there are over 1000, and I need only 2 entries of which I do know "name" and "symbol" each. I need to get the value of quotes->ASK->time.
x = requests.get('http://example.org/data.json')
parsed = x.json()
gettime= str(parsed[0]["quotes"]["ASK"]["time"])
print(gettime)
I know that I can get it that way, and then loop through that a thousand times, but that seems like an overkill for just 2 values. Is there a way to do something like parsed["symbol":"kalo"]["quotes"]["ASK"]["time"] which would give me kalo time without using a loop, without going through all thousand entries?
[
{
"id": "nem-cri",
"name": "nemlaaoo",
"symbol": "nem",
"rank": 27,
"owner": "marcel",
"quotes": {
"ASK": {
"price": 19429,
"time": 319250866,
"duration": 21
}
}
},
{
"id": "kalo-lo-leek",
"name": "kalowaaa",
"symbol": "kalo",
"rank": 122,
"owner": "daniel",
"quotes": {
"ASK": {
"price": 12928,
"time": 937282932,
"duration": 09
}
}
},
{
"id": "reewmaarwl",
"name": "reeqooow",
"symbol": "reeq",
"rank": 4,
"owner": "eric",
"quotes": {
"ASK": {
"price": 9989,
"time": 124288222,
"duration": 19
}
}
},
{
"id": "sharkooaksj",
"name": "sharkmaaa",
"symbol": "shark",
"rank": 22,
"owner": "eric",
"quotes": {
"ASK": {
"price": 11122,
"time": 482773882,
"duration": 22
}
}
}
]
If you are OK with using pandas I would just create a DataFrame.
import pandas as pd
df = pd.json_normalize(parsed)
print(df)
id name symbol rank owner quotes.ASK.price \
0 nem-cri nemlaaoo nem 27 marcel 19429
1 kalo-lo-leek kalowaaa kalo 122 daniel 12928
2 reewmaarwl reeqooow reeq 4 eric 9989
3 sharkooaksj sharkmaaa shark 22 eric 11122
quotes.ASK.time quotes.ASK.duration
0 319250866 21
1 937282932 9
2 124288222 19
3 482773882 22
If you want the kalo value then
print(df[df['symbol'] == 'kalo']['quotes.ASK.price']) # -> 12928

How to normalize the column which contains JSON in data frame and get a complete data frame

I have a pandas dataframe in which one column contains JSON data
Student_Id
V_Id
Json_result
32101
35
[{"q_id":"8007","q_text":"வேறுபட்ட பொம்மை எது?","q_img":"","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"105","MC":"1","LO Text":"ஓவியம் மற்றும் படங்களின் வெளிப்படையான மற்றும் மறைமுகமான கூறுகளை நுட்பமாக உற்று நோக்குதல்.","notes":"","isAnswered":"1","correctAnswerId":["1"],"isAnswerCorrect":"1","answer":"1"},{"q_id":"8008","q_text":"","q_img":"8008_Set_3.png","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"106","MC":"1","LO Text":"கதை, சூழல், நிகழ்வைத் தொடர்ச்சியான படங்கள் மற்றும் இவற்றில் இடம் பெறும் செயல்பாடுகள் பற்றி பேசுதல்.","notes":"(படம் பார்த்துக் கதையை மிகச் சரியாகக் கூறினால் 'சிறப்பு', சரியாகக் கூறினால் 'அருமை', கதையைக் கூறவில்லை என்றால் 'சிந்திக்க' என்பதைத் தன்னார்வலர் தேர்ந்தெடுக்கவும்.)","isAnswered":"1","correctAnswerId":["1","2"],"isAnswerCorrect":"","answer":"3"},{"q_id":"8009","q_text":"","q_img":"8009_Set_3.png","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"109","MC":"1","LO Text":"அச்சடிக்கப்பட்ட குறிப்பிட்ட எழுத்தை அடையாளம் காணுதல்.","notes":"","isAnswered":"1","correctAnswerId":["1"],"isAnswerCorrect":"1","answer":"1"}]
32102
35
[{"q_id":"8007","q_text":"வேறுபட்ட பொம்மை எது?","q_img":"","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"105","MC":"1","LO Text":"ஓவியம் மற்றும் படங்களின் வெளிப்படையான மற்றும் மறைமுகமான கூறுகளை நுட்பமாக உற்று நோக்குதல்.","notes":"","isAnswered":"1","correctAnswerId":["1"],"isAnswerCorrect":"1","answer":"1"},{"q_id":"8008","q_text":"","q_img":"8008_Set_3.png","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"106","MC":"1","LO Text":"கதை, சூழல், நிகழ்வைத் தொடர்ச்சியான படங்கள் மற்றும் இவற்றில் இடம் பெறும் செயல்பாடுகள் பற்றி பேசுதல்.","notes":"(படம் பார்த்துக் கதையை மிகச் சரியாகக் கூறினால் 'சிறப்பு', சரியாகக் கூறினால் 'அருமை', கதையைக் கூறவில்லை என்றால் 'சிந்திக்க' என்பதைத் தன்னார்வலர் தேர்ந்தெடுக்கவும்.)","isAnswered":"1","correctAnswerId":["1","2"],"isAnswerCorrect":"","answer":"3"},{"q_id":"8009","q_text":"","q_img":"8009_Set_3.png","subject":"Tamil","q_medium":"Tamil","Skill":"0","Class":"1std","LO ID":"109","MC":"1","LO Text":"அச்சடிக்கப்பட்ட குறிப்பிட்ட எழுத்தை அடையாளம் காணுதல்.","notes":"","isAnswered":"1","correctAnswerId":["1"],"isAnswerCorrect":"1","answer":"1"}]
I would like to normalize the JSON content in the attributes column so the JSON attributes become each a column in the dataframe. There are more than 40k rows in the dataframe.
The json sample in a single row is in the form as follows
[
{
"q_id": "8007",
"q_text": "வேறுபட்ட பொம்மை எது?",
"q_img": "",
"subject": "Tamil",
"q_medium": "Tamil",
"Skill": "0",
"Class": "1std",
"LO ID": "105",
"MC": "1",
"LO Text": "ஓவியம் மற்றும் படங்களின் வெளிப்படையான மற்றும் மறைமுகமான கூறுகளை நுட்பமாக உற்று நோக்குதல்.",
"notes": "",
"isAnswered": "1",
"correctAnswerId": [
"1"
],
"isAnswerCorrect": "1",
"answer": "1"
},
{
"q_id": "8008",
"q_text": "",
"q_img": "8008_Set_3.png",
"subject": "Tamil",
"q_medium": "Tamil",
"Skill": "0",
"Class": "1std",
"LO ID": "106",
"MC": "1",
"LO Text": "கதை, சூழல், நிகழ்வைத் தொடர்ச்சியான படங்கள் மற்றும் இவற்றில் இடம் பெறும் செயல்பாடுகள் பற்றி பேசுதல்.",
"notes": "(படம் பார்த்துக் கதையை மிகச் சரியாகக் கூறினால் 'சிறப்பு', சரியாகக் கூறினால் 'அருமை', கதையைக் கூறவில்லை என்றால் 'சிந்திக்க' என்பதைத் தன்னார்வலர் தேர்ந்தெடுக்கவும்.)",
"isAnswered": "1",
"correctAnswerId": [
"1",
"2"
],
"isAnswerCorrect": "",
"answer": "3"
},
{
"q_id": "8009",
"q_text": "",
"q_img": "8009_Set_3.png",
"subject": "Tamil",
"q_medium": "Tamil",
"Skill": "0",
"Class": "1std",
"LO ID": "109",
"MC": "1",
"LO Text": "அச்சடிக்கப்பட்ட குறிப்பிட்ட எழுத்தை அடையாளம் காணுதல்.",
"notes": "",
"isAnswered": "1",
"correctAnswerId": [
"1"
],
"isAnswerCorrect": "1",
"answer": "1"
}
]
I want to link the student for the json q_id and want an output as follows
Student_Id
V_Id
q_id
subject
q_medium
Class
LO_ID
isAnswered
correctAnswerId
isAnswerCorrect
answer
32101
35
8007
Tamil
Tamil
1std
105
1
1
1
1
32101
35
8008
Tamil
Tamil
1std
106
1
[1,2]
-
3
32101
35
8009
Tamil
Tamil
1std
109
1
1
1
1
32102
35
8007
Tamil
Tamil
1std
105
1
1
1
1
32102
35
8008
Tamil
Tamil
1std
106
1
[1,2]
-
3
32102
35
8009
Tamil
Tamil
1std
109
1
1
1
1
Like this I want to get the dataframe for 40k ID and rows. How do I write in python to get this kind of data frame?
You may start by using df.explode() and using loop and .apply(lambda) to get the value of each key in Json Result, as shown in example below
import json
df['Json_result'] = df['Json_result'].apply(lambda x: json.loads(x))
df = df.explode('Json_result')
keys = df['Json_result'].tolist()[0].keys() # Get the list of keys in json
for column in keys: # loop to create new column by getting the value from the dict
df[column] = df['Json_result'].apply(lambda x: x.get(column, None))

Formatting dataframe output into JSON records by group

my dataframe looks like this df:
count_arena_users count_users event timestamp
0 4458 12499 football 2017-04-30
1 2706 4605 cricket 2015-06-30
2 592 4176 tennis 2016-06-30
3 3427 10126 badminton 2017-05-31
4 717 2313 football 2016-03-31
5 101 155 hockey 2016-01-31
6 45923 191180 tennis 2015-12-31
7 1208 2824 badminton 2017-01-31
8 5577 8906 cricket 2016-02-29
9 111 205 football 2016-03-31
10 4 8 hockey 2017-09-30
the data is fetched from a psql database, Now i want to generate the output of "select * from tbl_arena" in json format. But the json format that is desired has to be something like this:
[
{
"event": "football",
"data_to_plot": [
{
"count_arena_users": 717,
"count_users": 2313,
"timestamp": "2016-03-31"
},
{
"count_arena_users": 111,
"count_users": 205,
"timestamp": "2016-03-31"
},
{
"count_arena_users": 4458,
"count_users": 12499,
"timestamp": "2017-04-30"
}
]
},
{
"event": "cricket",
"data_to_plot": [
{
"count_arena_users": 2706,
"count_users": 4605,
"timestamp": "2015-06-30"
},
{
"count_arena_users": 5577,
"count_users": 8906,
"timestamp": "2016-02-29"
}
]
}
.
.
.
.
]
the values of all the columns are grouped based on the event column and later their occurance order of sub-dictionaries is decided based on timestamp column i.e. earlier dates appearing first and newer/latest dates appearing below it.
I'm using python 3.x and json.dumps to format the data into json style.
A high level process is as follows -
Aggregate all data with respect to events. We'd need a groupby + apply for this.
Convert the result to a series of records, one record for each event and associated data. Use to_json, with the orient=records.
df.groupby('event', sort=False)\
.apply(lambda x: x.drop('event', 1).sort_values('timestamp').to_dict('r'))\
.reset_index(name='data_to_plot')\
.to_json(orient='records')
[
{
"event": "football",
"data_to_plot": [
{
"count_arena_users": 717,
"timestamp": "2016-03-31",
"count_users": 2313
},
{
"count_arena_users": 111,
"timestamp": "2016-03-31",
"count_users": 205
},
{
"count_arena_users": 4458,
"timestamp": "2017-04-30",
"count_users": 12499
}
]
},
...
]

get value of nested lists and dictionaries of a json

I'm trying to get value of 'description' and first 'x','y' of related to that description from a json file so I used pandas.io.json.json_normalize and followed this example at end of page but getting error:
KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('description',))
How can I get value of 'description' "Play" and "Game" and first 'x','y' of related to that description (0,2) and (1, 2) respectively from following json file and save result as a data frame?
I edited the code and I want to get this as result:
0 1 2 3
0 Play Game
1
2
3
4
but Game is not in the x,y that should be.
import pandas as pd
from pandas.io.json import json_normalize
data = [
{
"responses": [
{
"text": [
{
"description": "Play",
"bounding": {
"vertices": [
{
"x": 0,
"y": 2
},
{
"x": 513,
"y": -5
},
{
"x": 513,
"y": 73
},
{
"x": 438,
"y": 73
}
]
}
},
{
"description": "Game",
"bounding": {
"vertices": [
{
"x": 1,
"y": 2
},
{
"x": 307,
"y": 29
},
{
"x": 307,
"y": 55
},
{
"x": 201,
"y": 55
}
]
}
}
]
}
]
}
]
#w is columns h is rows
w, h = 4, 5;
Matrix = [[' ' for j in range(w)] for i in range(h)]
for row in data:
for response in row["responses"]:
for entry in response["text"]:
Description = entry["description"]
x = entry["bounding"]["vertices"][0]["x"]
y = entry["bounding"]["vertices"][0]["y"]
Matrix[x][y] = Description
df = pd.DataFrame(Matrix)
print(df)
you need to pass data[0]['responses'][0]['text'] to json_normalize like this
df = json_normalize(data[0]['responses'][0]['text'],[['bounding','vertices']], 'description')
which will result in
x y description
0 438 -5 Play
1 513 -5 Play
2 513 73 Play
3 438 73 Play
4 201 29 Game
5 307 29 Game
6 307 55 Game
7 201 55 Game
I hope this is what you are expecting.
EDIT:
df.groupby('description').get_group('Play').iloc[0]
will give you the first item of a group 'play'
x 438
y -5
description Play
Name: 0, dtype: object

Categories