Formatting dataframe output into JSON records by group

Formatting dataframe output into JSON records by group - python

my dataframe looks like this df:
count_arena_users count_users event timestamp
0 4458 12499 football 2017-04-30
1 2706 4605 cricket 2015-06-30
2 592 4176 tennis 2016-06-30
3 3427 10126 badminton 2017-05-31
4 717 2313 football 2016-03-31
5 101 155 hockey 2016-01-31
6 45923 191180 tennis 2015-12-31
7 1208 2824 badminton 2017-01-31
8 5577 8906 cricket 2016-02-29
9 111 205 football 2016-03-31
10 4 8 hockey 2017-09-30
the data is fetched from a psql database, Now i want to generate the output of "select * from tbl_arena" in json format. But the json format that is desired has to be something like this:
[
{
"event": "football",
"data_to_plot": [
{
"count_arena_users": 717,
"count_users": 2313,
"timestamp": "2016-03-31"
},
{
"count_arena_users": 111,
"count_users": 205,
"timestamp": "2016-03-31"
},
{
"count_arena_users": 4458,
"count_users": 12499,
"timestamp": "2017-04-30"
}
]
},
{
"event": "cricket",
"data_to_plot": [
{
"count_arena_users": 2706,
"count_users": 4605,
"timestamp": "2015-06-30"
},
{
"count_arena_users": 5577,
"count_users": 8906,
"timestamp": "2016-02-29"
}
]
}
.
.
.
.
]
the values of all the columns are grouped based on the event column and later their occurance order of sub-dictionaries is decided based on timestamp column i.e. earlier dates appearing first and newer/latest dates appearing below it.
I'm using python 3.x and json.dumps to format the data into json style.

A high level process is as follows -
Aggregate all data with respect to events. We'd need a groupby + apply for this.
Convert the result to a series of records, one record for each event and associated data. Use to_json, with the orient=records.
df.groupby('event', sort=False)\
.apply(lambda x: x.drop('event', 1).sort_values('timestamp').to_dict('r'))\
.reset_index(name='data_to_plot')\
.to_json(orient='records')
[
{
"event": "football",
"data_to_plot": [
{
"count_arena_users": 717,
"timestamp": "2016-03-31",
"count_users": 2313
},
{
"count_arena_users": 111,
"timestamp": "2016-03-31",
"count_users": 205
},
{
"count_arena_users": 4458,
"timestamp": "2017-04-30",
"count_users": 12499
}
]
},
...
]

Related

How to parse through json file and transform it into time series

I have this json file:
{
"walk": [
{
"date": "2021-01-10",
"duration": 301800,
"levels": {
"data": [
{
"timestamp": "2021-01-10T13:16:00.000",
"level": "slow",
"seconds": 360
},
{
"timestamp": "2021-01-10T13:22:00.000",
"level": "moderate",
"seconds": 2940
},
{
"dateTime": "2021-01-10T14:11:00.000",
"level": "fast",
"seconds": 300
and I want to parse through this such that it is a 1-min level time series data. (i.e.: 6 data points (360 seconds= 6 minutes) as level "slow".
timestamp level
2021-01-10 13:16:00 slow
2021-01-10 13:17:00 slow
.......
2021-01-10 13:22:00 moderate
I have right now:
with open('walks.json') as f:
df = pd.json_normalize(json.load(f),
record_path=['walk']
)
but that returns levels nested in one cell for each day. How can I achieve this?

You need to nest the record_path levels
df = pd.json_normalize(data=data, record_path=["walk", ["levels", "data"]])

Python - Grab specific value from known key inside large json

I need to get just 2 entries inside a very large json object, I don't know the array position, but I do know key:value pairs of the entry I want to find and where I want another value from this entry.
In this example there are only 4 examples, but in the original there are over 1000, and I need only 2 entries of which I do know "name" and "symbol" each. I need to get the value of quotes->ASK->time.
x = requests.get('http://example.org/data.json')
parsed = x.json()
gettime= str(parsed[0]["quotes"]["ASK"]["time"])
print(gettime)
I know that I can get it that way, and then loop through that a thousand times, but that seems like an overkill for just 2 values. Is there a way to do something like parsed["symbol":"kalo"]["quotes"]["ASK"]["time"] which would give me kalo time without using a loop, without going through all thousand entries?
[
{
"id": "nem-cri",
"name": "nemlaaoo",
"symbol": "nem",
"rank": 27,
"owner": "marcel",
"quotes": {
"ASK": {
"price": 19429,
"time": 319250866,
"duration": 21
}
}
},
{
"id": "kalo-lo-leek",
"name": "kalowaaa",
"symbol": "kalo",
"rank": 122,
"owner": "daniel",
"quotes": {
"ASK": {
"price": 12928,
"time": 937282932,
"duration": 09
}
}
},
{
"id": "reewmaarwl",
"name": "reeqooow",
"symbol": "reeq",
"rank": 4,
"owner": "eric",
"quotes": {
"ASK": {
"price": 9989,
"time": 124288222,
"duration": 19
}
}
},
{
"id": "sharkooaksj",
"name": "sharkmaaa",
"symbol": "shark",
"rank": 22,
"owner": "eric",
"quotes": {
"ASK": {
"price": 11122,
"time": 482773882,
"duration": 22
}
}
}
]

If you are OK with using pandas I would just create a DataFrame.
import pandas as pd
df = pd.json_normalize(parsed)
print(df)
id name symbol rank owner quotes.ASK.price \
0 nem-cri nemlaaoo nem 27 marcel 19429
1 kalo-lo-leek kalowaaa kalo 122 daniel 12928
2 reewmaarwl reeqooow reeq 4 eric 9989
3 sharkooaksj sharkmaaa shark 22 eric 11122
quotes.ASK.time quotes.ASK.duration
0 319250866 21
1 937282932 9
2 124288222 19
3 482773882 22
If you want the kalo value then
print(df[df['symbol'] == 'kalo']['quotes.ASK.price']) # -> 12928

How to normalize uneven JSON structures in pandas?

I am using the Google Maps Distance Matrix API to get several distances from multiple origins. The API response comes in a JSON structured like:
{
"destination_addresses": [
"Destination 1",
"Destination 2",
"Destination 3"
],
"origin_addresses": [
"Origin 1",
"Origin 2"
],
"rows": [
{
"elements": [
{
"distance": {
"text": "8.7 km",
"value": 8687
},
"duration": {
"text": "19 mins",
"value": 1129
},
"status": "OK"
},
{
"distance": {
"text": "223 km",
"value": 222709
},
"duration": {
"text": "2 hours 42 mins",
"value": 9704
},
"status": "OK"
},
{
"distance": {
"text": "299 km",
"value": 299156
},
"duration": {
"text": "4 hours 17 mins",
"value": 15400
},
"status": "OK"
}
]
},
{
"elements": [
{
"distance": {
"text": "216 km",
"value": 215788
},
"duration": {
"text": "2 hours 44 mins",
"value": 9851
},
"status": "OK"
},
{
"distance": {
"text": "20.3 km",
"value": 20285
},
"duration": {
"text": "21 mins",
"value": 1283
},
"status": "OK"
},
{
"distance": {
"text": "210 km",
"value": 210299
},
"duration": {
"text": "2 hours 45 mins",
"value": 9879
},
"status": "OK"
}
]
}
],
"status": "OK"
}
Note the rows array has the same number of elements in origin_addresses (2), while each elements array has the same number of elements in destination_addresses (3).
Is one able to use the pandas API to normalize everything inside rows while fetching the corresponding data from origin_addresses and destination_addresses?
The output should be:
status distance.text distance.value duration.text duration.value origin_addresses destination_addresses
0 OK 8.7 km 8687 19 mins 1129 Origin 1 Destination 1
1 OK 223 km 222709 2 hours 42 mins 9704 Origin 1 Destination 2
2 OK 299 km 299156 4 hours 17 mins 15400 Origin 1 Destination 3
3 OK 216 km 215788 2 hours 44 mins 9851 Origin 2 Destination 1
4 OK 20.3 km 20285 21 mins 1283 Origin 2 Destination 2
5 OK 210 km 210299 2 hours 45 mins 9879 Origin 2 Destination 3
If pandas does not provide a relatively simple way to do it, how would one accomplish this operation?

If data contains the dictionary from the question you can try:
df = pd.DataFrame(data["rows"])
df["origin_addresses"] = data["origin_addresses"]
df = df.explode("elements")
df = pd.concat([df.pop("elements").apply(pd.Series), df], axis=1)
df = pd.concat(
[df.pop("distance").apply(pd.Series).add_prefix("distance."), df], axis=1
)
df = pd.concat(
[df.pop("duration").apply(pd.Series).add_prefix("duration."), df], axis=1
)
df["destination_addresses"] = data["destination_addresses"] * len(
data["origin_addresses"]
)
print(df)
Prints:
duration.text duration.value distance.text distance.value status origin_addresses destination_addresses
0 19 mins 1129 8.7 km 8687 OK Origin 1 Destination 1
0 2 hours 42 mins 9704 223 km 222709 OK Origin 1 Destination 2
0 4 hours 17 mins 15400 299 km 299156 OK Origin 1 Destination 3
1 2 hours 44 mins 9851 216 km 215788 OK Origin 2 Destination 1
1 21 mins 1283 20.3 km 20285 OK Origin 2 Destination 2
1 2 hours 45 mins 9879 210 km 210299 OK Origin 2 Destination 3

Calculate average value by hour of json data

I'm having troubles with grouping samples by hour. Data structure looks like this:
data = [
{
"pressure": "1009.7",
"timestamp": "2019-09-03 08:03:00"
},
{
"pressure": "1009.7",
"timestamp": "2019-09-03 08:18:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 08:33:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 08:56:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 09:03:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 09:18:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 09:33:00"
},
{
"pressure": "1009.7",
"timestamp": "2019-09-03 09:56:00"
},
{
"pressure": "1009.6",
"timestamp": "2019-09-03 10:03:00"
}
]
As you can see, there are 4 measurements of pressure every hour and I would like to calculate average value per hour. I've tried achieving this with Pandas, but no luck. What I've tried was to extract start and end timestamp, round them to full hour and then pass it to DataFrame as index, and json as data, but there's shape mismatch (no wonder). I thought that I would be able to pass it like this to df and later try to calculate mean, but it looks like there should me some intermediate step.

If your JSON mimics the above then we can pass it into a dataframe
df = pd.DataFrame.from_dict(data)
pressure timestamp
0 1009.7 2019-09-03 08:03:00
1 1009.7 2019-09-03 08:18:00
2 1009.8 2019-09-03 08:33:00
3 1009.8 2019-09-03 08:56:00
4 1009.8 2019-09-03 09:03:00
5 1009.8 2019-09-03 09:18:00
6 1009.8 2019-09-03 09:33:00
7 1009.7 2019-09-03 09:56:00
8 1009.6 2019-09-03 10:03:00
then just group by the hour and take the mean of pressure.
hourly_avg = df.groupby(df['timestamp'].dt.hour)['pressure'].mean()
print(hourly_avg)
timestamp
8 1009.750
9 1009.775
10 1009.600
Name: pressure, dtype: float64
note, you'll need to make your date a proper DateTime and pressure into a floating-point value.
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['pressure'] = df['pressure'].astype(float)

I would approach the problem by creating a new dictionary with the date/hour as a key and the pressures as a list (the value of the dictionary).
d = {}
for _dict in data:
key = _dict['timestamp'][:13] # 2019-09-03 08, etc.
d.setdefault(key, []).append(float(_dict['pressure']))
for key, array in d.items():
print(key, format(sum(array) / len(array), '.3f'))
Prints:
2019-09-03 08 1009.750
2019-09-03 09 1009.775
2019-09-03 10 1009.600

Check this:
df = pd.DataFrame(data)
df['timestamp']=pd.to_datetime(df['timestamp'], format='%Y%m%d %H:%M:%S')
df['pressure'] = df['pressure'].astype(float)
df['hour'] = df['timestamp'].dt.hour
pressure = df.groupby([df['hour']])['pressure'].mean()
print(pressure)
Output:
timestamp
8 1009.750
9 1009.775
10 1009.600

Python Pandas - Iterate over unique columns

I am trying to iterate over a list of unique column-values to create three different keys with dictionaries inside a dictionary. This is the code I have now:
import pandas as pd
dataDict = {}
metrics = frontendFrame['METRIC'].unique()
for metric in metrics:
dataDict[metric] = frontendFrame[frontendFrame['METRIC'] == metric].to_dict('records')
print(dataDict)
This works fine for low amounts of data, but as fast as the amount of data increases this can take almost one second (!!!!).
I've tried with groupby in pandas which is even slower, and also with map, but I don't want to return things to a list. How can I iterate over this and create what I want in a faster way? I am using Python 3.6
UPDATE:
Input:
DATETIME METRIC ANOMALY VALUE
0 2018-02-27 17:30:32 SCORE 2.0 -1.0
1 2018-02-27 17:30:32 VALUE NaN 0.0
2 2018-02-27 17:30:32 INDEX NaN 6.6613381477499995E-16
3 2018-02-27 17:31:30 SCORE 2.0 -1.0
4 2018-02-27 17:31:30 VALUE NaN 0.0
5 2018-02-27 17:31:30 INDEX NaN 6.6613381477499995E-16
6 2018-02-27 17:32:30 SCORE 2.0 -1.0
7 2018-02-27 17:32:30 VALUE NaN 0.0
8 2018-02-27 17:32:30 INDEX NaN 6.6613381477499995E-16
Output:
{
"INDEX": [
{
"DATETIME": 1519759710000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
},
{
"DATETIME": 1519759770000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
}],
"SCORE": [
{
"DATETIME": 1519759710000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
}],
"VALUE": [
{
"DATETIME": 1519759710000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
}]
}

One possible solution:
a = defaultdict( list )
_ = {x['METRIC']: a[x['METRIC']].append(x) for x in frontendFrame.to_dict('records')}
a = dict(a)
from collections import defaultdict
a = defaultdict( list )
for x in frontendFrame.to_dict('records'):
a[x['METRIC']].append(x)
a = dict(a)
Slow:
dataDict = frontendFrame.groupby('METRIC').apply(lambda x: x.to_dict('records')).to_dict()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Formatting dataframe output into JSON records by group - python

Related

How to parse through json file and transform it into time series

Python - Grab specific value from known key inside large json

How to normalize uneven JSON structures in pandas?

Calculate average value by hour of json data

Python Pandas - Iterate over unique columns

Categories

Resources