I try to read this JSON data
{
"values": [
{
"1510122047": [
35.7,
256
]
},
{
"1510125000": [
41.7,
7
]
},
{
"1510129000": [
31.7,
0
]
}
]
}
and normalize it into a pandas data frame of this format:
I tried it with json_normalize but I was not able to get the result I need.
Here is what I tried: But it's not quite efficient. I would like to find a solution that works with pandas' built in functions to do this. I'd appreciate ideas!
import pandas
import json
s = """
{"values": [
{
"1510122047": [35.7, 256]
},
{
"1510125000": [41.7, 7]
},
{
"1510129000": [31.7, 0]
}
]}
"""
data = json.loads(s)
normalized_data = []
for value in data['values']:
timestamp = list(value.keys())[0]
normalized_data.append({'timestamp':timestamp, 'value_1': value[timestamp][0], 'value_2': value[timestamp][1]})
pandas.DataFrame(normalized_data)
Thanks
EDIT
Thanks for your suggestions. Unfortunately none where faster than the solution of this OP. Here is what I did to generate a bigger payload and test for speed:
I guess it's the nature of JSON to be slowly for this application.
import pandas
import json
import time
s1 = """{
"1510122047": [35.7, 256]
},
{
"1510125000": [41.7, 7]
},
{
"1510129000": [31.7, 0]
}"""
s = """
{"values": [
{
"1510122047": [35.7, 256]
},
{
"1510125000": [41.7, 7]
},
{
"1510129000": [31.7, 0]
},
""" + ",".join([s1]*1000000) + "]}"
data = json.loads(s)
tic = time.time()
normalized_data = []
for value in data['values']:
timestamp = list(value.keys())[0]
normalized_data.append({'timestamp':timestamp, 'value_1': value[timestamp][0], 'value_2': value[timestamp][1]})
print(time.time() - tic)
pandas.DataFrame(normalized_data)
This is one approach using a nested comprehension
Ex:
df= pd.DataFrame([[key] + value for item in data['values']
for key, value in item.items()
], columns=["Timestamp", "Val_1", "Val_2"])
print(df)
Output:
Timestamp Val_1 Val_2
0 1510122047 35.7 256
1 1510125000 41.7 7
2 1510129000 31.7 0
data = {'values': [{'1510122047': [35.7, 256]},
{'1510125000': [41.7, 7]},
{'1510129000': [31.7, 0]}]}
dfn = json_normalize(data, 'values').stack()
df_output = pd.DataFrame(dfn.tolist(), index=dfn.index)
df_output = df_output.reset_index().iloc[:, 1:]
# rename columns
df_output.columns = 'value_' + df_output.columns.astype(str)
df_output.rename(columns={'value_level_1':'timestamp'}, inplace=True)
print(df_output)
# timestamp value_0 value_1
# 0 1510122047 35.7 256
# 1 1510125000 41.7 7
# 2 1510129000 31.7 0
Related
I'm trying to extract data from dictionaries, here's an example for one dictionary. Here's what I have so far (probably not the greatest solution).
def common():
ab={
"names": ["Brad", "Chad"],
"org_name": "Leon",
"missing": 0.3,
"con": {
"base": "abx",
"conditions": {"func": "**", "ref": 0},
"results": 4,
},
"change": [{"func": "++", "ref": 50, "res": 31},
{"func": "--", "ref": 22, "res": 11}]
}
data = []
if "missing" in ab.keys():
data.append(
{
"names": ab["names"],
"org_name": ab["org_name"],
"func": "missing",
"ref": "",
"res": ab["missing"],
}
)
if "con" in ab.keys():
data.append(
{
"names": ab["names"],
"org_name": ab["con"]["base"],
"func": ab["con"]["conditions"]["func"],
"ref": ab["con"]["conditions"]["ref"],
"res": ab["con"]["results"],
}
)
df = pd.DataFrame(data)
print(df)
return df
Output:
names org_name func ref res
0 [Brad, Chad] Leon missing 0.3
1 [Brad, Chad] abx ** 0 4.0
What I would like the output to look like:
names org_name func ref res
0 [Brad, Chad] Leon missing 0.3
1 [Brad, Chad] abx ** 0 4
2 [Brad, Chad] Leon ++ 50 31
3 [Brad, Chad] Leon -- 22 11
The dictionaries can be different length, ultimately a list of several dictionaries will be passed. I'm not sure how to repeat the names and org_name values based on the ref and res values... I don't want to keep adding row by row, dynamic solution is always preferred.
Try:
import pandas as pd
ab={
"names": ["Brad", "Chad"],
"org_name": "Leon",
"missing": 0.3,
"con": {
"base": "abx",
"conditions": {"func": "**", "ref": 0},
"results": 4,
},
"change": [{"func": "++", "ref": 50, "res": 31},
{"func": "--", "ref": 22, "res": 11}]
}
out = []
if 'change' in ab:
for ch in ab['change']:
out.append({'names': ab['names'], 'org_name': ab['org_name'], **ch})
if 'con' in ab:
out.append({'names': ab['names'], 'org_name': ab['con']['base'], **ab['con']['conditions'], 'res': ab['con']['results']})
if 'missing' in ab:
out.append({'names': ab['names'], 'org_name': ab['org_name'], 'func': 'missing', 'res': ab['missing']})
print(pd.DataFrame(out).fillna(''))
Prints:
names org_name func ref res
0 [Brad, Chad] Leon ++ 50.0 31.0
1 [Brad, Chad] Leon -- 22.0 11.0
2 [Brad, Chad] abx ** 0.0 4.0
3 [Brad, Chad] Leon missing 0.3
My MongoDB document structure is as follows and some of the factors are NaN.
_id :ObjectId("5feddb959297bb2625db1450")
factors: Array
0:Object
factorId:"C24"
Index:0
weight:1
1:Object
factorId:"C25"
Index:1
weight:1
2:Object
factorId:"C26"
Index:2
weight:1
name:"Growth Led Momentum"
I want to convert it to pandas data frame as follows using pymongo and pandas.
|name | factorId | Index | weight|
----------------------------------------------------
|Growth Led Momentum | C24 | 0 | 0 |
----------------------------------------------------
|Growth Led Momentum | C25 | 1 | 0 |
----------------------------------------------------
|Growth Led Momentum | C26 | 2 | 0 |
----------------------------------------------------
Thank you
Update
I broke out the ol Python to give this a crack - the following code works flawlessly!
from pymongo import MongoClient
import pandas as pd
uri = "mongodb://<your_mongo_uri>:27017"
database_name = "<your_database_name"
collection_name = "<your_collection_name>"
mongo_client = MongoClient(uri)
database = mongo_client[database_name]
collection = database[collection_name]
# I used this code to insert a doc into a test collection
# before querying (just incase you wanted to know lol)
"""
data = {
"_id": 1,
"name": "Growth Lead Momentum",
"factors": [
{
"factorId": "C24",
"index": 0,
"weight": 1
},
{
"factorId": "D74",
"index": 7,
"weight": 9
}
]
}
insert_result = collection.insert_one(data)
print(insert_result)
"""
# This is the query that
# answers your question
results = collection.aggregate([
{
"$unwind": "$factors"
},
{
"$project": {
"_id": 1, # Change to 0 if you wish to ignore "_id" field.
"name": 1,
"factorId": "$factors.factorId",
"index": "$factors.index",
"weight": "$factors.weight"
}
}
])
# This is how we turn the results into a DataFrame.
# We can simply pass `list(results)` into `DataFrame(..)`,
# due to how our query works.
results_as_dataframe = pd.DataFrame(list(results))
print(results_as_dataframe)
Which outputs:
_id name factorId index weight
0 1 Growth Lead Momentum C24 0 1
1 1 Growth Lead Momentum D74 7 9
Original Answer
You could use the aggregation pipeline to unwind factors and then project the fields you want.
Something like this should do the trick.
Live demo here.
Database Structure
[
{
"_id": 1,
"name": "Growth Lead Momentum",
"factors": [
{
factorId: "C24",
index: 0,
weight: 1
},
{
factorId: "D74",
index: 7,
weight: 9
}
]
}
]
Query
db.collection.aggregate([
{
$unwind: "$factors"
},
{
$project: {
_id: 1,
name: 1,
factorId: "$factors.factorId",
index: "$factors.index",
weight: "$factors.weight"
}
}
])
Results
(.csv friendly)
[
{
"_id": 1,
"factorId": "C24",
"index": 0,
"name": "Growth Lead Momentum",
"weight": 1
},
{
"_id": 1,
"factorId": "D74",
"index": 7,
"name": "Growth Lead Momentum",
"weight": 9
}
]
Wonderful answer by Matt, In case you want to use pandas:
Use this after you have retrieved documents from db:
df = pd.json_normalize(data)
df = df['factors'].explode().apply(lambda x: [val for _, val in x.items()]).explode().apply(pd.Series).join(df).drop(columns=['factors'])
Output:
factorId Index weight name
0 C24 0 1 Growth Led Momentum
0 C25 1 1 Growth Led Momentum
0 C26 2 1 Growth Led Momentum
I'm trying to normalize a similar sample data
{
"2018-04-26 10:09:33": [
{
"user_id": "M8BE957ZA",
"ts": "2018-04-26 10:06:33",
"message": "Hello"
}
],
"2018-04-27 19:10:55": [
{
"user_id": "M5320QS1X",
"ts": "2018-04-27 19:10:55",
"message": "Thank you"
}
],
I know I can use json_normalize(data,'2018-04-26 10:09:33',record_prefix= '') to create a table in pandas but the date/time keeps changing. How can I normalize it so I have as follow? Any suggestions
user_id. ts message
2018-04-26 10:09:33 M8BE957ZA. 2018-04-26 10:06:33. Hello
2018-04-26 10:09:33 M5320QS1X 2018-04-27 19:10:55. Thank you
test = {
"2018-04-26 10:09:33": [
{
"user_id": "M8BE957ZA",
"ts": "2018-04-26 10:06:33",
"message": "Hello"
}
],
"2018-04-27 19:10:55": [
{
"user_id": "M5320QS1X",
"ts": "2018-04-27 19:10:55",
"message": "Thank you"
}
]}
df = pd.DataFrame(test).melt()
variable value
0 2018-04-26 10:09:33 {'user_id': 'M8BE957ZA', 'ts': '2018-04-26 10:...
1 2018-04-27 19:10:55 {'user_id': 'M5320QS1X', 'ts': '2018-04-27 19:...
Read in your dataframe as your dict, then melt it to get the above structure. Next you can use json_normalize on the value column, then rejoin it to the variable column like so:
df.join(json_normalize(df['value'])).drop(columns = 'value').rename(columns = {'variable':'date'})
date user_id ts message
0 2018-04-26 10:09:33 M8BE957ZA 2018-04-26 10:06:33 Hello
1 2018-04-27 19:10:55 M5320QS1X 2018-04-27 19:10:55 Thank you
I am trying to iterate over a list of unique column-values to create three different keys with dictionaries inside a dictionary. This is the code I have now:
import pandas as pd
dataDict = {}
metrics = frontendFrame['METRIC'].unique()
for metric in metrics:
dataDict[metric] = frontendFrame[frontendFrame['METRIC'] == metric].to_dict('records')
print(dataDict)
This works fine for low amounts of data, but as fast as the amount of data increases this can take almost one second (!!!!).
I've tried with groupby in pandas which is even slower, and also with map, but I don't want to return things to a list. How can I iterate over this and create what I want in a faster way? I am using Python 3.6
UPDATE:
Input:
DATETIME METRIC ANOMALY VALUE
0 2018-02-27 17:30:32 SCORE 2.0 -1.0
1 2018-02-27 17:30:32 VALUE NaN 0.0
2 2018-02-27 17:30:32 INDEX NaN 6.6613381477499995E-16
3 2018-02-27 17:31:30 SCORE 2.0 -1.0
4 2018-02-27 17:31:30 VALUE NaN 0.0
5 2018-02-27 17:31:30 INDEX NaN 6.6613381477499995E-16
6 2018-02-27 17:32:30 SCORE 2.0 -1.0
7 2018-02-27 17:32:30 VALUE NaN 0.0
8 2018-02-27 17:32:30 INDEX NaN 6.6613381477499995E-16
Output:
{
"INDEX": [
{
"DATETIME": 1519759710000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
},
{
"DATETIME": 1519759770000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
}],
"SCORE": [
{
"DATETIME": 1519759710000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
}],
"VALUE": [
{
"DATETIME": 1519759710000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
}]
}
One possible solution:
a = defaultdict( list )
_ = {x['METRIC']: a[x['METRIC']].append(x) for x in frontendFrame.to_dict('records')}
a = dict(a)
from collections import defaultdict
a = defaultdict( list )
for x in frontendFrame.to_dict('records'):
a[x['METRIC']].append(x)
a = dict(a)
Slow:
dataDict = frontendFrame.groupby('METRIC').apply(lambda x: x.to_dict('records')).to_dict()
I'm trying to get value of 'description' and first 'x','y' of related to that description from a json file so I used pandas.io.json.json_normalize and followed this example at end of page but getting error:
KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('description',))
How can I get value of 'description' "Play" and "Game" and first 'x','y' of related to that description (0,2) and (1, 2) respectively from following json file and save result as a data frame?
I edited the code and I want to get this as result:
0 1 2 3
0 Play Game
1
2
3
4
but Game is not in the x,y that should be.
import pandas as pd
from pandas.io.json import json_normalize
data = [
{
"responses": [
{
"text": [
{
"description": "Play",
"bounding": {
"vertices": [
{
"x": 0,
"y": 2
},
{
"x": 513,
"y": -5
},
{
"x": 513,
"y": 73
},
{
"x": 438,
"y": 73
}
]
}
},
{
"description": "Game",
"bounding": {
"vertices": [
{
"x": 1,
"y": 2
},
{
"x": 307,
"y": 29
},
{
"x": 307,
"y": 55
},
{
"x": 201,
"y": 55
}
]
}
}
]
}
]
}
]
#w is columns h is rows
w, h = 4, 5;
Matrix = [[' ' for j in range(w)] for i in range(h)]
for row in data:
for response in row["responses"]:
for entry in response["text"]:
Description = entry["description"]
x = entry["bounding"]["vertices"][0]["x"]
y = entry["bounding"]["vertices"][0]["y"]
Matrix[x][y] = Description
df = pd.DataFrame(Matrix)
print(df)
you need to pass data[0]['responses'][0]['text'] to json_normalize like this
df = json_normalize(data[0]['responses'][0]['text'],[['bounding','vertices']], 'description')
which will result in
x y description
0 438 -5 Play
1 513 -5 Play
2 513 73 Play
3 438 73 Play
4 201 29 Game
5 307 29 Game
6 307 55 Game
7 201 55 Game
I hope this is what you are expecting.
EDIT:
df.groupby('description').get_group('Play').iloc[0]
will give you the first item of a group 'play'
x 438
y -5
description Play
Name: 0, dtype: object