dictionaries to pandas dataframe

dictionaries to pandas dataframe - python

I'm trying to extract data from dictionaries, here's an example for one dictionary. Here's what I have so far (probably not the greatest solution).
def common():
ab={
"names": ["Brad", "Chad"],
"org_name": "Leon",
"missing": 0.3,
"con": {
"base": "abx",
"conditions": {"func": "**", "ref": 0},
"results": 4,
},
"change": [{"func": "++", "ref": 50, "res": 31},
{"func": "--", "ref": 22, "res": 11}]
}
data = []
if "missing" in ab.keys():
data.append(
{
"names": ab["names"],
"org_name": ab["org_name"],
"func": "missing",
"ref": "",
"res": ab["missing"],
}
)
if "con" in ab.keys():
data.append(
{
"names": ab["names"],
"org_name": ab["con"]["base"],
"func": ab["con"]["conditions"]["func"],
"ref": ab["con"]["conditions"]["ref"],
"res": ab["con"]["results"],
}
)
df = pd.DataFrame(data)
print(df)
return df
Output:
names org_name func ref res
0 [Brad, Chad] Leon missing 0.3
1 [Brad, Chad] abx ** 0 4.0
What I would like the output to look like:
names org_name func ref res
0 [Brad, Chad] Leon missing 0.3
1 [Brad, Chad] abx ** 0 4
2 [Brad, Chad] Leon ++ 50 31
3 [Brad, Chad] Leon -- 22 11
The dictionaries can be different length, ultimately a list of several dictionaries will be passed. I'm not sure how to repeat the names and org_name values based on the ref and res values... I don't want to keep adding row by row, dynamic solution is always preferred.

Try:
import pandas as pd
ab={
"names": ["Brad", "Chad"],
"org_name": "Leon",
"missing": 0.3,
"con": {
"base": "abx",
"conditions": {"func": "**", "ref": 0},
"results": 4,
},
"change": [{"func": "++", "ref": 50, "res": 31},
{"func": "--", "ref": 22, "res": 11}]
}
out = []
if 'change' in ab:
for ch in ab['change']:
out.append({'names': ab['names'], 'org_name': ab['org_name'], **ch})
if 'con' in ab:
out.append({'names': ab['names'], 'org_name': ab['con']['base'], **ab['con']['conditions'], 'res': ab['con']['results']})
if 'missing' in ab:
out.append({'names': ab['names'], 'org_name': ab['org_name'], 'func': 'missing', 'res': ab['missing']})
print(pd.DataFrame(out).fillna(''))
Prints:
names org_name func ref res
0 [Brad, Chad] Leon ++ 50.0 31.0
1 [Brad, Chad] Leon -- 22.0 11.0
2 [Brad, Chad] abx ** 0.0 4.0
3 [Brad, Chad] Leon missing 0.3

Related

How to parse through json file and transform it into time series

I have this json file:
{
"walk": [
{
"date": "2021-01-10",
"duration": 301800,
"levels": {
"data": [
{
"timestamp": "2021-01-10T13:16:00.000",
"level": "slow",
"seconds": 360
},
{
"timestamp": "2021-01-10T13:22:00.000",
"level": "moderate",
"seconds": 2940
},
{
"dateTime": "2021-01-10T14:11:00.000",
"level": "fast",
"seconds": 300
and I want to parse through this such that it is a 1-min level time series data. (i.e.: 6 data points (360 seconds= 6 minutes) as level "slow".
timestamp level
2021-01-10 13:16:00 slow
2021-01-10 13:17:00 slow
.......
2021-01-10 13:22:00 moderate
I have right now:
with open('walks.json') as f:
df = pd.json_normalize(json.load(f),
record_path=['walk']
)
but that returns levels nested in one cell for each day. How can I achieve this?

You need to nest the record_path levels
df = pd.json_normalize(data=data, record_path=["walk", ["levels", "data"]])

Python - Grab specific value from known key inside large json

I need to get just 2 entries inside a very large json object, I don't know the array position, but I do know key:value pairs of the entry I want to find and where I want another value from this entry.
In this example there are only 4 examples, but in the original there are over 1000, and I need only 2 entries of which I do know "name" and "symbol" each. I need to get the value of quotes->ASK->time.
x = requests.get('http://example.org/data.json')
parsed = x.json()
gettime= str(parsed[0]["quotes"]["ASK"]["time"])
print(gettime)
I know that I can get it that way, and then loop through that a thousand times, but that seems like an overkill for just 2 values. Is there a way to do something like parsed["symbol":"kalo"]["quotes"]["ASK"]["time"] which would give me kalo time without using a loop, without going through all thousand entries?
[
{
"id": "nem-cri",
"name": "nemlaaoo",
"symbol": "nem",
"rank": 27,
"owner": "marcel",
"quotes": {
"ASK": {
"price": 19429,
"time": 319250866,
"duration": 21
}
}
},
{
"id": "kalo-lo-leek",
"name": "kalowaaa",
"symbol": "kalo",
"rank": 122,
"owner": "daniel",
"quotes": {
"ASK": {
"price": 12928,
"time": 937282932,
"duration": 09
}
}
},
{
"id": "reewmaarwl",
"name": "reeqooow",
"symbol": "reeq",
"rank": 4,
"owner": "eric",
"quotes": {
"ASK": {
"price": 9989,
"time": 124288222,
"duration": 19
}
}
},
{
"id": "sharkooaksj",
"name": "sharkmaaa",
"symbol": "shark",
"rank": 22,
"owner": "eric",
"quotes": {
"ASK": {
"price": 11122,
"time": 482773882,
"duration": 22
}
}
}
]

If you are OK with using pandas I would just create a DataFrame.
import pandas as pd
df = pd.json_normalize(parsed)
print(df)
id name symbol rank owner quotes.ASK.price \
0 nem-cri nemlaaoo nem 27 marcel 19429
1 kalo-lo-leek kalowaaa kalo 122 daniel 12928
2 reewmaarwl reeqooow reeq 4 eric 9989
3 sharkooaksj sharkmaaa shark 22 eric 11122
quotes.ASK.time quotes.ASK.duration
0 319250866 21
1 937282932 9
2 124288222 19
3 482773882 22
If you want the kalo value then
print(df[df['symbol'] == 'kalo']['quotes.ASK.price']) # -> 12928

MongoDB collection to pandas Dataframe

My MongoDB document structure is as follows and some of the factors are NaN.
_id :ObjectId("5feddb959297bb2625db1450")
factors: Array
0:Object
factorId:"C24"
Index:0
weight:1
1:Object
factorId:"C25"
Index:1
weight:1
2:Object
factorId:"C26"
Index:2
weight:1
name:"Growth Led Momentum"
I want to convert it to pandas data frame as follows using pymongo and pandas.
|name | factorId | Index | weight|
----------------------------------------------------
|Growth Led Momentum | C24 | 0 | 0 |
----------------------------------------------------
|Growth Led Momentum | C25 | 1 | 0 |
----------------------------------------------------
|Growth Led Momentum | C26 | 2 | 0 |
----------------------------------------------------
Thank you

Update
I broke out the ol Python to give this a crack - the following code works flawlessly!
from pymongo import MongoClient
import pandas as pd
uri = "mongodb://<your_mongo_uri>:27017"
database_name = "<your_database_name"
collection_name = "<your_collection_name>"
mongo_client = MongoClient(uri)
database = mongo_client[database_name]
collection = database[collection_name]
# I used this code to insert a doc into a test collection
# before querying (just incase you wanted to know lol)
"""
data = {
"_id": 1,
"name": "Growth Lead Momentum",
"factors": [
{
"factorId": "C24",
"index": 0,
"weight": 1
},
{
"factorId": "D74",
"index": 7,
"weight": 9
}
]
}
insert_result = collection.insert_one(data)
print(insert_result)
"""
# This is the query that
# answers your question
results = collection.aggregate([
{
"$unwind": "$factors"
},
{
"$project": {
"_id": 1, # Change to 0 if you wish to ignore "_id" field.
"name": 1,
"factorId": "$factors.factorId",
"index": "$factors.index",
"weight": "$factors.weight"
}
}
])
# This is how we turn the results into a DataFrame.
# We can simply pass `list(results)` into `DataFrame(..)`,
# due to how our query works.
results_as_dataframe = pd.DataFrame(list(results))
print(results_as_dataframe)
Which outputs:
_id name factorId index weight
0 1 Growth Lead Momentum C24 0 1
1 1 Growth Lead Momentum D74 7 9
Original Answer
You could use the aggregation pipeline to unwind factors and then project the fields you want.
Something like this should do the trick.
Live demo here.
Database Structure
[
{
"_id": 1,
"name": "Growth Lead Momentum",
"factors": [
{
factorId: "C24",
index: 0,
weight: 1
},
{
factorId: "D74",
index: 7,
weight: 9
}
]
}
]
Query
db.collection.aggregate([
{
$unwind: "$factors"
},
{
$project: {
_id: 1,
name: 1,
factorId: "$factors.factorId",
index: "$factors.index",
weight: "$factors.weight"
}
}
])
Results
(.csv friendly)
[
{
"_id": 1,
"factorId": "C24",
"index": 0,
"name": "Growth Lead Momentum",
"weight": 1
},
{
"_id": 1,
"factorId": "D74",
"index": 7,
"name": "Growth Lead Momentum",
"weight": 9
}
]

Wonderful answer by Matt, In case you want to use pandas:
Use this after you have retrieved documents from db:
df = pd.json_normalize(data)
df = df['factors'].explode().apply(lambda x: [val for _, val in x.items()]).explode().apply(pd.Series).join(df).drop(columns=['factors'])
Output:
factorId Index weight name
0 C24 0 1 Growth Led Momentum
0 C25 1 1 Growth Led Momentum
0 C26 2 1 Growth Led Momentum

can use the method of pymongo to obtain securities occupation funds and securities quantity？

I have a mongodb data like below:
code date num price money
0 2 2015-11-15 10 3.8 -38.0
1 2 2015-11-17 -10 3.7 37.0
2 2 2015-11-20 20 3.5 -70.0
3 2 2016-04-01 10 3.2 -32.0
4 2 2016-04-02 -30 3.6 108.0
5 2 2016-04-03 50 3.4 -170.0
6 2 2016-11-01 -40 3.5 140.0
7 3 2015-02-01 25 7.0 -175.0
8 3 2015-05-01 35 7.5 -262.5
9 3 2016-03-01 -15 8.0 120.0
10 5 2015-11-20 50 5.0 -250.0
11 5 2016-06-01 -50 5.5 275.0
12 6 2015-02-01 35 11.5 -402.5
I want to get the number of securities held and the funds currently occupied by the securities
If I take out the data, I can get the result I want in the following way:
import pandas as pd
import numpy as np
df=pd.DataFrame({'code': [2,2,2,2,2,2,2,3,3,3,5,5,6],
'date': ['2015-11-15','2015-11-17','2015-11-20','2016-04-01','2016-04-02','2016-04-03','2016-11-01','2015-02-01','2015-05-01','2016-03-01','2015-11-20','2016-06-01','2015-02-01'],
'num' : [10,-10, 20, 10, -30,50, -40, 25, 35, -15, 50, -50, 35],
'price': [3.8,3.7,3.5,3.2, 3.6,3.4, 3.5, 7, 7.5, 8, 5, 5.5, 11.5],
'money': [-38,37,-70,-32, 108,-170, 140,-175,-262.5,120,-250, 275,-402.5]
})
print(df,"\n------------------------------------------\n")
df['hold'] = df.groupby(['code'])['num'].cumsum()
df['type'] = np.where(df['hold'] > 0, 'B', 'S')
df['total']=df['total1']= df.groupby(['code'])['money'].cumsum()
def FiFo(dfg):
if dfg[dfg['hold'] == 0]['hold'].count():
subT = dfg[dfg['hold'] == 0]['total1'].iloc[-1]
dfg['total'] = np.where(dfg['hold'] > 0, dfg['total']-subT, dfg['total'])
return dfg
dfR = df.groupby(['code'], as_index=False)\
.apply(FiFo) \
.drop(['type', 'total1'], axis=1) \
.reset_index(drop=True)
df1=dfR.groupby(['code']).tail(1)
print(df1,"\n------------------------------------------\n")
out
code date num price money *hold* *total*
6 2 2016-11-01 -40 3.5 140.0 *10* *-30.0*
9 3 2016-03-01 -15 8.0 120.0 *45* *-317.5*
11 5 2016-06-01 -50 5.5 275.0 *0* *25.0*
12 6 2015-02-01 35 11.5 -402.5 *35* *-402.5*
If use the mongodb method (such as aggregate, or other), how can i directly obtain the same result as above?

I got this answer on this website below
https://developer.mongodb.com/community/forums/t/help-writing-aggregation-query-using-pymongo-instead-of-pandas/6290/13
pipeline = [
{
'$sort': { 'code': 1, 'date': 1 }
},
{
'$group': {
'_id': '$code',
'num': { '$last': '$num' }, 'price': { '$last': '$price' }, 'money': { '$last': '$money' },
'code_data': { '$push': { 'n': "$num", 'm': "$money" } }
}
},
{
'$addFields': {
'result': {
'$reduce': {
'input': '$code_data',
'initialValue': { 'hold': 0, 'sum_m': 0, 'total': 0 },
'in': {
'$let': {
'vars': {
'hold_': { '$add': [ '$$this.n', '$$value.hold' ] },
'sum_m_': { '$add': [ '$$this.m', '$$value.sum_m' ] }
},
'in': {
'$cond': [ { '$eq': [ '$$hold_', 0 ] },
{ 'hold': '$$hold_', 'sum_m': 0, 'total': '$$sum_m_' },
{ 'hold': '$$hold_', 'sum_m': '$$sum_m_', 'total': '$$sum_m_' }
]
}
}
}
}
}
}
},
{
'$addFields': { 'code': '$_id', 'hold': '$result.hold', 'total': '$result.total' }
},
{
'$project': { 'code_data': 0, 'result': 0, '_id': 0 }
},
{
'$sort': { 'code': 1 }
}
]

Python Pandas - Iterate over unique columns

I am trying to iterate over a list of unique column-values to create three different keys with dictionaries inside a dictionary. This is the code I have now:
import pandas as pd
dataDict = {}
metrics = frontendFrame['METRIC'].unique()
for metric in metrics:
dataDict[metric] = frontendFrame[frontendFrame['METRIC'] == metric].to_dict('records')
print(dataDict)
This works fine for low amounts of data, but as fast as the amount of data increases this can take almost one second (!!!!).
I've tried with groupby in pandas which is even slower, and also with map, but I don't want to return things to a list. How can I iterate over this and create what I want in a faster way? I am using Python 3.6
UPDATE:
Input:
DATETIME METRIC ANOMALY VALUE
0 2018-02-27 17:30:32 SCORE 2.0 -1.0
1 2018-02-27 17:30:32 VALUE NaN 0.0
2 2018-02-27 17:30:32 INDEX NaN 6.6613381477499995E-16
3 2018-02-27 17:31:30 SCORE 2.0 -1.0
4 2018-02-27 17:31:30 VALUE NaN 0.0
5 2018-02-27 17:31:30 INDEX NaN 6.6613381477499995E-16
6 2018-02-27 17:32:30 SCORE 2.0 -1.0
7 2018-02-27 17:32:30 VALUE NaN 0.0
8 2018-02-27 17:32:30 INDEX NaN 6.6613381477499995E-16
Output:
{
"INDEX": [
{
"DATETIME": 1519759710000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
},
{
"DATETIME": 1519759770000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
}],
"SCORE": [
{
"DATETIME": 1519759710000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
}],
"VALUE": [
{
"DATETIME": 1519759710000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
}]
}

One possible solution:
a = defaultdict( list )
_ = {x['METRIC']: a[x['METRIC']].append(x) for x in frontendFrame.to_dict('records')}
a = dict(a)
from collections import defaultdict
a = defaultdict( list )
for x in frontendFrame.to_dict('records'):
a[x['METRIC']].append(x)
a = dict(a)
Slow:
dataDict = frontendFrame.groupby('METRIC').apply(lambda x: x.to_dict('records')).to_dict()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

dictionaries to pandas dataframe - python

Related

How to parse through json file and transform it into time series

Python - Grab specific value from known key inside large json

MongoDB collection to pandas Dataframe

can use the method of pymongo to obtain securities occupation funds and securities quantity？

Python Pandas - Iterate over unique columns

Categories

Resources