Fill missing timeseries data using pandas or numpy - python

I have a list of dictionaries which looks like this :
L=[
{
"timeline": "2014-10",
"total_prescriptions": 17
},
{
"timeline": "2014-11",
"total_prescriptions": 14
},
{
"timeline": "2014-12",
"total_prescriptions": 8
},
{
"timeline": "2015-1",
"total_prescriptions": 4
},
{
"timeline": "2015-3",
"total_prescriptions": 10
},
{
"timeline": "2015-4",
"total_prescriptions": 3
}
]
This basically is the result of a SQL query which when given a start date and an end date gives the count of total prescriptions for each month starting from the start date till the end month.However,for months where the prescriptions count is 0(Feb 2015),it completely skips that month.Is it possible using pandas or numpy to alter this list so that it adds an entry for the missing month with 0 as the total prescription as follows:
[
{
"timeline": "2014-10",
"total_prescriptions": 17
},
{
"timeline": "2014-11",
"total_prescriptions": 14
},
{
"timeline": "2014-12",
"total_prescriptions": 8
{
"timeline": "2015-1",
"total_prescriptions": 4
},
{
"timeline": "2015-2", # 2015-2 to be inserted for missing month
"total_prescriptions": 0 # 0 to be inserted for total prescription
},
{
"timeline": "2015-3",
"total_prescriptions": 10
},
{
"timeline": "2015-4",
"total_prescriptions": 3
}
]

What you are talking about is called "Resampling" in Pandas; first convert the your time to a numpy datetime and set as your index:
df = pd.DataFrame(L)
df.index=pd.to_datetime(df.timeline,format='%Y-%m')
df
timeline total_prescriptions
timeline
2014-10-01 2014-10 17
2014-11-01 2014-11 14
2014-12-01 2014-12 8
2015-01-01 2015-1 4
2015-03-01 2015-3 10
2015-04-01 2015-4 3
Then you can add in your missing months with resample('MS') (MS stands for "month start" I guess), and use fillna(0) to convert null values to zero as in your requirement.
df = df.resample('MS').fillna(0)
df
total_prescriptions
timeline
2014-10-01 17
2014-11-01 14
2014-12-01 8
2015-01-01 4
2015-02-01 NaN
2015-03-01 10
2015-04-01 3
To convert back to your original format, convert the datetime index back to string using to_native_types, and then export using to_dict('records'):
df['timeline']=df.index.to_native_types()
df.to_dict('records')
[{'timeline': '2014-10-01', 'total_prescriptions': 17.0},
{'timeline': '2014-11-01', 'total_prescriptions': 14.0},
{'timeline': '2014-12-01', 'total_prescriptions': 8.0},
{'timeline': '2015-01-01', 'total_prescriptions': 4.0},
{'timeline': '2015-02-01', 'total_prescriptions': 0.0},
{'timeline': '2015-03-01', 'total_prescriptions': 10.0},
{'timeline': '2015-04-01', 'total_prescriptions': 3.0}]

Related

can use the method of pymongo to obtain securities occupation funds and securities quantity?

I have a mongodb data like below:
code date num price money
0 2 2015-11-15 10 3.8 -38.0
1 2 2015-11-17 -10 3.7 37.0
2 2 2015-11-20 20 3.5 -70.0
3 2 2016-04-01 10 3.2 -32.0
4 2 2016-04-02 -30 3.6 108.0
5 2 2016-04-03 50 3.4 -170.0
6 2 2016-11-01 -40 3.5 140.0
7 3 2015-02-01 25 7.0 -175.0
8 3 2015-05-01 35 7.5 -262.5
9 3 2016-03-01 -15 8.0 120.0
10 5 2015-11-20 50 5.0 -250.0
11 5 2016-06-01 -50 5.5 275.0
12 6 2015-02-01 35 11.5 -402.5
I want to get the number of securities held and the funds currently occupied by the securities
If I take out the data, I can get the result I want in the following way:
import pandas as pd
import numpy as np
df=pd.DataFrame({'code': [2,2,2,2,2,2,2,3,3,3,5,5,6],
'date': ['2015-11-15','2015-11-17','2015-11-20','2016-04-01','2016-04-02','2016-04-03','2016-11-01','2015-02-01','2015-05-01','2016-03-01','2015-11-20','2016-06-01','2015-02-01'],
'num' : [10,-10, 20, 10, -30,50, -40, 25, 35, -15, 50, -50, 35],
'price': [3.8,3.7,3.5,3.2, 3.6,3.4, 3.5, 7, 7.5, 8, 5, 5.5, 11.5],
'money': [-38,37,-70,-32, 108,-170, 140,-175,-262.5,120,-250, 275,-402.5]
})
print(df,"\n------------------------------------------\n")
df['hold'] = df.groupby(['code'])['num'].cumsum()
df['type'] = np.where(df['hold'] > 0, 'B', 'S')
df['total']=df['total1']= df.groupby(['code'])['money'].cumsum()
def FiFo(dfg):
if dfg[dfg['hold'] == 0]['hold'].count():
subT = dfg[dfg['hold'] == 0]['total1'].iloc[-1]
dfg['total'] = np.where(dfg['hold'] > 0, dfg['total']-subT, dfg['total'])
return dfg
dfR = df.groupby(['code'], as_index=False)\
.apply(FiFo) \
.drop(['type', 'total1'], axis=1) \
.reset_index(drop=True)
df1=dfR.groupby(['code']).tail(1)
print(df1,"\n------------------------------------------\n")
out
code date num price money *hold* *total*
6 2 2016-11-01 -40 3.5 140.0 *10* *-30.0*
9 3 2016-03-01 -15 8.0 120.0 *45* *-317.5*
11 5 2016-06-01 -50 5.5 275.0 *0* *25.0*
12 6 2015-02-01 35 11.5 -402.5 *35* *-402.5*
If use the mongodb method (such as aggregate, or other), how can i directly obtain the same result as above?
I got this answer on this website below
https://developer.mongodb.com/community/forums/t/help-writing-aggregation-query-using-pymongo-instead-of-pandas/6290/13
pipeline = [
{
'$sort': { 'code': 1, 'date': 1 }
},
{
'$group': {
'_id': '$code',
'num': { '$last': '$num' }, 'price': { '$last': '$price' }, 'money': { '$last': '$money' },
'code_data': { '$push': { 'n': "$num", 'm': "$money" } }
}
},
{
'$addFields': {
'result': {
'$reduce': {
'input': '$code_data',
'initialValue': { 'hold': 0, 'sum_m': 0, 'total': 0 },
'in': {
'$let': {
'vars': {
'hold_': { '$add': [ '$$this.n', '$$value.hold' ] },
'sum_m_': { '$add': [ '$$this.m', '$$value.sum_m' ] }
},
'in': {
'$cond': [ { '$eq': [ '$$hold_', 0 ] },
{ 'hold': '$$hold_', 'sum_m': 0, 'total': '$$sum_m_' },
{ 'hold': '$$hold_', 'sum_m': '$$sum_m_', 'total': '$$sum_m_' }
]
}
}
}
}
}
}
},
{
'$addFields': { 'code': '$_id', 'hold': '$result.hold', 'total': '$result.total' }
},
{
'$project': { 'code_data': 0, 'result': 0, '_id': 0 }
},
{
'$sort': { 'code': 1 }
}
]

Calculate average value by hour of json data

I'm having troubles with grouping samples by hour. Data structure looks like this:
data = [
{
"pressure": "1009.7",
"timestamp": "2019-09-03 08:03:00"
},
{
"pressure": "1009.7",
"timestamp": "2019-09-03 08:18:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 08:33:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 08:56:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 09:03:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 09:18:00"
},
{
"pressure": "1009.8",
"timestamp": "2019-09-03 09:33:00"
},
{
"pressure": "1009.7",
"timestamp": "2019-09-03 09:56:00"
},
{
"pressure": "1009.6",
"timestamp": "2019-09-03 10:03:00"
}
]
As you can see, there are 4 measurements of pressure every hour and I would like to calculate average value per hour. I've tried achieving this with Pandas, but no luck. What I've tried was to extract start and end timestamp, round them to full hour and then pass it to DataFrame as index, and json as data, but there's shape mismatch (no wonder). I thought that I would be able to pass it like this to df and later try to calculate mean, but it looks like there should me some intermediate step.
If your JSON mimics the above then we can pass it into a dataframe
df = pd.DataFrame.from_dict(data)
pressure timestamp
0 1009.7 2019-09-03 08:03:00
1 1009.7 2019-09-03 08:18:00
2 1009.8 2019-09-03 08:33:00
3 1009.8 2019-09-03 08:56:00
4 1009.8 2019-09-03 09:03:00
5 1009.8 2019-09-03 09:18:00
6 1009.8 2019-09-03 09:33:00
7 1009.7 2019-09-03 09:56:00
8 1009.6 2019-09-03 10:03:00
then just group by the hour and take the mean of pressure.
hourly_avg = df.groupby(df['timestamp'].dt.hour)['pressure'].mean()
print(hourly_avg)
timestamp
8 1009.750
9 1009.775
10 1009.600
Name: pressure, dtype: float64
note, you'll need to make your date a proper DateTime and pressure into a floating-point value.
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['pressure'] = df['pressure'].astype(float)
I would approach the problem by creating a new dictionary with the date/hour as a key and the pressures as a list (the value of the dictionary).
d = {}
for _dict in data:
key = _dict['timestamp'][:13] # 2019-09-03 08, etc.
d.setdefault(key, []).append(float(_dict['pressure']))
for key, array in d.items():
print(key, format(sum(array) / len(array), '.3f'))
Prints:
2019-09-03 08 1009.750
2019-09-03 09 1009.775
2019-09-03 10 1009.600
Check this:
df = pd.DataFrame(data)
df['timestamp']=pd.to_datetime(df['timestamp'], format='%Y%m%d %H:%M:%S')
df['pressure'] = df['pressure'].astype(float)
df['hour'] = df['timestamp'].dt.hour
pressure = df.groupby([df['hour']])['pressure'].mean()
print(pressure)
Output:
timestamp
8 1009.750
9 1009.775
10 1009.600

Python Pandas - Iterate over unique columns

I am trying to iterate over a list of unique column-values to create three different keys with dictionaries inside a dictionary. This is the code I have now:
import pandas as pd
dataDict = {}
metrics = frontendFrame['METRIC'].unique()
for metric in metrics:
dataDict[metric] = frontendFrame[frontendFrame['METRIC'] == metric].to_dict('records')
print(dataDict)
This works fine for low amounts of data, but as fast as the amount of data increases this can take almost one second (!!!!).
I've tried with groupby in pandas which is even slower, and also with map, but I don't want to return things to a list. How can I iterate over this and create what I want in a faster way? I am using Python 3.6
UPDATE:
Input:
DATETIME METRIC ANOMALY VALUE
0 2018-02-27 17:30:32 SCORE 2.0 -1.0
1 2018-02-27 17:30:32 VALUE NaN 0.0
2 2018-02-27 17:30:32 INDEX NaN 6.6613381477499995E-16
3 2018-02-27 17:31:30 SCORE 2.0 -1.0
4 2018-02-27 17:31:30 VALUE NaN 0.0
5 2018-02-27 17:31:30 INDEX NaN 6.6613381477499995E-16
6 2018-02-27 17:32:30 SCORE 2.0 -1.0
7 2018-02-27 17:32:30 VALUE NaN 0.0
8 2018-02-27 17:32:30 INDEX NaN 6.6613381477499995E-16
Output:
{
"INDEX": [
{
"DATETIME": 1519759710000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
},
{
"DATETIME": 1519759770000,
"METRIC": "INDEX",
"ANOMALY": null,
"VALUE": "6.6613381477499995E-16"
}],
"SCORE": [
{
"DATETIME": 1519759710000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "SCORE",
"ANOMALY": 2,
"VALUE": "-1.0"
}],
"VALUE": [
{
"DATETIME": 1519759710000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
},
{
"DATETIME": 1519759770000,
"METRIC": "VALUE",
"ANOMALY": null,
"VALUE": "0.0"
}]
}
One possible solution:
a = defaultdict( list )
_ = {x['METRIC']: a[x['METRIC']].append(x) for x in frontendFrame.to_dict('records')}
a = dict(a)
from collections import defaultdict
a = defaultdict( list )
for x in frontendFrame.to_dict('records'):
a[x['METRIC']].append(x)
a = dict(a)
Slow:
dataDict = frontendFrame.groupby('METRIC').apply(lambda x: x.to_dict('records')).to_dict()

Formatting dataframe output into JSON records by group

my dataframe looks like this df:
count_arena_users count_users event timestamp
0 4458 12499 football 2017-04-30
1 2706 4605 cricket 2015-06-30
2 592 4176 tennis 2016-06-30
3 3427 10126 badminton 2017-05-31
4 717 2313 football 2016-03-31
5 101 155 hockey 2016-01-31
6 45923 191180 tennis 2015-12-31
7 1208 2824 badminton 2017-01-31
8 5577 8906 cricket 2016-02-29
9 111 205 football 2016-03-31
10 4 8 hockey 2017-09-30
the data is fetched from a psql database, Now i want to generate the output of "select * from tbl_arena" in json format. But the json format that is desired has to be something like this:
[
{
"event": "football",
"data_to_plot": [
{
"count_arena_users": 717,
"count_users": 2313,
"timestamp": "2016-03-31"
},
{
"count_arena_users": 111,
"count_users": 205,
"timestamp": "2016-03-31"
},
{
"count_arena_users": 4458,
"count_users": 12499,
"timestamp": "2017-04-30"
}
]
},
{
"event": "cricket",
"data_to_plot": [
{
"count_arena_users": 2706,
"count_users": 4605,
"timestamp": "2015-06-30"
},
{
"count_arena_users": 5577,
"count_users": 8906,
"timestamp": "2016-02-29"
}
]
}
.
.
.
.
]
the values of all the columns are grouped based on the event column and later their occurance order of sub-dictionaries is decided based on timestamp column i.e. earlier dates appearing first and newer/latest dates appearing below it.
I'm using python 3.x and json.dumps to format the data into json style.
A high level process is as follows -
Aggregate all data with respect to events. We'd need a groupby + apply for this.
Convert the result to a series of records, one record for each event and associated data. Use to_json, with the orient=records.
df.groupby('event', sort=False)\
.apply(lambda x: x.drop('event', 1).sort_values('timestamp').to_dict('r'))\
.reset_index(name='data_to_plot')\
.to_json(orient='records')
[
{
"event": "football",
"data_to_plot": [
{
"count_arena_users": 717,
"timestamp": "2016-03-31",
"count_users": 2313
},
{
"count_arena_users": 111,
"timestamp": "2016-03-31",
"count_users": 205
},
{
"count_arena_users": 4458,
"timestamp": "2017-04-30",
"count_users": 12499
}
]
},
...
]

Using Pandas json_normalize on nested Json with arrays

The problem is normalizing a json with nested array of json objects. I have looked at similar questions and tried to use their solution to no avail.
This is what my json object looks like.
{
"results": [
{
"_id": "25",
"Product": {
"Description": "3 YEAR",
"TypeLevel1": "INTEREST",
"TypeLevel2": "LONG"
},
"Settlement": {},
"Xref": {
"SCSP": "96"
},
"ProductSMCP": [
{
"SMCP": "01"
}
]
},
{
"_id": "26",
"Product": {
"Description": "10 YEAR",
"TypeLevel1": "INTEREST",
"Currency": "USD",
"Operational": true,
"TypeLevel2": "LONG"
},
"Settlement": {},
"Xref": {
"BBT": "CITITYM9",
"TCK": "ZN"
},
"ProductSMCP": [
{
"SMCP": "01"
},
{
"SMCP2": "02"
}
]
}
]
}
Here is my code for normalizing the json object.
data = json.load(j)
data = data['results']
print pd.io.json.json_normalize(data)
The results that I WANT should be like this
id Description TypeLevel1 TypeLevel2 Currency \
25 3 YEAR US INTEREST LONG NAN
26 10 YEAR US INTEREST NAN USD
BBT TCT SMCP SMCP2 SCSP
NAN NAN 521 NAN 01
M9 ZN 01 02 NAN
However, the result I get is this:
Product.Currency Product.Description Product.Operational Product.TypeLevel1 \
0 NaN 3 YEAR NaN INTEREST
1 USD 10 YEAR True INTEREST
Product.TypeLevel2 ProductSMCP Xref.BBT Xref.SCSP \
0 LONG [{'SMCP': '01'}] NaN 96
1 LONG [{'SMCP': '01'}, {'SMCP2': '02'}] CITITYM9 NaN
Xref.TCK _id
0 NaN 25
1 ZN 26
As you can see, the issue is at ProductSCMP, it is not completely flattening the array.
Once we get past first normalization, I'd apply a lambda to finish the job.
from cytoolz.dicttoolz import merge
pd.io.json.json_normalize(data).pipe(
lambda x: x.drop('ProductSMCP', 1).join(
x.ProductSMCP.apply(lambda y: pd.Series(merge(y)))
)
)
Product.Currency Product.Description Product.Operational Product.TypeLevel1 Product.TypeLevel2 Xref.BBT Xref.SCSP Xref.TCK _id SMCP SMCP2
0 NaN 3 YEAR NaN INTEREST LONG NaN 96 NaN 25 01 NaN
1 USD 10 YEAR True INTEREST LONG CITITYM9 NaN ZN 26 01 02
Trim Column Names
pd.io.json.json_normalize(data).pipe(
lambda x: x.drop('ProductSMCP', 1).join(
x.ProductSMCP.apply(lambda y: pd.Series(merge(y)))
)
).rename(columns=lambda x: re.sub('(Product|Xref)\.', '', x))
Currency Description Operational TypeLevel1 TypeLevel2 BBT SCSP TCK _id SMCP SMCP2
0 NaN 3 YEAR NaN INTEREST LONG NaN 96 NaN 25 01 NaN
1 USD 10 YEAR True INTEREST LONG CITITYM9 NaN ZN 26 01 02

Categories