Select Pandas dataframe rows between two dates - python

I am working on two tables as follows:
A first table df1 giving a rate and a validity period:
rates = {'rate': [ 0.974, 0.966, 0.996, 0.998, 0.994, 1.006, 1.042, 1.072, 0.954],
'Valid from': ['31/12/2018','15/01/2019','01/02/2019','01/03/2019','01/04/2019','15/04/2019','01/05/2019','01/06/2019','30/06/2019'],
'Valid to': ['14/01/2019','31/01/2019','28/02/2019','31/03/2019','14/04/2019','30/04/2019','31/05/2019','29/06/2019','31/07/2019']}
df1 = pd.DataFrame(rates)
df1['Valid to'] = pd.to_datetime(df1['Valid to'])
df1['Valid from'] = pd.to_datetime(df1['Valid from'])
rate Valid from Valid to
0 0.974 2018-12-31 2019-01-14
1 0.966 2019-01-15 2019-01-31
2 0.996 2019-01-02 2019-02-28
3 0.998 2019-01-03 2019-03-31
4 0.994 2019-01-04 2019-04-14
5 1.006 2019-04-15 2019-04-30
6 1.042 2019-01-05 2019-05-31
7 1.072 2019-01-06 2019-06-29
8 0.954 2019-06-30 2019-07-31
A second table df2 listing recorded amounts and corresponding dates
data = {'date': ['03/01/2019','23/01/2019','27/02/2019','14/03/2019','05/04/2019','30/04/2019','14/06/2019'],
'amount': [200,305,155,67,95,174,236,]}
df2 = pd.DataFrame(data)
df2['date'] = pd.to_datetime(df2['date'])
date amount
0 2019-03-01 200
1 2019-01-23 305
2 2019-02-27 155
3 2019-03-14 67
4 2019-05-04 95
5 2019-04-30 174
6 2019-06-14 236
The objective would be to retrieve from df1 the applicable rate to each row on df2 using iteration and based on the date on df2.
Example: the date on the first row in df2 is 2019-01-03, therefore the applicable rate would be 0.974
The explanations given here (https://www.interviewqs.com/ddi_code_snippets/select_pandas_dataframe_rows_between_two_dates) gives me an idea on how to retrieve the rows on df2 between two dates in df1.
But I didn't manage to retrieve from df1 the applicable rate to each row on df2 using iteration.

If your dataframes are not very big, you can simply do the join on a dummy key and then do filtering to narrow it down to what you need. See example below (note that I had to update your example a little bit to have correct date formatting)
import pandas as pd
rates = {'rate': [ 0.974, 0.966, 0.996, 0.998, 0.994, 1.006, 1.042, 1.072, 0.954],
'valid_from': ['31/12/2018','15/01/2019','01/02/2019','01/03/2019','01/04/2019','15/04/2019','01/05/2019','01/06/2019','30/06/2019'],
'valid_to': ['14/01/2019','31/01/2019','28/02/2019','31/03/2019','14/04/2019','30/04/2019','31/05/2019','29/06/2019','31/07/2019']}
df1 = pd.DataFrame(rates)
df1['valid_to'] = pd.to_datetime(df1['valid_to'],format ='%d/%m/%Y')
df1['valid_from'] = pd.to_datetime(df1['valid_from'],format='%d/%m/%Y')
Then you df1 would be
rate valid_from valid_to
0 0.974 2018-12-31 2019-01-14
1 0.966 2019-01-15 2019-01-31
2 0.996 2019-02-01 2019-02-28
3 0.998 2019-03-01 2019-03-31
4 0.994 2019-04-01 2019-04-14
5 1.006 2019-04-15 2019-04-30
6 1.042 2019-05-01 2019-05-31
7 1.072 2019-06-01 2019-06-29
8 0.954 2019-06-30 2019-07-31
This is your second data frame df2
data = {'date': ['03/01/2019','23/01/2019','27/02/2019','14/03/2019','05/04/2019','30/04/2019','14/06/2019'],
'amount': [200,305,155,67,95,174,236,]}
df2 = pd.DataFrame(data)
df2['date'] = pd.to_datetime(df2['date'],format ='%d/%m/%Y')
Then your df2 would look like the following
date amount
0 2019-01-03 200
1 2019-01-23 305
2 2019-02-27 155
3 2019-03-14 67
4 2019-04-05 95
5 2019-04-30 174
6 2019-06-14 236
Your solution:
df1['key'] = 1
df2['key'] = 1
df_output = pd.merge(df1, df2, on='key').drop('key',axis=1)
df_output = df_output[(df_output['date'] > df_output['valid_from']) & (df_output['date'] <= df_output['valid_to'])]
This is how would the result look like df_output:
rate valid_from valid_to date amount
0 0.974 2018-12-31 2019-01-14 2019-01-03 200
8 0.966 2019-01-15 2019-01-31 2019-01-23 305
16 0.996 2019-02-01 2019-02-28 2019-02-27 155
24 0.998 2019-03-01 2019-03-31 2019-03-14 67
32 0.994 2019-04-01 2019-04-14 2019-04-05 95
40 1.006 2019-04-15 2019-04-30 2019-04-30 174
55 1.072 2019-06-01 2019-06-29 2019-06-14 236

Related

Efficient way to merge large Pandas dataframes between two dates

I know there are many questions like this one but I can't seem to find the relevant answer.
Let's say I have 2 data frames as follow:
df1 = pd.DataFrame(
{
"end": [
"2019-08-31",
"2019-08-28",
"2019-09-09",
"2019-09-08",
"2019-09-14",
"2019-09-14",
],
"start": [
"2019-08-27",
"2019-08-22",
"2019-08-04",
"2019-09-02",
"2019-09-06",
"2019-09-10",
],
"id": [1234, 8679, 8679, 1234, 1234, 8679],
}
)
df2 = pd.DataFrame(
{
"timestamp": [
"2019-08-30 10:00",
"2019-08-28 10:00",
"2019-08-27 10:30",
"2019-08-07 12:00",
"2019-09-12 10:00",
"2019-09-11 14:00",
"2019-08-29 18:00",
],
"id": [1234, 1234, 8679, 1234, 8679, 8679, 1234],
"val": ["AAAB", "ABBA", "CXXC", "BBAA", "XCXC", "CCXX", "BAAB"],
}
)
df1["end"] = pd.to_datetime(df1["end"])
df1["start"] = pd.to_datetime(df1["start"])
df2["timestamp"] = pd.to_datetime(df2["timestamp"])
df1.sort_values(by=["end"], inplace=True)
df2.sort_values(by="timestamp", inplace=True)
Resulted as:
end start id
0 2019-08-31 2019-08-27 1234
1 2019-08-28 2019-08-22 8679
2 2019-09-09 2019-08-04 8679
3 2019-09-08 2019-09-02 1234
4 2019-09-14 2019-09-06 1234
5 2019-09-14 2019-09-10 8679
timestamp id val
0 2019-08-30 10:00 1234 AAAB
1 2019-08-28 10:00 1234 ABBA
2 2019-08-27 10:30 8679 CXXC
3 2019-08-07 12:00 1234 BBAA
4 2019-09-12 10:00 8679 XCXC
5 2019-09-11 14:00 8679 CCXX
6 2019-08-29 18:00 1234 BAAB
The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter:
merged_df = pd.merge(df1, df2, how="left", on="id")
merged_df = merged_df.loc[
(merged_df["timestamp"] >= merged_df["start"])
& (merged_df["timestamp"] <= merged_df["end"])
]
In which I get the output I wish to have:
end start id timestamp val
0 2019-08-31 2019-08-27 1234 2019-08-30 10:00 AAAB
1 2019-08-31 2019-08-27 1234 2019-08-28 10:00 ABBA
3 2019-08-31 2019-08-27 1234 2019-08-29 18:00 BAAB
4 2019-08-28 2019-08-22 8679 2019-08-27 10:30 CXXC
7 2019-09-09 2019-08-04 8679 2019-08-27 10:30 CXXC
19 2019-09-14 2019-09-10 8679 2019-09-12 10:00 XCXC
20 2019-09-14 2019-09-10 8679 2019-09-11 14:00 CCXX
My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K.
What I have tried so far:
The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory.
I also tried the pandasql approach which ended with my 16GB RAM PC
getting stuck.
I tried the merge_asof in 3 steps of left join, right join and outer join as
explained here but I run some tests and it seems to always
return up to 2 records from df2 to a single line in df1.
Any good advice will be appreciated!
Perhaps you can make a function with groupby and find the matching date range with pd.IntervalIndex so you don't have to merge:
def func():
for x, y in df2.groupby("id"):
tmp = df1.loc[df1["id"].eq(x)]
tmp.index = pd.IntervalIndex.from_arrays(tmp['start'], tmp['end'], closed='both')
y[["start", "end"]] = tmp.loc[y.timestamp, ["start", "end"]].to_numpy()
yield y
print (pd.concat(func()).sort_index())
timestamp id val start end
0 2019-08-30 10:00:00 1234 AAAB 2019-08-27 2019-08-31
1 2019-08-28 10:00:00 1234 ABBA 2019-08-27 2019-08-31
2 2019-08-07 10:30:00 8679 CXXC 2019-08-04 2019-09-09
3 2019-08-27 12:00:00 1234 BBAA 2019-08-27 2019-08-31
4 2019-09-12 10:00:00 8679 XCXC 2019-09-10 2019-09-14
5 2019-09-11 14:00:00 8679 CCXX 2019-09-10 2019-09-14
6 2019-08-29 18:00:00 1234 BAAB 2019-08-27 2019-08-31
I've been working with niv-dudovitch and david-arenburg on this one, and here are our findings which I hope will be helpful to some of you out there...
The core idea was to prevent growing objects in memory by creating a list of dataframes based on subsets of the data.
First version without multi-processing.
import pandas as pd
unk = df1.id.unique()
j = [None] * len(unk)
k = 0
df1.set_index('id', inplace = True)
df2.set_index('id', inplace = True)
for i in unk:
tmp = df1.loc[df1.index.isin([i])].join(df2.loc[df2.index.isin([i])], how='left')
j[k] = tmp.loc[tmp['timestamp'].between(tmp['start'], tmp['end'])]
k += 1
res = pd.concat(j)
res
Using Multi-Process
In our real case, we have 2 large data frame df2 is about 3 million rows and df1 is slightly above 110K. The output is about 20M rows.
import multiprocessing as mp
import itertools
import concurrent
from concurrent.futures import ProcessPoolExecutor
import time
import pandas as pd
from itertools import repeat
def get_val_between(ids, df1, df2):
"""
Locate all values between 2 dates by id
Args:
- ids (list): list of ids
Returns:
- concat list of dataframes
"""
j = [None] * len(ids)
k = 0
for i in ids:
tmp = df1.loc[df1.index.isin([i])].join(
df2.loc[df2.index.isin([i])], how="left"
)
tmp = tmp.loc[tmp["timestamp"].between(tmp["start"], tmp["end"])]
# add to list in location k
j[k] = tmp
k += 1
# keep only not None dfs in j
j = [i for i in j if i is not None]
if len(j) > 0:
return pd.concat(j)
else:
return None
def grouper(n, iterable, fillvalue=None):
"""grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"""
args = [iter(iterable)] * n
return itertools.zip_longest(fillvalue=fillvalue, *args)
def main():
df1.reset_index(inplace=True, drop=True)
df2.reset_index(inplace=True, drop=True)
id_lst = df1.id.unique()
iter_ids = grouper(10, list(id_lst))
df1.set_index("id", inplace=True)
df2.set_index("id", inplace=True)
# set multi-processes
executor = concurrent.futures.ProcessPoolExecutor(20)
result_futures = executor.map(get_val_between, iter_ids, repeat(df1), repeat(df2))
concurrent.futures.as_completed(result_futures)
result_concat = pd.concat(result_futures)
print(result_concat)
if __name__ == "__main__":
main()
results as expected:
end start timestamp val
id
8679 2019-08-28 2019-08-22 2019-08-27 10:30:00 CXXC
8679 2019-09-09 2019-08-04 2019-08-27 10:30:00 CXXC
8679 2019-09-14 2019-09-10 2019-09-11 14:00:00 CCXX
8679 2019-09-14 2019-09-10 2019-09-12 10:00:00 XCXC
1234 2019-08-31 2019-08-27 2019-08-28 10:00:00 ABBA
1234 2019-08-31 2019-08-27 2019-08-29 18:00:00 BAAB
1234 2019-08-31 2019-08-27 2019-08-30 10:00:00 AAAB
As a benchmark with an output of 20 million rows, the Multi-Process approach is x10 times faster.

Fill missing timestamps and apply different operations on different columns

I have a data in below format
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:55:00 18 1047
Expected Output:
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:20:00 0 41
xyz 01-01-2020 00:25:00 0 41
xyz 01-01-2020 00:30:00 0 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:50:00 0 1029
xyz 01-01-2020 11:55:00 18 1047
So I want to fill timestamps with 5 mins interval and fill flowers column by 0 and total_flowers column by previous value(ffill)
My efforts:
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00+05:30")
end_time = pd.to_datetime(f"{end_day} 23:55:00+05:30")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min')
df = df.set_index('timestamp').reindex(dates).reset_index(drop=False).reindex(columns=df.columns)
How do I fill flowers column with zeros and total_flower column with ffill and I am also getting values in timestamp column as Nan
Actual Output:
user timestamp flowers total_flowers
xyz Nan 15 15
xyz Nan 5 20
xyz Nan 21 41
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan 1 42
...
xyz Nan 57 1029
xyz Nan Nan Nan
xyz Nan 18 1047
Reindex and refill
If you construct the dates such that you can reindex your timestamps, you can then just do some fillna and ffill operations. I had to remove the timezone information, but you should be able to add that back if your data are timezone aware. Here's the full example using some of your data:
d = {'user': {0: 'xyz', 1: 'xyz', 2: 'xyz', 3: 'xyz'},
'timestamp': {0: Timestamp('2020-01-01 00:05:00'),
1: Timestamp('2020-01-01 00:10:00'),
2: Timestamp('2020-01-01 00:15:00'),
3: Timestamp('2020-01-01 00:35:00')},
'flowers': {0: 15, 1: 5, 2: 21, 3: 1},
'total_flowers': {0: 15, 1: 20, 2: 41, 3: 42}}
df = pd.DataFrame(d)
# user timestamp flowers total_flowers
#0 xyz 2020-01-01 00:05:00 15 15
#1 xyz 2020-01-01 00:10:00 5 20
#2 xyz 2020-01-01 00:15:00 21 41
#3 xyz 2020-01-01 00:35:00 1 42
#as you did, but with no TZ
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00")
end_time = pd.to_datetime(f"{end_day} 00:55:00")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min', name="timestamp")
#filling the nas and reformatting
df = df.set_index('timestamp')
df = df.reindex(dates)
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.reset_index(inplace=True)
Output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0
7 2020-01-01 00:40:00 xyz 0.0 42.0
8 2020-01-01 00:45:00 xyz 0.0 42.0
9 2020-01-01 00:50:00 xyz 0.0 42.0
10 2020-01-01 00:55:00 xyz 0.0 42.0
Resample and refill
You can also use resample here using asfreq(), then do the filling as before. This is convenient for finding the dates (and should get around the timezone stuff):
# resample and then fill the gaps
# same df as constructed above
df = df.set_index('timestamp')
df.resample('5T').asfreq()
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.index.name='timestamp'
df.reset_index(inplace=True)
Same output:
timestamp flowers total_flowers user
0 2020-01-01 00:05:00 15 15.0 xyz
1 2020-01-01 00:10:00 5 20.0 xyz
2 2020-01-01 00:15:00 21 41.0 xyz
3 2020-01-01 00:20:00 0 41.0 xyz
4 2020-01-01 00:25:00 0 41.0 xyz
5 2020-01-01 00:30:00 0 41.0 xyz
6 2020-01-01 00:35:00 1 42.0 xyz
I couldn't find a way to do the filling during the resampling. For instance, using
df = df.resample('5T').agg({'flowers':'sum',
'total_flowers':'ffill',
'user':'ffill'})
does not work (it gets you to the same place as asfreq, but there's more room for accidentally missing out columns here). Which is odd because when applying ffill over the whole DataFrame, the missing data can be forward filled (but we only want that for some columns, and the user column also gets dropped). But simply using asfreq and doing the filling after the fact seems fine to me with few columns.
crossed with #Tom
You are almost there:
df = pd.DataFrame({'user': ['xyz', 'xyz', 'xyz', 'xyz'],
'timestamp': ['01-01-2020 00:05:00', '01-01-2020 00:10:00', '01-01-2020 00:15:00', '01-01-2020 00:35:00'],
'flowers':[15, 5, 21, 1],
'total_flowers':[15, 20, 41, 42]
})
df['timestamp'] = pd.to_datetime(df['timestamp'])
r = pd.date_range(start=df['timestamp'].min(), end=df['timestamp'].max(), freq='5Min')
df = df.set_index('timestamp').reindex(r).rename_axis('timestamp').reset_index()
df['user'].ffill(inplace=True)
df['total_flowers'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
leads to the following output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0

How do I use conditional logic with Datetime columns in Pandas?

I have two datetime columns - ColumnA and ColumnB. I want to create a new column - ColumnC, using conditional logic.
Originally, I created ColumnB from a YearMonth column of dates such as 201907, 201908, etc.
When ColumnA is NaN, I want to choose ColumnB.
Otherwise, I want to choose ColumnA.
Currently, my code below is causing ColumnC to have different formats. I'm not sure how to get rid of all of those 0's. I want the whole column to be YYYY-MM-DD.
ID YearMonth ColumnA ColumnB ColumnC
0 1 201712 2017-12-29 2017-12-31 2017-12-29
1 1 201801 2018-01-31 2018-01-31 2018-01-31
2 1 201802 2018-02-28 2018-02-28 2018-02-28
3 1 201806 2018-06-29 2018-06-30 2018-06-29
4 1 201807 2018-07-31 2018-07-31 2018-07-31
5 1 201808 2018-08-31 2018-08-31 2018-08-31
6 1 201809 2018-09-28 2018-09-30 2018-09-28
7 1 201810 2018-10-31 2018-10-31 2018-10-31
8 1 201811 2018-11-30 2018-11-30 2018-11-30
9 1 201812 2018-12-31 2018-12-31 2018-12-31
10 1 201803 NaN 2018-03-31 1522454400000000000
11 1 201804 NaN 2018-04-30 1525046400000000000
12 1 201805 NaN 2018-05-31 1527724800000000000
13 1 201901 NaN 2019-01-31 1548892800000000000
14 1 201902 NaN 2019-02-28 1551312000000000000
15 1 201903 NaN 2019-03-31 1553990400000000000
16 1 201904 NaN 2019-04-30 1556582400000000000
17 1 201905 NaN 2019-05-31 1559260800000000000
18 1 201906 NaN 2019-06-30 1561852800000000000
19 1 201907 NaN 2019-07-31 1564531200000000000
20 1 201908 NaN 2019-08-31 1567209600000000000
21 1 201909 NaN 2019-09-30 1569801600000000000
df['ColumnB'] = pd.to_datetime(df['YearMonth'], format='%Y%m', errors='coerce').dropna() + pd.offsets.MonthEnd(0)
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB'], format='%Y%m%d'), df['ColumnA'])
df['ColumnC'] = np.where(df['ColumnA'].isnull(),df['ColumnB'] , df['ColumnA'])
Just figured it out!
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB']), pd.to_datetime(df['ColumnA']))

Pandas : merge on date and hour from datetime index

I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0

Optimize code to find the median of values of past 30 day for each row in a DataFrame

I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)

Categories