I have a timeseries consisting of a list of dicts as follows:
for i in range(10):
d = {
'ts': i,
'ts_offset': 6 * 60 * 60,
'value': 1234.0
}
if i >= 5:
d['ts_offset'] = 12 * 60 * 60
data.append(d)
frame = pd.DataFrame(data)
frame.index = pd.to_datetime(frame.ts, unit='s')
ts ts_offset value
ts
1970-01-01 00:00:00 0 21600 1234.0
1970-01-01 00:00:01 1 21600 1234.0
1970-01-01 00:00:02 2 21600 1234.0
1970-01-01 00:00:03 3 21600 1234.0
1970-01-01 00:00:04 4 21600 1234.0
1970-01-01 00:00:05 5 43200 1234.0
1970-01-01 00:00:06 6 43200 1234.0
1970-01-01 00:00:07 7 43200 1234.0
1970-01-01 00:00:08 8 43200 1234.0
1970-01-01 00:00:09 9 43200 1234.0
The index is a timestamp plus a localization dependant offset (in seconds). As you can see, my use case is that the offset may change at any point during the timeseries. I would like to convert this construct to a series where the index is a localized pd.TimeSeriesIndex, but so far, i was only able to find localization functions that worked on the entire index.
Is anybody aware of an efficient method to convert each index with a (possibly) separate timezone? The series can consist of up to a few thousand rows and this function would be called a lot, so i would like to vectorize as much as possible.
Edit:
I took the liberty of timing FLabs grouping solution vs a simple python loop with the following script:
import pandas as pd
import numpy as np
import datetime
def to_series1(data, metric):
idx = []
values = []
for i in data:
tz = datetime.timezone(datetime.timedelta(seconds=i["ts_offset"]))
idx.append(pd.Timestamp(i["ts"] * 10**9, tzinfo=tz))
values.append(np.float(i["value"]))
series = pd.Series(values, index=idx, name=metric)
return series
def to_series2(data, metric):
frame = pd.DataFrame(data)
frame.index = pd.to_datetime(frame.ts, unit='s', utc=True)
grouped = frame.groupby('ts_offset')
out = {}
for name, group in grouped:
out[name] = group
tz = datetime.timezone(datetime.timedelta(seconds=name))
out[name].index = out[name].index.tz_convert(tz)
out = pd.concat(out, axis=0).sort_index(level='ts')
out.index = out.index.get_level_values('ts')
series = out.value
series.name = metric
series.index.name = None
return series
metric = 'bla'
data = []
for i in range(100000):
d = {
'ts': i,
'ts_offset': 6 * 60 * 60,
'value': 1234.0
}
if i >= 50000:
d['ts_offset'] = 12 * 60 * 60
data.append(d)
%timeit to_series1(data, metric)
%timeit to_series2(data, metric)
The results were as follows:
2.59 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.03 s ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So i'm still open for suggestions that are possibly faster.
You can use groupby ts_offset, so that you can apply a single offset to a dataframe (vectorised operation):
grouped = frame.groupby('ts_offset')
out = {}
for name, group in grouped:
print(name)
out[name] = group
out[name].index = out[name].index + pd.DateOffset(seconds=name)
out = pd.concat(out, axis=0, names=['offset', 'ts']).sort_index(level='ts')
Showing the applied offset just to verify the results, you have:
Out[17]:
ts ts_offset value
ts
21600 1970-01-01 06:00:00 0 21600 1234.0
1970-01-01 06:00:01 1 21600 1234.0
1970-01-01 06:00:02 2 21600 1234.0
1970-01-01 06:00:03 3 21600 1234.0
1970-01-01 06:00:04 4 21600 1234.0
43200 1970-01-01 12:00:05 5 43200 1234.0
1970-01-01 12:00:06 6 43200 1234.0
1970-01-01 12:00:07 7 43200 1234.0
1970-01-01 12:00:08 8 43200 1234.0
1970-01-01 12:00:09 9 43200 1234.0
Finally, you can remove the first index:
out.index = out.index.get_level_values('ts')
Related
Problem:
For each row of a DataFrame, I want to find the nearest prior row where the 'Datetime' value is at least 20 seconds before the current 'Datetime' value.
For example: if the previous 'Datetime' (at index i-1) is at least 20s earlier than the current one - it will be chosen. Otherwise (e.g. only 5 seconds earlier), move to i-2 and see if it is at least 20s earlier. Repeat until the condition is met, or no such row has been found.
The expected result is a concatenation of the original df and the rows that were found. When no matching row at or more than 20 s before the current Datetime has been found, then the new columns are null (NaT or NaN, depending on the type).
Example data
df = pd.DataFrame({
'Datetime': pd.to_datetime([
f'2016-05-15 08:{M_S}+06:00'
for M_S in ['36:21', '36:41', '36:50', '37:10', '37:19', '37:39']]),
'A': [21, 43, 54, 2, 54, 67],
'B': [3, 3, 45, 23, 8, 6],
})
Example result:
>>> res
Datetime A B Datetime_nearest A_nearest B_nearest
0 2016-05-15 08:36:21+06:00 21 3 NaT NaN NaN
1 2016-05-15 08:36:41+06:00 43 3 2016-05-15 08:36:21+06:00 21.0 3.0
2 2016-05-15 08:36:50+06:00 54 45 2016-05-15 08:36:21+06:00 21.0 3.0
3 2016-05-15 08:37:10+06:00 2 23 2016-05-15 08:36:50+06:00 54.0 45.0
4 2016-05-15 08:37:19+06:00 54 8 2016-05-15 08:36:50+06:00 54.0 45.0
5 2016-05-15 08:37:39+06:00 67 6 2016-05-15 08:37:19+06:00 54.0 8.0
The last three columns are the newly created columns, and the first three columns are the original dataset.
Two vectorized solutions
Note: we assume that the rows are sorted by Datetime. If that is not the case, then sort them first (O[n log n]).
For 10,000 rows:
3.3 ms, using Numpy's searchsorted.
401 ms, using a rolling window of 20s, left-open.
1. Using np.searchsorted
We use np.searchsorted to find in one call the indices of all matching rows:
import numpy as np
s = df['Datetime']
z = np.searchsorted(s, s - (pd.Timedelta(min_dt) - pd.Timedelta('1ns'))) - 1
E.g., for the OP's data, these indices are:
>>> z
array([-1, 0, 0, 2, 2, 4])
I.e.: z[0] == -1: no matching row; z[1] == 0: row 0 (08:36:21) is the nearest that is 20s or more before row 1 (08:36:41). z[2] == 0: row 0 is the nearest match for row 2 (row 1 is too close). Etc.
Why subtracting 1? We use np.searchsorted to select the first row in the exclusion zone (i.e., too close); then we subtract 1 to get the correct row (the first one at least 20s before).
Why - 1ns? This is to make the search window left-open. A row at exactly 20s before the current one will not be in the exclusion zone, and thus will end up being the one selected as the match.
We then use z to select the matching rows (or nulls) and concatenate into the result. Putting it all in a function:
def select_np(df, min_dt='20s'):
newcols = [f'{k}_nearest' for k in df.columns]
s = df['Datetime']
z = np.searchsorted(s, s - (pd.Timedelta(min_dt) - pd.Timedelta('1ns'))) - 1
return pd.concat([
df,
df.iloc[z].set_axis(newcols, axis=1).reset_index(drop=True).where(pd.Series(z >= 0))
], axis=1)
On the OP's example
>>> select_np(df[['Datetime', 'A', 'B']])
Datetime A B Datetime_nearest A_nearest B_nearest
0 2016-05-15 08:36:21+06:00 21 3 NaT NaN NaN
1 2016-05-15 08:36:41+06:00 43 3 2016-05-15 08:36:21+06:00 21.0 3.0
2 2016-05-15 08:36:50+06:00 54 45 2016-05-15 08:36:21+06:00 21.0 3.0
3 2016-05-15 08:37:10+06:00 2 23 2016-05-15 08:36:50+06:00 54.0 45.0
4 2016-05-15 08:37:19+06:00 54 8 2016-05-15 08:36:50+06:00 54.0 45.0
5 2016-05-15 08:37:39+06:00 67 6 2016-05-15 08:37:19+06:00 54.0 8.0
2. Using a rolling window (pure Pandas)
This was our original solution and uses pandas rolling with a Timedelta(20s) window size, left-open. It is still more optimized than a naive (O[n^2]) search, but is roughly 100x slower than select_np(), as pandas uses explicit loops in Python to find the window bounds for .rolling(): see get_window_bounds(). There is also some overhead due to having to make sub-frames, applying a function or aggregate, etc.
def select_pd(df, min_dt='20s'):
newcols = [f'{k}_nearest' for k in df.columns]
z = (
df.assign(rownum=range(len(df)))
.rolling(pd.Timedelta(min_dt), on='Datetime', closed='right')['rownum']
.apply(min).astype(int) - 1
)
return pd.concat([
df,
df.iloc[z].set_axis(newcols, axis=1).reset_index(drop=True).where(z >= 0)
], axis=1)
3. Testing
First, we write an arbitrary-size test data generator:
def gen(n):
return pd.DataFrame({
'Datetime': pd.Timestamp('2020') +\
np.random.randint(0, 30, n).cumsum() * pd.Timedelta('1s'),
'A': np.random.randint(0, 100, n),
'B': np.random.randint(0, 100, n),
})
Example
np.random.seed(0)
tdf = gen(10)
>>> select_np(tdf)
Datetime A B Datetime_nearest A_nearest B_nearest
0 2020-01-01 00:00:12 21 87 NaT NaN NaN
1 2020-01-01 00:00:27 36 46 NaT NaN NaN
2 2020-01-01 00:00:48 87 88 2020-01-01 00:00:27 36.0 46.0
3 2020-01-01 00:00:48 70 81 2020-01-01 00:00:27 36.0 46.0
4 2020-01-01 00:00:51 88 37 2020-01-01 00:00:27 36.0 46.0
5 2020-01-01 00:01:18 88 25 2020-01-01 00:00:51 88.0 37.0
6 2020-01-01 00:01:21 12 77 2020-01-01 00:00:51 88.0 37.0
7 2020-01-01 00:01:28 58 72 2020-01-01 00:00:51 88.0 37.0
8 2020-01-01 00:01:37 65 9 2020-01-01 00:00:51 88.0 37.0
9 2020-01-01 00:01:56 39 20 2020-01-01 00:01:28 58.0 72.0
Speed
tdf = gen(10_000)
% timeit select_np(tdf)
3.31 ms ± 6.79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit select_pd(df)
401 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> select_np(df).equals(select_pd(df))
True
Scale sweep
We can now compare speed over a range of sizes, using the excellent perfplot package:
import perfplot
perfplot.plot(
setup=gen,
kernels=[select_np, select_pd],
n_range=[2**k for k in range(4, 16)],
equality_check=lambda a, b: a.equals(b),
)
Focusing on select_np:
perfplot.plot(
setup=gen,
kernels=[select_np],
n_range=[2**k for k in range(4, 24)],
)
The following solution is memory-efficient but it is not the fastest one (because it uses iteration over rows).
The fully vectorized version (that I could think of on my own) would be faster but it would use O(n^2) memory.
Example dataframe:
timestamps = [pd.Timestamp('2016-01-01 00:00:00'),
pd.Timestamp('2016-01-01 00:00:19'),
pd.Timestamp('2016-01-01 00:00:20'),
pd.Timestamp('2016-01-01 00:00:21'),
pd.Timestamp('2016-01-01 00:00:50')]
df = pd.DataFrame({'Datetime': timestamps,
'A': np.arange(10, 15),
'B': np.arange(20, 25)})
Datetime
A
B
0
2016-01-01 00:00:00
10
20
1
2016-01-01 00:00:19
11
21
2
2016-01-01 00:00:20
12
22
3
2016-01-01 00:00:21
13
23
4
2016-01-01 00:00:50
14
24
Solution:
times = df['Datetime'].to_numpy() # it's convenient to have it as an `ndarray`
shifted_times = times - pd.Timedelta(20, unit='s')
useful is a list of "useful" indices of df - i.e. where the appended values will NOT be nan:
useful = np.nonzero(shifted_times >= times[0])[0]
# useful == [2, 3, 4]
Truncate shifted_times from the beginning - to iterate through useful elements only:
if len(useful) == 0:
# all new columns will be `nan`s
first_i = 0 # this value will never actually be used
useful_shifted_times = np.array([], dtype=shifted_times.dtype)
else:
first_i = useful[0] # first_i == 2
useful_shifted_times = shifted_times[first_i : ]
Find the corresponding index positions of df for each "useful" value.
(these index positions are essentially the indices of times
that are selected for each element of useful_shifted_times):
selected_indices = []
# Iterate through `useful_shifted_times` one by one:
# (`i` starts at `first_i`)
for i, shifted_time in enumerate(useful_shifted_times, first_i):
selected_index = np.nonzero(times[: i] <= shifted_time)[0][-1]
selected_indices.append(selected_index)
# selected_indices == [0, 0, 3]
Selected rows:
df_nearest = df.iloc[selected_indices].add_suffix('_nearest')
Datetime_nearest
A_nearest
B_nearest
0
2016-01-01 00:00:00
10
20
0
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:21
13
23
Replace indices of df_nearest to match those of the corresponding rows of df.
(basically, that is the last len(selected_indices) indices):
df_nearest.index = df.index[len(df) - len(selected_indices) : ]
Datetime_nearest
A_nearest
B_nearest
2
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:00
10
20
4
2016-01-01 00:00:21
13
23
Append the selected rows to the original dataframe to get the final result:
new_df = df.join(df_nearest)
Datetime
A
B
Datetime_nearest
A_nearest
B_nearest
0
2016-01-01 00:00:00
10
20
NaT
nan
nan
1
2016-01-01 00:00:19
11
21
NaT
nan
nan
2
2016-01-01 00:00:20
12
22
2016-01-01 00:00:00
10
20
3
2016-01-01 00:00:21
13
23
2016-01-01 00:00:00
10
20
4
2016-01-01 00:00:50
14
24
2016-01-01 00:00:21
13
23
Note: NaT stands for 'Not a Time'. It is the equivalent of nan for time values.
Note: it also works as expected even if all the last 'Datetime' - 20 sec is before the very first 'Datetime' --> all new columns will be nans.
I have a pandas dataframe which looks like this:
Concentr 1 Concentr 2 Time
0 25.4 0.48 00:01:00
1 26.5 0.49 00:02:00
2 25.2 0.52 00:03:00
3 23.7 0.49 00:04:00
4 23.8 0.55 00:05:00
5 24.6 0.53 00:06:00
6 26.3 0.57 00:07:00
7 27.1 0.59 00:08:00
8 28.8 0.56 00:09:00
9 23.9 0.54 00:10:00
10 25.6 0.49 00:11:00
11 27.5 0.56 00:12:00
12 26.3 0.55 00:13:00
13 25.3 0.54 00:14:00
and I want to keep the max value of Concentr 1 of every 5 minute interval, along with the time it occured and the value of concetr 2 at that time. So, for the previous example I would like to have:
Concentr 1 Concentr 2 Time
0 26.5 0.49 00:02:00
1 28.8 0.56 00:09:00
2 27.5 0.56 00:12:00
My current approach would be i) to create and auxiliary variable with an ID for each 5-min interval eg 00:00 to 00:05 will be interval 1, from 00:05 to 00:10 would be interval 2 etc, ii) use the interval variable in a groupby to get the max concentr 1 per interval and iii) merge back to the initial df using both the interval variable and the concentr 1 and thus identifying the corresponding time.
I would like to ask if there is a better / more efficient / more elegant way to do it.
Thank you very much for any help.
You can do a regular resample / groupby, and use the idxmax method to get the desired row for each group. Then use that to index your original data:
>>> df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
Concentr 1 Concentr 2 Time
1 26.5 0.49 2021-10-09 00:02:00
8 28.8 0.56 2021-10-09 00:09:00
11 27.5 0.56 2021-10-09 00:12:00
This is assuming your 'Time' column is datetime like, which I did with pd.to_datetime. You can convert the time column back with strftime. So in full:
df['Time'] = pd.to_datetime(df['Time'])
result = df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
result['Time'] = result['Time'].dt.strftime('%H:%M:%S')
Giving:
Concentr1 Concentr2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
df = df.set_index('Time')
idx = df.resample('5T').agg({'Concentr 1': np.argmax})
df = df.iloc[idx.conc]
Then you would probably need to reset_index() if you do not wish Time to be your index.
You can also use this:
groupby every n=5 nrows and filter the original df based on max index of "Concentr 1"
df = df[df.index.isin(df.groupby(df.index // 5)["Concentr 1"].idxmax())]
print(df)
Output:
Concentr 1 Concentr 2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
I have a pandas DataFrame of the form:
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 2014-01-01 00:00:00
I am interested in only the year, month and day in the birth column of the dataframe. I tried to leverage on the Python datetime from pandas but it resulted into an error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1054-02-07 00:00:00
The birth column is an object dtype.
My guess would be that it is an incorrect date. I would not like to pass the parameter errors="coerce" into the to_datetime method, because each item is important and I need just the YYYY-MM-DD.
I tried to leverage on the regex from pandas:
df["birth"].str.find("(\d{4})-(\d{2})-(\d{2})")
But this is returning NANs. How can I resolve this?
Thanks
Because not possible convert to datetimes you can use split by first whitespace and then select first value:
df['birth'] = df['birth'].str.split().str[0]
And then if necessary convert to periods.
Representing out-of-bounds spans.
print (df)
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 0-01-01 00:00:00
def to_per(x):
splitted = x.split('-')
return pd.Period(year=int(splitted[0]),
month=int(splitted[1]),
day=int(splitted[2]), freq='D')
df['birth'] = df['birth'].str.split().str[0].apply(to_per)
print (df)
id amount birth
0 4 78.0 1980-02-02
1 5 24.0 1989-03-03
2 6 49.5 2014-01-01
3 7 34.0 2014-01-01
4 8 49.5 0000-01-01
I currently have a dataframe, where an uniqueID has multiple dates in another column. I want extract the hours between each date, but ignore the weekend if the next date is after the weekend. For example, if today is friday at 12 pm,
and the following date is tuesday at 12 pm then the difference in hours between these two dates would be 48 hours.
Here is my dataset with the expected output:
df = pd.DataFrame({"UniqueID": ["A","A","A","B","B","B","C","C"],"Date":
["2018-12-07 10:30:00","2018-12-10 14:30:00","2018-12-11 17:30:00",
"2018-12-14 09:00:00","2018-12-18 09:00:00",
"2018-12-21 11:00:00","2019-01-01 15:00:00","2019-01-07 15:00:00"],
"ExpectedOutput": ["28.0","27.0","Nan","48.0","74.0","NaN","96.0","NaN"]})
df["Date"] = df["Date"].astype(np.datetime64)
This is what I have so far, but it includes the weekends:
df["date_diff"] = df.groupby(["UniqueID"])["Date"].apply(lambda x: x.diff()
/ np.timedelta64(1 ,'h')).shift(-1)
Thanks!
Idea is floor datetimes for remove times and get number of business days between start day + one day and shifted day to hours3 column by numpy.busday_count and then create hour1 and hour2 columns for start and end hours if not weekends hours. Last sum all hours columns together:
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(['UniqueID','Date'])
df["shifted"] = df.groupby(["UniqueID"])["Date"].shift(-1)
df["hours1"] = df["Date"].dt.floor('d')
df["hours2"] = df["shifted"].dt.floor('d')
mask = df['shifted'].notnull()
f = lambda x: np.busday_count(x['hours1'] + pd.Timedelta(1, unit='d'), x['hours2'])
df.loc[mask, 'hours3'] = df[mask].apply(f, axis=1) * 24
mask1 = df['hours1'].dt.dayofweek < 5
hours1 = df['hours1'] + pd.Timedelta(1, unit='d') - df['Date']
df['hours1'] = np.where(mask1, hours1, np.nan) / np.timedelta64(1 ,'h')
mask1 = df['hours2'].dt.dayofweek < 5
df['hours2'] = np.where(mask1, df['shifted']-df['hours2'], np.nan) / np.timedelta64(1 ,'h')
df['date_diff'] = df['hours1'].fillna(0) + df['hours2'] + df['hours3']
print (df)
UniqueID Date ExpectedOutput shifted hours1 \
0 A 2018-12-07 10:30:00 28.0 2018-12-10 14:30:00 13.5
1 A 2018-12-10 14:30:00 27.0 2018-12-11 17:30:00 9.5
2 A 2018-12-11 17:30:00 Nan NaT 6.5
3 B 2018-12-14 09:00:00 48.0 2018-12-18 09:00:00 15.0
4 B 2018-12-18 09:00:00 74.0 2018-12-21 11:00:00 15.0
5 B 2018-12-21 11:00:00 NaN NaT 13.0
6 C 2019-01-01 15:00:00 96.0 2019-01-07 15:00:00 9.0
7 C 2019-01-07 15:00:00 NaN NaT 9.0
hours2 hours3 date_diff
0 14.5 0.0 28.0
1 17.5 0.0 27.0
2 NaN NaN NaN
3 9.0 24.0 48.0
4 11.0 48.0 74.0
5 NaN NaN NaN
6 15.0 72.0 96.0
7 NaN NaN NaN
First solution was removed with 2 reasons - was not accurate and slow:
np.random.seed(2019)
dates = pd.date_range('2015-01-01','2018-01-01', freq='H')
df = pd.DataFrame({"UniqueID": np.random.choice(list('ABCDEFGHIJ'), size=100),
"Date": np.random.choice(dates, size=100)})
print (df)
def old(df):
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(['UniqueID','Date'])
df["shifted"] = df.groupby(["UniqueID"])["Date"].shift(-1)
def f(x):
a = pd.date_range(x['Date'], x['shifted'], freq='T')
return ((a.dayofweek < 5).sum() / 60).round()
mask = df['shifted'].notnull()
df.loc[mask, 'date_diff'] = df[mask].apply(f, axis=1)
return df
def new(df):
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(['UniqueID','Date'])
df["shifted"] = df.groupby(["UniqueID"])["Date"].shift(-1)
df["hours1"] = df["Date"].dt.floor('d')
df["hours2"] = df["shifted"].dt.floor('d')
mask = df['shifted'].notnull()
f = lambda x: np.busday_count(x['hours1'] + pd.Timedelta(1, unit='d'), x['hours2'])
df.loc[mask, 'hours3'] = df[mask].apply(f, axis=1) * 24
mask1 = df['hours1'].dt.dayofweek < 5
hours1 = df['hours1'] + pd.Timedelta(1, unit='d') - df['Date']
df['hours1'] = np.where(mask1, hours1, np.nan) / np.timedelta64(1 ,'h')
mask1 = df['hours2'].dt.dayofweek < 5
df['hours2'] = np.where(mask1, df['shifted'] - df['hours2'], np.nan) / np.timedelta64(1 ,'h')
df['date_diff'] = df['hours1'].fillna(0) + df['hours2'] + df['hours3']
return df
print (new(df))
print (old(df))
In [44]: %timeit (new(df))
22.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [45]: %timeit (old(df))
1.01 s ± 8.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)