I need to count events on a pandas dataframe using rolling windows with overlaps.
In particular, I have a dataframe with discontinuous events in time, like this:
Ma
2000-01-04 2.2
2000-01-05 2.6
2000-01-06 3.1
2000-01-16 2.4
2000-01-22 2.1
2000-01-27 2.5
2000-02-12 2.3
2000-02-19 3.5
2000-02-21 2.4
2000-02-27 2.4
and I want to count how many events occurred in time windows of 10 days with an overlap of 5 days.
This is the result which i'm looking for:
Events
from 2000-01-04 to 2000-01-14 3
from 2000-01-09 to 2000-01-19 1
from 2000-01-14 to 2000-01-24 2
from 2000-01-19 to 2000-01-29 2
Have you got any suggestions?
I tried to use groupby but I can only count data in non overlapping windows, using this line: df.groupby(pd.DatetimeIndex(df.Time).to_period("10d")).size()
I tried also Rolling.count from Pandas dataframe but again without success.
My approach is to create a data frame with a 5 day interval based on the minimum and maximum dates in the original data frame.
Based on that, I extract the number of items above row i and below row i+2 and calculate the number of items
. . and then list them in a loop process. Finally, we put it in a data frame. I'm sure there are better approaches.
import pandas as pd
import numpy as np
import io
data = '''
date Ma
2000-01-04 2.2
2000-01-05 2.6
2000-01-06 3.1
2000-01-16 2.4
2000-01-22 2.1
2000-01-27 2.5
2000-02-12 2.3
2000-02-19 3.5
2000-02-21 2.4
2000-02-27 2.4
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df['date'] = pd.to_datetime(df['date'])
df2 = pd.DataFrame({'date':pd.date_range(df['date'].min(), df['date'].max(), freq='5D')})
tmp = []
for i in range(len(df)):
try:
cnt = df[(df['date'] >= df2.date[i])&(df['date'] <= df2.date[i+2])].count()
fromto = 'from'+ str(df2.date[i].strftime('%Y-%m-%d'))+'to'+str(df2.date[i+2].strftime('%Y-%m-%d'))
tmp.append([fromto, cnt[0]])
except:
break
df3 = pd.DataFrame(tmp, columns=['FromTo', 'Events'])
df3
FromTo Events
0 from2000-01-04to2000-01-14 3
1 from2000-01-09to2000-01-19 1
2 from2000-01-14to2000-01-24 2
3 from2000-01-19to2000-01-29 2
4 from2000-01-24to2000-02-03 1
5 from2000-01-29to2000-02-08 0
6 from2000-02-03to2000-02-13 1
7 from2000-02-08to2000-02-18 1
8 from2000-02-13to2000-02-23 2
Related
I am trying to combine two different timeframes of pandas dataframe. The first dataframe has 1 hour timeseries. and the second dataframe has 1 minute timeseries.
1 hour dataframe
get_time value
0 1599739200 123.10
1 1599742800 136.24
2 1599750000 224.14
1 minute dataframe
get_time value
0 1599739200 2.11
1 1599739260 3.11
2 1599739320 3.12
3 1599742800 4.23
4 1599742860 2.22
5 1599742920 1.11
6 1599746400 7.24
7 1599746460 22.10
8 1599746520 2.13
9 1599750000 5.14
10 1599750060 12.10
11 1599750120 21.30
I want to combine those two dataframes, so the value of 1 hour dataframe will be mapped in 1 minute dataframe. if there is no 1 hour value then the mapped value will be nan.
Desired Result:
get_time value 1h mapped value
0 1599739200 2.11 123.10
1 1599739260 3.11 123.10
2 1599739320 3.12 123.10
3 1599742800 4.23 136.24
4 1599742860 2.22 136.24
5 1599742920 1.11 136.24
6 1599746400 7.24 NaN
7 1599746460 22.10 NaN
8 1599746520 2.13 NaN
9 1599750000 5.14 224.14
10 1599750060 12.10 224.14
11 1599750120 21.30 224.14
Basically i want to combine those dataframe with these logic:
if (1m_get_time >= 1h_get_time) and (1m_get_time < 1h_get_time+60minutes)
1h mapped value = 1h value
else:
1h mapped value = nan
Currently i use recursive method. But it takes long time for big size of data. here is the example of dataframe:
dfhigh_ = pd.DataFrame({
'get_time' : [1599739200, 1599742800, 1599750000],
'value' : [123.1, 136.24, 224.14],
})
dflow_ = pd.DataFrame({
'get_time' : [1599739200, 1599739260, 1599739320, 1599742800, 1599742860, 1599742920, 1599746400, 1599746460, 1599746520, 1599750000, 1599750060, 1599750120],
'value' : [2.11, 3.11, 3.12, 4.23, 2.22, 1.11, 7.24, 22.1, 2.13, 5.14, 12.1, 21.3],
})
Floor the get_time from dflow_ to nearest hour representation then use Series.map to map the values from dfhigh_ to dflow_ based on this rounded timestamp:
hr = dflow_['get_time'] // 3600 * 3600
dflow_['mapped_value'] = hr.map(dfhigh_.set_index('get_time')['value'])
get_time value mapped_value
0 1599739200 2.11 123.10
1 1599739260 3.11 123.10
2 1599739320 3.12 123.10
3 1599742800 4.23 136.24
4 1599742860 2.22 136.24
5 1599742920 1.11 136.24
6 1599746400 7.24 NaN
7 1599746460 22.10 NaN
8 1599746520 2.13 NaN
9 1599750000 5.14 224.14
10 1599750060 12.10 224.14
11 1599750120 21.30 224.14
This should work (for edge cases as well):
import pandas as pd
from datetime import datetime
import numpy as np
dfhigh_ = dfhigh_.rename(columns={'value': '1h mapped value'})
df_new = pd.merge(dflow_, dfhigh_, how='outer', on=['get_time'])
df_new.get_time = [datetime.fromtimestamp(x) for x in df_new['get_time']]
for idx,row in df_new.iterrows():
if not np.isnan(row['1h mapped value']):
current_hour, current_1h_mapped_value = row['get_time'].hour, row['1h mapped value']
for sub_idx,sub_row in df_new.loc[(df_new.get_time.dt.hour == current_hour) & np.isnan(df_new['1h mapped value'])].iterrows():
df_new.loc[sub_idx, '1h mapped value'] = current_1h_mapped_value
I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)
I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})
I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)
I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
B A
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
df.reset_index()
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
EDIT:
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
EDIT 2:
Now I understand better what is going on. I can acces the different values of the fits using
model.beta
I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
Source.
I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
c.append(max(df_dbl.index))
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7