Vectorized count of daily longest consecutive streak - python

For evaluating daily longest consecutive runtimes of a power plant, I have to evaluate the longest streak per day, meaning that each day is considered as a separate timeframe.
So let's say I've got the power output in the dataframe df:
df = pd.Series(
data=[
*np.zeros(4), *(np.full(24*5, 19.5) + np.random.rand(24*5)),
*np.zeros(4), *(np.full(8, 19.5) + np.random.rand(8)),
*np.zeros(5), *(np.full(24, 19.5) + np.random.rand(24)),
*np.zeros(27), *(np.full(24, 19.5) + np.random.rand(24))],
index=pd.date_range(start='2019-07-01 00:00:00', periods=9*24, freq='1h'))
And the "cutoff-power" is 1 (everything below that is considered as off). I use this to mask the "on"-values, shift and compare the mask to itself to count the number of consecutive groups. Finally I group the groups by the days of the year in the index and count the daily consecutive values consec_group:
mask = df > 1
groups = mask.ne(mask.shift()).cumsum()
consec_group = groups[mask].groupby(groups[mask].index.date).value_counts()
Which yields:
consec_group
Out[3]:
2019-07-01 2 20
2019-07-02 2 24
2019-07-03 2 24
2019-07-04 2 24
2019-07-05 2 24
2019-07-06 4 8
2 4
6 3
2019-07-07 6 21
2019-07-09 8 24
dtype: int64
But I'd like to have the maximum value of each consecutive daily streak and dates without any runtime should be displayed with zeros, as in 2019-07-08 7 0. See the expected result:
2019-07-01 20
2019-07-02 24
2019-07-03 24
2019-07-04 24
2019-07-05 24
2019-07-06 8
2019-07-07 21
2019-07-08 0
2019-07-09 24
dtype: int64
Any help will be appreciated!

First remove second level by Series.reset_index, filter out second duplicated values by call back with Series.asfreq - it working, because .value_counts sort Series:
consec_group = (consec_group.reset_index(level=1, drop=True)[lambda x: ~x.index.duplicated()]
.asfreq('d', fill_value=0))
print (consec_group)
Or solution with GroupBy.first:
consec_group = (consec_group.groupby(level=0)
.first()
.asfreq('d', fill_value=0))
print (consec_group)
2019-07-01 20
2019-07-02 24
2019-07-03 24
2019-07-04 24
2019-07-05 24
2019-07-06 8
2019-07-07 21
2019-07-08 0
2019-07-09 24
Freq: D, dtype: int64

Ok, I guess I was too close to the finish line to see the answer... Looks like I had already solved the complex part.
So right after posting the question, I tested max with the level=0 argument instead of level=1 and that was the solution:
max_consec_group = consec_group.max(level=0).asfreq('d', fill_value=0)
Thanks at jezrael for the asfreq part!

Related

monthly resampling pandas with specific start day

I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer

filtering date column in python

I'm new to python and I'm facing the following problem. I have a dataframe composed of 2 columns, one of them is date (datetime64[ns]). I want to keep all records within the last 12 months. My code is the following:
today=start_time.date()
last_year = today + relativedelta(months = -12)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
when I run it I get the following message:
TypeError: type object 2017-06-05
Any ideas?
last_year seems to bring me the date that I want in the following format: 2017-06-05
Create a time delta object in pandas to increment the date (12 months). Call pandas.Timstamp('now') to get the current date. And then create a date_range. Here is an example for getting monthly data for 12 months.
import pandas as pd
import datetime
list_1 = [i for i in range(0, 12)]
list_2 = [i for i in range(13, 25)]
list_3 = [i for i in range(26, 38)]
data_frame = pd.DataFrame({'A': list_1, 'B': list_2, 'C':list_3}, pd.date_range(pd.Timestamp('now'), pd.Timestamp('now') + pd.Timedelta (weeks=53), freq='M'))
We create a timestamp for the current date and enter that as our start date. Then we create a timedelta to increment that date by 53 weeks (or 52 if you'd like) which gets us 12 months of data. Below is the output:
A B C
2018-06-30 05:05:21.335625 0 13 26
2018-07-31 05:05:21.335625 1 14 27
2018-08-31 05:05:21.335625 2 15 28
2018-09-30 05:05:21.335625 3 16 29
2018-10-31 05:05:21.335625 4 17 30
2018-11-30 05:05:21.335625 5 18 31
2018-12-31 05:05:21.335625 6 19 32
2019-01-31 05:05:21.335625 7 20 33
2019-02-28 05:05:21.335625 8 21 34
2019-03-31 05:05:21.335625 9 22 35
2019-04-30 05:05:21.335625 10 23 36
2019-05-31 05:05:21.335625 11 24 37
Try
today = datetime.datetime.now()
You can use pandas functionality with datetime objects. The syntax is often more intuitive and obviates the need for additional imports.
last_year = pd.to_datetime('today') + pd.DateOffset(years=-1)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
As such, we would need to see all your code to be sure of the reason behind your error; for example, how is start_time defined?

selecting rows in a pandas dataframe starting with a certain index value

Suppose I have a dataframe, where the rows are indexed by trading days, so something like:
Date ClosingPrice
2017-3-16 10.00
2017-3-17 10.13
2017-3-20 10.19
...
I want to find $N$ rows starting with (say) 2017-2-28, so I don't know the date range, I just know that I want to do something ten rows down. What is the most elegant way of doing this? (there are plenty of ugly ways...)
my quick answer
s = df.Date.searchsorted(pd.to_datetime('2017-2-28'))[0]
df.iloc[s:s + 10]
demo
df = pd.DataFrame(dict(
Date=pd.date_range('2017-01-31', periods=90, freq='B'),
ClosingPrice=np.random.rand(90)
)).iloc[:, ::-1]
date = pd.to_datetime('2017-3-11')
s = df.Date.searchsorted(date)[0]
df.iloc[s:s + 10]
Date ClosingPrice
29 2017-03-13 0.737527
30 2017-03-14 0.411525
31 2017-03-15 0.794309
32 2017-03-16 0.578911
33 2017-03-17 0.747763
34 2017-03-20 0.081113
35 2017-03-21 0.000058
36 2017-03-22 0.274022
37 2017-03-23 0.367831
38 2017-03-24 0.100930
naive time test
df[df['Date'] >= Date(2017,02,28)][:10]
I guess?

Prepare Data Frames to be compared. Index manipulation, datetime and beyond

Ok, this is a question in two steps.
Step one: I have a pandas DataFrame like this:
date time value
0 20100201 0 12
1 20100201 6 22
2 20100201 12 45
3 20100201 18 13
4 20100202 0 54
5 20100202 6 12
6 20100202 12 18
7 20100202 18 17
8 20100203 6 12
...
As you can see, for instance between rows 7 and 8 there is data missing (in this case, the value for the 0 time). Sometimes, several hours or even a full day could be missing.
I would like to convert this DataFrame to the format like this:
value
2010-02-01 00:00:00 12
2010-02-01 06:00:00 22
2010-02-01 12:00:00 45
2010-02-01 18:00:00 13
2010-02-02 00:00:00 54
2010-02-02 06:00:00 12
2010-02-02 12:00:00 18
2010-02-02 18:00:00 17
...
I want this because I have another DataFrame (let's call it "reliable DataFrame") in this format that I am sure it has no missing values.
EDIT 2016/07/28: Studying the problem it seems there were also duplicated data in the dataframe. See the solution to also address this problem.
Step two: With the previous step done I want to compare row by row the index in the "reliable DataFrame" with the index in the DataFrame with missing values.
I want to add a row with the value NaN where there are missing entries in the first DataFrame. The final check would be to be sure that both DataFrames have the same dimension.
I know this is a long question, but I am stacked. I have tried to manage the dates with the dateutil.parser.parse and to use set_index as the method to set a new index, but I have lots of errors in the code. I am afraid this is clearly above my pandas level.
Thank you in advance.
Step 1 Answer
df['DateTime'] = (df['date'].astype(str) + ' ' + df['time'].astype(str) +':'+'00'+':'+'00').apply(lambda x: pd.to_datetime(str(x)))
df.set_index('DateTime', drop=True, append=False, inplace=True, verify_integrity=False)
df.drop(['date', 'time'], axis=1, level=None, inplace=True, errors='raise')
If there are duplicates these can be removed by:
df = df.reset_index().drop_duplicates(subset='DateTime',keep='last').set_index('DateTime')
Step 2
df_join = df.join(df1, how='outer', lsuffix='x',sort=True)

Groupby with TimeGrouper 'backwards'

I have a DataFrame containing a time series:
rng = pd.date_range('2016-06-01', periods=24*7, freq='H')
ones = pd.Series([1]*24*7, rng)
rdf = pd.DataFrame({'a': ones})
Last entry is 2016-06-07 23:00:00. I now want to group this by, say two days, basically like so:
rdf.groupby(pd.TimeGrouper('2D')).sum()
However, I want to group starting from my last data point backwards, so instead of getting this result:
a
2016-06-01 48
2016-06-03 48
2016-06-05 48
2016-06-07 24
I'd much rather expect this:
a
2016-06-01 24
2016-06-03 48
2016-06-05 48
2016-06-07 48
and when grouping by '3D':
a
2016-06-01 24
2016-06-04 72
2016-06-07 72
Expected outcome when grouping by '4D' is:
a
2016-06-03 72
2016-06-07 96
I am not able to get this with every combination of closed, label etc. I could think of.
How can I achieve this?
Since I primarily want to group by 7 days, aka one week, I am using this method now to come to the desired bins:
from pandas.tseries.offsets import Week
# Let's not make full weeks
hours = 24*6*4
rng = pd.date_range('2016-06-01', periods=hours, freq='H')
# Set week start to whatever the last weekday of the range is
print("Last day is %s" % rng[-1])
freq = Week(weekday=rng[-1].weekday())
ones = pd.Series([1]*hours, rng)
rdf = pd.DataFrame({'a': ones})
rdf.groupby(pd.TimeGrouper(freq=freq, closed='right', label='right')).sum()
This gives me the desired output of
2016-06-25 96
2016-07-02 168
2016-07-09 168
Since the question now focuses on grouping by week, you can simply:
rdf.resample('W-{}'.format(rdf.index[-1].strftime('%a')), closed='right', label='right').sum()
You can use loffset to get it to work - at least for most periods (using .resample()):
for i in range(2, 7):
print(i)
print(rdf.resample('{}D'.format(i), closed='right', loffset='{}D'.format(i)).sum())
2
a
2016-06-01 24
2016-06-03 48
2016-06-05 48
2016-06-07 48
3
a
2016-06-01 24
2016-06-04 72
2016-06-07 72
4
a
2016-06-01 24
2016-06-05 96
2016-06-09 48
5
a
2016-06-01 24
2016-06-06 120
2016-06-11 24
6
a
2016-06-01 24
2016-06-07 144
However, you could also create custom groupings that calculate the correct values without TimeGrouper like so:
days = rdf.index.to_series().dt.day.unique()[::-1]
for n in range(2, 7):
chunks = [days[i:i + n] for i in range(0, len(days), n)][::-1]
grp = pd.Series({k: v for d in [zip(chunk, [idx] * len(chunk)) for idx, chunk in enumerate(chunks)] for k, v in d})
rdf.groupby(rdf.index.to_series().dt.day.map(grp))['a'].sum()
2
groups
0 24
1 48
2 48
3 48
Name: a, dtype: int64
3
groups
0 24
1 72
2 72
Name: a, dtype: int64
4
groups
0 72
1 96
Name: a, dtype: int64
5
groups
0 48
1 120
Name: a, dtype: int64
6
groups
0 24
1 144
Name: a, dtype: int64

Categories