Groupby with TimeGrouper 'backwards' - python

I have a DataFrame containing a time series:
rng = pd.date_range('2016-06-01', periods=24*7, freq='H')
ones = pd.Series([1]*24*7, rng)
rdf = pd.DataFrame({'a': ones})
Last entry is 2016-06-07 23:00:00. I now want to group this by, say two days, basically like so:
rdf.groupby(pd.TimeGrouper('2D')).sum()
However, I want to group starting from my last data point backwards, so instead of getting this result:
a
2016-06-01 48
2016-06-03 48
2016-06-05 48
2016-06-07 24
I'd much rather expect this:
a
2016-06-01 24
2016-06-03 48
2016-06-05 48
2016-06-07 48
and when grouping by '3D':
a
2016-06-01 24
2016-06-04 72
2016-06-07 72
Expected outcome when grouping by '4D' is:
a
2016-06-03 72
2016-06-07 96
I am not able to get this with every combination of closed, label etc. I could think of.
How can I achieve this?

Since I primarily want to group by 7 days, aka one week, I am using this method now to come to the desired bins:
from pandas.tseries.offsets import Week
# Let's not make full weeks
hours = 24*6*4
rng = pd.date_range('2016-06-01', periods=hours, freq='H')
# Set week start to whatever the last weekday of the range is
print("Last day is %s" % rng[-1])
freq = Week(weekday=rng[-1].weekday())
ones = pd.Series([1]*hours, rng)
rdf = pd.DataFrame({'a': ones})
rdf.groupby(pd.TimeGrouper(freq=freq, closed='right', label='right')).sum()
This gives me the desired output of
2016-06-25 96
2016-07-02 168
2016-07-09 168

Since the question now focuses on grouping by week, you can simply:
rdf.resample('W-{}'.format(rdf.index[-1].strftime('%a')), closed='right', label='right').sum()
You can use loffset to get it to work - at least for most periods (using .resample()):
for i in range(2, 7):
print(i)
print(rdf.resample('{}D'.format(i), closed='right', loffset='{}D'.format(i)).sum())
2
a
2016-06-01 24
2016-06-03 48
2016-06-05 48
2016-06-07 48
3
a
2016-06-01 24
2016-06-04 72
2016-06-07 72
4
a
2016-06-01 24
2016-06-05 96
2016-06-09 48
5
a
2016-06-01 24
2016-06-06 120
2016-06-11 24
6
a
2016-06-01 24
2016-06-07 144
However, you could also create custom groupings that calculate the correct values without TimeGrouper like so:
days = rdf.index.to_series().dt.day.unique()[::-1]
for n in range(2, 7):
chunks = [days[i:i + n] for i in range(0, len(days), n)][::-1]
grp = pd.Series({k: v for d in [zip(chunk, [idx] * len(chunk)) for idx, chunk in enumerate(chunks)] for k, v in d})
rdf.groupby(rdf.index.to_series().dt.day.map(grp))['a'].sum()
2
groups
0 24
1 48
2 48
3 48
Name: a, dtype: int64
3
groups
0 24
1 72
2 72
Name: a, dtype: int64
4
groups
0 72
1 96
Name: a, dtype: int64
5
groups
0 48
1 120
Name: a, dtype: int64
6
groups
0 24
1 144
Name: a, dtype: int64

Related

How do I reproduce rows in a dataframe while increasing one by one values in an specific column

Imagine I have an specific dataframe with four columns (being of them a date) and let's say 12 rows. I would like to add for each row, around 30 replicated rows below but increasing day by day while maintaining the rest static. For example if this is my dataframe:
`
Video_ID date ratio_liked accomulated_views
45 2022-08-07 0.540457 0.826594
87 2021-06-14 0.979323 0.977446
34 2018-02-09 0.128068 0.1237669
25 2010-01-07 0.507959 0.378297
23 2020-09-03 0.731555 0.818380
85 2015-02-01 0.999961 0.619517
92 2019-04-07 0.129270 0.024533
51 2007-07-03 0.441010 0.741781
37 2009-12-01 0.682101 0.375660
50 2012-11-10 0.754488 0.352293
I would like something like this: (The hash lines implies there are rows inbetween)
Video_ID date ratio_liked accomulated_views
45 2022-08-07 0.540457 0.826594
45 2022-08-08 0.540457 0.826594
45 2022-08-09 0.540457 0.826594
45 2022-08-10 0.540457 0.826594
---------------------------------------------
45 2022-09-06 0.540457 0.826594
45 2022-09-07 0.540457 0.826594
87 2021-06-14 0.979323 0.977446
87 2021-06-15 0.979323 0.977446
87 2021-06-16 0.979323 0.977446
------------------------------------------------------
87 2021-07-14 0.979323 0.977446
34 2018-02-07 0.128068 0.1237669
34 2018-02-18 0.128068 0.1237669
34 2018-03-07 0.128068 0.1237669
---------------------------------------------
50 2012-11-10 0.754488 0.352293
----------------------------------------------
50 2012-12-10 0.754488 0.352293
The range between two dataframes is giving by pandas.date_range(date, date+ DateOffset(months=1),freq='d')
The thing I think it can be approached:
1)For each row, create a fuction the reproduces it 30 times giving you a dataframe of 30 rows
2)Replace the column with the data_range
3)Create a loop or maybe a compression list that select each rows, applies the fuction before and then concat
It will end up giving you a dataframe of around 12x30(or 31) rows.
I've tried to code this but this have been unsucessful.
explode is what you are looking for:
# Make sure `date` is of type Timestamp
df["date"] = pd.to_datetime(df["date"])
result = df.assign(
# For each row, replace the existing date with a series of dates
# This series starts from the current date and ends in 1 month
date=df["date"].apply(lambda d: pd.date_range(d, d + pd.DateOffset(months=1)))
).explode("date")

randomly subsample once every month pandas

I have the following dataframe.
data = {'bid':['23243', '23243', '23243', '12145', '12145', '12145', '12145'],
'lid':['54346786', '23435687', '23218987', '43454432', '21113567', '54789876', '22898721'],
'date':['2021-08-11','2021-08-12','2021-09-17','2021-05-02','2021-05-11','2021-05-20','2021-08-13'],
'val1':[44,34,54,76,89,33,27],
'val2':[11,55,43,89,76,44,88]}
df = pd.DataFrame(data)
What I am looking for is to randomly pick a lid per month for the bid column, and maintain a count of past instances until the point of the random sample, something similar to this:
I can think of separating the year and months into different columns and then apply pd.groupby on the bid, year and month with the pd.Series.sample function, but there must be a better way of doing it.
Use GroupBy.cumcount per bid and then per months and bid use DataFrameGroupBy.sample:
df['date'] = pd.to_datetime(df['date'])
#if necessary sorting
#df = df.sort_values(['bid','date'])
df['prev'] = df.groupby('bid').cumcount()
df1 = df.groupby(['bid', pd.Grouper(freq='M', key='date')], sort=False).sample(n=1)
print (df1)
bid lid date val1 val2 prev
1 23243 23435687 2021-08-12 34 55 1
2 23243 23218987 2021-09-17 54 43 2
5 12145 54789876 2021-05-20 33 44 2
6 12145 22898721 2021-08-13 27 88 3
IIUC, use groupby.sample, assume date column have datetime64 dtype:
out = df.groupby([df['date'].dt.month, 'bid']).sample(n=1).reset_index(drop=True)
print(out)
# Output
bid lid date val1 val2
0 12145 21113567 2021-05-11 89 76
1 12145 22898721 2021-08-13 27 88
2 23243 23435687 2021-08-12 34 55
3 23243 23218987 2021-09-17 54 43

Pandas filter data for line graph

I'm trying to use Pandas to filter the dataframe. So in the dataset I have 1982-01 to 2019-11. I want to filter data based on year 2010 onwards ie. 2010-01 to 2019-11.
mydf = pd.read_csv('temperature_mean.csv')
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
I had set the index = month, and I'm able to get the mean temperature for the filtered data. However I need to get the indexes as my x label for line graph. I'm not able to do it. I tried to use reg to get data from 201x onwards, but there's still an error.
How do I get the label for the months i.e. 2010-01, 2010-02......2019-10, 2019-11, for the line graph.
Thanks!
mydf = pd.read_csv('temperature_mean.csv')
month mean_temp
______________________
0 1982-01-01 39
1 1985-04-01 29
2 1999-03-01 19
3 2010-01-01 59
4 2013-05-01 32
5 2015-04-01 34
6 2016-11-01 59
7 2017-08-01 14
8 2017-09-01 7
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
mean_temp
month
______________________
2010-01-01 59
2013-05-01 32
2015-04-01 34
2016-11-01 59
2017-08-01 14
2017-09-01 7
Drawing the line plot (default x & y arguments):
df.plot.line()
If for some reason, you want to manually specify the column name
df.reset_index().plot.line(x='month', y='mean_temp')

Vectorized count of daily longest consecutive streak

For evaluating daily longest consecutive runtimes of a power plant, I have to evaluate the longest streak per day, meaning that each day is considered as a separate timeframe.
So let's say I've got the power output in the dataframe df:
df = pd.Series(
data=[
*np.zeros(4), *(np.full(24*5, 19.5) + np.random.rand(24*5)),
*np.zeros(4), *(np.full(8, 19.5) + np.random.rand(8)),
*np.zeros(5), *(np.full(24, 19.5) + np.random.rand(24)),
*np.zeros(27), *(np.full(24, 19.5) + np.random.rand(24))],
index=pd.date_range(start='2019-07-01 00:00:00', periods=9*24, freq='1h'))
And the "cutoff-power" is 1 (everything below that is considered as off). I use this to mask the "on"-values, shift and compare the mask to itself to count the number of consecutive groups. Finally I group the groups by the days of the year in the index and count the daily consecutive values consec_group:
mask = df > 1
groups = mask.ne(mask.shift()).cumsum()
consec_group = groups[mask].groupby(groups[mask].index.date).value_counts()
Which yields:
consec_group
Out[3]:
2019-07-01 2 20
2019-07-02 2 24
2019-07-03 2 24
2019-07-04 2 24
2019-07-05 2 24
2019-07-06 4 8
2 4
6 3
2019-07-07 6 21
2019-07-09 8 24
dtype: int64
But I'd like to have the maximum value of each consecutive daily streak and dates without any runtime should be displayed with zeros, as in 2019-07-08 7 0. See the expected result:
2019-07-01 20
2019-07-02 24
2019-07-03 24
2019-07-04 24
2019-07-05 24
2019-07-06 8
2019-07-07 21
2019-07-08 0
2019-07-09 24
dtype: int64
Any help will be appreciated!
First remove second level by Series.reset_index, filter out second duplicated values by call back with Series.asfreq - it working, because .value_counts sort Series:
consec_group = (consec_group.reset_index(level=1, drop=True)[lambda x: ~x.index.duplicated()]
.asfreq('d', fill_value=0))
print (consec_group)
Or solution with GroupBy.first:
consec_group = (consec_group.groupby(level=0)
.first()
.asfreq('d', fill_value=0))
print (consec_group)
2019-07-01 20
2019-07-02 24
2019-07-03 24
2019-07-04 24
2019-07-05 24
2019-07-06 8
2019-07-07 21
2019-07-08 0
2019-07-09 24
Freq: D, dtype: int64
Ok, I guess I was too close to the finish line to see the answer... Looks like I had already solved the complex part.
So right after posting the question, I tested max with the level=0 argument instead of level=1 and that was the solution:
max_consec_group = consec_group.max(level=0).asfreq('d', fill_value=0)
Thanks at jezrael for the asfreq part!

pandas groupby apply function that take N-column frame and returns object

Is there a 'transform' method of something like that to apply a function to groups (all columns at once) and return an object? Anything I try seems to return one object per column in the group.
For example consider the data
Maturity Date s Term Month
0 2012-02-01 00:00:00 2012-01-03 00:00:00 2.993 29 2
18 2012-03-01 00:00:00 2012-01-03 00:00:00 3.022 58 3
57 2012-04-01 00:00:00 2012-01-03 00:00:00 3.084 89 4
117 2012-05-01 00:00:00 2012-01-03 00:00:00 3.138 119 5
...
and suppose I do a groupby on Date and apply some function to the groups labeled by (Term, Month, s). The result should be something like
Maturity result
2012-02-01 00:00:00 2012-01-03 object
2012-03-01 00:00:00 2012-01-03 object
2012-04-01 00:00:00 2012-01-03 object
....
I can obviously just iterate through the groups and aggregate the results but I imagine I'm just missing something obvious about how to use one of the transform methods.
You could apply the function and then aggregate each group manually. For example, assuming the aggregation is a mean and the function is the sum of the column, you could:
df.groupby("Date")['Term', 'Month', 's'].apply(lambda rows: np.mean(rows['Term'] + rows['Month'] + rows['s']))
So if we assume a fit method that builds some model from a dataframe having the columns "month", "Term" and "s":
import pandas as pd
import numpy as np
def fit (dataframe):
return { "param1": np.mean(dataframe["Term"]) + np.max(dataframe["month"]), "param2": np.std(dataframe["s"])}
And a dataframe containing those colummns for a bunch of dates:
df = pd.DataFrame({"date": ["20140101", "20140202", "20140203"] * 4, "Term" : np.random.randint(100, size=12),"month": np.random.randint(12, size=12),"s": np.random.rand(12)*3})
print df
(outputs: )
Term date month s
0 24 20140101 6 2.364798
1 43 20140202 9 0.066188
2 59 20140203 6 1.078052
3 40 20140101 3 1.982825
4 34 20140202 4 2.089518
5 20 20140203 1 2.412956
6 84 20140101 8 0.779843
7 62 20140202 9 0.918860
8 32 20140203 11 2.613289
9 16 20140101 9 0.788347
10 23 20140202 6 0.982986
11 27 20140203 1 0.658260
Then we can apply the fit() on all the columns at once for each group of rows:
modelPerDate = df.groupby("date").apply(fit)
print modelPerDate
Which produces a dataframe of models, one per date:
date
20140101 {'param2': 0.70786647858131047, 'param1': 50.0}
20140202 {'param2': 0.71852297283637756, 'param1': 49.5}
20140203 {'param2': 0.83876295773013798, 'param1': 45.5}

Categories