Given a Pandas dataframe or series, I would like to resample it at specific points in time.
This might mean dropping values or adding new values by forward filling previous ones.
Example
Given the Series X defined by
import pandas
rng_X = pandas.to_datetime(
['2021-01-01', '2021-01-02', '2021-01-07', '2021-01-08', '2021-02-01'])
X = pandas.Series([0, 2, 4, 6, 8], rng_X)
X
2021-01-01 0
2021-01-02 2
2021-01-07 4
2021-01-08 6
2021-02-01 8
Resample X at dates
rng_Y = pandas.to_datetime(
['2021-01-02', '2021-01-03', '2021-01-07', '2021-01-08', '2021-01-09', '2021-01-10'])
The expected output is
2021-01-02 2
2021-01-03 2
2021-01-07 4
2021-01-08 6
2021-01-09 6
2021-01-10 6
2021-01-01 is dropped from the output since it isn't in rng_y.
2021-01-03 is added to the output with its value copied forward from 2021-01-02 since it does not exist in X
2021-01-09 and 2021-01-10 are also added to the output with values copied from 2021-01-08
2021-02-01 is dropped from the output since it does not exist in rng_Y
Try reindex with method set to 'ffill':
X = X.reindex(rng_Y, method='ffill')
X:
2021-01-02 2
2021-01-03 2
2021-01-07 4
2021-01-08 6
2021-01-09 6
2021-01-10 6
dtype: int32
Complete Code:
import pandas as pd
rng_X = pd.to_datetime(['2021-01-01', '2021-01-02', '2021-01-07', '2021-01-08',
'2021-02-01'])
rng_Y = pd.to_datetime(['2021-01-02', '2021-01-03', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10'])
X = pd.Series([0, 2, 4, 6, 8], rng_X)
X = X.reindex(rng_Y, method='ffill')
print(X)
If X was a DataFrame (df) instead of a Series:
df = pd.DataFrame([0, 2, 4, 6, 8], index=rng_X, columns=['X'])
df = df.reindex(rng_Y, method='ffill')
df:
X
2021-01-02 2
2021-01-03 2
2021-01-07 4
2021-01-08 6
2021-01-09 6
2021-01-10 6
Related
I have a dataframe with the following columns:
datetime: HH:MM:SS (not continuous, there are some missing days)
date: ['datetime'].dt.date
X = various values
X_daily_cum = df.groupby(['date']).X.cumsum()
So Xcum is the cumulated sum of X but grouped per day, it's reset every day.
Code to reproduce:
import pandas as pd
df = pd.DataFrame( [['2021-01-01 10:10', 3],
['2021-01-03 13:33', 7],
['2021-01-03 14:44', 6],
['2021-01-07 17:17', 2],
['2021-01-07 07:07', 4],
['2021-01-07 01:07', 9],
['2021-01-09 09:09', 3]],
columns=['datetime', 'X'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %M:%S')
df['date'] = df['datetime'].dt.date
df['X_daily_cum'] = df.groupby(['date']).X.cumsum()
print(df)
Now I would like a new column that takes for value the cumulated sum of previous available day, like that:
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3
2 2021-01-03 00:14:44 6 2021-01-03 13 3
3 2021-01-07 00:17:17 2 2021-01-07 2 13
4 2021-01-07 00:07:07 4 2021-01-07 6 13
5 2021-01-07 00:01:07 9 2021-01-07 15 13
6 2021-01-09 00:09:09 3 2021-01-09 3 15
Is there a clean way to do it with pandas with an apply ?
I have managed to do it in a disgusting way by copying the df, removing datetime granularity, selecting last record of each date, joining this new df with the previous one. It's disgusting, I would like a more elegant solution.
Thanks for the help
Use Series.duplicated with Series.mask for set missing values to all values without last per dates, then shifting values and forward filling missing values:
df['last_day_cum_value'] = (df['X_daily_cum'].mask(df['date'].duplicated(keep='last'))
.shift()
.ffill())
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0
Old solution:
Use DataFrame.drop_duplicates with Series created by date and Series.shift for previous dates, then use Series.map for new column:
s = df.drop_duplicates('date', keep='last').set_index('date')['X_daily_cum'].shift()
print (s)
date
2021-01-01 NaN
2021-01-03 3.0
2021-01-07 13.0
2021-01-09 15.0
Name: X_daily_cum, dtype: float64
df['last_day_cum_value'] = df['date'].map(s)
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0
Say I have this DataFrame:
user
sub_date
unsub_date
group
0
alice
2021-01-01 00:00:00
2021-02-09 00:00:00
A
1
bob
2021-02-03 00:00:00
2021-04-05 00:00:00
B
2
charlie
2021-02-03 00:00:00
NaT
A
3
dave
2021-01-29 00:00:00
2021-09-01 00:00:00
B
What is the most efficient way to count the subbed users per date and per group? In other words, to get this DataFrame:
date
group
subbed
2021-01-01
A
1
2021-01-01
B
0
2021-01-02
A
1
2021-01-02
B
0
...
...
...
2021-02-03
A
2
2021-02-03
B
2
...
...
...
2021-02-10
A
1
2021-02-10
B
2
...
...
...
Here's a snippet to init the example df:
import pandas as pd
import datetime as dt
users = pd.DataFrame(
[
["alice", "2021-01-01", "2021-02-09", "A"],
["bob", "2021-02-03", "2021-04-05", "B"],
["charlie", "2021-02-03", None, "A"],
["dave", "2021-01-29", "2021-09-01", "B"],
],
columns=["user", "sub_date", "unsub_date", "group"],
)
users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
pd.to_datetime
)
Using a smaller date range for convenience
Note: my users df is different from OPs. I've changed around a few dates to make the outputs smaller
In [26]: import pandas as pd
...: import datetime as dt
...:
...: users = pd.DataFrame(
...: [
...: ["alice", "2021-01-01", "2021-01-05", "A"],
...: ["bob", "2021-01-03", "2021-01-07", "B"],
...: ["charlie", "2021-01-03", None, "A"],
...: ["dave", "2021-01-09", "2021-01-11", "B"],
...: ],
...: columns=["user", "sub_date", "unsub_date", "group"],
...: )
...:
...: users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
...: pd.to_datetime
...: )
In [81]: users
Out[81]:
user sub_date unsub_date group
0 alice 2021-01-01 2021-01-05 A
1 bob 2021-01-03 2021-01-07 B
2 charlie 2021-01-03 NaT A
3 dave 2021-01-09 2021-01-11 B
In [82]: users.melt(id_vars=['user', 'group'])
Out[82]:
user group variable value
0 alice A sub_date 2021-01-01
1 bob B sub_date 2021-01-03
2 charlie A sub_date 2021-01-03
3 dave B sub_date 2021-01-09
4 alice A unsub_date 2021-01-05
5 bob B unsub_date 2021-01-07
6 charlie A unsub_date NaT
7 dave B unsub_date 2021-01-11
# dropna to remove rows with no unsub_date
# sort_values to sort by date
# sub_date exists -> map to 1, else -1 then take cumsum to get # of subbed people at that date
In [85]: melted = users.melt(id_vars=['user', 'group']).dropna().sort_values('value')
...: melted['sub_value'] = np.where(melted['variable'] == 'sub_date', 1, -1) # or melted['variable'].map({'sub_date': 1, 'unsub_date': -1})
...: melted['sub_cumsum_group'] = melted.groupby('group')['sub_value'].cumsum()
...: melted
Out[85]:
user group variable value sub_value sub_cumsum_group
0 alice A sub_date 2021-01-01 1 1
1 bob B sub_date 2021-01-03 1 1
2 charlie A sub_date 2021-01-03 1 2
4 alice A unsub_date 2021-01-05 -1 1
5 bob B unsub_date 2021-01-07 -1 0
3 dave B sub_date 2021-01-09 1 1
7 dave B unsub_date 2021-01-11 -1 0
In [93]: idx = pd.date_range(melted['value'].min(), melted['value'].max(), freq='1D')
...: idx
Out[93]:
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10', '2021-01-11'],
dtype='datetime64[ns]', freq='D')
In [94]: melted.set_index('value').groupby('group')['sub_cumsum_group'].apply(lambda x: x.reindex(idx).ffill().fillna(0))
Out[94]:
group
A 2021-01-01 1.0
2021-01-02 1.0
2021-01-03 2.0
2021-01-04 2.0
2021-01-05 1.0
2021-01-06 1.0
2021-01-07 1.0
2021-01-08 1.0
2021-01-09 1.0
2021-01-10 1.0
2021-01-11 1.0
B 2021-01-01 0.0
2021-01-02 0.0
2021-01-03 1.0
2021-01-04 1.0
2021-01-05 1.0
2021-01-06 1.0
2021-01-07 0.0
2021-01-08 0.0
2021-01-09 1.0
2021-01-10 1.0
2021-01-11 0.0
Name: sub_cumsum_group, dtype: float64
The data is described by step functions, and staircase can be used for these applications
import staircase as sc
stepfunctions = users.groupby("group").apply(sc.Stairs, "sub_date", "unsub_date")
stepfunctions will be a pandas.Series, indexed by group, and the values are Stairs objects which represent step functions.
group
A <staircase.Stairs, id=2516834869320>
B <staircase.Stairs, id=2516112096072>
dtype: object
You could plot the step function for A if you wanted like so
stepfunctions["A"].plot()
Next step is to sample the step function at whatever dates you want, eg for every day of January..
sc.sample(stepfunctions, pd.date_range("2021-01-01", "2021-02-01")).melt(ignore_index=False).reset_index()
The result is this
group variable value
0 A 2021-01-01 1
1 B 2021-01-01 0
2 A 2021-01-02 1
3 B 2021-01-02 0
4 A 2021-01-03 1
.. ... ... ...
59 B 2021-01-30 1
60 A 2021-01-31 1
61 B 2021-01-31 1
62 A 2021-02-01 1
63 B 2021-02-01 1
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
Try this?
>>> users.groupby(['sub_date','group'])[['user']].count()
I am trying to select the dates where the percentage change is over 1%. To do this my code is as follows:
df1 has 109 rows × 6 columns
`df1['Close'].pct_change().gt(0.01).index` produces:
DatetimeIndex(['2020-12-31', '2021-01-04', '2021-01-05', '2021-01-06',
'2021-01-07', '2021-01-08', '2021-01-11', '2021-01-12',
'2021-01-13', '2021-01-14',
...
'2021-05-25', '2021-05-26', '2021-05-27', '2021-05-28',
'2021-06-01', '2021-06-02', '2021-06-03', '2021-06-04',
'2021-06-07', '2021-06-08'],
dtype='datetime64[ns]', name='Date', length=109, freq=None)
This is not right because there are very few dates which are over 1% but i am still getting the same length of 109 as I would get without .gt().
Could you please advise why it is showing all the dates.
Select True values:
df1.loc[df1["Close"].pct_change().gt(1)].index
>>> df1
Open Close
Date
2021-01-01 5 7
2021-01-02 1 3
2021-01-03 1 2
2021-01-04 10 6
2021-01-05 5 10
2021-01-06 6 9
2021-01-07 8 1
2021-01-08 1 3
2021-01-09 10 5
2021-01-10 7 3
>>> df1.loc[df1["Close"].pct_change().gt(1)].index
DatetimeIndex(['2021-01-04', '2021-01-08'], dtype='datetime64[ns]', name='Date', freq=None)
Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31
Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31
I have two data frames (let say A and B) indexed with dates.
I define a column in B as following
B["column1"] = A.shift(1)
Later, when I add additional data to A and I want to update B, it doesn't work.
B["column1] = A.shift(1) still produces the same data before I added additional data to A.
How can I solve this issue?
Perform a df.reindex() before your assignment statement, as follows:
B = B.reindex(A.index)
Then, you can get your desired result with your code:
B["column1"] = A.shift(1)
Caution: If your dataframe B has other columns with values built with date indices other than the indices of dataframe A, reindexing in this way may cause loss of data in other columns of dataframe B. To overcome this, you can reindex B to the combined index of A and B with a union() function as follows:
B = B.reindex(A.index.union(B.index))
Demo Run
A_index = pd.date_range(start='2021/1/1', periods=8)
A = pd.Series([10, 20, 30, 40, 50, 60, 70, 80], index=A_index)
print(A)
2021-01-01 10
2021-01-02 20
2021-01-03 30
2021-01-04 40
2021-01-05 50
2021-01-06 60
2021-01-07 70
2021-01-08 80
Freq: D, dtype: int64
B = pd.DataFrame()
B["column1"] = A.shift(1)
print(B)
column1
2021-01-01 NaN
2021-01-02 10.0
2021-01-03 20.0
2021-01-04 30.0
2021-01-05 40.0
2021-01-06 50.0
2021-01-07 60.0
2021-01-08 70.0
# Add data to A
A = A.append(pd.Series([100, 110, 120], index=pd.date_range(start='2021/1/21', periods=3)))
print(A)
2021-01-01 10
2021-01-02 20
2021-01-03 30
2021-01-04 40
2021-01-05 50
2021-01-06 60
2021-01-07 70
2021-01-08 80
2021-01-21 100 <= New data
2021-01-22 110 <= New data
2021-01-23 120 <= New data
dtype: int64
#Run new code
B = B.reindex(A.index)
#Run existing code
B["column1"] = A.shift(1)
print(B)
column1
2021-01-01 NaN
2021-01-02 10.0
2021-01-03 20.0
2021-01-04 30.0
2021-01-05 40.0
2021-01-06 50.0
2021-01-07 60.0
2021-01-08 70.0
2021-01-21 80.0 <= New data
2021-01-22 100.0 <= New data
2021-01-23 110.0 <= New data
Please use DataFrame.at
B.at["column1"] = A.shift(1)
reference DataFrame.at