Pandas — match last identical row and compute difference - python

With a DataFrame like the following:
timestamp value
0 2012-01-01 3.0
1 2012-01-05 3.0
2 2012-01-06 6.0
3 2012-01-09 3.0
4 2012-01-31 1.0
5 2012-02-09 3.0
6 2012-02-11 1.0
7 2012-02-13 3.0
8 2012-02-15 2.0
9 2012-02-18 5.0
What would be an elegant and efficient way to add a time_since_last_identical column, so that the previous example would result in:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 5 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 10 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
The important part of the problem is not necessarily the usage of time delays. Any solution that matches one particular row with the previous row of identical value, and computes something out of those two rows (here, a difference) will be valid.
Note: not interested in apply or loop-based approaches.

A simple, clean and elegant groupby will do the trick:
df['time_since_last_identical'] = df.groupby('value').diff()
Gives:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT

Here is a solution using pandas groupby:
out = df.groupby(df['value'])\
.apply(lambda x: pd.to_datetime(x['timestamp'], format = "%Y-%m-%d").diff())\
.reset_index(level = 0, drop = False)\
.reindex(df.index)\
.rename(columns = {'timestamp' : 'time_since_last_identical'})
out = pd.concat([df['timestamp'], out], axis = 1)
That gives the following output:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
It does not exactly match your desired output, but I guess it is a matter of conventions (e.g. whether to include current day or not). Happy to refine if you provide more details.

Related

Pandas create new date rows and forward fill column values for maximum next 3 consecutive months

I want to forward fill rows for the next 3 consecutive months but stops if a new data row is available for the same ID within that 3 months window.
Here is a sample data
id date value1 Value2
1 2016-09-01 5 2
1 2016-11-01 7 15
2 2015-09-01 11 6
2 2015-12-01 13 4
2 2016-05-01 3 5
I would like to get
id date value1 value2
1 2016-09-01 5 2
1 2016-10-01 5 2
1 2016-11-01 7 15
1 2016-12-01 7 15
1 2017-01-01 7 15
1 2017-02-01 7 15
2 2015-09-01 11 6
2 2015-10-01 11 6
2 2015-11-01 11 6
2 2015-12-01 13 4
2 2016-01-01 13 4
2 2016-02-01 13 4
2 2016-03-01 13 4
2 2016-05-01 3 5
...
I tried a bunch of forward-fill methods and crossed join with the calendar but couldn't figure it out.
Any help will be appreciated!
I think it might be done like this:
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'id': [1, 1, 2, 2, 2],
'date': [
dt.datetime.fromisoformat(s) for s in [
'2016-09-01',
'2016-11-01',
'2015-09-01',
'2015-12-01',
'2016-05-01'
]
],
'value1': [5, 7, 11, 13, 2],
'value2': [2, 15, 6, 4, 5]
}).set_index('id')
result = []
for _id, data in df.groupby('id'):
tmp_df = pd.DataFrame({
'date': pd.period_range(
start=min(data.date),
end=max(data.date + dt.timedelta(days=31 * 3)),
freq='M'
).to_timestamp()
})
tmp_df = tmp_df.join(data.set_index('date'), on='date')
tmp_df['id'] = _id
result.append(tmp_df.set_index('id'))
result = pd.concat(result).fillna(method='ffill', limit=3).dropna()
print(result)
Result:
date value1 value2
id
1 2016-09-01 5.0 2.0
1 2016-10-01 5.0 2.0
1 2016-11-01 7.0 15.0
1 2016-12-01 7.0 15.0
1 2017-01-01 7.0 15.0
1 2017-02-01 7.0 15.0
2 2015-09-01 11.0 6.0
2 2015-10-01 11.0 6.0
2 2015-11-01 11.0 6.0
2 2015-12-01 13.0 4.0
2 2016-01-01 13.0 4.0
2 2016-02-01 13.0 4.0
2 2016-03-01 13.0 4.0
2 2016-05-01 2.0 5.0
2 2016-06-01 2.0 5.0
2 2016-07-01 2.0 5.0
2 2016-08-01 2.0 5.0

Forward filling missing dates into Python Panel Pandas Dataframe

Suppose I have the following pandas dataframe:
df = pd.DataFrame({'Date':['2015-01-31','2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30', '2015-04-30'], 'ID':[1,2,2,2,1,2], 'value':[1,2,3,4,5,6]})
print(df)
Date ID value
2015-01-31 1 1
2015-01-31 2 2
2015-02-28 2 3
2015-03-31 2 4
2015-04-30 1 5
2015-04-30 2 6
I want to forward fill the data such that I have the values for each end of month till 2015-05-31 (i.e. for each date - ID combination). That is, I would like the dataframe to look as follows:
Date ID value
2015-01-31 1 1
2015-01-31 2 2
2015-02-28 2 3
2015-02-28 1 1
2015-03-31 2 4
2015-03-31 1 1
2015-04-30 1 5
2015-04-30 2 6
2015-05-31 1 5
2015-05-31 2 6
Is something like this possible? I saw several similar questions on Stackoverflow on forward filling dates, however this was without an index column (where the same date can occur many times).
You can pivot then fill value with reindex + ffill
out = df.pivot(*df.columns).reindex(pd.date_range('2015-01-31',periods = 5,freq='M')).ffill().stack().reset_index()
out.columns = df.columns
out
Out[1077]:
Date ID value
0 2015-01-31 1 1.0
1 2015-01-31 2 2.0
2 2015-02-28 1 1.0
3 2015-02-28 2 3.0
4 2015-03-31 1 1.0
5 2015-03-31 2 4.0
6 2015-04-30 1 5.0
7 2015-04-30 2 6.0
8 2015-05-31 1 5.0
9 2015-05-31 2 6.0
Another solution:
idx = pd.MultiIndex.from_product(
[
pd.date_range(df["Date"].min(), "2015-05-31", freq="M"),
df["ID"].unique(),
],
names=["Date", "ID"],
)
df = df.set_index(["Date", "ID"]).reindex(idx).groupby(level=1).ffill()
print(df.reset_index())
Prints:
Date ID value
0 2015-01-31 1 1.0
1 2015-01-31 2 2.0
2 2015-02-28 1 1.0
3 2015-02-28 2 3.0
4 2015-03-31 1 1.0
5 2015-03-31 2 4.0
6 2015-04-30 1 5.0
7 2015-04-30 2 6.0
8 2015-05-31 1 5.0
9 2015-05-31 2 6.0

Pandas Resample Monthly data to Weekly within Groups and Split Values

I have a dataframe, below:
ID Date Volume Sales
1 2020-02 10 4
1 2020-03 8 6
2 2020-02 6 8
2 2020-03 4 10
Is there an easy way to convert this to weekly data using resampling? And dividing the volume and sales column by the number of weeks in the month?
I have started my process which code which looks like:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('date')
grouped = df.groupby('ID').resmaple('W').ffill().reset_index()
print(grouped)
After this step, I get an error message: cannot inset ID, already exists
Also is there a code to use for finding the number of weeks in a month for dividing the volume and sales column by the number of weeks in the month.
The Expected output is :
ID Volume Sales Weeks
0 1 2.5 1.0 2020-02-02
0 1 2.5 1.0 2020-02-09
0 1 2.5 1.0 2020-02-16
0 1 2.5 1.0 2020-02-23
1 1 1.6 1.2 2020-03-01
1 1 1.6 1.2 2020-03-08
1 1 1.6 1.2 2020-03-15
1 1 1.6 1.2 2020-03-22
1 1 1.6 1.2 2020-03-29
2 2 1.5 2 2020-02-02
2 2 1.5 2 2020-02-09
2 2 1.5 2 2020-02-16
2 2 1.5 2 2020-02-23
3 2 0.8 2 2020-03-01
3 2 0.8 2 2020-03-08
3 2 0.8 2 2020-03-15
3 2 0.8 2 2020-03-22
3 2 0.8 2 2020-03-29
After review, a much simpler solution can be used. Please refer to subsection labeled New Solution in Part 1 below.
This task requires multiple steps. Let's break it down as follows:
Part 1: Transform Date & Resample
New Solution
With consideration that the weekly frequency required, being Sunday based (i.e. freq='W-SUN') is independent for each month and is not related to or affected by any adjacent month(s), we can directly use the year-month values in column Date to generate date ranges in weekly basis in one step rather than breaking into 2 steps by first generating daily date ranges from year-month and then resample the daily date ranges to weekly afterwards.
The new program logics just needs to use pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month. Altogether, it does not need to call .resample() or .asfreq() like other solutions. Effectively, the pd.date_range() with freq='W' is doing the resampling task for us.
Here goes the codes:
df['Weeks'] = df['Date'].map(lambda x:
pd.date_range(
start=pd.to_datetime(x),
end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
freq='W'))
df = df.explode('Weeks')
Result:
print(df)
ID Date Volume Sales Weeks
0 1 2020-02 10 4 2020-02-02
0 1 2020-02 10 4 2020-02-09
0 1 2020-02 10 4 2020-02-16
0 1 2020-02 10 4 2020-02-23
1 1 2020-03 8 6 2020-03-01
1 1 2020-03 8 6 2020-03-08
1 1 2020-03 8 6 2020-03-15
1 1 2020-03 8 6 2020-03-22
1 1 2020-03 8 6 2020-03-29
2 2 2020-02 6 8 2020-02-02
2 2 2020-02 6 8 2020-02-09
2 2 2020-02 6 8 2020-02-16
2 2 2020-02 6 8 2020-02-23
3 2 2020-03 4 10 2020-03-01
3 2 2020-03 4 10 2020-03-08
3 2 2020-03 4 10 2020-03-15
3 2 2020-03 4 10 2020-03-22
3 2 2020-03 4 10 2020-03-29
By the 2 lines of codes above, we already get the required result for Part 1. We don't need to go through the complicated codes of .groupby() and .resample() in the old solution.
We can continue to go to Part 2. As we have not created the grouped object, we can either replace grouped by df in for the codes in Part 2 or add a new line grouped = df to continue.
Old Solution
We use pd.date_range() with freq='D' with the help of pd.offsets.MonthEnd() to produce daily entries for the full month. Then transform these full month ranges to index before resampling to week frequency. Resampled with closed='left' to exclude the unwanted week of 2020-04-05 produced under default resample() parameters.
df['Weeks'] = df['Date'].map(lambda x:
pd.date_range(
start=pd.to_datetime(x),
end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
freq='D'))
df = df.explode('Weeks').set_index('Weeks')
grouped = (df.groupby(['ID', 'Date'], as_index=False)
.resample('W', closed='left')
.ffill().dropna().reset_index(-1))
Result:
print(grouped)
Weeks ID Date Volume Sales
0 2020-02-02 1.0 2020-02 10.0 4.0
0 2020-02-09 1.0 2020-02 10.0 4.0
0 2020-02-16 1.0 2020-02 10.0 4.0
0 2020-02-23 1.0 2020-02 10.0 4.0
1 2020-03-01 1.0 2020-03 8.0 6.0
1 2020-03-08 1.0 2020-03 8.0 6.0
1 2020-03-15 1.0 2020-03 8.0 6.0
1 2020-03-22 1.0 2020-03 8.0 6.0
1 2020-03-29 1.0 2020-03 8.0 6.0
2 2020-02-02 2.0 2020-02 6.0 8.0
2 2020-02-09 2.0 2020-02 6.0 8.0
2 2020-02-16 2.0 2020-02 6.0 8.0
2 2020-02-23 2.0 2020-02 6.0 8.0
3 2020-03-01 2.0 2020-03 4.0 10.0
3 2020-03-08 2.0 2020-03 4.0 10.0
3 2020-03-15 2.0 2020-03 4.0 10.0
3 2020-03-22 2.0 2020-03 4.0 10.0
3 2020-03-29 2.0 2020-03 4.0 10.0
Here, we retain the column Date for some use later.
Part 2: Divide Volume and Sales by number of weeks in month
Here, the number of weeks in month used to divide the Volume and Sales figures should actually be the number of resampled weeks within the month as shown in the interim result above.
If we use the actual number of weeks, then for Feb 2020, because of leap year, it has 29 days in that month and thus it actually spans across 5 weeks instead of the 4 resampled weeks in the interim result above. Then it would cause inconsistent results because there are only 4 week entries above while we divide each Volume and Sales figure by 5.
Let's go to the codes then:
We group by columns ID and Date and then divide each value in columns Volume and Sales by group size (i.e. number of resampled weeks).
grouped[['Volume', 'Sales']] = (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
.transform(lambda x: x / x.count()))
or simplified form using /= as follows:
grouped[['Volume', 'Sales']] /= (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
.transform('count'))
Result:
print(grouped)
Weeks ID Date Volume Sales
0 2020-02-02 1.0 2020-02 2.5 1.0
0 2020-02-09 1.0 2020-02 2.5 1.0
0 2020-02-16 1.0 2020-02 2.5 1.0
0 2020-02-23 1.0 2020-02 2.5 1.0
1 2020-03-01 1.0 2020-03 1.6 1.2
1 2020-03-08 1.0 2020-03 1.6 1.2
1 2020-03-15 1.0 2020-03 1.6 1.2
1 2020-03-22 1.0 2020-03 1.6 1.2
1 2020-03-29 1.0 2020-03 1.6 1.2
2 2020-02-02 2.0 2020-02 1.5 2.0
2 2020-02-09 2.0 2020-02 1.5 2.0
2 2020-02-16 2.0 2020-02 1.5 2.0
2 2020-02-23 2.0 2020-02 1.5 2.0
3 2020-03-01 2.0 2020-03 0.8 2.0
3 2020-03-08 2.0 2020-03 0.8 2.0
3 2020-03-15 2.0 2020-03 0.8 2.0
3 2020-03-22 2.0 2020-03 0.8 2.0
3 2020-03-29 2.0 2020-03 0.8 2.0
Optionally, you can do some cosmetic works to drop the column Date and rearrange column Weeks to your desired position if you like.
Edit: (Similarity and difference from other questions resampling from month to week)
In this review, I have searched some other questions of similar titles and compared the questions and solutions.
There is another question with similar requirement to split the monthly values equally to weekly values according to the number of weeks in the resampled month. In that question, the months are represented as the first date of the months and they are in datetime format and used as index in the dataframe while in this question, the months are represented as YYYY-MM which can be of string type.
A big and critical difference is that in that question, the last month period index 2018-05-01 with value 22644 was actually not processed. That is, the month of 2018-05 is not resampled into weeks in May 2018 and the value 22644 has never been processed to split into weekly proportions. The accepted solution using .asfreq() does not show any entry for 2018-05 at all and the other solution using .resample() still keeps one (un-resampled) entry for 2018-05 and the value 22644 is not split into weekly proportions.
However, in our question here, the last month listed in each group still needs to be resampled into weeks and values split equally for the resampled weeks.
Looking at the solution, my new solution makes no call to .resample() nor .asfreq(). It just uses pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month based on 'YYYY-MM' values. This is what I could not imagine of when I worked on the old solution making use of .resample()

Year to date average in dataframe

I have a dataframe that I am trying to calculate the year-to-date average for my value columns. Below is a sample dataframe.
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
I want to create new columns (values_ytd & values2_ytd) that will average the values from January to the latest period within the same year (April in sample data). I will need to group the data by year & name when calculating the averages. I am looking for an output similar to this.
date name values values2 values2_ytd values_ytd
0 2019-01-01 a 1 1 1 1
1 2019-02-01 a 3 3 2 2
2 2019-03-01 a 2 2 2 2
3 2019-04-01 a 6 2 2 3
I have tried unsuccesfully to using expanding().mean(), but most likely I was doing it wrong. My main dataframe has numerous name categories and many more columns. Here is the code I was attempting to use
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).expanding().mean().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
but am receiving the following error.
NotImplementedError: ops for Expanding for this dtype datetime64[ns] are not implemented
Note: This code below works perfectly when substituting cumsum() for .expanding().mean()to create a year-to-date sum of the values, but I cant figure it out for averages
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).cumsum().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
Any help is greatly appreciated.
Try this:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df[['values2_ytd', 'values_ytd']] = df.groupby([df.index.year, 'name'])['values','values2'].expanding().mean().reset_index(level=[0,1], drop=True)
df
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
Example using multiple names and years:
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
4 2019-01-01 b 1 4
5 2019-02-01 b 3 4
6 2020-01-01 a 1 1
7 2020-02-01 a 3 3
8 2020-03-01 a 2 2
9 2020-04-01 a 6 2
Output:
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
2019-01-01 b 1 4 1.0 4.0
2019-02-01 b 3 4 2.0 4.0
2020-01-01 a 1 1 1.0 1.0
2020-02-01 a 3 3 2.0 2.0
2020-03-01 a 2 2 2.0 2.0
2020-04-01 a 6 2 3.0 2.0
You should set date column as index: df.set_index('date', inplace=True) and then use df.resample('AS').groupby('name').mean()

Elegant resample for groups in Pandas

For a given pandas data frame called full_df which looks like
index id timestamp data
------- ---- ------------ ------
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-05-01 9.0
The start and end dates (and the time delta between start and end) are varying.
But I need a id wise resampled version (added rows marked with *)
index id timestamp data
------- ---- ------------ ------ ----
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-03-01 NaN *
4 1 2017-04-01 13.0
5 2 2017-02-01 1.0
6 2 2017-03-01 2.0
7 2 2017-04-01 NaN *
8 2 2017-05-01 9.0
Because the dataset is very large I was wondering if there is more efficient way of doing so than
Do full_df.groupby('id')
Do for each group df
df.index = pd.DatetimeIndex(df['timestamp'])
all_days = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(all_days)
Combine all groups again with a new index
That's time consuming and not very elegant. Any ideas?
Using resample
In [1175]: (df.set_index('timestamp').groupby('id').resample('MS').asfreq()
.drop(['id', 'index'], 1).reset_index())
Out[1175]:
id timestamp data
0 1 2017-01-01 10.0
1 1 2017-02-01 11.0
2 1 2017-03-01 NaN
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-04-01 NaN
7 2 2017-05-01 9.0
Details
In [1176]: df
Out[1176]:
index id timestamp data
0 1 1 2017-01-01 10.0
1 2 1 2017-02-01 11.0
2 3 1 2017-04-01 13.0
3 4 2 2017-02-01 1.0
4 5 2 2017-03-01 2.0
5 6 2 2017-05-01 9.0
In [1177]: df.dtypes
Out[1177]:
index int64
id int64
timestamp datetime64[ns]
data float64
dtype: object
Edit to add: this way does the min/max of dates for full_df, not df. If there wide variation in start/end dates between IDs this will unfortunately inflate the dataframe and #JohnGalt method is better. Nevertheless I'll leave this here as an alternate approach as it ought to be faster than groupby/resample for cases where it is appropriate.
I think the most efficient approach is likely going to be with stack/unstack or melt/pivot.
You could do something like this, for example:
full_df.set_index(['timestamp','id']).unstack('id').stack('id',dropna=False)
index data
timestamp id
2017-01-01 1 1.0 10.0
2 NaN NaN
2017-02-01 1 2.0 11.0
2 4.0 1.0
2017-03-01 1 NaN NaN
2 5.0 2.0
2017-04-01 1 3.0 13.0
2 NaN NaN
2017-05-01 1 NaN NaN
2 6.0 9.0
Just add reset_index().set_index('id') if you want it to display more like how you have it above. Note in particular the use of dropna=False with stack which preserves the NaN placeholders. Without that, the stack/unstack method just leaves you back where you started.
This method automatically includes the min & max dates, and all dates present for at least one timestamp. If there are interior timestamps missing for everyone, then you need to add a resample like this:
full_df.set_index(['timestamp','id']).unstack('id')\
.resample('MS').mean()\
.stack('id',dropna=False)

Categories