Calculate percent change over time in dataframe with noncontiguous dates

Calculate percent change over time in dataframe with noncontiguous dates - python

I have a dataframe:
df
Name Date ID Amount
0 Faye 2018-12-31 A 2
1 Faye 2019-03-31 A 1
2 Faye 2019-06-30 B 5
3 Faye 2019-09-30 B 2
4 Faye 2019-09-30 A 4
5 Faye 2020-03-31 A 1
6 Mike 2018-12-31 A 4
7 Mike 2019-03-31 B 2
8 Mike 2019-03-31 C 1
9 Mike 2019-06-30 A 3
And for each Name, Date, ID, group I calculate the % change of Amount from the previous Date in a new column. If there was no previous entry, then I add New:
df['% Change'] = (df.sort_values('Date').groupby(['Name', 'ID']).Amount.pct_change())
df['% Change'] = df['% Change'].fillna('New')
But if the ID disappears for a few Dates (e.g., ID A is present for Faye in 2018-12-31 and 2019-03-31, but disappears for the 2019-06-30 period), I want the next time it appears to show as New again so that the output looks like:
Name Date ID Amount % Change
0 Faye 2018-12-31 A 2 New
1 Faye 2019-03-31 A 1 -0.5
2 Faye 2019-06-30 B 5 New
3 Faye 2019-09-30 B 2 -0.6
4 Faye 2019-09-30 A 4 New
5 Faye 2020-03-31 A 1 New
6 Mike 2018-12-31 A 4 New
7 Mike 2019-03-31 B 2 New
8 Mike 2019-03-31 C 1 New
9 Mike 2019-06-30 A 3 New
How do I achieve this?

I think your expected result contains an error, sorting it will make it easier toooo see:
>>> expected.sort_values(['Name', 'ID', 'Date'])
Name Date ID Amount %_Change
0 Faye 2018-12-31 A 2 New
1 Faye 2019-03-31 A 1 -0.5
4 Faye 2019-09-30 A 4 New
5 Faye 2020-03-31 A 1 -0.75 <-- shouldn't this be "New" since 2020-12-31 was missing?
2 Faye 2019-06-30 B 5 New
3 Faye 2019-09-30 B 2 -0.6
6 Mike 2018-12-31 A 4 New
9 Mike 2019-06-30 A 3 New
7 Mike 2019-03-31 B 2 New
8 Mike 2019-03-31 C 1 New
With that said, you can right a custom percent_change function to resample each (Name, ID) group to quarterly before calculating the percentage change:
def percent_change(group):
amount = group[['Date', 'Amount']].set_index('Date')
return amount.pct_change(fill_method=None, freq='Q').rename(columns={'Amount': '% Change'})
result = (
df.sort_values('Date')
.groupby(['Name', 'ID'])
.apply(percent_change)
.fillna('New')
.merge(df, how='right', on=['Name', 'ID', 'Date'])
)
Result:
Name ID Date % Change Amount
0 Faye A 2018-12-31 New 2
1 Faye A 2019-03-31 -0.5 1
2 Faye A 2019-09-30 New 4
3 Faye A 2020-03-31 New 1
4 Faye B 2019-06-30 New 5
5 Faye B 2019-09-30 -0.6 2
6 Mike A 2018-12-31 New 4
7 Mike A 2019-06-30 New 3
8 Mike B 2019-03-31 New 2
9 Mike C 2019-03-31 New 1
A faster version
For the record, I randomly generate a 2M rows dataframe with the following code:
import string
n = 2_000_000
qend = pd.date_range('2000-01-01', '2019-12-31', freq='Q')
np.random.seed(42)
names = list(map(''.join, np.random.choice(list(string.ascii_uppercase), (n, 3))))
dates = np.random.choice(qend, n)
ids = np.random.choice(list(string.ascii_uppercase), n)
amounts = np.random.randint(1, 100, n)
df = pd.DataFrame({
'Name': names,
'Date': dates,
'ID': ids,
'Amount': amounts
})
It assumes all dates are quarter-ends and that they are of Timestamp data type. Explanation provided in the comment
# Make a sorted copy of the original dataframe
tmp = df.sort_values(['Name', 'ID', 'Date'])
# When we call a GroupBy, we lose the original index
# so let's keep a copy here
tmp['OriginalIndex'] = tmp.index
# Calculate day difference between consecutive rows.
# It is a lot faster than `groupby(...)['Date'].diff()`
# but it gives wrong result for the first row of each
# group. The first row of each group should be NaN. It's
# an easy fix and we will deal with it later
tmp['DayDiff'] = tmp['Date'].diff() / pd.Timedelta(days=1)
# This has the same problem as `DayDiff` above but you will
# see that it's irrelevant to our problem
tmp['% Change'] = tmp['Amount'].pct_change()
# The index of the first row in each group
first_indexes = tmp.groupby(['Name', 'ID'])['OriginalIndex'].first()
# Fix the issue in `DayDiff`: the first row of each group
# should be NaN
tmp.loc[first_indexes, 'DayDiff'] = np.nan
# Now this is the key to the whole problem: a quarter lasts a
# maximum of 92 days. If `DayDiff <= 92`, the percentage change
# formula applies. Otherwise, `DayDiff` is either NaN or >92.
# In both cases, the percentage change is NaN.
pct_change = tmp['% Change'].where(tmp['DayDiff'] <= 92, np.nan).fillna('New')
# Assign the result back to frame
df['% Change'] = pct_change

Related

Conduct the calculation only when the date value is valid

I have a data frame dft:
Date Total Value
02/01/2022 2
03/01/2022 6
N/A 4
03/11/2022 4
03/15/2022 4
05/01/2022 4
For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.
I have tried the following code:
newdft = []
for item in dft:
temp = item.copy()
timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
temp["Days"] = timediff.days
newdft.append(temp)
But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?

I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:
dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')
So the column will now look like this:
0 2022-02-01
1 2022-03-01
2 NaT
3 2022-03-11
4 2022-03-15
5 2022-05-01
Name: Date, dtype: datetime64[ns]
You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:
dft['Days'] = datetime.now() - dft['Date']
This will make dft look like below:
Date Total Value Days
0 2022-02-01 2 148 days 15:49:03.406935
1 2022-03-01 6 120 days 15:49:03.406935
2 NaT 4 NaT
3 2022-03-11 4 110 days 15:49:03.406935
4 2022-03-15 4 106 days 15:49:03.406935
5 2022-05-01 4 59 days 15:49:03.406935
If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:
df['Days'] = (datetime.now() - df['Date']).dt.days
Which will give you:
Date Total Value Days
0 2022-02-01 2 148.0
1 2022-03-01 6 120.0
2 NaT 4 NaN
3 2022-03-11 4 110.0
4 2022-03-15 4 106.0
5 2022-05-01 4 59.0

In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.
df['Days'] = (
df['Date']
[lambda d: d != 'N/A']
.apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
)
Result:
Date Total Value Days
0 02/01/2022 2 148.0
1 03/01/2022 6 120.0
2 N/A 4 NaN
3 03/11/2022 4 110.0
4 03/15/2022 4 106.0
5 05/01/2022 4 59.0
And for what it's worth, another option is date.today() instead of datetime.now():
.apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())
And the result is a timedelta instead of float:
Date Total Value Days
0 02/01/2022 2 148 days
1 03/01/2022 6 120 days
2 N/A 4 NaT
3 03/11/2022 4 110 days
4 03/15/2022 4 106 days
5 05/01/2022 4 59 days
See also: How do I select rows from a DataFrame based on column values?

Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.
import datetime
import pandas as pd
dft = pd.DataFrame({'Date': [
'02/01/2022',
'03/01/2022',
None,
'03/11/2022',
'03/15/2022',
'05/01/2022'],
'Total Value': [2,6,4,4,4,4]})
dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
(dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)
This would result in integer values in the Days column:
Date Total Value Days
0 02/01/2022 2 148
1 03/01/2022 6 120
2 None 4 None
3 03/11/2022 4 110
4 03/15/2022 4 106
5 05/01/2022 4 59

Assign weights to observations based on date difference and sequence condition

I already asked the question on the same problem and #mozway helped a lot.
However my logic on weights assignment was wrong.
I need to form the following dataframe w/ weight column:
id date status weight diff_in_days comment
-----------------------------------------------------------------
0 2 2019-02-03 reserve 0.003 0 1 / diff_days
1 2 2019-12-31 reserve 0.001 331 since diff to next is 1 day
2 2 2020-01-01 reserve 0.9 1 since the next date status is sold
3 2 2020-01-02 sold 1 1 sold
4 3 2020-01-03 reserve 0.001 0 since diff to next is 1 day
5 4 2020-01-03 booked 0.9 0 since the next date status is sold
6 3 2020-02-04 reserve 0.9 1 since the next date status is sold
7 4 2020-02-06 sold 1 3 sold
7 3 2020-02-07 sold 1 3 sold
To make diff_in_days column I use:
df['diff_in_days'] = df.groupby('flat_id')['date'].diff().dt.days.fillna(0)
Is there a way to implement this preudo-code without for-loop:
for i in df.iterrows():
df['weight'][i] = 1 / df['diff_in_days'][i+1]
if df['status'][i+1] == 'sold' (for each flat_id):
df['weight'][i] = 0.9
if df['status'][i] == 'sold':
df['weight'][i] = 1

Managed to do it like this:
df.sort_values(['flat_id', 'date'], inplace=True)
find diff in days between dates for flat_ids and shift it one row back
s = df.groupby(['flat_id']['date'].diff().dt.days.shift(-1)
assign weights for flat_ids with status == 'sold'
df['weight'] = np.where(df['status'].eq('sold'),s.max()+10, s.fillna(0))
now find rows with status == sold and shift back one row to find it's predecessors
s1 = df["status"].eq("sold").shift(-1)
s1 = s1.fillna(False)
assign them second maximum weights
df.loc[s1, "weight"] = s.max()+5
df["weight"].ffill(inplace=True)
final dataframe
flat_id date status weight
4 2 2019-02-04 reserve 331.0
0 2 2020-01-01 reserve 336.0
1 2 2020-01-02 sold 341.0
2 3 2020-01-03 reserve 1.0
5 3 2020-01-04 reserve 336.0
7 3 2020-02-07 sold 341.0
3 4 2020-01-03 booked 336.0
6 4 2020-02-06 sold 341.0

Compute date difference in days in pandas

I've got a dataframe that looks like this
date id
0 2019-01-15 c-15-Jan-2019-0
1 2019-01-26 c-26-Jan-2019-1
2 2019-02-02 c-02-Feb-2019-2
3 2019-02-15 c-15-Feb-2019-3
4 2019-02-23 c-23-Feb-2019-4
and I'd like to create a new column called 'days_since' that shows the number of days that have gone by since the last record. For instance, the new column would be
date id days_since
0 2019-01-15 c-15-Jan-2019-0 NaN
1 2019-01-26 c-26-Jan-2019-1 11
2 2019-02-02 c-02-Feb-2019-2 5
3 2019-02-15 c-15-Feb-2019-3 13
4 2019-02-23 c-23-Feb-2019-4 7
I tried to use
df_c['days_since'] = df_c.groupby('id')['date'].diff().apply(lambda x: x.days)
but that just returned a column full of null values. The date column is full of datetime objects. Any ideas?

I think you make it too complicated, given the date column contains datetime data, you can use:
>>> df['date'].diff()
0 NaT
1 11 days
2 7 days
3 13 days
4 8 days
Name: date, dtype: timedelta64[ns]
or if you want the number of days:
>>> df['date'].diff().dt.days
0 NaN
1 11.0
2 7.0
3 13.0
4 8.0
Name: date, dtype: float64
So you can assign the number of days with:
df['days_since'] = df['date'].diff().dt.days
This gives us:
>>> df
date days_since
0 2019-01-15 NaN
1 2019-01-26 11.0
2 2019-02-02 7.0
3 2019-02-15 13.0
4 2019-02-23 8.0

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.

You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

pandas - Copy each row 'n' times depending on column value

I'd like to copy or duplicate the rows of a DataFrame based on the value of a column, in this case orig_qty. So if I have a DataFrame and using pandas==0.24.2:
import pandas as pd
d = {'a': ['2019-04-08', 4, 115.00], 'b': ['2019-04-09', 2, 103.00]}
df = pd.DataFrame.from_dict(
d,
orient='index',
columns=['date', 'orig_qty', 'price']
)
Input
>>> print(df)
date orig_qty price
a 2019-04-08 4 115.0
b 2019-04-09 2 103.0
So in the example above the row with orig_qty=4 should be duplicated 4 times and the row with orig_qty=2 should be duplicated 2 times. After this transformation I'd like a DataFrame that looks like:
Desired Output
>>> print(new_df)
date orig_qty price fifo_qty
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-08 4 115.0 1
5 2019-04-09 2 103.0 1
6 2019-04-09 2 103.0 1
Note I do not really care about the index after the transformation. I can elaborate more on the use case for this, but essentially I'm doing some FIFO accounting where important changes can occur between values of orig_qty.

Use Index.repeat, DataFrame.loc, DataFrame.assign and DataFrame.reset_index
new_df = df.loc[df.index.repeat(df['orig_qty'])].assign(fifo_qty=1).reset_index(drop=True)
[output]
date orig_qty price fifo_qty
0 2019-04-08 4 115.0 1
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-09 2 103.0 1
5 2019-04-09 2 103.0 1

Use np.repeat
new_df = pd.DataFrame({col: np.repeat(df[col], df.orig_qty) for col in df.columns})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate percent change over time in dataframe with noncontiguous dates - python

Related

Conduct the calculation only when the date value is valid

Assign weights to observations based on date difference and sequence condition

Compute date difference in days in pandas

creating daily price change for a product on a pandas dataframe

pandas - Copy each row 'n' times depending on column value

Categories

Resources