I have a dataframe:
df
Name Date ID Amount
0 Faye 2018-12-31 A 2
1 Faye 2019-03-31 A 1
2 Faye 2019-06-30 B 5
3 Faye 2019-09-30 B 2
4 Faye 2019-09-30 A 4
5 Faye 2020-03-31 A 1
6 Mike 2018-12-31 A 4
7 Mike 2019-03-31 B 2
8 Mike 2019-03-31 C 1
9 Mike 2019-06-30 A 3
And for each Name, Date, ID, group I calculate the % change of Amount from the previous Date in a new column. If there was no previous entry, then I add New:
df['% Change'] = (df.sort_values('Date').groupby(['Name', 'ID']).Amount.pct_change())
df['% Change'] = df['% Change'].fillna('New')
But if the ID disappears for a few Dates (e.g., ID A is present for Faye in 2018-12-31 and 2019-03-31, but disappears for the 2019-06-30 period), I want the next time it appears to show as New again so that the output looks like:
Name Date ID Amount % Change
0 Faye 2018-12-31 A 2 New
1 Faye 2019-03-31 A 1 -0.5
2 Faye 2019-06-30 B 5 New
3 Faye 2019-09-30 B 2 -0.6
4 Faye 2019-09-30 A 4 New
5 Faye 2020-03-31 A 1 New
6 Mike 2018-12-31 A 4 New
7 Mike 2019-03-31 B 2 New
8 Mike 2019-03-31 C 1 New
9 Mike 2019-06-30 A 3 New
How do I achieve this?
I think your expected result contains an error, sorting it will make it easier toooo see:
>>> expected.sort_values(['Name', 'ID', 'Date'])
Name Date ID Amount %_Change
0 Faye 2018-12-31 A 2 New
1 Faye 2019-03-31 A 1 -0.5
4 Faye 2019-09-30 A 4 New
5 Faye 2020-03-31 A 1 -0.75 <-- shouldn't this be "New" since 2020-12-31 was missing?
2 Faye 2019-06-30 B 5 New
3 Faye 2019-09-30 B 2 -0.6
6 Mike 2018-12-31 A 4 New
9 Mike 2019-06-30 A 3 New
7 Mike 2019-03-31 B 2 New
8 Mike 2019-03-31 C 1 New
With that said, you can right a custom percent_change function to resample each (Name, ID) group to quarterly before calculating the percentage change:
def percent_change(group):
amount = group[['Date', 'Amount']].set_index('Date')
return amount.pct_change(fill_method=None, freq='Q').rename(columns={'Amount': '% Change'})
result = (
df.sort_values('Date')
.groupby(['Name', 'ID'])
.apply(percent_change)
.fillna('New')
.merge(df, how='right', on=['Name', 'ID', 'Date'])
)
Result:
Name ID Date % Change Amount
0 Faye A 2018-12-31 New 2
1 Faye A 2019-03-31 -0.5 1
2 Faye A 2019-09-30 New 4
3 Faye A 2020-03-31 New 1
4 Faye B 2019-06-30 New 5
5 Faye B 2019-09-30 -0.6 2
6 Mike A 2018-12-31 New 4
7 Mike A 2019-06-30 New 3
8 Mike B 2019-03-31 New 2
9 Mike C 2019-03-31 New 1
A faster version
For the record, I randomly generate a 2M rows dataframe with the following code:
import string
n = 2_000_000
qend = pd.date_range('2000-01-01', '2019-12-31', freq='Q')
np.random.seed(42)
names = list(map(''.join, np.random.choice(list(string.ascii_uppercase), (n, 3))))
dates = np.random.choice(qend, n)
ids = np.random.choice(list(string.ascii_uppercase), n)
amounts = np.random.randint(1, 100, n)
df = pd.DataFrame({
'Name': names,
'Date': dates,
'ID': ids,
'Amount': amounts
})
It assumes all dates are quarter-ends and that they are of Timestamp data type. Explanation provided in the comment
# Make a sorted copy of the original dataframe
tmp = df.sort_values(['Name', 'ID', 'Date'])
# When we call a GroupBy, we lose the original index
# so let's keep a copy here
tmp['OriginalIndex'] = tmp.index
# Calculate day difference between consecutive rows.
# It is a lot faster than `groupby(...)['Date'].diff()`
# but it gives wrong result for the first row of each
# group. The first row of each group should be NaN. It's
# an easy fix and we will deal with it later
tmp['DayDiff'] = tmp['Date'].diff() / pd.Timedelta(days=1)
# This has the same problem as `DayDiff` above but you will
# see that it's irrelevant to our problem
tmp['% Change'] = tmp['Amount'].pct_change()
# The index of the first row in each group
first_indexes = tmp.groupby(['Name', 'ID'])['OriginalIndex'].first()
# Fix the issue in `DayDiff`: the first row of each group
# should be NaN
tmp.loc[first_indexes, 'DayDiff'] = np.nan
# Now this is the key to the whole problem: a quarter lasts a
# maximum of 92 days. If `DayDiff <= 92`, the percentage change
# formula applies. Otherwise, `DayDiff` is either NaN or >92.
# In both cases, the percentage change is NaN.
pct_change = tmp['% Change'].where(tmp['DayDiff'] <= 92, np.nan).fillna('New')
# Assign the result back to frame
df['% Change'] = pct_change
Related
I have a data frame dft:
Date Total Value
02/01/2022 2
03/01/2022 6
N/A 4
03/11/2022 4
03/15/2022 4
05/01/2022 4
For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.
I have tried the following code:
newdft = []
for item in dft:
temp = item.copy()
timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
temp["Days"] = timediff.days
newdft.append(temp)
But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?
I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:
dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')
So the column will now look like this:
0 2022-02-01
1 2022-03-01
2 NaT
3 2022-03-11
4 2022-03-15
5 2022-05-01
Name: Date, dtype: datetime64[ns]
You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:
dft['Days'] = datetime.now() - dft['Date']
This will make dft look like below:
Date Total Value Days
0 2022-02-01 2 148 days 15:49:03.406935
1 2022-03-01 6 120 days 15:49:03.406935
2 NaT 4 NaT
3 2022-03-11 4 110 days 15:49:03.406935
4 2022-03-15 4 106 days 15:49:03.406935
5 2022-05-01 4 59 days 15:49:03.406935
If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:
df['Days'] = (datetime.now() - df['Date']).dt.days
Which will give you:
Date Total Value Days
0 2022-02-01 2 148.0
1 2022-03-01 6 120.0
2 NaT 4 NaN
3 2022-03-11 4 110.0
4 2022-03-15 4 106.0
5 2022-05-01 4 59.0
In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.
df['Days'] = (
df['Date']
[lambda d: d != 'N/A']
.apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
)
Result:
Date Total Value Days
0 02/01/2022 2 148.0
1 03/01/2022 6 120.0
2 N/A 4 NaN
3 03/11/2022 4 110.0
4 03/15/2022 4 106.0
5 05/01/2022 4 59.0
And for what it's worth, another option is date.today() instead of datetime.now():
.apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())
And the result is a timedelta instead of float:
Date Total Value Days
0 02/01/2022 2 148 days
1 03/01/2022 6 120 days
2 N/A 4 NaT
3 03/11/2022 4 110 days
4 03/15/2022 4 106 days
5 05/01/2022 4 59 days
See also: How do I select rows from a DataFrame based on column values?
Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.
import datetime
import pandas as pd
dft = pd.DataFrame({'Date': [
'02/01/2022',
'03/01/2022',
None,
'03/11/2022',
'03/15/2022',
'05/01/2022'],
'Total Value': [2,6,4,4,4,4]})
dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
(dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)
This would result in integer values in the Days column:
Date Total Value Days
0 02/01/2022 2 148
1 03/01/2022 6 120
2 None 4 None
3 03/11/2022 4 110
4 03/15/2022 4 106
5 05/01/2022 4 59
I already asked the question on the same problem and #mozway helped a lot.
However my logic on weights assignment was wrong.
I need to form the following dataframe w/ weight column:
id date status weight diff_in_days comment
-----------------------------------------------------------------
0 2 2019-02-03 reserve 0.003 0 1 / diff_days
1 2 2019-12-31 reserve 0.001 331 since diff to next is 1 day
2 2 2020-01-01 reserve 0.9 1 since the next date status is sold
3 2 2020-01-02 sold 1 1 sold
4 3 2020-01-03 reserve 0.001 0 since diff to next is 1 day
5 4 2020-01-03 booked 0.9 0 since the next date status is sold
6 3 2020-02-04 reserve 0.9 1 since the next date status is sold
7 4 2020-02-06 sold 1 3 sold
7 3 2020-02-07 sold 1 3 sold
To make diff_in_days column I use:
df['diff_in_days'] = df.groupby('flat_id')['date'].diff().dt.days.fillna(0)
Is there a way to implement this preudo-code without for-loop:
for i in df.iterrows():
df['weight'][i] = 1 / df['diff_in_days'][i+1]
if df['status'][i+1] == 'sold' (for each flat_id):
df['weight'][i] = 0.9
if df['status'][i] == 'sold':
df['weight'][i] = 1
Managed to do it like this:
df.sort_values(['flat_id', 'date'], inplace=True)
find diff in days between dates for flat_ids and shift it one row back
s = df.groupby(['flat_id']['date'].diff().dt.days.shift(-1)
assign weights for flat_ids with status == 'sold'
df['weight'] = np.where(df['status'].eq('sold'),s.max()+10, s.fillna(0))
now find rows with status == sold and shift back one row to find it's predecessors
s1 = df["status"].eq("sold").shift(-1)
s1 = s1.fillna(False)
assign them second maximum weights
df.loc[s1, "weight"] = s.max()+5
df["weight"].ffill(inplace=True)
final dataframe
flat_id date status weight
4 2 2019-02-04 reserve 331.0
0 2 2020-01-01 reserve 336.0
1 2 2020-01-02 sold 341.0
2 3 2020-01-03 reserve 1.0
5 3 2020-01-04 reserve 336.0
7 3 2020-02-07 sold 341.0
3 4 2020-01-03 booked 336.0
6 4 2020-02-06 sold 341.0
I've got a dataframe that looks like this
date id
0 2019-01-15 c-15-Jan-2019-0
1 2019-01-26 c-26-Jan-2019-1
2 2019-02-02 c-02-Feb-2019-2
3 2019-02-15 c-15-Feb-2019-3
4 2019-02-23 c-23-Feb-2019-4
and I'd like to create a new column called 'days_since' that shows the number of days that have gone by since the last record. For instance, the new column would be
date id days_since
0 2019-01-15 c-15-Jan-2019-0 NaN
1 2019-01-26 c-26-Jan-2019-1 11
2 2019-02-02 c-02-Feb-2019-2 5
3 2019-02-15 c-15-Feb-2019-3 13
4 2019-02-23 c-23-Feb-2019-4 7
I tried to use
df_c['days_since'] = df_c.groupby('id')['date'].diff().apply(lambda x: x.days)
but that just returned a column full of null values. The date column is full of datetime objects. Any ideas?
I think you make it too complicated, given the date column contains datetime data, you can use:
>>> df['date'].diff()
0 NaT
1 11 days
2 7 days
3 13 days
4 8 days
Name: date, dtype: timedelta64[ns]
or if you want the number of days:
>>> df['date'].diff().dt.days
0 NaN
1 11.0
2 7.0
3 13.0
4 8.0
Name: date, dtype: float64
So you can assign the number of days with:
df['days_since'] = df['date'].diff().dt.days
This gives us:
>>> df
date days_since
0 2019-01-15 NaN
1 2019-01-26 11.0
2 2019-02-02 7.0
3 2019-02-15 13.0
4 2019-02-23 8.0
I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.
I'd like to copy or duplicate the rows of a DataFrame based on the value of a column, in this case orig_qty. So if I have a DataFrame and using pandas==0.24.2:
import pandas as pd
d = {'a': ['2019-04-08', 4, 115.00], 'b': ['2019-04-09', 2, 103.00]}
df = pd.DataFrame.from_dict(
d,
orient='index',
columns=['date', 'orig_qty', 'price']
)
Input
>>> print(df)
date orig_qty price
a 2019-04-08 4 115.0
b 2019-04-09 2 103.0
So in the example above the row with orig_qty=4 should be duplicated 4 times and the row with orig_qty=2 should be duplicated 2 times. After this transformation I'd like a DataFrame that looks like:
Desired Output
>>> print(new_df)
date orig_qty price fifo_qty
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-08 4 115.0 1
5 2019-04-09 2 103.0 1
6 2019-04-09 2 103.0 1
Note I do not really care about the index after the transformation. I can elaborate more on the use case for this, but essentially I'm doing some FIFO accounting where important changes can occur between values of orig_qty.
Use Index.repeat, DataFrame.loc, DataFrame.assign and DataFrame.reset_index
new_df = df.loc[df.index.repeat(df['orig_qty'])].assign(fifo_qty=1).reset_index(drop=True)
[output]
date orig_qty price fifo_qty
0 2019-04-08 4 115.0 1
1 2019-04-08 4 115.0 1
2 2019-04-08 4 115.0 1
3 2019-04-08 4 115.0 1
4 2019-04-09 2 103.0 1
5 2019-04-09 2 103.0 1
Use np.repeat
new_df = pd.DataFrame({col: np.repeat(df[col], df.orig_qty) for col in df.columns})