Within group move date to same date in prior year - python

I have a pandas dataframe which looks like this
pd.DataFrame({'a':['cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust3'],
'date':[date(2017, 12, 15), date(2018, 12, 20), date(2020, 1, 10), date(2017, 12, 15), date(2018, 12, 10), date(2017, 1, 5), date(2018, 1, 15), date(2019, 2, 20)],
'c':[5, 6, 7, 4, 8, 6, 5, 9]})
a date c
0 cust1 2017-12-15 5
1 cust1 2018-12-20 6
2 cust1 2020-01-10 7
3 cust2 2017-12-15 4
4 cust2 2018-12-10 8
5 cust3 2017-01-05 6
6 cust3 2018-01-15 5
7 cust3 2019-02-20 9
'a' = customer
'date' = date when customer paid
'c' = amount customer paid
I need to check if the customer paid in each year but for customers which historically paid in December but in later years paid in January I would like to change the January date to a December date. so looking at cust1, historically she paid in December but then she missed to pay in December 2019 but paid in January 2020. I would like to move the date to the same day in December in the prior year.
Note: my dataframe has thousands with more customers and pay dates all through the year but i specifically want to apply the above rule only where historically payments were made in December but in later years are being made in January.
my resulting dataframe should look like this:
a date c
0 cust1 2017-12-15 5
1 cust1 2018-12-20 6
2 cust1 2019-12-10 7
3 cust2 2017-12-15 4
4 cust2 2018-12-10 8
5 cust3 2017-01-05 6
6 cust3 2018-01-15 5
7 cust3 2019-02-20 9
EDIT
my dataframe is slightly more complex then initially described above, complexity being that I can have several times a customer is making a payment during any one year
a date c
0 cust1 2017-06-15 5
1 cust1 2017-12-15 5
2 cust1 2018-06-15 6
3 cust1 2019-01-20 6
4 cust1 2019-06-15 7
5 cust1 2020-01-10 7
6 cust1 2020-06-12 8
7 cust2 2017-12-15 4
8 cust2 2018-12-10 8
9 cust3 2017-01-05 6
10 cust3 2018-01-15 5
11 cust3 2019-02-20 9
so looking at cust1 she always makes 2 payments during the year. but the December 2018 payment was only done in January 2019. I would like to adjust the January date to a December date if in the prior year the payment was made in December and the for any subsequent years were there is a January payment
so my resulting dataframe should look like this:
a date c newDate
0 cust1 2017-06-15 5 2017-06-15
1 cust1 2017-12-15 5 2017-12-15
2 cust1 2018-06-15 6 2018-06-15
3 cust1 2019-01-20 6 2018-12-20
4 cust1 2019-06-15 7 2019-06-15
5 cust1 2020-01-10 7 2019-12-10
6 cust1 2020-06-12 8 2020-06-12
7 cust2 2017-12-15 4 2017-12-15
8 cust2 2018-12-10 8 2018-12-10
9 cust3 2017-01-05 6 2017-01-05
10 cust3 2018-01-15 5 2018-01-15
11 cust3 2019-02-20 9 2019-02-20
I tried the following incorporating some of the suggestions below:
df = pd.DataFrame({'a':['cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust3'],
'date':[date(2017, 6, 15), date(2017, 12, 15), date(2018, 6, 15), date(2019, 1, 20), date(2019, 6, 15), date(2020, 1, 10), date(2020, 6, 12), date(2017, 12, 15), date(2018, 12, 10), date(2017, 1, 5), date(2018, 1, 15), date(2019, 2, 20)],
'c':[5, 5, 6, 6, 7, 7, 8, 4, 8, 6, 5, 9]})
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_2 = df.loc[df['date'].dt.month.isin(year_end_month)].copy()
df_3 = pd.concat([df, df_2]).drop_duplicates(keep=False)
s=df_2.groupby('a').date.shift().dt.month
df_2['newDate']=np.where(s.eq(12) & df_2.date.dt.month.eq(1), df_2.date-
pd.DateOffset(months=1), df_2.date)
df_4 = pd.concat([df_2, df_3])
df_4.newDate = df_4.newDate.fillna(df_4.date)
df_4.sort_values(by=['a', 'date'])
The problem with my the above approach is that it works the first time the payment date is moved from December to January but it doesn't work for subsequent years. so looking at cust1 first time she switchted payment from December to January was in December 2018 to January 2019 and my approach captures this. but my approach fails to move her 2019 payment which she made in January 2020 to December 2019. Any idea how this can be solved for?

Check with groupby shift and find the row have the need to be fix , then do np.where
s=df.groupby('a').date.shift().dt.month
df['date']=np.where(s.eq(12) & df.date.dt.month.eq(1), df.date-pd.DateOffset(months=1), df.date)
df
a date c
0 cust1 2017-12-15 5
1 cust1 2018-12-20 6
2 cust1 2019-12-10 7
3 cust2 2017-12-15 4
4 cust2 2018-12-10 8
5 cust3 2017-01-05 6
6 cust3 2018-01-15 5
7 cust3 2019-02-20 9

Related

Copying and appending rows to a dataframe with increment to timestamp column by a minute

Here is the dataframe I have:
df = pd.DataFrame([[pd.Timestamp(2017, 1, 1, 12, 32, 0), 2, 3],
[pd.Timestamp(2017, 1, 2, 12, 32, 0), 4, 9]],
columns=['time', 'feature1', 'feature2'])
For every timestamp value found in the df (i.e for every value of the 'time' column), I need to append 5 more rows with the time column value of each row incremented by a minute successively, and the remaining columns values however will be copied as is.
So the output would look like:
time feature1 feature2
2017-01-01 12:32:00 2 3
2017-01-01 12:33:00 2 3
2017-01-01 12:34:00 2 3
2017-01-01 12:35:00 2 3
2017-01-01 12:36:00 2 3
2017-01-01 12:37:00 2 3
2017-01-02 12:32:00 4 9
2017-01-02 12:33:00 4 9
2017-01-02 12:34:00 4 9
2017-01-02 12:35:00 4 9
2017-01-02 12:36:00 4 9
2017-01-02 12:37:00 4 9
As an elegant solution, I used df.asfreq('1min') function. But I could not tell it to stop after appending 5 rows! Instead it would keep appending rows with 1 min increments till it reached the next timestamp!
I tried the good old for loop in python and as expected it is very time consuming (I am dealing with 10 million rows).
I was hoping that there would be an elegant solution to this? Something that used functions like - df.asfreq('1min') but with a stop condition after appending 5 rows.
You can repeat the df and then do a groupby with cumcount and add the minutes like below:
out = df.loc[df.index.repeat(6)]
out['time'] = out['time'] + pd.to_timedelta(out.groupby("time").cumcount(),unit='m')
print(out)
time feature1 feature2
0 2017-01-01 12:32:00 2 3
1 2017-01-01 12:33:00 2 3
2 2017-01-01 12:34:00 2 3
3 2017-01-01 12:35:00 2 3
4 2017-01-01 12:36:00 2 3
5 2017-01-01 12:37:00 2 3
6 2017-01-02 12:32:00 4 9
7 2017-01-02 12:33:00 4 9
8 2017-01-02 12:34:00 4 9
9 2017-01-02 12:35:00 4 9
10 2017-01-02 12:36:00 4 9
11 2017-01-02 12:37:00 4 9
You could create a column containing a list of required times using pandas.date_range and explode the DataFrame on that column:
df["time"] = df["time"].apply(lambda x: pd.date_range(start=x, periods=6, freq="1min"))
df = df.explode("time")
>>> df
time feature1 feature2
0 2017-01-01 12:32:00 2 3
0 2017-01-01 12:33:00 2 3
0 2017-01-01 12:34:00 2 3
0 2017-01-01 12:35:00 2 3
0 2017-01-01 12:36:00 2 3
0 2017-01-01 12:37:00 2 3
1 2017-01-02 12:32:00 4 9
1 2017-01-02 12:33:00 4 9
1 2017-01-02 12:34:00 4 9
1 2017-01-02 12:35:00 4 9
1 2017-01-02 12:36:00 4 9
1 2017-01-02 12:37:00 4 9

Python - Create new column (summation) in date range - Rolling Sum?

I am trying to create a new column in my dataframe:
Let X be a variable number of days.
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
1
2019-01-01 15:00:00
4
2
2019-01-05 11:00:00
1
3
2019-01-12 12:00:00
3
4
2019-01-15 15:00:00
2
5
2019-02-04 18:00:00
7
For each row, I need to sum up units sold + all the units sold in the last 10 days (letting x = 10 days)
Desired Result:
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
5
1
2019-01-01 15:00:00
4
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
6
5
2019-02-04 18:00:00
7
7
I have used the .rolling(window=) method before using periods and I think the following can help
df = df.rolling(window='10D', on='date').sum() but I can't get the syntax right!!
I have tried
df["Total Units sold in the last 10 days"] = df.rolling(on="date", window="10D", closed="both").sum()["Units Sold"] but get the error
"ValueError: Wrong number of items passed 2, placement implies 1" and "ValueError: Shape of passed values is (500, 2), indices imply (500, 1)"
Please please help!
Based on your sample data, you need to specify on parameter.
df = pd.DataFrame({'Date': [pd.Timestamp('2019-01-01 15:00:00'),
pd.Timestamp('2019-01-01 19:00:00'),
pd.Timestamp('2019-01-05 11:00:00'),
pd.Timestamp('2019-01-12 12:00:00'),
pd.Timestamp('2019-01-15 15:00:00'),
pd.Timestamp('2019-02-04 18:00:00')],
'Units Sold': [4, 5, 1, 3, 2, 7],
'Total Units sold in the last X days': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
df = df.sort_values("Date")
df["Total Units sold in the last X days"] = df.rolling("10D", on="Date").sum()["Units Sold"]
df
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 15:00:00
4
4
1
2019-01-01 19:00:00
5
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
5
5
2019-02-04 18:00:00
7
7

Within group move date to same date in prior year if certain condition is met

I have a pandas dataframe which looks like this
pd.DataFrame({'a':['cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust3'],
'date':[date(2017, 6, 15), date(2017, 12, 15), date(2018, 6, 15), date(2019, 1, 20), date(2019, 6, 15), date(2020, 1, 10), date(2020, 6, 12), date(2017, 12, 15), date(2018, 12, 10), date(2017, 1, 5), date(2018, 1, 15), date(2019, 2, 20)],
'c':[5, 5, 6, 6, 7, 7, 8, 4, 8, 6, 5, 9]})
a date c
0 cust1 2017-06-15 5
1 cust1 2017-12-15 5
2 cust1 2018-06-15 6
3 cust1 2019-01-20 6
4 cust1 2019-06-15 7
5 cust1 2020-01-10 7
6 cust1 2020-06-12 8
7 cust2 2017-12-15 4
8 cust2 2018-12-10 8
9 cust3 2017-01-05 6
10 cust3 2018-01-15 5
11 cust3 2019-02-20 9
a' = customer
'date' = date when customer paid
'c' = amount customer paid
I need to check if the customer paid as many times in each year then in the previous year but for customers which historically paid in December but in later years paid in January I would like to change the January date to a December date.
I tried the following:
year_end_month = [1, 12]
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_2 = df.loc[df['date'].dt.month.isin(year_end_month)].copy()
df_3 = pd.concat([df, df_2]).drop_duplicates(keep=False)
s=df_2.groupby('a').date.shift().dt.month
df_2['newDate']=np.where(s.eq(12) & df_2.date.dt.month.eq(1), df_2.date-
pd.DateOffset(months=1), df_2.date)
df_4 = pd.concat([df_2, df_3])
df_4.newDate = df_4.newDate.fillna(df_4.date)
df_4.sort_values(by=['a', 'date'])
The problem with my approach is that it works the first time the payment date is moved from December to January but it doesn't work for subsequent years. so looking at cust1 first time she switchted payment from December to January was in December 2018 to January 2019 and my approach captures this. but my approach fails to move her 2019 payment which she made in January 2020 to December 2019. Any idea how this can be solved for?
my resulting dataframe should look like this:
a date c newDate
0 cust1 2017-06-15 5 2017-06-15
1 cust1 2017-12-15 5 2017-12-15
2 cust1 2018-06-15 6 2018-06-15
3 cust1 2019-01-20 6 **2018-12-20**
4 cust1 2019-06-15 7 2019-06-15
5 cust1 2020-01-10 7 **2019-12-10**
6 cust1 2020-06-12 8 2020-06-12
7 cust2 2017-12-15 4 2017-12-15
8 cust2 2018-12-10 8 2018-12-10
9 cust3 2017-01-05 6 2017-01-05
10 cust3 2018-01-15 5 2018-01-15
11 cust3 2019-02-20 9 2019-02-20
Let's try ffill() on the shift() month series
months = df.date.dt.month
s = months.eq(12).groupby(df['a']).shift()
df['date'] = np.where(months.eq(1) & s.where(s).groupby(df['a']).ffill(),
df['date'] - pd.tseries.offsets.MonthOffset(),
df['date'])
Output:
a date c
0 cust1 2017-06-15 5
1 cust1 2017-12-15 5
2 cust1 2018-06-15 6
3 cust1 2018-12-20 6
4 cust1 2019-06-15 7
5 cust1 2019-12-10 7
6 cust1 2020-06-12 8
7 cust2 2017-12-15 4
8 cust2 2018-12-10 8
9 cust3 2017-01-05 6
10 cust3 2018-01-15 5
11 cust3 2019-02-20 9

Python Pandas Dataframe: Value of second recent day for each person

I am trying to group one dataframe conditional on another dataframe using Pythons pandas dataframes:
The first dataframe gives the holidays of each person:
import pandas as pd
df_holiday = pd.DataFrame({'Person': ['Alfred', 'Bob', 'Charles'], 'Last Holiday': ['2018-02-01', '2018-06-01', '2018-05-01']})
df_holiday.head()
Last Holiday Person
0 2018-02-01 Alfred
1 2018-06-01 Bob
2 2018-05-01 Charles
The second dataframe gives the sales value for each person and month:
df_sales = pd.DataFrame({'Person': ['Alfred', 'Alfred', 'Alfred','Bob','Bob','Bob','Bob','Bob','Bob','Charles','Charles','Charles','Charles','Charles','Charles'],'Date': ['2018-01-01', '2018-02-01', '2018-03-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01', '2018-01-01', '2018-02-01', '2018-03-01','2018-04-01', '2018-05-01', '2018-06-01'], 'Sales': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]})
df_sales.head(15)
Date Person Sales
0 2018-01-01 Alfred 1
1 2018-02-01 Alfred 2
2 2018-03-01 Alfred 3
3 2018-01-01 Bob 4
4 2018-02-01 Bob 5
5 2018-03-01 Bob 6
6 2018-04-01 Bob 7
7 2018-05-01 Bob 8
8 2018-06-01 Bob 9
9 2018-01-01 Charles 10
10 2018-02-01 Charles 11
11 2018-03-01 Charles 12
12 2018-04-01 Charles 13
13 2018-05-01 Charles 14
14 2018-06-01 Charles 15
Now, i want the sales number for each person before his last holiday, i.e. the outcome should be:
Date Person Sales
0 2018-01-01 Alfred 1
7 2018-05-01 Bob 8
12 2018-04-01 Charles 13
Any help?
We could do merge then filter and drop_duplicates
df=df_holiday.merge(df_sales).loc[lambda x : x['Last Holiday']>x['Date']].drop_duplicates('Person',keep='last')
Out[163]:
Person Last Holiday Date Sales
0 Alfred 2018-02-01 2018-01-01 1
7 Bob 2018-06-01 2018-05-01 8
12 Charles 2018-05-01 2018-04-01 13

First value of each week in pd.Series/DataFrame

Say I have a pd.Series of daily S&P 500 values, and I would like to filter this series to get the first business day and the associated value of each week.
So, for instance, my filtered series would contain the 5 September 2017 (Tuesday - no value for the Monday), then 11 September 2017 (Monday).
Source series:
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-06 2465.54
2017-09-07 2465.10
2017-09-08 2461.43
2017-09-11 2488.11
2017-09-12 2496.48
Filtered series
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
My solution currently consists of:
mask = SP500.apply(lambda row: SP500[row.name - datetime.timedelta(days=row.name.weekday()):].index[0], axis=1).unique()
filtered = SP500.loc[mask]
This however feels suboptimal/non-pythonic. Any better/faster/cleaner solutions?
Using resample on pd.Series.index.to_series
s[s.index.to_series().resample('W').first()]
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
dtype: float64
df.sort_index().assign(week=df.index.get_level_values(0).week).drop_duplicates('week',keep='first').drop('week',1)
Out[774]:
price
2017-09-01 2476.55
2017-09-05 2457.85
2017-09-11 2488.11
I'm not sure that the solution you give works, since the .apply method for series can't access the index, and doesn't have an axis argument. What you gave would work on a DataFrame, but this is simpler if you have a dataframe:
#Make some fake data
x = pd.DataFrame(pd.date_range(date(2017, 10, 9), date(2017, 10, 23)), columns = ['date'])
x['value'] = x.index
print(x)
date value
0 2017-10-09 0
1 2017-10-10 1
2 2017-10-11 2
3 2017-10-12 3
4 2017-10-13 4
5 2017-10-14 5
6 2017-10-15 6
7 2017-10-16 7
8 2017-10-17 8
9 2017-10-18 9
10 2017-10-19 10
11 2017-10-20 11
12 2017-10-21 12
13 2017-10-22 13
14 2017-10-23 14
#filter
filtered = x.groupby(x['date'].apply(lambda d: d-timedelta(d.weekday())), as_index = False).first()
print(filtered)
date value
0 2017-10-09 0
1 2017-10-16 7
2 2017-10-23 14

Categories