Adding rows per group using Moving Average in Python - python

I have the following problem:
I have a table in which the customer number, the date and the sales are stored. The customer transactions are available on the first of each month. It may happen that a customer has not placed an order every month. The table looks like this:
ID
Date
Revenues
1
2021-05-01
100
1
2021-07-01
200
1
2021-08-01
100
1
2021-10-01
200
2
2021-12-01
300
2
2022-01-01
400
Now I want to add a certain number of rows to each group whose date is from today for a certain number of months in the future. The ID should remain the same, the date should be increased by one month and the turnover column should be filled with the moving average method.
The table should look like this:
ID
Date
Revenues
1
2021-05-01
100
1
2021-07-01
200
1
2021-08-01
100
1
2021-10-01
200
1
2022-04-01
150
1
2022-05-01
150
2
2021-12-01
300
2
2022-01-01
400
2
2022-04-01
350
2
2022-05-01
350
How can I solve this problem?
Thank you for your help :)

If I understand you correctly:
df["Date"] = pd.to_datetime(df["Date"])
def reindex(x):
min_date = x["Date"].min()
r = pd.date_range(min_date, min_date + pd.DateOffset(months=3), freq="MS")
x = x.set_index("Date").reindex(r)
x["ID"] = x["ID"].ffill().bfill()
x["Revenues"] = x["Revenues"].fillna(x["Revenues"].mean())
return x
x = df.groupby("ID", as_index=False).apply(reindex).droplevel(0).reset_index()
x = x.rename(columns={"index": "Date"})
print(x.to_markdown())
Prints:
Date
ID
Revenues
0
2021-12-01 00:00:00
1
100
1
2022-01-01 00:00:00
1
200
2
2022-02-01 00:00:00
1
150
3
2022-03-01 00:00:00
1
150
4
2021-12-01 00:00:00
2
300
5
2022-01-01 00:00:00
2
400
6
2022-02-01 00:00:00
2
350
7
2022-03-01 00:00:00
2
350

Related

Conduct the calculation only when the date value is valid

I have a data frame dft:
Date Total Value
02/01/2022 2
03/01/2022 6
N/A 4
03/11/2022 4
03/15/2022 4
05/01/2022 4
For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.
I have tried the following code:
newdft = []
for item in dft:
temp = item.copy()
timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
temp["Days"] = timediff.days
newdft.append(temp)
But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?
I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:
dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')
So the column will now look like this:
0 2022-02-01
1 2022-03-01
2 NaT
3 2022-03-11
4 2022-03-15
5 2022-05-01
Name: Date, dtype: datetime64[ns]
You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:
dft['Days'] = datetime.now() - dft['Date']
This will make dft look like below:
Date Total Value Days
0 2022-02-01 2 148 days 15:49:03.406935
1 2022-03-01 6 120 days 15:49:03.406935
2 NaT 4 NaT
3 2022-03-11 4 110 days 15:49:03.406935
4 2022-03-15 4 106 days 15:49:03.406935
5 2022-05-01 4 59 days 15:49:03.406935
If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:
df['Days'] = (datetime.now() - df['Date']).dt.days
Which will give you:
Date Total Value Days
0 2022-02-01 2 148.0
1 2022-03-01 6 120.0
2 NaT 4 NaN
3 2022-03-11 4 110.0
4 2022-03-15 4 106.0
5 2022-05-01 4 59.0
In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.
df['Days'] = (
df['Date']
[lambda d: d != 'N/A']
.apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
)
Result:
Date Total Value Days
0 02/01/2022 2 148.0
1 03/01/2022 6 120.0
2 N/A 4 NaN
3 03/11/2022 4 110.0
4 03/15/2022 4 106.0
5 05/01/2022 4 59.0
And for what it's worth, another option is date.today() instead of datetime.now():
.apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())
And the result is a timedelta instead of float:
Date Total Value Days
0 02/01/2022 2 148 days
1 03/01/2022 6 120 days
2 N/A 4 NaT
3 03/11/2022 4 110 days
4 03/15/2022 4 106 days
5 05/01/2022 4 59 days
See also: How do I select rows from a DataFrame based on column values?
Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.
import datetime
import pandas as pd
dft = pd.DataFrame({'Date': [
'02/01/2022',
'03/01/2022',
None,
'03/11/2022',
'03/15/2022',
'05/01/2022'],
'Total Value': [2,6,4,4,4,4]})
dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
(dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)
This would result in integer values in the Days column:
Date Total Value Days
0 02/01/2022 2 148
1 03/01/2022 6 120
2 None 4 None
3 03/11/2022 4 110
4 03/15/2022 4 106
5 05/01/2022 4 59

Assign weights to observations based on date difference and sequence condition

I already asked the question on the same problem and #mozway helped a lot.
However my logic on weights assignment was wrong.
I need to form the following dataframe w/ weight column:
id date status weight diff_in_days comment
-----------------------------------------------------------------
0 2 2019-02-03 reserve 0.003 0 1 / diff_days
1 2 2019-12-31 reserve 0.001 331 since diff to next is 1 day
2 2 2020-01-01 reserve 0.9 1 since the next date status is sold
3 2 2020-01-02 sold 1 1 sold
4 3 2020-01-03 reserve 0.001 0 since diff to next is 1 day
5 4 2020-01-03 booked 0.9 0 since the next date status is sold
6 3 2020-02-04 reserve 0.9 1 since the next date status is sold
7 4 2020-02-06 sold 1 3 sold
7 3 2020-02-07 sold 1 3 sold
To make diff_in_days column I use:
df['diff_in_days'] = df.groupby('flat_id')['date'].diff().dt.days.fillna(0)
Is there a way to implement this preudo-code without for-loop:
for i in df.iterrows():
df['weight'][i] = 1 / df['diff_in_days'][i+1]
if df['status'][i+1] == 'sold' (for each flat_id):
df['weight'][i] = 0.9
if df['status'][i] == 'sold':
df['weight'][i] = 1
Managed to do it like this:
df.sort_values(['flat_id', 'date'], inplace=True)
find diff in days between dates for flat_ids and shift it one row back
s = df.groupby(['flat_id']['date'].diff().dt.days.shift(-1)
assign weights for flat_ids with status == 'sold'
df['weight'] = np.where(df['status'].eq('sold'),s.max()+10, s.fillna(0))
now find rows with status == sold and shift back one row to find it's predecessors
s1 = df["status"].eq("sold").shift(-1)
s1 = s1.fillna(False)
assign them second maximum weights
df.loc[s1, "weight"] = s.max()+5
df["weight"].ffill(inplace=True)
final dataframe
flat_id date status weight
4 2 2019-02-04 reserve 331.0
0 2 2020-01-01 reserve 336.0
1 2 2020-01-02 sold 341.0
2 3 2020-01-03 reserve 1.0
5 3 2020-01-04 reserve 336.0
7 3 2020-02-07 sold 341.0
3 4 2020-01-03 booked 336.0
6 4 2020-02-06 sold 341.0

Calculate how many touch points the customer had in X months

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. I have now calculated the date up to which the data should be taken into account. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
from datetime import date
from dateutil.relativedelta import relativedelta
def find_last_date(date):
six_months = date + relativedelta(months=-6)
return six_months
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['count_from_date'] = df['fromDate'].apply(lambda x: find_last_date(x))
print(df)
What I have
customerId fromDate count_from_date
0 1 2022-06-01 2021-12-01
1 1 2022-05-25 2021-11-25
2 1 2022-05-25 2021-11-25
3 1 2022-05-20 2021-11-20
4 1 2021-09-05 2021-03-05
5 2 2022-06-02 2021-12-02
6 3 2021-03-01 2020-09-01
7 3 2021-02-01 2020-08-01
What I want
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
2 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
3 1 2022-05-20 2021-11-20 0 # No in the last 6 months
4 1 2021-09-05 2021-03-05 0 # No in the last 6 months
5 2 2022-06-02 2021-12-02 0 # No in the last 6 months
6 3 2021-03-01 2020-09-01 1 # 2021-02-01 = 1
7 3 2021-02-01 2020-08-01 0 # No in the last 6 months
You can try groupby customerId and loop through the rows in subgroup to count number of fromDate between fromDate and count_from_date
def count(g):
m = pd.concat([g['fromDate'].between(d1, d2, 'neither')
for d1, d2 in zip(g['count_from_date'], g['fromDate'])], axis=1)
g = g.assign(occur_last_6_months=m.sum().tolist())
return g
out = df.groupby('customerId').apply(count)
print(out)
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
For this problem, the challenge for a performant solution is to manipulate the data as to have an appropriate structure to run rolling window operations on it.
First of all, we need to avoid having duplicate indices. In your case, this means aggregating multiple touch points in a single day:
>>> df = df.groupby(['customerId', 'fromDate'], as_index=False).count()
customerId fromDate count_from_date
0 1 2021-09-05 1
1 1 2022-05-20 1
2 1 2022-05-25 2
3 1 2022-06-01 1
4 2 2022-06-02 1
5 3 2021-02-01 1
6 3 2021-03-01 1
Now, we can set the index to fromDate, sort it and groupby customerId as to be able to use rolling windows. I here use a 180D rolling window (6 months):
>>> roll_df = df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
The sort_index step is important to ensure your data is monotonically increasing.
However, this also counts the touch on the day itself, which seems not what you want, so we remove 1 from the result:
>>> roll_df - 1
customerId fromDate
1 2021-09-05 0.0
2022-05-20 0.0
2022-05-25 2.0
2022-06-01 3.0
2 2022-06-02 0.0
3 2021-02-01 0.0
2021-03-01 1.0
Name: count_from_date, dtype: float64
Finally, we divide by the initial counts to get back to the original structure:
>>> roll_df / df.set_index(['customerId', 'fromDate'])['count_from_date']
customerId fromDate count_from_date
0 1 2021-09-05 0.0
1 1 2022-05-20 0.0
2 1 2022-05-25 1.0
3 1 2022-06-01 3.0
4 2 2022-06-02 0.0
5 3 2021-02-01 0.0
6 3 2021-03-01 1.0
You can always .reset_index() at the end.
The one liner solution is
(df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']

How to aggregate data in hour based on timestamp in pandas?

I have a dataframe a day full from 00:00:00 to 23:59:59
the table bellow is just and example, I can't paste it here because it's too long.
id sm_log_time score 1 score 2
0 2020-04-15 15:25:49 10 10
1 2020-04-15 15:38:55 10 10
2 2020-04-15 15:52:01 10 10
3 2020-04-15 16:05:07 10 10
4 2020-04-15 16:18:13 10 10
And my desired dataframe is something like this. Score 1 and score 2 is sum based on minutes in an hour
id sm_log_time score 1 score 2
0 2020-04-15 15:00:00 100 200
1 2020-04-15 16:00:00 230 200
2 2020-04-15 17:00:00 200 300
3 2020-04-15 18:00:00 100 300
4 2020-04-15 19:00:00 100 300
Someone give me this for reference:
times = pd.to_datetime(df.timestamp_col)
df.groupby([times.hour, times.minute]).value_col.sum()
First setting index is necessary. Then use resample method of time series index:
df.set_index('sm_log_time').resample('H').sum().reset_index()
Result:
sm_log_time id score 1 score 2
0 2020-04-15 15:00:00 3 30 30
1 2020-04-15 16:00:00 7 20 20
Please note also id is summed, You may drop it if not necessary. New row number of resulting dataframe is now in index.

Group python pandas dataframe per weeks (starting on Monday)

I have a dataframe with values per day (see df below).
I want to group the "Forecast" field per week but with Monday as the first day of the week.
Currently I can do it via pd.TimeGrouper('W') (see df_final below) but it groups the week starting on Sundays (see df_final below)
import pandas as pd
data = [("W1","G1",1234,pd.to_datetime("2015-07-1"),8),
("W1","G1",1234,pd.to_datetime("2015-07-30"),2),
("W1","G1",1234,pd.to_datetime("2015-07-15"),2),
("W1","G1",1234,pd.to_datetime("2015-07-2"),4),
("W1","G2",2345,pd.to_datetime("2015-07-5"),5),
("W1","G2",2345,pd.to_datetime("2015-07-7"),1),
("W1","G2",2345,pd.to_datetime("2015-07-9"),1),
("W1","G2",2345,pd.to_datetime("2015-07-11"),3)]
labels = ["Site","Type","Product","Date","Forecast"]
df = pd.DataFrame(data,columns=labels).set_index(["Site","Type","Product","Date"])
df
Forecast
Site Type Product Date
W1 G1 1234 2015-07-01 8
2015-07-30 2
2015-07-15 2
2015-07-02 4
G2 2345 2015-07-05 5
2015-07-07 1
2015-07-09 1
2015-07-11 3
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.TimeGrouper('W')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
df_final
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-05 12 6
1 W1 1234 2015-07-19 2 6
2 W1 1234 2015-08-02 2 6
3 W1 2345 2015-07-05 5 6
4 W1 2345 2015-07-12 5 6
Use W-MON instead W, check anchored offsets:
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.Grouper(freq='W-MON')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
print (df_final)
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-06 12 0
1 W1 1234 2015-07-20 2 0
2 W1 1234 2015-08-03 2 0
3 W1 2345 2015-07-06 5 0
4 W1 2345 2015-07-13 5 0
I have three solutions to this problem as described below. First, I should state that the ex-accepted answer is incorrect. Here is why:
# let's create an example df of length 9, 2020-03-08 is a Sunday
s = pd.DataFrame({'dt':pd.date_range('2020-03-08', periods=9, freq='D'),
'counts':0})
> s
dt
counts
0
2020-03-08 00:00:00
0
1
2020-03-09 00:00:00
0
2
2020-03-10 00:00:00
0
3
2020-03-11 00:00:00
0
4
2020-03-12 00:00:00
0
5
2020-03-13 00:00:00
0
6
2020-03-14 00:00:00
0
7
2020-03-15 00:00:00
0
8
2020-03-16 00:00:00
0
These nine days span three Monday-to-Sunday weeks. The weeks of March 2nd, 9th, and 16th. Let's try the accepted answer:
# the accepted answer
> s.groupby(pd.Grouper(key='dt',freq='W-Mon')).count()
dt
counts
2020-03-09 00:00:00
2
2020-03-16 00:00:00
7
This is wrong because the OP wants to have "Monday as the first day of the week" (not as the last day of the week) in the resulting dataframe. Let's see what we get when we try with freq='W'
> s.groupby(pd.Grouper(key='dt', freq='W')).count()
dt
counts
2020-03-08 00:00:00
1
2020-03-15 00:00:00
7
2020-03-22 00:00:00
1
This grouper actually grouped as we wanted (Monday to Sunday) but labeled the 'dt' with the END of the week, rather than the start. So, to get what we want, we can move the index by 6 days like:
w = s.groupby(pd.Grouper(key='dt', freq='W')).count()
w.index -= pd.Timedelta(days=6)
or alternatively we can do:
s.groupby(pd.Grouper(key='dt',freq='W-Mon',label='left',closed='left')).count()
a third solution, arguably the most readable one, is converting dt to period first, then grouping, and finally (if needed) converting back to timestamp:
s.groupby(s.dt.dt.to_period('W'))['counts'].count().to_timestamp()
# a variant of this solution is: s.set_index('dt').to_period('W').groupby(pd.Grouper(freq='W')).count().to_timestamp()
all of these solutions return what the OP asked for:
dt
counts
2020-03-02 00:00:00
1
2020-03-09 00:00:00
7
2020-03-16 00:00:00
1
Explanation: when freq is provided to pd.Grouper, both closed and label kwargs default to right. Setting freq to W (short for W-Sun) works because we want our week to end on Sunday (Sunday included, and g.closed == 'right' handles this). Unfortunately, the pd.Grouper docstring does not show the default values but you can see them like this:
g = pd.Grouper(key='dt', freq='W')
print(g.closed, g.label)
> right right

Categories