count of last n days per group - python

I have a DataFrame like this
df = pd.DataFrame({'Team':['CHI','IND','CHI','CHI','IND','CHI','CHI','IND'],
'Date':[datetime.date(2015,10,27),datetime.date(2015,10,28),datetime.date(2015,10,29),datetime.date(2015,10,30),datetime.date(2015,11,1),datetime.date(2015,11,2),datetime.date(2015,11,4),datetime.date(2015,11,4)]})
I can find the number of rest days between games using this.
df['TeamRest'] = df.groupby('Team')['Date'].diff() - datetime.timedelta(1)
I would like to also add a row to the DataFrame that keeps track of how many games each team has played in the last 5 days.

With Date converted to datetime so it can be used as DateTimeIndex, which will be important for the rolling_count with daily frequency
df.Date = pd.to_datetime(df.Date)
1) calculate the difference in days between games per team:
df['days_between'] = df.groupby('Team')['Date'].diff() - timedelta(days=1)
2) calculate the rolling count of games for the last 5 days per team:
df['game_count'] = 1
rolling_games_count = df.set_index('Date').groupby('Team').apply(lambda x: pd.rolling_count(x, window=5, freq='D')).reset_index()
df = df.drop('game_count', axis=1).merge(rolling_games_count, on=['Team', 'Date'], how='left')
to get:
Date Team days_between game_count
0 2015-10-27 CHI NaT 1
1 2015-10-28 IND NaT 1
2 2015-10-29 CHI 1 days 2
3 2015-10-30 CHI 0 days 3
4 2015-11-01 IND 3 days 2
5 2015-11-02 CHI 2 days 3
6 2015-11-04 CHI 1 days 2
7 2015-11-04 IND 2 days 2
If you were to
df = pd.DataFrame({'Team':['CHI','IND','CHI','CHI','IND','CHI','CHI','IND'], 'Date': [date(2015,10,27),date(2015,10,28),date(2015,10,29),date(2015,10,30),date(2015,11,1),date(2015,11,2),date(2015,11,4),date(2015,12,10)]})
df['game'] = 1 # initialize a game to count.
df['nb_games'] = df.groupby('Team')['game'].apply(pd.rolling_count, 5)
you get the surprising result (one Date changed to one month later)
Date Team game nb_games
0 2015-10-27 CHI 1 1
2 2015-10-29 CHI 1 2
3 2015-10-30 CHI 1 3
5 2015-11-02 CHI 1 4
6 2015-11-04 CHI 1 5
1 2015-10-28 IND 1 1
4 2015-11-01 IND 1 2
7 2015-12-10 IND 1 3
of nb_games=3 for a later date in December, when there were no games during the last five days. Unless you convert to datetime, you only count the last five entries in the DataFrame, so you'll always get five for a team with more than five games played.

Related

How to merge dataframe between dates

I have one dataframe data contains daily data of sales (DF).
I have another dataframe that contains quarterly data (DF1).
This is what the quarterly dataframe looks like DF1.
Date Computer Sale In Person Sales Net Sales
1/29/2021 1 2 3
4/30/2021 2 4 6
7/29/2021 3 6 9
1/29/2022 4 8 12
5/1/2022 5 10 15
7/30/2022 6 12 18
This is what the daily Data frame looks like: DF
Date Num of people
1 / 30 / 2021 45
1 / 31 / 2021 35
2 / 1 / 2021 25
5 / 1 / 2021 20
5 / 2 / 2021 15
I have columns Computer Sales, In Person Sales, Net Sales in the quarterly dataframe.
How to I merge the columns from above to the daily dataframe so that I can see on the daily dataframe the quarterly data. I want the final result to look like this
Date Num of people Computer Sale In Person Sales Net Sales
1/30/2021 45 1 2 3
1/31/2021 35 1 2 3
2/1/2021 25 1 2 3
5/1/2021 20 2 4 6
5/2/2021 15 2 4 6
So, for example. I want 1/30/2021 to be the figure that is 1/29/2021 and once the daily data goes past 4/30/2021 then merge the new quarterly Data.
Please let me know if I need to be more specific.
A possible solution:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
pd.merge_asof(df2, df1, on='Date', direction='backward')
Output:
Date Num of people Computer Sale In Person Sales Net Sales
0 2021-01-30 45 1 2 3
1 2021-01-31 35 1 2 3
2 2021-02-01 25 1 2 3
3 2021-05-01 20 2 4 6
4 2021-05-02 15 2 4 6

Conduct the calculation only when the date value is valid

I have a data frame dft:
Date Total Value
02/01/2022 2
03/01/2022 6
N/A 4
03/11/2022 4
03/15/2022 4
05/01/2022 4
For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.
I have tried the following code:
newdft = []
for item in dft:
temp = item.copy()
timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
temp["Days"] = timediff.days
newdft.append(temp)
But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?
I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:
dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')
So the column will now look like this:
0 2022-02-01
1 2022-03-01
2 NaT
3 2022-03-11
4 2022-03-15
5 2022-05-01
Name: Date, dtype: datetime64[ns]
You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:
dft['Days'] = datetime.now() - dft['Date']
This will make dft look like below:
Date Total Value Days
0 2022-02-01 2 148 days 15:49:03.406935
1 2022-03-01 6 120 days 15:49:03.406935
2 NaT 4 NaT
3 2022-03-11 4 110 days 15:49:03.406935
4 2022-03-15 4 106 days 15:49:03.406935
5 2022-05-01 4 59 days 15:49:03.406935
If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:
df['Days'] = (datetime.now() - df['Date']).dt.days
Which will give you:
Date Total Value Days
0 2022-02-01 2 148.0
1 2022-03-01 6 120.0
2 NaT 4 NaN
3 2022-03-11 4 110.0
4 2022-03-15 4 106.0
5 2022-05-01 4 59.0
In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.
df['Days'] = (
df['Date']
[lambda d: d != 'N/A']
.apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
)
Result:
Date Total Value Days
0 02/01/2022 2 148.0
1 03/01/2022 6 120.0
2 N/A 4 NaN
3 03/11/2022 4 110.0
4 03/15/2022 4 106.0
5 05/01/2022 4 59.0
And for what it's worth, another option is date.today() instead of datetime.now():
.apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())
And the result is a timedelta instead of float:
Date Total Value Days
0 02/01/2022 2 148 days
1 03/01/2022 6 120 days
2 N/A 4 NaT
3 03/11/2022 4 110 days
4 03/15/2022 4 106 days
5 05/01/2022 4 59 days
See also: How do I select rows from a DataFrame based on column values?
Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.
import datetime
import pandas as pd
dft = pd.DataFrame({'Date': [
'02/01/2022',
'03/01/2022',
None,
'03/11/2022',
'03/15/2022',
'05/01/2022'],
'Total Value': [2,6,4,4,4,4]})
dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
(dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)
This would result in integer values in the Days column:
Date Total Value Days
0 02/01/2022 2 148
1 03/01/2022 6 120
2 None 4 None
3 03/11/2022 4 110
4 03/15/2022 4 106
5 05/01/2022 4 59

Assign weights to observations based on date difference and sequence condition

I already asked the question on the same problem and #mozway helped a lot.
However my logic on weights assignment was wrong.
I need to form the following dataframe w/ weight column:
id date status weight diff_in_days comment
-----------------------------------------------------------------
0 2 2019-02-03 reserve 0.003 0 1 / diff_days
1 2 2019-12-31 reserve 0.001 331 since diff to next is 1 day
2 2 2020-01-01 reserve 0.9 1 since the next date status is sold
3 2 2020-01-02 sold 1 1 sold
4 3 2020-01-03 reserve 0.001 0 since diff to next is 1 day
5 4 2020-01-03 booked 0.9 0 since the next date status is sold
6 3 2020-02-04 reserve 0.9 1 since the next date status is sold
7 4 2020-02-06 sold 1 3 sold
7 3 2020-02-07 sold 1 3 sold
To make diff_in_days column I use:
df['diff_in_days'] = df.groupby('flat_id')['date'].diff().dt.days.fillna(0)
Is there a way to implement this preudo-code without for-loop:
for i in df.iterrows():
df['weight'][i] = 1 / df['diff_in_days'][i+1]
if df['status'][i+1] == 'sold' (for each flat_id):
df['weight'][i] = 0.9
if df['status'][i] == 'sold':
df['weight'][i] = 1
Managed to do it like this:
df.sort_values(['flat_id', 'date'], inplace=True)
find diff in days between dates for flat_ids and shift it one row back
s = df.groupby(['flat_id']['date'].diff().dt.days.shift(-1)
assign weights for flat_ids with status == 'sold'
df['weight'] = np.where(df['status'].eq('sold'),s.max()+10, s.fillna(0))
now find rows with status == sold and shift back one row to find it's predecessors
s1 = df["status"].eq("sold").shift(-1)
s1 = s1.fillna(False)
assign them second maximum weights
df.loc[s1, "weight"] = s.max()+5
df["weight"].ffill(inplace=True)
final dataframe
flat_id date status weight
4 2 2019-02-04 reserve 331.0
0 2 2020-01-01 reserve 336.0
1 2 2020-01-02 sold 341.0
2 3 2020-01-03 reserve 1.0
5 3 2020-01-04 reserve 336.0
7 3 2020-02-07 sold 341.0
3 4 2020-01-03 booked 336.0
6 4 2020-02-06 sold 341.0

Group python pandas dataframe per weeks (starting on Monday)

I have a dataframe with values per day (see df below).
I want to group the "Forecast" field per week but with Monday as the first day of the week.
Currently I can do it via pd.TimeGrouper('W') (see df_final below) but it groups the week starting on Sundays (see df_final below)
import pandas as pd
data = [("W1","G1",1234,pd.to_datetime("2015-07-1"),8),
("W1","G1",1234,pd.to_datetime("2015-07-30"),2),
("W1","G1",1234,pd.to_datetime("2015-07-15"),2),
("W1","G1",1234,pd.to_datetime("2015-07-2"),4),
("W1","G2",2345,pd.to_datetime("2015-07-5"),5),
("W1","G2",2345,pd.to_datetime("2015-07-7"),1),
("W1","G2",2345,pd.to_datetime("2015-07-9"),1),
("W1","G2",2345,pd.to_datetime("2015-07-11"),3)]
labels = ["Site","Type","Product","Date","Forecast"]
df = pd.DataFrame(data,columns=labels).set_index(["Site","Type","Product","Date"])
df
Forecast
Site Type Product Date
W1 G1 1234 2015-07-01 8
2015-07-30 2
2015-07-15 2
2015-07-02 4
G2 2345 2015-07-05 5
2015-07-07 1
2015-07-09 1
2015-07-11 3
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.TimeGrouper('W')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
df_final
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-05 12 6
1 W1 1234 2015-07-19 2 6
2 W1 1234 2015-08-02 2 6
3 W1 2345 2015-07-05 5 6
4 W1 2345 2015-07-12 5 6
Use W-MON instead W, check anchored offsets:
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.Grouper(freq='W-MON')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
print (df_final)
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-06 12 0
1 W1 1234 2015-07-20 2 0
2 W1 1234 2015-08-03 2 0
3 W1 2345 2015-07-06 5 0
4 W1 2345 2015-07-13 5 0
I have three solutions to this problem as described below. First, I should state that the ex-accepted answer is incorrect. Here is why:
# let's create an example df of length 9, 2020-03-08 is a Sunday
s = pd.DataFrame({'dt':pd.date_range('2020-03-08', periods=9, freq='D'),
'counts':0})
> s
dt
counts
0
2020-03-08 00:00:00
0
1
2020-03-09 00:00:00
0
2
2020-03-10 00:00:00
0
3
2020-03-11 00:00:00
0
4
2020-03-12 00:00:00
0
5
2020-03-13 00:00:00
0
6
2020-03-14 00:00:00
0
7
2020-03-15 00:00:00
0
8
2020-03-16 00:00:00
0
These nine days span three Monday-to-Sunday weeks. The weeks of March 2nd, 9th, and 16th. Let's try the accepted answer:
# the accepted answer
> s.groupby(pd.Grouper(key='dt',freq='W-Mon')).count()
dt
counts
2020-03-09 00:00:00
2
2020-03-16 00:00:00
7
This is wrong because the OP wants to have "Monday as the first day of the week" (not as the last day of the week) in the resulting dataframe. Let's see what we get when we try with freq='W'
> s.groupby(pd.Grouper(key='dt', freq='W')).count()
dt
counts
2020-03-08 00:00:00
1
2020-03-15 00:00:00
7
2020-03-22 00:00:00
1
This grouper actually grouped as we wanted (Monday to Sunday) but labeled the 'dt' with the END of the week, rather than the start. So, to get what we want, we can move the index by 6 days like:
w = s.groupby(pd.Grouper(key='dt', freq='W')).count()
w.index -= pd.Timedelta(days=6)
or alternatively we can do:
s.groupby(pd.Grouper(key='dt',freq='W-Mon',label='left',closed='left')).count()
a third solution, arguably the most readable one, is converting dt to period first, then grouping, and finally (if needed) converting back to timestamp:
s.groupby(s.dt.dt.to_period('W'))['counts'].count().to_timestamp()
# a variant of this solution is: s.set_index('dt').to_period('W').groupby(pd.Grouper(freq='W')).count().to_timestamp()
all of these solutions return what the OP asked for:
dt
counts
2020-03-02 00:00:00
1
2020-03-09 00:00:00
7
2020-03-16 00:00:00
1
Explanation: when freq is provided to pd.Grouper, both closed and label kwargs default to right. Setting freq to W (short for W-Sun) works because we want our week to end on Sunday (Sunday included, and g.closed == 'right' handles this). Unfortunately, the pd.Grouper docstring does not show the default values but you can see them like this:
g = pd.Grouper(key='dt', freq='W')
print(g.closed, g.label)
> right right

make a shift by index with a pandas dataframe

Is there a pandas way to do that:
predicted_sells = []
for row in df.values:
index_tms = row[0]
delta = index_tms + timedelta(hours=1)
try:
sells_to_predict = df.loc[delta]['cars_sold']
except KeyError:
new_element = None
predicted_sells.append(sells_to_predict)
df['sell_to_predict'] = predicted_sells
example explanation:
sell is the number of cars I sold at the time tms. sell_to_predict is the number of cars I sold the hour after. I want to predict that. So I want to build a new column containing at the time tms the number of cars I will sell at the time tms+1h
before my code it looks like that
tms sell
2015-11-23 15:00:00 6
2015-11-23 16:00:00 2
2015-11-23 17:00:00 10
after it looks like that
tms sell sell_to_predict
2015-11-23 15:00:00 6 2
2015-11-23 16:00:00 2 10
2015-11-23 17:00:00 10 NaN
I create a new column based on a shift of an other column, but that's not a shift in number of columns. That's a shift based on an index (here the index is a timestamp)
Here is an other example, little more complex :
before :
sell random
store hour
1 1 1 9
2 7 7
2 1 4 3
2 2 3
after :
sell random predict
store hour
1 1 1 9 7
2 7 7 NaN
2 1 4 3 2
2 2 3 NaN
have you tried shift?
e.g.
df = pd.DataFrame(list(range(4)))
df.columns = ['sold']
df['predict'] = df.sold.shift(-1)
df
sold predict
0 0 1
1 1 2
2 2 3
3 3 NaN
the answer was to resample so I won't have any hole, and then apply the answer for this question : How do you shift Pandas DataFrame with a multiindex?

Categories