Pandas cumulative countif (based on condition) - python

I have a DataFrame df and I am trying to calculate a cumulative count based on the condition that the date in the column at is bigger or equal to the dates in the column recovery_date.
Here is the original df:
at recovery_date
0 2020-02-01 2020-03-02
1 2020-03-01 2020-03-31
2 2020-04-01 2020-05-01
3 2020-05-01 2020-05-31
4 2020-06-01 2020-07-01
Here is the desired outcome:
at recovery_date result
0 2020-02-01 2020-03-02 0
1 2020-03-01 2020-03-31 0
2 2020-04-01 2020-05-01 2
3 2020-05-01 2020-05-31 3
4 2020-06-01 2020-07-01 4
The interpretation is that for each at there are x amount of recovery_dates preceding it or on the same day.
I am trying to avoid using a for loop as I am implementing this for a time-sensitive application.
This is a solution I was able to find, however I am looking for something more performant:
def how_many(at: pd.Timestamp, recoveries: pd.Series) -> int:
return (at >= recoveries).sum()
df["result"] = [how_many(row["at"], df["recovery_date"][:idx]) for idx, row in df.iterrows()]
Thanks a lot!!

You're looking for something like this:
df['result'] = df['at'].apply(lambda at: (at >= df['recovery_date']).sum())
What this does is: For each value in the at column, check if there are any recovery_dates that are bigger or equal (at this point we have an array of True (=1) and False (=0) values) then sum them.
This yields your desired output
at recovery_date count result
0 2020-02-01 2020-03-02 1 0
1 2020-03-01 2020-03-31 1 0
2 2020-04-01 2020-05-01 1 2
3 2020-05-01 2020-05-31 1 3
4 2020-06-01 2020-07-01 1 4

Related

How to drop rows for each value in a column using a condition?

I have the following dataframe:
df = pd.DataFrame({'No': [0,0,0,1,1,2,2],
'date':['2020-01-15','2019-12-16','2021-03-01', '2018-05-19', '2016-04-08', '2020-01-02', '2020-03-07']})
df.date =pd.to_datetime(df.date)
No date
0 0 2018-01-15
1 0 2019-12-16
2 0 2021-03-01
3 1 2018-05-19
4 1 2016-04-08
5 2 2020-01-02
6 2 2020-03-07
I want to drop the rows if all the date values are earlier than 2020-01-01 for each unique number in No column, i.e. I want to drop rows with the indices 3 and 4.
Is it possible to do it without a for loop?
Use groupby and transform:
>>> df[df.groupby('No')['date'].transform('max')>='2020-01-01']
No date
0 0 2020-01-15
1 0 2019-12-16
2 0 2021-03-01
5 2 2020-01-02
6 2 2020-03-07

Pandas - Number of rows to the last row in a group that meets a requirement

My data is like this
date group meet_criteria
2020-03-31 1 no
2020-04-01 1 yes
2020-04-02 1 no
2020-04-03 1 no
2020-04-04 1 yes
2020-04-05 1 no
2020-03-31 2 yes
2020-04-01 2 no
I would like to create another column which will equal 1 divide by the number of days since the last date in a group that the column meet_criteria is yes (the current meet_criteria is excluded and if a group has never met the criteria then the value will be 0.)
My desired data will look like this
date group meet_criteria last_time_met_criteria
2020-03-31 1 no 0
2020-04-01 1 yes 0
2020-04-02 1 no 1
2020-04-03 1 no 0.5
2020-04-04 1 yes 0.333333
2020-04-05 1 no 1
2020-03-31 2 yes 0
2020-04-01 2 no 1
Is there any way to do this efficiently in pandas? Thanks
This can be done using pd.merge_asof & subsequent calculations in pandas.
Here's a fully worked example with your data (original data loaded into a variable called df, and df.date converted to datetime first)
# sorting necessary for how `merge_asof` will be used
df2 = df.sort_values(['date', 'group'])
# construct the `right` data frame of dates to lookup
df_meet_criteria = df2[df2.meet_criteria == 'yes'].copy()
df_meet_criteria['date_met_criteria'] = df_meet_criteria.date
# merge
# `by`: columns to do regular merge on
# `on`: columns to do as_of merge on
# `allow_exact_matches`: True -> closed interval, False -> open interval,
# i.e. latest date before current date
last_date = pd.merge_asof(
df2,
df_meet_criteria,
by='group',
on='date',
allow_exact_matches=False,
suffixes=('', '_y')
).sort_values(['group', 'date'])
# calculate the inverse_days.
last_date['days_since'] = (last_date.date - last_date.date_met_criteria).dt.days
last_date.loc[last_date.days_since == 0, 'days_since'] = np.nan
last_date['last_time_met_criteria'] = (1 / last_date.days_since).fillna(0)
final = last_date[['date', 'group', 'meet_criteria', 'last_time_met_criteria']]
final dataframe looks like this:
date group meet_criteria last_time_met_criteria
0 2020-03-31 1 no 0.000000
2 2020-04-01 1 yes 0.000000
4 2020-04-02 1 no 1.000000
5 2020-04-03 1 no 0.500000
6 2020-04-04 1 yes 0.333333
7 2020-04-05 1 no 1.000000
1 2020-03-31 2 yes 0.000000
3 2020-04-01 2 no 1.000000

Groupby count to printed to individual row

I have a dataframe that is similar to:
Date Name
2020-04-01 ABCD
2020-04-01 Test
2020-04-01 Run1
2020-04-02 Run1
2020-04-03 XXX1
2020-04-03 Test
I want to groupby date and enumerate the number of datapoints for that day. I also want a column for the cumulative count of that date for every datapoint. Essentially, the two columns will give a quick reference of data: on 4-15-20 scan 10 of 23. This is the desired result:
Date Name # Total Scans
2020-04-01 ABCD 1 3
2020-04-01 Test 2 3
2020-04-01 Run1 3 3
2020-04-02 Run1 1 1
2020-04-03 XXX1 1 2
2020-04-03 Test 2 2
So far I have:
>>>df["#"]=std.groupby(['Date']).cumcount()+1
Date Name #
2020-04-01 ABCD 1
2020-04-01 Test 2
2020-04-01 Run1 3
2020-04-02 Run1 1
2020-04-03 XXX1 1
2020-04-03 Test 2
However, I'm having trouble adding the last column without needing to iterate over the dataset. Everything I've read says iterating over a dataframe is a no-no and the size of the file causes tremendous lag testing, confirming iteration is a bad idea.
Can anyone give me input here without needing iteration? Thanks
Let's do:
groups = df.groupby('Date')
df['#'] = groups.cumcount() + 1
df['Total Scans'] = groups['Date'].transform('size')
output:
Date Name # Total Scans
0 2020-04-01 ABCD 1 3
1 2020-04-01 Test 2 3
2 2020-04-01 Run1 3 3
3 2020-04-02 Run1 1 1
4 2020-04-03 XXX1 1 2
5 2020-04-03 Test 2 2

Creating a new dataframe from a multi index dataframe using some conditions

I have a time series dataset which is basically consumption data of materials over the past 5 years
Material No Consumption Date Consumption
A 2019-06-01 1
A 2019-07-01 2
A 2019-08-01 3
A 2019-09-01 4
A 2019-10-01 0
A 2019-11-01 0
A 2019-12-01 0
A 2020-01-01 1
A 2020-02-01 2
A 2020-03-01 3
A 2020-04-01 0
A 2020-05-01 0
B 2019-06-01 0
B 2019-07-01 0
B 2019-08-01 0
B 2019-09-01 4
B 2019-10-01 0
B 2019-11-01 0
B 2019-12-01 0
B 2020-01-01 4
B 2020-02-01 2
B 2020-03-01 8
B 2020-04-01 0
B 2020-05-01 0
From the above dataframe, I want to see the number of months in which the material had at least 1 unit of consumption. The output dataframe should look something like this.
Material no_of_months(Jan2020-May2020) no_of_months(Jun2019-May2020)
A 3 7
B 3 4
Currently I'm sub-setting the data frame and using a group by to count the unique entries with non-zero consumption. However, this needs creating multiple data frames for different periods and then merging them. Was wondering if this could be done in a better way using dictionaries.
consumption_jan20_may20 = consumption.loc[consumption['Consumption Date']>='2020-01-01',['Material No','Consumption Date','Consumption']]
consumption_jan20_may20 = consumption_jan20_may20.groupby([pd.Grouper(key='Material No'),grouper])['Consumption'].count().reset_index()
consumption_jan20_may20 = consumption_jan20_may20.groupby('Material No').count().reset_index()
consumption_jan20_may20.columns = ['Material No','no_of_months(Jan2020-May2020)','dummy']
consumption_jan20_may20 = consumption_jan20_may20[['MATNR','no_of_months(Jan2020-May2020)']]
You can firstly limit the data that you are investigating (limit it to a range of months). Let's say you want to check the data for the first 5 months:
df = df[:6]
Then you can use the below code to find the months that the material usage is not zero:
df_nonezero = df[df['Consumption']!=0]
if you want to see how many months the consumption is not zero, you can simply determine the length of new data frame:
len(df_nonezero)

Check if date in one dataframe is between two dates in another dataframe, by group

I have the following problem. I've got a dataframe with start and end dates for each group. There might be more than one start and end date per group, like this:
group start_date end_date
1 2020-01-03 2020-03-03
1 2020-05-03 2020-06-03
2 2020-02-03 2020-06-03
And another dataframe with one row per date, per group, like this:
group date
1 2020-01-03
1 2020-02-03
1 2020-03-03
1 2020-04-03
1 2020-05-03
1 2020-06-03
2 2020-02-03
3 2020-03-03
4 2020-04-03
.
.
So I want to create a column is_between in an efficient way, ideally avoiding loops, so I get the following dataframe
group date is_between
1 2020-01-03 1
1 2020-02-03 1
1 2020-03-03 1
1 2020-04-03 0
1 2020-05-03 1
1 2020-06-03 1
2 2020-02-03 1
3 2020-03-03 1
4 2020-04-03 1
.
.
So it gets a 1 when a group's date is between the dates in the first dataframe. I'm guessing some combination of groupby, where, between and maybe map might do it, but I'm not finding the correct one. Any ideas?
Based on #YOBEN_S and #Quang Hoang's advice this made it:
df = df.merge(dic_dates, how='left')
df['is_between'] = np.where(df.date.between(pd.to_datetime(df.start_date),
pd.to_datetime(df.end_Date)),1, 0)
df = (df.sort_values(by=['group', 'date', 'is_between'])
.drop_duplicates(subset=['group', 'date'], keep='last'))
you could try with merge_asof, by the group and on the date and start_date, then check where the date is less than end_date and finally assign back to the original df2
ser = (pd.merge_asof(df2.reset_index() #for later index alignment
.sort_values('date'),
df1.sort_values('start_date'),
by='group',
left_on='date', right_on='start_date',
direction='backward')
.assign(is_between=lambda x: x.date<=x.end_date)
.set_index(['index'])['is_between']
)
df2['is_between'] = ser.astype(int)
print (df2)
group date is_between
0 1 2020-01-03 1
1 1 2020-02-03 1
2 1 2020-03-03 1
3 1 2020-04-03 0
4 1 2020-05-03 1
5 1 2020-06-03 1
6 2 2020-02-03 1
7 3 2020-03-03 0
8 4 2020-04-03 0

Categories