I have a dataframe that is similar to:
Date Name
2020-04-01 ABCD
2020-04-01 Test
2020-04-01 Run1
2020-04-02 Run1
2020-04-03 XXX1
2020-04-03 Test
I want to groupby date and enumerate the number of datapoints for that day. I also want a column for the cumulative count of that date for every datapoint. Essentially, the two columns will give a quick reference of data: on 4-15-20 scan 10 of 23. This is the desired result:
Date Name # Total Scans
2020-04-01 ABCD 1 3
2020-04-01 Test 2 3
2020-04-01 Run1 3 3
2020-04-02 Run1 1 1
2020-04-03 XXX1 1 2
2020-04-03 Test 2 2
So far I have:
>>>df["#"]=std.groupby(['Date']).cumcount()+1
Date Name #
2020-04-01 ABCD 1
2020-04-01 Test 2
2020-04-01 Run1 3
2020-04-02 Run1 1
2020-04-03 XXX1 1
2020-04-03 Test 2
However, I'm having trouble adding the last column without needing to iterate over the dataset. Everything I've read says iterating over a dataframe is a no-no and the size of the file causes tremendous lag testing, confirming iteration is a bad idea.
Can anyone give me input here without needing iteration? Thanks
Let's do:
groups = df.groupby('Date')
df['#'] = groups.cumcount() + 1
df['Total Scans'] = groups['Date'].transform('size')
output:
Date Name # Total Scans
0 2020-04-01 ABCD 1 3
1 2020-04-01 Test 2 3
2 2020-04-01 Run1 3 3
3 2020-04-02 Run1 1 1
4 2020-04-03 XXX1 1 2
5 2020-04-03 Test 2 2
Related
I have a DataFrame df and I am trying to calculate a cumulative count based on the condition that the date in the column at is bigger or equal to the dates in the column recovery_date.
Here is the original df:
at recovery_date
0 2020-02-01 2020-03-02
1 2020-03-01 2020-03-31
2 2020-04-01 2020-05-01
3 2020-05-01 2020-05-31
4 2020-06-01 2020-07-01
Here is the desired outcome:
at recovery_date result
0 2020-02-01 2020-03-02 0
1 2020-03-01 2020-03-31 0
2 2020-04-01 2020-05-01 2
3 2020-05-01 2020-05-31 3
4 2020-06-01 2020-07-01 4
The interpretation is that for each at there are x amount of recovery_dates preceding it or on the same day.
I am trying to avoid using a for loop as I am implementing this for a time-sensitive application.
This is a solution I was able to find, however I am looking for something more performant:
def how_many(at: pd.Timestamp, recoveries: pd.Series) -> int:
return (at >= recoveries).sum()
df["result"] = [how_many(row["at"], df["recovery_date"][:idx]) for idx, row in df.iterrows()]
Thanks a lot!!
You're looking for something like this:
df['result'] = df['at'].apply(lambda at: (at >= df['recovery_date']).sum())
What this does is: For each value in the at column, check if there are any recovery_dates that are bigger or equal (at this point we have an array of True (=1) and False (=0) values) then sum them.
This yields your desired output
at recovery_date count result
0 2020-02-01 2020-03-02 1 0
1 2020-03-01 2020-03-31 1 0
2 2020-04-01 2020-05-01 1 2
3 2020-05-01 2020-05-31 1 3
4 2020-06-01 2020-07-01 1 4
I have the following table and I want to count the number of active jobs, per client, on each day in 2020. A job is active if the date falls on or between its start_date and end_date.
job
client
start_date
end_date
AA001
ALPHA
2020/12/19
2020/12/28
AA002
ALPHA
2020/04/03
2020/10/10
AA003
BRAVO
2020/10/11
2020/10/11
AA004
CHARLIE
2020/04/06
2020/11/15
AA005
ALPHA
2020/04/01
2020/04/30
AA006
CHARLIE
2020/05/01
2020/06/03
AA007
BRAVO
2020/06/04
2020/06/17
AA008
BRAVO
2020/06/18
2020/07/01
AA009
CHARLIE
2020/07/02
2020/08/04
AA010
ALPHA
2020/05/05
2020/08/06
AA011
BRAVO
2020/10/12
2020/11/04
For instance, here is how many jobs were active for client ALPHA at the beginning of April:
Date
Client
Active jobs
ALPHA
2020-04-01
1
ALPHA
2020-04-02
1
ALPHA
2020-04-03
2
ALPHA
2020-04-04
2
ALPHA
2020-04-05
2
ALPHA
2020-04-06
2
ALPHA
2020-04-07
2
ALPHA
2020-04-08
2
ALPHA
2020-04-09
2
ALPHA
2020-04-10
2
I can solve this problem using nested loops, e.g.
groups = df.groupby(["client"])
dates = pd.date_range('2020-01-01','2020-12-01', freq='D')
for client, jobs in groups:
for date in dates:
active_jobs = jobs.loc[(jobs.start_date <= date) & (jobs.end_date >= date)]
print(date,client,len(active_jobs))
(Explanation: group rows by client, construct a list of dates, then for each date for each client, find/count the rows where start_date <= date and end_date >= date.)
Of course my real data is much larger than this and looping is very inefficient. How do I rewrite my query to take advantage of vectorization?
Approach with broadcasting
Check for the inclusion of dates between the start_date and end_date columns, this will create a boolean mask, now we create a new dataframe from this mask and assign the column names to the corresponding dates, then group this dataframe by client and aggregate using sum to count the number of active jobs for each client on each day
start, end = df[['start_date', 'end_date']].to_numpy().T
dates = pd.date_range('2020-01-01','2020-12-01', freq='D').to_numpy()
m = (start[:, None] <= dates) & (end[:, None] >= dates)
s = pd.DataFrame(m, columns=dates).groupby(df['client']).sum().stack()
After stacking the resulting series containing the counts of active_jobs will look like
>>> s
client
ALPHA 2020-01-01 0
2020-01-02 0
2020-01-03 0
2020-01-04 0
2020-01-05 0
..
CHARLIE 2020-11-27 0
2020-11-28 0
2020-11-29 0
2020-11-30 0
2020-12-01 0
Length: 1008, dtype: int64
Examining the active jobs for client ALPHA for the month of APRIL
>>> s.loc[pd.IndexSlice['ALPHA', '2020-04-01':]]
client
ALPHA 2020-04-01 1
2020-04-02 1
2020-04-03 2
2020-04-04 2
2020-04-05 2
2020-04-06 2
2020-04-07 2
2020-04-08 2
2020-04-09 2
2020-04-10 2
2020-04-11 2
2020-04-12 2
2020-04-13 2
2020-04-14 2
2020-04-15 2
2020-04-16 2
2020-04-17 2
2020-04-18 2
2020-04-19 2
2020-04-20 2
2020-04-21 2
2020-04-22 2
2020-04-23 2
2020-04-24 2
2020-04-25 2
2020-04-26 2
2020-04-27 2
2020-04-28 2
2020-04-29 2
2020-04-30 2
dtype: int64
PS: Although using broadcasting is faster but it will require sufficient amount of memory to hold the intermediate boolean mask in memory. One more thing you also have to convert the start_date and end_date column to pandas datetime format prior to using this approach
My data is like this
date group meet_criteria
2020-03-31 1 no
2020-04-01 1 yes
2020-04-02 1 no
2020-04-03 1 no
2020-04-04 1 yes
2020-04-05 1 no
2020-03-31 2 yes
2020-04-01 2 no
I would like to create another column which will equal 1 divide by the number of days since the last date in a group that the column meet_criteria is yes (the current meet_criteria is excluded and if a group has never met the criteria then the value will be 0.)
My desired data will look like this
date group meet_criteria last_time_met_criteria
2020-03-31 1 no 0
2020-04-01 1 yes 0
2020-04-02 1 no 1
2020-04-03 1 no 0.5
2020-04-04 1 yes 0.333333
2020-04-05 1 no 1
2020-03-31 2 yes 0
2020-04-01 2 no 1
Is there any way to do this efficiently in pandas? Thanks
This can be done using pd.merge_asof & subsequent calculations in pandas.
Here's a fully worked example with your data (original data loaded into a variable called df, and df.date converted to datetime first)
# sorting necessary for how `merge_asof` will be used
df2 = df.sort_values(['date', 'group'])
# construct the `right` data frame of dates to lookup
df_meet_criteria = df2[df2.meet_criteria == 'yes'].copy()
df_meet_criteria['date_met_criteria'] = df_meet_criteria.date
# merge
# `by`: columns to do regular merge on
# `on`: columns to do as_of merge on
# `allow_exact_matches`: True -> closed interval, False -> open interval,
# i.e. latest date before current date
last_date = pd.merge_asof(
df2,
df_meet_criteria,
by='group',
on='date',
allow_exact_matches=False,
suffixes=('', '_y')
).sort_values(['group', 'date'])
# calculate the inverse_days.
last_date['days_since'] = (last_date.date - last_date.date_met_criteria).dt.days
last_date.loc[last_date.days_since == 0, 'days_since'] = np.nan
last_date['last_time_met_criteria'] = (1 / last_date.days_since).fillna(0)
final = last_date[['date', 'group', 'meet_criteria', 'last_time_met_criteria']]
final dataframe looks like this:
date group meet_criteria last_time_met_criteria
0 2020-03-31 1 no 0.000000
2 2020-04-01 1 yes 0.000000
4 2020-04-02 1 no 1.000000
5 2020-04-03 1 no 0.500000
6 2020-04-04 1 yes 0.333333
7 2020-04-05 1 no 1.000000
1 2020-03-31 2 yes 0.000000
3 2020-04-01 2 no 1.000000
I have a time series dataset which is basically consumption data of materials over the past 5 years
Material No Consumption Date Consumption
A 2019-06-01 1
A 2019-07-01 2
A 2019-08-01 3
A 2019-09-01 4
A 2019-10-01 0
A 2019-11-01 0
A 2019-12-01 0
A 2020-01-01 1
A 2020-02-01 2
A 2020-03-01 3
A 2020-04-01 0
A 2020-05-01 0
B 2019-06-01 0
B 2019-07-01 0
B 2019-08-01 0
B 2019-09-01 4
B 2019-10-01 0
B 2019-11-01 0
B 2019-12-01 0
B 2020-01-01 4
B 2020-02-01 2
B 2020-03-01 8
B 2020-04-01 0
B 2020-05-01 0
From the above dataframe, I want to see the number of months in which the material had at least 1 unit of consumption. The output dataframe should look something like this.
Material no_of_months(Jan2020-May2020) no_of_months(Jun2019-May2020)
A 3 7
B 3 4
Currently I'm sub-setting the data frame and using a group by to count the unique entries with non-zero consumption. However, this needs creating multiple data frames for different periods and then merging them. Was wondering if this could be done in a better way using dictionaries.
consumption_jan20_may20 = consumption.loc[consumption['Consumption Date']>='2020-01-01',['Material No','Consumption Date','Consumption']]
consumption_jan20_may20 = consumption_jan20_may20.groupby([pd.Grouper(key='Material No'),grouper])['Consumption'].count().reset_index()
consumption_jan20_may20 = consumption_jan20_may20.groupby('Material No').count().reset_index()
consumption_jan20_may20.columns = ['Material No','no_of_months(Jan2020-May2020)','dummy']
consumption_jan20_may20 = consumption_jan20_may20[['MATNR','no_of_months(Jan2020-May2020)']]
You can firstly limit the data that you are investigating (limit it to a range of months). Let's say you want to check the data for the first 5 months:
df = df[:6]
Then you can use the below code to find the months that the material usage is not zero:
df_nonezero = df[df['Consumption']!=0]
if you want to see how many months the consumption is not zero, you can simply determine the length of new data frame:
len(df_nonezero)
I have the following problem. I've got a dataframe with start and end dates for each group. There might be more than one start and end date per group, like this:
group start_date end_date
1 2020-01-03 2020-03-03
1 2020-05-03 2020-06-03
2 2020-02-03 2020-06-03
And another dataframe with one row per date, per group, like this:
group date
1 2020-01-03
1 2020-02-03
1 2020-03-03
1 2020-04-03
1 2020-05-03
1 2020-06-03
2 2020-02-03
3 2020-03-03
4 2020-04-03
.
.
So I want to create a column is_between in an efficient way, ideally avoiding loops, so I get the following dataframe
group date is_between
1 2020-01-03 1
1 2020-02-03 1
1 2020-03-03 1
1 2020-04-03 0
1 2020-05-03 1
1 2020-06-03 1
2 2020-02-03 1
3 2020-03-03 1
4 2020-04-03 1
.
.
So it gets a 1 when a group's date is between the dates in the first dataframe. I'm guessing some combination of groupby, where, between and maybe map might do it, but I'm not finding the correct one. Any ideas?
Based on #YOBEN_S and #Quang Hoang's advice this made it:
df = df.merge(dic_dates, how='left')
df['is_between'] = np.where(df.date.between(pd.to_datetime(df.start_date),
pd.to_datetime(df.end_Date)),1, 0)
df = (df.sort_values(by=['group', 'date', 'is_between'])
.drop_duplicates(subset=['group', 'date'], keep='last'))
you could try with merge_asof, by the group and on the date and start_date, then check where the date is less than end_date and finally assign back to the original df2
ser = (pd.merge_asof(df2.reset_index() #for later index alignment
.sort_values('date'),
df1.sort_values('start_date'),
by='group',
left_on='date', right_on='start_date',
direction='backward')
.assign(is_between=lambda x: x.date<=x.end_date)
.set_index(['index'])['is_between']
)
df2['is_between'] = ser.astype(int)
print (df2)
group date is_between
0 1 2020-01-03 1
1 1 2020-02-03 1
2 1 2020-03-03 1
3 1 2020-04-03 0
4 1 2020-05-03 1
5 1 2020-06-03 1
6 2 2020-02-03 1
7 3 2020-03-03 0
8 4 2020-04-03 0