I have a table like the following but approximately 7 million rows. What I am trying to find out is how many cases is each user working on simultaneously? I would like to groupby the username and then get an average count of how many references are open concurrently between the two times.
Reference
starttime
stoptime
Username
1
2020-07-28 06:41:56.000
2020-07-28 07:11:25.000
Arthur
2
2020-07-18 13:24:02.000
2020-07-18 13:38:42.000
Arthur
3
2020-07-03 09:27:03.000
2020-07-03 10:35:24.000
Arthur
4
2020-07-05 19:42:38.000
2020-07-05 20:07:52.000
Bob
5
2020-07-04 10:22:48.000
2020-07-04 10:24:32.000
Bob
Any ideas?
Someone asked a similar question just yesterday so here it is:
ends = df['starttime'].values < df['endtime'].values[:, None]
starts = df['starttime'].values > df['starttime'].values[:, None]
same_name = (df['Username'].values == df['Username'].values[:, None])
# check for rows where all three conditions are met
# count the nubmer of matches by sum across axis=1 !!!
df['overlap'] = (ends & starts & same_name).sum(1)
df
To answer your final question for the mean value you would then run:
df['overlap'].mean()
I would use Pandas groupby function as you suggested in your tag already, by username. Let me describe the general workflow below per grouped user:
Collect all start times and stop times as 'moments of change in activities'.
Loop over all of them in your grouped dataframe
Use e.g. Pandas.DataFrame.loc to check how many cases are 'active' at moments of changes.
Save these in a list to compute the average count of cases
I don't have your code, but in pseudo-code it would look something like:
df = ... # your raw df
grouped = df.groupby(by='Username')
for user, user_df in grouped:
cases = []
user_starts_cases = user_df['starttime'].to_numpy()
user_stops_cases = user_df['stoptime'].to_numpy()
times_of_activity_changes = np.concatenate(user_starts_cases, user_stops_cases)
for xs in times_of_activity_changes:
num_activities = len(user_df.loc[(user_df['starttime'] <= xs) & (user_df['stoptime'] >= xs)]) # mind the brackets
active_cases.append(num_activities)
print(sum(active_cases)/len(active_cases))
It depends a bit what you would call 'on average' but with this you could sample the amount of active cases at the times of your interest and compute an average.
Related
I have data that looks like this:
id Date Time assigned_pat_loc prior_pat_loc Activity
0 45546325 2/7/2011 4:29:38 EIAB^EIAB^6 NaN Admission
1 45546325 2/7/2011 5:18:22 8W^W844^A EIAB^EIAB^6 Observation
2 45546325 2/7/2011 5:18:22 8W^W844^A EIAB^EIAB^6 Transfer to 8W
3 45546325 2/7/2011 6:01:44 8W^W858^A 8W^W844^A Bed Movement
4 45546325 2/7/2011 7:20:44 8W^W844^A 8W^W858^A Bed Movement
5 45546325 2/9/2011 18:36:03 8W^W844^A NaN Discharge-Observation
6 45666555 3/8/2011 20:22:36 EIC^EIC^5 NaN Admission
7 45666555 3/9/2011 1:08:04 53^5314^A EIC^EIC^5 Admission
8 45666555 3/9/2011 1:08:04 53^5314^A EIC^EIC^5 Transfer to 53
9 45666555 3/9/2011 17:03:38 53^5336^A 53^5314^A Bed Movement
I need to find where there were multiple patients (identified with id column) are in the same room at the same time, the start and end times for those, the dates, and room number (assigned_pat_loc). assigned_pat_loc is the current patient location in the hospital, formatted as “unit^room^bed”.
So far I've done the following:
# Read in CSV file and remove bed number from patient location
data = pd.read_csv('raw_data.csv')
data['assigned_pat_loc'] = data['assigned_pat_loc'].str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)
# Convert Date column to datetime type
patient_data['Date'] = pd.to_datetime(patient_data['Date'])
# Sort dataframe by date
patient_data.sort_values(by=['Date'], inplace = True)
# Identify rows with duplicate room and date assignments, indicating multiple patients shared room
same_room = patient_data.duplicated(subset = ['Date','assigned_pat_loc'])
# Assign duplicates to new dataframe
df_same_rooms = patient_data[same_room]
# Remove duplicate patient ids but keep latest one
no_dups = df_same_rooms.drop_duplicates(subset = ['id'], keep = 'last')
# Group patients in the same rooms at the same times together
df_shuf = pd.concat(group[1] for group in df_same_rooms.groupby(['Date', 'assigned_pat_loc'], sort=False))
And then I'm stuck at this point:
id Date Time assigned_pat_loc prior_pat_loc Activity
599359 42963403 2009-01-01 12:32:25 11M^11MX 4LD^W463^A Transfer
296155 42963484 2009-01-01 16:41:55 11M^11MX EIC^EIC^2 Transfer
1373 42951976 2009-01-01 15:51:09 11M^11MX NaN Discharge
362126 42963293 2009-01-01 4:56:57 11M^11MX EIAB^EIAB^6 Transfer
362125 42963293 2009-01-01 4:56:57 11M^11MX EIAB^EIAB^6 Admission
... ... ... ... ... ... ...
268266 46381369 2011-09-09 18:57:31 54^54X 11M^1138^A Transfer
16209 46390230 2011-09-09 6:19:06 10M^1028 EIAB^EIAB^5 Admission
659699 46391825 2011-09-09 14:28:20 9W^W918 EIAB^EIAB^3 Transfer
659698 46391825 2011-09-09 14:28:20 9W^W918 EIAB^EIAB^3 Admission
268179 46391644 2011-09-09 17:48:53 64^6412 EIE^EIE^3 Admission
Where you can see different patients in the same room at the same time, but I don't know how to extract those intervals of overlap between two different rows for the same room and same times. And then to format it such that the start time and end time are related to the earlier and later times of the transpiring of a shared room between two patients. Below is the desired output.
Where r_id is the id of the other patient sharing the same room and length is the number of hours that room was shared.
As suggested, you can use groupby. One more thing you need to take care of is finding the overlapping time. Ideally you'd use datetime which are easy to work with. However you used a different format so we need to convert it first to make the solution easier. Since you did not provide a workable example, I will just write the gist here:
# convert current format to datetime
df['start_datetime'] = pd.to_datetime(df.start_date) + df.start_time.astype('timedelta64[h]')
df['end_datetime'] = pd.to_datetime(df.end_date) + df.end_time.astype('timedelta64[h]')
df = df.sort_values(['start_datetime', 'end_datetime'], ascending=[True, False])
gb = df.groupby('r_id')
for g, g_df in gb:
g_df['overlap_group'] = (g_df['end_datetime'].cummax().shift() <= g_df['start_datetime']).cumsum()
print(g_df)
This is a tentative example, and you might need to tweak the datetime conversion and some other minor things, but this is the gist.
The cummax() detects where there is an overlap between the intervals, and cumsum() counts the number of overlapping groups, since it's a counter we can use it as a unique identifier.
I used the following threads:
Group rows by overlapping ranges
python/pandas - converting date and hour integers to datetime
Edit
After discussing it with OP the idea is to take each patient's df and sort it by the date of the event. The first one will be the start_time and the last one would be the end_time.
The unification of the time and date are not necessary for detecting the start and end time as they can sort by date and then by the time to get the same order they would have gotten if they did unify the columns. However for the overlap detection it does make life easier when it's in one column.
gb_patient = df.groupby('id')
patients_data_list = []
for patient_id, patient_df in gb_patient:
patient_df = patient_df.sort_values(by=['Date', 'Time'])
patient_data = {
"patient_id": patient_id,
"start_time": patient_df.Date.values[0] + patient_df.Time.values[0],
"end_time": patient_df.Date.values[-1] + patient_df.Time.values[-1]
}
patients_data_list.append(patient_data)
new_df = pd.DataFrame(patients_data_list)
After that they can use the above code for the overlaps.
I have an issue where I need to track the progression of patients insurance claim statuses based on the dates of those statuses. I also need to create a count of status based on certain conditions.
DF:
ClaimID
New
Accepted
Denied
Pending
Expired
Group
001
2021-01-01T09:58:35:335Z
2021-01-01T10:05:43:000Z
A
002
2021-01-01T06:30:30:000Z
2021-03-01T04:11:45:000Z
2021-03-01T04:11:53:000Z
A
003
2021-02-14T14:23:54:154Z
2021-02-15T11:11:56:000Z
2021-02-15T11:15:00:000Z
A
004
2021-02-14T15:36:05:335Z
2021-02-14T17:15:30:000Z
A
005
2021-02-14T15:56:59:009Z
2021-03-01T10:05:43:000Z
A
In the above dataset, we have 6 columns. ClaimID is simple and just indicates the ID of the claim. New, Accepted, Denied, Pending, and Expired indicate the status of the claim and the day/time those statuses were set.
What I need to do is get a count of how many claims are New on each day and how many move out of new into a new status. For example, There are 2 new claims on 2021-01-01. On that same day 1 moved to Accepted about 7 minutes later. Thus on 2021-01-01 the table of counts would read:
DF_Count:
Date
New
Accepted
Denied
Pending
Expired
2021-01-01
2
1
0
0
0
2021-01-02
1
0
0
0
0
2021-01-03
1
0
0
0
0
2021-01-04
1
0
0
0
0
2021-01-05
1
0
0
0
0
....
....
....
....
....
....
2021-02-14
4
2
0
0
0
2021-02-15
2
3
0
0
1
2021-02-16
2
2
0
0
0
Few Conditions:
If a claim moves from one status to the other on the same day (even if they are a minutes/hours apart) it would not be subtracted from the original status until the next day. This can be seen on 2021-01-01 where claim 001 moves from new to accepted on the same day but the claim is not subtracted from new until 2021-01-02.
Until something happens to a claim, it should remain in its original status. Claim 002 will remain in new until 2021-03-01 when it is approved.
If a claim changes status on a later date than its original status, it will be subtracted on that later date. For this, see status 003. It is new on 2/14 but accepted on 2/15. This is why New goes down by 2 on 2/15 (the other claim is the is 004 which is new and accepted on the same day)
For certain statuses, I do not need to look at all columns. For example, For new I only look at the dates inside Accepted and Denied. Not Pending and Expired. When I do these same steps for approved, I no longer need to look at new, just the other columns. How would I do that?
In the final DF_count table, the dates should start from the earliest date in 'New' and end on todays date.
The code needs to be grouped by the Group Column as well. For example, patients in group B (not pictured) will have to have the same start and end date but for their own claims.
I need to do this separately for all of the statuses. Not just new.
Current Solution:
My current solution has been to create an dataset with just dates from the min New Date to todays date. Then for each column, what I do is use the .loc method to find dates that are greater than New in each of the other columns. For example, in the code below I look for all cases where new is equal to approved.
df1 = df.loc[(df['New'] == df['Approved']) &
((df['Expired'].isnull()) | (df['Expired'] >= df['Accepted'])) &
((df['Pending'].isnull()) | (df['Pending'] >= df['Accepted'])) &
((df['Denied'].isnull()) | (df['Denied'] >= df['Accepted']))]
newtoaccsday = df1.loc[:, ('Group', 'Accepted')]
newtoappsday['Date'] = newtoappsday['Accepted']
newtoappsday = newtoappsday.reset_index(drop = True)
newtoappsday= newtoappsday.groupby(['Date', 'Group'], as_index = False)['Approved'].value_counts()
newtoappsday.drop(columns = {'Accepted'}, inplace = True)
newtoappsday.rename(columns = {'count': 'NewAppSDay'}, inplace = True)
newtoappsday['Date'] = newtoappsday['Date'] + timedelta(1)
df_count= df_count.merge(newtoappsday, how = 'left', on = ['Date', 'Group']).fillna(0)
--After doing the above steps for all conditions (where new goes to accepted on a later date etc.) I will do the final calculation for new:
df_count['New'] = df_count.eval('New = New - (NewAccSDay + NewAccLater + NewDenSDay + NewDenLater + NewExpLater + NewPendSDay + NewPendLater)').groupby(['Tier2_ID', 'ClaimType'])['New'].cumsum()
Any and all help would be greatly appreciated. My method above is extremely inefficient and leading to some errors. Do I need to write a for loop for this? What is the best way to go about this.
First convert the date columns with something like
for i in ['New', 'Accepted', 'Denied', 'Pending', 'Expired']:
df[i] = pd.to_datetime(df[i], format="%Y-%m-%dT%H:%M:%S:%f%z")
Then develop the date range applicable based on your column conditions. In this logic if Denied is there the range is new --> denied, or if accepted new --> accepted or if no acceptance new --> today with code like (alter as per rules):
df['new_range'] = df[['New','Accepted','Denied']].apply (lambda x: pd.date_range(x['New'],x['Denied']).date.tolist() if
pd.notnull(x['Denied']) else
pd.date_range(x['New'],x['Accepted']).date.tolist() if
pd.notnull(x['Accepted']) else
pd.date_range(x['New'],datetime.today()).date.tolist()
,axis=1)
You should be able filter on a group and see date ranges in your df like:
df[df['Group']=='A']['new_range']
0 [2021-01-01]
1 [2021-01-01, 2021-01-02, 2021-01-03, 2021-01-0...
2 [2021-02-14]
3 [2021-02-14]
4 [2021-02-14, 2021-02-15, 2021-02-16, 2021-02-1..
Then you can explode the date ranges and group on counts to get the new counts for each day with code like:
new = pd.to_datetime(df[df['Group']=='A']['new_range'].explode('Date')).reset_index()
newc = new.groupby('new_range').count()
newc
new_range
2021-01-01 2
2021-01-02 1
2021-01-03 1
2021-01-04 1
2021-01-05 1
2021-01-06 1...
Similarly get counts for accepted, denied and then left joined on date to arrive at final table, fill na to 0.
By creating your rules to expand your date range, then explode over date range and groupby to get your counts you should be able to avoid much of the expensive operation.
I think this is what you want or can be easily modified to your necesity:
import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import date
def dateRange(d1,d2):
return [d1 + timedelta(days=x) for x in range((d2-d1).days)]
def addCount(dic,group,dat,cat):
if group not in dic:
dic[group]={}
if dat not in dic[group]:
dic[group][dat]={}
if cat not in dic[group][dat]:
dic[group][dat][cat]=0
dic[group][dat][cat]+=1
df =pd.read_csv("testdf.csv",
parse_dates=["New","Accepted","Denied","Pending", "Expired"])#,
cdic={}
for i,row in df.iterrows():
cid=row["ClaimID"]
dnew=row["New"].date()
dacc=row["Accepted"].date()
dden=row["Denied"].date()
dpen=row["Pending"].date()
dexp=row["Expired"].date()
group=row["Group"]
if not pd.isna(dacc): #Claim has been accepted
if(dnew == dacc):
dacc+=timedelta(days=1)
nend=dacc
addCount(cdic,group,dacc,"acc")
if not pd.isna(dden): # Claim has been denied
if(dnew == dden):
dden+=timedelta(days=1)
if pd.isna(dacc):
nend=dden
addCount(cdic,group,dden,"den")
if not pd.isna(dpen):
addCount(cdic,group,dpen,"pen") # Claim is pending
if not pd.isna(dexp):
addCount(cdic,group,dexp,"exp") # Claim is expired
if pd.isna(dacc) and pd.isna(dden):
nend=date.today()+timedelta(days+1)
for d in dateRange(dnew,nend): # Fill new status until first change
addCount(cdic,group,d,"new")
ndfl=[]
for group in cdic:
for dat in sorted(cdic[group].keys()):
r=cdic[group][dat]
ndfl.append([group,dat,r.get("new",0),r.get("acc",0),
r.get("den",0),r.get("pen",0),r.get("exp",0)])
ndf=pd.DataFrame(ndfl,columns=["Group", "Date","New","Accepted","Denied","Pending","Expired"])
I have three dataframes:
ob (Orderbook) - an orderbook containing Part Numbers, the week they are due and the hours it takes to build them.
Part Number
Due Week
Build Hours
A
2022-46
4
A
2022-46
5
B
2022-46
8
C
2022-47
1.6
osm (Operator Skill Matrix) - a skills matrix containing operators names and part numbers
Operator
Part number
Mr.One
A
Mr.One
B
Mr.Two
A
Mr.Two
B
Mrs. Three
C
ah (Avaliable Hours) - a list containg how many hours an operator can work in a given week
Operator
YYYYWW
Hours
Mr.One
2022-45
40
Mr.One
2022-46
35
Mr.Two
2022-46
37
Mr.Two
2022-47
39
Mrs. Three
2022-47
40
Mrs. Three
2022-48
45
I am trying to work out for each week if there are enough operators, with the right skills, working enough hours to complete all of the orders on the orderbook. And if not, identify the orders that cant be complete.
Step by Step it would look like this:
Take the part number of the first row of the orderbook.
Seach the skills matrix to find a list of operators who can build that part.
Seach the hours list and check if the operators have any hours avaliable for the week the order is due.
If the operator has hours avalible, add their name to that row of the orderbook.
Subtract the Build hours in the orderbook from the Avalible hours in the Avalible Hours df.
Repeat this for each row in the orderbook until all orders have a name against them or there are no avalible hours left.
The only thing i could think to try was a bunch of nested for loops, but as there are thousands of rows it takes ~45 minutes to complete one iteration and would take days if not weeks to complete the whole thing.
#for each row in the orderbook
for i, rowi in ob_sum_hours.iterrows():
#for each row in the operator skill matrix
for j, rowj in osm.iterrows():
#for each row in the avalible operator hours
for y, rowy in aoh.iterrows():
if(rowi['Material']==rowj['MATERIAL'] and rowi['ProdYYYYWW']==rowy['YYYYWW'] and rowj['Operator']==rowy['Operator'] and rowy['Hours'] > 0):`
rowy['Hours'] -=rowi['PlanHrs']
rowi['HoursAllocated'] = rowi['Operator']
The final result would look like this:
Part Number
Due Week
Build Hours
Operator
A
2022-46
4
Mr.One
A
2022-46
5
Mr.One
B
2022-46
8
Mr.Two
C
2022-47
1.6
Mrs.Three
Is there a better way to achieve this?
Made with one loop + apply on each line.
Orderbook.groupby(Orderbook.index) groups by index, i.e. my_func iterates through each row, still better than a loop.
In the 'aaa' list, we get a list of unique Operators that match. In the 'bbb' list, filter Avaliable by: 'YYYYWW', 'Operator' (using isin for the list of unique Operators) and 'Hours' greater than 0. Further in the loop, using the 'bbb' indices, we check free time and if 'ava' is greater than zero, using explicit indexing loc set values.
import pandas as pd
Orderbook = pd.read_csv('Orderbook.csv', header=0)
Operator = pd.read_csv('Operator.csv', header=0)
Avaliable= pd.read_csv('Avaliable.csv', header=0)
Orderbook['Operator'] = 'no'
def my_func(x):
aaa = Operator.loc[Operator['Part number'] == x['Part Number'].values[0], 'Operator'].unique()
bbb = Avaliable[(Avaliable['YYYYWW'] == x['Due Week'].values[0]) &
(Avaliable['Operator'].isin(aaa)) & (Avaliable['Hours'] > 0)]
for i in bbb.index:
ava = Avaliable.loc[i, 'Hours'] - x['Build Hours'].values
if ava >= 0:
Avaliable.loc[i, 'Hours'] = ava
Orderbook.loc[x.index, 'Operator'] = Avaliable.loc[i, 'Operator']
break# added loop interrupt
Orderbook.groupby(Orderbook.index).apply(my_func)
print(Orderbook)
print(Avaliable)
Update 18.11.2022
I did it without cycles. But, you need to check. If you find something incorrect please let me know. You can also measure the exact processing time by putting at the beginning:
import datetime
now = datetime.datetime.now()
and printing the elapsed time at the end:
time_ = datetime.datetime.now() - now
print('elapsed time', time_)
the code:
Orderbook = pd.read_csv('Orderbook.csv', header=0)
Operator = pd.read_csv('Operator.csv', header=0)
Avaliable = pd.read_csv('Avaliable.csv', header=0)
Orderbook['Operator'] = 'no'
aaa = [Operator.loc[Operator['Part number'] == Orderbook.loc[i, 'Part Number'], 'Operator'].unique() for i in
range(len(Orderbook))]
def my_func(x):
bbb = Avaliable[(Avaliable['YYYYWW'] == x['Due Week'].values[0]) &
(Avaliable['Operator'].isin(aaa[x.index[0]])) & (Avaliable['Hours'] > 0)]
fff = Avaliable.loc[bbb.index, 'Hours'] - x['Build Hours'].values
ind = fff[fff.ge(0)].index
Avaliable.loc[ind[0], 'Hours'] = fff[ind[0]]
Orderbook.loc[x.index, 'Operator'] = Avaliable.loc[ind[0], 'Operator']
Orderbook.groupby(Orderbook.index).apply(my_func)
print(Orderbook)
print(Avaliable)
I am not entirely positive the best way to ask or phrase this question so I will highlight my problem, dataset, my thoughts on the method and end goal and hopefully it will be clear by the end.
My problem:
My company dispatches workers and will load up dispatches to a single employee even if they are on their current dispatch. This is due to limitation in the software we use. If an employee receives two dispatches within 30 minutes, we call this a double dispatch.
We are analyzing our dispatching efficiency and I am running into a bit of a head scratcher. I need to run through our 100k row database and add an additional column that will read as a dummy variable 1 for double 0 for normal. BUT as we have multiple people we dispatch and B our records do not start ordered by dispatch, I need to determine how often a dispatch occurs to the same person within 30 minutes.
Dataset:
The dataset is incredibly massive due to poor organization in our data warehouse but for terms of what items I need these are the columns I will need for my calc.
Tech Name | Dispatch Time (PST)
John Smith | 1/1/2017 12:34
Jane Smith | 1/1/2017 12:46
John Smith | 1/1/2017 18:32
John Smith | 1/1/2017 18:50
My Thoughts:
How I would do it is clunky and it could work one way but not backwards. I would more or less write my code as:
import pandas as pd
df = pd.read_excel('data.xlsx')
df.sort('Dispatch Time (PST)', inplace = True)
tech_name = None
dispatch_time = pd.to_datetime('1/1/1900 00:00:00')
for index, row in df.iterrows():
if tech_name is None:
tech_name = row['Tech Name']
else:
if dispatch_time.pd.time_delta('0 Days 00:30:00') > row['Tech Dispatch Time (PST)'] AND row['Tech Name'] = tech_name:
row['Double Dispatch'] = 1
dispatch_time = row['Tech Dispatch Time (PST)']
else:
dispatch_time = row['Tech Dispatch Time (PST)']
tech_name = row['Tech Name']
This has many problems from being slow, only tracking dates going backwards and not forwards so I will be missing many dispatches.
End Goal:
My goal is to have a dataset I can then plug back into Tableau for my report by adding on one column that reads as that dummy variable so I can filter and calculate on that.
I appreciate your time and help and let me know if any more details are necessary.
Thank you!
------------------ EDIT -------------
Added a edit to make the question clear as I failed to do so earlier.
Question: Is Pandas the best tool to use to iterate over my dataframe to see each for each datetime dispatch, is there a record that matches the Tech's Name AND is less then 30 minutes away from this record.
If so, how could I improve my algorithm or theory, if not what would the best tool be.
Desired Output - An additional column that records if a dispatch happened within a 30 minute window as a dummy variable 1 for True 0 for False. I need to see when double dispatches are occuring and how many records are true double dispatches, and not just a count that says there were 100 instances of double dispatch, but that involved over 200 records. I need to be able to sort and see each record.
Hello I think I found a solution. It slow, only compares one index before or after, but in terms of cases that have 3 dispatches within thirty minutes, this represents less then .5 % for us.
import pandas as pd
import numpy as np
import datetime as dt
dispatch = 'Tech Dispatched Date-Time (PST)'
tech = 'CombinedTech'
df = pd.read_excel('combined_data.xlsx')
df.sort_values(dispatch, inplace=True)
df.reset_index(inplace = True)
df['Double Dispatch'] = np.NaN
writer = pd.ExcelWriter('final_output.xlsx', engine='xlsxwriter')
dispatch_count = 0
time = dt.timedelta(minutes = 30)
for index, row in df.iterrows():
try:
tech_one = df[tech].loc[(index - 1)]
dispatch_one = df[dispatch].loc[(index - 1)]
except KeyError:
tech_one = None
dispatch_one = pd.to_datetime('1/1/1990 00:00:00')
try:
tech_two = df[tech].loc[(index + 1)]
dispatch_two = df[dispatch].loc[(index + 1)]
except KeyError:
tech_two = None
dispatch_two = pd.to_datetime('1/1/2020 00:00:00')
first_time = dispatch_one + time
second_time = pd.to_datetime(row[dispatch]) + time
dispatch_pd = pd.to_datetime(row[dispatch])
if tech_one == row[tech] or tech_two == row[tech]:
if first_time > row[dispatch] or second_time > dispatch_two:
df.set_value(index, 'Double Dispatch', 1)
dispatch_count += 1
else:
df.set_value(index, 'Double Dispatch', 0)
dispatch_count += 1
print(dispatch_count) # This was to monitor total # of records being pushed through
df.to_excel(writer,sheet_name='Sheet1')
writer.save()
writer.close()
I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)