Imagine 2 Dataframes, where MasterDB (df1) is a Central Database and LatestResults (df2) is an incoming CSV I retrieve weekly.
df1 = pd.read_sv("Master.csv")
#with columns:
Index(['Timestamp', 'ID', 'Status', 'Username', 'URL', 'Last Seen'],
dtype='object')
Shape(1000,6)
df2 = pd.read_sv("LatestResults.csv")
#with same columns:
Index(['Timestamp', 'ID', 'Status', 'Username', 'URL', 'Last Seen'],
dtype='object')
Shape(300,6)
Based on certain conditions, I would like to iterate through my (new) items in df2 and treat items based on the following 3 conditions (assuming there's only 3 for now):
a) if new item does not pre-exist (for non-matching "URL")
append to MasterDB (df1)
set 'Status' = "Net New" (in df1)
set 'Last Seen' = today.date (in df1)
b) if new item pre-exists, edit its status to "Still Active" (for matching "URL", regardless of "ID")
set 'Status' to "Still Active"
set 'Last Seen' = today.date
c) if item in MasterDB (df1) is not found in LatestResults (only for matching "ID" AND "URL")
set 'Status to 'Expired'
Here's the code I have so far:
def function(input1, input2):
df1 = pd.read_csv(input1)
df2 = pd.read_csv(input2)
df1['Status'] = df1['Status'].astype(str)
df1['Last Seen'] = df1['Last Seen'].astype(str)
for i, row in df1.iterrows():
tmp = df2.loc[(df2['URL'] == row['URL'])
# A) STILL ACTIVE
if not tmp.empty:
df1.at[i, 'Last Seen'] = str(datetime.now().replace(second=0, microsecond=0))
df1.at[i, 'Status'] = "STILL ACTIVE"
for i, row in df1.iterrows():
tmp = df2.loc[(df2['URL'] == row['URL']) & (df2['ID'] == row['ID'])]
# B) NOT FOUND // EXPIRED
if tmp.empty:
df1.at[i, 'Status'] = "EXPIRED"
for i, row in df2.iterrows():
tmp = df1.loc[(df1['URL'] == row['URL']) & (df1['ID'] == row['ID'])]
# C) NET NEW
if tmp.empty:
df1.at[i, 'Last Seen'] = str(datetime.now().replace(second=0, microsecond=0))
row["Status"] = "NET NEW"
df1 = df1.append(row, ignore_index=True)
df1['Last Seen'].fillna("Not Set", inplace=True)
df1.to_csv("MasterDB.csv", index=False)
This solution works for the most part. New items are appended and (some) pre-existing items are labeled correctly. The problems I'm having are:
When matching still active records, if the item is untouched (most cases), its value is overwritten to NaN. How can I fix this?
For my third case "Expired", I want to be severe with the and conditions and only set to expired if both conditions are met. It does not seem to work. Is this the right way to think of it?
I've explored using pandas merge, join and map but persevered using the methods here above. But after a few days on this, I realize it's bulky. I'd love to hear ideas about how to design this differently. Many thanks in advance.
I'm going to share with you how I would implement a solution for your problem. Perhaps a different point of view can help you see what's wrong in your code.
First, the data that I used. This is my LatestResults.csv
ID Status Username URL Last Seen
0 1 new Harry Potter google 2020-10-20 12:00:00
1 20 new Hermione Granger yahoo 2020-10-20 12:00:00
2 3 new Ron Weasley twitter 2020-10-20 12:00:00
3 4 new Hagrid instagram 2020-10-20 12:00:00
And this is my MasterDB.csv
ID Status Username URL Last Seen
0 1 blabla Harry Potter google 2020-10-20 12:00:00
1 2 blabla Hermione Granger yahoo 2020-10-20 12:00:00
2 3 blabla Ron Weasley bing 2020-10-20 12:00:00
3 4 blabla McGonagall facebook 2020-10-20 12:00:00
4 5 blabla Dumbledore google 2020-10-20 12:00:00
5 6 blabla You-Know-Who badoo 2020-10-20 12:00:00
I tried to include all possible combinations:
Harry Potter has the same ID and URL in both DataFrames
Hermione Granger has the same URL, but different ID
Ron Weasley has the same ID, but different URL
McGonagall has a unique URL, but the same ID as Hagrid
Dumbledore has a unique ID, but the same URL as Harry Potter
You-Know-Who has uniques ID and URL
Hagrid has a unique unique URL, but it's not in you database yet
I know the username is not used anywhere in the code, but these names help us to understand what's going on.
Now the code.
def new_function(db_file, input_file):
today = datetime.now().replace(second=0, microsecond=0)
df_input = pd.read_csv(input_file)
df_database = pd.read_csv(db_file)
df_database['Last Seen'].astype(str)
# Filter for input data
net_new_filter = ~ df_input['URL'].isin(df_database['URL'])
# Filter for DB data
still_active = df_database['URL'].isin(df_input['URL'])
expired_filter = (pd.merge(df_database, df_input, on=['ID', 'URL'],
how='left', indicator=True)['_merge'] != 'both')
expired_filter = expired_filter & (~ still_active)
# Applying filters
df_database.loc[still_active, 'Status'] = 'Still Active'
df_database.loc[still_active, 'Last Seen'] = str(today)
df_database.loc[expired_filter, 'Status'] = 'Expired'
new_data = df_input[net_new_filter].copy()
new_data['Status'] = 'Net New'
new_data['Last Seen'] = str(today)
final_database = pd.concat([df_database, new_data],
ignore_index=True)
final_database.to_csv("Test_Data/Output.csv", index=False)
I enjoy writing DataFrame filters separately since it makes clearer what conditions you are applying.
You may use the method .isin() to verify if the elements of a column are located somewhere in another column. It directly detects the "Still Active" and the "Net New" conditions.
Then, if you want to check if two columns of the same row are equal two other columns simultaneously, you may use the .merge() method, such as described here. However, we must not overwrite the "Still Active" condition, therefore the additional expression expired_filter = expired_filter & (~ still_active).
Once you run the code, the following DataFrame will be saved:
ID Status Username URL Last Seen
0 1 Still Active Harry Potter google 2020-10-23 20:18:00
1 2 Still Active Hermione Granger yahoo 2020-10-23 20:18:00
2 3 Expired Ron Weasley bing 2020-10-20 12:00:00
3 4 Expired McGonagall facebook 2020-10-20 12:00:00
4 5 Still Active Dumbledore google 2020-10-23 20:18:00
5 6 Expired You-Know-Who badoo 2020-10-20 12:00:00
6 3 Net New Ron Weasley twitter 2020-10-23 20:18:00
7 4 Net New Hagrid instagram 2020-10-23 20:18:00
The rows with new URL are at the end, while those from the original database which didn't have an exact match over ID and URL were set to "Expired". Finally, the rows with corresponding URL were kept, and their Status was set to "Still Active".
I hope this example helps you out. At least it's a little bit faster because it doesn't rely on for loops.
If the data wasn't helpful, please provide a sample of your own data, so we can test using the same dataset.
Related
I have an issue where I need to track the progression of patients insurance claim statuses based on the dates of those statuses. I also need to create a count of status based on certain conditions.
DF:
ClaimID
New
Accepted
Denied
Pending
Expired
Group
001
2021-01-01T09:58:35:335Z
2021-01-01T10:05:43:000Z
A
002
2021-01-01T06:30:30:000Z
2021-03-01T04:11:45:000Z
2021-03-01T04:11:53:000Z
A
003
2021-02-14T14:23:54:154Z
2021-02-15T11:11:56:000Z
2021-02-15T11:15:00:000Z
A
004
2021-02-14T15:36:05:335Z
2021-02-14T17:15:30:000Z
A
005
2021-02-14T15:56:59:009Z
2021-03-01T10:05:43:000Z
A
In the above dataset, we have 6 columns. ClaimID is simple and just indicates the ID of the claim. New, Accepted, Denied, Pending, and Expired indicate the status of the claim and the day/time those statuses were set.
What I need to do is get a count of how many claims are New on each day and how many move out of new into a new status. For example, There are 2 new claims on 2021-01-01. On that same day 1 moved to Accepted about 7 minutes later. Thus on 2021-01-01 the table of counts would read:
DF_Count:
Date
New
Accepted
Denied
Pending
Expired
2021-01-01
2
1
0
0
0
2021-01-02
1
0
0
0
0
2021-01-03
1
0
0
0
0
2021-01-04
1
0
0
0
0
2021-01-05
1
0
0
0
0
....
....
....
....
....
....
2021-02-14
4
2
0
0
0
2021-02-15
2
3
0
0
1
2021-02-16
2
2
0
0
0
Few Conditions:
If a claim moves from one status to the other on the same day (even if they are a minutes/hours apart) it would not be subtracted from the original status until the next day. This can be seen on 2021-01-01 where claim 001 moves from new to accepted on the same day but the claim is not subtracted from new until 2021-01-02.
Until something happens to a claim, it should remain in its original status. Claim 002 will remain in new until 2021-03-01 when it is approved.
If a claim changes status on a later date than its original status, it will be subtracted on that later date. For this, see status 003. It is new on 2/14 but accepted on 2/15. This is why New goes down by 2 on 2/15 (the other claim is the is 004 which is new and accepted on the same day)
For certain statuses, I do not need to look at all columns. For example, For new I only look at the dates inside Accepted and Denied. Not Pending and Expired. When I do these same steps for approved, I no longer need to look at new, just the other columns. How would I do that?
In the final DF_count table, the dates should start from the earliest date in 'New' and end on todays date.
The code needs to be grouped by the Group Column as well. For example, patients in group B (not pictured) will have to have the same start and end date but for their own claims.
I need to do this separately for all of the statuses. Not just new.
Current Solution:
My current solution has been to create an dataset with just dates from the min New Date to todays date. Then for each column, what I do is use the .loc method to find dates that are greater than New in each of the other columns. For example, in the code below I look for all cases where new is equal to approved.
df1 = df.loc[(df['New'] == df['Approved']) &
((df['Expired'].isnull()) | (df['Expired'] >= df['Accepted'])) &
((df['Pending'].isnull()) | (df['Pending'] >= df['Accepted'])) &
((df['Denied'].isnull()) | (df['Denied'] >= df['Accepted']))]
newtoaccsday = df1.loc[:, ('Group', 'Accepted')]
newtoappsday['Date'] = newtoappsday['Accepted']
newtoappsday = newtoappsday.reset_index(drop = True)
newtoappsday= newtoappsday.groupby(['Date', 'Group'], as_index = False)['Approved'].value_counts()
newtoappsday.drop(columns = {'Accepted'}, inplace = True)
newtoappsday.rename(columns = {'count': 'NewAppSDay'}, inplace = True)
newtoappsday['Date'] = newtoappsday['Date'] + timedelta(1)
df_count= df_count.merge(newtoappsday, how = 'left', on = ['Date', 'Group']).fillna(0)
--After doing the above steps for all conditions (where new goes to accepted on a later date etc.) I will do the final calculation for new:
df_count['New'] = df_count.eval('New = New - (NewAccSDay + NewAccLater + NewDenSDay + NewDenLater + NewExpLater + NewPendSDay + NewPendLater)').groupby(['Tier2_ID', 'ClaimType'])['New'].cumsum()
Any and all help would be greatly appreciated. My method above is extremely inefficient and leading to some errors. Do I need to write a for loop for this? What is the best way to go about this.
First convert the date columns with something like
for i in ['New', 'Accepted', 'Denied', 'Pending', 'Expired']:
df[i] = pd.to_datetime(df[i], format="%Y-%m-%dT%H:%M:%S:%f%z")
Then develop the date range applicable based on your column conditions. In this logic if Denied is there the range is new --> denied, or if accepted new --> accepted or if no acceptance new --> today with code like (alter as per rules):
df['new_range'] = df[['New','Accepted','Denied']].apply (lambda x: pd.date_range(x['New'],x['Denied']).date.tolist() if
pd.notnull(x['Denied']) else
pd.date_range(x['New'],x['Accepted']).date.tolist() if
pd.notnull(x['Accepted']) else
pd.date_range(x['New'],datetime.today()).date.tolist()
,axis=1)
You should be able filter on a group and see date ranges in your df like:
df[df['Group']=='A']['new_range']
0 [2021-01-01]
1 [2021-01-01, 2021-01-02, 2021-01-03, 2021-01-0...
2 [2021-02-14]
3 [2021-02-14]
4 [2021-02-14, 2021-02-15, 2021-02-16, 2021-02-1..
Then you can explode the date ranges and group on counts to get the new counts for each day with code like:
new = pd.to_datetime(df[df['Group']=='A']['new_range'].explode('Date')).reset_index()
newc = new.groupby('new_range').count()
newc
new_range
2021-01-01 2
2021-01-02 1
2021-01-03 1
2021-01-04 1
2021-01-05 1
2021-01-06 1...
Similarly get counts for accepted, denied and then left joined on date to arrive at final table, fill na to 0.
By creating your rules to expand your date range, then explode over date range and groupby to get your counts you should be able to avoid much of the expensive operation.
I think this is what you want or can be easily modified to your necesity:
import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import date
def dateRange(d1,d2):
return [d1 + timedelta(days=x) for x in range((d2-d1).days)]
def addCount(dic,group,dat,cat):
if group not in dic:
dic[group]={}
if dat not in dic[group]:
dic[group][dat]={}
if cat not in dic[group][dat]:
dic[group][dat][cat]=0
dic[group][dat][cat]+=1
df =pd.read_csv("testdf.csv",
parse_dates=["New","Accepted","Denied","Pending", "Expired"])#,
cdic={}
for i,row in df.iterrows():
cid=row["ClaimID"]
dnew=row["New"].date()
dacc=row["Accepted"].date()
dden=row["Denied"].date()
dpen=row["Pending"].date()
dexp=row["Expired"].date()
group=row["Group"]
if not pd.isna(dacc): #Claim has been accepted
if(dnew == dacc):
dacc+=timedelta(days=1)
nend=dacc
addCount(cdic,group,dacc,"acc")
if not pd.isna(dden): # Claim has been denied
if(dnew == dden):
dden+=timedelta(days=1)
if pd.isna(dacc):
nend=dden
addCount(cdic,group,dden,"den")
if not pd.isna(dpen):
addCount(cdic,group,dpen,"pen") # Claim is pending
if not pd.isna(dexp):
addCount(cdic,group,dexp,"exp") # Claim is expired
if pd.isna(dacc) and pd.isna(dden):
nend=date.today()+timedelta(days+1)
for d in dateRange(dnew,nend): # Fill new status until first change
addCount(cdic,group,d,"new")
ndfl=[]
for group in cdic:
for dat in sorted(cdic[group].keys()):
r=cdic[group][dat]
ndfl.append([group,dat,r.get("new",0),r.get("acc",0),
r.get("den",0),r.get("pen",0),r.get("exp",0)])
ndf=pd.DataFrame(ndfl,columns=["Group", "Date","New","Accepted","Denied","Pending","Expired"])
I have a pandas dataframe that looks something like below.
I want to check the values in User ID to see if it is unique. If so, then I then want to check the License Type column to see if it is a full trial then return a 1 in a new column 'Full_direct'. Else, i would return a 0 in the 'full_direct' column.
Date **User ID** Product Name License Type Month
0 2017-01-01 10431046623214402832 90295d194237 trial 2017-01
1 2017-07-09 246853380240772174 29125b243095 trial 2017-07
2 2017-07-07 13685844038024265672 47423e1485 trial 2017-07
3 2017-02-12 2475366081966194134 202400c85587 full 2017-02
4 2017-04-08 761179767639020420 168300g168004 full 2017-04
I made this attempt but wasnt able to iterate through the dataframe in this manner. I was hoping to see if someone could advise. Thanks!
for values in main_df['User ID']:
if values.is_unique and main_df['License Type'] == 'full':
main_df['Full_Direct'] = 1
else:
main_df['Full_direct'] = 0
We do not need for loop here, let us try duplicated
df['Full_direct'] = ((~df['User ID'].duplicated(keep=False)) & (df['License Type'] == 'full')).astype(int)
Fix your code
for values in df.index:
if df['UserID'].isin([df.loc[values,'User ID']]).sum()==1 and df.loc[values,'License Type'] == 'full':
df.loc[values,'Full_direct'] = 1
else:
df.loc[values,'Full_direct'] = 0
I have two dataframes that I want to merge. One contains data on "assessments" done on particular dates for particular clients. The second contains data on different categories of "services" performed for clients on particular dates. See sample code below:
assessments = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access']})
ClientID AssessmentDate Program
212 2018-01-04 Case Mgmt
212 2018-07-03 Case Mgmt
212 2019-06-10 Case Mgmt
292 2017-08-08 Coordinated Access
292 2017-12-21 Coordinated Access
services = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'ServiceDate' : ['2018-01-02', '2018-04-08', '2018-05-23', '2017-09-08', '2017-12-03'],
'Assistance Navigation' : ['0','1','1','0','1'],
'Basic Needs' : ['1','0','0','1','2']})
ClientID ServiceDate Assistance Navigation Basic Needs
212 2018-01-02 0 1
212 2018-04-08 1 0
212 2018-05-23 1 0
292 2017-09-08 0 1
292 2017-12-03 1 2
I want to know how many services of each service type (Assistance Navigation and Basic Needs) occur between consecutive assessments of the same program. In other words, I want to append two columns to the assessments dataframe named 'Assistance Navigation' and 'Basic Needs' that tell me how many Assistance Navigation services and how many Basic Needs services have occurred since the last assessment of the same program. The resulting dataframe would look like this:
assessmentsFinal = pd.DataFrame({'ClientID' : ['212','212','212','292','292'],
'AssessmentDate' : ['2018-01-04', '2018-07-03', '2019-06-10', '2017-08-08', '2017-12-21'],
'Program' : ['Case Mgmt','Case Mgmt','Case Mgmt','Coordinated Access','Coordinated Access'],
'Assistance Navigation' : ['0','2','0','0','1'],
'Basic Needs' : ['0','0','0','0','3']})
ClientID AssessmentDate Program Assistance Navigation Basic Needs
212 2018-01-04 Case Mgmt 0 0
212 2018-07-03 Case Mgmt 2 0
212 2019-06-10 Case Mgmt 0 0
292 2017-08-08 Coordinated Access 0 0
292 2017-12-21 Coordinated Access 1 3
Of course, the real data has many more service categories than just 'Assistance Navigation' and 'Basic Needs' and the number of services and assessments is huge. My current attempt uses loops (which I know is a Pandas sin) and takes a couple of minutes to run, which may pose problems when our dataset gets even larger. Below is the current code for reference. Basically we loop through the assessments dataframe to get the ClientID and the date range and then we go into the services sheet and tally up the service type occurrences. There's got to be a quick and easy way to do this in Pandas but I'm new to the game. Thanks in advance.
servicesDict = {}
prevClient = -1
prevDate = ""
prevProg = ""
categories = ["ClientID","ServiceDate","Family Needs","Housing Navigation","Housing Assistance","Basic Needs","Professional","Education","Financial Aid","Healthcare","Counseling","Contact","Assistance Navigation","Referral","Misc"]
for index, row in assessmentDF.iterrows():
curClient = row[0]
curDate = datetime.strptime(row[1], '%m/%d/%y')
curProg = row[7]
curKey = (curClient, curDate)
if curKey not in servicesDict:
services = [curClient, curDate, 0,0,0,0,0,0,0,0,0,0,0,0,0]
servicesDict.update({curKey : services})
services = servicesDict[curKey]
#if curDate and prevDate equal each other action required
if curClient == prevClient and curProg == prevProg:
boundary = serviceDF[serviceDF['ClientID'] == curClient].index
for x in boundary:
curRowSer = serviceDF.iloc[x]
curDateSer = datetime.strptime(curRowSer[1], '%m/%d/%y')
if curDateSer>=prevDate and curDateSer<curDate:
serviceCategory = curRowSer[5]
i = categories.index(serviceCategory)
services[i] = services[i] + 1
servicesDict.update({curKey : services})
prevClient = curClient
prevDate = curDate
prevProg = curProg
servicesCleaned = pd.DataFrame.from_dict(servicesDict, orient = 'index',columns=categories)
#then merge into assessments on clientID and AssessmentDate
One way would be like this. You'll probably have to tweak it for your original dataset, and check the edge cases.
assessments['PreviousAssessmentDate'] = assessments.groupby(['ClientID', 'Program']).AssessmentDate.shift(1, fill_value='0000-00-00')
df = assessments.merge(services, on='ClientID', how='left')
df[df.columns[5:]] = df[df.columns[5:]].multiply((df.AssessmentDate > df.ServiceDate) & (df.PreviousAssessmentDate < df.ServiceDate), axis=0)
df = df.groupby(['ClientID', 'AssessmentDate', 'Program']).sum().reset_index()
ClientID AssessmentDate Program Assistance Navigation Basic Needs
0 212 2018-01-04 Case Mgmt 0 1
1 212 2018-07-03 Case Mgmt 2 0
2 212 2019-06-10 Case Mgmt 0 0
3 292 2017-08-08 Coordinated Access 0 0
4 292 2017-12-21 Coordinated Access 1 3
Logic
We shift the AssessmentDate by 1 in order to determine the
previous assessment date
We merge the two dataframes on ClientID
We set all service type columns to 0 incase the ServiceDate doesn't fall between PreviousAssessmentDate and the AssessmentDate.
We groupby ClientID, Program and AssessmentDate and do a sum()
Assumptions
Service type categories are integers
Your data frame is sorted on AssessmentDate (for the shift)
I have two datasets:
One contains house energy certificates issued the last 10 years with an ID for the house and the date it was issued. One house could have more certificates issued, as they can renew it.
The other contains all transactions of houses for the last 10 years and the ID (Which is the same id as in the first dataset)
My problem is then find the Energy certificate value of the house on the date it was being sold. I am able to merge the datasets on the house ID, but not quite sure to deal with the date column.
The Energy Certificates has the column with the "DateIssued" and the Transaction data set has the column "OfficialDateSold". The conditions would then be to find the Energy certificate with the right House ID and then with the date closest to the sold date, but not after.
Snippet of the dataframes:
Transactions:
address_id sold_date
0 1223632151 NaN
1 160073875 2013-09-24
2 160073875 2010-06-16
3 160073875 2009-08-05
4 160073875 2006-12-18
... ... ...
2792726 2147477357 2011-11-03
2792727 2147477357 2014-02-26
2792728 2147477579 2017-05-24
2792729 2147479054 2013-02-04
2792730 2147482539 1993-08-10
Energy Certificate
id certificate_number date_issued
0 1785963944 A2012-274656 27.11.2012 10:32:35
1 512265039 A2010-6435 30.06.2010 13:19:18
2 2003824679 A2014-459214 17.06.2014 11:00:47
3 1902877247 A2011-133593 14.10.2011 12:57:08
4 1620713314 A2009-266 25.12.2009 13:18:32
... ... ... ...
307846 753123775 A2019-1078357 30.11.2019 17:23:59
307847 1927124560 A2019-1078363 30.11.2019 20:44:22
307848 1122610963 A2019-1078371 30.11.2019 22:44:45
307849 28668673 A2019-1078373 30.11.2019 22:56:23
307850 1100393780 A2019-1078377 30.11.2019 23:38:42
Want the output
id certificate_number date_issued sold_date
id = address_id
date_issued <= sold_date
But also to find the Certificate closest to the sold_date(the newest before sold)
(I know the dates must be in the same format)
I am using Python with Jupyter Notebook.
I think you need merge_asof, but first is necessary convert columns to datetimess by to_datetime and remove rows with missing values in sold_date by DataFrame.dropna:
df1['sold_date'] = pd.to_datetime(df1['sold_date'])
df2['date_issued'] = pd.to_datetime(df2['date_issued'], dayfirst=True)
df1 = df1.dropna(subset=['sold_date'])
df = pd.merge_asof(df2.sort_values('date_issued'),
df1.sort_values('sold_date'),
left_on='date_issued',
right_on='sold_date',
left_by='id',
right_by='address_id')
I have a dataframe (DeptTemplate) the .head() of which looks like:
Name Status Status change date Product
0 Bob CURRENT NaN Pencils
1 Steve CURRENT NaN Pens
2 Heather NEW JOINER 02/08/2018 Paper
3 Lizzy NEW JOINER 06/02/2018 Pens
4 Ralph LEFT NaN Paper
I am trying to identify and return all the information for records that have a non 'CURRENT' Status and also no Status change date.
The code below explains my methodology:
def checkStatusChangeDate(DeptTemplate,filename,filepath, referencePeriodStartDate, referencePeriodEndDate,writer):
#This code checks if a status is not current that there is a status change date attached
test = DeptTemplate[DeptTemplate.Status != "CURRENT"]
pd.to_datetime(test['Status change date'])
test['Status change date'].dt.strftime('%d/%m/%Y')
statusError = test['Status change date'] == 'NaT'
finalError = DeptTemplate.loc[statusError['Status change date']]
I first of all identify any records that are not 'CURRENT'. I then identify from this subset any records that do not have a status change date. I end up with statusError data frame that looks like:
4 False
where the only record which does not have a CURRENT status and no Status change date is for Ralph.
The bit that I get stuck on is then trying to return Ralphs entire record by then referencing the statusError dataframe against the original DeptTemplate.
I am trying to use:
either:
finalError = DeptTemplate.loc[statusError['Status change date']]
or
finalError = DeptTemplate[statusError['Status change date']]
but can't get the whole record to return in the finalError dataframe
(so I end up with a finalError dataframe that looks like:
Name Status Status change date Product
4 Ralph LEFT NaN Paper
You are almost there, but you are trying to slice your original DataFrame using a a slice of a different DataFrame - that won't work because they are not the same thing.
Step 1: Set boolean masks
not_current = df['STATUS'] != 'CURRENT'
no_date_change = df['Status change date'].isnull()
Step 2: Use masks
df[not_current & no_date_change]
If I understood you correctly: You want to find one Record which does not contain a change date and no CURRENT status and then want to return all aother records to for this. So if there are other entries for Ralph for example you want to get them as well.
My solution for this would be:
import pandas as pd
data = {"Name":["Bob","Steve","Heather","Lizzy","Ralph","Ralph","Ralph"],
"Status":["CURRENT","CURRENT","NEW JOINER","NEW JOINER","LEFT","CURRENT","CURRENT"],
"Status change date": ["","","02/08/2018","06/02/2018","","06/02/2018","06/02/2018"],
"Product":["Pencils","Pens","Paper","Pens","Paper","Pencils","Pens"]}
df = pd.DataFrame(data)
df["Status change date"]=pd.to_datetime(df["Status change date"])
df.head()
Name Status Status change date Product
0 Bob CURRENT NaT Pencils
1 Steve CURRENT NaT Pens
2 Heather NEW JOINER 2018-02-08 Paper
3 Lizzy NEW JOINER 2018-06-02 Pens
4 Ralph LEFT NaT Paper
5 Ralph CURRENT 2018-06-02 Pencils
6 Ralph CURRENT 2018-06-02 Pens
get all entries that are contain no CURRENT status and no status change date:
finalError = df[(df["Status"]!="CURRENT") & (df["Status change date"].isnull())]
finalError.head()
Name Status Status change date Product
4 Ralph LEFT NaT Paper
Now check for the names in the old dataframe to get all the records from Ralph.
df[df["Name"]==finalError["Name"].any()]
Name Status Status change date Product
4 Ralph LEFT NaT Paper
5 Ralph CURRENT 2018-06-02 Pencils
6 Ralph CURRENT 2018-06-02 Pens