I have data that looks like this:
id Date Time assigned_pat_loc prior_pat_loc Activity
0 45546325 2/7/2011 4:29:38 EIAB^EIAB^6 NaN Admission
1 45546325 2/7/2011 5:18:22 8W^W844^A EIAB^EIAB^6 Observation
2 45546325 2/7/2011 5:18:22 8W^W844^A EIAB^EIAB^6 Transfer to 8W
3 45546325 2/7/2011 6:01:44 8W^W858^A 8W^W844^A Bed Movement
4 45546325 2/7/2011 7:20:44 8W^W844^A 8W^W858^A Bed Movement
5 45546325 2/9/2011 18:36:03 8W^W844^A NaN Discharge-Observation
6 45666555 3/8/2011 20:22:36 EIC^EIC^5 NaN Admission
7 45666555 3/9/2011 1:08:04 53^5314^A EIC^EIC^5 Admission
8 45666555 3/9/2011 1:08:04 53^5314^A EIC^EIC^5 Transfer to 53
9 45666555 3/9/2011 17:03:38 53^5336^A 53^5314^A Bed Movement
I need to find where there were multiple patients (identified with id column) are in the same room at the same time, the start and end times for those, the dates, and room number (assigned_pat_loc). assigned_pat_loc is the current patient location in the hospital, formatted as “unit^room^bed”.
So far I've done the following:
# Read in CSV file and remove bed number from patient location
data = pd.read_csv('raw_data.csv')
data['assigned_pat_loc'] = data['assigned_pat_loc'].str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)
# Convert Date column to datetime type
patient_data['Date'] = pd.to_datetime(patient_data['Date'])
# Sort dataframe by date
patient_data.sort_values(by=['Date'], inplace = True)
# Identify rows with duplicate room and date assignments, indicating multiple patients shared room
same_room = patient_data.duplicated(subset = ['Date','assigned_pat_loc'])
# Assign duplicates to new dataframe
df_same_rooms = patient_data[same_room]
# Remove duplicate patient ids but keep latest one
no_dups = df_same_rooms.drop_duplicates(subset = ['id'], keep = 'last')
# Group patients in the same rooms at the same times together
df_shuf = pd.concat(group[1] for group in df_same_rooms.groupby(['Date', 'assigned_pat_loc'], sort=False))
And then I'm stuck at this point:
id Date Time assigned_pat_loc prior_pat_loc Activity
599359 42963403 2009-01-01 12:32:25 11M^11MX 4LD^W463^A Transfer
296155 42963484 2009-01-01 16:41:55 11M^11MX EIC^EIC^2 Transfer
1373 42951976 2009-01-01 15:51:09 11M^11MX NaN Discharge
362126 42963293 2009-01-01 4:56:57 11M^11MX EIAB^EIAB^6 Transfer
362125 42963293 2009-01-01 4:56:57 11M^11MX EIAB^EIAB^6 Admission
... ... ... ... ... ... ...
268266 46381369 2011-09-09 18:57:31 54^54X 11M^1138^A Transfer
16209 46390230 2011-09-09 6:19:06 10M^1028 EIAB^EIAB^5 Admission
659699 46391825 2011-09-09 14:28:20 9W^W918 EIAB^EIAB^3 Transfer
659698 46391825 2011-09-09 14:28:20 9W^W918 EIAB^EIAB^3 Admission
268179 46391644 2011-09-09 17:48:53 64^6412 EIE^EIE^3 Admission
Where you can see different patients in the same room at the same time, but I don't know how to extract those intervals of overlap between two different rows for the same room and same times. And then to format it such that the start time and end time are related to the earlier and later times of the transpiring of a shared room between two patients. Below is the desired output.
Where r_id is the id of the other patient sharing the same room and length is the number of hours that room was shared.
As suggested, you can use groupby. One more thing you need to take care of is finding the overlapping time. Ideally you'd use datetime which are easy to work with. However you used a different format so we need to convert it first to make the solution easier. Since you did not provide a workable example, I will just write the gist here:
# convert current format to datetime
df['start_datetime'] = pd.to_datetime(df.start_date) + df.start_time.astype('timedelta64[h]')
df['end_datetime'] = pd.to_datetime(df.end_date) + df.end_time.astype('timedelta64[h]')
df = df.sort_values(['start_datetime', 'end_datetime'], ascending=[True, False])
gb = df.groupby('r_id')
for g, g_df in gb:
g_df['overlap_group'] = (g_df['end_datetime'].cummax().shift() <= g_df['start_datetime']).cumsum()
print(g_df)
This is a tentative example, and you might need to tweak the datetime conversion and some other minor things, but this is the gist.
The cummax() detects where there is an overlap between the intervals, and cumsum() counts the number of overlapping groups, since it's a counter we can use it as a unique identifier.
I used the following threads:
Group rows by overlapping ranges
python/pandas - converting date and hour integers to datetime
Edit
After discussing it with OP the idea is to take each patient's df and sort it by the date of the event. The first one will be the start_time and the last one would be the end_time.
The unification of the time and date are not necessary for detecting the start and end time as they can sort by date and then by the time to get the same order they would have gotten if they did unify the columns. However for the overlap detection it does make life easier when it's in one column.
gb_patient = df.groupby('id')
patients_data_list = []
for patient_id, patient_df in gb_patient:
patient_df = patient_df.sort_values(by=['Date', 'Time'])
patient_data = {
"patient_id": patient_id,
"start_time": patient_df.Date.values[0] + patient_df.Time.values[0],
"end_time": patient_df.Date.values[-1] + patient_df.Time.values[-1]
}
patients_data_list.append(patient_data)
new_df = pd.DataFrame(patients_data_list)
After that they can use the above code for the overlaps.
I have an issue where I need to track the progression of patients insurance claim statuses based on the dates of those statuses. I also need to create a count of status based on certain conditions.
DF:
ClaimID
New
Accepted
Denied
Pending
Expired
Group
001
2021-01-01T09:58:35:335Z
2021-01-01T10:05:43:000Z
A
002
2021-01-01T06:30:30:000Z
2021-03-01T04:11:45:000Z
2021-03-01T04:11:53:000Z
A
003
2021-02-14T14:23:54:154Z
2021-02-15T11:11:56:000Z
2021-02-15T11:15:00:000Z
A
004
2021-02-14T15:36:05:335Z
2021-02-14T17:15:30:000Z
A
005
2021-02-14T15:56:59:009Z
2021-03-01T10:05:43:000Z
A
In the above dataset, we have 6 columns. ClaimID is simple and just indicates the ID of the claim. New, Accepted, Denied, Pending, and Expired indicate the status of the claim and the day/time those statuses were set.
What I need to do is get a count of how many claims are New on each day and how many move out of new into a new status. For example, There are 2 new claims on 2021-01-01. On that same day 1 moved to Accepted about 7 minutes later. Thus on 2021-01-01 the table of counts would read:
DF_Count:
Date
New
Accepted
Denied
Pending
Expired
2021-01-01
2
1
0
0
0
2021-01-02
1
0
0
0
0
2021-01-03
1
0
0
0
0
2021-01-04
1
0
0
0
0
2021-01-05
1
0
0
0
0
....
....
....
....
....
....
2021-02-14
4
2
0
0
0
2021-02-15
2
3
0
0
1
2021-02-16
2
2
0
0
0
Few Conditions:
If a claim moves from one status to the other on the same day (even if they are a minutes/hours apart) it would not be subtracted from the original status until the next day. This can be seen on 2021-01-01 where claim 001 moves from new to accepted on the same day but the claim is not subtracted from new until 2021-01-02.
Until something happens to a claim, it should remain in its original status. Claim 002 will remain in new until 2021-03-01 when it is approved.
If a claim changes status on a later date than its original status, it will be subtracted on that later date. For this, see status 003. It is new on 2/14 but accepted on 2/15. This is why New goes down by 2 on 2/15 (the other claim is the is 004 which is new and accepted on the same day)
For certain statuses, I do not need to look at all columns. For example, For new I only look at the dates inside Accepted and Denied. Not Pending and Expired. When I do these same steps for approved, I no longer need to look at new, just the other columns. How would I do that?
In the final DF_count table, the dates should start from the earliest date in 'New' and end on todays date.
The code needs to be grouped by the Group Column as well. For example, patients in group B (not pictured) will have to have the same start and end date but for their own claims.
I need to do this separately for all of the statuses. Not just new.
Current Solution:
My current solution has been to create an dataset with just dates from the min New Date to todays date. Then for each column, what I do is use the .loc method to find dates that are greater than New in each of the other columns. For example, in the code below I look for all cases where new is equal to approved.
df1 = df.loc[(df['New'] == df['Approved']) &
((df['Expired'].isnull()) | (df['Expired'] >= df['Accepted'])) &
((df['Pending'].isnull()) | (df['Pending'] >= df['Accepted'])) &
((df['Denied'].isnull()) | (df['Denied'] >= df['Accepted']))]
newtoaccsday = df1.loc[:, ('Group', 'Accepted')]
newtoappsday['Date'] = newtoappsday['Accepted']
newtoappsday = newtoappsday.reset_index(drop = True)
newtoappsday= newtoappsday.groupby(['Date', 'Group'], as_index = False)['Approved'].value_counts()
newtoappsday.drop(columns = {'Accepted'}, inplace = True)
newtoappsday.rename(columns = {'count': 'NewAppSDay'}, inplace = True)
newtoappsday['Date'] = newtoappsday['Date'] + timedelta(1)
df_count= df_count.merge(newtoappsday, how = 'left', on = ['Date', 'Group']).fillna(0)
--After doing the above steps for all conditions (where new goes to accepted on a later date etc.) I will do the final calculation for new:
df_count['New'] = df_count.eval('New = New - (NewAccSDay + NewAccLater + NewDenSDay + NewDenLater + NewExpLater + NewPendSDay + NewPendLater)').groupby(['Tier2_ID', 'ClaimType'])['New'].cumsum()
Any and all help would be greatly appreciated. My method above is extremely inefficient and leading to some errors. Do I need to write a for loop for this? What is the best way to go about this.
First convert the date columns with something like
for i in ['New', 'Accepted', 'Denied', 'Pending', 'Expired']:
df[i] = pd.to_datetime(df[i], format="%Y-%m-%dT%H:%M:%S:%f%z")
Then develop the date range applicable based on your column conditions. In this logic if Denied is there the range is new --> denied, or if accepted new --> accepted or if no acceptance new --> today with code like (alter as per rules):
df['new_range'] = df[['New','Accepted','Denied']].apply (lambda x: pd.date_range(x['New'],x['Denied']).date.tolist() if
pd.notnull(x['Denied']) else
pd.date_range(x['New'],x['Accepted']).date.tolist() if
pd.notnull(x['Accepted']) else
pd.date_range(x['New'],datetime.today()).date.tolist()
,axis=1)
You should be able filter on a group and see date ranges in your df like:
df[df['Group']=='A']['new_range']
0 [2021-01-01]
1 [2021-01-01, 2021-01-02, 2021-01-03, 2021-01-0...
2 [2021-02-14]
3 [2021-02-14]
4 [2021-02-14, 2021-02-15, 2021-02-16, 2021-02-1..
Then you can explode the date ranges and group on counts to get the new counts for each day with code like:
new = pd.to_datetime(df[df['Group']=='A']['new_range'].explode('Date')).reset_index()
newc = new.groupby('new_range').count()
newc
new_range
2021-01-01 2
2021-01-02 1
2021-01-03 1
2021-01-04 1
2021-01-05 1
2021-01-06 1...
Similarly get counts for accepted, denied and then left joined on date to arrive at final table, fill na to 0.
By creating your rules to expand your date range, then explode over date range and groupby to get your counts you should be able to avoid much of the expensive operation.
I think this is what you want or can be easily modified to your necesity:
import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import date
def dateRange(d1,d2):
return [d1 + timedelta(days=x) for x in range((d2-d1).days)]
def addCount(dic,group,dat,cat):
if group not in dic:
dic[group]={}
if dat not in dic[group]:
dic[group][dat]={}
if cat not in dic[group][dat]:
dic[group][dat][cat]=0
dic[group][dat][cat]+=1
df =pd.read_csv("testdf.csv",
parse_dates=["New","Accepted","Denied","Pending", "Expired"])#,
cdic={}
for i,row in df.iterrows():
cid=row["ClaimID"]
dnew=row["New"].date()
dacc=row["Accepted"].date()
dden=row["Denied"].date()
dpen=row["Pending"].date()
dexp=row["Expired"].date()
group=row["Group"]
if not pd.isna(dacc): #Claim has been accepted
if(dnew == dacc):
dacc+=timedelta(days=1)
nend=dacc
addCount(cdic,group,dacc,"acc")
if not pd.isna(dden): # Claim has been denied
if(dnew == dden):
dden+=timedelta(days=1)
if pd.isna(dacc):
nend=dden
addCount(cdic,group,dden,"den")
if not pd.isna(dpen):
addCount(cdic,group,dpen,"pen") # Claim is pending
if not pd.isna(dexp):
addCount(cdic,group,dexp,"exp") # Claim is expired
if pd.isna(dacc) and pd.isna(dden):
nend=date.today()+timedelta(days+1)
for d in dateRange(dnew,nend): # Fill new status until first change
addCount(cdic,group,d,"new")
ndfl=[]
for group in cdic:
for dat in sorted(cdic[group].keys()):
r=cdic[group][dat]
ndfl.append([group,dat,r.get("new",0),r.get("acc",0),
r.get("den",0),r.get("pen",0),r.get("exp",0)])
ndf=pd.DataFrame(ndfl,columns=["Group", "Date","New","Accepted","Denied","Pending","Expired"])
I have a dataframe (df1) that looks like this;
title
score
id
timestamp
Stock_name
Biocryst ($BCRX) continues to remain undervalued
120
mfuz84
2021-01-28 21:32:10
...and then continues with 44000 something more rows. I have another dataframe (df2) that looks like this;
Company name
Symbol
BioCryst Pharmaceuticals, Inc.
BCRX
GameStop
GME
Apple Inc.
AAPL
...containing all nasdaq and NYSE listed stocks. What I want to do now however, is to add the symbol of the stock to the column "Stock_name" in df1. In order to do this, I want to match the df1[title] with the df2[Symbol] and then based on what symbol has a match in the title, add the corresponding stock name (df2[Company name]) to the df1[Stock_name] column. If there is more than one stock name in the title, I want to use the first one mentioned.
Is there any easy way to do this?
I tried with this little dataset and it's working, let me know if you have some problems
df1 = pd.DataFrame({"title" : ["Biocryst ($BCRX) continues to remain undervalued", "AAPL is good, buy it"], 'score' : [120,420] , 'Stock_name' : ["",""] })
df2 = pd.DataFrame({'Company name' : ['BioCryst Pharmaceuticals, Inc.','GameStop','Apple Inc.'], 'Symbol' : ["BCRX","GME","AAPL"]})
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120
1 AAPL is good, buy it 420
df2
Company name Symbol
0 BioCryst Pharmaceuticals, Inc. BCRX
1 GameStop GME
2 Apple Inc. AAPL
for j in range(0,len(df2)):
for i in range(0,len(df1)):
if df2['Symbol'][j] in df1['title'][i]:
df1['Stock_name'][i] = df2['Symbol'][j]
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120 BCRX
1 AAPL is good, buy it 420 AAPL
First, I think you should create a dictionary based on df2.
symbol_lookup = dict(zip(df2['Symbol'],df2['Company name']))
Then you need a function that will parse the title column. If you can rely on stock symbols being preceded by a dollar sign, you can use the following:
def find_name(input_string):
for symbol in input_string.split('$'):
#if the first four characters form
#a stock symbol, return the name
if symbol_lookup.get(symbol[:4]):
return symbol_lookup.get(symbol[:4])
#otherwise check the first three characters
if symbol_lookup.get(symbol[:3]):
return symbol_lookup.get(symbol[:3])
You could also write a function based on expecting the symbols to be in parentheses. If you can't rely on either, it would be more complicated.
Finally, you can apply your function to the title column:
df1['Stock_name'] = df1['title'].apply(find_name)
Is there a way to specify a DataFrame index (row) based on matching text inside the dataframe?
I am importing a text file from the internet located here every day into a python pandas DataFrame. I am parsing out just some of the data and doing calculations to give me the peak value for each day. The specific group of data I am needing to gather starts with the section headed "RTO COMBINED HOUR ENDING INTEGRATED FORECAST LOAD MW".
I need to specifically only use part of the data to do the calculations I need and I am able to manually specify which index line to start with, but daily this number could change due to text added to the top of the file by the authors.
Updated as of: 05-05-2016 1700 Constrained operations ARE expected in
the AEP, APS, BC, COMED, DOM,and PS zones on 05-06-2016. Constrained
operations ARE expected in the AEP, APS, BC, COMED, DOM,and PS zones
on 05-07-2016. The PS/ConEd 600/400 MW contract will be limited to
700MW on 05-06-16.
Is there a way to match text in the pandas DataFrame and specify the index of that match? Currently I am manually specifying the index I want to start with using the variable 'day' below on the 6th line. I would like this variable to hold the index (row) of the dataframe that includes the text I want to match.
The code below works but may stop working if the line number (index) changes:
def forecastload():
wb = load_workbook(filename = 'pjmactualload.xlsx')
ws = wb['PJM Load']
printRow = 13
#put this in iteration to pull 2 rows of data at a time (one for each day) for 7 days max
day = 239
while day < 251:
#pulls in first day only
data = pd.read_csv("http://oasis.pjm.com/doc/projload.txt", skiprows=day, delim_whitespace=True, header=None, nrows=2)
#sets data at HE 24 = to data that is in HE 13- so I can delete column 0 data to allow checking 'max'
data.at[1,13]= data.at[1,1]
#get date for printing it with max load later on
newDate = str(data.at[0,0])
#now delete first column to get rid of date data. date already saved as newDate
data = data.drop(0,1)
data = data.drop(1,1)
#pull out max value of day
#add index to this for iteration ie dayMax[x] = data.values.max()
dayMax = data.max().max()
dayMin = data.min().min()
#print date and max load for that date
actualMax = "Forecast Max"
actualMin = "Forecast Min"
dayMax = int(dayMax)
maxResults = [str(newDate),int(dayMax),actualMax,dayMin,actualMin]
d = 1
for items in maxResults:
ws.cell(row=printRow, column=d).value = items
d += 1
printRow += 1
#print maxResults
#l.writerows(maxResults)
day = day + 2
wb.save('pjmactualload.xlsx')
In this case i recommend you to use the command line in order to obtain a dataset that you could read later with pandas and do whatever you want.
To retrieve the data you can use curl and grep:
$ curl -s http://oasis.pjm.com/doc/projload.txt | grep -A 17 "RTO COMBINED HOUR ENDING INTEGRATED FORECAST" | tail -n +5
05/06/16 am 68640 66576 65295 65170 66106 70770 77926 83048 84949 85756 86131 86089
pm 85418 85285 84579 83762 83562 83289 82451 82460 84009 82771 78420 73258
05/07/16 am 66809 63994 62420 61640 61848 63403 65736 68489 71850 74183 75403 75529
pm 75186 74613 74072 73950 74386 74978 75135 75585 77414 76451 72529 67957
05/08/16 am 63583 60903 59317 58492 58421 59378 60780 62971 66289 68997 70436 71212
pm 71774 71841 71635 71831 72605 73876 74619 75848 78338 77121 72665 67763
05/09/16 am 63865 61729 60669 60651 62175 66796 74620 79930 81978 83140 84307 84778
pm 85112 85562 85568 85484 85766 85924 85487 85737 87366 84987 78666 72166
05/10/16 am 67581 64686 62968 62364 63400 67603 75311 80515 82655 84252 86078 87120
pm 88021 88990 89311 89477 89752 89860 89256 89327 90469 87730 81220 74449
05/11/16 am 70367 67044 65125 64265 65054 69060 76424 81785 84646 87097 89541 91276
pm 92646 93906 94593 94970 95321 95073 93897 93162 93615 90974 84335 77172
05/12/16 am 71345 67840 65837 64892 65600 69547 76853 82077 84796 87053 89135 90527
pm 91495 92351 92583 92473 92541 92053 90818 90241 90750 88135 81816 75042
Let's use the previous output (in the rto.txt file) to obtain a more readable data using awk and sed:
$ awk '/^ [0-9]/{d=$1;print $0;next}{print d,$0}' rto.txt | sed 's/^ //;s/\s\+/,/g'
05/06/16,am,68640,66576,65295,65170,66106,70770,77926,83048,84949,85756,86131,86089
05/06/16,pm,85418,85285,84579,83762,83562,83289,82451,82460,84009,82771,78420,73258
05/07/16,am,66809,63994,62420,61640,61848,63403,65736,68489,71850,74183,75403,75529
05/07/16,pm,75186,74613,74072,73950,74386,74978,75135,75585,77414,76451,72529,67957
05/08/16,am,63583,60903,59317,58492,58421,59378,60780,62971,66289,68997,70436,71212
05/08/16,pm,71774,71841,71635,71831,72605,73876,74619,75848,78338,77121,72665,67763
05/09/16,am,63865,61729,60669,60651,62175,66796,74620,79930,81978,83140,84307,84778
05/09/16,pm,85112,85562,85568,85484,85766,85924,85487,85737,87366,84987,78666,72166
05/10/16,am,67581,64686,62968,62364,63400,67603,75311,80515,82655,84252,86078,87120
05/10/16,pm,88021,88990,89311,89477,89752,89860,89256,89327,90469,87730,81220,74449
05/11/16,am,70367,67044,65125,64265,65054,69060,76424,81785,84646,87097,89541,91276
05/11/16,pm,92646,93906,94593,94970,95321,95073,93897,93162,93615,90974,84335,77172
05/12/16,am,71345,67840,65837,64892,65600,69547,76853,82077,84796,87053,89135,90527
05/12/16,pm,91495,92351,92583,92473,92541,92053,90818,90241,90750,88135,81816,75042
now, read and reshape the above result with pandas:
df = pd.read_csv("rto2.txt",names=["date","period"]+list(range(1,13)),index_col=[0,1])
df = df.stack().reset_index().rename(columns={"level_2":"hour",0:"value"})
df.index = pd.to_datetime(df.apply(lambda x: "{date} {hour}:00 {period}".format(**x),axis=1))
df.drop(["date", "hour", "period"], axis=1, inplace=True)
At this point you have a beautiful time series :)
In [10]: df.head()
Out[10]:
value
2016-05-06 01:00:00 68640
2016-05-06 02:00:00 66576
2016-05-06 03:00:00 65295
2016-05-06 04:00:00 65170
2016-05-06 05:00:00 66106
to obtain the statistics:
In[11]: df.groupby(df.index.date).agg([min,max])
Out[11]:
value
min max
2016-05-06 65170 86131
2016-05-07 61640 77414
2016-05-08 58421 78338
2016-05-09 60651 87366
2016-05-10 62364 90469
2016-05-11 64265 95321
2016-05-12 64892 92583
I hope this can help you.
Regards.
Here is how you can do what you are looking for:
And the sample code:
import numpy as np
import pandas a pd
df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
df.loc[df['a'] < 0.5, 'a'] = 1
You can refer to this documentation
Added image showing how to access index: