Pandas VLOOKUP with an ID and a date range? - python

I am working on a project where I am pulling blood pressure readings from a third party device API for our patients using python, and inserting it into a sql server database.
The data that is pulled basically gives the device ID, the date of reading, and the measurements. No patient identifier.
The team that uses this data wants me to put this data into our sql database and additionally attach a patient ID to the reading so they themselves can pull the data and know which patient the reading corresponds to.
They have an excel spreadsheet they manually fill out that has a patient ID and their device ID. When a patient is done with this health program, they return the device back to their provider and then that device will be loaned to another patient starting the program. So one device may be used by multiple patients. Or sometimes a patient's device malfunctions and they get a new one, so one patient may get multiple devices.
The spreadsheet has a first/last reading date column but it seems it's not really filled out much.
Here is a barebones example of the readings dataframe:
reading_id device_id date_recorded systolic diastolic
123 42107 2022-10-31 194 104
126 42107 2022-11-01 195 103
122 42102 2022-11-03 180 90
107 36781 2022-11-04 110 70
111 36781 2022-11-05 140 85
321 42107 2022-11-06 180 95
432 42107 2022-11-07 130 60
234 50192 2022-11-08 120 75
101 61093 2022-11-11 140 90
333 42107 2022-11-15 130 60
561 12908 2022-11-18 120 90
And example of devices spreadsheet that I imported as a dataframe.
patient_id patient_num device_id pat_name first_reading last_reading
32149 1 42107 bob 2022-10-31 2022-11-01
41105 2 42102 jess
21850 3 42107 james 2022-11-07
32109 4 36781 patrick 2022-11-05
32109 4 50192 patrick
10824 5 61093 john 2022-11-11 2022-11-11
10233 6 42107 ashley
patient_num is just which # patient in program they are. patient_id is their id in our EHR that we would use to look them up. As far as I can tell, if last_reading is filled then that means the patient is done with that device. And if there are nothing in first_reading or last_reading, that patient is still using that device.
So as we can see, device 42107 was first used by bob, but he quit the program on 2022-11-01. james then started using device 42107 until he too quit the program on 2022-11-07. finally, it seems ashley is using that device still to this day.
patrick was using device 36781 until it malfunctioned 2022-11-05. then he got a new device 50192 and has been using it since.
finally, I noticed that there are device_ids in the readings that are not in the devices spreadsheet. i'm not sure how to handle those.
this is the output I want:
reading_id device_id date_recorded systolic diastolic patient_id
123 42107 2022-10-31 194 104 32149
126 42107 2022-11-01 195 103 32149
122 42102 2022-11-03 180 90 41105
107 36781 2022-11-04 110 70 32109
111 36781 2022-11-05 140 85 32109
321 42107 2022-11-06 180 95 21850
432 42107 2022-11-07 130 60 21850
234 50192 2022-11-08 120 75 32109
101 61093 2022-11-11 140 90 10824
333 42107 2022-11-15 130 60 10233
561 12908 2022-11-18 120 90 no id(?)
Is there enough data in the devices spreadsheet to achieve this? Or is it not possible with the missing dates in first/last date, plus the missing device_id to patient_id in the sheet? I wanted to ask that team to put in a "start date" and "end date" for each patient's loan duration. Though I would have to take into account patients with no "end date" yet if they are still using the device.
I've tried making a function filtering the devices dataframe based on the device ID and date recorded and using df.apply but I think I kept getting errors due to missing data. Also I just suck, but I am still learning.
Thanks for any guidance!

Related

Is there a way to merge two data frames on their index, remove duplicate values in index, and then append columns which will be assigned properly?

I have a list of the top people who buy from our company, as well as a list of the top sellers from whom we purchase. They look like this:
money_received muhney_adj lbs lbs_adj trips trader
account
XYZ 2556972.42 0.00 12860910.0 0 461 NaN
ABC 2541912.58 -80439.42 4918390.0 -29283 73 DB
ZXCV 1806438.66 -300.00 8726460.0 0 186 AS
KLJU 1602633.62 -27600.00 9174020.0 0 287 NaN
ASDF 1320914.19 0.00 364761.0 0 9 JK
and
account money_paid muhney_adj lbs lbs_adj trips trader
XYZ 1566198.29 -553.57 2302200 -4500.0 58 DB
ABC 584055.93 -21998.49 2854607 -17401.0 272 JB
PLM 434250.78 -17528.26 2389424 -6700.0 94 JK
JNI 384354.07 -314.50 2188932 -1000.0 38 JB
The account column is the index in both dataframes. There is some overlap in the accounts, while there are some which only appear in one of the two dfs.
How can I join the two dfs into a new df that would have a layout like this:
index account money_received lbs_sold sale_trips money_paid lbs_bought purchase_trips
0 XYZ 2556972.42 12860910 461 1566198.29 2302200 58
1 ABC 2541912.58 4918390 73 584055.93 2854607 272
2 PLM 1806438.66 8726460 186 434250.78 2389424 94
I do not mind if there are many null values; however, I need to remove all duplicate account names.
Any thoughts?

Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient.
Issue I'm Having: My code works on a small dataset, but the actual data set is about 5GB and 13M rows. The code has been running for several days now and still hasn't finished. For background, my code is in a Jupyter notebook running on a standard work PC.
Sample Data
import pandas as pd
df = pd.DataFrame({"PatientAccountNumber":[113,113,113,113,225,225,225,225,225,225,225],
"TransactionCode":['50','50','77','60','22','77','25','77','25','77','77'],
"Bucket":['Charity','Charity','Bad Debt','3rd Party','Self Pay','Bad Debt',
'Charity','Bad Debt','Charity','Bad Debt','Bad Debt']})
print(df)
Sample Dataframe
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
7 225 77 Bad Debt
8 225 25 Charity
9 225 77 Bad Debt
10 225 77 Bad Debt
Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
df.drop(df[mask].index[1:],inplace=True)
print(df)
Desired Result (Each patient should have a maximum of one Bad Debt transaction)
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Alternate Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
mask = mask & (mask.cumsum() > 1)
df.loc[mask, 'Bucket'] = 'DELETE'
df = df[df['Bucket'] != 'DELETE]
Attempted using Dask
I thought maybe Dask would be able to help me out, but I got the following error codes:
Using Dask on first solution - "NotImplementedError: Series getitem in only supported for other series objects with matching partition structure"
Using Dask on second solution - "TypeError: '_LocIndexer' object does not support item assignment"
You can solve this using df.duplicated on both accountNumber and Bucket and checking if Bucket is Bad Debt:
df[~(df.duplicated(['PatientAccountNumber','Bucket']) & df['Bucket'].eq("Bad Debt"))]
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Create a boolean mask without loop:
mask = df[df['Bucket'].eq('Bad Debt')].duplicated('PatientAccountNumber')
df.drop(mask[mask].index, inplace=True)
>>> df
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity

How to grab a complete table hidden beyond 'Show all' by web scraping in Python

According to the reply I found in my previous question, I am able to grab the table by web scraping in Python from the URL: https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html But it only grabs partially until the row "Show all" is appeared.
How can I grab the complete table in Python which is hidden beyond "Show all" ?
Here is the code I am using:
import pandas as pd
import requests
from bs4 import BeautifulSoup
#
vaccineDF = pd.read_html('https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html')[0]
vaccineDF = vaccineDF.reset_index(drop=True)
print(vaccineDF.head(100))
The output only grabs 15 rows (until Show All):
Unnamed: 0_level_0 Doses administered ... Unnamed: 8_level_0 Unnamed: 9_level_0
Unnamed: 0_level_1 Per 100 people ... Unnamed: 8_level_1 Unnamed: 9_level_1
0 World 11 ... NaN NaN
1 Israel 116 ... NaN NaN
2 Seychelles 116 ... NaN NaN
3 U.A.E. 99 ... NaN NaN
4 Chile 69 ... NaN NaN
5 Bahrain 66 ... NaN NaN
6 Bhutan 63 ... NaN NaN
7 U.K. 62 ... NaN NaN
8 United States 61 ... NaN NaN
9 San Marino 60 ... NaN NaN
10 Maldives 59 ... NaN NaN
11 Malta 55 ... NaN NaN
12 Monaco 53 ... NaN NaN
13 Hungary 45 ... NaN NaN
14 Serbia 44 ... NaN NaN
15 Show all Show all ... Show all Show all
Below is the screen shot of the partial table until "Show all" in the web (left part) and corresponding inspect elements (right part):
You can't print whole data directly because you can see your full data after clicking the Show all Button. So, from this scenario, we can understand that first of all we have to first create an on click() event for clicking the Show all Button then we can fetch the whole table.
I have used Selenium Library for the on click event for pressing the Show all Button. For this particular scenario, I have used Firefox() Webdriver of Selenium for fetching all data from url. Kindly refer to the code given below for fetching the whole table of the given COVID Dataset URL:-
# Import all the Important Libraries
from selenium import webdriver # This module help to fetch data and on-click event purpose
from pandas.io.html import read_html # This module will help to read 'html' source. So, we can __scrape__ data from it
import pandas as pd # This Module will help to Convert Our Data into 'DataFrame'
# Create 'FireFox' Webdriver Object
driver = webdriver.Firefox()
# Get Website
driver.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
# Find 'Show all' Button Using 'XPath'
show_all_button = driver.find_element_by_xpath("/html/body/div[1]/main/article/section/div/div/div[4]/div[1]/div/table/tbody/tr[16]")
# Click 'Show all' Button
show_all_button.click()
# Get 'HTML' Content of Page
html_data = driver.page_source
After fetching the whole data, let's see how many tables are there in our COVID Dataset URL
covid_data_tables = read_html(html_data, attrs = {"class":"g-summary-table svelte-2wimac"}, header = None)
# Print Number of Tables Extracted
print ("\nExtracted {num} COVID Data Table".format(num = len(covid_data_tables)), "\n")
# Output of Above Cell:-
Extracted 1 COVID Data Table
Now, let's fetch the Data Table:-
# Print Table Data
covid_data_tables[0].head(20)
# Output of above cell:-
Unnamed: 0_level_0 Doses administered Pct. of population
Unnamed: 0_level_1 Per 100 people Total Vaccinated Fully vaccinated
0 World 11 877933955 – –
1 Israel 116 10307583 60% 56%
2 Seychelles 116 112194 68% 47%
3 U.A.E. 99 9489684 – –
4 Chile 69 12934282 41% 28%
5 Bahrain 66 1042463 37% 29%
6 Bhutan 63 478219 63% –
7 U.K. 62 41505768 49% 13%
8 United States 61 202282923 38% 24%
9 San Marino 60 20424 35% 25%
10 Maldives 59 303752 53% 5.6%
11 Malta 55 264658 38% 17%
12 Monaco 53 20510 30% 23%
13 Hungary 45 4416581 32% 14%
14 Serbia 44 3041740 26% 17%
15 Qatar 43 1209648 – –
16 Uruguay 38 1310591 30% 8.3%
17 Singapore 30 1667522 20% 9.5%
18 Antigua and Barbuda 28 27032 28% –
19 Iceland 28 98672 20% 8.1%
As you can see it was not showing show all in our dataset. So, Now we can Convert this Data Table to DataFrame. For doing this task we have to store this Data into CSV Format and we can reupload it and store it in DataFrame. The code for the Same was stated below:-
# HTML Table to CSV Format Conversion For COVID Dataset
covid_data_file = 'covid_data.csv'
covid_data_tables[0].to_csv(covid_data_file, sep = ',')
# Read CSV Data From Data Table for Further Analysis
covid_data = pd.read_csv("covid_data.csv")
So, after Storing all the Data into csv Format let's convert data into DataFrame Format and Print Whole data:-
# Store 'CSV' Data into 'DataFrame' Format
vaccineDF = pd.DataFrame(covid_data)
vaccineDF = vaccineDF.drop(columns=["Unnamed: 0"], axis = 1) # 'drop' Unneccesary Columns from the Dataset
# Print Whole Dataset
vaccineDF
# Output of above cell:-
Unnamed: 0_level_0 Doses administered Doses administered.1 Pct. of population Pct. of population.1
0 Unnamed: 0_level_1 Per 100 people Total Vaccinated Fully vaccinated
1 World 11 877933955 – –
2 Israel 116 10307583 60% 56%
3 Seychelles 116 112194 68% 47%
4 U.A.E. 99 9489684 – –
... ... ... ... ... ...
154 Syria <0.1 2500 <0.1% –
155 Papua New Guinea <0.1 1081 <0.1% –
156 South Sudan <0.1 947 <0.1% –
157 Cameroon <0.1 400 <0.1% –
158 Zambia <0.1 106 <0.1% –
159 rows × 5 columns
From above Output we can see that we have successfully fetched whole data table. Hope this Solution will help you.
OWID provides this data, which effectively comes from JHU
if you want latest vaccination data by country, it's simple to use CSV interface
import requests, io
dfraw = pd.read_csv(io.StringIO(requests.get("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv").text))
dfraw["date"] = pd.to_datetime(dfraw["date"])
dfraw.sort_values(["iso_code","date"]).groupby("iso_code", as_index=False).last()

Conditional count column in Pandas where separate strings match in multiple columns

I am attempting to recreate this report I have in Excel:
Dealer Net NetSold NetRatio Phone PhSold PhRatio WalkIn WInSold WInRatio
Ford 671 31 4.62% 127 21 16.54% 93 24 25.81%
Toyota 863 37 4.29% 125 39 31.20% 97 32 32.99%
Chevy 826 67 8.11% 160 41 25.63% 224 126 56.25%
Dodge 1006 55 5.47% 121 28 23.14% 242 87 35.95%
Kia 910 57 6.26% 123 36 29.27% 202 92 45.54%
VW 1029 84 8.16% 316 65 20.57% 329 148 44.98%
Lexus 1250 73 5.84% 137 36 26.28% 138 69 50.00%
Total 6555 404 6.16% 1109 266 23.99% 1325 578 43.62%
Out of a csv that looks like this:
Dealer LeadType LeadStatusType
Chevy Internet Active
Ford Internet Active
Ford Internet Sold
Toyota Internet Active
VW Walk-in Sold
Kia Internet Active
Dodge Internet Active
There's more data in the csv than that, which will be used in other pages in this report, but I'm really only looking to solve the part that I'm stuck on now, as I want to learn as much as possible, and to make sure that I'm on an okay track to keep progressing down.
I was able to get close to where I think I need to be with the following:
lead_counts = df.groupby('Dealer')['Lead Type'].value_counts().unstack()
which of course gives pretty data summing up the leads by type. The issue is that I now need to insert calculated columns based on other fields. For example: For each dealer, count the number of leads that are both LeadType='Internet' AND LeadStatusType='Sold'.
I've honestly tried so many things that I'm not going to be able to remember them all.
def leads_by_type(x):
for dealer in dealers:
return len(df[(df['Dealer'] == dealer) &(df['Lead Type'] == 'Internet') & (df['Lead Status Type'] == 'Sold')])
Tried something like this where I can reliably get the data I'm looking for, but I can't really figure out to apply it to a column.
I've tried simply:
lead_counts['NetSold'] = len(df[(df['Dealer'] == dealer) &(df['Lead Type'] == 'Internet') & (df['Lead Status Type'] == 'Sold')])
Any advice for how to proceed, or am I going about this the wrong way already? This is all very doable in Excel, and I often get asked why I'm trying to replicate it in Python. The answer is just automation and learning.
I know some of the columns don't exactly match up in the table and code, this is just because I shortened some of them on the table to clean it up to post.

Pandas data pull - messy strings to float

I am new to Pandas and I am just starting to take in the versatility of the package. While working with a small practice csv file, I pulled the following data in:
Rank Corporation Sector Headquarters Revenue (thousand PLN) Profit (thousand PLN) Employees
1.ÿ PKN Orlen SA oil and gas P?ock 79 037 121 2 396 447 4,445
2.ÿ Lotos Group SA oil and gas Gda?sk 29 258 539 584 878 5,168
3.ÿ PGE SA energy Warsaw 28 111 354 6 165 394 44,317
4.ÿ Jer¢nimo Martins retail Kostrzyn 25 285 407 N/A 36,419
5.ÿ PGNiG SA oil and gas Warsaw 23 003 534 1 711 787 33,071
6.ÿ Tauron Group SA energy Katowice 20 755 222 1 565 936 26,710
7.ÿ KGHM Polska Mied? SA mining Lubin 20 097 392 13 653 597 18,578
8.ÿ Metro Group Poland retail Warsaw 17 200 000 N/A 22,556
9.ÿ Fiat Auto Poland SA automotive Bielsko-Bia?a 16 513 651 83 919 5,303
10.ÿ Orange Polska telecommunications Warsaw 14 922 000 1 785 000 23,805
I have two serious problems with it that I cannot seem to find solution for:
1) data in "Ravenue" and "Profit" columns is pulled in as strings because of funny formatting with spaces between thousands, and I cannot seem to figure out how to make Pandas translate into floating point values.
2) Data under "Rank" column is pulled in as "1.?", "2.?" etc. What's happening there? Again, when I am trying to re-write this data with something more appropriate like "1.", "2." etc. the DataFrame just does not budge.
Ideas? Suggestions? I am also open for outright bashing because my problem might be quite obvious and silly - excuse my lack of experience then :)
I would use the converters parameter.
pass this to your pd.read_csv call
def space_float(x):
return float(x.replace(' ', ''))
converters = {
'Revenue (thousand PLN)': space_float,
'Profit (thousand PLN)': space_float,
'Rank': str.strip
}
pd.read_csv(... converters=converters ...)

Categories