Conditional count column in Pandas where separate strings match in multiple columns - python

I am attempting to recreate this report I have in Excel:
Dealer Net NetSold NetRatio Phone PhSold PhRatio WalkIn WInSold WInRatio
Ford 671 31 4.62% 127 21 16.54% 93 24 25.81%
Toyota 863 37 4.29% 125 39 31.20% 97 32 32.99%
Chevy 826 67 8.11% 160 41 25.63% 224 126 56.25%
Dodge 1006 55 5.47% 121 28 23.14% 242 87 35.95%
Kia 910 57 6.26% 123 36 29.27% 202 92 45.54%
VW 1029 84 8.16% 316 65 20.57% 329 148 44.98%
Lexus 1250 73 5.84% 137 36 26.28% 138 69 50.00%
Total 6555 404 6.16% 1109 266 23.99% 1325 578 43.62%
Out of a csv that looks like this:
Dealer LeadType LeadStatusType
Chevy Internet Active
Ford Internet Active
Ford Internet Sold
Toyota Internet Active
VW Walk-in Sold
Kia Internet Active
Dodge Internet Active
There's more data in the csv than that, which will be used in other pages in this report, but I'm really only looking to solve the part that I'm stuck on now, as I want to learn as much as possible, and to make sure that I'm on an okay track to keep progressing down.
I was able to get close to where I think I need to be with the following:
lead_counts = df.groupby('Dealer')['Lead Type'].value_counts().unstack()
which of course gives pretty data summing up the leads by type. The issue is that I now need to insert calculated columns based on other fields. For example: For each dealer, count the number of leads that are both LeadType='Internet' AND LeadStatusType='Sold'.
I've honestly tried so many things that I'm not going to be able to remember them all.
def leads_by_type(x):
for dealer in dealers:
return len(df[(df['Dealer'] == dealer) &(df['Lead Type'] == 'Internet') & (df['Lead Status Type'] == 'Sold')])
Tried something like this where I can reliably get the data I'm looking for, but I can't really figure out to apply it to a column.
I've tried simply:
lead_counts['NetSold'] = len(df[(df['Dealer'] == dealer) &(df['Lead Type'] == 'Internet') & (df['Lead Status Type'] == 'Sold')])
Any advice for how to proceed, or am I going about this the wrong way already? This is all very doable in Excel, and I often get asked why I'm trying to replicate it in Python. The answer is just automation and learning.
I know some of the columns don't exactly match up in the table and code, this is just because I shortened some of them on the table to clean it up to post.

Related

Pandas VLOOKUP with an ID and a date range?

I am working on a project where I am pulling blood pressure readings from a third party device API for our patients using python, and inserting it into a sql server database.
The data that is pulled basically gives the device ID, the date of reading, and the measurements. No patient identifier.
The team that uses this data wants me to put this data into our sql database and additionally attach a patient ID to the reading so they themselves can pull the data and know which patient the reading corresponds to.
They have an excel spreadsheet they manually fill out that has a patient ID and their device ID. When a patient is done with this health program, they return the device back to their provider and then that device will be loaned to another patient starting the program. So one device may be used by multiple patients. Or sometimes a patient's device malfunctions and they get a new one, so one patient may get multiple devices.
The spreadsheet has a first/last reading date column but it seems it's not really filled out much.
Here is a barebones example of the readings dataframe:
reading_id device_id date_recorded systolic diastolic
123 42107 2022-10-31 194 104
126 42107 2022-11-01 195 103
122 42102 2022-11-03 180 90
107 36781 2022-11-04 110 70
111 36781 2022-11-05 140 85
321 42107 2022-11-06 180 95
432 42107 2022-11-07 130 60
234 50192 2022-11-08 120 75
101 61093 2022-11-11 140 90
333 42107 2022-11-15 130 60
561 12908 2022-11-18 120 90
And example of devices spreadsheet that I imported as a dataframe.
patient_id patient_num device_id pat_name first_reading last_reading
32149 1 42107 bob 2022-10-31 2022-11-01
41105 2 42102 jess
21850 3 42107 james 2022-11-07
32109 4 36781 patrick 2022-11-05
32109 4 50192 patrick
10824 5 61093 john 2022-11-11 2022-11-11
10233 6 42107 ashley
patient_num is just which # patient in program they are. patient_id is their id in our EHR that we would use to look them up. As far as I can tell, if last_reading is filled then that means the patient is done with that device. And if there are nothing in first_reading or last_reading, that patient is still using that device.
So as we can see, device 42107 was first used by bob, but he quit the program on 2022-11-01. james then started using device 42107 until he too quit the program on 2022-11-07. finally, it seems ashley is using that device still to this day.
patrick was using device 36781 until it malfunctioned 2022-11-05. then he got a new device 50192 and has been using it since.
finally, I noticed that there are device_ids in the readings that are not in the devices spreadsheet. i'm not sure how to handle those.
this is the output I want:
reading_id device_id date_recorded systolic diastolic patient_id
123 42107 2022-10-31 194 104 32149
126 42107 2022-11-01 195 103 32149
122 42102 2022-11-03 180 90 41105
107 36781 2022-11-04 110 70 32109
111 36781 2022-11-05 140 85 32109
321 42107 2022-11-06 180 95 21850
432 42107 2022-11-07 130 60 21850
234 50192 2022-11-08 120 75 32109
101 61093 2022-11-11 140 90 10824
333 42107 2022-11-15 130 60 10233
561 12908 2022-11-18 120 90 no id(?)
Is there enough data in the devices spreadsheet to achieve this? Or is it not possible with the missing dates in first/last date, plus the missing device_id to patient_id in the sheet? I wanted to ask that team to put in a "start date" and "end date" for each patient's loan duration. Though I would have to take into account patients with no "end date" yet if they are still using the device.
I've tried making a function filtering the devices dataframe based on the device ID and date recorded and using df.apply but I think I kept getting errors due to missing data. Also I just suck, but I am still learning.
Thanks for any guidance!

How to efficiently mix groups of columns in a pandas dataframe?

I have one dataframe in Python :
Winner_height
Winner_rank
Loser_height
Loser_rank
183
15
185
32
195
42
178
12
And I would like to get a mixed database keeping the information about both players and a field that allows me to identify the winner (0 if it is player 1, 1 if it is player 2), as below :
Player_1_height
Player_1_rank
Player_2_height
Player_2_rank
Winner
183
15
185
32
0
178
12
195
42
1
Is there an efficient way to mix groups of columns with pandas, i.e. without drawing a random number for each row and creating a duplicate database?
Thanks in advance

Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient.
Issue I'm Having: My code works on a small dataset, but the actual data set is about 5GB and 13M rows. The code has been running for several days now and still hasn't finished. For background, my code is in a Jupyter notebook running on a standard work PC.
Sample Data
import pandas as pd
df = pd.DataFrame({"PatientAccountNumber":[113,113,113,113,225,225,225,225,225,225,225],
"TransactionCode":['50','50','77','60','22','77','25','77','25','77','77'],
"Bucket":['Charity','Charity','Bad Debt','3rd Party','Self Pay','Bad Debt',
'Charity','Bad Debt','Charity','Bad Debt','Bad Debt']})
print(df)
Sample Dataframe
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
7 225 77 Bad Debt
8 225 25 Charity
9 225 77 Bad Debt
10 225 77 Bad Debt
Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
df.drop(df[mask].index[1:],inplace=True)
print(df)
Desired Result (Each patient should have a maximum of one Bad Debt transaction)
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Alternate Solution
for account in df['PatientAccountNumber'].unique():
mask = (df['PatientAccountNumber'] == account) & (df['Bucket'] == 'Bad Debt')
mask = mask & (mask.cumsum() > 1)
df.loc[mask, 'Bucket'] = 'DELETE'
df = df[df['Bucket'] != 'DELETE]
Attempted using Dask
I thought maybe Dask would be able to help me out, but I got the following error codes:
Using Dask on first solution - "NotImplementedError: Series getitem in only supported for other series objects with matching partition structure"
Using Dask on second solution - "TypeError: '_LocIndexer' object does not support item assignment"
You can solve this using df.duplicated on both accountNumber and Bucket and checking if Bucket is Bad Debt:
df[~(df.duplicated(['PatientAccountNumber','Bucket']) & df['Bucket'].eq("Bad Debt"))]
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity
Create a boolean mask without loop:
mask = df[df['Bucket'].eq('Bad Debt')].duplicated('PatientAccountNumber')
df.drop(mask[mask].index, inplace=True)
>>> df
PatientAccountNumber TransactionCode Bucket
0 113 50 Charity
1 113 50 Charity
2 113 77 Bad Debt
3 113 60 3rd Party
4 225 22 Self Pay
5 225 77 Bad Debt
6 225 25 Charity
8 225 25 Charity

Pandas Custom Groupby Percentages of Total

for a school project I need to implement the following function.
Make a function select(df, col1, col2) that takes a data frame and two column labels and outputs a multi indexed Series with the fraction of occurrences of possible values of col2 given the values of col1.
For example select(df_test, 'Do you ever gamble?', 'Lottery Type') would yield
No risk yes 0.433099
risk no 0.566901
Yes risk yes 0.548872
risk no 0.451128
Note that the sum of Lottery Type:risk yes + Lottery Type:risk no is 1.0.
It was a much larger dataframe but I managed to groupby and aggregate to a point using gr = df.groupby([col1, col2], as_index=True).count() It resulted in the below smallish dataframes. ;
Do you ever smoke cigarettes? Do you ever drink alcohol? Have you ever been skydiving? Do you ever drive above the speed limit? Have you ever cheated on your significant other? Do you eat steak? How do you like your steak prepared? Gender Age Household Income Education Location (Census Region)
Do you ever gamble? Lottery Type
No risk no 155 157 156 157 155 157 121 147 147 121 147 145
risk yes 120 120 120 119 120 120 89 117 117 94 116 117
Yes risk no 114 114 113 113 114 114 99 110 110 96 109 110
risk yes 141 142 141 142 142 141 116 133 133 113 133 133
The Code looks messy so this is an image of the above DF. So my question is how can I aggregate on the percentage of the people say who don't smoke and percentage of the people who smoke. I tried using custom aggregation functions but I couldn't figure out. Using the below function just throws a type error.
.agg(lambda x: sum(x)/len(x))
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Have a look at pivot_table. https://stackoverflow.com/a/40302194/5478373 has a good example of how to use the pivot_tables to sum the totals and then divide the results by that total and multiply by 100.
group = pd.pivot_table(df,
...
aggfunc=np.sum)
.div(len(df.index))
.mul(100)

Pandas data pull - messy strings to float

I am new to Pandas and I am just starting to take in the versatility of the package. While working with a small practice csv file, I pulled the following data in:
Rank Corporation Sector Headquarters Revenue (thousand PLN) Profit (thousand PLN) Employees
1.ÿ PKN Orlen SA oil and gas P?ock 79 037 121 2 396 447 4,445
2.ÿ Lotos Group SA oil and gas Gda?sk 29 258 539 584 878 5,168
3.ÿ PGE SA energy Warsaw 28 111 354 6 165 394 44,317
4.ÿ Jer¢nimo Martins retail Kostrzyn 25 285 407 N/A 36,419
5.ÿ PGNiG SA oil and gas Warsaw 23 003 534 1 711 787 33,071
6.ÿ Tauron Group SA energy Katowice 20 755 222 1 565 936 26,710
7.ÿ KGHM Polska Mied? SA mining Lubin 20 097 392 13 653 597 18,578
8.ÿ Metro Group Poland retail Warsaw 17 200 000 N/A 22,556
9.ÿ Fiat Auto Poland SA automotive Bielsko-Bia?a 16 513 651 83 919 5,303
10.ÿ Orange Polska telecommunications Warsaw 14 922 000 1 785 000 23,805
I have two serious problems with it that I cannot seem to find solution for:
1) data in "Ravenue" and "Profit" columns is pulled in as strings because of funny formatting with spaces between thousands, and I cannot seem to figure out how to make Pandas translate into floating point values.
2) Data under "Rank" column is pulled in as "1.?", "2.?" etc. What's happening there? Again, when I am trying to re-write this data with something more appropriate like "1.", "2." etc. the DataFrame just does not budge.
Ideas? Suggestions? I am also open for outright bashing because my problem might be quite obvious and silly - excuse my lack of experience then :)
I would use the converters parameter.
pass this to your pd.read_csv call
def space_float(x):
return float(x.replace(' ', ''))
converters = {
'Revenue (thousand PLN)': space_float,
'Profit (thousand PLN)': space_float,
'Rank': str.strip
}
pd.read_csv(... converters=converters ...)

Categories