Pandas data pull - messy strings to float

Pandas data pull - messy strings to float - python

I am new to Pandas and I am just starting to take in the versatility of the package. While working with a small practice csv file, I pulled the following data in:
Rank Corporation Sector Headquarters Revenue (thousand PLN) Profit (thousand PLN) Employees
1.ÿ PKN Orlen SA oil and gas P?ock 79 037 121 2 396 447 4,445
2.ÿ Lotos Group SA oil and gas Gda?sk 29 258 539 584 878 5,168
3.ÿ PGE SA energy Warsaw 28 111 354 6 165 394 44,317
4.ÿ Jer¢nimo Martins retail Kostrzyn 25 285 407 N/A 36,419
5.ÿ PGNiG SA oil and gas Warsaw 23 003 534 1 711 787 33,071
6.ÿ Tauron Group SA energy Katowice 20 755 222 1 565 936 26,710
7.ÿ KGHM Polska Mied? SA mining Lubin 20 097 392 13 653 597 18,578
8.ÿ Metro Group Poland retail Warsaw 17 200 000 N/A 22,556
9.ÿ Fiat Auto Poland SA automotive Bielsko-Bia?a 16 513 651 83 919 5,303
10.ÿ Orange Polska telecommunications Warsaw 14 922 000 1 785 000 23,805
I have two serious problems with it that I cannot seem to find solution for:
1) data in "Ravenue" and "Profit" columns is pulled in as strings because of funny formatting with spaces between thousands, and I cannot seem to figure out how to make Pandas translate into floating point values.
2) Data under "Rank" column is pulled in as "1.?", "2.?" etc. What's happening there? Again, when I am trying to re-write this data with something more appropriate like "1.", "2." etc. the DataFrame just does not budge.
Ideas? Suggestions? I am also open for outright bashing because my problem might be quite obvious and silly - excuse my lack of experience then :)

I would use the converters parameter.
pass this to your pd.read_csv call
def space_float(x):
return float(x.replace(' ', ''))
converters = {
'Revenue (thousand PLN)': space_float,
'Profit (thousand PLN)': space_float,
'Rank': str.strip
}
pd.read_csv(... converters=converters ...)

Related

Pandas VLOOKUP with an ID and a date range?

I am working on a project where I am pulling blood pressure readings from a third party device API for our patients using python, and inserting it into a sql server database.
The data that is pulled basically gives the device ID, the date of reading, and the measurements. No patient identifier.
The team that uses this data wants me to put this data into our sql database and additionally attach a patient ID to the reading so they themselves can pull the data and know which patient the reading corresponds to.
They have an excel spreadsheet they manually fill out that has a patient ID and their device ID. When a patient is done with this health program, they return the device back to their provider and then that device will be loaned to another patient starting the program. So one device may be used by multiple patients. Or sometimes a patient's device malfunctions and they get a new one, so one patient may get multiple devices.
The spreadsheet has a first/last reading date column but it seems it's not really filled out much.
Here is a barebones example of the readings dataframe:
reading_id device_id date_recorded systolic diastolic
123 42107 2022-10-31 194 104
126 42107 2022-11-01 195 103
122 42102 2022-11-03 180 90
107 36781 2022-11-04 110 70
111 36781 2022-11-05 140 85
321 42107 2022-11-06 180 95
432 42107 2022-11-07 130 60
234 50192 2022-11-08 120 75
101 61093 2022-11-11 140 90
333 42107 2022-11-15 130 60
561 12908 2022-11-18 120 90
And example of devices spreadsheet that I imported as a dataframe.
patient_id patient_num device_id pat_name first_reading last_reading
32149 1 42107 bob 2022-10-31 2022-11-01
41105 2 42102 jess
21850 3 42107 james 2022-11-07
32109 4 36781 patrick 2022-11-05
32109 4 50192 patrick
10824 5 61093 john 2022-11-11 2022-11-11
10233 6 42107 ashley
patient_num is just which # patient in program they are. patient_id is their id in our EHR that we would use to look them up. As far as I can tell, if last_reading is filled then that means the patient is done with that device. And if there are nothing in first_reading or last_reading, that patient is still using that device.
So as we can see, device 42107 was first used by bob, but he quit the program on 2022-11-01. james then started using device 42107 until he too quit the program on 2022-11-07. finally, it seems ashley is using that device still to this day.
patrick was using device 36781 until it malfunctioned 2022-11-05. then he got a new device 50192 and has been using it since.
finally, I noticed that there are device_ids in the readings that are not in the devices spreadsheet. i'm not sure how to handle those.
this is the output I want:
reading_id device_id date_recorded systolic diastolic patient_id
123 42107 2022-10-31 194 104 32149
126 42107 2022-11-01 195 103 32149
122 42102 2022-11-03 180 90 41105
107 36781 2022-11-04 110 70 32109
111 36781 2022-11-05 140 85 32109
321 42107 2022-11-06 180 95 21850
432 42107 2022-11-07 130 60 21850
234 50192 2022-11-08 120 75 32109
101 61093 2022-11-11 140 90 10824
333 42107 2022-11-15 130 60 10233
561 12908 2022-11-18 120 90 no id(?)
Is there enough data in the devices spreadsheet to achieve this? Or is it not possible with the missing dates in first/last date, plus the missing device_id to patient_id in the sheet? I wanted to ask that team to put in a "start date" and "end date" for each patient's loan duration. Though I would have to take into account patients with no "end date" yet if they are still using the device.
I've tried making a function filtering the devices dataframe based on the device ID and date recorded and using df.apply but I think I kept getting errors due to missing data. Also I just suck, but I am still learning.
Thanks for any guidance!

How to scrape tbody from a collapsible table using BeautifulSoup library?

Recently i did a project based of covid-19 dashboard. Where i use to scrape data from this website which has a collapsible table. Everything was ok till now, now recently the heroku app showing some errors. So i rerun my code in my local machine and the error occured at scraping tbody. Then i figured out that the site i use to scrape data has changed or updated the way it looks (table) and then my code is not able to grab it. I tried viewing page source and i am not able to find the table (tbody) that is on this page.But i am able to find tbody and all the data if i inspect the row of the table but cant find it on page source.How can i scrape the table now ?
My code:
The table i have to grab:

The data you see on the page is loaded from external URL via Ajax. You can use requests/json module to load it:
import json
import requests
url = 'https://www.mohfw.gov.in/data/datanew.json'
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
# print some data on screen:
for d in data:
print('{:<30} {:<10} {:<10} {:<10} {:<10}'.format(d['state_name'], d['active'], d['positive'], d['cured'], d['death']))
Prints:
Andaman and Nicobar Islands 329 548 214 5
Andhra Pradesh 75720 140933 63864 1349
Arunachal Pradesh 670 1591 918 3
Assam 9814 40269 30357 98
Bihar 17579 51233 33358 296
Chandigarh 369 1051 667 15
Chhattisgarh 2803 9086 6230 53
... and so on.

Try:
import json
import requests
import pandas as pd
data = []
row = []
r = requests.get('https://www.mohfw.gov.in/data/datanew.json')
j = json.loads(r.text)
for i in j:
for k in i:
row.append(i[k])
data.append(row)
row = []
columns = [i for i in j[0]]
df = pd.DataFrame(data, columns=columns)
df.sno = pd.to_numeric(df.sno, errors='coerce').reset_index()
df = df.sort_values('sno',)
print(df.to_string())
prints:
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 0 Andaman and Nicobar Islands 329 548 214 5 403 636 226 7 35
1 1 Andhra Pradesh 75720 140933 63864 1349 72188 150209 76614 1407 28
2 2 Arunachal Pradesh 670 1591 918 3 701 1673 969 3 12
3 3 Assam 9814 40269 30357 98 10183 41726 31442 101 18
4 4 Bihar 17579 51233 33358 296 18937 54240 34994 309 10
5 5 Chandigarh 369 1051 667 15 378 1079 683 18 04
6 6 Chhattisgarh 2803 9086 6230 53 2720 9385 6610 55 22
7 7 Dadra and Nagar Haveli and Daman and Diu 412 1100 686 2 418 1145 725 2 26
8 8 Delhi 10705 135598 120930 3963 10596 136716 122131 3989 07
9 9 Goa 1657 5913 4211 45 1707 6193 4438 48 30
10 10 Gujarat 14090 61438 44907 2441 14300 62463 45699 2464 24
and so on...

Conditional count column in Pandas where separate strings match in multiple columns

I am attempting to recreate this report I have in Excel:
Dealer Net NetSold NetRatio Phone PhSold PhRatio WalkIn WInSold WInRatio
Ford 671 31 4.62% 127 21 16.54% 93 24 25.81%
Toyota 863 37 4.29% 125 39 31.20% 97 32 32.99%
Chevy 826 67 8.11% 160 41 25.63% 224 126 56.25%
Dodge 1006 55 5.47% 121 28 23.14% 242 87 35.95%
Kia 910 57 6.26% 123 36 29.27% 202 92 45.54%
VW 1029 84 8.16% 316 65 20.57% 329 148 44.98%
Lexus 1250 73 5.84% 137 36 26.28% 138 69 50.00%
Total 6555 404 6.16% 1109 266 23.99% 1325 578 43.62%
Out of a csv that looks like this:
Dealer LeadType LeadStatusType
Chevy Internet Active
Ford Internet Active
Ford Internet Sold
Toyota Internet Active
VW Walk-in Sold
Kia Internet Active
Dodge Internet Active
There's more data in the csv than that, which will be used in other pages in this report, but I'm really only looking to solve the part that I'm stuck on now, as I want to learn as much as possible, and to make sure that I'm on an okay track to keep progressing down.
I was able to get close to where I think I need to be with the following:
lead_counts = df.groupby('Dealer')['Lead Type'].value_counts().unstack()
which of course gives pretty data summing up the leads by type. The issue is that I now need to insert calculated columns based on other fields. For example: For each dealer, count the number of leads that are both LeadType='Internet' AND LeadStatusType='Sold'.
I've honestly tried so many things that I'm not going to be able to remember them all.
def leads_by_type(x):
for dealer in dealers:
return len(df[(df['Dealer'] == dealer) &(df['Lead Type'] == 'Internet') & (df['Lead Status Type'] == 'Sold')])
Tried something like this where I can reliably get the data I'm looking for, but I can't really figure out to apply it to a column.
I've tried simply:
lead_counts['NetSold'] = len(df[(df['Dealer'] == dealer) &(df['Lead Type'] == 'Internet') & (df['Lead Status Type'] == 'Sold')])
Any advice for how to proceed, or am I going about this the wrong way already? This is all very doable in Excel, and I often get asked why I'm trying to replicate it in Python. The answer is just automation and learning.
I know some of the columns don't exactly match up in the table and code, this is just because I shortened some of them on the table to clean it up to post.

Pandas column reformatting

Any quick way to achieve the below output pls?
Input:
Code Items
123 eq-hk
456 ca-eu; tp-lbe
789 ca-us
321 go-ch
654 ca-au; go-au
987 go-jp
147 co-ml; go-ml
258 ca-us
369 ca-us; ca-my
741 ca-us
852 ca-eu
963 ca-ml; co-ml; go-ml
Output:
Code eq ca go co tp
123 hk
456 eu lbe
789 us
321 ch
654 au au
987 jp
147 ml ml
258 us
369 us,my
741 us
852 eu
963 ml ml ml
Am again running into loops and a very ugly code to make it work. If there is an elegant way to achieve this pls?
Thank you!

This is a little bit complicate
(df.set_index('Code')
.Items.str.split(';',expand=True)
.stack()
.str.split('-',expand=True)
.set_index(0,append=True)[1]
.unstack()
.fillna('')
.sum(level=0))
0 ca co eq go tp
Code
123 hk
147 ml ml
258 us
321 ch
369 usmy
456 eu lbe
654 au au
741 us
789 us
852 eu
963 ml ml ml
987 jp
# using str split to get unnest the column,
#then we do stack, and str split again , then set the first column to index
# after unstack we yield the result

List comprehensions work better (read: much faster) for string problems like this which require multiple levels of splitting.
df2 = pd.DataFrame([
dict(y.split('-') for y in x.split('; '))
for x in df.Items]).fillna('')
df2.insert(0, 'Code', df.Code)
print(df2)
Code ca co eq go tp
0 123 hk
1 456 eu lbe
2 789 us
3 321 ch
4 654 au au
5 987 jp
6 147 ml ml
7 258 us # Should be "us,my"... see below.
8 369 my
9 741 us
10 852 eu
11 963 ml ml ml
This does not handle the situation where multiple items with the same key can be present in a row. For that, a slightly more involved solution is needed.
from itertools import chain
v = [x.split('; ') for x in df.Items]
X = pd.Series(df.Code.values.repeat([len(x) for x in v]))
Y = pd.DataFrame([x.split('-') for x in chain.from_iterable(v)])
df2 = pd.concat([X, Y], axis=1, ignore_index=True)
(df2.set_index([0, 1, 3])[2]
.unstack(1)
.fillna('')
.groupby(level=0)
.agg(lambda x: ','.join(x).strip(','))
1 ca co eq go tp
0
123 hk
147 ml ml
258 us
321 ch
369 us,my
456 eu lbe
654 au au
741 us
789 us
852 eu
963 ml ml ml
987 jp

import pandas as pd
df = pd.DataFrame([
('123', 'eq-hk'),
('456', 'ca-eu; tp-lbe'),
('789', 'ca-us'),
('321', 'go-ch'),
('654', 'ca-au; go-au'),
('987', 'go-jp'),
('147', 'co-ml; go-ml'),
('258', 'ca-us'),
('369', 'ca-us; ca-my'),
('741', 'ca-us'),
('852', 'ca-eu'),
('963', 'ca-ml; co-ml; go-ml')],
columns=['Code', 'Items'])
# Get item type list from each row, sum (concatenate) the lists and convert
# to a set to remove duplicates
item_types = set(df['Items'].str.findall('(\w+)-').sum())
print(item_types)
# {'ca', 'co', 'eq', 'go', 'tp'}
# Generate a column for each item type
df1 = pd.DataFrame(df['Code'])
for t in item_types:
df1[t] = df['Items'].str.findall('%s-(\w+)' % t).apply(lambda x: ''.join(x))
print(df1)
# Code ca tp eq co go
#0 123 hk
#1 456 eu lbe
#2 789 us
#3 321 ch
#4 654 au au
#5 987 jp
#6 147 ml ml
#7 258 us
#8 369 usmy
#9 741 us
#10 852 eu
#11 963 ml ml ml

Pivot tables using pandas

I have the following dataframe:
df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]
this will produce 1.4 mil records. I've taken the first 12.
Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011
I then filter on ['nat_actn_2_3'] for the certain hiring codes.
h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])
and can look at value_counts() to see total hires by region.
total_newhires = h3['regions'].value_counts()
total_newhires
produces:
Out[38]:
Pacific Southwest Region (R5) 42255
Pacific Northwest Region (R6) 32081
Intermountain Region (R4) 24045
Northern Region (R1) 22822
Rocky Mountain Region (R2) 17481
Southwest Region (R3) 17305
Eastern Region (R9) 11034
Research & Development(RES) 7337
Southern Region (R8) 7288
Albuquerque Service Center(ASC) 7032
Washington Office(WO) 4837
Alaska Region (R10) 4210
Job Corps(JC) 4010
nda 438
I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums.
Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried:
hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])
I'm wondering if groupby may be what I'm looking for?
Any help is appreciated. I've spent 3 days on this and can't seem to put it together.
So based off the answer below I did a pivot using the following code:
h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).
Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly.
ssno
fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
nat_actn_2_3
100 34 20 25 18 38 43 45 14 19 25 10
101 510 453 725 795 1029 1293 957 383 470 605 145
108 170 132 112 85 123 127 84 43 40 29 10
115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339
130 299 313 431 324 291 325 336 202 230 436 112
140 62 74 71 75 132 125 82 42 45 74 18
141 20 16 23 17 20 14 10 9 13 17 7
170 202 433 226 278 336 386 284 265 121 118 49
171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34
190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN
702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650
703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726

Try it like this:
h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)
To get counts use the aggfunc = len
Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int
Try:
h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)
if you have an older version of pandas

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas data pull - messy strings to float - python

I would use the converters parameter. pass this to your pd.read_csv call def space_float(x): return float(x.replace(' ', '')) converters = { 'Revenue (thousand PLN)': space_float, 'Profit (thousand PLN)': space_float, 'Rank': str.strip } pd.read_csv(... converters=converters ...)

Related

Pandas VLOOKUP with an ID and a date range?

How to scrape tbody from a collapsible table using BeautifulSoup library?

Conditional count column in Pandas where separate strings match in multiple columns

Pandas column reformatting

Pivot tables using pandas

Categories

Resources