Print column with a specific value in python - python

I am using Colab.I am trying to print data form only NY,NC, SC State
confirmed_cases_USA, deaths_USA = get_confirmed_deaths_tuple_df (USA_covid_data)
# selecting rows based on condition PA, IL,OH,GA ,NC
options = ['NC',"PA"]
#options = ['NC',"PA","IL","OH","GA"]
confirmed_cases_Selected = confirmed_cases_USA[confirmed_cases_USA ['State'].isin(options)]
deaths_Selected= deaths_USA [deaths_USA ['State'].isin(options)]
print(confirmed_cases_Selected.head())
print(deaths_Selected.head())
output is :
countyFIPS County Name State ... 9/19/20 9/20/20
1921 0 Statewide Unallocated NC ... 1166 1166 1166
1922 37001 Alamance County NC ... 3695 3728 3749
1923 37003 Alexander County NC ... 483 485 488
1924 37005 Alleghany County NC ... 219 220 220
1925 37007 Anson County NC ... 549 552 553
countyFIPS County Name State ... 9/19/20 9/20/20
1921 0 Statewide Unallocated NC ... 0 0
1922 37001 Alamance County NC ... 48 54 54
1923 37003 Alexander County NC ... 5 5 5
1924 37005 Alleghany County NC ... 0 0 0
1925 37007 Anson County NC ... 4 4 4
I am trying to Group the data by state first and then get the total to confirm case of the state

I'm not sure what get_confirmed_deaths_tuple_df does but it doesn't look like a DataFrame.
USA_covid_data['State'].isin(options) should return a boolean mask containing True and False. Return the values that satisfy the True condition with USA_covid_data[USA_covid_data['State'].isin(options)]
It should look something like this.
confirmed_cases_USA, deaths_USA = get_confirmed_deaths_tuple_df(USA_covid_data[USA_covid_data['State'].isin(options)])

Related

For loop with non-consecutive indices

I'm quite new to Phyton and working with data frames, so this might be a very simple problem.
I successfully imported some measurement data (1 minute resolution) and did some calculations on them. I want to recalculate some data processing on a 15 minute basis (not average), for which I extracted every row at :00, :15, :30 and :45 from the original data frame.
df_interval = df[(df['DateTime'].dt.minute == 0) | (df['DateTime'].dt.minute == 15) | (df['DateTime'].dt.minute == 30) | (df['DateTime'].dt.minute == 45)]
This seems to work fine. Now I want to recalculate the concentration every 15 minute based on what the instrument is internally doing, which is a simple formula.
So what I tried is:
for i in df_interval.index:
if np.isnan(df_interval.ATN[i]) == False and np.isnan(df_interval.ATN[i+1]) == False:
df_15min = (0.785 *((df_interval.ATN[i+1]-df_interval.ATN[i])/100))/(df_interval.Flow[i]*(1-0.07)*10.8*(1-df_interval.K[i]*df_interval.ATN[i])*15)
however, I end up with a KeyError: 226. And I don't understand why...
Update:
Here is the data and in the last column (df_15min) also the result that I want to get:
ATN
Flow
K
df_15min
150
3647
0.00994
165
3634
0.00996
180
3634
0.00995
195
3621
0.00995
210
3615
0.00994
225
1.703678939
3754
0.00994
3.75E-08
240
4.356519267
3741
0.00994
3.84E-08
255
6.997422571
3741
0.00994
3.94E-08
270
9.627710046
3736
0.00995
4.02E-08
285
12.23379251
3728
0.01007
3.89E-08
300
14.67175418
3727
0.01026
3.76E-08
315
16.9583747
3714
0.01043
3.73E-08
330
19.1497249
3714
0.01061
3.96E-08
345
21.39628083
3709
0.01079
3.87E-08
360
23.51512717
3701
0.01086
4.02E-08
375
25.63995721
3700
0.01083
3.90E-08
390
27.63886191
3688
0.0108
3.47E-08
405
29.36343728
3688
0.01076
3.68E-08
420
31.14291069
3677
0.01072
3.90E-08
I do a lot of things in Igor, so that is how I would do it there (unfortunately for me, it has to be in python this time):
variable i
For (i=0; i<numpnts(ATN)-1; i+=1)
df_15min[i] = (0.785 *((ATN[i+1]-ATN[i])/100))/(Flow[i]*(1-0.07)*10.8*(1-K[i]*ATN[i])*15)
endfor
Any help would be appreciated, thanks!
You can literally write the same operation as vectorial code. Just use the whole rows and shift(-1) to get the "next" row.
df['df_15min'] = (0.785 *((df['ATN'].shift(-1)-df['ATN'])/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15)
Or using diff:
df['df_15min'] = (0.785 *((-df['ATN'].diff(-1))/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15)
output:
ATN Flow K df_15min
index
150 NaN 3647 0.00994 NaN
165 NaN 3634 0.00996 NaN
180 NaN 3634 0.00995 NaN
195 NaN 3621 0.00995 NaN
210 NaN 3615 0.00994 NaN
225 1.703679 3754 0.00994 3.745468e-08
240 4.356519 3741 0.00994 3.844700e-08
255 6.997423 3741 0.00994 3.937279e-08
270 9.627710 3736 0.00995 4.019633e-08
285 12.233793 3728 0.01007 3.886148e-08
300 14.671754 3727 0.01026 3.763219e-08
315 16.958375 3714 0.01043 3.734876e-08
330 19.149725 3714 0.01061 3.955360e-08
345 21.396281 3709 0.01079 3.870011e-08
360 23.515127 3701 0.01086 4.017342e-08
375 25.639957 3700 0.01083 3.897022e-08
390 27.638862 3688 0.01080 3.473242e-08
405 29.363437 3688 0.01076 3.675232e-08
420 31.142911 3677 0.01072 NaN
Your if condition checks bc_interval.row1[i+1] for nan and then you access df_interval.row1[i+1]. Looks like you wanted to check df_interval.row1[i+1] instead.

How to scrape tbody from a collapsible table using BeautifulSoup library?

Recently i did a project based of covid-19 dashboard. Where i use to scrape data from this website which has a collapsible table. Everything was ok till now, now recently the heroku app showing some errors. So i rerun my code in my local machine and the error occured at scraping tbody. Then i figured out that the site i use to scrape data has changed or updated the way it looks (table) and then my code is not able to grab it. I tried viewing page source and i am not able to find the table (tbody) that is on this page.But i am able to find tbody and all the data if i inspect the row of the table but cant find it on page source.How can i scrape the table now ?
My code:
The table i have to grab:
The data you see on the page is loaded from external URL via Ajax. You can use requests/json module to load it:
import json
import requests
url = 'https://www.mohfw.gov.in/data/datanew.json'
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
# print some data on screen:
for d in data:
print('{:<30} {:<10} {:<10} {:<10} {:<10}'.format(d['state_name'], d['active'], d['positive'], d['cured'], d['death']))
Prints:
Andaman and Nicobar Islands 329 548 214 5
Andhra Pradesh 75720 140933 63864 1349
Arunachal Pradesh 670 1591 918 3
Assam 9814 40269 30357 98
Bihar 17579 51233 33358 296
Chandigarh 369 1051 667 15
Chhattisgarh 2803 9086 6230 53
... and so on.
Try:
import json
import requests
import pandas as pd
data = []
row = []
r = requests.get('https://www.mohfw.gov.in/data/datanew.json')
j = json.loads(r.text)
for i in j:
for k in i:
row.append(i[k])
data.append(row)
row = []
columns = [i for i in j[0]]
df = pd.DataFrame(data, columns=columns)
df.sno = pd.to_numeric(df.sno, errors='coerce').reset_index()
df = df.sort_values('sno',)
print(df.to_string())
prints:
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 0 Andaman and Nicobar Islands 329 548 214 5 403 636 226 7 35
1 1 Andhra Pradesh 75720 140933 63864 1349 72188 150209 76614 1407 28
2 2 Arunachal Pradesh 670 1591 918 3 701 1673 969 3 12
3 3 Assam 9814 40269 30357 98 10183 41726 31442 101 18
4 4 Bihar 17579 51233 33358 296 18937 54240 34994 309 10
5 5 Chandigarh 369 1051 667 15 378 1079 683 18 04
6 6 Chhattisgarh 2803 9086 6230 53 2720 9385 6610 55 22
7 7 Dadra and Nagar Haveli and Daman and Diu 412 1100 686 2 418 1145 725 2 26
8 8 Delhi 10705 135598 120930 3963 10596 136716 122131 3989 07
9 9 Goa 1657 5913 4211 45 1707 6193 4438 48 30
10 10 Gujarat 14090 61438 44907 2441 14300 62463 45699 2464 24
and so on...

Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe

I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers.
pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load.
I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat.
Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error.
pricing
Price_Per_Ton Wet_Fat
0 306 10
1 339 11
2 382 12
3 430 13
4 481 14
5 532 15
6 580 16
7 625 17
8 665 18
9 700 19
10 728 20
11 750 21
12 766 22
13 778 23
14 788 24
15 797 25
grower_moo
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat
0 L2019000011817 56660 833 1.448872 21.92
1 L2019000011816 53680 1409 2.557679 21.12
2 L2019000011815 53560 1001 1.834644 21.36
3 L2019000011161 62320 2737 4.207080 21.41
4 L2019000011160 57940 1129 1.911324 20.06
grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton'])
Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430
grower_moo_with_price
Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton
0 L2019000011817 56660 833 1.448872 21.92 750
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation):
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A "backward" search [the default]
selects the last row in the right DataFrame whose ‘on’ key is less
than or equal to the left’s key.
In the following code, I use your example inputs, but with column names using underscores _ instead of spaces .
# Required by merge_asof: sort keys in left DataFrame
grower_moo = grower_moo.sort_values('Fat')
# Required by merge_asof: key column data types must match
pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float')
# Perform the asof merge
res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat')
# Print result
res
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat
0 L2019000011160 57940 1129 1.911324 20.06 728 20.0
1 L2019000011816 53680 1409 2.557679 21.12 750 21.0
2 L2019000011815 53560 1001 1.834644 21.36 750 21.0
3 L2019000011161 62320 2737 4.207080 21.41 750 21.0
4 L2019000011817 56660 833 1.448872 21.92 750 21.0
# Optional: drop the key column from the right DataFrame
res.drop(columns='Wet_Fat')
Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton
0 L2019000011160 57940 1129 1.911324 20.06 728
1 L2019000011816 53680 1409 2.557679 21.12 750
2 L2019000011815 53560 1001 1.834644 21.36 750
3 L2019000011161 62320 2737 4.207080 21.41 750
4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis)
cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']]
del cocnat_df['Wet_Fat']

Reshape data frame (with R or

I want to know if it's possible to have this result:
exemple:
With this data frame
df
y Faisceaux destination Trajet RED_Groupe Nbr observation RED Pond Nbr observation total RED pct
1 2015 France DOM-TOM Aller 78248.47 87 85586.75 307 0.9142591 0.04187815
2 2015 Hors Schengen Aller 256817.64 234 195561.26 1194 1.3132337 0.06015340
3 2015 INTERNATIONAL Aller 258534.78 473 288856.53 2065 0.8950283 0.04099727
4 2015 Maghreb Aller 605514.45 270 171718.14 1130 3.5262113 0.16152007
5 2015 NATIONAL Aller 361185.82 923 1082529.19 5541 0.3336500 0.01528302
6 2015 Schengen Aller 312271.06 940 505181.07 4190 0.6181369 0.02831411
7 2015 France DOM-TOM Retour 30408.70 23 29024.60 108 1.0476871 0.04798989
8 2015 Hors Schengen Retour 349805.15 225 168429.96 953 2.0768583 0.09513165
9 2015 INTERNATIONAL Retour 193536.63 138 99160.52 678 1.9517509 0.08940104
10 2015 Maghreb Retour 302863.83 110 41677.90 294 7.2667735 0.33285861
11 2015 NATIONAL Retour 471520.80 647 757258.33 3956 0.6226684 0.02852167
12 2015 Schengen Retour 307691.66 422 243204.76 2104 1.2651548 0.05795112
without using Exel.
With R or Python? I don't know if spliting column like that is possible.
thanks to all the comment here my solution :
I split my data frame into two data frame df15 ( with 2015 data) and df16 (2016 data) then :
mytable15 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df15)
mytable16 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df16)

Pivot tables using pandas

I have the following dataframe:
df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]
this will produce 1.4 mil records. I've taken the first 12.
Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011
I then filter on ['nat_actn_2_3'] for the certain hiring codes.
h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])
and can look at value_counts() to see total hires by region.
total_newhires = h3['regions'].value_counts()
total_newhires
produces:
Out[38]:
Pacific Southwest Region (R5) 42255
Pacific Northwest Region (R6) 32081
Intermountain Region (R4) 24045
Northern Region (R1) 22822
Rocky Mountain Region (R2) 17481
Southwest Region (R3) 17305
Eastern Region (R9) 11034
Research & Development(RES) 7337
Southern Region (R8) 7288
Albuquerque Service Center(ASC) 7032
Washington Office(WO) 4837
Alaska Region (R10) 4210
Job Corps(JC) 4010
nda 438
I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums.
Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried:
hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])
I'm wondering if groupby may be what I'm looking for?
Any help is appreciated. I've spent 3 days on this and can't seem to put it together.
So based off the answer below I did a pivot using the following code:
h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).
Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly.
ssno
fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
nat_actn_2_3
100 34 20 25 18 38 43 45 14 19 25 10
101 510 453 725 795 1029 1293 957 383 470 605 145
108 170 132 112 85 123 127 84 43 40 29 10
115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339
130 299 313 431 324 291 325 336 202 230 436 112
140 62 74 71 75 132 125 82 42 45 74 18
141 20 16 23 17 20 14 10 9 13 17 7
170 202 433 226 278 336 386 284 265 121 118 49
171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34
190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN
702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650
703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726
Try it like this:
h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)
To get counts use the aggfunc = len
Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int
Try:
h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)
if you have an older version of pandas

Categories