Print column with a specific value in python - python
I am using Colab.I am trying to print data form only NY,NC, SC State
confirmed_cases_USA, deaths_USA = get_confirmed_deaths_tuple_df (USA_covid_data)
# selecting rows based on condition PA, IL,OH,GA ,NC
options = ['NC',"PA"]
#options = ['NC',"PA","IL","OH","GA"]
confirmed_cases_Selected = confirmed_cases_USA[confirmed_cases_USA ['State'].isin(options)]
deaths_Selected= deaths_USA [deaths_USA ['State'].isin(options)]
print(confirmed_cases_Selected.head())
print(deaths_Selected.head())
output is :
countyFIPS County Name State ... 9/19/20 9/20/20
1921 0 Statewide Unallocated NC ... 1166 1166 1166
1922 37001 Alamance County NC ... 3695 3728 3749
1923 37003 Alexander County NC ... 483 485 488
1924 37005 Alleghany County NC ... 219 220 220
1925 37007 Anson County NC ... 549 552 553
countyFIPS County Name State ... 9/19/20 9/20/20
1921 0 Statewide Unallocated NC ... 0 0
1922 37001 Alamance County NC ... 48 54 54
1923 37003 Alexander County NC ... 5 5 5
1924 37005 Alleghany County NC ... 0 0 0
1925 37007 Anson County NC ... 4 4 4
I am trying to Group the data by state first and then get the total to confirm case of the state
I'm not sure what get_confirmed_deaths_tuple_df does but it doesn't look like a DataFrame.
USA_covid_data['State'].isin(options) should return a boolean mask containing True and False. Return the values that satisfy the True condition with USA_covid_data[USA_covid_data['State'].isin(options)]
It should look something like this.
confirmed_cases_USA, deaths_USA = get_confirmed_deaths_tuple_df(USA_covid_data[USA_covid_data['State'].isin(options)])
Related
For loop with non-consecutive indices
I'm quite new to Phyton and working with data frames, so this might be a very simple problem. I successfully imported some measurement data (1 minute resolution) and did some calculations on them. I want to recalculate some data processing on a 15 minute basis (not average), for which I extracted every row at :00, :15, :30 and :45 from the original data frame. df_interval = df[(df['DateTime'].dt.minute == 0) | (df['DateTime'].dt.minute == 15) | (df['DateTime'].dt.minute == 30) | (df['DateTime'].dt.minute == 45)] This seems to work fine. Now I want to recalculate the concentration every 15 minute based on what the instrument is internally doing, which is a simple formula. So what I tried is: for i in df_interval.index: if np.isnan(df_interval.ATN[i]) == False and np.isnan(df_interval.ATN[i+1]) == False: df_15min = (0.785 *((df_interval.ATN[i+1]-df_interval.ATN[i])/100))/(df_interval.Flow[i]*(1-0.07)*10.8*(1-df_interval.K[i]*df_interval.ATN[i])*15) however, I end up with a KeyError: 226. And I don't understand why... Update: Here is the data and in the last column (df_15min) also the result that I want to get: ATN Flow K df_15min 150 3647 0.00994 165 3634 0.00996 180 3634 0.00995 195 3621 0.00995 210 3615 0.00994 225 1.703678939 3754 0.00994 3.75E-08 240 4.356519267 3741 0.00994 3.84E-08 255 6.997422571 3741 0.00994 3.94E-08 270 9.627710046 3736 0.00995 4.02E-08 285 12.23379251 3728 0.01007 3.89E-08 300 14.67175418 3727 0.01026 3.76E-08 315 16.9583747 3714 0.01043 3.73E-08 330 19.1497249 3714 0.01061 3.96E-08 345 21.39628083 3709 0.01079 3.87E-08 360 23.51512717 3701 0.01086 4.02E-08 375 25.63995721 3700 0.01083 3.90E-08 390 27.63886191 3688 0.0108 3.47E-08 405 29.36343728 3688 0.01076 3.68E-08 420 31.14291069 3677 0.01072 3.90E-08 I do a lot of things in Igor, so that is how I would do it there (unfortunately for me, it has to be in python this time): variable i For (i=0; i<numpnts(ATN)-1; i+=1) df_15min[i] = (0.785 *((ATN[i+1]-ATN[i])/100))/(Flow[i]*(1-0.07)*10.8*(1-K[i]*ATN[i])*15) endfor Any help would be appreciated, thanks!
You can literally write the same operation as vectorial code. Just use the whole rows and shift(-1) to get the "next" row. df['df_15min'] = (0.785 *((df['ATN'].shift(-1)-df['ATN'])/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15) Or using diff: df['df_15min'] = (0.785 *((-df['ATN'].diff(-1))/100))/(df['Flow']*(1-0.07)*10.8*(1-df['K']*df['ATN'])*15) output: ATN Flow K df_15min index 150 NaN 3647 0.00994 NaN 165 NaN 3634 0.00996 NaN 180 NaN 3634 0.00995 NaN 195 NaN 3621 0.00995 NaN 210 NaN 3615 0.00994 NaN 225 1.703679 3754 0.00994 3.745468e-08 240 4.356519 3741 0.00994 3.844700e-08 255 6.997423 3741 0.00994 3.937279e-08 270 9.627710 3736 0.00995 4.019633e-08 285 12.233793 3728 0.01007 3.886148e-08 300 14.671754 3727 0.01026 3.763219e-08 315 16.958375 3714 0.01043 3.734876e-08 330 19.149725 3714 0.01061 3.955360e-08 345 21.396281 3709 0.01079 3.870011e-08 360 23.515127 3701 0.01086 4.017342e-08 375 25.639957 3700 0.01083 3.897022e-08 390 27.638862 3688 0.01080 3.473242e-08 405 29.363437 3688 0.01076 3.675232e-08 420 31.142911 3677 0.01072 NaN
Your if condition checks bc_interval.row1[i+1] for nan and then you access df_interval.row1[i+1]. Looks like you wanted to check df_interval.row1[i+1] instead.
How to scrape tbody from a collapsible table using BeautifulSoup library?
Recently i did a project based of covid-19 dashboard. Where i use to scrape data from this website which has a collapsible table. Everything was ok till now, now recently the heroku app showing some errors. So i rerun my code in my local machine and the error occured at scraping tbody. Then i figured out that the site i use to scrape data has changed or updated the way it looks (table) and then my code is not able to grab it. I tried viewing page source and i am not able to find the table (tbody) that is on this page.But i am able to find tbody and all the data if i inspect the row of the table but cant find it on page source.How can i scrape the table now ? My code: The table i have to grab:
The data you see on the page is loaded from external URL via Ajax. You can use requests/json module to load it: import json import requests url = 'https://www.mohfw.gov.in/data/datanew.json' data = requests.get(url).json() # uncomment to print all data: # print(json.dumps(data, indent=4)) # print some data on screen: for d in data: print('{:<30} {:<10} {:<10} {:<10} {:<10}'.format(d['state_name'], d['active'], d['positive'], d['cured'], d['death'])) Prints: Andaman and Nicobar Islands 329 548 214 5 Andhra Pradesh 75720 140933 63864 1349 Arunachal Pradesh 670 1591 918 3 Assam 9814 40269 30357 98 Bihar 17579 51233 33358 296 Chandigarh 369 1051 667 15 Chhattisgarh 2803 9086 6230 53 ... and so on.
Try: import json import requests import pandas as pd data = [] row = [] r = requests.get('https://www.mohfw.gov.in/data/datanew.json') j = json.loads(r.text) for i in j: for k in i: row.append(i[k]) data.append(row) row = [] columns = [i for i in j[0]] df = pd.DataFrame(data, columns=columns) df.sno = pd.to_numeric(df.sno, errors='coerce').reset_index() df = df.sort_values('sno',) print(df.to_string()) prints: sno state_name active positive cured death new_active new_positive new_cured new_death state_code 0 0 Andaman and Nicobar Islands 329 548 214 5 403 636 226 7 35 1 1 Andhra Pradesh 75720 140933 63864 1349 72188 150209 76614 1407 28 2 2 Arunachal Pradesh 670 1591 918 3 701 1673 969 3 12 3 3 Assam 9814 40269 30357 98 10183 41726 31442 101 18 4 4 Bihar 17579 51233 33358 296 18937 54240 34994 309 10 5 5 Chandigarh 369 1051 667 15 378 1079 683 18 04 6 6 Chhattisgarh 2803 9086 6230 53 2720 9385 6610 55 22 7 7 Dadra and Nagar Haveli and Daman and Diu 412 1100 686 2 418 1145 725 2 26 8 8 Delhi 10705 135598 120930 3963 10596 136716 122131 3989 07 9 9 Goa 1657 5913 4211 45 1707 6193 4438 48 30 10 10 Gujarat 14090 61438 44907 2441 14300 62463 45699 2464 24 and so on...
Trying to lookup a value from a pandas dataframe within a range of two rows in the index dataframe
I have two dataframes - "grower_moo" and "pricing" in a Python notebook to analyze harvested crops and price payments to the growers. pricing is the index dataframe, and grower_moo has various unique load tickets with information about each load. I need to pull the price per ton from the pricing index to a new column in the load data if the Fat of that load is not greater than the next Wet Fat. Below is a .head() sample of each dataframe and the code I tried. I received a ValueError: Can only compare identically-labeled Series objects error. pricing Price_Per_Ton Wet_Fat 0 306 10 1 339 11 2 382 12 3 430 13 4 481 14 5 532 15 6 580 16 7 625 17 8 665 18 9 700 19 10 728 20 11 750 21 12 766 22 13 778 23 14 788 24 15 797 25 grower_moo Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat 0 L2019000011817 56660 833 1.448872 21.92 1 L2019000011816 53680 1409 2.557679 21.12 2 L2019000011815 53560 1001 1.834644 21.36 3 L2019000011161 62320 2737 4.207080 21.41 4 L2019000011160 57940 1129 1.911324 20.06 grower_moo['price_per_ton'] = max(pricing[pricing['Wet_Fat'] < grower_moo['Fat']]['Price_Per_Ton']) Example output - grower_moo['Fat'] of 13.60 is less than 14 Fat, therefore gets a price per ton of $430 grower_moo_with_price Load Ticket Net Fruit Weight Net MOO Percent_MOO Fat price_per_ton 0 L2019000011817 56660 833 1.448872 21.92 750 1 L2019000011816 53680 1409 2.557679 21.12 750 2 L2019000011815 53560 1001 1.834644 21.36 750 3 L2019000011161 62320 2737 4.207080 21.41 750 4 L2019000011160 57940 1129 1.911324 20.06 728
This looks like a job for an "as of" merge, pd.merge_asof (documentation): This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key. For each row in the left DataFrame: A "backward" search [the default] selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. In the following code, I use your example inputs, but with column names using underscores _ instead of spaces . # Required by merge_asof: sort keys in left DataFrame grower_moo = grower_moo.sort_values('Fat') # Required by merge_asof: key column data types must match pricing['Wet_Fat'] = pricing['Wet_Fat'].astype('float') # Perform the asof merge res = pd.merge_asof(grower_moo, pricing, left_on='Fat', right_on='Wet_Fat') # Print result res Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton Wet_Fat 0 L2019000011160 57940 1129 1.911324 20.06 728 20.0 1 L2019000011816 53680 1409 2.557679 21.12 750 21.0 2 L2019000011815 53560 1001 1.834644 21.36 750 21.0 3 L2019000011161 62320 2737 4.207080 21.41 750 21.0 4 L2019000011817 56660 833 1.448872 21.92 750 21.0 # Optional: drop the key column from the right DataFrame res.drop(columns='Wet_Fat') Load_Ticket Net_Fruit_Weight Net_MOO Percent_MOO Fat Price_Per_Ton 0 L2019000011160 57940 1129 1.911324 20.06 728 1 L2019000011816 53680 1409 2.557679 21.12 750 2 L2019000011815 53560 1001 1.834644 21.36 750 3 L2019000011161 62320 2737 4.207080 21.41 750 4 L2019000011817 56660 833 1.448872 21.92 750
concat_df = pd.concat([grower_moo, pricing], axis) cocnat_df = concat_df[concat_df['Wet_Fat'] < concat_df['Fat']] del cocnat_df['Wet_Fat']
Reshape data frame (with R or
I want to know if it's possible to have this result: exemple: With this data frame df y Faisceaux destination Trajet RED_Groupe Nbr observation RED Pond Nbr observation total RED pct 1 2015 France DOM-TOM Aller 78248.47 87 85586.75 307 0.9142591 0.04187815 2 2015 Hors Schengen Aller 256817.64 234 195561.26 1194 1.3132337 0.06015340 3 2015 INTERNATIONAL Aller 258534.78 473 288856.53 2065 0.8950283 0.04099727 4 2015 Maghreb Aller 605514.45 270 171718.14 1130 3.5262113 0.16152007 5 2015 NATIONAL Aller 361185.82 923 1082529.19 5541 0.3336500 0.01528302 6 2015 Schengen Aller 312271.06 940 505181.07 4190 0.6181369 0.02831411 7 2015 France DOM-TOM Retour 30408.70 23 29024.60 108 1.0476871 0.04798989 8 2015 Hors Schengen Retour 349805.15 225 168429.96 953 2.0768583 0.09513165 9 2015 INTERNATIONAL Retour 193536.63 138 99160.52 678 1.9517509 0.08940104 10 2015 Maghreb Retour 302863.83 110 41677.90 294 7.2667735 0.33285861 11 2015 NATIONAL Retour 471520.80 647 757258.33 3956 0.6226684 0.02852167 12 2015 Schengen Retour 307691.66 422 243204.76 2104 1.2651548 0.05795112 without using Exel. With R or Python? I don't know if spliting column like that is possible.
thanks to all the comment here my solution : I split my data frame into two data frame df15 ( with 2015 data) and df16 (2016 data) then : mytable15 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df15) mytable16 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df16)
Pivot tables using pandas
I have the following dataframe: df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']] this will produce 1.4 mil records. I've taken the first 12. Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009 Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007 Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006 Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007 Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008 Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009 Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012 Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011 Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006 Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005 Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013 Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011 I then filter on ['nat_actn_2_3'] for the certain hiring codes. h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])] h2 = h1.sort('ssno') h3 = h2.drop_duplicates(['ssno','actn_dt']) and can look at value_counts() to see total hires by region. total_newhires = h3['regions'].value_counts() total_newhires produces: Out[38]: Pacific Southwest Region (R5) 42255 Pacific Northwest Region (R6) 32081 Intermountain Region (R4) 24045 Northern Region (R1) 22822 Rocky Mountain Region (R2) 17481 Southwest Region (R3) 17305 Eastern Region (R9) 11034 Research & Development(RES) 7337 Southern Region (R8) 7288 Albuquerque Service Center(ASC) 7032 Washington Office(WO) 4837 Alaska Region (R10) 4210 Job Corps(JC) 4010 nda 438 I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums. Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried: hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy']) I'm wondering if groupby may be what I'm looking for? Any help is appreciated. I've spent 3 days on this and can't seem to put it together. So based off the answer below I did a pivot using the following code: h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len). Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly. ssno fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 nat_actn_2_3 100 34 20 25 18 38 43 45 14 19 25 10 101 510 453 725 795 1029 1293 957 383 470 605 145 108 170 132 112 85 123 127 84 43 40 29 10 115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339 130 299 313 431 324 291 325 336 202 230 436 112 140 62 74 71 75 132 125 82 42 45 74 18 141 20 16 23 17 20 14 10 9 13 17 7 170 202 433 226 278 336 386 284 265 121 118 49 171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34 190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN 702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650 703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726
Try it like this: h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0) To get counts use the aggfunc = len Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int Try: h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0) if you have an older version of pandas