Pivot tables using pandas - python

I have the following dataframe:
df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]
this will produce 1.4 mil records. I've taken the first 12.
Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011
I then filter on ['nat_actn_2_3'] for the certain hiring codes.
h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])
and can look at value_counts() to see total hires by region.
total_newhires = h3['regions'].value_counts()
total_newhires
produces:
Out[38]:
Pacific Southwest Region (R5) 42255
Pacific Northwest Region (R6) 32081
Intermountain Region (R4) 24045
Northern Region (R1) 22822
Rocky Mountain Region (R2) 17481
Southwest Region (R3) 17305
Eastern Region (R9) 11034
Research & Development(RES) 7337
Southern Region (R8) 7288
Albuquerque Service Center(ASC) 7032
Washington Office(WO) 4837
Alaska Region (R10) 4210
Job Corps(JC) 4010
nda 438
I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums.
Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried:
hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])
I'm wondering if groupby may be what I'm looking for?
Any help is appreciated. I've spent 3 days on this and can't seem to put it together.
So based off the answer below I did a pivot using the following code:
h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).
Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly.
ssno
fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
nat_actn_2_3
100 34 20 25 18 38 43 45 14 19 25 10
101 510 453 725 795 1029 1293 957 383 470 605 145
108 170 132 112 85 123 127 84 43 40 29 10
115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339
130 299 313 431 324 291 325 336 202 230 436 112
140 62 74 71 75 132 125 82 42 45 74 18
141 20 16 23 17 20 14 10 9 13 17 7
170 202 433 226 278 336 386 284 265 121 118 49
171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34
190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN
702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650
703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726

Try it like this:
h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)
To get counts use the aggfunc = len
Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int
Try:
h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)
if you have an older version of pandas

Related

Is there a way to have insertion of tags or labels with respect to a certain value range?

I have this data and would like to insert a new column titled 'Level'. I understand
'insert' is a mode of entering in a new column. I tried an 'if' for the argument 'value', but this is not yielding anything.
Data:
Active Discharged Deaths
State/UTs
Andaman and Nicobar 6 7437 129
Andhra Pradesh 14550 1993589 13925
Arunachal Pradesh 634 52507 267
Assam 6415 580491 5710
Bihar 55 716048 9656
Chandigarh 35 64273 814
Chhattisgarh 354 990757 13557
Dadra and Nagar Haveli and Daman and Diu 2 10659 4
Delhi 367 1412542 25082
Goa 885 170391 3210
Gujarat 152 815275 10082
Haryana 617 760271 9685
Himachal Pradesh 1699 209420 3613
Jammu and Kashmir 1286 320337 4410
Jharkhand 126 342716 5133
Karnataka 17412 2901299 37426
Kerala 239338 3966557 21631
Ladakh 54 20327 207
Lakshadweep 9 10288 51
Madhya Pradesh 125 781629 10516
Maharashtra 51234 6300755 137811
Manipur 3180 110602 1802
Meghalaya 2104 73711 1329
Mizoram 11414 54056 226
Nagaland 712 29045 631
Odisha 6322 997790 8055
Puducherry 914 121452 1818
Punjab 326 584079 16444
Rajasthan 86 945097 8954
Sikkim 913 28968 375
Tamil Nadu 16256 2572942 35036
Telengana 5505 650453 3886
Tripura 691 81866 803
Uttar Pradesh 227 1686369 22861
Uttarakhand 379 335358 7388
West Bengal 8480 1525581 18515
code:
data = Table.read_table('IndiaStatus.csv')#.drop('Discharged', 'Discharge Ratio (%)','Total Cases','Active','Deaths')
data2.info()
data3 = data2.set_index("State/UTs")
data3 = data3[["Active","Discharged","Deaths"]]
print(data3)
data3.insert(1, column = "Level", value = "Severe" if data3["Active"] > 91874)
output:
line 49
data3.insert(1, column = "Level", value = "Severe" if data3["Active"] > 91874)
^
SyntaxError: invalid syntax
The SyntaxError is because you need a else condition, so something like value = "Severe", if data3["Active"] > 91874 else 'OTHER' would remove the error. That said, it won't work in this case and return another error of using a Series - in this case data3["Active"] > 91874 - in a if statement.
I believe you can use np.where here
data3.insert(1, column = "Level",
value = np.where(data3["Active"] > 91874, "Severe", 'OTHER')
Replace OTHER in the above code by any value you want to assign in the column when the condition data3["Active"] > 91874 is not met

Print column with a specific value in python

I am using Colab.I am trying to print data form only NY,NC, SC State
confirmed_cases_USA, deaths_USA = get_confirmed_deaths_tuple_df (USA_covid_data)
# selecting rows based on condition PA, IL,OH,GA ,NC
options = ['NC',"PA"]
#options = ['NC',"PA","IL","OH","GA"]
confirmed_cases_Selected = confirmed_cases_USA[confirmed_cases_USA ['State'].isin(options)]
deaths_Selected= deaths_USA [deaths_USA ['State'].isin(options)]
print(confirmed_cases_Selected.head())
print(deaths_Selected.head())
output is :
countyFIPS County Name State ... 9/19/20 9/20/20
1921 0 Statewide Unallocated NC ... 1166 1166 1166
1922 37001 Alamance County NC ... 3695 3728 3749
1923 37003 Alexander County NC ... 483 485 488
1924 37005 Alleghany County NC ... 219 220 220
1925 37007 Anson County NC ... 549 552 553
countyFIPS County Name State ... 9/19/20 9/20/20
1921 0 Statewide Unallocated NC ... 0 0
1922 37001 Alamance County NC ... 48 54 54
1923 37003 Alexander County NC ... 5 5 5
1924 37005 Alleghany County NC ... 0 0 0
1925 37007 Anson County NC ... 4 4 4
I am trying to Group the data by state first and then get the total to confirm case of the state
I'm not sure what get_confirmed_deaths_tuple_df does but it doesn't look like a DataFrame.
USA_covid_data['State'].isin(options) should return a boolean mask containing True and False. Return the values that satisfy the True condition with USA_covid_data[USA_covid_data['State'].isin(options)]
It should look something like this.
confirmed_cases_USA, deaths_USA = get_confirmed_deaths_tuple_df(USA_covid_data[USA_covid_data['State'].isin(options)])

How to scrape tbody from a collapsible table using BeautifulSoup library?

Recently i did a project based of covid-19 dashboard. Where i use to scrape data from this website which has a collapsible table. Everything was ok till now, now recently the heroku app showing some errors. So i rerun my code in my local machine and the error occured at scraping tbody. Then i figured out that the site i use to scrape data has changed or updated the way it looks (table) and then my code is not able to grab it. I tried viewing page source and i am not able to find the table (tbody) that is on this page.But i am able to find tbody and all the data if i inspect the row of the table but cant find it on page source.How can i scrape the table now ?
My code:
The table i have to grab:
The data you see on the page is loaded from external URL via Ajax. You can use requests/json module to load it:
import json
import requests
url = 'https://www.mohfw.gov.in/data/datanew.json'
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
# print some data on screen:
for d in data:
print('{:<30} {:<10} {:<10} {:<10} {:<10}'.format(d['state_name'], d['active'], d['positive'], d['cured'], d['death']))
Prints:
Andaman and Nicobar Islands 329 548 214 5
Andhra Pradesh 75720 140933 63864 1349
Arunachal Pradesh 670 1591 918 3
Assam 9814 40269 30357 98
Bihar 17579 51233 33358 296
Chandigarh 369 1051 667 15
Chhattisgarh 2803 9086 6230 53
... and so on.
Try:
import json
import requests
import pandas as pd
data = []
row = []
r = requests.get('https://www.mohfw.gov.in/data/datanew.json')
j = json.loads(r.text)
for i in j:
for k in i:
row.append(i[k])
data.append(row)
row = []
columns = [i for i in j[0]]
df = pd.DataFrame(data, columns=columns)
df.sno = pd.to_numeric(df.sno, errors='coerce').reset_index()
df = df.sort_values('sno',)
print(df.to_string())
prints:
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 0 Andaman and Nicobar Islands 329 548 214 5 403 636 226 7 35
1 1 Andhra Pradesh 75720 140933 63864 1349 72188 150209 76614 1407 28
2 2 Arunachal Pradesh 670 1591 918 3 701 1673 969 3 12
3 3 Assam 9814 40269 30357 98 10183 41726 31442 101 18
4 4 Bihar 17579 51233 33358 296 18937 54240 34994 309 10
5 5 Chandigarh 369 1051 667 15 378 1079 683 18 04
6 6 Chhattisgarh 2803 9086 6230 53 2720 9385 6610 55 22
7 7 Dadra and Nagar Haveli and Daman and Diu 412 1100 686 2 418 1145 725 2 26
8 8 Delhi 10705 135598 120930 3963 10596 136716 122131 3989 07
9 9 Goa 1657 5913 4211 45 1707 6193 4438 48 30
10 10 Gujarat 14090 61438 44907 2441 14300 62463 45699 2464 24
and so on...

Reshape data frame (with R or

I want to know if it's possible to have this result:
exemple:
With this data frame
df
y Faisceaux destination Trajet RED_Groupe Nbr observation RED Pond Nbr observation total RED pct
1 2015 France DOM-TOM Aller 78248.47 87 85586.75 307 0.9142591 0.04187815
2 2015 Hors Schengen Aller 256817.64 234 195561.26 1194 1.3132337 0.06015340
3 2015 INTERNATIONAL Aller 258534.78 473 288856.53 2065 0.8950283 0.04099727
4 2015 Maghreb Aller 605514.45 270 171718.14 1130 3.5262113 0.16152007
5 2015 NATIONAL Aller 361185.82 923 1082529.19 5541 0.3336500 0.01528302
6 2015 Schengen Aller 312271.06 940 505181.07 4190 0.6181369 0.02831411
7 2015 France DOM-TOM Retour 30408.70 23 29024.60 108 1.0476871 0.04798989
8 2015 Hors Schengen Retour 349805.15 225 168429.96 953 2.0768583 0.09513165
9 2015 INTERNATIONAL Retour 193536.63 138 99160.52 678 1.9517509 0.08940104
10 2015 Maghreb Retour 302863.83 110 41677.90 294 7.2667735 0.33285861
11 2015 NATIONAL Retour 471520.80 647 757258.33 3956 0.6226684 0.02852167
12 2015 Schengen Retour 307691.66 422 243204.76 2104 1.2651548 0.05795112
without using Exel.
With R or Python? I don't know if spliting column like that is possible.
thanks to all the comment here my solution :
I split my data frame into two data frame df15 ( with 2015 data) and df16 (2016 data) then :
mytable15 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df15)
mytable16 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df16)

Pandas data pull - messy strings to float

I am new to Pandas and I am just starting to take in the versatility of the package. While working with a small practice csv file, I pulled the following data in:
Rank Corporation Sector Headquarters Revenue (thousand PLN) Profit (thousand PLN) Employees
1.ÿ PKN Orlen SA oil and gas P?ock 79 037 121 2 396 447 4,445
2.ÿ Lotos Group SA oil and gas Gda?sk 29 258 539 584 878 5,168
3.ÿ PGE SA energy Warsaw 28 111 354 6 165 394 44,317
4.ÿ Jer¢nimo Martins retail Kostrzyn 25 285 407 N/A 36,419
5.ÿ PGNiG SA oil and gas Warsaw 23 003 534 1 711 787 33,071
6.ÿ Tauron Group SA energy Katowice 20 755 222 1 565 936 26,710
7.ÿ KGHM Polska Mied? SA mining Lubin 20 097 392 13 653 597 18,578
8.ÿ Metro Group Poland retail Warsaw 17 200 000 N/A 22,556
9.ÿ Fiat Auto Poland SA automotive Bielsko-Bia?a 16 513 651 83 919 5,303
10.ÿ Orange Polska telecommunications Warsaw 14 922 000 1 785 000 23,805
I have two serious problems with it that I cannot seem to find solution for:
1) data in "Ravenue" and "Profit" columns is pulled in as strings because of funny formatting with spaces between thousands, and I cannot seem to figure out how to make Pandas translate into floating point values.
2) Data under "Rank" column is pulled in as "1.?", "2.?" etc. What's happening there? Again, when I am trying to re-write this data with something more appropriate like "1.", "2." etc. the DataFrame just does not budge.
Ideas? Suggestions? I am also open for outright bashing because my problem might be quite obvious and silly - excuse my lack of experience then :)
I would use the converters parameter.
pass this to your pd.read_csv call
def space_float(x):
return float(x.replace(' ', ''))
converters = {
'Revenue (thousand PLN)': space_float,
'Profit (thousand PLN)': space_float,
'Rank': str.strip
}
pd.read_csv(... converters=converters ...)

Categories