How to scrape the table of states? - python

I am trying to scrape the table from:
https://worldpopulationreview.com/states
My code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://worldpopulationreview.com/states'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table = soup.find('table', {'class': 'jsx-a3119e4553b2cac7 table is-striped is-hoverable is-fullwidth tp-table-body is-narrow'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns=headers)
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
df
Currently returns
'NoneType' object has no attribute 'find_all'
Clearly the error is because the table variable is returning nothing, but I believe I have the table tag correct.

The table data is dynamically loaded by JavaScript and bs4 can't render JS but you can do the job bs4 with an automation tool something like selenium and grab the table using pandas DataFrame.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
driver.get('https://worldpopulationreview.com/states')
driver.maximize_window()
time.sleep(8)
soup = BeautifulSoup(driver.page_source,"lxml")
#You can pull the table directly from the web page
df = pd.read_html(str(soup))[0]
print(df)
#OR
#table= soup.select_one('table[class="jsx-a3119e4553b2cac7 table is-striped is-hoverable is-fullwidth tp-table-body is-narrow"]')
# df = pd.read_html(str(table))[0]
# print(df)
Output:
Rank State 2022 Population Growth Rate ... 2010 Population Growth Since 2010 % of US Density (/mi²)
0 1 California 39995077 0.57% ... 37253956 7.36% 11.93% 257
1 2 Texas 29945493 1.35% ... 25145561 19.09% 8.93% 115
2 3 Florida 22085563 1.25% ... 18801310 17.47% 6.59% 412
3 4 New York 20365879 0.41% ... 19378102 5.10% 6.07% 432
4 5 Pennsylvania 13062764 0.23% ... 12702379 2.84% 3.90% 292
5 6 Illinois 12808884 -0.01% ... 12830632 -0.17% 3.82% 231
6 7 Ohio 11852036 0.22% ... 11536504 2.74% 3.53% 290
7 8 Georgia 10916760 0.95% ... 9687653 12.69% 3.26% 190
8 9 North Carolina 10620168 0.86% ... 9535483 11.38% 3.17% 218
9 10 Michigan 10116069 0.19% ... 9883640 2.35% 3.02% 179
10 11 New Jersey 9388414 0.53% ... 8791894 6.78% 2.80% 1277
11 12 Virginia 8757467 0.73% ... 8001024 9.45% 2.61% 222
12 13 Washington 7901429 1.26% ... 6724540 17.50% 2.36% 119
13 14 Arizona 7303398 1.05% ... 6392017 14.26% 2.18% 64
14 15 Massachusetts 7126375 0.68% ... 6547629 8.84% 2.13% 914
15 16 Tennessee 7023788 0.81% ... 6346105 10.68% 2.09% 170
16 17 Indiana 6845874 0.44% ... 6483802 5.58% 2.04% 191
17 18 Maryland 6257958 0.65% ... 5773552 8.39% 1.87% 645
18 19 Missouri 6188111 0.27% ... 5988927 3.33% 1.85% 90
19 20 Wisconsin 5935064 0.35% ... 5686986 4.36% 1.77% 110
20 21 Colorado 5922618 1.27% ... 5029196 17.76% 1.77% 57
21 22 Minnesota 5787008 0.70% ... 5303925 9.11% 1.73% 73
22 23 South Carolina 5217037 0.95% ... 4625364 12.79% 1.56% 174
23 24 Alabama 5073187 0.48% ... 4779736 6.14% 1.51% 100
24 25 Louisiana 4682633 0.27% ... 4533372 3.29% 1.40% 108
25 26 Kentucky 4539130 0.37% ... 4339367 4.60% 1.35% 115
26 27 Oregon 4318492 0.95% ... 3831074 12.72% 1.29% 45
27 28 Oklahoma 4000953 0.52% ... 3751351 6.65% 1.19% 58
28 29 Connecticut 3612314 0.09% ... 3574097 1.07% 1.08% 746
29 30 Utah 3373162 1.53% ... 2763885 22.04% 1.01% 41
30 31 Iowa 3219171 0.45% ... 3046355 5.67% 0.96% 58
31 32 Nevada 3185426 1.28% ... 2700551 17.95% 0.95% 29
32 33 Arkansas 3030646 0.32% ... 2915918 3.93% 0.90% 58
33 34 Mississippi 2960075 -0.02% ... 2967297 -0.24% 0.88% 63
34 35 Kansas 2954832 0.29% ... 2853118 3.57% 0.88% 36
35 36 New Mexico 2129190 0.27% ... 2059179 3.40% 0.64% 18
36 37 Nebraska 1988536 0.68% ... 1826341 8.88% 0.59% 26
37 38 Idaho 1893410 1.45% ... 1567582 20.79% 0.56% 23
38 39 West Virginia 1781860 -0.33% ... 1852994 -3.84% 0.53% 74
39 40 Hawaii 1474265 0.65% ... 1360301 8.38% 0.44% 230
40 41 New Hampshire 1389741 0.44% ... 1316470 5.57% 0.41% 155
41 42 Maine 1369159 0.25% ... 1328361 3.07% 0.41% 44
42 43 Rhode Island 1106341 0.41% ... 1052567 5.11% 0.33% 1070
43 44 Montana 1103187 0.87% ... 989415 11.50% 0.33%
8
44 45 Delaware 1008350 0.92% ... 897934 12.30% 0.30% 517
45 46 South Dakota 901165 0.81% ... 814180 10.68% 0.27% 12
46 47 North Dakota 800394 1.35% ... 672591 19.00% 0.24% 12
47 48 Alaska 738023 0.31% ... 710231 3.91% 0.22%
1
48 49 Vermont 646545 0.27% ... 625741 3.32% 0.19% 70
49 50 Wyoming 579495 0.23% ... 563626 2.82% 0.17%
6
[50 rows x 9 columns]

Table is rendered dynamically from JSON that is placed at the end of the source code, so it do not need selenium simply extract the tag and load the JSON - This also includes all additional information from the page:
soup = BeautifulSoup(requests.get('https://worldpopulationreview.com/states').text)
json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
Example
import requests, json
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://worldpopulationreview.com/states').text)
pd.DataFrame(
json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
Example
Cause there are also additional information, that is used for the map, simply choose columns you need by header.
fips
state
densityMi
pop2022
pop2021
pop2020
pop2019
pop2010
growthRate
growth
growthSince2010
area
fill
Name
rank
0
6
California
256.742
39995077
39766650
39538223
39309799
37253956
0.00574419
228427
0.0735793
155779
#084594
California
1
1
48
Texas
114.632
29945493
29545499
29145505
28745507
25145561
0.0135382
399994
0.190886
261232
#084594
Texas
2
2
12
Florida
411.852
22085563
21811875
21538187
21264502
18801310
0.0125477
273688
0.174682
53625
#084594
Florida
3
3
36
New York
432.158
20365879
20283564
20201249
20118937
19378102
0.00405821
82315
0.0509739
47126
#084594
New York
4
4
42
Pennsylvania
291.951
13062764
13032732
13002700
12972667
12702379
0.00230435
30032
0.0283715
44743
#2171b5
Pennsylvania
5
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
45
46
South Dakota
11.887
901165
893916
886667
879421
814180
0.00810926
7249
0.106838
75811
#c6dbef
South Dakota
46
46
38
North Dakota
11.5997
800394
789744
779094
768441
672591
0.0134854
10650
0.190016
69001
#c6dbef
North Dakota
47
47
2
Alaska
1.29332
738023
735707
733391
731075
710231
0.00314799
2316
0.0391309
570641
#c6dbef
Alaska
48
48
50
Vermont
70.147
646545
644811
643077
641347
625741
0.00268916
1734
0.033247
9217
#c6dbef
Vermont
49
49
56
Wyoming
5.96845
579495
578173
576851
575524
563626
0.00228651
1322
0.0281552
97093
#c6dbef
Wyoming
50

Related

webscrapping with api in python from url

I am currently working on a project and want with API or webscrapping get the table from a website.
I gave the following code:
import requests
import pandas as pd
import numpy as np
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
resp = requests.get(url)
tables = pd.read_html(resp.text)
all_df = pd.concat(tables)
data= pd.DataFrame(all_df)
But i got the error message no tables found, but I want the table which also can download csv.
Anyone know what the problem is?
With some help from selenium before calling read_html :
#https://selenium-python.readthedocs.io/installation.html
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
import pandas as pd
​
s = Service("./chromedriver.exe")
​
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
​
with webdriver.Chrome(service=s) as driver:
driver.get(url)
df = pd.concat(pd.read_html(driver.page_source))
​
Output :
print(df)
State Circumcision Rate
0 West Virginia 87%
1 Michigan 86%
2 Kentucky 85%
3 Nebraska 84%
4 Ohio 84%
.. ... ...
45 Alaska 0%
46 Arizona 0%
47 Delaware 0%
48 Idaho 0%
49 Mississippi 0%
[50 rows x 2 columns]
Here is one way of getting that data as a dataframe:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import json
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
soup = bs(requests.get(url, headers=headers).text, 'html.parser')
script_w_data = soup.select_one('script[id="__NEXT_DATA__"]').text
df = pd.json_normalize(json.loads(script_w_data)['props']['pageProps']['listing'])
print(df)
Result in terminal:
fips state densityMi pop2023 pop2022 pop2020 pop2019 pop2010 growthRate growth growthSince2010 circumcisionRate
0 54 West Virginia 73.88019 1775932 1781860 1793716 1799642 1852994 -0.00333 -5928 -0.04159 0.87
1 26 Michigan 179.26454 10135438 10116069 10077331 10057961 9883640 0.00191 19369 0.02548 0.86
2 21 Kentucky 115.37702 4555777 4539130 4505836 4489190 4339367 0.00367 16647 0.04987 0.85
3 31 Nebraska 26.06024 2002052 1988536 1961504 1947985 1826341 0.00680 13516 0.09621 0.84
4 39 Ohio 290.70091 11878330 11852036 11799448 11773150 11536504 0.00222 26294 0.02963 0.84
5 18 Indiana 191.92896 6876047 6845874 6785528 6755359 6483802 0.00441 30173 0.06050 0.83
6 19 Iowa 57.89018 3233572 3219171 3190369 3175964 3046355 0.00447 14401 0.06146 0.82
7 55 Wisconsin 109.96966 5955737 5935064 5893718 5873043 5686986 0.00348 20673 0.04726 0.82
8 45 South Carolina 175.18855 5266343 5217037 5118425 5069118 4625364 0.00945 49306 0.13858 0.81
9 42 Pennsylvania 292.62222 13092796 13062764 13002700 12972667 12702379 0.00230 30032 0.03074 0.79
10 56 Wyoming 5.98207 580817 579495 576851 575524 563626 0.00228 1322 0.03050 0.79
11 15 Hawaii 231.00763 1483762 1474265 1455271 1445774 1360301 0.00644 9497 0.09076 0.78
12 20 Kansas 36.24443 2963308 2954832 2937880 2929402 2853118 0.00287 8476 0.03862 0.77
13 38 North Dakota 11.75409 811044 800394 779094 768441 672591 0.01331 10650 0.20585 0.77
14 40 Oklahoma 58.63041 4021753 4000953 3959353 3938551 3751351 0.00520 20800 0.07208 0.77
15 46 South Dakota 11.98261 908414 901165 886667 879421 814180 0.00804 7249 0.11574 0.77
16 29 Missouri 90.26083 6204710 6188111 6154913 6138318 5988927 0.00268 16599 0.03603 0.76
17 33 New Hampshire 155.90830 1395847 1389741 1377529 1371424 1316470 0.00439 6106 0.06030 0.76
18 44 Rhode Island 1074.29594 1110822 1106341 1097379 1092896 1052567 0.00405 4481 0.05535 0.76
19 47 Tennessee 171.70515 7080262 7023788 6910840 6854371 6346105 0.00804 56474 0.11569 0.76
20 51 Virginia 223.36045 8820504 8757467 8631393 8568357 8001024 0.00720 63037 0.10242 0.74
21 13 Georgia 191.59470 11019186 10916760 10711908 10609487 9687653 0.00938 102426 0.13745 0.72
22 24 Maryland 648.84362 6298325 6257958 6177224 6136855 5773552 0.00645 40367 0.09089 0.72
23 9 Connecticut 746.69537 3615499 3612314 3605944 3602762 3574097 0.00088 3185 0.01158 0.71
24 23 Maine 44.50148 1372559 1369159 1362359 1358961 1328361 0.00248 3400 0.03327 0.67
25 5 Arkansas 58.42619 3040207 3030646 3011524 3001967 2915918 0.00315 9561 0.04262 0.66
26 8 Colorado 57.86332 5997070 5922618 5773714 5699264 5029196 0.01257 74452 0.19245 0.66
27 25 Massachusetts 919.82103 7174604 7126375 7029917 6981690 6547629 0.00677 48229 0.09576 0.66
28 34 New Jersey 1283.40005 9438124 9388414 9288994 9239284 8791894 0.00529 49710 0.07350 0.66
29 50 Vermont 70.33514 648279 646545 643077 641347 625741 0.00268 1734 0.03602 0.64
30 17 Illinois 230.67908 12807072 12808884 12812508 12814324 12830632 -0.00014 -1812 -0.00184 0.63
31 27 Minnesota 73.18202 5827265 5787008 5706494 5666238 5303925 0.00696 40257 0.09867 0.63
32 36 New York 433.90472 20448194 20365879 20201249 20118937 19378102 0.00404 82315 0.05522 0.59
33 37 North Carolina 220.30026 10710558 10620168 10439388 10348993 9535483 0.00851 90390 0.12323 0.52
34 30 Montana 7.64479 1112668 1103187 1084225 1074744 989415 0.00859 9481 0.12457 0.50
35 48 Texas 116.16298 30345487 29945493 29145505 28745507 25145561 0.01336 399994 0.20679 0.50
36 35 New Mexico 17.60148 2135024 2129190 2117522 2111685 2059179 0.00274 5834 0.03683 0.49
37 22 Louisiana 108.67214 4695071 4682633 4657757 4645314 4533372 0.00266 12438 0.03567 0.45
38 49 Utah 41.66892 3423935 3373162 3271616 3220842 2763885 0.01505 50773 0.23881 0.42
39 12 Florida 416.95573 22359251 22085563 21538187 21264502 18801310 0.01239 273688 0.18924 0.35
40 41 Oregon 45.41307 4359110 4318492 4237256 4196636 3831074 0.00941 40618 0.13783 0.24
41 6 California 258.20877 40223504 39995077 39538223 39309799 37253956 0.00571 228427 0.07971 0.22
42 1 Alabama 100.65438 5097641 5073187 5024279 4999822 4779736 0.00482 24454 0.06651 0.20
43 53 Washington 120.37292 7999503 7901429 7705281 7607206 6724540 0.01241 98074 0.18960 0.15
44 32 Nevada 29.38425 3225832 3185426 3104614 3064205 2700551 0.01268 40406 0.19451 0.12
45 2 Alaska 1.29738 740339 738023 733391 731075 710231 0.00314 2316 0.04239 NaN
46 4 Arizona 64.96246 7379346 7303398 7151502 7075549 6392017 0.01040 75948 0.15446 NaN
47 10 Delaware 522.08876 1017551 1008350 989948 980743 897934 0.00912 9201 0.13321 NaN
48 16 Idaho 23.23926 1920562 1893410 1839106 1811950 1567582 0.01434 27152 0.22517 NaN
49 28 Mississippi 63.07084 2959473 2960075 2961279 2961879 2967297 -0.00020 -602 -0.00264 NaN

How do i create code for a vlookup in python?

df
Season
Date
Team
Team_Season_Code
TS
L
Opponent
Opponent_Season_Code
OS
2019
20181109
Abilene_Chr
1_2019
94
Home
Arkansas_St
15_2019
73
2019
20181115
Abilene_Chr
1_2019
67
Away
Denver
82_2019
61
2019
20181122
Abilene_Chr
1_2019
72
N
Elon
70_2019
56
2019
20181123
Abilene_Chr
1_2019
73
Away
Pacific
224_2019
71
2019
20181124
Abilene_Chr
1_2019
60
N
UC_Riverside
306_2019
48
Overall_Season_Avg
Team_Season_Code
Team
TS
OS
MOV
15_2019
Arkansas_St
70.909091
65.242424
5.666667
70_2019
Elon
73.636364
71.818182
1.818182
82_2019
Denver
74.03125
72.15625
1.875
224_2019
Pacific
78.333333
76.466667
1.866667
306_2019
UC_Riverside
79.545455
78.060606
1.484848
I have these two dataframes and I want to be able to look up the Opponent_Season_Code from df in Overall_Season_Avg - "Team_Season_Code" and bring back "TS" and "OS" to create a new column in df called "OOS" and "OTS"
So a new column for row 1 in df should have Column name OOS with data - 65.24... and Column name OTS with data 70.90...
In excel its a simple vlookup but i haven't been able to use the solutions that i have found to the vlookup question on overflow so i decided to post my own question. I will also say that the Overall_Season_Avg dataframe was created through by Overall_Season_Avg = df.groupby(['Team_Season_Code', 'Team']).agg({'TS': np.mean, 'OS': np.mean, 'MOV': np.mean})
You can use a merge, after reworking a bit Overall_Season_Avg :
df.merge(Overall_Season_Avg
.set_index(['Team_Season_Code', 'Team'])
[['OS', 'TS']].add_prefix('O'),
left_on=['Opponent_Season_Code', 'Opponent'],
right_index=True, how='left'
)
Output:
Season Date Team Team_Season_Code TS L Opponent Opponent_Season_Code OS OOS OTS
0 2019 20181109 Abilene_Chr 1_2019 94 Home Arkansas_St 15_2019 73 65.242424 70.909091
1 2019 20181115 Abilene_Chr 1_2019 67 Away Denver 82_2019 61 72.156250 74.031250
2 2019 20181122 Abilene_Chr 1_2019 72 N Elon 70_2019 56 71.818182 73.636364
3 2019 20181123 Abilene_Chr 1_2019 73 Away Pacific 224_2019 71 76.466667 78.333333
4 2019 20181124 Abilene_Chr 1_2019 60 N UC_Riverside 306_2019 48 78.060606 79.545455
merging only on Opponent_Season_Code/Team_Season_Code:
df.merge(Overall_Season_Avg
.set_index('Team_Season_Code')
[['OS', 'TS']].add_prefix('O'),
left_on=['Opponent_Season_Code'],
right_index=True, how='left'
)
Output:
Season Date Team Team_Season_Code TS L Opponent Opponent_Season_Code OS OOS OTS
0 2019 20181109 Abilene_Chr 1_2019 94 Home Arkansas_St 15_2019 73 65.242424 70.909091
1 2019 20181115 Abilene_Chr 1_2019 67 Away Denver 82_2019 61 72.156250 74.031250
2 2019 20181122 Abilene_Chr 1_2019 72 N Elon 70_2019 56 71.818182 73.636364
3 2019 20181123 Abilene_Chr 1_2019 73 Away Pacific 224_2019 71 76.466667 78.333333
4 2019 20181124 Abilene_Chr 1_2019 60 N UC_Riverside 306_2019 48 78.060606 79.545455
df.merge(Overall_Season_Avg, on=['Team_Season_Code', 'Team'], how='left')
and rename column's names
or use transform instead agg when make Overall_Season_Avg.
but i don remain transform code becuz you don provide reproducible example
make simple and reproducible example plz
https://stackoverflow.com/help/minimal-reproducible-example

How to scrape tbody from a collapsible table using BeautifulSoup library?

Recently i did a project based of covid-19 dashboard. Where i use to scrape data from this website which has a collapsible table. Everything was ok till now, now recently the heroku app showing some errors. So i rerun my code in my local machine and the error occured at scraping tbody. Then i figured out that the site i use to scrape data has changed or updated the way it looks (table) and then my code is not able to grab it. I tried viewing page source and i am not able to find the table (tbody) that is on this page.But i am able to find tbody and all the data if i inspect the row of the table but cant find it on page source.How can i scrape the table now ?
My code:
The table i have to grab:
The data you see on the page is loaded from external URL via Ajax. You can use requests/json module to load it:
import json
import requests
url = 'https://www.mohfw.gov.in/data/datanew.json'
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
# print some data on screen:
for d in data:
print('{:<30} {:<10} {:<10} {:<10} {:<10}'.format(d['state_name'], d['active'], d['positive'], d['cured'], d['death']))
Prints:
Andaman and Nicobar Islands 329 548 214 5
Andhra Pradesh 75720 140933 63864 1349
Arunachal Pradesh 670 1591 918 3
Assam 9814 40269 30357 98
Bihar 17579 51233 33358 296
Chandigarh 369 1051 667 15
Chhattisgarh 2803 9086 6230 53
... and so on.
Try:
import json
import requests
import pandas as pd
data = []
row = []
r = requests.get('https://www.mohfw.gov.in/data/datanew.json')
j = json.loads(r.text)
for i in j:
for k in i:
row.append(i[k])
data.append(row)
row = []
columns = [i for i in j[0]]
df = pd.DataFrame(data, columns=columns)
df.sno = pd.to_numeric(df.sno, errors='coerce').reset_index()
df = df.sort_values('sno',)
print(df.to_string())
prints:
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 0 Andaman and Nicobar Islands 329 548 214 5 403 636 226 7 35
1 1 Andhra Pradesh 75720 140933 63864 1349 72188 150209 76614 1407 28
2 2 Arunachal Pradesh 670 1591 918 3 701 1673 969 3 12
3 3 Assam 9814 40269 30357 98 10183 41726 31442 101 18
4 4 Bihar 17579 51233 33358 296 18937 54240 34994 309 10
5 5 Chandigarh 369 1051 667 15 378 1079 683 18 04
6 6 Chhattisgarh 2803 9086 6230 53 2720 9385 6610 55 22
7 7 Dadra and Nagar Haveli and Daman and Diu 412 1100 686 2 418 1145 725 2 26
8 8 Delhi 10705 135598 120930 3963 10596 136716 122131 3989 07
9 9 Goa 1657 5913 4211 45 1707 6193 4438 48 30
10 10 Gujarat 14090 61438 44907 2441 14300 62463 45699 2464 24
and so on...

ValueError errors while reading JSON file with pd.read_json

I am trying to read JSON file using pandas:
import pandas as pd
df = pd.read_json('https://data.gov.in/node/305681/datastore/export/json')
I get ValueError: arrays must all be same length
Some other JSON pages show this error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
How do I somehow read the values? I am not particular about data validity.
Looking at the json it is valid, but it's nested with data and fields:
import json
import requests
In [11]: d = json.loads(requests.get('https://data.gov.in/node/305681/datastore/export/json').text)
In [12]: list(d.keys())
Out[12]: ['data', 'fields']
You want the data as the content, and fields as the column names:
In [13]: pd.DataFrame(d["data"], columns=[x["label"] for x in d["fields"]])
Out[13]:
S. No. States/UTs 2008-09 2009-10 2010-11 2011-12 2012-13
0 1 Andhra Pradesh 183446.36 193958.45 201277.09 212103.27 222973.83
1 2 Arunachal Pradesh 360.5 380.15 407.42 419 438.69
2 3 Assam 4658.93 4671.22 4707.31 4705 4709.58
3 4 Bihar 10740.43 11001.77 7446.08 7552 8371.86
4 5 Chhattisgarh 9737.92 10520.01 12454.34 12984.44 13704.06
5 6 Goa 148.61 148 149 149.45 457.87
6 7 Gujarat 12675.35 12761.98 13269.23 14269.19 14558.39
7 8 Haryana 38149.81 38453.06 39644.17 41141.91 42342.66
8 9 Himachal Pradesh 977.3 1000.26 1020.62 1049.66 1069.39
9 10 Jammu and Kashmir 7208.26 7242.01 7725.19 6519.8 6715.41
10 11 Jharkhand 3994.77 3924.73 4153.16 4313.22 4238.95
11 12 Karnataka 23687.61 29094.3 30674.18 34698.77 36773.33
12 13 Kerala 15094.54 16329.52 16856.02 17048.89 22375.28
13 14 Madhya Pradesh 6712.6 7075.48 7577.23 7971.53 8710.78
14 15 Maharashtra 35502.28 38640.12 42245.1 43860.99 45661.07
15 16 Manipur 1105.25 1119 1137.05 1149.17 1162.19
16 17 Meghalaya 994.52 999.47 1010.77 1021.14 1028.18
17 18 Mizoram 411.14 370.92 387.32 349.33 352.02
18 19 Nagaland 831.92 833.5 802.03 703.65 617.98
19 20 Odisha 19940.15 23193.01 23570.78 23006.87 23229.84
20 21 Punjab 36789.7 32828.13 35449.01 36030 37911.01
21 22 Rajasthan 6449.17 6713.38 6696.92 9605.43 10334.9
22 23 Sikkim 136.51 136.07 139.83 146.24 146
23 24 Tamil Nadu 88097.59 108475.73 115137.14 118518.45 119333.55
24 25 Tripura 1388.41 1442.39 1569.45 1650 1565.17
25 26 Uttar Pradesh 10139.8 10596.17 10990.72 16075.42 17073.67
26 27 Uttarakhand 1961.81 2535.77 2613.81 2711.96 3079.14
27 28 West Bengal 33055.7 36977.96 39939.32 43432.71 47114.91
28 29 Andaman and Nicobar Islands 617.58 657.44 671.78 780 741.32
29 30 Chandigarh 272.88 248.53 180.06 180.56 170.27
30 31 Dadra and Nagar Haveli 70.66 70.71 70.28 73 73
31 32 Daman and Diu 18.83 18.9 18.81 19.67 20
32 33 Delhi 1.17 1.17 1.17 1.23 NA
33 34 Lakshadweep 134.64 138.22 137.98 139.86 139.99
34 35 Puducherry 111.69 112.84 113.53 116 112.89
See also json_normalize for more complex json DataFrame extraction.
The following listed both the key and value pair for me:
from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
import pandas as pd
import requests
df = json.loads(requests.get('https://api.github.com/repos/akkhil2012/MachineLearning').text)
data = pd.DataFrame.from_dict(df, orient='index')
print(data)

Pivot tables using pandas

I have the following dataframe:
df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]
this will produce 1.4 mil records. I've taken the first 12.
Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011
I then filter on ['nat_actn_2_3'] for the certain hiring codes.
h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])
and can look at value_counts() to see total hires by region.
total_newhires = h3['regions'].value_counts()
total_newhires
produces:
Out[38]:
Pacific Southwest Region (R5) 42255
Pacific Northwest Region (R6) 32081
Intermountain Region (R4) 24045
Northern Region (R1) 22822
Rocky Mountain Region (R2) 17481
Southwest Region (R3) 17305
Eastern Region (R9) 11034
Research & Development(RES) 7337
Southern Region (R8) 7288
Albuquerque Service Center(ASC) 7032
Washington Office(WO) 4837
Alaska Region (R10) 4210
Job Corps(JC) 4010
nda 438
I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums.
Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried:
hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])
I'm wondering if groupby may be what I'm looking for?
Any help is appreciated. I've spent 3 days on this and can't seem to put it together.
So based off the answer below I did a pivot using the following code:
h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).
Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly.
ssno
fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
nat_actn_2_3
100 34 20 25 18 38 43 45 14 19 25 10
101 510 453 725 795 1029 1293 957 383 470 605 145
108 170 132 112 85 123 127 84 43 40 29 10
115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339
130 299 313 431 324 291 325 336 202 230 436 112
140 62 74 71 75 132 125 82 42 45 74 18
141 20 16 23 17 20 14 10 9 13 17 7
170 202 433 226 278 336 386 284 265 121 118 49
171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34
190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN
702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650
703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726
Try it like this:
h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)
To get counts use the aggfunc = len
Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int
Try:
h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)
if you have an older version of pandas

Categories