webscrapping with api in python from url - python

I am currently working on a project and want with API or webscrapping get the table from a website.
I gave the following code:
import requests
import pandas as pd
import numpy as np
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
resp = requests.get(url)
tables = pd.read_html(resp.text)
all_df = pd.concat(tables)
data= pd.DataFrame(all_df)
But i got the error message no tables found, but I want the table which also can download csv.
Anyone know what the problem is?

With some help from selenium before calling read_html :
#https://selenium-python.readthedocs.io/installation.html
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
import pandas as pd
​
s = Service("./chromedriver.exe")
​
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
​
with webdriver.Chrome(service=s) as driver:
driver.get(url)
df = pd.concat(pd.read_html(driver.page_source))
​
Output :
print(df)
State Circumcision Rate
0 West Virginia 87%
1 Michigan 86%
2 Kentucky 85%
3 Nebraska 84%
4 Ohio 84%
.. ... ...
45 Alaska 0%
46 Arizona 0%
47 Delaware 0%
48 Idaho 0%
49 Mississippi 0%
[50 rows x 2 columns]

Here is one way of getting that data as a dataframe:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import json
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
soup = bs(requests.get(url, headers=headers).text, 'html.parser')
script_w_data = soup.select_one('script[id="__NEXT_DATA__"]').text
df = pd.json_normalize(json.loads(script_w_data)['props']['pageProps']['listing'])
print(df)
Result in terminal:
fips state densityMi pop2023 pop2022 pop2020 pop2019 pop2010 growthRate growth growthSince2010 circumcisionRate
0 54 West Virginia 73.88019 1775932 1781860 1793716 1799642 1852994 -0.00333 -5928 -0.04159 0.87
1 26 Michigan 179.26454 10135438 10116069 10077331 10057961 9883640 0.00191 19369 0.02548 0.86
2 21 Kentucky 115.37702 4555777 4539130 4505836 4489190 4339367 0.00367 16647 0.04987 0.85
3 31 Nebraska 26.06024 2002052 1988536 1961504 1947985 1826341 0.00680 13516 0.09621 0.84
4 39 Ohio 290.70091 11878330 11852036 11799448 11773150 11536504 0.00222 26294 0.02963 0.84
5 18 Indiana 191.92896 6876047 6845874 6785528 6755359 6483802 0.00441 30173 0.06050 0.83
6 19 Iowa 57.89018 3233572 3219171 3190369 3175964 3046355 0.00447 14401 0.06146 0.82
7 55 Wisconsin 109.96966 5955737 5935064 5893718 5873043 5686986 0.00348 20673 0.04726 0.82
8 45 South Carolina 175.18855 5266343 5217037 5118425 5069118 4625364 0.00945 49306 0.13858 0.81
9 42 Pennsylvania 292.62222 13092796 13062764 13002700 12972667 12702379 0.00230 30032 0.03074 0.79
10 56 Wyoming 5.98207 580817 579495 576851 575524 563626 0.00228 1322 0.03050 0.79
11 15 Hawaii 231.00763 1483762 1474265 1455271 1445774 1360301 0.00644 9497 0.09076 0.78
12 20 Kansas 36.24443 2963308 2954832 2937880 2929402 2853118 0.00287 8476 0.03862 0.77
13 38 North Dakota 11.75409 811044 800394 779094 768441 672591 0.01331 10650 0.20585 0.77
14 40 Oklahoma 58.63041 4021753 4000953 3959353 3938551 3751351 0.00520 20800 0.07208 0.77
15 46 South Dakota 11.98261 908414 901165 886667 879421 814180 0.00804 7249 0.11574 0.77
16 29 Missouri 90.26083 6204710 6188111 6154913 6138318 5988927 0.00268 16599 0.03603 0.76
17 33 New Hampshire 155.90830 1395847 1389741 1377529 1371424 1316470 0.00439 6106 0.06030 0.76
18 44 Rhode Island 1074.29594 1110822 1106341 1097379 1092896 1052567 0.00405 4481 0.05535 0.76
19 47 Tennessee 171.70515 7080262 7023788 6910840 6854371 6346105 0.00804 56474 0.11569 0.76
20 51 Virginia 223.36045 8820504 8757467 8631393 8568357 8001024 0.00720 63037 0.10242 0.74
21 13 Georgia 191.59470 11019186 10916760 10711908 10609487 9687653 0.00938 102426 0.13745 0.72
22 24 Maryland 648.84362 6298325 6257958 6177224 6136855 5773552 0.00645 40367 0.09089 0.72
23 9 Connecticut 746.69537 3615499 3612314 3605944 3602762 3574097 0.00088 3185 0.01158 0.71
24 23 Maine 44.50148 1372559 1369159 1362359 1358961 1328361 0.00248 3400 0.03327 0.67
25 5 Arkansas 58.42619 3040207 3030646 3011524 3001967 2915918 0.00315 9561 0.04262 0.66
26 8 Colorado 57.86332 5997070 5922618 5773714 5699264 5029196 0.01257 74452 0.19245 0.66
27 25 Massachusetts 919.82103 7174604 7126375 7029917 6981690 6547629 0.00677 48229 0.09576 0.66
28 34 New Jersey 1283.40005 9438124 9388414 9288994 9239284 8791894 0.00529 49710 0.07350 0.66
29 50 Vermont 70.33514 648279 646545 643077 641347 625741 0.00268 1734 0.03602 0.64
30 17 Illinois 230.67908 12807072 12808884 12812508 12814324 12830632 -0.00014 -1812 -0.00184 0.63
31 27 Minnesota 73.18202 5827265 5787008 5706494 5666238 5303925 0.00696 40257 0.09867 0.63
32 36 New York 433.90472 20448194 20365879 20201249 20118937 19378102 0.00404 82315 0.05522 0.59
33 37 North Carolina 220.30026 10710558 10620168 10439388 10348993 9535483 0.00851 90390 0.12323 0.52
34 30 Montana 7.64479 1112668 1103187 1084225 1074744 989415 0.00859 9481 0.12457 0.50
35 48 Texas 116.16298 30345487 29945493 29145505 28745507 25145561 0.01336 399994 0.20679 0.50
36 35 New Mexico 17.60148 2135024 2129190 2117522 2111685 2059179 0.00274 5834 0.03683 0.49
37 22 Louisiana 108.67214 4695071 4682633 4657757 4645314 4533372 0.00266 12438 0.03567 0.45
38 49 Utah 41.66892 3423935 3373162 3271616 3220842 2763885 0.01505 50773 0.23881 0.42
39 12 Florida 416.95573 22359251 22085563 21538187 21264502 18801310 0.01239 273688 0.18924 0.35
40 41 Oregon 45.41307 4359110 4318492 4237256 4196636 3831074 0.00941 40618 0.13783 0.24
41 6 California 258.20877 40223504 39995077 39538223 39309799 37253956 0.00571 228427 0.07971 0.22
42 1 Alabama 100.65438 5097641 5073187 5024279 4999822 4779736 0.00482 24454 0.06651 0.20
43 53 Washington 120.37292 7999503 7901429 7705281 7607206 6724540 0.01241 98074 0.18960 0.15
44 32 Nevada 29.38425 3225832 3185426 3104614 3064205 2700551 0.01268 40406 0.19451 0.12
45 2 Alaska 1.29738 740339 738023 733391 731075 710231 0.00314 2316 0.04239 NaN
46 4 Arizona 64.96246 7379346 7303398 7151502 7075549 6392017 0.01040 75948 0.15446 NaN
47 10 Delaware 522.08876 1017551 1008350 989948 980743 897934 0.00912 9201 0.13321 NaN
48 16 Idaho 23.23926 1920562 1893410 1839106 1811950 1567582 0.01434 27152 0.22517 NaN
49 28 Mississippi 63.07084 2959473 2960075 2961279 2961879 2967297 -0.00020 -602 -0.00264 NaN

Related

How to scrape the table of states?

I am trying to scrape the table from:
https://worldpopulationreview.com/states
My code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://worldpopulationreview.com/states'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table = soup.find('table', {'class': 'jsx-a3119e4553b2cac7 table is-striped is-hoverable is-fullwidth tp-table-body is-narrow'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns=headers)
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
df
Currently returns
'NoneType' object has no attribute 'find_all'
Clearly the error is because the table variable is returning nothing, but I believe I have the table tag correct.
The table data is dynamically loaded by JavaScript and bs4 can't render JS but you can do the job bs4 with an automation tool something like selenium and grab the table using pandas DataFrame.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
driver.get('https://worldpopulationreview.com/states')
driver.maximize_window()
time.sleep(8)
soup = BeautifulSoup(driver.page_source,"lxml")
#You can pull the table directly from the web page
df = pd.read_html(str(soup))[0]
print(df)
#OR
#table= soup.select_one('table[class="jsx-a3119e4553b2cac7 table is-striped is-hoverable is-fullwidth tp-table-body is-narrow"]')
# df = pd.read_html(str(table))[0]
# print(df)
Output:
Rank State 2022 Population Growth Rate ... 2010 Population Growth Since 2010 % of US Density (/mi²)
0 1 California 39995077 0.57% ... 37253956 7.36% 11.93% 257
1 2 Texas 29945493 1.35% ... 25145561 19.09% 8.93% 115
2 3 Florida 22085563 1.25% ... 18801310 17.47% 6.59% 412
3 4 New York 20365879 0.41% ... 19378102 5.10% 6.07% 432
4 5 Pennsylvania 13062764 0.23% ... 12702379 2.84% 3.90% 292
5 6 Illinois 12808884 -0.01% ... 12830632 -0.17% 3.82% 231
6 7 Ohio 11852036 0.22% ... 11536504 2.74% 3.53% 290
7 8 Georgia 10916760 0.95% ... 9687653 12.69% 3.26% 190
8 9 North Carolina 10620168 0.86% ... 9535483 11.38% 3.17% 218
9 10 Michigan 10116069 0.19% ... 9883640 2.35% 3.02% 179
10 11 New Jersey 9388414 0.53% ... 8791894 6.78% 2.80% 1277
11 12 Virginia 8757467 0.73% ... 8001024 9.45% 2.61% 222
12 13 Washington 7901429 1.26% ... 6724540 17.50% 2.36% 119
13 14 Arizona 7303398 1.05% ... 6392017 14.26% 2.18% 64
14 15 Massachusetts 7126375 0.68% ... 6547629 8.84% 2.13% 914
15 16 Tennessee 7023788 0.81% ... 6346105 10.68% 2.09% 170
16 17 Indiana 6845874 0.44% ... 6483802 5.58% 2.04% 191
17 18 Maryland 6257958 0.65% ... 5773552 8.39% 1.87% 645
18 19 Missouri 6188111 0.27% ... 5988927 3.33% 1.85% 90
19 20 Wisconsin 5935064 0.35% ... 5686986 4.36% 1.77% 110
20 21 Colorado 5922618 1.27% ... 5029196 17.76% 1.77% 57
21 22 Minnesota 5787008 0.70% ... 5303925 9.11% 1.73% 73
22 23 South Carolina 5217037 0.95% ... 4625364 12.79% 1.56% 174
23 24 Alabama 5073187 0.48% ... 4779736 6.14% 1.51% 100
24 25 Louisiana 4682633 0.27% ... 4533372 3.29% 1.40% 108
25 26 Kentucky 4539130 0.37% ... 4339367 4.60% 1.35% 115
26 27 Oregon 4318492 0.95% ... 3831074 12.72% 1.29% 45
27 28 Oklahoma 4000953 0.52% ... 3751351 6.65% 1.19% 58
28 29 Connecticut 3612314 0.09% ... 3574097 1.07% 1.08% 746
29 30 Utah 3373162 1.53% ... 2763885 22.04% 1.01% 41
30 31 Iowa 3219171 0.45% ... 3046355 5.67% 0.96% 58
31 32 Nevada 3185426 1.28% ... 2700551 17.95% 0.95% 29
32 33 Arkansas 3030646 0.32% ... 2915918 3.93% 0.90% 58
33 34 Mississippi 2960075 -0.02% ... 2967297 -0.24% 0.88% 63
34 35 Kansas 2954832 0.29% ... 2853118 3.57% 0.88% 36
35 36 New Mexico 2129190 0.27% ... 2059179 3.40% 0.64% 18
36 37 Nebraska 1988536 0.68% ... 1826341 8.88% 0.59% 26
37 38 Idaho 1893410 1.45% ... 1567582 20.79% 0.56% 23
38 39 West Virginia 1781860 -0.33% ... 1852994 -3.84% 0.53% 74
39 40 Hawaii 1474265 0.65% ... 1360301 8.38% 0.44% 230
40 41 New Hampshire 1389741 0.44% ... 1316470 5.57% 0.41% 155
41 42 Maine 1369159 0.25% ... 1328361 3.07% 0.41% 44
42 43 Rhode Island 1106341 0.41% ... 1052567 5.11% 0.33% 1070
43 44 Montana 1103187 0.87% ... 989415 11.50% 0.33%
8
44 45 Delaware 1008350 0.92% ... 897934 12.30% 0.30% 517
45 46 South Dakota 901165 0.81% ... 814180 10.68% 0.27% 12
46 47 North Dakota 800394 1.35% ... 672591 19.00% 0.24% 12
47 48 Alaska 738023 0.31% ... 710231 3.91% 0.22%
1
48 49 Vermont 646545 0.27% ... 625741 3.32% 0.19% 70
49 50 Wyoming 579495 0.23% ... 563626 2.82% 0.17%
6
[50 rows x 9 columns]
Table is rendered dynamically from JSON that is placed at the end of the source code, so it do not need selenium simply extract the tag and load the JSON - This also includes all additional information from the page:
soup = BeautifulSoup(requests.get('https://worldpopulationreview.com/states').text)
json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
Example
import requests, json
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://worldpopulationreview.com/states').text)
pd.DataFrame(
json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
Example
Cause there are also additional information, that is used for the map, simply choose columns you need by header.
fips
state
densityMi
pop2022
pop2021
pop2020
pop2019
pop2010
growthRate
growth
growthSince2010
area
fill
Name
rank
0
6
California
256.742
39995077
39766650
39538223
39309799
37253956
0.00574419
228427
0.0735793
155779
#084594
California
1
1
48
Texas
114.632
29945493
29545499
29145505
28745507
25145561
0.0135382
399994
0.190886
261232
#084594
Texas
2
2
12
Florida
411.852
22085563
21811875
21538187
21264502
18801310
0.0125477
273688
0.174682
53625
#084594
Florida
3
3
36
New York
432.158
20365879
20283564
20201249
20118937
19378102
0.00405821
82315
0.0509739
47126
#084594
New York
4
4
42
Pennsylvania
291.951
13062764
13032732
13002700
12972667
12702379
0.00230435
30032
0.0283715
44743
#2171b5
Pennsylvania
5
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
45
46
South Dakota
11.887
901165
893916
886667
879421
814180
0.00810926
7249
0.106838
75811
#c6dbef
South Dakota
46
46
38
North Dakota
11.5997
800394
789744
779094
768441
672591
0.0134854
10650
0.190016
69001
#c6dbef
North Dakota
47
47
2
Alaska
1.29332
738023
735707
733391
731075
710231
0.00314799
2316
0.0391309
570641
#c6dbef
Alaska
48
48
50
Vermont
70.147
646545
644811
643077
641347
625741
0.00268916
1734
0.033247
9217
#c6dbef
Vermont
49
49
56
Wyoming
5.96845
579495
578173
576851
575524
563626
0.00228651
1322
0.0281552
97093
#c6dbef
Wyoming
50

Webscraping A Table Using Python and BeautifulSoup

I'm learning on how to webscrape using Python since I'm a novice. Right now, I attempted to webscrape Euros 2020 stats from this website https://theanalyst.com/na/2021/06/euro-2020-player-stats. After running my initial code (see below) to gather the html from the webpage, I cannot locate the table tag and its data-table class. I can see the table and its data-table when I inspected the website, but it is not shown when I print out the page_soup.
from urllib.request import urlopen as uReq # Web client
from bs4 import BeautifulSoup as soup # HTML data structure
url_page = 'https://theanalyst.com/na/2021/06/euro-2020-player-stats'
# Open connection & download the html from the url
uClient = uReq(url_page)
# Parses html into a soup data structure
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
print(page_soup)
The table is loaded dynamically in JSON format via sending a GET request to:
https://dataviz.theanalyst.com/euro-2020-hub/player_stats_3_2020.json
Since we're dealing with JSON data, it's easier to use the requests library to get the data.
Here is an example using the pandas library to print the table into a DataFrame (you don't have to use the pandas library).
import pandas as pd
import requests
url = "https://dataviz.theanalyst.com/euro-2020-hub/player_stats_3_2020.json"
response = requests.get(url).json()
print(pd.json_normalize(response["data"]).to_string())
Output (truncated):
player_id team_id team_name player_first_name player_last_name player age position detailed_position mins_played np_shots np_sot np_goals np_xG op_chances_created op_assists op_xA op_passes op_pass_completion_rate tackles_won interceptions recoveries avg_carry_distance avg_carry_progress carry_w_shot carry_w_goal carry_w_chance_created carry_w_assist take_ons take_ons_success_rate goal_ending total_xG shot_ending team_badge
0 103955 114 England Raheem Sterling Raheem Sterling 26 Forward Second Striker 641 14 8 3 3.82 2 1 1.18 193 0.85 5 4 23 12.98 6.73 3 0 3 1 38 52.63 6 7.08 24 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
1 56979 114 England Jordan Henderson Jordan Henderson 31 Midfielder Central Midfielder 150 1 1 1 0.32 0 0 0.06 111 0.88 0 1 11 7.83 0.49 0 0 0 0 3 66.67 0 0.00 0 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
2 78830 114 England Harry Kane Harry Kane 27 Forward Striker 649 15 7 4 3.57 5 0 0.39 159 0.70 0 3 8 10.52 3.06 2 0 2 0 15 53.33 7 6.38 21 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
3 58621 114 England Kyle Walker Kyle Walker 31 Defender Full Back 599 0 0 0 0.00 2 0 0.18 352 0.87 0 8 37 11.66 5.09 0 0 0 0 1 100.00 3 2.54 10 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
The variable response is now a dictionary (dict) which you can access the keys/values. To view and prettify the data:
from pprint import pprint
print(type(response))
pprint(response)
Output (truncated):
<class 'dict'>
{'data': [{'age': 26,
'avg_carry_distance': 12.98,
'avg_carry_progress': 6.73,
'carry_w_assist': 1,
'carry_w_chance_created': 3,
'carry_w_goal': 0,
'carry_w_shot': 3,
'detailed_position': 'Second Striker',

How to extract data from a table from a web page using Beautiful Soup

I want to extract the data from the table given in 'https://statisticstimes.com/demographics/india/indian-states-population.php' and put it in a list or a dictionary.
I am a beginner in Python. From what I have learned so far all I could do is:
import urllib.request , urllib.error , urllib.parse
from bs4 import BeautifulSoup
url = input("Enter url: ")
html = urllib.request.urlopen(url).read()
x = BeautifulSoup(html , 'html.parser')
tags = x('tr')
lst = list()
for tag in tags:
lst.append(tag.findAll('td'))
print(lst)
You can use requests and pandas.
Here's how:
import pandas as pd
import requests
from tabulate import tabulate
url = "https://statisticstimes.com/demographics/india/indian-states-population.php"
df = pd.read_html(requests.get(url).text, flavor="bs4")[-1]
print(tabulate(df.head(10), showindex=False))
Output:
--- ---------------- -------- -------- ------- ----- ---- -------------------- ---
NCT Delhi 18710922 16787941 1922981 11.45 1.36 Malawi 63
18 Haryana 28204692 25351462 2853230 11.25 2.06 Venezuela 51
14 Kerala 35699443 33406061 2293382 6.87 2.6 Morocco 41
20 Himachal Pradesh 7451955 6864602 587353 8.56 0.54 China, Hong Kong SAR 104
16 Punjab 30141373 27743338 2398035 8.64 2.2 Mozambique 48
12 Telangana 39362732 35004000 4358732 12.45 2.87 Iraq 36
25 Goa 1586250 1458545 127705 8.76 0.12 Bahrain 153
19 Uttarakhand 11250858 10086292 1164566 11.55 0.82 Haiti 84
UT3 Chandigarh 1158473 1055450 103023 9.76 0.08 Eswatini 159
9 Gujarat 63872399 60439692 3432707 5.68 4.66 France 23
--- ---------------- -------- -------- ------- ----- ---- -------------------- ---
With:
df.to_csv("your_table.csv", index=False)
you can dump the table to a .csv file:

DataFrame max() not return max

Real beginner question here, but it is so simple, I'm genuinely stumped. Python/DataFrame newbie.
I've loaded a DataFrame from a Google Sheet, however any graphing or attempts at calculations are generating bogus results. Loading code:
# Setup
!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('Linear Regression - Brain vs. Body Predictor').worksheet("Raw Data")
rows = worksheet.get_all_values()
# Convert to a DataFrame and render.
import pandas as pd
df = pd.DataFrame.from_records(rows)
This seems to work fine and the data looks to be correctly loaded when I print out the DataFrame but running max() returns obviously false results. For example:
print(df[0])
print(df[0].max())
Will output:
0 3.385
1 0.48
2 1.35
3 465
4 36.33
5 27.66
6 14.83
7 1.04
8 4.19
9 0.425
10 0.101
11 0.92
12 1
13 0.005
14 0.06
15 3.5
16 2
17 1.7
18 2547
19 0.023
20 187.1
21 521
22 0.785
23 10
24 3.3
25 0.2
26 1.41
27 529
28 207
29 85
...
32 6654
33 3.5
34 6.8
35 35
36 4.05
37 0.12
38 0.023
39 0.01
40 1.4
41 250
42 2.5
43 55.5
44 100
45 52.16
46 10.55
47 0.55
48 60
49 3.6
50 4.288
51 0.28
52 0.075
53 0.122
54 0.048
55 192
56 3
57 160
58 0.9
59 1.62
60 0.104
61 4.235
Name: 0, Length: 62, dtype: object
Max: 85
Obviously, the maximum value is way out -- it should be 6654, not 85.
What on earth am I doing wrong?
First StackOverflow post, so thanks in advance.
If you check it, you'll see at the end of your print() that dtype=object. Also, you'll notice your pandas Series have "int" values along with "float" values (e.g. you have 6654 and 3.5 in the same Series).
These are good hints you have a series of strings, and the max operator here is comparing based on string comparing. You want, however, to have a series of numbers (specifically floats) and to compare based on number comparing.
Check the following reproducible example:
>>> df = pd.DataFrame({'col': ['0.02', '9', '85']}, dtype=object)
>>> df.col.max()
'9'
You can check that because
>>> '9' > '85'
True
You want these values to be considered floats instead. Use pd.to_numeric
>>> df['col'] = pd.to_numeric(df.col)
>>> df.col.max()
85
For more on str and int comparison, check this question

ValueError errors while reading JSON file with pd.read_json

I am trying to read JSON file using pandas:
import pandas as pd
df = pd.read_json('https://data.gov.in/node/305681/datastore/export/json')
I get ValueError: arrays must all be same length
Some other JSON pages show this error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
How do I somehow read the values? I am not particular about data validity.
Looking at the json it is valid, but it's nested with data and fields:
import json
import requests
In [11]: d = json.loads(requests.get('https://data.gov.in/node/305681/datastore/export/json').text)
In [12]: list(d.keys())
Out[12]: ['data', 'fields']
You want the data as the content, and fields as the column names:
In [13]: pd.DataFrame(d["data"], columns=[x["label"] for x in d["fields"]])
Out[13]:
S. No. States/UTs 2008-09 2009-10 2010-11 2011-12 2012-13
0 1 Andhra Pradesh 183446.36 193958.45 201277.09 212103.27 222973.83
1 2 Arunachal Pradesh 360.5 380.15 407.42 419 438.69
2 3 Assam 4658.93 4671.22 4707.31 4705 4709.58
3 4 Bihar 10740.43 11001.77 7446.08 7552 8371.86
4 5 Chhattisgarh 9737.92 10520.01 12454.34 12984.44 13704.06
5 6 Goa 148.61 148 149 149.45 457.87
6 7 Gujarat 12675.35 12761.98 13269.23 14269.19 14558.39
7 8 Haryana 38149.81 38453.06 39644.17 41141.91 42342.66
8 9 Himachal Pradesh 977.3 1000.26 1020.62 1049.66 1069.39
9 10 Jammu and Kashmir 7208.26 7242.01 7725.19 6519.8 6715.41
10 11 Jharkhand 3994.77 3924.73 4153.16 4313.22 4238.95
11 12 Karnataka 23687.61 29094.3 30674.18 34698.77 36773.33
12 13 Kerala 15094.54 16329.52 16856.02 17048.89 22375.28
13 14 Madhya Pradesh 6712.6 7075.48 7577.23 7971.53 8710.78
14 15 Maharashtra 35502.28 38640.12 42245.1 43860.99 45661.07
15 16 Manipur 1105.25 1119 1137.05 1149.17 1162.19
16 17 Meghalaya 994.52 999.47 1010.77 1021.14 1028.18
17 18 Mizoram 411.14 370.92 387.32 349.33 352.02
18 19 Nagaland 831.92 833.5 802.03 703.65 617.98
19 20 Odisha 19940.15 23193.01 23570.78 23006.87 23229.84
20 21 Punjab 36789.7 32828.13 35449.01 36030 37911.01
21 22 Rajasthan 6449.17 6713.38 6696.92 9605.43 10334.9
22 23 Sikkim 136.51 136.07 139.83 146.24 146
23 24 Tamil Nadu 88097.59 108475.73 115137.14 118518.45 119333.55
24 25 Tripura 1388.41 1442.39 1569.45 1650 1565.17
25 26 Uttar Pradesh 10139.8 10596.17 10990.72 16075.42 17073.67
26 27 Uttarakhand 1961.81 2535.77 2613.81 2711.96 3079.14
27 28 West Bengal 33055.7 36977.96 39939.32 43432.71 47114.91
28 29 Andaman and Nicobar Islands 617.58 657.44 671.78 780 741.32
29 30 Chandigarh 272.88 248.53 180.06 180.56 170.27
30 31 Dadra and Nagar Haveli 70.66 70.71 70.28 73 73
31 32 Daman and Diu 18.83 18.9 18.81 19.67 20
32 33 Delhi 1.17 1.17 1.17 1.23 NA
33 34 Lakshadweep 134.64 138.22 137.98 139.86 139.99
34 35 Puducherry 111.69 112.84 113.53 116 112.89
See also json_normalize for more complex json DataFrame extraction.
The following listed both the key and value pair for me:
from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
import pandas as pd
import requests
df = json.loads(requests.get('https://api.github.com/repos/akkhil2012/MachineLearning').text)
data = pd.DataFrame.from_dict(df, orient='index')
print(data)

Categories