Python Beautiful Soup can't find specific table - python

I'm having issues with scraping basketball-reference.com. I'm trying to access the "Team Per Game Stats" table but can't seem to target the correct div/table. I'm trying to capture the table and bring it into a dataframe using pandas.
I've tried using soup.find and soup.find_all to find a all the tables but when I search the results I do not see the ID of the table I am looking for. See below.
x = soup.find("table", id="team-stats-per_game")
import csv, time, sys, math
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import urllib.request
#NBA season
year = 2019
# URL page we will scraping
url = "https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base".format(year)
# Basketball reference URL
html = urlopen(url)
soup = BeautifulSoup(html,'lxml')
x = soup.find("table", id="team-stats-per_game")
print(x)
Result:
None
I expect the output to list the table elements, specifically tr and th tags to target and bring into a pandas df.

As Jarett mentioned above, BeautifulSoup can't parse your tag. In this case it's because it's commented out in the source.
While this is admittedly an amateurish approach, it works for your data.
table_src = html.text.split('<div class="overthrow table_container"
id="div_team-stats-per_game">')[1].split('</table>')[0] + '</table>'
table = BeautifulSoup(table_src, 'lxml')

The tables are rendered after, so you'd need to use Selenium to let it render or as mentioned above. But that isn't necessary as most of the tables are within the comments. You could use BeautifulSoup to pull out the comments, then search through those for the table tags.
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
#NBA season
year = 2019
url = 'https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base'.format(year)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
This will return you a list of dataframes, so just pull out the table you want from wherever it is located by its index position:
Output:
print (tables[3])
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 82 19780 3555 ... 615 486 1137 1608 9686
1 2.0 Golden State Warriors* 82 19805 3612 ... 625 525 1169 1757 9650
2 3.0 New Orleans Pelicans 82 19755 3581 ... 610 441 1215 1732 9466
3 4.0 Philadelphia 76ers* 82 19805 3407 ... 606 432 1223 1745 9445
4 5.0 Los Angeles Clippers* 82 19830 3384 ... 561 385 1193 1913 9442
5 6.0 Portland Trail Blazers* 82 19855 3470 ... 546 413 1135 1669 9402
6 7.0 Oklahoma City Thunder* 82 19855 3497 ... 766 425 1145 1839 9387
7 8.0 Toronto Raptors* 82 19880 3460 ... 680 437 1150 1724 9384
8 9.0 Sacramento Kings 82 19730 3541 ... 679 363 1095 1751 9363
9 10.0 Washington Wizards 82 19930 3456 ... 683 379 1154 1701 9350
10 11.0 Houston Rockets* 82 19830 3218 ... 700 405 1094 1803 9341
11 12.0 Atlanta Hawks 82 19855 3392 ... 675 419 1397 1932 9294
12 13.0 Minnesota Timberwolves 82 19830 3413 ... 683 411 1074 1664 9223
13 14.0 Boston Celtics* 82 19780 3451 ... 706 435 1052 1670 9216
14 15.0 Brooklyn Nets* 82 19980 3301 ... 539 339 1236 1763 9204
15 16.0 Los Angeles Lakers 82 19780 3491 ... 618 440 1284 1701 9165
16 17.0 Utah Jazz* 82 19755 3314 ... 663 483 1240 1728 9161
17 18.0 San Antonio Spurs* 82 19805 3468 ... 501 386 992 1487 9156
18 19.0 Charlotte Hornets 82 19830 3297 ... 591 405 1001 1550 9081
19 20.0 Denver Nuggets* 82 19730 3439 ... 634 363 1102 1644 9075
20 21.0 Dallas Mavericks 82 19780 3182 ... 533 351 1167 1650 8927
21 22.0 Indiana Pacers* 82 19705 3390 ... 713 404 1122 1594 8857
22 23.0 Phoenix Suns 82 19880 3289 ... 735 418 1279 1932 8815
23 24.0 Orlando Magic* 82 19780 3316 ... 543 445 1082 1526 8800
24 25.0 Detroit Pistons* 82 19855 3185 ... 569 331 1135 1811 8778
25 26.0 Miami Heat 82 19730 3251 ... 627 448 1208 1712 8668
26 27.0 Chicago Bulls 82 19905 3266 ... 603 351 1159 1663 8605
27 28.0 New York Knicks 82 19780 3134 ... 557 422 1151 1713 8575
28 29.0 Cleveland Cavaliers 82 19755 3189 ... 534 195 1106 1642 8567
29 30.0 Memphis Grizzlies 82 19880 3113 ... 684 448 1147 1801 8490
30 NaN League Average 82 19815 3369 ... 626 406 1155 1714 9119
[31 rows x 25 columns]

As other answers mentioned this is basically because the content of page is being loaded by help of JavaScript and getting source code with help of urlopener or request will not load that dynamic part.
So here I have a way around of it, actually you can make use of selenium to let the dynamic content load and then get the source code from there and find for the table.
Here is the code that actually give the result you expected.
But you will need to setup selenium web driver
from lxml import html
from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
def parse(url):
response = webdriver.Firefox()
response.get(url)
sleep(3)
sourceCode=response.page_source
return sourceCode
year =2019
soup = BeautifulSoup(parse("https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base".format(year)),'lxml')
x = soup.find("table", id="team-stats-per_game")
print(x)
Hope this helped you with your problem and feel free to ask any further doubts.
Happy Coding:)

Related

Time interval calculation for consecutive days in rows

I have a dataframe that looks like this:
Path_Version commitdates Year-Month API Age api_spec_id
168 NaN 2018-10-19 2018-10 39 521
169 NaN 2018-10-19 2018-10 39 521
170 NaN 2018-10-12 2018-10 39 521
171 NaN 2018-10-12 2018-10 39 521
172 NaN 2018-10-12 2018-10 39 521
173 NaN 2018-10-11 2018-10 39 521
174 NaN 2018-10-11 2018-10 39 521
175 NaN 2018-10-11 2018-10 39 521
176 NaN 2018-10-11 2018-10 39 521
177 NaN 2018-10-11 2018-10 39 521
178 NaN 2018-09-26 2018-09 39 521
179 NaN 2018-09-25 2018-09 39 521
I want to calculate the days elapsed from the first commitdate till the last, after sorting the commit dates first, so something like this:
Path_Version commitdates Year-Month API Age api_spec_id Days_difference
168 NaN 2018-10-19 2018-10 39 521 25
169 NaN 2018-10-19 2018-10 39 521 25
170 NaN 2018-10-12 2018-10 39 521 18
171 NaN 2018-10-12 2018-10 39 521 18
172 NaN 2018-10-12 2018-10 39 521 18
173 NaN 2018-10-11 2018-10 39 521 16
174 NaN 2018-10-11 2018-10 39 521 16
175 NaN 2018-10-11 2018-10 39 521 16
176 NaN 2018-10-11 2018-10 39 521 16
177 NaN 2018-10-11 2018-10 39 521 16
178 NaN 2018-09-26 2018-09 39 521 1
179 NaN 2018-09-25 2018-09 39 521 0
I tried first sorting the commitdates by api_spec_id since it is unique for every API, and then calculating the diff
final_api['commitdates'] = final_api.groupby('api_spec_id')['commitdate'].apply(lambda x: x.sort_values())
final_api['diff'] = final_api.groupby('api_spec_id')['commitdates'].diff() / np.timedelta64(1, 'D')
final_api['diff'] = final_api['diff'].fillna(0)
It just returns me a zero for the entire column. I don't want to group them, I only want to calculate the difference based on the sorted commitdates: starting from the first commitdate till the last in the entire dataset, in days
Any idea how can I achieve this?
Use pandas.to_datetime, sub, min and dt.days:
t = pd.to_datetime(df['commitdates'])
df['Days_difference'] = t.sub(t.min()).dt.days
If you need to group per API:
t = pd.to_datetime(df['commitdates'])
df['Days_difference'] = t.sub(t.groupby(df['api_spec_id']).transform('min')).dt.days
Output:
Path_Version commitdates Year-Month API Age api_spec_id Days_difference
168 NaN 2018-10-19 2018-10 39 521 24
169 NaN 2018-10-19 2018-10 39 521 24
170 NaN 2018-10-12 2018-10 39 521 17
171 NaN 2018-10-12 2018-10 39 521 17
172 NaN 2018-10-12 2018-10 39 521 17
173 NaN 2018-10-11 2018-10 39 521 16
174 NaN 2018-10-11 2018-10 39 521 16
175 NaN 2018-10-11 2018-10 39 521 16
176 NaN 2018-10-11 2018-10 39 521 16
177 NaN 2018-10-11 2018-10 39 521 16
178 NaN 2018-09-26 2018-09 39 521 1
179 NaN 2018-09-25 2018-09 39 521 0

How to get the body of the table using Python?

I am self-lerning webscraping and I am trying to get tbody from a table with beautifulSoups.
My attempt:
url ='https://www.agrolok.pl/notowania/notowania-cen-pszenicy.htm'
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
table = soup.findAll('table', class_='hover')
print(table)
Thats what I get:
<table class="hover"></table>
Any hints highly appreciated
'table', class_='hover' that contains table data aka tbody, tr, td and so on are dynamic thats why you are not getting tbody but you can mimic dat selenium with pandas/bs4. I use selenium with pandas.
Script:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.agrolok.pl/notowania/notowania-cen-pszenicy.htm')
driver.maximize_window()
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
df = pd.read_html(str(soup))[0]
d=df.rename(columns=df.iloc[0]).drop(df.index[0])
print(d)
Output:
7/4/2022 1410 1380 343.25 4.7002 1613 1640
1 7/1/2022 1410 1300 334.50 4.7176 1578 1630
2 6/30/2022 1410 1320 350.25 4.6806 1639 1650
3 6/29/2022 1500 1380 358.50 4.6809 1678 1710
4 6/28/2022 1450 1360 356.75 4.7004 1677 1690
5 6/27/2022 1450 1360 350.00 4.6965 1644 1690
6 6/24/2022 1450 1360 357.25 4.7094 1682 1700
7 6/23/2022 1450 1360 359.00 4.7096 1691 1690
8 6/22/2022 1470 1410 370.50 4.6590 1726 1750
9 6/21/2022 1500 1370 372.50 4.6460 1731 1730
10 6/20/2022 1540 1460 388.25 4.6731 1814 1780
11 6/15/2022 1560 1460 392.75 4.6642 1832 1780
12 6/14/2022 1560 1460 392.25 4.6548 1826 1780
13 6/13/2022 1540 1460 394.50 4.6313 1827 1800
14 6/10/2022 1530 1450 391.75 4.6030 1803 1760
15 6/9/2022 1540 1500 386.25 4.5826 1770 1730
16 6/8/2022 1550 1520 381.75 4.5817 1749 1730
17 6/7/2022 1500 1540 385.50 4.5855 1768 1700
18 6/6/2022 1600 1510 397.50 4.5880 1824 1760
19 6/3/2022 1560 1490 378.25 4.5908 1736 1700
20 6/2/2022 1590 1490 382.50 4.5876 1755 1710
21 6/1/2022 1590 1490 380.50 4.5891 1746 1700
22 5/31/2022 1650 1560 392.25 4.5756 1795 1750
23 5/30/2022 1670 1590 406.75 4.5869 1866 1800
24 5/27/2022 1670 1580 414.75 4.6102 1912 1700
25 5/26/2022 1650 1580 409.50 4.6135 1889 1700
26 5/25/2022 1670 1600 404.50 4.5955 1859 1700
27 5/24/2022 1690 1630 410.50 4.6107 1893 1800
28 5/23/2022 1700 1600 426.00 4.6171 1966 1860
29 5/20/2022 1700 1630 420.75 4.6366 1951 1840
30 5/19/2022 1700 1640 422.25 4.6429 1960 1850
31 5/18/2022 1700 1640 430.50 4.6528 2003 1850
32 5/17/2022 1690 1640 438.25 4.6558 2040 1850
33 5/16/2022 1690 1640 438.25 4.6724 2048 1880
34 5/13/2022 1670 1560 416.50 4.6679 1944 1800
35 5/12/2022 1670 1540 414.25 4.6841 1940 1790
36 5/11/2022 1670 1540 403.25 4.6700 1883 1790
37 5/10/2022 1680 1560 396.50 4.6761 1854 1780
38 5/9/2022 1670 1560 394.50 4.7059 1856 1780
39 5/6/2022 1600 1580 406.25 4.6979 1909 1760
40 5/5/2022 1660 1610 401.00 4.6658 1871 1780
41 5/4/2022 1660 1630 390.50 4.6777 1827 1735
42 4/29/2022 1660 1630 400.75 4.6582 1867 1720
43 4/28/2022 1670 1640 416.50 4.6915 1954 1740
44 4/27/2022 1670 1630 418.25 4.7076 1969 1720
45 4/26/2022 1660 1640 415.25 4.6429 1928 1685
46 4/25/2022 1665 1630 408.25 4.6405 1894 1670
47 4/22/2022 1665 1650 407.00 4.6361 1887 1690
48 4/21/2022 1660 1650 405.75 4.6523 1888 1690
49 4/20/2022 1660 1660 398.50 4.6295 1845 1700
50 4/19/2022 1680 1660 399.50 4.6361 1852 1740
51 4/15/2022 1690 1660 401.00 4.6378 1860 1770
52 4/14/2022 1690 1660 401.00 4.6447 1863 1770
53 4/13/2022 1680 1630 403.00 4.6460 1872 1780
54 4/12/2022 1650 1620 399.25 4.6626 1862 1700
55 4/11/2022 1630 1590 379.50 4.6451 1763 1670
56 4/8/2022 1650 1610 372.75 4.6405 1730 1660
57 4/7/2022 1650 1610 363.75 4.6478 1691 1670
58 4/6/2022 1650 1600 364.00 4.6539 1694 1670
59 4/5/2022 1650 1620 364.50 4.6317 1688 1680
60 4/4/2022 1640 1610 363.75 4.6373 1687 1680
soup = BeautifulSoup(HTML)
# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", {"title":"TheTitle"} )
rows=list()
for row in table.findAll("tr"):
rows.append(row)
# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times
for i in table:
tbody = i.find_all('tbody')

Python scraping an expandable table(BeautifulSoup)?

I have a conflicting issue that i cant seem to find online.
I want to scrape a table from this website: https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html
and this is the table i wanted to scrape:
So i was able to scrape it, but! it stops until the show all part button.
Is there a way for me to be able to expand this table and then scrape it?
Here is my code(Its a mess as I just wrote it, but enought to get the idea)
def connect_add():
#giving URL a var
url = 'https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html'
#Sending request to URL
req = requests.get(url)
soup = BeautifulSoup(req.text,'html.parser')
tble = soup.find("table", class_="svelte-2wimac")
table_rows = tble.find_all('tr')
data = []
for rows in table_rows:
prepare = []
for td in rows.find_all('td'):
x = td.text
prepare.append(x)
data.append(prepare)
df_side = pd.DataFrame(data)
x = df_side.head(50)
display(x)
connect_add()
The data is loaded from external source. You can use this example how to load it:
import pandas as pd
df = pd.read_json(
"https://static01.nyt.com/newsgraphics/2021/01/19/world-vaccinations-tracker/3bf66651fd690992142ef2a7e233e8fdedcdd6c5/latest.json"
)
print(df)
Prints:
geoid location last_updated total_vaccinations population people_vaccinated people_fully_vaccinated
0 DZA Algeria 2021-02-19 75000 43053054 NaN NaN
1 MOZ Mozambique 2021-03-23 57305 30366036 57305.0 NaN
2 CPV Cape Verde 2021-03-24 2184 549935 2184.0 NaN
3 MUS Mauritius 2021-03-24 117323 1265711 117323.0 NaN
4 STP Sao Tome and Principe 2021-03-29 9724 215056 9724.0 NaN
5 ARM Armenia 2021-03-31 565 2957731 565.0 NaN
6 MMR Myanmar 2021-03-31 1040000 54045420 1000000.0 40000.0
7 SYR Syria 2021-04-08 2500 17070135 2500.0 NaN
8 HND Honduras 2021-04-09 57639 9746117 55000.0 2639.0
9 TCA Turks and Caicos Islands 2021-04-11 25039 38191 15039.0 10000.0
10 VEN Venezuela 2021-04-12 250000 28515829 250000.0 NaN
11 JAM Jamaica 2021-04-13 135473 2948279 135473.0 NaN
12 COG Congo 2021-04-14 14297 5380508 14297.0 NaN
13 FLK Falkland Islands 2021-04-14 4407 3398 2632.0 1775.0
14 TLS Timor 2021-04-14 2629 1293119 2629.0 NaN
15 NRU Nauru 2021-04-15 700 12581 700.0 NaN
16 SSD South Sudan 2021-04-15 947 11062113 947.0 NaN
17 FJI Fiji 2021-04-16 56000 889953 56000.0 NaN
18 DJI Djibouti 2021-04-17 10246 973560 10246.0 NaN
19 LSO Lesotho 2021-04-17 16000 2125268 16000.0 NaN
20 LBY Libya 2021-04-17 750 6777452 750.0 NaN
21 NER Niger 2021-04-17 1366 23310715 1366.0 NaN
22 SOM Somalia 2021-04-17 117567 15442905 117567.0 NaN
23 TGO Togo 2021-04-17 160000 8082366 160000.0 NaN
24 EGY Egypt 2021-04-18 660000 100388073 660000.0 NaN
25 MRT Mauritania 2021-04-18 7038 4525696 7038.0 NaN
26 SGP Singapore 2021-04-18 2213888 5703569 1364124.0 849764.0
27 COM Comoros 2021-04-21 13440 850886 13440.0 NaN
28 MSR Montserrat 2021-04-21 1909 5900 1293.0 616.0
29 AFG Afghanistan 2021-04-22 240000 38041754 240000.0 NaN
30 AIA Anguilla 2021-04-22 6898 14731 6115.0 783.0
31 ATG Antigua and Barbuda 2021-04-22 29754 97118 29754.0 NaN
32 MCO Monaco 2021-04-22 24390 38964 12758.0 11632.0
33 AGO Angola 2021-04-23 456349 31825295 456349.0 NaN
34 BLR Belarus 2021-04-23 328500 9466856 244000.0 84500.0
35 BRN Brunei 2021-04-23 10715 433285 10715.0 NaN
36 GAB Gabon 2021-04-23 8897 2172579 6895.0 2002.0
37 IRQ Iraq 2021-04-23 298377 39309783 298377.0 NaN
38 SDN Sudan 2021-04-23 140227 42813238 140227.0 NaN
39 GMB Gambia 2021-04-24 20922 2347706 20922.0 NaN
40 NIC Nicaragua 2021-04-24 135130 6545502 135130.0 NaN
41 COD Democratic Republic of Congo 2021-04-25 1700 86790567 1700.0 NaN
42 SWZ Eswatini 2021-04-25 34897 1148130 34897.0 NaN
43 MLI Mali 2021-04-25 49903 19658031 49903.0 NaN
44 PSE Palestine 2021-04-25 213989 4685306 170109.0 43880.0
45 PNG Papua New Guinea 2021-04-25 2900 8776109 2900.0 NaN
46 GUY Guyana 2021-04-26 126800 782766 124000.0 2800.0
47 LAO Laos 2021-04-26 184387 7169455 126072.0 58315.0
48 TON Tonga 2021-04-26 5367 104494 5367.0 NaN
49 BHS Bahamas 2021-04-27 25692 389482 25692.0 NaN
50 BIH Bosnia and Herzegovina 2021-04-27 106464 3301000 83260.0 23204.0
51 SLB Solomon Islands 2021-04-27 4890 669823 4890.0 NaN
52 UZB Uzbekistan 2021-04-27 600369 33580650 600369.0 NaN
53 GNQ Equatorial Guinea 2021-04-28 75518 1355986 64646.0 10872.0
54 KEN Kenya 2021-04-28 853081 52573973 853081.0 NaN
55 KGZ Kyrgyzstan 2021-04-28 27858 6456900 27000.0 858.0
56 CMR Cameroon 2021-04-29 11000 25876380 11000.0 NaN
57 BWA Botswana 2021-04-30 49882 2303697 49882.0 NaN
58 GHA Ghana 2021-04-30 849527 30417856 849527.0 NaN
59 VNM Vietnam 2021-04-30 509855 96462106 509855.0 NaN
60 VCT Saint Vincent and the Grenadines 2021-05-01 14526 110589 NaN NaN
61 BMU Bermuda 2021-05-02 58193 63918 32877.0 25216.0
62 NLD Netherlands 2021-05-02 5651843 17332850 4448730.0 NaN
63 PRY Paraguay 2021-05-02 143441 7044636 131013.0 12428.0
64 AND Andorra 2021-05-03 28881 77142 24182.0 4699.0
65 BOL Bolivia 2021-05-03 878563 11513100 637694.0 240869.0
66 CRI Costa Rica 2021-05-03 950252 5047561 605099.0 345153.0
67 WSM Samoa 2021-05-03 7435 197097 NaN NaN
68 SYC Seychelles 2021-05-03 127721 97625 68045.0 59676.0
69 JOR Jordan 2021-05-04 1091048 10101694 805020.0 286028.0
70 NZL New Zealand 2021-05-04 304900 4917000 217603.0 87297.0
71 KNA Saint Kitts and Nevis 2021-05-04 13070 52834 12943.0 127.0
72 ETH Ethiopia 2021-05-05 1215934 112078730 NaN NaN
73 LIE Liechtenstein 2021-05-05 13829 38019 9645.0 4184.0
74 MLT Malta 2021-05-05 359429 502653 246698.0 112731.0
75 OMN Oman 2021-05-05 326269 4974986 253000.0 73269.0
76 CHE Switzerland 2021-05-05 3001029 8574832 1997717.0 1003312.0
77 CYP Cyprus 2021-05-06 332423 1198575 252792.0 79631.0
78 SLV El Salvador 2021-05-06 1114544 6453553 958828.0 155716.0
79 GRD Grenada 2021-05-06 17000 112003 13000.0 4000.0
80 KWT Kuwait 2021-05-06 1440000 4207083 NaN NaN
81 LBN Lebanon 2021-05-06 509705 6855713 325383.0 184322.0
82 LUX Luxembourg 2021-05-06 227314 619896 165376.0 61938.0
83 NOR Norway 2021-05-06 1919369 5347896 1465851.0 453518.0
84 PAK Pakistan 2021-05-06 3320304 216565318 NaN NaN
85 PER Peru 2021-05-06 1939155 32510453 1284692.0 654463.0
86 ESP Spain 2021-05-06 19048132 47076781 13271511.0 5956451.0
87 BLZ Belize 2021-05-07 47675 390353 47675.0 NaN
88 BRA Brazil 2021-05-07 46875460 211049527 31722544.0 15152916.0
89 CYM Cayman Islands 2021-05-07 69772 64948 37470.0 32302.0
90 COL Colombia 2021-05-07 6096661 50339443 3861416.0 2235245.0
91 DMA Dominica 2021-05-07 32008 71808 18864.0 13144.0
92 ECU Ecuador 2021-05-07 1245822 17373662 981620.0 264202.0
93 DEU Germany 2021-05-07 34408840 83132799 26872478.0 7572228.0
94 GRL Greenland 2021-05-07 14278 56225 8994.0 5284.0
95 GIN Guinea 2021-05-07 173623 12771246 116436.0 57187.0
96 ISL Iceland 2021-05-07 184304 361313 138577.0 53658.0
97 IRN Iran 2021-05-07 1485287 82913906 1231652.0 253635.0
98 IRL Ireland 2021-05-07 1799190 4941444 1305178.0 494012.0
99 KAZ Kazakhstan 2021-05-07 2158924 18513930 1634939.0 523985.0
100 NAM Namibia 2021-05-07 36417 2494530 34346.0 2071.0
101 NPL Nepal 2021-05-07 2453512 28608710 2091511.0 362001.0
102 RWA Rwanda 2021-05-07 350400 12626950 350400.0 NaN
103 SMR San Marino 2021-05-07 34011 33860 21389.0 12622.0
104 SWE Sweden 2021-05-07 3679451 10285453 2852689.0 826762.0
105 UGA Uganda 2021-05-07 395805 44269594 395805.0 NaN
106 ALB Albania 2021-05-08 596766 2854191 NaN NaN
107 ABW Aruba 2021-05-08 80699 106314 55744.0 24955.0
108 BRB Barbados 2021-05-08 75476 287025 75476.0 NaN
109 BEL Belgium 2021-05-08 4591359 11484055 3527895.0 1084263.0
110 BTN Bhutan 2021-05-08 481491 763092 481491.0 NaN
111 CHL Chile 2021-05-08 15703842 18952038 8559854.0 7143988.0
112 DNK Denmark 2021-05-08 2339464 5818553 1489198.0 850266.0
113 DOM Dominican Republic 2021-05-08 2345528 10738958 1535083.0 810445.0
114 FIN Finland 2021-05-08 2154469 5520314 1943842.0 210627.0
115 FRA France 2021-05-08 25414386 67059887 17692900.0 7832913.0
116 GEO Georgia 2021-05-08 58533 3720382 58533.0 NaN
117 GIB Gibraltar 2021-05-08 74256 33701 38727.0 35529.0
118 GRC Greece 2021-05-08 3647689 10716322 2450349.0 1197340.0
119 GTM Guatemala 2021-05-08 206951 16604026 204459.0 2492.0
120 MDV Maldives 2021-05-08 431792 530953 300906.0 130886.0
121 MEX Mexico 2021-05-08 21228359 127575529 14148207.0 9440251.0
122 MDA Moldova 2021-05-08 184660 2657637 161266.0 23394.0
123 MAR Morocco 2021-05-08 9864561 36471769 5473809.0 4390752.0
124 POL Poland 2021-05-08 13670541 37970874 10185393.0 3650119.0
125 ROU Romania 2021-05-08 5891855 19356544 3580368.0 2314812.0
126 LCA Saint Lucia 2021-05-08 25200 182790 NaN NaN
127 SEN Senegal 2021-05-08 427377 16296364 427377.0 NaN
128 SLE Sierra Leone 2021-05-08 64966 7813215 58250.0 6716.0
129 SVK Slovakia 2021-05-08 1792674 5454073 1209044.0 583630.0
130 ZAF South Africa 2021-05-08 382480 58558270 382480.0 382480.0
131 SUR Suriname 2021-05-08 90338 581363 45420.0 44918.0
132 TUN Tunisia 2021-05-08 499369 11694719 350426.0 148943.0
133 UKR Ukraine 2021-05-08 863085 44385155 862639.0 446.0
134 GBR United Kingdom 2021-05-08 53041048 66834405 35371669.0 17669379.0
135 ZMB Zambia 2021-05-08 77348 17861030 77348.0 NaN
136 ARG Argentina 2021-05-09 9082597 44938712 7688877.0 1393720.0
137 AUS Australia 2021-05-09 2654338 25364307 NaN NaN
138 AUT Austria 2021-05-09 3632879 8877067 2665516.0 972493.0
139 AZE Azerbaijan 2021-05-09 1687397 10023318 1005678.0 681719.0
140 BHR Bahrain 2021-05-09 1375967 1641172 797181.0 578786.0
141 BGD Bangladesh 2021-05-09 9316086 163046161 5819900.0 3496186.0
142 BGR Bulgaria 2021-05-09 938064 6975761 646068.0 291996.0
143 KHM Cambodia 2021-05-09 2884922 16486542 1773994.0 1110928.0
144 CAN Canada 2021-05-09 15917555 37589262 14668624.0 1248931.0
145 CHN China 2021-05-09 324307000 1397715000 NaN NaN
146 CIV Cote d'Ivoire 2021-05-09 262639 25716544 262639.0 NaN
147 HRV Croatia 2021-05-09 1131607 4067500 879312.0 252295.0
148 CUW Curacao 2021-05-09 109444 157538 77141.0 32303.0
149 CZE Czechia 2021-05-09 3654376 10669709 2610990.0 1058179.0
150 EST Estonia 2021-05-09 532605 1326590 373391.0 159214.0
151 FRO Faeroe Islands 2021-05-09 23519 48678 16896.0 6623.0
152 HKG Hong Kong 2021-05-09 1741682 7451000 1071488.0 670194.0
153 HUN Hungary 2021-05-09 6809350 9769949 4305775.0 2503575.0
154 IND India 2021-05-09 168304868 1366417754 133854676.0 34450192.0
155 IDN Indonesia 2021-05-09 21993299 270625568 13349469.0 8643830.0
156 IMN Isle of Man 2021-05-09 75783 84584 59932.0 15851.0
157 ISR Israel 2021-05-09 10501225 9053300 5422082.0 5079143.0
158 ITA Italy 2021-05-09 24054000 60297396 16823066.0 7401862.0
159 JPN Japan 2021-05-09 4436325 126264931 3277886.0 1158439.0
160 LVA Latvia 2021-05-09 395512 1912789 316665.0 79647.0
161 LTU Lithuania 2021-05-09 1162170 2786844 777019.0 385151.0
162 MAC Macao 2021-05-09 118687 631636 77597.0 41241.0
163 MWI Malawi 2021-05-09 319323 18628747 319323.0 NaN
164 MYS Malaysia 2021-05-09 1766651 31949777 1089637.0 677014.0
165 MNG Mongolia 2021-05-09 2213376 3225167 1590636.0 622740.0
166 MNE Montenegro 2021-05-09 109507 622137 78760.0 30747.0
167 NGA Nigeria 2021-05-09 1665698 200963599 1665698.0 NaN
168 MKD North Macedonia 2021-05-09 107978 2083459 107978.0 NaN
169 PAN Panama 2021-05-09 780569 4246439 524958.0 255610.0
170 PHL Philippines 2021-05-09 2408781 108116615 1957511.0 451270.0
171 PRT Portugal 2021-05-09 3963372 10269417 2858389.0 1104961.0
172 QAT Qatar 2021-05-09 1813240 2832067 1115842.0 697398.0
173 RUS Russia 2021-05-09 21754829 144373535 13129704.0 8625125.0
174 SAU Saudi Arabia 2021-05-09 10584301 34268528 NaN NaN
175 SRB Serbia 2021-05-09 3798942 6944975 2149705.0 1649237.0
176 SVN Slovenia 2021-05-09 737817 2087946 484949.0 252868.0
177 KOR South Korea 2021-05-09 4181003 51709098 3674729.0 506274.0
178 LKA Sri Lanka 2021-05-09 1125740 21803000 928400.0 197340.0
179 TWN Taiwan 2021-05-09 92049 23780452 NaN NaN
180 THA Thailand 2021-05-09 1809894 69625582 1296440.0 513454.0
181 TTO Trinidad and Tobago 2021-05-09 61120 1394973 60174.0 946.0
182 TUR Turkey 2021-05-09 24918773 83429615 14585980.0 10332793.0
183 ARE United Arab Emirates 2021-05-09 11145934 9770529 NaN NaN
184 URY Uruguay 2021-05-09 2005442 3461734 1228151.0 777291.0
185 ZWE Zimbabwe 2021-05-09 684243 14645468 526066.0 158177.0
186 USA United States 2021-05-09 259716989 331811257 152116936.0 114258244.0
187 OWID_WRL World NaN 1297259952 7673533970 641081197.0 309613453.0

NonType Object when transforming a scraped table to DataFrame

I am trying to scrape a list of stocks tickers that are displayed in a table in the following link: http://www.advfn.com/nyse/newyorkstockexchange.asp?companies=A
I scraped the table using beautiful soup but when I transform it to Pandas Data Frame I get an error:
TypeError: 'NoneType' object is not callable
I tried the following code:
url = 'http://www.advfn.com/nyse/newyorkstockexchange.asp?companies=A'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find("table",{"class":"market tab1"})
df = pd.read_html(table)
but it does not work. How do I solve it? and why do I get the error?
full error log:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
796 try:
--> 797 tables = p.parse_tables()
798 except Exception as caught:
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in parse_tables(self)
212 def parse_tables(self):
--> 213 tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
214 return (self._build_table(table) for table in tables)
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in _build_doc(self)
618 # try to parse the input in the simplest way
--> 619 r = parse(self.io, parser=parser)
620 try:
~/anaconda3/lib/python3.7/site-packages/lxml/html/__init__.py in parse(filename_or_url, parser, base_url, **kw)
939 parser = html_parser
--> 940 return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
941
src/lxml/etree.pyx in lxml.etree.parse()
src/lxml/parser.pxi in lxml.etree._parseDocument()
TypeError: 'NoneType' object is not callable
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-23-c3e05c494f63> in <module>
5 table = soup.find("table",{"class":"market tab1"})
6 #print(table)
----> 7 df = pd.read_html(table)
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
985 decimal=decimal, converters=converters, na_values=na_values,
986 keep_default_na=keep_default_na,
--> 987 displayed_only=displayed_only)
~/anaconda3/lib/python3.7/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
799 # if `io` is an io-like object, check if it's seekable
800 # and try to rewind it before trying the next parser
--> 801 if hasattr(io, 'seekable') and io.seekable():
802 io.seek(0)
803 elif hasattr(io, 'seekable') and not io.seekable():
TypeError: 'NoneType' object is not callable
beg of table:
<table cellpadding="0" cellspacing="1" class="market tab1" width="610">
<colgroup><col/><col/><col class="c"/></colgroup>
<tr><td class="tabh" colspan="3"><b>Companies listed on the NYSE</b></td></tr>
<tr><th>Equity</th><th>Symbol</th><th>Info</th></tr>
<tr class="ts0"><td align="left">A K Steel</td><td>AKS</td><td><img src="/s/stock-chart.gif"/><img src="/s/stock-news.gif"/><img src="/s/fundamentals.gif"/><img src="/s/stock-trades.gif"/></td></tr>
You are passing a <class 'bs4.element.Tag'> element
into pandas read_html. You need to convert it to a string.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www.advfn.com/nyse/newyorkstockexchange.asp?companies=A'
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find("table",{"class":"market tab1"})
df = pd.read_html(str(table))
print(df)
Outputs:
[ 0 1 2
0 Companies listed on the NYSE NaN NaN
1 Equity Symbol Info
2 A K Steel AKS NaN
3 A M R AMR NaN
4 A M R Cp 7.875 AAR NaN
5 A V X AVX NaN
6 A a R AIR NaN
7 A.h. Belo Corporation AHC NaN
8 Aaron Rents RNT.A NaN
9 Aaron Rents RNT NaN
10 Aarons Cl A AAN.A NaN
11 Aarons Inc. AAN NaN
12 Ab Svensk Cdss Arbmn CBJ NaN
13 Ab Svensk Ekport AXF NaN
14 Ab Svensk Ekportkrdt SQT NaN
15 Ab Svensk Ekportkred DVK NaN
16 Ab Svensk Ekportkred IWK NaN
17 Ab Svensk Ekportkred RCW NaN
18 Ab Svensk Ekportkred EOA NaN
19 Ab Svensk Msci Arn MIS NaN
20 Ab Svensk Russell REU NaN
21 Ab Svensk Sp Arns SAD NaN
22 Ab Svensk Sp Arns MHG NaN
23 Abb ABB NaN
24 Abbott Labs ABT NaN
25 Abercrombie & Fitch ANF NaN
26 Abitibi ABY NaN
27 Abm ABM NaN
28 Acadia AKR NaN
29 Acc Bear Amex Egy IMW NaN
.. ... ... ...
194 Ashland ASH NaN
195 Aspen Insurance AHL NaN
196 Assisted Living Concepts (nevada ALC NaN
197 Associated Estates AEC NaN
198 Assurant AIZ NaN
199 Assured Guaranty AGO NaN
200 Astoria AF NaN
201 Astrazeneca AZN NaN
202 Atlanta Gas Light ATG NaN
203 Atlas Pipeline APL NaN
204 Atlas Pipeline Holdings Lp AHD NaN
205 Atmos ATO NaN
206 Att T NaN
207 Att ATT NaN
208 Atwood Oceanics ATW NaN
209 Au Optronics AUO NaN
210 Autoliv ALV NaN
211 Autonation AN NaN
212 Autozone AZO NaN
213 Av Svensk Ekportkred NEH NaN
214 Avalonbay AVB NaN
215 Aventine Renew Enrgy AVR NaN
216 Avery Dennison AVY NaN
217 Avis Budget Grp. CAR NaN
218 Avista AVA NaN
219 Avnet AVT NaN
220 Avon Products AVP NaN
221 Axa AXA NaN
222 Axis AXS NaN
223 Azz AZZ NaN
[224 rows x 3 columns]]

Python Pandas Dataframe assignment

I am following a Lynda tutorial where they use the following code:
import pandas as pd
import seaborn
flights = seaborn.load_dataset('flights')
flights_indexed = flights.set_index(['year','month'])
flights_unstacked = flights_indexed.unstack()
flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
and it works perfectly. However, in my case it seems that the code is not compiling, for the last line I keep getting an error.
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
I know in the video they are using Python 2, however I have Python 3 since I am learning for work (which uses Python 3). Most of the differences I have been able to figure out, however I cannot figure out how to create this new column called 'total' with the sums of the passengers.
The root cause of this error message is the categorical nature of the month column:
In [42]: flights.dtypes
Out[42]:
year int64
month category
passengers int64
dtype: object
In [43]: flights.month.cat.categories
Out[43]: Index(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], d
type='object')
and you are trying to add a category total - Pandas doesn't like that.
Workaround:
In [45]: flights.month.cat.add_categories('total', inplace=True)
In [46]: x = flights.pivot(index='year', columns='month', values='passengers')
In [47]: x['total'] = x.sum(1)
In [48]: x
Out[48]:
month January February March April May June July August September October November December total
year
1949 112.0 118.0 132.0 129.0 121.0 135.0 148.0 148.0 136.0 119.0 104.0 118.0 1520.0
1950 115.0 126.0 141.0 135.0 125.0 149.0 170.0 170.0 158.0 133.0 114.0 140.0 1676.0
1951 145.0 150.0 178.0 163.0 172.0 178.0 199.0 199.0 184.0 162.0 146.0 166.0 2042.0
1952 171.0 180.0 193.0 181.0 183.0 218.0 230.0 242.0 209.0 191.0 172.0 194.0 2364.0
1953 196.0 196.0 236.0 235.0 229.0 243.0 264.0 272.0 237.0 211.0 180.0 201.0 2700.0
1954 204.0 188.0 235.0 227.0 234.0 264.0 302.0 293.0 259.0 229.0 203.0 229.0 2867.0
1955 242.0 233.0 267.0 269.0 270.0 315.0 364.0 347.0 312.0 274.0 237.0 278.0 3408.0
1956 284.0 277.0 317.0 313.0 318.0 374.0 413.0 405.0 355.0 306.0 271.0 306.0 3939.0
1957 315.0 301.0 356.0 348.0 355.0 422.0 465.0 467.0 404.0 347.0 305.0 336.0 4421.0
1958 340.0 318.0 362.0 348.0 363.0 435.0 491.0 505.0 404.0 359.0 310.0 337.0 4572.0
1959 360.0 342.0 406.0 396.0 420.0 472.0 548.0 559.0 463.0 407.0 362.0 405.0 5140.0
1960 417.0 391.0 419.0 461.0 472.0 535.0 622.0 606.0 508.0 461.0 390.0 432.0 5714.0
UPDATE: alternatively if you don't want to touch the original DF you can get rid of categorical columns in the flights_unstacked DF:
In [76]: flights_unstacked.columns = \
...: flights_unstacked.columns \
...: .set_levels(flights_unstacked.columns.get_level_values(1).categories,
...: level=1)
...:
In [77]: flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
In [78]: flights_unstacked
Out[78]:
passengers
month January February March April May June July August September October November December total
year
1949 112 118 132 129 121 135 148 148 136 119 104 118 1520
1950 115 126 141 135 125 149 170 170 158 133 114 140 1676
1951 145 150 178 163 172 178 199 199 184 162 146 166 2042
1952 171 180 193 181 183 218 230 242 209 191 172 194 2364
1953 196 196 236 235 229 243 264 272 237 211 180 201 2700
1954 204 188 235 227 234 264 302 293 259 229 203 229 2867
1955 242 233 267 269 270 315 364 347 312 274 237 278 3408
1956 284 277 317 313 318 374 413 405 355 306 271 306 3939
1957 315 301 356 348 355 422 465 467 404 347 305 336 4421
1958 340 318 362 348 363 435 491 505 404 359 310 337 4572
1959 360 342 406 396 420 472 548 559 463 407 362 405 5140
1960 417 391 419 461 472 535 622 606 508 461 390 432 5714

Categories