Pandas read_html error when reading in a Wikipedia table

Pandas read_html error when reading in a Wikipedia table - python

I'm trying to read in a table using read_html
import requests
import pandas as pd
import numpy as np
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
resp = requests.get(url)
tables = pd.read_html(resp.text)
But I get this error
IndexError: list index out of range
Other Wiki pages work fine. What's up with this page and how do I solve the above error?

Seems like the table can't be read because of the jquery table sorter.
It's easy to read tables with the selenium library into a df when you're dealing with jquery instead of plain html. You'll still need to do some cleanup, but this will get the table into a df.
You'll need to install the selenium library and download a web browser driver too.
from selenium import webdriver
driver = r'C:\chromedriver_win32\chromedriver.exe'
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
driver = webdriver.Chrome(driver)
driver.get(url)
the_table = driver.find_element_by_xpath('//*[#id="mw-content-text"]/div/table[2]/tbody/tr/td[2]/table')
data = the_table.text
df = pd.DataFrame([x.split() for x in data.split('\n')])
driver.close()
print(df)
0 1 2 3 4 5 \
0 Country (or dependent territory, None None
1 subnational area, etc.) Region Subregion Rate
2 listed Source None None None None
3 None None None None None None
4 Burundi Africa Eastern Africa 6.02 635
5 Comoros Africa Eastern Africa 7.70 60
6 Djibouti Africa Eastern Africa 6.48 60
7 Eritrea Africa Eastern Africa 8.04 390
8 Ethiopia Africa Eastern Africa 7.56 7,552
9 Kenya Africa Eastern Africa 5.00 2,466
10 Madagascar Africa Eastern Africa 7.69 1,863

Related

How to scrape web-page with button/menuitems optionvalue?

In particular, I'am trying to scrape this web site
I would like to setup the Button-menuitems on "50" rows per page:
My Currently core is the follow:
Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//[#class='btn btn-default dropdown-toggle']")))).select_by_visible_text('50')
where is my wrong? Can you help me ?
Thank you in advance for youre Time!

This should work for your case
driver = webdriver.Firefox(service=s)
driver.get(' https://whalewisdom.com/filer/fisher-asset-management-llc#tabholdings_tab_link')
button = driver.find_element(By.CSS_SELECTOR, '.btn-group.dropdown')
button.click()
element = driver.find_element(By.XPATH, '//li[#role="menuitem"]/a[contains(text(), "50")]')
element.click()

You can try this easier code which doesn't need Selenium but rather directly makes a call to the data API of the site with requests.
Please note the argument limit at the end of the query string that sets the limit to 50 rows, as you want. If you want to scrape the next 50 items just increase the offset to 50 then 100, 150, etc. This will get you all the available data.
import requests
import pandas as pd
import json
url = "https://whalewisdom.com/filer/holdings?id=berkshire-hathaway-inc&q1=-1&type_filter=1,2,3,4&symbol=&change_filter=&minimum_ranking=&minimum_shares=&is_etf=0&sc=true&sort=current_mv&order=desc&offset=0&limit=50"
raw = requests.get(url)
data = json.loads(raw.content)
df = pd.DataFrame(data["rows"])
df.head()
Print out:
symbol permalink security_type name sector industry current_shares previous_shares shares_change position_change_type ... percent_ownership quarter_first_owned quarter_id_owned source_type source_date filing_date avg_price recent_price quarter_end_price id
0 AAPL aapl SH Apple Inc INFORMATION TECHNOLOGY COMPUTERS & PERIPHERALS 8.909234e+08 8.871356e+08 3787856.0 addition ... 5.5045625 Q1 2016 61 13F 2022-03-31 2022-05-16 36.6604 160.01 174.61 None
1 BAC bac SH Bank of America Corp. (North Carolina National... FINANCE BANKS 1.010101e+09 1.010101e+09 0.0 None ... 12.5371165 Q3 2017 67 13F 2022-03-31 2022-05-16 25.5185 33.04 41.22 None
2 AXP axp SH American Express Co FINANCE CONSUMER FINANCE 1.516107e+08 1.516107e+08 0.0 None ... 20.1326115 Q1 2001 1 13F 2022-03-31 2022-05-16 39.3110 151.60 187.00 None
3 CVX cvx SH Chevron Corp. (Standard Oil of California) ENERGY INTEGRATED OIL & GAS 1.591781e+08 3.824504e+07 120933081.0 addition ... 8.1014366 Q4 2020 80 13F 2022-03-31 2022-05-16 125.3424 159.14 162.83 None
4 KO ko SH Coca Cola Co. CONSUMER STAPLES

Web scraping a table through multiple pages with a single link

I am trying to web scrape a table on a webpage as part of an assignment using Python. I want to scrape all 618 records of the table which are scattered across 13 pages in the same URL. However, my program only scrapes the first page of the table and its records. The URL is in my code, which can be found below:
from bs4 import BeautifulSoup as bs
import requests as r
base_URL = 'https://www.nba.com/players'
def scrape_webpage(URL):
player_names = []
page = r.get(URL)
print(f'{page.status_code}')
soup = bs(page.content, 'html.parser')
raw_player_names = soup.find_all('div', class_='flex flex-col lg:flex-row')
for name in raw_player_names:
player_names.append(name.get_text().strip())
print(player_names)
scrape_webpage(base_URL)

The player data is embedded inside <script> element in the page. You can decode it with this example:
import re
import json
import requests
import pandas as pd
url = "https://www.nba.com/players"
data = re.search(r'({"props":.*})', requests.get(url).text).group(0)
data = json.loads(data)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data["props"]["pageProps"]["players"])
print(df.head().to_markdown())
Prints:
PERSON_ID
PLAYER_LAST_NAME
PLAYER_FIRST_NAME
PLAYER_SLUG
TEAM_ID
TEAM_SLUG
IS_DEFUNCT
TEAM_CITY
TEAM_NAME
TEAM_ABBREVIATION
JERSEY_NUMBER
POSITION
HEIGHT
WEIGHT
COLLEGE
COUNTRY
DRAFT_YEAR
DRAFT_ROUND
DRAFT_NUMBER
ROSTER_STATUS
FROM_YEAR
TO_YEAR
PTS
REB
AST
STATS_TIMEFRAME
PLAYER_LAST_INITIAL
HISTORIC
0
1630173
Achiuwa
Precious
precious-achiuwa
1610612761
raptors
0
Toronto
Raptors
TOR
5
F
6-8
225
Memphis
Nigeria
2020
1
20
1
2020
2021
9.1
6.5
1.1
Season
A
False
1
203500
Adams
Steven
steven-adams
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
4
C
6-11
265
Pittsburgh
New Zealand
2013
1
12
1
2013
2021
6.9
10
3.4
Season
A
False
2
1628389
Adebayo
Bam
bam-adebayo
1610612748
heat
0
Miami
Heat
MIA
13
C-F
6-9
255
Kentucky
USA
2017
1
14
1
2017
2021
19.1
10.1
3.4
Season
A
False
3
1630583
Aldama
Santi
santi-aldama
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
7
F-C
6-11
215
Loyola-Maryland
Spain
2021
1
30
1
2021
2021
4.1
2.7
0.7
Season
A
False
4
200746
Aldridge
LaMarcus
lamarcus-aldridge
1610612751
nets
0
Brooklyn
Nets
BKN
21
C-F
6-11
250
Texas-Austin
USA
2006
1
2
1
2006
2021
12.9
5.5
0.9
Season
A
False

Web Scraping & BeautifulSoup <li> parsing

I'm just learning web scraping & want to output the result of this website to a csv file
https://www.avbuyer.com/aircraft/private-jets
but am struggling with year, sn & time field in the below code -
when I put "soup" in place of "post" it works but not when I want to put them together
any help would be much appreciated
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.avbuyer.com/aircraft/private-jets'
page = requests.get(url)
page
soup = BeautifulSoup(page.text, 'lxml')
soup
df = pd.DataFrame({'Plane':[''], 'Year':[''], 'S/N':[''], 'Total Time':[''], 'Price':[''], 'Location':[''], 'Description':[''], 'Tag':[''], 'Last updated':[''], 'Link':['']})
while True:
postings = soup.find_all('div', class_ = 'listing-item premium')
for post in postings:
try:
link = post.find('a', class_ = 'more-info').get('href')
link_full = 'https://www.avbuyer.com'+ link
plane = post.find('h2', class_ = 'item-title').text
price = post.find('div', class_ = 'price').text
location = post.find('div', class_ = 'list-item-location').text
year = post.find_all('ul', class_ = 'fa-no-bullet clearfix')[2]
year.find_all('li')[0].text
sn = post.find('ul', class_ = 'fa-no-bullet clearfix')[2]
sn.find('li')[1].text
time = post.find('ul', class_ = 'fa-no-bullet clearfix')[2]
time.find('li')[2].text
desc = post.find('div', classs_ = 'list-item-para').text
tag = post.find('div', class_ = 'list-viewing-date').text
updated = post.find('div', class_ = 'list-update').text
df = df.append({'Plane':plane, 'Year':year, 'S/N':sn, 'Total Time':time, 'Price':price, 'Location':location,
'Description':desc, 'Tag':tag, 'Last updated':updated, 'Link':link_full}, ignore_index = True)
except:
pass
next_page = soup.find('a', {'rel':'next'}).get('href')
next_page_full = 'https://www.avbuyer.com'+next_page
next_page_full
url = next_page_full
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
df.to_csv('/Users/xxx/avbuyer.csv')

Try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.avbuyer.com/aircraft/private-jets')
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
temp=[]
for post in postings:
link = post.find('a', class_ = 'more-info').get('href')
link_full = 'https://www.avbuyer.com'+ link
plane = post.find('h2', class_ = 'item-title').text
price = post.find('div', class_ = 'price').text
location = post.find('div', class_ = 'list-item-location').text
t=post.find_all('div',class_='list-other-dtl')
for i in t:
data=[tup.text for tup in i.find_all('li')]
years=data[0]
s=data[1]
total_time=data[2]
temp.append([plane,price,location,link_full,years,s,total_time])
df=pd.DataFrame(temp,columns=["plane","price","location","link","Years","S/N","Totaltime"])
print(df)
output:
plane price location link Years S/N Totaltime
0 Dassault Falcon 2000LXS Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2021 S/N 377 Total Time 33
1 Cirrus Vision SF50 G1 Please call North America + Canada, United States - WI, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2018 S/N 0080 Total Time 615
2 Gulfstream IV Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1990 S/N 1148 Total Time 6425
4 Boeing 787-8 Make offer Europe, Monaco, For Sale by Global Jet Monaco https://www.avbuyer.com/aircraft/private-jets/... Year 2010 S/N - Total Time 1
5 Hawker 4000 Make offer South America, Puerto Rico, For Sale by JetHQ https://www.avbuyer.com/aircraft/private-jets/... Year 2009 S/N RC-24 Total Time 2120
6 Embraer Legacy 500 Make offer North America + Canada, United States - NE, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2015 S/N 55000016 Total Time 2607
7 Dassault Falcon 2000LXS Make offer North America + Canada, United States - DE, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2015 S/N 300 Total Time 1909
8 Dassault Falcon 50EX Please call North America + Canada, United States - TX, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2002 S/N 320 Total Time 7091.9
9 Dassault Falcon 2000 Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2001 S/N 146 Total Time 6760
10 Bombardier Learjet 75 Make offer Europe, Switzerland, For Sale by Jetcraft https://www.avbuyer.com/aircraft/private-jets/... Year 2014 S/N 45-491 Total Time 1611
11 Hawker 800B Please call Europe, United Kingdom - England, For Sale by ... https://www.avbuyer.com/aircraft/private-jets/... Year 1985 S/N 258037 Total Time 9621
13 BAe Avro RJ100 Please call North America + Canada, United States - MT, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1996 S/N E3282 Total Time 45996
14 Embraer Legacy 600 Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2007 S/N 14501014 Total Time 4328
15 Bombardier Challenger 850 Make offer North America + Canada, United States - AZ, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2003 S/N 7755 Total Time 12114.1
16 Gulfstream G650 Please call Europe, Switzerland, For Sale by Jetcraft https://www.avbuyer.com/aircraft/private-jets/... Year 2013 S/N 6047 Total Time 2178
17 Bombardier Learjet 55 Price: USD $995,000 North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1982 S/N 020 Total Time 13448
18 Dassault Falcon 8X Please call North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2016 S/N 406 Total Time 1627
19 Hawker 800XP Price: USD $1,595,000 North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2002 S/N 258578 Total Time 10169

Right now, your try-except clauses are not allowing you to see and debug the errors in your script. If you remove them, you will see:
IndexError: list index out of range in line 24. There are only two elements inside the list, and you are looking for the second one. Therefore, your line should be:
year = post.find_all('ul', class_ = 'fa-no-bullet clearfix')[1]
KeyError: 2 in line 26. You are using find(), which returns a <class 'bs4.element.Tag'> object, not a list. Here you want to use find_all() as you did in line 24. Same happens for line 28.
However, instead of using this expression three times, you should rather store the result in a variable and use it later.
AttributeError: 'NoneType' object has no attribute 'text' in line 31. There is a type, you wrote _classs.
AttributeError: 'NoneType' object has no attribute 'text' in line 32. There is nothing wrong with your code. Instead, there are some entries in the webpage that don't have this element. You should check if the find method gave you any result.
tag = post.find('div', class_ = 'list-viewing-date')
if tag:
tag = tag.text
else:
tag = None
You don't have a way out of your while loop. You should catch whenever the script cannot find a new next_page and add a break.
After changing all this, it worked for me to scrape the first page. I used:
Python 3.9.7
bs4 4.10.0
It is very important that you state what versions of Python and the libraries you are using.
Cheers!

While web scraping for a table in Python, an empty table is returned

I need to grab a table from a web site by web scraping using BeautifulSoup library in Python. From the URL https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html
When I run this code, I get an empty table:
import requests
from bs4 import BeautifulSoup
#
vaacineProgressResponse = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
vaacineProgressContent = BeautifulSoup(vaacineProgressResponse.content, 'html.parser')
vaacineProgressContentTable = vaacineProgressContent.find_all('table', class_="g-summary-table svelte-2wimac")
if vaacineProgressContentTable is not None and len(vaacineProgressContentTable) > 0:
vaacineProgressContentTable = vaacineProgressContentTable[0]
#
print ('the table =', vaacineProgressContentTable)
The output:
the table = []
Process finished with exit code 0
The screen shot below shows table in the web page (at left) and related Inspect element section (at right):

Very simple - it's because there's an extra space in the class you're searching for.
If you change the class to g-summary-table svelte-2wimac, the tags should be correctly returned.
The following code should work:
import requests
from bs4 import BeautifulSoup
#
url = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
soup = BeautifulSoup(url.content, 'html.parser')
table = soup.find_all('table', class_="g-summary-table svelte-2wimac")
print(table)
I've also done similar scraping on the NYTimes interactive website, and spaces can be very tricky. If you added an extra space or missed one, an empty result is returned.
If you cannot find the tags, I would recommend printing the entire document first using print(soup.prettify()) and find the desired tags you plan to scrape. Make sure you copy the exact text of the class name from the contents printed by BeautifulSoup.

As an alternative, if you want to download the data in json format, then read into pandas, you can do this. same starting code from above and working off the soup object
There are several apis that are available (below are three), but pulled out of the html like:
import re
import pandas as pd
latest_dataset = soup.find(string=re.compile('latest')).splitlines()[2].split('"')[1]
requests.get(latest_dataset).json()
latest_timeseries = soup.find(string=re.compile('timeseries')).splitlines()[2].split('"')[3]
requests.get(latest_timeseries).json()
allwithrate = soup.find(string=re.compile('all_with_rate')).splitlines()[2].split('"')[1]
requests.get(allwithrate).json()
pd.DataFrame(requests.get(allwithrate).json())
output of the last one
geoid location last_updated total_vaccinations people_vaccinated display_name ... Region IncomeGroup country gdp_per_cap vaccinations_rate people_fully_vaccinated
0 MUS Mauritius 2021-02-17 3843.0 3843.0 Mauritius ... Sub-Saharan Africa High income Mauritius 11099.24028 0.3037 NaN
1 DZA Algeria 2021-02-19 75000.0 NaN Algeria ... Middle East & North Africa Lower middle income Algeria 3973.964072 0.1776 NaN
2 LAO Laos 2021-03-17 40732.0 40732.0 Laos ... East Asia & Pacific Lower middle income Lao PDR 2534.89828 0.5768 NaN
3 MOZ Mozambique 2021-03-23 57305.0 57305.0 Mozambique ... Sub-Saharan Africa Low income Mozambique 503.5707727 0.1943 NaN
4 CPV Cape Verde 2021-03-24 2184.0 2184.0 Cape Verde ... Sub-Saharan Africa Lower middle income Cabo Verde 3603.781793 0.4016 NaN
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
243 GUF NaN NaN NaN NaN French Guiana ... NaN NaN NaN NaN NaN NaN
244 KOS NaN NaN NaN NaN Kosovo ... NaN NaN NaN NaN NaN NaN
245 CUW NaN NaN NaN NaN Cura�ao ... Latin America & Caribbean High income Curacao 19689.13982 NaN NaN
246 CHI NaN NaN NaN NaN Channel Islands ... Europe & Central Asia High income Channel Islands 74462.64675 NaN NaN
247 SXM NaN NaN NaN NaN Sint Maarten ... Latin America & Caribbean High income Sint Maarten (Dutch part) 29160.10381 NaN NaN
[248 rows x 17 columns]

How to read CSV file from GitHub using pandas

Im trying to read CSV file thats on github with Python using pandas> i have looked all over the web, and I tried some solution that I found on this website, but they do not work. What am I doing wrong?
I have tried this:
import pandas as pd
url = 'https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv'
df = pd.read_csv(url,index_col=0)
#df = pd.read_csv(url)
print(df.head(5))

You should provide URL to raw content. Try using this:
import pandas as pd
url = 'https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv'
df = pd.read_csv(url, index_col=0)
print(df.head(5))
Output:
alpha-2 ... intermediate-region-code
name ...
Afghanistan AF ... NaN
Åland Islands AX ... NaN
Albania AL ... NaN
Algeria DZ ... NaN
American Samoa AS ... NaN

Add ?raw=true at the end of the GitHub URL to get the raw file link.
In your case,
import pandas as pd
url = 'https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv?raw=true'
df = pd.read_csv(url,index_col=0)
print(df.head(5))
Output:
alpha-2 alpha-3 country-code iso_3166-2 region \
name
Afghanistan AF AFG 4 ISO 3166-2:AF Asia
Åland Islands AX ALA 248 ISO 3166-2:AX Europe
Albania AL ALB 8 ISO 3166-2:AL Europe
Algeria DZ DZA 12 ISO 3166-2:DZ Africa
American Samoa AS ASM 16 ISO 3166-2:AS Oceania
sub-region intermediate-region region-code \
name
Afghanistan Southern Asia NaN 142.0
Åland Islands Northern Europe NaN 150.0
Albania Southern Europe NaN 150.0
Algeria Northern Africa NaN 2.0
American Samoa Polynesia NaN 9.0
sub-region-code intermediate-region-code
name
Afghanistan 34.0 NaN
Åland Islands 154.0 NaN
Albania 39.0 NaN
Algeria 15.0 NaN
American Samoa 61.0 NaN
Note: This works only with GitHub links and not with GitLab or Bitbucket links.

You can copy/paste the url and change 2 things:
Remove "blob"
Replace github.com by raw.githubusercontent.com
For instance this link:
https://github.com/mwaskom/seaborn-data/blob/master/iris.csv
Works this way:
import pandas as pd
pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

I recommend to either use pandas as you tried to and others here have explained, or depending on the application, the python csv-handler CommaSeperatedPython, which is a minimalistic wrapper for the native csv-library.
The library returns the contents of a file as a 2-Dimensional String-Array. It's is in its very early stage though, so if you want to do large scale data-analysis, I would suggest Pandas.

First convert the github csv file to raw in order to access the data, follow the link below in comment on how to convert csv file to raw .
import pandas as pd
url_data = (r'https://raw.githubusercontent.com/oderofrancis/rona/main/Countries-Continents.csv')
data_csv = pd.read_csv(url_data)
data_csv.head()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas read_html error when reading in a Wikipedia table - python

Related

How to scrape web-page with button/menuitems optionvalue?

Web scraping a table through multiple pages with a single link

Web Scraping & BeautifulSoup <li> parsing

While web scraping for a table in Python, an empty table is returned

How to read CSV file from GitHub using pandas

Categories

Resources