so basically, I want to scrape a table from this article and find the difference between the 1980 column and the 2018 column. to do that, I'm trying to convert the scraped data from a tag to a string, and then to a float. but I get an error when I try to convert into a float.
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
html = urlopen("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions_per_capita")
soup = BeautifulSoup(html, "html.parser")
for tr in soup.select('table:nth-of-type(2) tr:has(td)'):
nation = tr.td.a.text
eighty = tr.find_all("td")[3]
eighty_x = eighty.text
eighty_y = float(eighty_x)
eighteen = tr.find_all("td")[14]
eighteen_x = eighteen.text
eighteen_y = float(eighteen_x)
selection =(nation, eighty_x, eighteen_x.strip())
print(selection)
what I get is this:
('Afghanistan', '0.2', '0.3')
('Albania', '1.7', '1.6')
('Algeria', '3.0', '3.9')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [39], in <cell line: 9>()
11 eighty = tr.find_all("td")[3]
12 eighty_x = eighty.text
---> 13 eighty_y = float(eighty_x)
14 eighteen = tr.find_all("td")[14]
15 eighteen_x = eighteen.text
ValueError: could not convert string to float: '..'
The table that you're scraping from, located here: https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions_per_capita
has the ".." in it. So that's the issue. You then need to decide how to handle that case. Maybe store it as a -1 in your data to signal the lack of data in the table? Up to you.
Also, I think in your selection you want eighty_y and eighteen_y instead of the x's right? Or else why are you turning x->y (as in eighty_x becomes eighty_y)?
Assuming this is fine, then this will do it:
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
html = urlopen("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions_per_capita")
soup = BeautifulSoup(html, "html.parser")
for tr in soup.select('table:nth-of-type(2) tr:has(td)'):
if tr is None or tr.td is None or tr.td.a is None: continue
nation = tr.td.a.text
eighty = tr.find_all("td")[3]
eighty_x = eighty.text
eighty_y = -1 if ".." in eighty_x else float(eighty_x)
eighteen = tr.find_all("td")[14]
eighteen_x = eighteen.text
eighteen_y = -1 if ".." in eighteen_x else float(eighteen_x)
selection =(nation, eighty_y, eighteen_y)
print(selection)
Related
I have been running this python code and it gives me an error saying
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-9-6ff1d459c8bd> in <module>
6 soup = BeautifulSoup(data, 'html5lib')
7 df = pd.DataFrame(columns=["Name", "Sector", "Price", "Price/Earnings", "Dividend_Yield", "Earnings/Share", "52_Week_Low", "52_Week_High", "Market_Cap", "EBITDA"])
----> 8 for row in soup.find_all('tbody')[1].find_all('tr'):
9 col = row.find_all("td")
10 Name = col[0].text
IndexError: list index out of range
The code i have user do do the python web scraping is,
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.kaggle.com/priteshraj10/sp-500-companies"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
df = pd.DataFrame(columns=["Name", "Sector", "Price", "Price/Earnings", "Dividend_Yield", "Earnings/Share", "52_Week_Low", "52_Week_High", "Market_Cap", "EBITDA"])
for row in soup.find_all('tbody')[1].find_all('tr'):
col = row.find_all("td")
Name = col[0].text
Sector = col[1].text
Price = col[2].text
Price_Earnings = col[3].text
Dividend_Yield = col[4].text
Earnings_Share = col[5].text
Week_Low = col[6].text
Week_High = col[7].text
Market_Cap = col[8].text
EBITDA = col[9].text
df = df.append({"Name":Name,"Sector":Sector,"Price":Price,"Price_Earnings":Price_Earnings,"Dividend_Yield":Dividend_Yield,"Earnings_Share":Earnings_Share,"Week_Low":Week_Low,"Week_High":Week_High,"Market_Cap":Market_Cap,"EBITDA":EBITDA}, ignore_index=True)
Can you help me on this?
If you try to print the variable soup you will see that the HTML returned does not contain the information you want probably because the site has a block to avoid web-scraping
Apparently this line of code is returning you a list
for row in soup.find_all('tbody')[1]
And it expects at least to have two items (python list indexing starts from 0), and it's not.
What you could do is print this item:
print(soup.find_all('tbody'))
To see what are you trying to access the position index 1, and why is not there.
Additionally if you want to check the length:
print(len(soup.find_all('tbody')))
It should be smaller than 1, hence the error.
I'd recommend you, instead of printing using a debugger to find out what's up with your code.
The issue seems to be that the website you are trying to scrape , probably changed their HTML code somehow.
I am trying to scrape the info from the election results in 18 NI constituencies here:
http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results
Each of the unique URLs starts like this:
http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/
The selector for the 18 URLS is as follows:
#container > div.two-column-content.clearfix > div > div.right-column.cms > div > ul > li
What I want to start with is a list with the 18 URLS. This list should be clean (i.e. just have the actual addresses, no tags, etc)
My Code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver
url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'
response = requests.get(url)
response.status_code
text = requests.get(url).text
soup = BeautifulSoup(text, parser="html5lib")
link_list = []
for a in soup('a'):
if a.has_attr('href'):
link_list.append(a)
re_pattern = r"^/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/"
This is where I get lost, as I need to search for all 18 URLS that start with that pattern (The pattern is wrong I am pretty sure. Please help!)
The rest of the code:
import re
good_urls = [url for url in link_list if re.match(re_pattern, url)]
here I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-f3fbbd3199b1> in <module>
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]
<ipython-input-36-f3fbbd3199b1> in <listcomp>(.0)
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]
~/opt/anaconda3/lib/python3.7/re.py in match(pattern, string, flags)
173 """Try to apply the pattern at the start of the string, returning
174 a Match object, or None if no match was found."""
--> 175 return _compile(pattern, flags).match(string)
176
177 def fullmatch(pattern, string, flags=0):
TypeError: expected string or bytes-like object
What should I type differently to get those 18 urls? Thank you!
This seems to do the job.
I've removed some unnecessary imports and stuff that's not needed here, just readd them if you need them elsewhere of course.
The error message was due to triyng to do a regex comparison on a soup object, it needs to be cast to string (same problem as discussed in the link #Huzefa posted, so that was definitely relevant).
Fixing that still left the issue of trying to isolate the correct strings. I've simplified the regex for matching, then use a simple string split on " and selecting the second object resulting from the split (which is our url)
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'
response = requests.get(url)
text = requests.get(url).text
soup = BeautifulSoup(text, "html.parser")
re_pattern = "<a href=\".*/Elections-2019/.*"
link_list = []
for a in soup('a'):
if a.has_attr('href') and re.match(re_pattern, str(a)):
link_list.append((str(a).split('"')[1]))
Hope it fits your purpose, ask if anything is unclear.
I am struggled trying to convert a html table to a dataframe. I would like to write the table in a csv file.
from requests import session
import sys
import csv
from bs4 import BeautifulSoup
c = session()
outfile = open("Weather2017.csv", 'wb')
response = c.get('http://www.wunderground.com/history/airport/EGLL/2017/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2017&req_city=NA&req_state=NA&req_statename=NA')
soup = BeautifulSoup(response.text, "html.parser")
soup = soup.find(id="obsTable").text.replace('\n','',1)
outfile.write(soup.replace('\n',',London\n'))
the type error is the following
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-1e149d844e15> in <module>()
11 soup = BeautifulSoup(response.text, "html.parser")
12 soup = soup.find(id="obsTable").text.replace('\n','',1)
---> 13 outfile.write(soup.replace('\n',',London\n'))
14
15
TypeError: a bytes-like object is required, not 'str'
this is the table I want to covert in to a csv file
Can anyone help me?
Thanks in advance!
How about this,
from requests import session
import sys
import csv
from bs4 import BeautifulSoup
c = session()
response = c.get('http://www.wunderground.com/history/airport/EGLL/2017/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2017&req_city=NA&req_state=NA&req_statename=NA')
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find(id="obsTable")
headers = [header.text.encode('utf-8').strip() for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf-8').strip() for val in row.find_all('td')])
del rows[0] # Remove header row. Added as empty.
with open('Weather2017.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(row for row in rows if row)
The things causing a problem in your code when applying BeautifulSoup() are these tags:
tbody , /tbody , thead, /thead. if you get rid of them everything will work well!
Here is a solution using pandas, regex and some other libs :)
#needed imports
import pandas as pd
import numpy as numpy
from bs4 import BeautifulSoup
import requests
import re
# get page html code
url = 'https://www.wunderground.com/history/airport/EGLL/2017/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2017&req_city=NA&req_state=NA&req_statename=NA'
req = requests.get(url)
html = req.text
soup = (BeautifulSoup(html, 'html.parser'))
#removing tags that cause problems using re library
patterns = ['<tbody>','</tbody>','<thead>','</thead>']
cleaned_html= soup.prettify()
for pat in patterns:
cleaned_html = re.sub(pat, '', cleaned_html)
df = pd.read_html(cleaned_html, attrs={'id':'obsTable'})[0]
df.head()
# build a hierarchical columns
df.columns = [['2017',
'Temp. (°C)','Temp. (°C)','Temp. (°C)',
'Dew Point (°C)','Dew Point (°C)','Dew Point (°C)',
'Humidity (%)','Humidity (%)','Humidity (%)',
'Sea Level Press. (hPa)','Sea Level Press. (hPa)','Sea Level Press. (hPa)',
'Visibility (km)','Visibility (km)','Visibility (km)',
'Wind (km/h)', 'Wind (km/h)','Wind (km/h)',
'Precip. (mm)', 'Events'],
['Jan',
'high','avg','low',
'high','avg','low',
'high','avg','low',
'high','avg','low',
'high','avg','low',
'high','avg','high',
'sum',
'nan']]
df.head()
#removing the first un-needed rows
df = df.drop([0,1], axis=0)
df.reset_index(inplace=True, drop=True)
df.head()
#save the result to CSV file
df.to_csv('weather.csv')
in the code sample below, 3 of the 5 elements I am attempting to scrape return values as expected. 2 (goals_scored and assists) return no values. I have verified that the data does exist on the web page and that I am using the correct attribute, but not sure why results are not returning. Is there something obvious I am overlooking?
import sys
from bs4 import BeautifulSoup as bs
import urllib2
import datetime as dt
import time
import pandas as pd
proxy_support = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy_support)
player_name=[]
club =[]
position = []
goals_scored = []
assists = []
for p in range(25):
player_url = 'http://www.mlssoccer.com/stats/season?page={p}&franchise=select&year=2017&season_type=REG&group=goals'.format(
p=p)
page = opener.open(player_url).read()
player_soup = bs(page,"lxml")
print >>sys.stderr, '[{time}] Running page {n}...'.format(
time=dt.datetime.now(), n=p)
length = len(player_soup.find('tbody').findAll('tr'))
for row in range(0, length):
try:
name = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Player'})[row].find('a').contents[0]
player_name.append(name)
team = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Club'})[row].contents[0]
club.append(team)
pos = player_soup.find('tbody').findAll('td', attrs={'data-title': 'POS'})[row].contents[0]
position.append(pos)
goals = player_soup.find('tbody').findAll('td', attrs={'data-title': 'G' ,'class': 'responsive'})[row].contents[0]
goals_scored.apppend(goals)
a = player_soup.find('tbody').findAll('td', attrs={'data-title': 'A'})[row].contents[0]
assists.append(a)
except:
pass
player_data = {'player_name':player_name,
'club':club,
'position' : position,
'goals_scored' : goals_scored,
'assists' : assists,
}
df = pd.DataFrame.from_dict(player_data,orient='index')
df
The only thing I can figure out is that there is a slight difference in the HTML for the variables not returning data. Do i need to include the class= responsive in my code? If so, any examples of how that might look?
Position HTML : F
Goals HTML: 11
Any insight is appreciated
You can try like this to get your desired data. I've only parsed the portion you needed. The rest you can do for dataframe. FYI, there are two types of classes attached to different td tags. odd and even. Don't forget to consider that as well.
from bs4 import BeautifulSoup
import requests
page_url = "https://www.mlssoccer.com/stats/season?page={0}&franchise=select&year=2017&season_type=REG&group=goals"
for url in [page_url.format(p) for p in range(5)]:
soup = BeautifulSoup(requests.get(url).text, "lxml")
table = soup.select("table")[0]
for items in table.select(".odd,.even"):
player = items.select("td[data-title='Player']")[0].text
club = items.select("td[data-title='Club']")[0].text
position = items.select("td[data-title='POS']")[0].text
goals = items.select("td[data-title='G']")[0].text
assist = items.select("td[data-title='A']")[0].text
print(player,club,position,goals,assist)
Partial result looks like:
Nemanja Nikolic CHI F 24 4
Diego Valeri POR M 21 11
Ola Kamara CLB F 18 3
As I've included both the classes in my script so you will get all data from that site.
I would like the following commands to grab the date from the address in this range but I can't seem to get it to run more than once. I am using Python 3. As you can see below the the url for the site is appended with i as to be read http://zinc.docking.org/substance/10 ; http://zinc.docking.org/substance/11 ... and so on. Here is the code:
import bs4 as bs
import urllib.request
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
ate = row1[0].getText()
print(ate)
This is my output:
$python3 Date.py
November 11th, 2005
The script should however give me 3 dates. This code works so I know that row[0] does in fact contain a value.I feel like there is some sort of simple formatting error but I am not sure where to begin troubleshooting. When I format it "Correctly" this is the code:
import bs4 as bs
import urllib.request
import pandas as pd
import csv
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
table2 = soup.find("table", attrs={"class": "protomers"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
ate = row1[0].getText()
print(ate)
The error I get is as follows:
Traceback (most recent call last):
File "Stack.py", line 11, in <module>
ate = row1[1].getText()
IndexError: list index out of range
The first code works so I know that row[0] does in fact contain a value. Any ideas?
You might want to fix your indentation:
import bs4 as bs
import urllib.request
site = "http://zinc.docking.org/substance/"
for i in range(10, 16):
site1 = str("%s%i" % (site, i))
sauce = urllib.request.urlopen(site1).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
table1 = soup.find("table", attrs={"class": "substance-properties"})
for row in table1.findAll('tr'):
row1 = row.findAll('td')
Date = row1[0].getText()
print(Date)
Edit: You should rename your Date variable, that is a reserved name. Also, by convention Python vars are lower case.