error handling with BeautifulSoup when scraped url doesn't respond - python

I'm totally noob to python so please forgive my mistake and lack of vocabulary. I'm trying to scrap some url with BeautifulSoup. My url are coming from a GA api call and some of them doesn't respond.
How do I build my script so that BeautifulSoup ignore the url that doesn't return anything ?
Here is my code :
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist[4:8]:
page = urllib2.urlopen(row)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find(attrs={'class': 'nb-shares'})
share = name_box.text.strip()
# save the data in tuple
sharelist.append((row,share))
print(sharelist)
I tried to use this :
except Exception:
pass
but I don't know where and got some syntax error. I've look at other questions, but cannot find any answers for me.

You may check the value of name_box variable - it would be None if nothing found:
for row in urllist[4:8]:
page = urllib2.urlopen(row)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
# ...

Related

Webscraping the contents of a tables

Hi I am trying to use Python and Beautiful Soup to scrape a webpage. There are various tables in the webpage with results that I want out of them, but I am struggling to:
1) find the right table
2) find the right two cells
3) write the cells 1 and 2 into a dictionary key and value, respectively.
So far, after making a request, and parsing the HTML, I use:
URL='someurl.com'
def datascrape(url):
page=requests.get(url)
print ("requesting page")
soup = BeautifulSoup(page.content, "html.parser")
return(soup)
soup=datascrape(URL)
results = {}
for row in soup.findAll('tr'):
aux = row.findAll('td')
try:
if "Status" in (aux.stripped_strings):
key=(aux[0].strings)
value=(aux[1].string)
results[key] = value
except:
pass
print (results)
Unfortunately "results" is always empty. I am really not sure where I am going wrong. Could anyone enlighten me please?
I'm not sure why you're using findAll() instead of find_all() as I'm fairly new to web-scraping, but nevertheless I think this gives you the output you're looking for.
URL='http://sitem.herts.ac.uk/aeru/bpdb/Reports/2070.html'
def datascrape(url):
page=requests.get(url)
print ("requesting page")
soup = BeautifulSoup(page.content,
"html.parser")
return(soup)
soup=datascrape(URL)
results = {}
table_rows = soup.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
try:
for i in row:
if "Status" in i:
key=(row[0].strip())
value=(row[1].strip())
results[key] = value
else:
pass
print(results)
Hope this helps!
If just after the Status and Not Applicable you can use positional nth-of-type css selectors. This does depend on position being the same across pages.
import requests
from bs4 import BeautifulSoup
url ='https://sitem.herts.ac.uk/aeru/bpdb/Reports/2070.htm'
page=requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
tdCells = [item.text.strip() for item in soup.select('table:nth-of-type(2) tr:nth-of-type(1) td')]
results = {tdCells[0] : tdCells[1]}
print(results)

BeautifulSoup generating inconsistent results

I'm using BeautifulSoup to pull data out of Reddit sidebars on a selection of subreddits, but my results are changing pretty much every time I run my script.
Specifically, the results in sidebar_urls changes from iteration to iteration; sometimes it will result in [XYZ.com/abc, XYZ.com/def], other times it will return just [XYZ.com/def], and finally, it will sometimes return [].
Any ideas why this might be happening using the code below?
sidebar_urls = []
for i in range(0, len(reddit_urls)):
req = urllib.request.Request(reddit_urls[i], headers=headers)
resp = urllib.request.urlopen(req)
soup = BeautifulSoup(resp, 'html.parser')
links = soup.find_all(href=True)
for link in links:
if "XYZ.com" in str(link['href']):
sidebar_urls.append(link['href'])
It seems you sometimes get a page that does not have a side bar. It could be because Reddit is recognizing you as a robot and returning a default page instead of the one you expect. Consider identifying yourself when requesting the pages, using the User-Agent field:
reddit_urls = [
"https://www.reddit.com/r/leagueoflegends/",
"https://www.reddit.com/r/pokemon/"
]
# Update this to identify yourself
user_agent = "me#example.com"
sidebar_urls = []
for reddit_url in reddit_urls:
response = requests.get(reddit_url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(response.text, "html.parser")
# Find the sidebar tag
side_tag = soup.find("div", {"class": "side"})
if side_tag is None:
print("Could not find a sidebar in page: {}".format(reddit_url))
continue
# Find all links in the sidebar tag
link_tags = side_tag.find_all("a")
for link in link_tags:
link_text = str(link["href"])
sidebar_urls.append(link_text)
print(sidebar_urls)

Web scraping nested comments on Reddit using beautifulsoup

This code gets the page. My problem is I need to scrape the content of users comments not the number of comments. It is nested inside the number of comments section but I am not sure how I can access the link and parse through and scrape the user comments.
request_list = []
id_list = [0]
for i in range(0,200,25):
response = requests.get("https://www.reddit.com/r/CryptoCurrency/?count="+str(i)+"&after="+str(id_list[-1]), headers = {'User-agent':'No Bot'})
soup = BeautifulSoup(response.content, 'lxml')
request_list.append(soup)
id_list.append(soup.find_all('div', attrs={'data-type': 'link'})[-1]['data-fullname'])
print(i, id_list)
if i%100 == 0:
time.sleep(1)
The code below I tried writing a function that is supposed to access the nested comments but I have no clue.
def extract_comment_contents(request_list):
comment_contents_list = []
for i in request_list:
if response.status_code == 200:
for each in i.find_all('a', attrs={'data-inbound-url': '/r/CryptoCurrency/comments/'}):
comment_contents_list.append(each.text)
else:
print("Call failed at request ", i)
return comment_contents_list
fetch_comment_contents_list = extract_comment_contents(request_list)
print(fetch_comment_contents_list)
For each thread, you need to send another request to get the comments page. The url for the comments page can be found using soup.find_all('a', class_='bylink comments may-blank'). This will give all the a tags that have to url for the comments page. I'll show you one example to get to the comments page.
r = requests.get('https://www.reddit.com/r/CryptoCurrency/?count=0&after=0')
soup = BeautifulSoup(r.text, 'lxml')
for comments_tag in soup.find_all('a', class_='bylink comments may-blank', href=True):
url = comments_tag['href']
r2 = requests.get(url)
soup = BeautifulSoup(r2.text, 'lxml')
# Your job is to parse this soup object and get all the comments.

skipping Error 404 with BeautifulSoup

I'm trying to scrap some URL with BeautifulSoup. The URL I'm scraping are coming from a google analytics API call, some of then aren't working properly so I need to find a way to skip them.
I tried to add this:
except urllib2.HTTPError:
continue
But I got the following syntax error :
except urllib2.HTTPError:
^
SyntaxError: invalid syntax
Here is my full code:
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row,share))
print(sharelist)
Your except statement is not preceded by a try statement. You should use the following pattern:
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
Also note the indentation levels. Code executed under the try clause must be indented, as well as the except clause.
Two errors:
1. No try statement
2. No indentation
Use this:
for row in urllist:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
If you just want to catch a 404, you need to check the code returned or raise the error or else you will catch and ignore more than just the 404:
import urllib2
from bs4 import BeautifulSoup
from urlparse import urljoin
def print_results(results):
base = 'http://www.konbini.com'
rawdata = []
sharelist = []
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
# use urljoin to join to the base url
urllist = [urljoin(base, h) for h in rawdata]
for url in urllist:
# query the website and return the html to the variable 'page'
try: # need to open with try
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404: # check the return code
continue
raise # if other than 404, raise the error
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((url, share))
print(sharelist)
As already mentioned by others,
try statement missing
Proper indentation missing.
You should use IDE or Editors so that you won't face such problems, Some good IDE and Editors are
IDE - Eclipse Use Pydev plugin
Editors - Visual Studio Code
Anyways, Code after try and indent
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row, share))
print(sharelist)
Your syntax error is due to the fact that you're missing a try with your except statement.
try:
# code that might throw HTTPError
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue

Using beautifulsoup to scrape <h2> tag

I am scraping a website data using beautiful soup. I want the the anchor value (My name is nick) of the following. But i searched a lot in the google but can't find any perfect solution to solve my query.
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
temp = news.find('h2')
print temp
output :
<h2 class="menuNewsHl2_MenuNews1">My name is nick</h2>
But i want output like this : My name is nick
Just grab the text attribute:
>>> soup = BeautifulSoup('''<h2 class="menuNewsHl2_MenuNews1">My name is nick</h2>''')
>>> soup.text
u'My name is nick'
Your error is probably occurring because you don't have that specific tag in your input string.
Check if temp is not None
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
temp = news.find('h2')
if temp:
print temp.text
or put your print statement in a try ... except block
news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
try:
print news.find('h2').text
except AttributeError:
continue
Try using this:
all_string=soup.find_all("h2")[0].get_text()

Categories