What is the right way to handle errors? - python

My script below scrapes a website and returns the data from a table. It's not finished but it works. The problem is that it has no error checking. Where should I have error handling in my script?
There are no unittests, should I write some and schedule my unittests to be run periodicaly. Or should the error handling be done in my script?
Any advice on the proper way to do this would be great.
#!/usr/bin/env python
''' Gets the Canadian Monthly Residential Bill Calculations table
from URL and saves the results to a sqllite database.
'''
import urllib2
from BeautifulSoup import BeautifulSoup
class Bills():
''' Canadian Monthly Residential Bill Calculations '''
URL = "http://www.hydro.mb.ca/regulatory_affairs/energy_rates/electricity/utility_rate_comp.shtml"
def __init__(self):
''' Initialization '''
self.url = self.URL
self.data = []
self.get_monthly_residential_bills(self.url)
def get_monthly_residential_bills(self, url):
''' Gets the Monthly Residential Bill Calculations table from URL '''
doc = urllib2.urlopen(url)
soup = BeautifulSoup(doc)
res_table = soup.table.th.findParents()[1]
results = res_table.findNextSibling()
header = self.get_column_names(res_table)
self.get_data(results)
self.save(header, self.data)
def get_data(self, results):
''' Extracts data from search result. '''
rows = results.childGenerator()
data = []
for row in rows:
if row == "\n":
continue
for td in row.contents:
if td == "\n":
continue
data.append(td.text)
self.data.append(tuple(data))
data = []
def get_column_names(self, table):
''' Gets table title, subtitle and column names '''
results = table.findAll('tr')
title = results[0].text
subtitle = results[1].text
cols = results[2].childGenerator()
column_names = []
for col in cols:
if col == "\n":
continue
column_names.append(col.text)
return title, subtitle, column_names
def save(self, header, data):
pass
if __name__ == '__main__':
a = Bills()
for td in a.data:
print td

See the documentation of all the functions and see what all exceptions do they throw.
For ex, in urllib2.urlopen(), it's written that Raises URLError on errors. It's a subclass of IOError.
So, for the urlopen(), you could do something like:
try:
doc = urllib2.urlopen(url)
except IOError:
print >> sys.stderr, 'Error opening URL'
Similary, do the same for others.

You should write unit tests and you should use exception handling. But only catch the exceptions you can handle; you do no one any favors by catching everything and throwing any useful information out.
Unit tests aren't run periodically though; they're run before and after the code changes (although it is feasible for one change's "after" to become another change's "before" if they're close enough).

A copple places you need to have them.is in importing things like tkinter
try:
import Tkinter as tk
except:
import tkinter as tk
also anywhere where the user enters something with a n intended type. A good way to figure this out is to run it abd try really hard to make it crash. Eg typing in wrong type.

The answer to "where should I have error handling in my script?" is basically "any place where something could go wrong", which depends entirely on the logic of your program.
In general, any place where your program relies on an assumption that a particular operation worked as you intended, and there's a possibility that it may not have, you should add code to check whether or not it actually did work, and take appropriate remedial action if it didn't. In some cases, the underlying code might generate an exception on failure and you may be happy to just let the program terminate with an uncaught exception without adding any error-handling code of your own, but (1) this would be, or ought to be, rare if anyone other than you is ever going to use that program; and (2) I'd say this would fall into the "works as intended" category anyway.

Related

python try and exception

I am not a python geek but have tried to solve this problem using information from several answers to similar questions but none seem to really work in my case. Here it is:
I am calling a function from a python script:
Here is the function:
def getsom(X):
#some codes
try:
st = get data from site 1 using X
except:
print "not available from site 1, getting from site 2"
st = get data from site 2 using X
#some codes that depend on st
I am calling this from a python script as such:
#some codes
for yr in range(min_yr,max_yr+1):
day=1
while day<max_day:
st1 = getsom(X)
#some code that depends on st1
day+=1
This works fine when data is available on either site 1 or 2 for a particular day, but breaks down when it is unavailable on both sites for another day.
I want to be able to check for the next day if data is unavailable for a particular day for both sites. I have tried different configurations of try and except with no success and would appreciate any help on the most efficient way to do this.
Thanks!
***Edits
Final version that worked:
in the function part:
def getsom(X):
#some codes
try:
st = get data from site 1 using X
except:
print "not available from site 1, getting from site 2"
st = get data from site 2 using X
try:
st = get data from site 2 using X
except:
print "data not available from sites 1 and 2"
st=None
if st is not None:
#some codes that depend on st
In order to iterate to the next day on the script side, I had to handle the none case from the function with another try/except block:
#some codes
for yr in range(min_yr,max_yr+1):
day=1
while day<max_day:
try:
st=getsom(X)
except:
st=None
if st is not None:
#some codes that depend
As mentioned in the comments you seem like you want to catch the exception in first-level exception handler. You can do it like this:
def getsom(X):
#some codes
try:
st = get data from site 1 using X
except:
print "not available from site 1, getting from site 2"
try:
st = get data from site 2 using X
except:
print "Not available from site 2 as well."
# Here you can assign some arbitrary value to your variable (like None for example) or return from function.
#some codes that depend on st
If data is not available on neither of the sites you can assign some arbitrary value to your variable st or simply return from the function.
Is this what you are looking for?
Also, you shouldn't simply write except without specifying the type of exception you expect - look here for more info: Should I always specify an exception type in `except` statements?
Edit to answer the problem in comment:
If you have no data about certain day you can just return None and handle it like this:
#some codes
for yr in range(min_yr,max_yr+1):
day=1
while day<max_day:
st1 = getsom(X)
if st1 is not None:
#some code that depends on st1
day+=1
Why don't you create a separate function for it?
def getdata(X):
for site in [site1, site2]: # possibly more
try:
return get_data_from_site_using_X()
except:
print "not available in %s" % site
print "couldn't find data anywhere"
Then getsom becomes:
def getsom(X):
#some codes
st = getdata(X)
#some codes that depend on st

Python: return empty value on exception

I have some experience in Python, but I have never used try & except functions to catch errors due to lack of formal training.
I am working on extracting a few articles from wikipedia. For this I have an array of titles, a few of which do not have any article or search result at the end. I would like the page retrieval function just to skip those few names and continue running the script on the rest. Reproducible code follows.
import wikipedia
# This one works.
links = ["CPython"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
#The sequence breaks down if there is no wikipedia page.
links = ["CPython","no page"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
The library running it uses a method like this. Normally it would be really bad practice, but since this is just for a one-off data extraction, I am willing to change the local copy of the library to get it to work. Edit I included the complete function now.
def page(title=None, pageid=None, auto_suggest=True, redirect=True, preload=False):
'''
Get a WikipediaPage object for the page with title `title` or the pageid
`pageid` (mutually exclusive).
Keyword arguments:
* title - the title of the page to load
* pageid - the numeric pageid of the page to load
* auto_suggest - let Wikipedia find a valid page title for the query
* redirect - allow redirection without raising RedirectError
* preload - load content, summary, images, references, and links during initialization
'''
if title is not None:
if auto_suggest:
results, suggestion = search(title, results=1, suggestion=True)
try:
title = suggestion or results[0]
except IndexError:
# if there is no suggestion or search results, the page doesn't exist
raise PageError(title)
return WikipediaPage(title, redirect=redirect, preload=preload)
elif pageid is not None:
return WikipediaPage(pageid=pageid, preload=preload)
else:
raise ValueError("Either a title or a pageid must be specified")
What should I do to retreive only the pages that do not give the error. Maybe there is a way to filter out all items in the list that give this error or an error of some kind. Returning "NA" or similar would be fine with pages that don't exist. Skipping them without notice would be fine too. Thanks!
The function wikipedia.page will raise a wikipedia.exceptions.PageError if the page doesn't exist. That's the error you want to catch.
import wikipedia
links = ["CPython","no page"]
test=[]
for link in links:
try:
#try to load the wikipedia page
page=wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
#if a "PageError" was raised, ignore it and continue to next link
continue
You have to surround the function wikipedia.page by a try block, so I'm afraid you can't use list comprehension.
Understand that this will be bad practice, but for a one off quick and dirty script you can just:
edit: Wait, sorry. I've just noticed the list comprehension. I'm actually not sure if this will work without breaking that down:
links = ["CPython", "no page"]
test = []
for link in links:
try:
page = wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
pass
test = [testitem.content for testitem in test]
print(test)
pass Tells python to essentially to trust you and ignore the error so that it can continue on about its day.

Pandas: Repeat function for current keyword if except

I have built a web scraper. The program enters searchterm into a searchbox and grabs the results. Pandas goes through a spreadsheet line-by-line in a column to retrieve each searchterm.
Sometimes the page doesn't load properly, prompting a refresh.
I need a way for it to repeat the function and try the same searchterm if it fails. Right now, if I return, it would go on to the next line in the spreadsheet.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
df = pd.read_csv(searchterms.csv, delimiter=",")
def scrape(searchterm):
#Loads url
searchbox = driver.find_element_by_name("searchbox")
searchbox.clear()
searchbox.send_keys(searchterm)
print "Searching for %s ..." % searchterm
no_result = True
while no_result is True:
try:
#Find results, grab them
no_result = False
except:
#Refresh page and do the above again for the current searchterm - How?
driver.refresh()
return pd.Series([col1, col2])
df[["Column 1", "Column 2"]] = df["searchterm"].apply(scrape)
#Executes crawl for each line in csv
The try except construct comes with else clause. The else block is executed if everything goes OK. :
def scrape(searchterm):
#Loads url
no_result = True
while no_result:
#Find results, grab them
searchbox = driver.find_element_by_name("searchbox")
searchbox.clear()
try: #assumes that an exception is thrown if there is no results
searchbox.send_keys(searchterm)
print "Searching for %s ..." % searchterm
except:
#Refresh page and do the above again for the current searchterm
driver.refresh()
else: # executed if no exceptions were thrown
no_results = False
# .. some post-processing code here
return pd.Series([col1, col2])
(There is also a finally block that is executed no matter what, which is useful for cleanup tasks that don't depend on the success or failure of the preceding code)
Also, note that empty except catches any exceptions and is almost never a good idea. I'm not familiar with how selenium handles errors, but when catching exceptions, you should specify which exception you are expecting to handle. This how, if an unexpected exception occurs, your code will abort and you'll know that something bad happened.
That is why you should also try keeping as few lines a possible within the try block.

How do I handle partially initialized classes

My question concerns a class that I am writing that may or may not be fully initialized. The basic goal is to take a match_id and open the corresponding match_url (example: http://dota2lounge.com/match?m=1899) and then grab some properties out of the webpage. The problem is some match_ids will result in 404 pages (http://dota2lounge.com/404).
When this happens, there won't be a way to determine the winner of the match, so the rest of the Match can't be initialized. I have seen this causing problems with methods of the Match, so I added the lines to initialize everything to None if self._valid_url is False. This works in principal, but then I'm adding a line each time a new attribute is added, and it seems prone to errors down the pipeline (in methods, etc.) It also doesn't alert the user that this class wasn't properly initialized. They would need to call .is_valid_match() to determine that.
tl;dr: What is the best way to handle classes that may be only partially initiated? Since this is a hobby project and I'm looking to learn, I'm open to pretty much any solutions (trying new things), including other classes or whatever. Thanks.
This is an abbreviated version of the code containing the relevant portions (Python 3.3):
from urllib.request import urlopen
from bs4 import BeautifulSoup
class Match(object):
def __init__(self, match_id):
self.match_id = match_id
self.match_url = self.__determine_match_url__()
self._soup = self.__get_match_soup__()
self._valid_match_url = self.__determine_match_404__()
if self._valid_match_url:
self.teams, self.winner = self.__get_teams_and_winner__()
# These lines were added, but I'm not sure if this is correct.
else:
self.teams, self.winner = None, None
def __determine_match_url__(self):
return 'http://dota2lounge.com/match?m=' + str(self.match_id)
def __get_match_soup__(self):
return BeautifulSoup(urlopen(self.match_url))
def __get_match_details__(self):
return self._soup.find('section', {'class': 'box'})
def __determine_match_404__(self):
try:
if self._soup.find('h1').text == '404':
return False
except AttributeError:
return True
def __get_teams_and_winner__(self):
teams = [team.getText() for team in
self._soup.find('section', {'class': 'box'}).findAll('b')]
winner = False
for number, team in enumerate(teams):
if ' (win)' in team:
teams[number] = teams[number].replace(' (win)', '')
winner = teams[number]
return teams, winner
def is_valid_match(self):
return all([self._valid_match_url, self.winner])
I would raise an exception, handle that in your creation code (wherever you call some_match = Match(match_id)), and probably don't add it to whatever list you may or may not be using...
For a better answer, you might want to include in your question the code that instantiates all your Match objects.

Dictionary / JSON issue using Python 2.7

I'm looking at scraping some data from Facebook using Python 2.7. My code basically augments by 1 changing the Facebook profile ID to then capture details returned by the page.
An example of the page I'm looking to capture the data from is graph.facebook.com/4.
Here's my code below:
import scraperwiki
import urlparse
import simplejson
source_url = "http://graph.facebook.com/"
profile_id = 1
while True:
try:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
data['id'] = result['id']
data['name'] = result['name']
data['first_name'] = result['first_name']
data['last_name'] = result['last_name']
data['link'] = result['link']
data['username'] = result['username']
data['gender'] = result['gender']
data['locale'] = result['locale']
print data['id'], data['name']
scraperwiki.sqlite.save(unique_keys=['id'], data=data)
#time.sleep(3)
except:
continue
profile_id +=1
I am using the scraperwiki site to carry out this check but no data is printed back to console despite the line 'print data['id'], data['name'] used just to check the code is working
Any suggestions on what is wrong with this code? As said, for each returned profile, the unique data should be captured and printed to screen as well as populated into the sqlite database.
Thanks
Any suggestions on what is wrong with this code?
Yes. You are swallowing all of your errors. There could be a huge number of things going wrong in the block under try. If anything goes wrong in that block, you move on without printing anything.
You should only ever use a try / except block when you are looking to handle a specific error.
modify your code so that it looks like this:
while True:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
# ... more ...
and then you will get detailed error messages when specific things go wrong.
As for your concern in the comments:
The reason I have the error handling is because, if you look for
example at graph.facebook.com/3, this page contains no user data and
so I don't want to collate this info and skip to the next user, ie. no
4 etc
If you want to handle the case where there is no data, then find a way to handle that case specifically. It is bad practice to swallow all errors.

Categories