Python: return empty value on exception - python

I have some experience in Python, but I have never used try & except functions to catch errors due to lack of formal training.
I am working on extracting a few articles from wikipedia. For this I have an array of titles, a few of which do not have any article or search result at the end. I would like the page retrieval function just to skip those few names and continue running the script on the rest. Reproducible code follows.
import wikipedia
# This one works.
links = ["CPython"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
#The sequence breaks down if there is no wikipedia page.
links = ["CPython","no page"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
The library running it uses a method like this. Normally it would be really bad practice, but since this is just for a one-off data extraction, I am willing to change the local copy of the library to get it to work. Edit I included the complete function now.
def page(title=None, pageid=None, auto_suggest=True, redirect=True, preload=False):
'''
Get a WikipediaPage object for the page with title `title` or the pageid
`pageid` (mutually exclusive).
Keyword arguments:
* title - the title of the page to load
* pageid - the numeric pageid of the page to load
* auto_suggest - let Wikipedia find a valid page title for the query
* redirect - allow redirection without raising RedirectError
* preload - load content, summary, images, references, and links during initialization
'''
if title is not None:
if auto_suggest:
results, suggestion = search(title, results=1, suggestion=True)
try:
title = suggestion or results[0]
except IndexError:
# if there is no suggestion or search results, the page doesn't exist
raise PageError(title)
return WikipediaPage(title, redirect=redirect, preload=preload)
elif pageid is not None:
return WikipediaPage(pageid=pageid, preload=preload)
else:
raise ValueError("Either a title or a pageid must be specified")
What should I do to retreive only the pages that do not give the error. Maybe there is a way to filter out all items in the list that give this error or an error of some kind. Returning "NA" or similar would be fine with pages that don't exist. Skipping them without notice would be fine too. Thanks!

The function wikipedia.page will raise a wikipedia.exceptions.PageError if the page doesn't exist. That's the error you want to catch.
import wikipedia
links = ["CPython","no page"]
test=[]
for link in links:
try:
#try to load the wikipedia page
page=wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
#if a "PageError" was raised, ignore it and continue to next link
continue
You have to surround the function wikipedia.page by a try block, so I'm afraid you can't use list comprehension.

Understand that this will be bad practice, but for a one off quick and dirty script you can just:
edit: Wait, sorry. I've just noticed the list comprehension. I'm actually not sure if this will work without breaking that down:
links = ["CPython", "no page"]
test = []
for link in links:
try:
page = wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
pass
test = [testitem.content for testitem in test]
print(test)
pass Tells python to essentially to trust you and ignore the error so that it can continue on about its day.

Related

How to make my bot skip over urls that don't exist

Hey guys I was wondering if there was a way to make my bot skip invalid urls after 1 try to continue with the for loop but continue doesn't seem to work
def check_valid(stripped_results):
global vstripped_results
vstripped_results = []
for tag in stripped_results:
conn = requests.head("https://" + tag)
conn2 = requests.head("http://" + tag)
status_code = conn.status_code
website_is_up = status_code == 200
if website_is_up:
vstripped_results.append(tag)
else:
continue
stripped results is an array of an unknown amount of domains and Subdomains which is why I have the 'https://' part and tbh I'm not even sure whether my if statement is effective or not.
Any help would be greatly appreciated I don't want to get rate limited by discord anymore from sending so many invalid domains through. :(
This is easy. To check the validity of a URL there exist a python library, namely Validators. This library can be used to validate any URL for if it exist or not. Let's take it step by step.
Firstly,
Here is the documentation link for validators:
https://validators.readthedocs.io/en/latest/
How do you validate a link using validators?
It is simple. Let's work on command line for a moment.
This image shows it. This module gives out boolean result on if it is a valid link or not.
Here for the link of this question it gave out True and when it would be false then it would give you the error.
You can validate it using this syntax:
validators.url('Add your URL variable here')
Remember that this gives boolean value so code for it that way.
So you can use it this way...
I wouldn't be implementing it in your code as I want you to try it yourself once. I would help you with this if you are unable to do it.
Thank You! :)
Try this?
def check_valid(stripped_results):
global vstripped_results
vstripped_results = []
for tag in stripped_results:
conn = requests.head("https://" + tag)
conn2 = requests.head("http://" + tag)
status_code = conn.status_code
website_is_up = status_code == 200
if website_is_up:
vstripped_results.append(tag)
else:
#Do the thing here

Finding if some tag exists in HTML response and printing if/else accordingly

I am trying to collect data from a website (using Python). In a webpage, there are multiple listings of software and in each listing. My data is within a tag (h5) and certain class ('price_software_details).
However, in some cases, tag along with the data is missing. I want to print 'NA' message if data and tag are missing else it should print the data.
I tried the code that I have mentioned below, though it's not working.
Help please!
interest = soup.find(id = 'allsoftware')
for link in interest.findAll('h5'):
if link.find(class_ = 'price_software_details') == True:
print(link.getText())
else:
print('NA')
Have you tried error handling (try, except)?
interest = soup.find(id='allsoftware')
for link in interest.findAll('h5'):
try:
item = link.find({'class':'price_software_details'})
print(item.get_text())
except:
print('NA')
You need to know soup.find() never be True.It only will be result or None.
interest = soup.find(id = 'allsoftware')
for link in interest.findAll('h5'):
if link.find(class_ = 'price_software_details'):
print(link.getText())
else:
print('NA')

python retrieve text from multiple random wikipedia pages

I am using python 2.7 with wikipedia package to retrieve the text from multiple random wikipedia pages as explained in the docs.
I use the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
text = get_random_pages_summary(50)
and get the following error
File
"/home/user/.local/lib/python2.7/site-packages/wikipedia/wikipedia.py",
line 393, in __load raise DisambiguationError(getattr(self, 'title',
page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "Priuralsky" may refer to:
Priuralsky District Priuralsky (rural locality)
what i am trying to do is to get the text. from random pages in Wikipedia, and I need it to be just regular text, without any markdown
I assume that the problem is getting a random name that has more than one option when searching for a Wikipedia page.
when i use it to get one Wikipedia page. it works well.
Thanks
As you're doing it for random articles and with a Wikipedia API (not directly pulling the HTML with different tools) my suggestion would be to catch the DisambiguationError and re-random article in case this happens.
def random_page():
random = wikipedia.random(1)
try:
result = wikipedia.page(random).summary
except wikipedia.exceptions.DisambiguationError as e:
result = random_page()
return result
According to the document(http://wikipedia.readthedocs.io/en/latest/quickstart.html) the error will return multiple page candidates so you need to search that candidate again.
try:
wikipedia.summary("Priuralsky")
except wikipedia.exceptions.DisambiguationError as e:
for page_name in e.options:
print(page_name)
print(wikipedia.page(page_name).summary)
You can improve your code like this.
import wikipedia
def get_page_sumarries(page_name):
try:
return [[page_name, wikipedia.page(page_name).summary]]
except wikipedia.exceptions.DisambiguationError as e:
return [[p, wikipedia.page(p).summary] for p in e.options]
def get_random_pages_summary(pages=0):
ret = []
page_names = [wikipedia.random(1) for i in range(pages)]
for p in page_names:
for page_summary in get_page_sumarries(p):
ret.append(page_summary)
return ret
text = get_random_pages_summary(50)

Scrapy - handle exception when one of item fields is not returned

I'm trying to parse Scrapy items, where each of them has several fields. It happens that some of the fields cannot be properly captured due to incomplete information on the site. In case just one of the fields cannot be returned, the entire operation of extracting an item breaks with an exception (e.g. for below code I get "Attribute:None cannot be split"). The parser then moves to next request, without capturing other fields that were available.
item['prodcode'] = response.xpath('//head/title').re_first(r'.....').split(" ")[1]
#throws: Attribute:None cannot be split . Does not parse other fields.
What is the way of handling such exceptions by Scrapy? I would like to retrieve information from all fields that were available, while the unavailable ones return a blank or N/A. I could do try... except... on each of the item fields, but this seems like not the best solution. The docs mention exception handling, but somehow I cannot find a way for this case.
The most naive approach here would be to follow the EAFP approach and handle exceptions directly in the spider. For instance:
try:
item['prodcode'] = response.xpath('//head/title').re_first(r'.....').split(" ")[1]
except AttributeError:
item['prodcode'] = 'n/a'
A better option here could be to delegate the item field parsing logic to Item Loaders and different Input and Output Processors. So that your spider would be only responsible for parsing HTML and extracting the desired data but all of the post-processing and prettifying would be handled by an Item Loader. In other words, in your spider, you would only have:
loader = MyItemLoader(response=response)
# ...
loader.add_xpath("prodcode", "//head/title", re=r'.....')
# ...
loader.load_item()
And the Item Loader would have something like:
def parse_title(title):
try:
return title.split(" ")[1]
except Exception: # FIXME: handle more specific exceptions
return 'n/a'
class MyItemLoader(ItemLoader):
default_output_processor = TakeFirst()
prodcode_in = MapCompose(parse_title)

What is the right way to handle errors?

My script below scrapes a website and returns the data from a table. It's not finished but it works. The problem is that it has no error checking. Where should I have error handling in my script?
There are no unittests, should I write some and schedule my unittests to be run periodicaly. Or should the error handling be done in my script?
Any advice on the proper way to do this would be great.
#!/usr/bin/env python
''' Gets the Canadian Monthly Residential Bill Calculations table
from URL and saves the results to a sqllite database.
'''
import urllib2
from BeautifulSoup import BeautifulSoup
class Bills():
''' Canadian Monthly Residential Bill Calculations '''
URL = "http://www.hydro.mb.ca/regulatory_affairs/energy_rates/electricity/utility_rate_comp.shtml"
def __init__(self):
''' Initialization '''
self.url = self.URL
self.data = []
self.get_monthly_residential_bills(self.url)
def get_monthly_residential_bills(self, url):
''' Gets the Monthly Residential Bill Calculations table from URL '''
doc = urllib2.urlopen(url)
soup = BeautifulSoup(doc)
res_table = soup.table.th.findParents()[1]
results = res_table.findNextSibling()
header = self.get_column_names(res_table)
self.get_data(results)
self.save(header, self.data)
def get_data(self, results):
''' Extracts data from search result. '''
rows = results.childGenerator()
data = []
for row in rows:
if row == "\n":
continue
for td in row.contents:
if td == "\n":
continue
data.append(td.text)
self.data.append(tuple(data))
data = []
def get_column_names(self, table):
''' Gets table title, subtitle and column names '''
results = table.findAll('tr')
title = results[0].text
subtitle = results[1].text
cols = results[2].childGenerator()
column_names = []
for col in cols:
if col == "\n":
continue
column_names.append(col.text)
return title, subtitle, column_names
def save(self, header, data):
pass
if __name__ == '__main__':
a = Bills()
for td in a.data:
print td
See the documentation of all the functions and see what all exceptions do they throw.
For ex, in urllib2.urlopen(), it's written that Raises URLError on errors. It's a subclass of IOError.
So, for the urlopen(), you could do something like:
try:
doc = urllib2.urlopen(url)
except IOError:
print >> sys.stderr, 'Error opening URL'
Similary, do the same for others.
You should write unit tests and you should use exception handling. But only catch the exceptions you can handle; you do no one any favors by catching everything and throwing any useful information out.
Unit tests aren't run periodically though; they're run before and after the code changes (although it is feasible for one change's "after" to become another change's "before" if they're close enough).
A copple places you need to have them.is in importing things like tkinter
try:
import Tkinter as tk
except:
import tkinter as tk
also anywhere where the user enters something with a n intended type. A good way to figure this out is to run it abd try really hard to make it crash. Eg typing in wrong type.
The answer to "where should I have error handling in my script?" is basically "any place where something could go wrong", which depends entirely on the logic of your program.
In general, any place where your program relies on an assumption that a particular operation worked as you intended, and there's a possibility that it may not have, you should add code to check whether or not it actually did work, and take appropriate remedial action if it didn't. In some cases, the underlying code might generate an exception on failure and you may be happy to just let the program terminate with an uncaught exception without adding any error-handling code of your own, but (1) this would be, or ought to be, rare if anyone other than you is ever going to use that program; and (2) I'd say this would fall into the "works as intended" category anyway.

Categories