I'm currently working with Scrapy to pull company information from a website. However, the amount of data provided across the pages is vastly different; say, one company lists three of its team members, while another only lists two, or one company lists where its located, while another doesn't. Therefore, some XPaths may return null, so attempting to access them results in errors:
try:
item['industry'] = hxs.xpath('//*[#id="overview"]/div[2]/div[2]/p/text()[2]').extract()[0]
except IndexError:
item['industry'] = "None provided"
try:
item['URL'] = hxs.xpath('//*[#id="ContentPlaceHolder_lnkWebsite"]/text()').extract()[0]
except IndexError:
item['URL'] = "None provided"
try:
item['desc'] = hxs.xpath('//*[#id="overview"]/div[2]/div[4]/p/text()[1]').extract()[0]
except IndexError:
item['desc'] = "None provided"
try:
item['founded'] = hxs.xpath('//*[#id="ContentPlaceHolder_updSummary"]/div/div[2]/table/tbody/tr/td[1]/text()').extract()[0]
except IndexError:
item['founded'] = "None provided"
My code uses many try/catch statements. Since each exception is specific to the field I am trying to populate, is there a cleaner way of working around this?
Use TakeFirst() output processor:
Returns the first non-null/non-empty value from the values received,
so it’s typically used as an output processor to single-valued fields.
from scrapy.contrib.loader.processor import TakeFirst
class MyItem(Item):
industry = Field(output_processor=TakeFirst())
...
Then, inside the spider, you would not need try/catch at all:
item['industry'] = hxs.xpath('//*[#id="overview"]/div[2]/div[2]/p/text()[2]').extract()
In the latest version extract-first()use used for this. It returns None if search doesn't return anything. Thus you will have no errors.
Related
I'm trying to extract values from JSON input using python. There are many tag that I need to extract and not all JSON files have the same structure as the sources are multiple. Sometimes there is a possibility that a tag might be missing. So, a KeyError is bound to happen. If a tag is missing then respective variable will by default be None and it will be returned as a list (members) to the main call.
I tried calling a function to pass each tags into an individual try/except. But, I get hit by an error on the function call itself where the tag is being passed. So, instead I tried the below code but it skips any subsequent lines even if the tags are present. Is there a better way to do this?
def extract(self):
try:
self.data_version = self.data['meta']['data_version']
self.created = self.data['meta']['created']
self.revision = self.data['meta']['revision']
self.gender = self.data['info']['gender']
self.season = self.data['info']['season']
self.team_type = self.data['info']['team_type']
self.venue = self.data['info']['venue']
status = True
except KeyError:
status = False
members = [attr for attr in dir(self) if
not callable(getattr(self, attr)) and not attr.startswith("__") and getattr(self, attr) is None]
return status, members
UPDATED:
Thanks Barmar & John! .get() worked really well.
I am working on a flask app and using mongodb with it. In one endpoint i took csv files and inserts the content to mongodb with insert_many() . Before inserting i am creating a unique index for preventing duplication on mongodb. When there is no duplication i can reach inserted_ids for that process but when it raises duplication error i get None and i can't get inserted_ids . I am using ordered=False also. Is there any way that allows me to get inserted_ids even with duplicate key error ?
def createBulk(): #in controller
identity = get_jwt_identity()
try:
csv_file = request.files['csv']
insertedResult = ProductService(identity).create_product_bulk(csv_file)
print(insertedResult) # this result is None when get Duplicate Key Error
threading.Thread(target=ProductService(identity).sendInsertedItemsToEventCollector,args=(insertedResult,)).start()
return json_response(True,status=200)
except Exception as e:
print("insertedResultErr -> ",str(e))
return json_response({'error':str(e)},400)
def create_product_bulk(self,products): # in service
data_frame = read_csv(products)
data_json = data_frame.to_json(orient="records",force_ascii=False)
try:
return self.repo_client.create_bulk(loads(data_json))
except bulkErr as e:
print(str(e))
pass
except DuplicateKeyError as e:
print(str(e))
pass
def create_bulk(self, products): # in repo
self.checkCollectionName()
self.db.get_collection(name=self.collection_name).create_index('barcode',unique=True)
return self.db.get_collection(name=self.collection_name).insert_many(products,ordered=False)
Unfortunately, not in the way you have done it with the current pymongo drivers. As you have found, if you get errors in your insert_many() it will throw an exception and the exception detail does not contain details of the inserted_ids.
It does contain details of the keys the fail (in e.details['writeErrors'][]['keyValue']) so you could try and work backwards from that from your original products list.
Your other workaround is to use insert_one() in a loop with a try ... except and check each insert. I know this is less efficient but it's a workaround ...
I need to retrieve a list of objects from a cloud API. The list could be very short or very long. If there are more than a 100 items in the list returned, a paging header is sent in the response as a reference point to send on the following request.
I've been trying to write a loop that would cover this, but the code is not reliable or very efficient:
paging=''
objects=cloud.list_objects()
try:
paging=objects.headers['next-page']
except KeyError:
pass
while len(paging)>0:
objects=cloud.list_objects(page=paging)
try:
paging=objects.headers['next-page']
except KeyError:
paging=''
else:
pass
paging = ''
while True:
objects = cloud.list_objects(page=paging)
paging = objects.headers.get('next-page')
if not paging:
break
I have some experience in Python, but I have never used try & except functions to catch errors due to lack of formal training.
I am working on extracting a few articles from wikipedia. For this I have an array of titles, a few of which do not have any article or search result at the end. I would like the page retrieval function just to skip those few names and continue running the script on the rest. Reproducible code follows.
import wikipedia
# This one works.
links = ["CPython"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
#The sequence breaks down if there is no wikipedia page.
links = ["CPython","no page"]
test = [wikipedia.page(link, auto_suggest=False) for link in links]
test = [testitem.content for testitem in test]
print(test)
The library running it uses a method like this. Normally it would be really bad practice, but since this is just for a one-off data extraction, I am willing to change the local copy of the library to get it to work. Edit I included the complete function now.
def page(title=None, pageid=None, auto_suggest=True, redirect=True, preload=False):
'''
Get a WikipediaPage object for the page with title `title` or the pageid
`pageid` (mutually exclusive).
Keyword arguments:
* title - the title of the page to load
* pageid - the numeric pageid of the page to load
* auto_suggest - let Wikipedia find a valid page title for the query
* redirect - allow redirection without raising RedirectError
* preload - load content, summary, images, references, and links during initialization
'''
if title is not None:
if auto_suggest:
results, suggestion = search(title, results=1, suggestion=True)
try:
title = suggestion or results[0]
except IndexError:
# if there is no suggestion or search results, the page doesn't exist
raise PageError(title)
return WikipediaPage(title, redirect=redirect, preload=preload)
elif pageid is not None:
return WikipediaPage(pageid=pageid, preload=preload)
else:
raise ValueError("Either a title or a pageid must be specified")
What should I do to retreive only the pages that do not give the error. Maybe there is a way to filter out all items in the list that give this error or an error of some kind. Returning "NA" or similar would be fine with pages that don't exist. Skipping them without notice would be fine too. Thanks!
The function wikipedia.page will raise a wikipedia.exceptions.PageError if the page doesn't exist. That's the error you want to catch.
import wikipedia
links = ["CPython","no page"]
test=[]
for link in links:
try:
#try to load the wikipedia page
page=wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
#if a "PageError" was raised, ignore it and continue to next link
continue
You have to surround the function wikipedia.page by a try block, so I'm afraid you can't use list comprehension.
Understand that this will be bad practice, but for a one off quick and dirty script you can just:
edit: Wait, sorry. I've just noticed the list comprehension. I'm actually not sure if this will work without breaking that down:
links = ["CPython", "no page"]
test = []
for link in links:
try:
page = wikipedia.page(link, auto_suggest=False)
test.append(page)
except wikipedia.exceptions.PageError:
pass
test = [testitem.content for testitem in test]
print(test)
pass Tells python to essentially to trust you and ignore the error so that it can continue on about its day.
I'm trying to parse Scrapy items, where each of them has several fields. It happens that some of the fields cannot be properly captured due to incomplete information on the site. In case just one of the fields cannot be returned, the entire operation of extracting an item breaks with an exception (e.g. for below code I get "Attribute:None cannot be split"). The parser then moves to next request, without capturing other fields that were available.
item['prodcode'] = response.xpath('//head/title').re_first(r'.....').split(" ")[1]
#throws: Attribute:None cannot be split . Does not parse other fields.
What is the way of handling such exceptions by Scrapy? I would like to retrieve information from all fields that were available, while the unavailable ones return a blank or N/A. I could do try... except... on each of the item fields, but this seems like not the best solution. The docs mention exception handling, but somehow I cannot find a way for this case.
The most naive approach here would be to follow the EAFP approach and handle exceptions directly in the spider. For instance:
try:
item['prodcode'] = response.xpath('//head/title').re_first(r'.....').split(" ")[1]
except AttributeError:
item['prodcode'] = 'n/a'
A better option here could be to delegate the item field parsing logic to Item Loaders and different Input and Output Processors. So that your spider would be only responsible for parsing HTML and extracting the desired data but all of the post-processing and prettifying would be handled by an Item Loader. In other words, in your spider, you would only have:
loader = MyItemLoader(response=response)
# ...
loader.add_xpath("prodcode", "//head/title", re=r'.....')
# ...
loader.load_item()
And the Item Loader would have something like:
def parse_title(title):
try:
return title.split(" ")[1]
except Exception: # FIXME: handle more specific exceptions
return 'n/a'
class MyItemLoader(ItemLoader):
default_output_processor = TakeFirst()
prodcode_in = MapCompose(parse_title)