Trying retrieve an unknown number of items in a list - python

I need to retrieve a list of objects from a cloud API. The list could be very short or very long. If there are more than a 100 items in the list returned, a paging header is sent in the response as a reference point to send on the following request.
I've been trying to write a loop that would cover this, but the code is not reliable or very efficient:
paging=''
objects=cloud.list_objects()
try:
paging=objects.headers['next-page']
except KeyError:
pass
while len(paging)>0:
objects=cloud.list_objects(page=paging)
try:
paging=objects.headers['next-page']
except KeyError:
paging=''
else:
pass

paging = ''
while True:
objects = cloud.list_objects(page=paging)
paging = objects.headers.get('next-page')
if not paging:
break

Related

Trying to put_item() to DynamoDB only if the complete instance with all attribute values doesn't exist already

def put_items_db(data_dict):
"""
Put provided dictionary to lqtpid database
"""
try:
response = self.table.put_item(Item=data_dict, ConditionExpression='attribute_not_exists(firstName)'
' AND attribute_not_exists(lastName)')
http_code_response = response['ResponseMetadata']['HTTPStatusCode']
logging.debug(f'http code response for db put {http_code_response}')
except ClientError as e:
# Ignore the ConditionalCheckFailedException
if e.response['Error']['Code'] != 'ConditionalCheckFailedException':
raise
When running the code it is still uploading entries that already exist...
What are your keys for that table? I'm assuming you are putting an item that has a different keys than the item you're comparing to. With ConditionExpression you only compare the item you're writing to one item in table, the one with exactly same keys.

Pymongo get inserted id's even with duplicate key error

I am working on a flask app and using mongodb with it. In one endpoint i took csv files and inserts the content to mongodb with insert_many() . Before inserting i am creating a unique index for preventing duplication on mongodb. When there is no duplication i can reach inserted_ids for that process but when it raises duplication error i get None and i can't get inserted_ids . I am using ordered=False also. Is there any way that allows me to get inserted_ids even with duplicate key error ?
def createBulk(): #in controller
identity = get_jwt_identity()
try:
csv_file = request.files['csv']
insertedResult = ProductService(identity).create_product_bulk(csv_file)
print(insertedResult) # this result is None when get Duplicate Key Error
threading.Thread(target=ProductService(identity).sendInsertedItemsToEventCollector,args=(insertedResult,)).start()
return json_response(True,status=200)
except Exception as e:
print("insertedResultErr -> ",str(e))
return json_response({'error':str(e)},400)
def create_product_bulk(self,products): # in service
data_frame = read_csv(products)
data_json = data_frame.to_json(orient="records",force_ascii=False)
try:
return self.repo_client.create_bulk(loads(data_json))
except bulkErr as e:
print(str(e))
pass
except DuplicateKeyError as e:
print(str(e))
pass
def create_bulk(self, products): # in repo
self.checkCollectionName()
self.db.get_collection(name=self.collection_name).create_index('barcode',unique=True)
return self.db.get_collection(name=self.collection_name).insert_many(products,ordered=False)
Unfortunately, not in the way you have done it with the current pymongo drivers. As you have found, if you get errors in your insert_many() it will throw an exception and the exception detail does not contain details of the inserted_ids.
It does contain details of the keys the fail (in e.details['writeErrors'][]['keyValue']) so you could try and work backwards from that from your original products list.
Your other workaround is to use insert_one() in a loop with a try ... except and check each insert. I know this is less efficient but it's a workaround ...

Python - How to handle exception while iterating through list?

I am using Selenium library and trying to iterate through list of items look them up on web and while my loop is working when item are found I am having hard time handling the exception when item is not find on the web page. For this instance I know that if the item is not found the page will show " No Results For" within span to which I can access with:
browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]')[0].text
Now the problem is that this span only appear when item loop is searching is not found. So I tried this logic, if this span doesn't exist than it means item is found so execute rest of the loop, if the the span does exist and is equal to " No Results For", then go and search for next item. Here is my code:
data = pd.DataFrame()
for i in lookup_list:
start_url = f"https://www.amazon.com/s?k=" + i +"&ref=nb_sb_noss_1"
browser.visit(start_url)
if browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]') is not None :
#browser.find_by_xpath("//a[#class='a-size-medium a-color-base']"):
item = browser.find_by_xpath("//a[#class='a-link-normal']")
item.click()
html = browser.html
soup = bs(html, "html.parser")
collection_dict ={
'PART_NUMBER': getmodel(soup),
'DIMENSIONS': getdim(soup),
'IMAGE_LINK': getImage(soup)
}
elif browser.find_by_xpath('(.//span[#class = "a-size-medium a-color-base"])[1]')[0].text != 'No results for':
continue
data = data.append(collection_dict, ignore_index=True)
The error I am getting is:
AttributeError: 'ElementList' object has no attribute 'click'
I do understand that the error I am getting is because I cant access attribute click since it the list has multiple items and therefore i cant click on all of them. But what im trying to do is to avoid even trying to access it if the page showes that the item is not found, i want the script to simply go to next item and search.
How do I modify this?
Thank you in advance.
Using a try-except with a pass is what you want in this situation, like #JammyDodger said. Although using this typically isn't a good sign because you don't want to simply ignore errors most of the time. pass will simply ignore the error and continue the rest of the loop.
try:
item.click()
except AttributeError:
pass
In order to skip to the next iteration of the loop, you may want to use the continue keyword.
try:
item.click()
except AttributeError:
continue

Created my first webcrawler, how do I get a "URL stack-trace"/history of URL's for each endpoint?

I created a web crawler that, given a base_url, will spider out and find all possible endpoints. While I am able to get all the endpoints, I need a way to figure out how I got there in the first place -- a 'url stack-trace' persay or breadcrumbs of url's leading to each endpoint.
I first start by finding all url's given a base url. Since the sublinks I'm looking for are within a json, I thought the best way to do this would be using a variation of a recurisve dictionary example I found here: http://www.saltycrane.com/blog/2011/10/some-more-python-recursion-examples/:
import requests
import pytest
import time
BASE_URL = "https://www.my-website.com/"
def get_leaf_nodes_list(base_url):
"""
:base_url: The starting point to crawl
:return: List of all possible endpoints
"""
class Namespace(object):
# A wrapper function is used to create a Namespace instance to hold the ns.results variable
pass
ns = Namespace()
ns.results = []
r = requests.get(BASE_URL)
time.sleep(0.5) # so we don't cause a DDOS?
data = r.json()
def dict_crawler(data):
# Retrieve all nodes from nested dict
if isinstance(data, dict):
for item in data.values():
dict_crawler(item)
elif isinstance(data, list) or isinstance(data, tuple):
for item in data:
dict_crawler(item)
else:
if type(data) is unicode:
if "http" in data: # If http in value, keep going
# If data is not already in ns.results, don't append it
if str(data) not in ns.results:
ns.results.append(data)
sub_r = requests.get(data)
time.sleep(0.5) # so we don't cause a DDOS?
sub_r_data = sub_r.json()
dict_crawler(sub_r_data)
dict_crawler(data)
return ns.results
To reiterate, the get_leaf_nodes_list does a get request and looks for any values within the json for a url (if 'http' string is in the value for each key) to recursively do more get requests until there's no url's left.
So to reiterate here are the questions I have:
How do I get a linear history of all the url's I hit to get to each endpoint?
Corollary to that, how would I store this history? As the leaf nodes list grows, my process gets expontentially slower and I am wondering if there's a better data type out there to store this information or a more efficient process to the code above.

How do I reduce the number of try/catch statements here?

I'm currently working with Scrapy to pull company information from a website. However, the amount of data provided across the pages is vastly different; say, one company lists three of its team members, while another only lists two, or one company lists where its located, while another doesn't. Therefore, some XPaths may return null, so attempting to access them results in errors:
try:
item['industry'] = hxs.xpath('//*[#id="overview"]/div[2]/div[2]/p/text()[2]').extract()[0]
except IndexError:
item['industry'] = "None provided"
try:
item['URL'] = hxs.xpath('//*[#id="ContentPlaceHolder_lnkWebsite"]/text()').extract()[0]
except IndexError:
item['URL'] = "None provided"
try:
item['desc'] = hxs.xpath('//*[#id="overview"]/div[2]/div[4]/p/text()[1]').extract()[0]
except IndexError:
item['desc'] = "None provided"
try:
item['founded'] = hxs.xpath('//*[#id="ContentPlaceHolder_updSummary"]/div/div[2]/table/tbody/tr/td[1]/text()').extract()[0]
except IndexError:
item['founded'] = "None provided"
My code uses many try/catch statements. Since each exception is specific to the field I am trying to populate, is there a cleaner way of working around this?
Use TakeFirst() output processor:
Returns the first non-null/non-empty value from the values received,
so it’s typically used as an output processor to single-valued fields.
from scrapy.contrib.loader.processor import TakeFirst
class MyItem(Item):
industry = Field(output_processor=TakeFirst())
...
Then, inside the spider, you would not need try/catch at all:
item['industry'] = hxs.xpath('//*[#id="overview"]/div[2]/div[2]/p/text()[2]').extract()
In the latest version extract-first()use used for this. It returns None if search doesn't return anything. Thus you will have no errors.

Categories