Pymongo get inserted id's even with duplicate key error - python

I am working on a flask app and using mongodb with it. In one endpoint i took csv files and inserts the content to mongodb with insert_many() . Before inserting i am creating a unique index for preventing duplication on mongodb. When there is no duplication i can reach inserted_ids for that process but when it raises duplication error i get None and i can't get inserted_ids . I am using ordered=False also. Is there any way that allows me to get inserted_ids even with duplicate key error ?
def createBulk(): #in controller
identity = get_jwt_identity()
try:
csv_file = request.files['csv']
insertedResult = ProductService(identity).create_product_bulk(csv_file)
print(insertedResult) # this result is None when get Duplicate Key Error
threading.Thread(target=ProductService(identity).sendInsertedItemsToEventCollector,args=(insertedResult,)).start()
return json_response(True,status=200)
except Exception as e:
print("insertedResultErr -> ",str(e))
return json_response({'error':str(e)},400)
def create_product_bulk(self,products): # in service
data_frame = read_csv(products)
data_json = data_frame.to_json(orient="records",force_ascii=False)
try:
return self.repo_client.create_bulk(loads(data_json))
except bulkErr as e:
print(str(e))
pass
except DuplicateKeyError as e:
print(str(e))
pass
def create_bulk(self, products): # in repo
self.checkCollectionName()
self.db.get_collection(name=self.collection_name).create_index('barcode',unique=True)
return self.db.get_collection(name=self.collection_name).insert_many(products,ordered=False)

Unfortunately, not in the way you have done it with the current pymongo drivers. As you have found, if you get errors in your insert_many() it will throw an exception and the exception detail does not contain details of the inserted_ids.
It does contain details of the keys the fail (in e.details['writeErrors'][]['keyValue']) so you could try and work backwards from that from your original products list.
Your other workaround is to use insert_one() in a loop with a try ... except and check each insert. I know this is less efficient but it's a workaround ...

Related

Trying to put_item() to DynamoDB only if the complete instance with all attribute values doesn't exist already

def put_items_db(data_dict):
"""
Put provided dictionary to lqtpid database
"""
try:
response = self.table.put_item(Item=data_dict, ConditionExpression='attribute_not_exists(firstName)'
' AND attribute_not_exists(lastName)')
http_code_response = response['ResponseMetadata']['HTTPStatusCode']
logging.debug(f'http code response for db put {http_code_response}')
except ClientError as e:
# Ignore the ConditionalCheckFailedException
if e.response['Error']['Code'] != 'ConditionalCheckFailedException':
raise
When running the code it is still uploading entries that already exist...
What are your keys for that table? I'm assuming you are putting an item that has a different keys than the item you're comparing to. With ConditionExpression you only compare the item you're writing to one item in table, the one with exactly same keys.

Trying retrieve an unknown number of items in a list

I need to retrieve a list of objects from a cloud API. The list could be very short or very long. If there are more than a 100 items in the list returned, a paging header is sent in the response as a reference point to send on the following request.
I've been trying to write a loop that would cover this, but the code is not reliable or very efficient:
paging=''
objects=cloud.list_objects()
try:
paging=objects.headers['next-page']
except KeyError:
pass
while len(paging)>0:
objects=cloud.list_objects(page=paging)
try:
paging=objects.headers['next-page']
except KeyError:
paging=''
else:
pass
paging = ''
while True:
objects = cloud.list_objects(page=paging)
paging = objects.headers.get('next-page')
if not paging:
break

How to get the record causing IntegrityError in Django

I have the following in my django model, which I am using with PostgresSql
class Business(models.Model):
location = models.CharField(max_length=200,default="")
name = models.CharField(max_length=200,default="",unique=True)
In my view I have:
for b in bs:
try:
p = Business(**b)
p.save()
except IntegrityError:
pass
When the app is run and an IntegrityError is triggered I would like to grab the already inserted record and also the object (I assume 'p') that triggered the error and update the location field.
In pseudocode:
for b in bs:
try:
p = Business(**b)
p.save()
except IntegrityError:
EXISTING_RECORD.location = EXISTING_RECORD.location + p.location
EXISTING_RECORD.save()
How is this done in django?
This is the way I got the existing record that you are asking for.
In this case, I had MyModel with
unique_together = (("owner", "hsh"),)
I used regex to get the owner and hsh of the existing record that was causing the issue.
import re
from django.db import IntegrityError
try:
// do something that might raise Integrity error
except IntegrityError as e:
#example error message (e.message): 'duplicate key value violates unique constraint "thingi_userfile_owner_id_7031f4ac5e4595e3_uniq"\nDETAIL: Key (owner_id, hsh)=(66819, 4252d2eba0e567e471cb08a8da4611e2) already exists.\n'
import re
match = re.search( r'Key \(owner_id, hsh\)=\((?P<owner_id>\d+), (?P<hsh>\w+)\) already', e.message)
existing_record = MyModel.objects.get(owner_id=match.group('owner_id'), hsh=match.group('hsh'))
I tried get_or_create, but that doesn't quite work the way you want (if you do get_or_create with both the name and the location, you still get an integrity error; if you do what Joran suggested, unless you overload update, it will overwrite location as opposed to append.
This should work the way you want:
for b in bs:
bobj, new_flag = Business.objects.get_or_create(name=b['name'])
if new_flag:
bobj.location = b['location']
else:
bobj.location += b['location'] # or possibly something like += ',' + b['location'] if you wanted to separate them
bobj.save()
It would be nice (and may be possible but I haven't tried), in the case where you can have multiple unique constraints, to be able to inspect the IntegrityException (similar to the accepted answer in IntegrityError: distinguish between unique constraint and not null violations, which also has the downside of appearing to be postgres only) to determine which field(s) violated. Note that if you wanted to follow your original framework, you can do collidedObject = Business.objects.get(name=b['name']) in your exception but that only works in the case where you know for sure that it was a name collision.
for b in bs:
p = Business.objects.get_or_create(name=b['name'])
p.update(**b)
p.save()
I think anyway

How do I reduce the number of try/catch statements here?

I'm currently working with Scrapy to pull company information from a website. However, the amount of data provided across the pages is vastly different; say, one company lists three of its team members, while another only lists two, or one company lists where its located, while another doesn't. Therefore, some XPaths may return null, so attempting to access them results in errors:
try:
item['industry'] = hxs.xpath('//*[#id="overview"]/div[2]/div[2]/p/text()[2]').extract()[0]
except IndexError:
item['industry'] = "None provided"
try:
item['URL'] = hxs.xpath('//*[#id="ContentPlaceHolder_lnkWebsite"]/text()').extract()[0]
except IndexError:
item['URL'] = "None provided"
try:
item['desc'] = hxs.xpath('//*[#id="overview"]/div[2]/div[4]/p/text()[1]').extract()[0]
except IndexError:
item['desc'] = "None provided"
try:
item['founded'] = hxs.xpath('//*[#id="ContentPlaceHolder_updSummary"]/div/div[2]/table/tbody/tr/td[1]/text()').extract()[0]
except IndexError:
item['founded'] = "None provided"
My code uses many try/catch statements. Since each exception is specific to the field I am trying to populate, is there a cleaner way of working around this?
Use TakeFirst() output processor:
Returns the first non-null/non-empty value from the values received,
so it’s typically used as an output processor to single-valued fields.
from scrapy.contrib.loader.processor import TakeFirst
class MyItem(Item):
industry = Field(output_processor=TakeFirst())
...
Then, inside the spider, you would not need try/catch at all:
item['industry'] = hxs.xpath('//*[#id="overview"]/div[2]/div[2]/p/text()[2]').extract()
In the latest version extract-first()use used for this. It returns None if search doesn't return anything. Thus you will have no errors.

IntegrityError: distinguish between unique constraint and not null violations

I have this code:
try:
principal = cls.objects.create(
user_id=user.id,
email=user.email,
path='something'
)
except IntegrityError:
principal = cls.objects.get(
user_id=user.id,
email=user.email
)
It tries to create a user with the given id and email, and if there already exists one - tries to get the existing record.
I know this is a bad construction and it will be refactored anyway. But my question is this:
How do i determine what kind of IntegrityError has happened: the one related to unique constraint violation (there is unique key on (user_id, email)) or the one related to not null constraint (path cannot be null)?
psycopg2 provides the SQLSTATE with the exception as the pgcode member, which gives you quite fine-grained error information to match on.
python3
>>> import psycopg2
>>> conn = psycopg2.connect("dbname=regress")
>>> curs = conn.cursor()
>>> try:
... curs.execute("INVALID;")
... except Exception as ex:
... xx = ex
>>> xx.pgcode
'42601'
See Appendix A: Error Codes in the PostgreSQL manual for code meanings. Note that you can match coarsely on the first two chars for broad categories. In this case I can see that SQLSTATE 42601 is syntax_error in the Syntax Error or Access Rule Violation category.
The codes you want are:
23505 unique_violation
23502 not_null_violation
so you could write:
try:
principal = cls.objects.create(
user_id=user.id,
email=user.email,
path='something'
)
except IntegrityError as ex:
if ex.pgcode == '23505':
principal = cls.objects.get(
user_id=user.id,
email=user.email
)
else:
raise
That said, this is a bad way to do an upsert or merge. #pr0gg3d is presumably right in suggesting the right way to do it with Django; I don't do Django so I can't comment on that bit. For general info on upsert/merge see depesz's article on the topic.
Update as of 9-6-2017:
A pretty elegant way to do this is to try/except IntegrityError as exc, and then use some useful attributes on exc.__cause__ and exc.__cause__.diag (a diagnostic class that gives you some other super relevant information on the error at hand - you can explore it yourself with dir(exc.__cause__.diag)).
The first one you can use was described above. To make your code more future proof you can reference the psycopg2 codes directly, and you can even check the constraint that was violated using the diagnostic class I mentioned above:
except IntegrityError as exc:
from psycopg2 import errorcodes as pg_errorcodes
assert exc.__cause__.pgcode == pg_errorcodes.UNIQUE_VIOLATION
assert exc.__cause__.diag.constraint_name == 'tablename_colA_colB_unique_constraint'
edit for clarification: I have to use the __cause__ accessor because I'm using Django, so to get to the psycopg2 IntegrityError class I have to call exc.__cause__
It could be better to use:
try:
obj, created = cls.objects.get_or_create(user_id=user.id, email=user.email)
except IntegrityError:
....
as in https://docs.djangoproject.com/en/dev/ref/models/querysets/#get-or-create
The IntegrityError should be raised only in the case there's a NOT NULL constraint violation.
Furthermore you can use created flag to know if the object already existed.

Categories