Bulk inserting in database unique codes - python

I need create a lot of unique codes and insert it in database.
Of course I can write something like this:
codes = set()
while len(codes) < codes_size:
c = generate_code()
if len(Codes.objects.filter(code=c)) == 0:
codes.add(Codes(c))
Codes.objects.bulk_create(codes)
But when database already contain a lot of codes it works very slow.
If insert code after each generation - it's very slow too.
Now best idea - not verify code until bulk_create. And if bulk_create raise exception then regenerate all codes again. Exceptions very rare, but when database will be more bigger then and exceptions will be more often.
bulk_create not say which code raise exception.

My understanding is that bulk_create() performs it's operation within a transaction which is not committed if an error occurs. This means that either all inserts succeeded, or none succeeded.
For example, if a code is generated that is a duplicate of one that is already in the table, or a duplicate of another within the batch, an IntegrityError exception would be raised and none of the codes would have been inserted into the database.
In terms of exceptions, you'll likely get a subclass of django.db.utils.DatabaseError, e.g. django.db.utils.IntegrityError. Because you don't know which database error will be raised you should catch DatabaseError.
I don't think that there is any way to know from the exception which of the codes caused the problem.
One way to handle this is to generate all of the codes up front and then, within a transaction, test whether they already exist in the table in one go using filter() and insert them if there are no duplicates:
from django.db import transaction
codes = set()
while len(codes) < codes_size:
codes.add(generate_code())
with transaction.atomic():
# check for any duplicate codes in table...
dups = Codes.objects.filter(code__in=codes)
if len(dups):
print 'Duplicate code(s) generated: {}'.format(dup.code for dup in dups)
# remove dups from codes, and generate additional codes to make up the shortfall.
# Note that this might also result in duplicates....
else:
Codes.objects.bulk_create(Codes(code) for code in codes)
You still need to handle database exceptions that are not due to duplicate values. For simplicity I've left that out.

Related

Exception hanling when using SQLALchemy session.execute for raw query with multiple statemens

While trying to update a non-existing column in a single statement throws an exception:
session.execute("UPDATE users SET wrong_column = test#test.com;")
If I do the same with multiple statements, the exception is not raised:
session.execute("UPDATE users SET email = test#test.com; UPDATE users SET wrong_column = test#test.com;")
One stackoverflow solution proposed is to split the query by the delimiter ;:
But this does not work in my particular case as my query also consist of a few functions that have a delimiter in the RETURN ... ; statement and then again the END; statement, so the split would break the query.
(I know I could also try to split the query with sqlparse but installing an extra library for that also seems weird to me)
Is there really no better way to check if all statements in a raw query were executed without error?
I also tried to find something in the CursorResult object that is returned by the session.execute func, but I could not find anything.
(I'm working with a MySQL DB in case it matters)

How to undo last transactions in Django DB

Well, I've a data in a csv file which is to be inserted in the DB. Now I cannot guarantee that the data provided will be according to my needs, so I want if there is any exception, then the previous transactions on the DB should be undo.
What I currently do is that the data is saved till an exception is found. I throw the exception and then prints the exception at the line number. What I want is that the whole file should be inserted again without any duplicate rows. How can I do it?
Here is my code:-
for row in rows:
line += 1
try :
obj = XYZ(col1 = row[0],col2=row[1])
obj.save()
except : raise ("Excetion found at line ", line)
How should I undo all the previous transactions which were done before an exception arised??
Sounds like what you're looking for is atomic(), to quote the docs (emphasis mine):
Atomicity is the defining property of database transactions. atomic
allows us to create a block of code within which the atomicity on the
database is guaranteed. If the block of code is successfully
completed, the changes are committed to the database. If there is an
exception, the changes are rolled back.

What does Django return if no objects are found? And why do I get a DoesNotExist

I have a MySQL table in which I have Django register that a certain user is 'connected' to a unit. For this I have to check if the unit is allready connected to some other user. (model: AppUnitSession)
So in my function I get three objects (models) as input (user, usersession, vehicle)
The problem I have is that my query for the AppUnitSession fails with a
Exception Type: DoesNotExist
Exception Value: AppUnitSession matching query does not exist.
This error occurs on the first line of this code:
sessions = AppUnitSession.objects.get(gps_unit_id=vehicle.gps_unit_id)
sessions = sessions.exclude(validation_date__isnull=True)
sessions = sessions.exclude(user_session_id=usersession.user_session_id)
from the call stack I can see that value for the vehicle.gps_unit_id is set:
{'gps_unit_id': 775L}
There are NO records in the AppUnitSession table that match this! all records in this table have gps_units_id = NULL (ea, the unit is available and the user can continue and after this there will be a record with the gps_unit_id set. If there are sessions found, I need do some more work.
For the start I don't want the error. But also I want something I can iterate over or check it's length (>1) do some more checks.
I'm a bit stuck on this one, so help and suggestions are welcome.
One way is to use filter instead of get but filter returns list of objects while get returns an object if found.
Use sessions = AppUnitSession.objects.filter(gps_unit_id=vehicle.gps_unit_id)
instead of #sessions = AppUnitSession.objects.get(gps_unit_id=vehicle.gps_unit_id)
Other approach if you only want to get rid of error is to perform exception handling:
try:
sessions = AppUnitSession.objects.get(gps_unit_id=vehicle.gps_unit_id)
sessions = sessions.exclude(validation_date__isnull=True)
sessions = sessions.exclude(user_session_id=usersession.user_session_id)
except AppUnitSession.DoesNotExist:
#some code when the object does not exists
But, it is better to use get rather than filter if you are trying to retrieve only one object. Read performance of get vs filter for one object here.

Adding REST resources to database. Struggling to loop through enumerated list and commit changes.

I'm trying to extract data from an API that gives me data back in JSON format. I'm using SQLalchemy and simplejson in a python script to achieve this. The database is PostgreSQL.
I have a class called Harvest, it specifies the details for the table harvest.
Here is the code I suspect is incorrect.
def process(self):
req = urllib2.Request('https://api.fulcrumapp.com/api/v2/records/', headers={"X-ApiToken":"****************************"})
resp = urllib2.urlopen(req)
data = simplejson.load(resp)
for i, m in enumerate(data['harvest']):
harvest = Harvest(m)
self.session.add(harvest)
self.session.commit()
Is there something wrong with this loop? Nothing is going through to the database.
I suspect that if there is anything wrong with the loop, it is that the loop is getting skipped. One thing you can do to verify this is:
ALTER USER application_user SET log_statements='all';
Then the statements will show up in your logs. When you are done:
ALTER USER application_user RESET log_statements;
This being said one thing I see in your code that may cause trouble later is the fact that you are committing per line. This will cause extra disk I/O. You probably want to commit after the loop.

removing error query during commit

I have postgresql db which i am updating with around 100000 records. I use session.merge() to insert/update each record and i do a commit after every 1000 records.
i=0
for record in records:
i+=1
session.merge(record)
if i%1000 == 0:
session.commit()
This code works fine. In my database i have a table with a UNIQUE field and there are some duplicated records that i insert into it. A error is thrown when this happens, saying the field is not unique. Since i am inserting 1000 records at a time, a rollback will not help me to skip these records. is there any way i can skip the session.merge() for the duplicate records (other than parsing through all the records to find the duplicate records of course)?
I think you already know this, but let's start out with a piece of dogma: you specified that the field needs to be unique, so you have to let the database check for uniqueness or deal with the errors from not letting that happen.
Checking for uniqueness:
if value not in database:
session.add(value)
session.commit()
Not checking for uniqueness and catching the exception.
try:
session.add(value)
session.commit()
except IntegrityError:
session.rollback()
The first one has a race condition. I tend to use the second pattern.
Now, bringing this back to your actual issue, if you want to assure uniqueness on a column in the database then obviously you're going to have to either let the db assure itself of the loaded value's actual uniqueness, or let the database give you an error and you handle it.
That's obviously a lot slower than adding 100k objects to the session and just committing them all, but that's how databases work.
You might want to consider massaging the data which you are loading OUTSIDE the database and BEFORE attempting to load it, to ensure uniqueness. That way, when you load it you can drop the need to check for uniqueness. Pretty easy to do with command line tools if for example you're loading from csv or text files.
you can get at a "partial rollback" using SAVEPOINT, which SQLAlchemy exposes via begin_nested(). You could do it just like this:
for i, record in enumerate(records):
try:
with session.begin_nested():
session.merge(record)
except:
print "Skipped record %s" % record
if not i % 1000:
session.commit()
notes for the above:
in python, we never do the "i = i+1" thing. use enumerate().
with session.begin_nested(): is the same as saying begin_nested(), then commit() if no exception, or rollback() if so.
You might want to consider writing a function along the lines of this example from the PostgreSQL documentation.
This is the option which works best for me because the number of records with duplicate unique keys is minimal.
def update_exception(records, i, failed_records):
failed_records.append(records[i]['pk'])
session.rollback()
start_range = int(round(i/1000,0) * 1000)
for index in range(start_range, i+1):
if records[index]['pk'] not in failed_records:
ins_obj = Model()
try:
session.merge(ins_obj)
except:
failed_records.append(json_data[table_name][index-1]['pk'])
pass
Say, if i hit an error at 2375 I store the primary key 'pk' for the 2375 record in failed_records and then i recommit from 2000 to 2375. It seems much faster than doing commits one by one.

Categories