Some processes at the same time read table. Each process takes on one task. Is it possbile don't use LOCK table in this case ?
db.session.execute('LOCK TABLE "Task"')
query = db.session.query(models.Task).order_by(models.Task.ordr).limit(1)
for row in query:
task = row
db.session.delete(row)
db.session.commit()
By locking table you use pessimistic approach to concurrency.
Alterntively, intead of locking the table, you can be optimistic about the things going the right way. I would wrap the code to retrieve a task to work on in a continues retry statement with error handling in case the commit fails because some other process already removed this very task this process tried to get.
Something like this, perhaps:
def get_next_task():
session = ...
task = None
while not(task):
try:
query = session.query(models.Task).order_by(models.Task.ordr).limit(1)
for row in query:
task = row
session.delete(row)
session.commit()
if not(task):
return # no more tasks found
except TODO_FIND_PROPER_EXCEPTION_TO_HANDLE as _exc:
pass # or log the statement
# maybe need to make_transient
return task
Whether this solution is better will depend on the use case, though.
Related
I am making a small game and using MySQL as the database. I am having a slight issue with saving because of a multithreaded pool. When I submit inserts/deletes, say for adding or deleting items, there's no guarantee they are completed in the order submitted. This ends up creating duplicates in some rare scenarios.
So for example if I add and delete an item (insert, delete), it's normally fine. However, doing that 3 times in a row, it would submit Insert, delete, insert, delete, insert, delete. However occasionally it may result in delete, insert, insert, delete, delete, insert.
What are the proper ways I would ensure the chain of individual queries in this situation? Do I try to combine the queries in code? Forget about pooling a connection and ensure it's ordered? Any other solutions?
I am currently using Twisted and MySQLdb:
pool = adbapi.ConnectionPool('MySQLdb', host=127.0.0.1, port=3306, user='..', passwd='..', db='testing')
d = pool.runOperation(query, args)
I figured out a way to do this. To chain the defers using Twisted's built in defers so that the next one only continues when the last one completes.
def _continueQuery(self, result, player, query):
return dbPool.runOperation(*query)
def _savePlayer(self, player):
if player.queries:
initialQuery = player.queries.pop(0)
d = dbPool.runOperation(*initialQuery)
while player.queries:
query = player.queries.pop(0)
d.addCallback(self._continueQuery, player, query)
I need to increase speed of parsing heap of XML files. I decided to try use Python threads, but I do not know how to correct work from them with DB.
My DB store only links to files. I decided to add isProcessing column to my DB to prevent acquire of same rows from multiple threads
So result table look like:
|xml_path|isProcessing|
Every thread set this flag before starting processing and other threads select for procossings rows where this flags is not set.
But I am not sure that is correct way, because I am not sure that acquire is atomic and two threads my to process same row two times.
def select_single_file_for_processing():
#...
sql = """UPDATE processing_files SET "isProcessing" = 'TRUE' WHERE "xml_name"='{0}'""".format(xml_name)
cursor.execute(sql)
conn.commit()
def worker():
result = select_single_file_for_processing() #
# ...
# processing()
def main():
# ....
while unprocessed_xml_count != 0: # now unprocessed_xml_count is global! I know that it's wrong, but how to fix it?
checker_thread = threading.Thread(target=select_total_unpocessed_xml_count)
checker_thread.start() # if we have files for processing
for i in range(10): # run processed
t = Process(target=worker)
t.start()
The second question - what is the best practice to work with DB from multiprocessing module?
As written, your isProcessing flag could have problems with multiple threads. You should include a predicate for isProcessing = FALSE and check how many rows are updated. One thread will report 1 row and any other threads will report 0 rows.
As to best practices? This is a reasonable solution. The key is to be specific. A simple update will set the values as specified. The operation you are trying to perform though is to change the value from a to b, hence including the predicate for a in the statement.
UPDATE processing_files
SET isProcessing = 'TRUE'
WHERE xmlName = '...'
AND isProcessing = 'FALSE';
I have a Django app saving objects to the database and a celery task that periodically does some processing on some of those objects. The problem is that the user can delete an object after it has been selected by the celery task for processing, but before the celery task has actually finished processing and saving it. So when the celery task does call .save(), the object re-appears in the database even though the user deleted it. Which is really spooky for users, of course.
So here's some code showing the problem:
def my_delete_view(request, pk):
thing = Thing.objects.get(pk=pk)
thing.delete()
return HttpResponseRedirect('yay')
#app.task
def my_periodic_task():
things = get_things_for_processing()
# if the delete happens anywhere between here and the .save(), we're hosed
for thing in things:
process_thing(thing) # could take a LONG time
thing.save()
I thought about trying to fix it by adding an atomic block and a transaction to test if the object actually exists before saving it:
#app.task
def my_periodic_task():
things = Thing.objects.filter(...some criteria...)
for thing in things:
process_thing(thing) # could take a LONG time
try:
with transaction.atomic():
# just see if it still exists:
unused = Thing.objects.select_for_update().get(pk=thing.pk)
# no exception means it exists. go ahead and save the
# processed version that has all of our updates.
thing.save()
except Thing.DoesNotExist:
logger.warning("Processed thing vanished")
Is this the correct pattern to do this sort of thing? I mean, I'll find out if it works within a few days of running it in production, but it would be nice to know if there are any other well-accepted patterns for accomplishing this sort of thing.
What I really want is to be able to update an object if it still exists in the database. I'm ok with the race between user edits and edits from the process_thing, and I can always throw in a refresh_from_db just before the process_thing to minimize the time during which user edits would be lost. But I definitely can't have objects re-appearing after the user has deleted them.
if you open a transaction for the time of processing of celery task, you should avoid such a problems:
#app.task
#transaction.atomic
def my_periodic_task():
things = get_things_for_processing()
# if the delete happens anywhere between here and the .save(), we're hosed
for thing in things:
process_thing(thing) # could take a LONG time
thing.save()
sometimes, you would like to report to the frontend, that you are working on the data, so you can add select_for_update() to your queryset (most probably in get_things_for_processing), then in the code responsible for deletion you need to handle errors when db will report that specific record is locked.
For now, it seems like the pattern of "select again atomically, then save" is sufficient:
#app.task
def my_periodic_task():
things = Thing.objects.filter(...some criteria...)
for thing in things:
process_thing(thing) # could take a LONG time
try:
with transaction.atomic():
# just see if it still exists:
unused = Thing.objects.select_for_update().get(pk=thing.pk)
# no exception means it exists. go ahead and save the
# processed version that has all of our updates.
thing.save()
except Thing.DoesNotExist:
logger.warning("Processed thing vanished")
(this is the same code as in my original question).
I have a table in a database, mapped with SQLAlchemy ORM module (I have a "scoped_session" Variable)
I want multiple instances of my program (not just threads, also from several servers) to be able to work on the same table and NOT work on the same data.
so i have coded a manual "row-lock" mechanism to make sure each row is handled in this method i use "Full Lock" on the table while i "row-lock" it:
def instance:
s = scoped_session(sessionmaker(bind=engine)
engine.execute("LOCK TABLES my_data WRITE")
rows = s.query(Row_model).filter(Row_model.condition == 1).filter(Row_model.is_locked == 0).limit(10)
for row in rows:
row.is_locked = 1
row.lock_time = datetime.now()
s.commit()
engine.execute("UNLOCK TABLES")
for row in row:
manipulate_data(row)
row.is_locked = 0
s.commit()
for i in range(10):
t = threading.Thread(target=instance)
t.start()
The problem is that while running some instances, several threads are collapsing and produce this error (each):
sqlalchemy.exc.DatabaseError: (raised as a result of Query-invoked
autoflush; consider using a session.no_autoflush block if this flush
is occurring prematurely) (DatabaseError) 1205 (HY000): Lock wait
timeout exceeded; try restarting transaction 'UPDATE my_daya SET
row_var = 1}
Where is the catch? what makes my DB table to not UNLOCK successfully?
Thanks.
Locks are evil. Avoid them. Things go very bad when errors occur. Especially when you mix sessions with raw SQL statements, like you do.
The beauty of the scoped session is that it wraps a database transaction. This transaction makes the modifications to the database atomic, and also takes care of cleaning up when things go wrong.
Use scoped sessions as follows:
with scoped_session(sessionmaker(bind=engine) as s:
<ORM actions using s>
It may be some work to rewrite your code so that it becomes properly transactional, but it will be worth it! Sqlalchemy has tricks to help you with that.
I'm using SQLAlchemy with a Postgres backend to do a bulk insert-or-update. To try to improve performance, I'm attempting to commit only once every thousand rows or so:
trans = engine.begin()
for i, rec in enumerate(records):
if i % 1000 == 0:
trans.commit()
trans = engine.begin()
try:
inserter.execute(...)
except sa.exceptions.SQLError:
my_table.update(...).execute()
trans.commit()
However, this isn't working. It seems that when the INSERT fails, it leaves things in a weird state that prevents the UPDATE from happening. Is it automatically rolling back the transaction? If so, can this be stopped? I don't want my entire transaction rolled back in the event of a problem, which is why I'm trying to catch the exception in the first place.
The error message I'm getting, BTW, is "sqlalchemy.exc.InternalError: (InternalError) current transaction is aborted, commands ignored until end of transaction block", and it happens on the update().execute() call.
You're hitting some weird Postgresql-specific behavior: if an error happens in a transaction, it forces the whole transaction to be rolled back. I consider this a Postgres design bug; it takes quite a bit of SQL contortionism to work around in some cases.
One workaround is to do the UPDATE first. Detect if it actually modified a row by looking at cursor.rowcount; if it didn't modify any rows, it didn't exist, so do the INSERT. (This will be faster if you update more frequently than you insert, of course.)
Another workaround is to use savepoints:
SAVEPOINT a;
INSERT INTO ....;
-- on error:
ROLLBACK TO SAVEPOINT a;
UPDATE ...;
-- on success:
RELEASE SAVEPOINT a;
This has a serious problem for production-quality code: you have to detect the error accurately. Presumably you're expecting to hit a unique constraint check, but you may hit something unexpected, and it may be next to impossible to reliably distinguish the expected error from the unexpected one. If this hits the error condition incorrectly, it'll lead to obscure problems where nothing will be updated or inserted and no error will be seen. Be very careful with this. You can narrow down the error case by looking at Postgresql's error code to make sure it's the error type you're expecting, but the potential problem is still there.
Finally, if you really want to do batch-insert-or-update, you actually want to do many of them in a few commands, not one item per command. This requires trickier SQL: SELECT nested inside an INSERT, filtering out the right items to insert and update.
This error is from PostgreSQL. PostgreSQL doesn't allow you to execute commands in the same transaction if one command creates an error. To fix this you can use nested transactions (implemented using SQL savepoints) via conn.begin_nested(). Heres something that might work. I made the code use explicit connections, factored out the chunking part and made the code use the context manager to manage transactions correctly.
from itertools import chain, islice
def chunked(seq, chunksize):
"""Yields items from an iterator in chunks."""
it = iter(seq)
while True:
yield chain([it.next()], islice(it, chunksize-1))
conn = engine.commit()
for chunk in chunked(records, 1000):
with conn.begin():
for rec in chunk:
try:
with conn.begin_nested():
conn.execute(inserter, ...)
except sa.exceptions.SQLError:
conn.execute(my_table.update(...))
This still won't have stellar performance though due to nested transaction overhead. If you want better performance try to detect which rows will create errors beforehand with a select query and use executemany support (execute can take a list of dicts if all inserts use the same columns). If you need to handle concurrent updates, you'll still need to do error handling either via retrying or falling back to one by one inserts.