Ensuring SQL query order in multithread pool - python

I am making a small game and using MySQL as the database. I am having a slight issue with saving because of a multithreaded pool. When I submit inserts/deletes, say for adding or deleting items, there's no guarantee they are completed in the order submitted. This ends up creating duplicates in some rare scenarios.
So for example if I add and delete an item (insert, delete), it's normally fine. However, doing that 3 times in a row, it would submit Insert, delete, insert, delete, insert, delete. However occasionally it may result in delete, insert, insert, delete, delete, insert.
What are the proper ways I would ensure the chain of individual queries in this situation? Do I try to combine the queries in code? Forget about pooling a connection and ensure it's ordered? Any other solutions?
I am currently using Twisted and MySQLdb:
pool = adbapi.ConnectionPool('MySQLdb', host=127.0.0.1, port=3306, user='..', passwd='..', db='testing')
d = pool.runOperation(query, args)

I figured out a way to do this. To chain the defers using Twisted's built in defers so that the next one only continues when the last one completes.
def _continueQuery(self, result, player, query):
return dbPool.runOperation(*query)
def _savePlayer(self, player):
if player.queries:
initialQuery = player.queries.pop(0)
d = dbPool.runOperation(*initialQuery)
while player.queries:
query = player.queries.pop(0)
d.addCallback(self._continueQuery, player, query)

Related

How to unit test a function that executes SQL without affecting the db in Python?

I am struggling to unittest a function that doesn't return anything and executes delete operation. The function is as follows:
def removeReportParseData(self, report_id, conn=None):
table_id = dbFind(
"select table_id from table_table where report_id=%s", (report_id), conn
)
for t in table_id:
self.removeTableParseData(int(t["table_id"]), conn)
dbUpdate("delete from table_table where table_id=%s", t["table_id"], conn)
I want to make sure that the commands were executed but don't want to affect the actual db. My current code is:
def test_remove_report_parse_data(self):
with patch("com.pdfgather.GlobalHelper.dbFind") as mocked_find:
mocked_find.return_value = [123, 232, 431]
mocked_find.assert_called_once()
Thank you in advance for any support.
I don't believe it is possible to execute SQL without executing it, so to speak, however if you are on MySQL you might be able to get fairly close to what you want by wrapping your queries in START TRANSACTION; and ROLLBACK;
i.e. you might replace your queries with:
START TRANSACTION;
YOUR QUERY HERE;
ROLLBACK
This will prove that your function works without actually changing the contents of the database.
However, if it is sufficient to simply test that these functions would execute ANY query, you could alternatively opt to test them with empty queries, and simply assert that your dbFind and dbUpdate methods were called as many times as you would expect.
Again, though, as I alluded to in my comment, I would strongly suggest NOT having your test suite interact with even your development database.
While there is certainly some configuration involved in setting up another database for your tests, you should be able to find some boilerplate code to do this quite easily as it is very common practice.

Python - SQLAlchemy - MySQL - multiple instances work on same data

I have a table in a database, mapped with SQLAlchemy ORM module (I have a "scoped_session" Variable)
I want multiple instances of my program (not just threads, also from several servers) to be able to work on the same table and NOT work on the same data.
so i have coded a manual "row-lock" mechanism to make sure each row is handled in this method i use "Full Lock" on the table while i "row-lock" it:
def instance:
s = scoped_session(sessionmaker(bind=engine)
engine.execute("LOCK TABLES my_data WRITE")
rows = s.query(Row_model).filter(Row_model.condition == 1).filter(Row_model.is_locked == 0).limit(10)
for row in rows:
row.is_locked = 1
row.lock_time = datetime.now()
s.commit()
engine.execute("UNLOCK TABLES")
for row in row:
manipulate_data(row)
row.is_locked = 0
s.commit()
for i in range(10):
t = threading.Thread(target=instance)
t.start()
The problem is that while running some instances, several threads are collapsing and produce this error (each):
sqlalchemy.exc.DatabaseError: (raised as a result of Query-invoked
autoflush; consider using a session.no_autoflush block if this flush
is occurring prematurely) (DatabaseError) 1205 (HY000): Lock wait
timeout exceeded; try restarting transaction 'UPDATE my_daya SET
row_var = 1}
Where is the catch? what makes my DB table to not UNLOCK successfully?
Thanks.
Locks are evil. Avoid them. Things go very bad when errors occur. Especially when you mix sessions with raw SQL statements, like you do.
The beauty of the scoped session is that it wraps a database transaction. This transaction makes the modifications to the database atomic, and also takes care of cleaning up when things go wrong.
Use scoped sessions as follows:
with scoped_session(sessionmaker(bind=engine) as s:
<ORM actions using s>
It may be some work to rewrite your code so that it becomes properly transactional, but it will be worth it! Sqlalchemy has tricks to help you with that.

How to get and delete record simultaneously in SqlAlchemy?

Some processes at the same time read table. Each process takes on one task. Is it possbile don't use LOCK table in this case ?
db.session.execute('LOCK TABLE "Task"')
query = db.session.query(models.Task).order_by(models.Task.ordr).limit(1)
for row in query:
task = row
db.session.delete(row)
db.session.commit()
By locking table you use pessimistic approach to concurrency.
Alterntively, intead of locking the table, you can be optimistic about the things going the right way. I would wrap the code to retrieve a task to work on in a continues retry statement with error handling in case the commit fails because some other process already removed this very task this process tried to get.
Something like this, perhaps:
def get_next_task():
session = ...
task = None
while not(task):
try:
query = session.query(models.Task).order_by(models.Task.ordr).limit(1)
for row in query:
task = row
session.delete(row)
session.commit()
if not(task):
return # no more tasks found
except TODO_FIND_PROPER_EXCEPTION_TO_HANDLE as _exc:
pass # or log the statement
# maybe need to make_transient
return task
Whether this solution is better will depend on the use case, though.

Persistent memoization in Python

I have an expensive function that takes and returns a small amount of data (a few integers and floats). I have already memoized this function, but I would like to make the memo persistent. There are already a couple of threads relating to this, but I'm unsure about potential issues with some of the suggested approaches, and I have some fairly specific requirements:
I will definitely use the function from multiple threads and processes simultaneously (both using multiprocessing and from separate python scripts)
I will not need read or write access to the memo from outside this python function
I am not that concerned about the memo being corrupted on rare occasions (like pulling the plug or accidentally writing to the file without locking it) as it isn't that expensive to rebuild (typically 10-20 minutes) but I would prefer if it would not be corrupted because of exceptions, or manually terminating a python process (I don't know how realistic that is)
I would strongly prefer solutions that don't require large external libraries as I have a severely limited amount of hard disk space on one machine I will be running the code on
I have a weak preference for cross-platform code, but I will likely only use this on Linux
This thread discusses the shelve module, which is apparently not process-safe. Two of the answers suggest using fcntl.flock to lock the shelve file. Some of the responses in this thread, however, seem to suggest that this is fraught with problems - but I'm not exactly sure what they are. It sounds as though this is limited to Unix (though apparently Windows has an equivalent called msvcrt.locking), and the lock is only 'advisory' - i.e., it won't stop me from accidentally writing to the file without checking it is locked. Are there any other potential problems? Would writing to a copy of the file, and replacing the master copy as a final step, reduce the risk of corruption?
It doesn't look as though the dbm module will do any better than shelve. I've had a quick look at sqlite3, but it seems a bit overkill for this purpose. This thread and this one mention several 3rd party libraries, including ZODB, but there are a lot of choices, and they all seem overly large and complicated for this task.
Does anyone have any advice?
UPDATE: kindall mentioned IncPy below, which does look very interesting. Unfortunately, I wouldn't want to move back to Python 2.6 (I'm actually using 3.2), and it looks like it is a bit awkward to use with C libraries (I make heavy use of numpy and scipy, among others).
kindall's other idea is instructive, but I think adapting this to multiple processes would be a little difficult - I suppose it would be easiest to replace the queue with file locking or a database.
Looking at ZODB again, it does look perfect for the task, but I really do want to avoid using any additional libraries. I'm still not entirely sure what all the issues with simply using flock are - I imagine one big problem is if a process is terminated while writing to the file, or before releasing the lock?
So, I've taken synthesizerpatel's advice and gone with sqlite3. If anyone's interested, I decided to make a drop-in replacement for dict that stores its entries as pickles in a database (I don't bother to keep any in memory as database access and pickling is fast enough compared to everything else I'm doing). I'm sure there are more efficient ways of doing this (and I've no idea whether I might still have concurrency issues), but here is the code:
from collections import MutableMapping
import sqlite3
import pickle
class PersistentDict(MutableMapping):
def __init__(self, dbpath, iterable=None, **kwargs):
self.dbpath = dbpath
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'create table if not exists memo '
'(key blob primary key not null, value blob not null)'
)
if iterable is not None:
self.update(iterable)
self.update(kwargs)
def encode(self, obj):
return pickle.dumps(obj)
def decode(self, blob):
return pickle.loads(blob)
def get_connection(self):
return sqlite3.connect(self.dbpath)
def __getitem__(self, key):
key = self.encode(key)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select value from memo where key=?',
(key,)
)
value = cursor.fetchone()
if value is None:
raise KeyError(key)
return self.decode(value[0])
def __setitem__(self, key, value):
key = self.encode(key)
value = self.encode(value)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'insert or replace into memo values (?, ?)',
(key, value)
)
def __delitem__(self, key):
key = self.encode(key)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select count(*) from memo where key=?',
(key,)
)
if cursor.fetchone()[0] == 0:
raise KeyError(key)
cursor.execute(
'delete from memo where key=?',
(key,)
)
def __iter__(self):
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select key from memo'
)
records = cursor.fetchall()
for r in records:
yield self.decode(r[0])
def __len__(self):
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select count(*) from memo'
)
return cursor.fetchone()[0]
sqlite3 out of the box provides ACID. File locking is prone to race-conditions and concurrency problems that you won't have using sqlite3.
Basically, yeah, sqlite3 is more than what you need, but it's not a huge burden. It can run on mobile phones, so it's not like you're committing to running some beastly software. It's going to save you time reinventing wheels and debugging locking issues.
I assume you want to continue to memoize the results of the function in RAM, probably in a dictionary, but use the persistence to reduce the "warmup" time of the application. In this case you're not going to be randomly accessing items directly in the backing store so a database might indeed be overkill (though as synthesizerpatel notes, maybe not as much as you think).
Still, if you want to roll your own, a viable strategy might be to simply load the dictionary from a file at the beginning of your run before starting any threads. When a result isn't in the dictionary, then you need to write it to the file after adding it to the dictionary. You can do this by adding it to a queue and using a single worker thread that flushes items from the queue to disk (just appending them to a single file would be fine). You might occasionally add the same result more than once, but this is not fatal since it'll be the same result each time, so reading it back in twice or more will do no real harm. Python's threading model will keep you out of most kinds of concurrency trouble (e.g., appending to a list is atomic).
Here is some (untested, generic, incomplete) code showing what I'm talking about:
import cPickle as pickle
import time, os.path
cache = {}
queue = []
# run at script start to warm up cache
def preload_cache(filename):
if os.path.isfile(filename):
with open(filename, "rb") as f:
while True:
try:
key, value = pickle.load(f), pickle.load(f)
except EOFError:
break
cache[key] = value
# your memoized function
def time_consuming_function(a, b, c, d):
key = (a, b, c, d)
if key in cache:
return cache[key]
else:
# generate the result here
# ...
# add to cache, checking to see if it's already there again to avoid writing
# it twice (in case another thread also added it) (this is not fatal, though)
if key not in cache:
cache[key] = result
queue.append((key, result))
return result
# run on worker thread to write new items out
def write_cache(filename):
with open(filename, "ab") as f:
while True:
while queue:
key, value = queue.pop() # item order not important
# but must write key and value in single call to ensure
# both get written (otherwise, interrupting script might
# leave only one written, corrupting the file)
f.write(pickle.dumps(key, pickle.HIGHEST_PROTOCOL) +
pickle.dumps(value, pickle.HIGHEST_PROTOCOL))
f.flush()
time.sleep(1)
If I had time, I'd turn this into a decorator... and put the persistence into a dict subclass... the use of global variables is also sub-optimal. :-) If you use this approach with multiprocessing you'd probably want to use a multiprocessing.Queue rather than a list; you can then use queue.get() as a blocking wait for a new result in the worker process that writes to the file. I've not used multiprocessing, though, so take this bit of advice with a grain of salt.

How do I efficiently do a bulk insert-or-update with SQLAlchemy?

I'm using SQLAlchemy with a Postgres backend to do a bulk insert-or-update. To try to improve performance, I'm attempting to commit only once every thousand rows or so:
trans = engine.begin()
for i, rec in enumerate(records):
if i % 1000 == 0:
trans.commit()
trans = engine.begin()
try:
inserter.execute(...)
except sa.exceptions.SQLError:
my_table.update(...).execute()
trans.commit()
However, this isn't working. It seems that when the INSERT fails, it leaves things in a weird state that prevents the UPDATE from happening. Is it automatically rolling back the transaction? If so, can this be stopped? I don't want my entire transaction rolled back in the event of a problem, which is why I'm trying to catch the exception in the first place.
The error message I'm getting, BTW, is "sqlalchemy.exc.InternalError: (InternalError) current transaction is aborted, commands ignored until end of transaction block", and it happens on the update().execute() call.
You're hitting some weird Postgresql-specific behavior: if an error happens in a transaction, it forces the whole transaction to be rolled back. I consider this a Postgres design bug; it takes quite a bit of SQL contortionism to work around in some cases.
One workaround is to do the UPDATE first. Detect if it actually modified a row by looking at cursor.rowcount; if it didn't modify any rows, it didn't exist, so do the INSERT. (This will be faster if you update more frequently than you insert, of course.)
Another workaround is to use savepoints:
SAVEPOINT a;
INSERT INTO ....;
-- on error:
ROLLBACK TO SAVEPOINT a;
UPDATE ...;
-- on success:
RELEASE SAVEPOINT a;
This has a serious problem for production-quality code: you have to detect the error accurately. Presumably you're expecting to hit a unique constraint check, but you may hit something unexpected, and it may be next to impossible to reliably distinguish the expected error from the unexpected one. If this hits the error condition incorrectly, it'll lead to obscure problems where nothing will be updated or inserted and no error will be seen. Be very careful with this. You can narrow down the error case by looking at Postgresql's error code to make sure it's the error type you're expecting, but the potential problem is still there.
Finally, if you really want to do batch-insert-or-update, you actually want to do many of them in a few commands, not one item per command. This requires trickier SQL: SELECT nested inside an INSERT, filtering out the right items to insert and update.
This error is from PostgreSQL. PostgreSQL doesn't allow you to execute commands in the same transaction if one command creates an error. To fix this you can use nested transactions (implemented using SQL savepoints) via conn.begin_nested(). Heres something that might work. I made the code use explicit connections, factored out the chunking part and made the code use the context manager to manage transactions correctly.
from itertools import chain, islice
def chunked(seq, chunksize):
"""Yields items from an iterator in chunks."""
it = iter(seq)
while True:
yield chain([it.next()], islice(it, chunksize-1))
conn = engine.commit()
for chunk in chunked(records, 1000):
with conn.begin():
for rec in chunk:
try:
with conn.begin_nested():
conn.execute(inserter, ...)
except sa.exceptions.SQLError:
conn.execute(my_table.update(...))
This still won't have stellar performance though due to nested transaction overhead. If you want better performance try to detect which rows will create errors beforehand with a select query and use executemany support (execute can take a list of dicts if all inserts use the same columns). If you need to handle concurrent updates, you'll still need to do error handling either via retrying or falling back to one by one inserts.

Categories