Persistent memoization in Python

Persistent memoization in Python - python

I have an expensive function that takes and returns a small amount of data (a few integers and floats). I have already memoized this function, but I would like to make the memo persistent. There are already a couple of threads relating to this, but I'm unsure about potential issues with some of the suggested approaches, and I have some fairly specific requirements:
I will definitely use the function from multiple threads and processes simultaneously (both using multiprocessing and from separate python scripts)
I will not need read or write access to the memo from outside this python function
I am not that concerned about the memo being corrupted on rare occasions (like pulling the plug or accidentally writing to the file without locking it) as it isn't that expensive to rebuild (typically 10-20 minutes) but I would prefer if it would not be corrupted because of exceptions, or manually terminating a python process (I don't know how realistic that is)
I would strongly prefer solutions that don't require large external libraries as I have a severely limited amount of hard disk space on one machine I will be running the code on
I have a weak preference for cross-platform code, but I will likely only use this on Linux
This thread discusses the shelve module, which is apparently not process-safe. Two of the answers suggest using fcntl.flock to lock the shelve file. Some of the responses in this thread, however, seem to suggest that this is fraught with problems - but I'm not exactly sure what they are. It sounds as though this is limited to Unix (though apparently Windows has an equivalent called msvcrt.locking), and the lock is only 'advisory' - i.e., it won't stop me from accidentally writing to the file without checking it is locked. Are there any other potential problems? Would writing to a copy of the file, and replacing the master copy as a final step, reduce the risk of corruption?
It doesn't look as though the dbm module will do any better than shelve. I've had a quick look at sqlite3, but it seems a bit overkill for this purpose. This thread and this one mention several 3rd party libraries, including ZODB, but there are a lot of choices, and they all seem overly large and complicated for this task.
Does anyone have any advice?
UPDATE: kindall mentioned IncPy below, which does look very interesting. Unfortunately, I wouldn't want to move back to Python 2.6 (I'm actually using 3.2), and it looks like it is a bit awkward to use with C libraries (I make heavy use of numpy and scipy, among others).
kindall's other idea is instructive, but I think adapting this to multiple processes would be a little difficult - I suppose it would be easiest to replace the queue with file locking or a database.
Looking at ZODB again, it does look perfect for the task, but I really do want to avoid using any additional libraries. I'm still not entirely sure what all the issues with simply using flock are - I imagine one big problem is if a process is terminated while writing to the file, or before releasing the lock?
So, I've taken synthesizerpatel's advice and gone with sqlite3. If anyone's interested, I decided to make a drop-in replacement for dict that stores its entries as pickles in a database (I don't bother to keep any in memory as database access and pickling is fast enough compared to everything else I'm doing). I'm sure there are more efficient ways of doing this (and I've no idea whether I might still have concurrency issues), but here is the code:
from collections import MutableMapping
import sqlite3
import pickle
class PersistentDict(MutableMapping):
def __init__(self, dbpath, iterable=None, **kwargs):
self.dbpath = dbpath
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'create table if not exists memo '
'(key blob primary key not null, value blob not null)'
)
if iterable is not None:
self.update(iterable)
self.update(kwargs)
def encode(self, obj):
return pickle.dumps(obj)
def decode(self, blob):
return pickle.loads(blob)
def get_connection(self):
return sqlite3.connect(self.dbpath)
def __getitem__(self, key):
key = self.encode(key)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select value from memo where key=?',
(key,)
)
value = cursor.fetchone()
if value is None:
raise KeyError(key)
return self.decode(value[0])
def __setitem__(self, key, value):
key = self.encode(key)
value = self.encode(value)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'insert or replace into memo values (?, ?)',
(key, value)
)
def __delitem__(self, key):
key = self.encode(key)
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select count(*) from memo where key=?',
(key,)
)
if cursor.fetchone()[0] == 0:
raise KeyError(key)
cursor.execute(
'delete from memo where key=?',
(key,)
)
def __iter__(self):
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select key from memo'
)
records = cursor.fetchall()
for r in records:
yield self.decode(r[0])
def __len__(self):
with self.get_connection() as connection:
cursor = connection.cursor()
cursor.execute(
'select count(*) from memo'
)
return cursor.fetchone()[0]

sqlite3 out of the box provides ACID. File locking is prone to race-conditions and concurrency problems that you won't have using sqlite3.
Basically, yeah, sqlite3 is more than what you need, but it's not a huge burden. It can run on mobile phones, so it's not like you're committing to running some beastly software. It's going to save you time reinventing wheels and debugging locking issues.

I assume you want to continue to memoize the results of the function in RAM, probably in a dictionary, but use the persistence to reduce the "warmup" time of the application. In this case you're not going to be randomly accessing items directly in the backing store so a database might indeed be overkill (though as synthesizerpatel notes, maybe not as much as you think).
Still, if you want to roll your own, a viable strategy might be to simply load the dictionary from a file at the beginning of your run before starting any threads. When a result isn't in the dictionary, then you need to write it to the file after adding it to the dictionary. You can do this by adding it to a queue and using a single worker thread that flushes items from the queue to disk (just appending them to a single file would be fine). You might occasionally add the same result more than once, but this is not fatal since it'll be the same result each time, so reading it back in twice or more will do no real harm. Python's threading model will keep you out of most kinds of concurrency trouble (e.g., appending to a list is atomic).
Here is some (untested, generic, incomplete) code showing what I'm talking about:
import cPickle as pickle
import time, os.path
cache = {}
queue = []
# run at script start to warm up cache
def preload_cache(filename):
if os.path.isfile(filename):
with open(filename, "rb") as f:
while True:
try:
key, value = pickle.load(f), pickle.load(f)
except EOFError:
break
cache[key] = value
# your memoized function
def time_consuming_function(a, b, c, d):
key = (a, b, c, d)
if key in cache:
return cache[key]
else:
# generate the result here
# ...
# add to cache, checking to see if it's already there again to avoid writing
# it twice (in case another thread also added it) (this is not fatal, though)
if key not in cache:
cache[key] = result
queue.append((key, result))
return result
# run on worker thread to write new items out
def write_cache(filename):
with open(filename, "ab") as f:
while True:
while queue:
key, value = queue.pop() # item order not important
# but must write key and value in single call to ensure
# both get written (otherwise, interrupting script might
# leave only one written, corrupting the file)
f.write(pickle.dumps(key, pickle.HIGHEST_PROTOCOL) +
pickle.dumps(value, pickle.HIGHEST_PROTOCOL))
f.flush()
time.sleep(1)
If I had time, I'd turn this into a decorator... and put the persistence into a dict subclass... the use of global variables is also sub-optimal. :-) If you use this approach with multiprocessing you'd probably want to use a multiprocessing.Queue rather than a list; you can then use queue.get() as a blocking wait for a new result in the worker process that writes to the file. I've not used multiprocessing, though, so take this bit of advice with a grain of salt.

Related

simple example of working with neo4j python driver?

Is there a simple example of working with the neo4j python driver?
How do I just pass cypher query to the driver to run and return a cursor?
If I'm reading for example this it seems the demo has a class wrapper, with a private member func I pass to the session.write,
session.write_transaction(self._create_and_return_greeting, ...
That then gets called it with a transaction as a first parameter...
def _create_and_return_greeting(tx, message):
that in turn runs the cypher
result = tx.run("CREATE (a:Greeting) "
This seems 10X more complicated than it needs to be.
I did just try a simpler:
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
But this results in a socket error on the query, probably because the session goes out of scope?
[dfcx/__init__] ERROR | Underlying socket connection gone (_ssl.c:2396)
[dfcx/__init__] ERROR | Failed to write data to connection IPv4Address(('neo4j-core-8afc8558-3.production-orch-0042.neo4j.io', 7687)) (IPv4Address(('34.82.120.138', 7687)))
Also I can't return a cursor/iterator, just the data()
When the session goes out of scope, the query result seems to die with it.
If I manually open and close a session, then I'd have the same problems?
Python must be the most popular language this DB is used with, does everyone use a different driver?
Py2neo seems cute, but completely lacking in ORM wrapper function for most of the cypher language features, so you have to drop down to raw cypher anyway. And I'm not sure it supports **kwargs argument interpolation in the same way.
I guess that big raise should help iron out some kinks :D
Slightly longer version trying to get a working DB wrapper:
def neo_connect() -> Union[neo4j.BoltDriver, neo4j.Neo4jDriver]:
global raw_driver
if raw_driver:
# print('reuse driver')
return raw_driver
neoconfig = NEOCONFIG
raw_driver = neo4j.GraphDatabase.driver(
neoconfig['url'], auth=(
neoconfig['user'], neoconfig['pass']))
if raw_driver is None:
raise BaseException("cannot connect to neo4j")
else:
return raw_driver
def raw_query(query, **kwargs):
# just get data, no cursor
neodriver = neo_connect()
session = neodriver.session()
# logging.info('neoquery %s', query)
# with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
data = result.data()
return data
except neo4j.exceptions.CypherSyntaxError as err:
logging.error('neo error %s', err)
logging.error('failed query: %s', query)
raise err
# finally:
# logging.info('close session')
# session.close()
update: someone pointed me to this example which is another way to use the tx wrapper.
https://github.com/neo4j-graph-examples/northwind/blob/main/code/python/example.py#L16-L21

def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
This is perfectly fine and works as intended on my end.
The error you're seeing is stating that there is a connection problem. So there must be something going on between the server and the driver that's outside of its influence.
Also, please note, that there is a difference between all of these ways to run a query:
with driver.session():
result = session.run("<SOME CYPHER>")
def work(tx):
result = tx.run("<SOME CYPHER>")
with driver.session():
session.write_transaction(work)
The latter one might be 3 lines longer and the team working on the drivers collected some feedback regarding this. However, there are more things to consider here. Firstly, changing the API surface is something that needs careful planning and cannot be done in say a patch release. Secondly, there are technical hurdles to overcome. Here are the semantics, anyway:
Auto-commit transaction. Runs only that query as one unit of work.
If you run a new auto-commit transaction within the same session, the previous result will buffer all available records for you (depending on the query, this will consume a lot of memory). This can be avoided by calling result.consume(). However, if the session goes out of scope, the result will be consumed automatically. This means you cannot extract further records from it. Lastly, any error will be raised and needs handling in the application code.
Managed transaction. Runs whatever unit of work you want within that function. A transaction is implicitly started and committed (unless you rollback explicitly) around the function.
If the transaction ends (end of function or rollback), the result will be consumed and become invalid. You'll have to extract all records you need before that.
This is the recommended way of using the driver because it will not raise all errors but handle some internally (where appropriate) and retry the work function (e.g. if the server is only temporarily unavailable). Since the function might be executed multiple time, you must make sure it's idempotent.
Closing thoughts:
Please remember that stackoverlfow is monitored on a best-effort basis and what can be perceived as hasty comments may get in the way of getting helpful answers to your questions

Using decorators for database access with psycopg2

I am constructing a model that does large parts of its calculations in a Postgresql database (for performance reasons). It looks somewhat like this:
def sql_func1(conn):
# prepare some data, crunch some number, etc.
curs = conn.cursor()
curs.execute("SOME SQL COMMAND")
curs.commit()
curs.close()
if __name__ == "__main__":
connection = psycopg2.connect(dbname='name', user='user', password='pass', host='localhost', port=1234)
sql_func1(conn)
sql_func2(conn)
sql_func3(conn)
connection.close()
The script uses around 30 individual functions like sql_func1. Obviously it is a little awkward to manage the connection and cursor in each function all the time. Thus I started using a decorator as described here. Now I can simply wrap sql_func1 with a decorator #db_connect and pass the connection from there. However, that means I am opening and closing the connection all the time, which is not good practice either. The psycopg2 FAQ says:
Creating a connection can be slow (think of SSL over TCP) so the best
practice is to create a single connection and keep it open as long as
required. It is also good practice to rollback or commit frequently
(even after a single SELECT statement) to make sure the backend is
never left “idle in transaction”. See also psycopg2.pool for
lightweight connection pooling.
Could you please give me some insights which would be an ideal practice im my case. Should I rather use a decorator that passes the cursor object instead of the connection? If so, please provide a code sample for the decorator. As I am rather new to programming, please let me also know in case you think my overall approach is wrong.

What about storing the connection in a global variable without closing it in the finally block? Something like this (according to the example you linked):
cnn = None
def with_connection(f):
def with_connection_(*args, **kwargs):
global cnn
if not cnn:
cnn = psycopg.connect(DSN)
try:
rv = f(cnn, *args, **kwargs)
except Exception, e:
cnn.rollback()
raise
else:
cnn.commit() # or maybe not
return rv
return with_connection_

Python 'with' implementation [duplicate]

I came across the Python with statement for the first time today. I've been using Python lightly for several months and didn't even know of its existence! Given its somewhat obscure status, I thought it would be worth asking:
What is the Python with statement
designed to be used for?
What do
you use it for?
Are there any
gotchas I need to be aware of, or
common anti-patterns associated with
its use? Any cases where it is better use try..finally than with?
Why isn't it used more widely?
Which standard library classes are compatible with it?

I believe this has already been answered by other users before me, so I only add it for the sake of completeness: the with statement simplifies exception handling by encapsulating common preparation and cleanup tasks in so-called context managers. More details can be found in PEP 343. For instance, the open statement is a context manager in itself, which lets you open a file, keep it open as long as the execution is in the context of the with statement where you used it, and close it as soon as you leave the context, no matter whether you have left it because of an exception or during regular control flow. The with statement can thus be used in ways similar to the RAII pattern in C++: some resource is acquired by the with statement and released when you leave the with context.
Some examples are: opening files using with open(filename) as fp:, acquiring locks using with lock: (where lock is an instance of threading.Lock). You can also construct your own context managers using the contextmanager decorator from contextlib. For instance, I often use this when I have to change the current directory temporarily and then return to where I was:
from contextlib import contextmanager
import os
#contextmanager
def working_directory(path):
current_dir = os.getcwd()
os.chdir(path)
try:
yield
finally:
os.chdir(current_dir)
with working_directory("data/stuff"):
# do something within data/stuff
# here I am back again in the original working directory
Here's another example that temporarily redirects sys.stdin, sys.stdout and sys.stderr to some other file handle and restores them later:
from contextlib import contextmanager
import sys
#contextmanager
def redirected(**kwds):
stream_names = ["stdin", "stdout", "stderr"]
old_streams = {}
try:
for sname in stream_names:
stream = kwds.get(sname, None)
if stream is not None and stream != getattr(sys, sname):
old_streams[sname] = getattr(sys, sname)
setattr(sys, sname, stream)
yield
finally:
for sname, stream in old_streams.iteritems():
setattr(sys, sname, stream)
with redirected(stdout=open("/tmp/log.txt", "w")):
# these print statements will go to /tmp/log.txt
print "Test entry 1"
print "Test entry 2"
# back to the normal stdout
print "Back to normal stdout again"
And finally, another example that creates a temporary folder and cleans it up when leaving the context:
from tempfile import mkdtemp
from shutil import rmtree
#contextmanager
def temporary_dir(*args, **kwds):
name = mkdtemp(*args, **kwds)
try:
yield name
finally:
shutil.rmtree(name)
with temporary_dir() as dirname:
# do whatever you want

I would suggest two interesting lectures:
PEP 343 The "with" Statement
Effbot Understanding Python's
"with" statement
1.
The with statement is used to wrap the execution of a block with methods defined by a context manager. This allows common try...except...finally usage patterns to be encapsulated for convenient reuse.
2.
You could do something like:
with open("foo.txt") as foo_file:
data = foo_file.read()
OR
from contextlib import nested
with nested(A(), B(), C()) as (X, Y, Z):
do_something()
OR (Python 3.1)
with open('data') as input_file, open('result', 'w') as output_file:
for line in input_file:
output_file.write(parse(line))
OR
lock = threading.Lock()
with lock:
# Critical section of code
3.
I don't see any Antipattern here.
Quoting Dive into Python:
try..finally is good. with is better.
4.
I guess it's related to programmers's habit to use try..catch..finally statement from other languages.

The Python with statement is built-in language support of the Resource Acquisition Is Initialization idiom commonly used in C++. It is intended to allow safe acquisition and release of operating system resources.
The with statement creates resources within a scope/block. You write your code using the resources within the block. When the block exits the resources are cleanly released regardless of the outcome of the code in the block (that is whether the block exits normally or because of an exception).
Many resources in the Python library that obey the protocol required by the with statement and so can used with it out-of-the-box. However anyone can make resources that can be used in a with statement by implementing the well documented protocol: PEP 0343
Use it whenever you acquire resources in your application that must be explicitly relinquished such as files, network connections, locks and the like.

Again for completeness I'll add my most useful use-case for with statements.
I do a lot of scientific computing and for some activities I need the Decimal library for arbitrary precision calculations. Some part of my code I need high precision and for most other parts I need less precision.
I set my default precision to a low number and then use with to get a more precise answer for some sections:
from decimal import localcontext
with localcontext() as ctx:
ctx.prec = 42 # Perform a high precision calculation
s = calculate_something()
s = +s # Round the final result back to the default precision
I use this a lot with the Hypergeometric Test which requires the division of large numbers resulting form factorials. When you do genomic scale calculations you have to be careful of round-off and overflow errors.

An example of an antipattern might be to use the with inside a loop when it would be more efficient to have the with outside the loop
for example
for row in lines:
with open("outfile","a") as f:
f.write(row)
vs
with open("outfile","a") as f:
for row in lines:
f.write(row)
The first way is opening and closing the file for each row which may cause performance problems compared to the second way with opens and closes the file just once.

See PEP 343 - The 'with' statement, there is an example section at the end.
... new statement "with" to the Python
language to make
it possible to factor out standard uses of try/finally statements.

points 1, 2, and 3 being reasonably well covered:
4: it is relatively new, only available in python2.6+ (or python2.5 using from __future__ import with_statement)

The with statement works with so-called context managers:
http://docs.python.org/release/2.5.2/lib/typecontextmanager.html
The idea is to simplify exception handling by doing the necessary cleanup after leaving the 'with' block. Some of the python built-ins already work as context managers.

Another example for out-of-the-box support, and one that might be a bit baffling at first when you are used to the way built-in open() behaves, are connection objects of popular database modules such as:
sqlite3
psycopg2
cx_oracle
The connection objects are context managers and as such can be used out-of-the-box in a with-statement, however when using the above note that:
When the with-block is finished, either with an exception or without, the connection is not closed. In case the with-block finishes with an exception, the transaction is rolled back, otherwise the transaction is commited.
This means that the programmer has to take care to close the connection themselves, but allows to acquire a connection, and use it in multiple with-statements, as shown in the psycopg2 docs:
conn = psycopg2.connect(DSN)
with conn:
with conn.cursor() as curs:
curs.execute(SQL1)
with conn:
with conn.cursor() as curs:
curs.execute(SQL2)
conn.close()
In the example above, you'll note that the cursor objects of psycopg2 also are context managers. From the relevant documentation on the behavior:
When a cursor exits the with-block it is closed, releasing any resource eventually associated with it. The state of the transaction is not affected.

In python generally “with” statement is used to open a file, process the data present in the file, and also to close the file without calling a close() method. “with” statement makes the exception handling simpler by providing cleanup activities.
General form of with:
with open(“file name”, “mode”) as file_var:
processing statements
note: no need to close the file by calling close() upon file_var.close()

The answers here are great, but just to add a simple one that helped me:
with open("foo.txt") as file:
data = file.read()
open returns a file
Since 2.6 python added the methods __enter__ and __exit__ to file.
with is like a for loop that calls __enter__, runs the loop once and then calls __exit__
with works with any instance that has __enter__ and __exit__
a file is locked and not re-usable by other processes until it's closed, __exit__ closes it.
source: http://web.archive.org/web/20180310054708/http://effbot.org/zone/python-with-statement.htm

Cassandra low performance?

I have to choose Cassandra or MongoDB(or another nosql database, I accept suggestions) for a project with a lot of inserts(1M/day).
So I create a small test to measure the write performance. Here's the code to insert in Cassandra:
import time
import os
import random
import string
import pycassa
def get_random_string(string_length):
return ''.join(random.choice(string.letters) for i in xrange(string_length))
def connect():
"""Connect to a test database"""
connection = pycassa.connect('test_keyspace', ['localhost:9160'])
db = pycassa.ColumnFamily(connection,'foo')
return db
def random_insert(db):
"""Insert a record into the database. The record has the following format
ID timestamp
4 random strings
3 random integers"""
record = {}
record['id'] = str(time.time())
record['str1'] = get_random_string(64)
record['str2'] = get_random_string(64)
record['str3'] = get_random_string(64)
record['str4'] = get_random_string(64)
record['num1'] = str(random.randint(0, 100))
record['num2'] = str(random.randint(0, 1000))
record['num3'] = str(random.randint(0, 10000))
db.insert(str(time.time()), record)
if __name__ == "__main__":
db = connect()
start_time = time.time()
for i in range(1000000):
random_insert(db)
end_time = time.time()
print "Insert time: %lf " %(end_time - start_time)
And the code to insert in Mongo it's the same changing the connection function:
def connect():
"""Connect to a test database"""
connection = pymongo.Connection('localhost', 27017)
db = connection.test_insert
return db.foo2
The results are ~1046 seconds to insert in Cassandra, and ~437 to finish in Mongo.
It's supposed that Cassandra it's much faster than Mongo inserting data. So , What i'm doing wrong?

There is no equivalent to Mongo's unsafe mode in Cassandra. (We used to have one, but we took it out, because it's just a Bad Idea.)
The other main problem is that you're doing single-threaded inserts. Cassandra is designed for high concurrency; you need to use a multithreaded test. See the graph at the bottom of http://spyced.blogspot.com/2010/01/cassandra-05.html (actual numbers are over a year out of date but the principle is still true).
The Cassandra source distribution has such a test included in contrib/stress.

If I am not mistaken, Cassandra allows you to specify whether or not you are doing a MongoDB-equivalent "safe mode" insert. (I dont recall the name of that feature in Cassandra)
In other words, Cassandra may be configured to write to disk and then return as opposed to the default MongoDB configuration which immediately returns after performing an insert without knowing if the insert was successful or not. It just means that your application never waits for a pass\fail from the server.
You can change that behavior by using safe mode in MongoDB but this is known to have a large impact on performance. Enable safe mode and you may see different results.

You will harness true power of Cassandra once you have multiple nodes running. Any node will be able to take a write request. Multithreading a client is only flooding more requests to same instance which is not going to help after a point.
Check cassandra log for the events that happen during your tests. Cassandra will initiate a disk write once the Memtable is full (this is configurable, make it large enough and you will be dealing on in RAM + disk writes of commit log). If disk write for Memtable happen during your test then it will slow it down. I do not know when MongoDB writes to disk.

Might I suggest taking a look at Membase here? It's used in exactly the same way as memcached and is fully distributed so you can continuously scale your write input rate simply by adding more servers and/or more RAM.
For this case, you'll definitely want to go with a client-side Moxi to give you the best performance. Take a look at our wiki: wiki.membase.org for examples and let me know if you need any further instruction...I'm happy to walk you through it and I'm certain that Membase can handle this load easily.

Create batch mutator for doing
multiple insert, update, and remove
operations using as few roundtrips as
possible.
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.batch
Batch mutator helped me reduce insert time in at least half

How do I efficiently do a bulk insert-or-update with SQLAlchemy?

I'm using SQLAlchemy with a Postgres backend to do a bulk insert-or-update. To try to improve performance, I'm attempting to commit only once every thousand rows or so:
trans = engine.begin()
for i, rec in enumerate(records):
if i % 1000 == 0:
trans.commit()
trans = engine.begin()
try:
inserter.execute(...)
except sa.exceptions.SQLError:
my_table.update(...).execute()
trans.commit()
However, this isn't working. It seems that when the INSERT fails, it leaves things in a weird state that prevents the UPDATE from happening. Is it automatically rolling back the transaction? If so, can this be stopped? I don't want my entire transaction rolled back in the event of a problem, which is why I'm trying to catch the exception in the first place.
The error message I'm getting, BTW, is "sqlalchemy.exc.InternalError: (InternalError) current transaction is aborted, commands ignored until end of transaction block", and it happens on the update().execute() call.

You're hitting some weird Postgresql-specific behavior: if an error happens in a transaction, it forces the whole transaction to be rolled back. I consider this a Postgres design bug; it takes quite a bit of SQL contortionism to work around in some cases.
One workaround is to do the UPDATE first. Detect if it actually modified a row by looking at cursor.rowcount; if it didn't modify any rows, it didn't exist, so do the INSERT. (This will be faster if you update more frequently than you insert, of course.)
Another workaround is to use savepoints:
SAVEPOINT a;
INSERT INTO ....;
-- on error:
ROLLBACK TO SAVEPOINT a;
UPDATE ...;
-- on success:
RELEASE SAVEPOINT a;
This has a serious problem for production-quality code: you have to detect the error accurately. Presumably you're expecting to hit a unique constraint check, but you may hit something unexpected, and it may be next to impossible to reliably distinguish the expected error from the unexpected one. If this hits the error condition incorrectly, it'll lead to obscure problems where nothing will be updated or inserted and no error will be seen. Be very careful with this. You can narrow down the error case by looking at Postgresql's error code to make sure it's the error type you're expecting, but the potential problem is still there.
Finally, if you really want to do batch-insert-or-update, you actually want to do many of them in a few commands, not one item per command. This requires trickier SQL: SELECT nested inside an INSERT, filtering out the right items to insert and update.

This error is from PostgreSQL. PostgreSQL doesn't allow you to execute commands in the same transaction if one command creates an error. To fix this you can use nested transactions (implemented using SQL savepoints) via conn.begin_nested(). Heres something that might work. I made the code use explicit connections, factored out the chunking part and made the code use the context manager to manage transactions correctly.
from itertools import chain, islice
def chunked(seq, chunksize):
"""Yields items from an iterator in chunks."""
it = iter(seq)
while True:
yield chain([it.next()], islice(it, chunksize-1))
conn = engine.commit()
for chunk in chunked(records, 1000):
with conn.begin():
for rec in chunk:
try:
with conn.begin_nested():
conn.execute(inserter, ...)
except sa.exceptions.SQLError:
conn.execute(my_table.update(...))
This still won't have stellar performance though due to nested transaction overhead. If you want better performance try to detect which rows will create errors beforehand with a select query and use executemany support (execute can take a list of dicts if all inserts use the same columns). If you need to handle concurrent updates, you'll still need to do error handling either via retrying or falling back to one by one inserts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Persistent memoization in Python - python

Related

simple example of working with neo4j python driver?

Using decorators for database access with psycopg2

Python 'with' implementation [duplicate]

Cassandra low performance?

How do I efficiently do a bulk insert-or-update with SQLAlchemy?

Categories

Resources