How would one go about cancelling execution of a query statement using pyscopg2 (the python Postgres driver)?
As an example, let's say I have the following code:
import psycopg2
cnx_string = "something_appropriate"
conn = psycopg2.connect(cnx_string)
cur = conn.cursor()
cur.execute("long_running_query")
Then I want to cancel the execution of that long running query from another thread - what method would I have to call on the connection/cursor objects to do this?
You can cancel a query by calling the pg_cancel_backend(pid) PostgreSQL function in a separate connection.
You can know the PID of the backend to cancel from the connection.get_backend_pid() method of psycopg2 (available from version 2.0.8).
The connection object has a cancel member. Using this and threading you could use
sqltimeout = threading.Timer(sql_timeout_seconds, conn.cancel)
sqltimeout.start()
When the timer expires, the cancel is sent to the connection and an exception will be raised by the server.
Don't forget to cancel the timer when the query normally finishes....
sqltimeout.cancel()
psycopg2's async execution support has been removed.
If you can use py-postgresql and its transactions (it's py3k), the internal implementation is asynchronous and supports being interrupted.
Related
I just can't figure out how to use the aiosqlite module so that I can keep the connection around for later use.
The example based on the aiosqlite project page
async with aiosqlite.connect('file.db') as conn:
cursor = await conn.execute("SELECT 42;")
rows = await cursor.fetchall()
print('rows %s' % rows)
works fine, but I want to keep the connection around so that I can use it throughout my program.
Typically, with sqlite, I open a connection, squirrel it away and then use it throughout the life of the program.
I also tried things like:
conn = aiosqlite.connect('file.db')
c = await conn.__enter__()
AttributeError: 'Connection' object has no attribute '__enter__'
Is there a way to use this module without a context manager?
The "best" way would be for the entry-point of your application to create the aiosqlite connection using the context manager method, store a reference to the connection object somewhere, and then run the application's "run loop" method from within that context. This would ensure that when your application exits, the sqlite connection is cleaned up appropriately. This could look something like this:
async def main():
async with aiosqlite.connect(...) as conn:
# save conn somewhere
await run_loop()
Alternately, you can await the appropriate enter/exit methods:
try:
conn = aiosqlite.connect(...)
await conn.__aenter__()
# do stuff
finally:
await conn.__aexit__()
Regardless, do beware that the asynchronous nature of aiosqlite does mean that shared connections will potentially result in overlap on transactions. If you need the assurance that concurrent queries take place with separate transactions, then you will need a separate connection per transaction.
According to the Python sqlite docs on sharing connections:
When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption.
This applies equally to aiosqlite and asyncio. For example, the following code will potentially overlap both inserts into a single transaction:
async def one(db):
await db.execute("insert ...")
await db.commit()
async def two(db):
await db.execute("insert ...")
await db.commit()
async def main():
async with aiosqlite.connect(...) as db:
await asyncio.gather(one(db), two(db))
The correct solution here would be to either create a connection for each transaction, or use something like executescript to execute the entire transaction at once.
Two points. Firstly a modifying statement locks the whole DB, as per:
When a database is accessed by multiple connections, and one of the processes modifies the database, the SQLite database is locked until that transaction is committed. The timeout parameter specifies how long the connection should wait for the lock to go away until raising an exception. The default for the timeout parameter is 5.0 (five seconds).
(https://docs.python.org/3/library/sqlite3.html#connection-objects)
so read-only statements will not be able to execute anyway, as long as a "write" is in progress. Consequently, I'd open a new connection for each write operation.
Secondly, for a file-based DB such as SQLite the cost of opening a connection is obviously much lower then for a network-accessed DB. You may profile it; a simple test like:
x = 0
for i in range(1000):
async with aiosqlite.connect('pic_db.db') as db:
async with db.execute('SELECT 12', ()) as cursor:
x += 1
shows the two statements take together ~550ms; if the for and connect are interchanged, it takes ~150ms; so each connect costs you about 0.4ms. If that's an issue, I'd try to reuse a connection for read-only statements (and open a new one for each modifying statement).
I'm runnning a Bokeh server, using the underlying Tornado framework.
I need the server to refresh some data at some point. This is done by fetching rows from an Oracle DB, using Cx_Oracle.
Thanks to Tornado's PeriodicCallback, the program checks every 30 seconds if new data should be loaded:
server.start()
from tornado.ioloop import PeriodicCallback
pcallback = PeriodicCallback(db_obj.reload_data_async, 10 * 1e3)
pcallback.start()
server.io_loop.start()
Where db_obj is an instance of a class which takes care of the DB related functions (connect, fetch, ...).
Basically, this is how the reload_data_async function looks like:
executor = concurrent.futures.ThreadPoolExecutor(4)
# methods of the db_obj class ...
#gen.coroutine
def reload_data_async(self):
# ... first, some code to check if the data should be reloaded ...
# ...
if data_should_be_reloaded:
new_data = yield executor.submit(self.fetch_data)
def fetch_data(self):
""" fetch new data in the DB """
cursor = cx.Cursor(self.db_connection)
cursor.execute("some SQL select request that takes time (select * from ...)")
rows = cursor.fetchall()
# some more processing thereafter
# ...
Basically, this works. But when I try to read the data while it's being load in fetch_data (by clicking for display in the GUI), the program crashes due to race condition (I guess?): it's accessing the data while it's being fetched at the same time.
I just discovered that tornado.concurrent.futures are not thread-safe:
tornado.concurrent.Future is similar to concurrent.futures.Future, but
not thread-safe (and therefore faster for use with single-threaded
event loops).
All in all, I think I should create a new thread to take care of the CX_Oracle operations. Can I do that using Tornado and keep using the PerodicCallback function? How can I convert my asynchronous operation to be thread-safe? What's the way to do this?
PS: Im using Python 2.7
Thanks
Solved it!
#Sraw is right: it should not cause crash.
Explanation: fetch_data() is using a cx Oracle Connection object (self.db_connection), which is NOT thread-safe by default. Setting the threaded parameter to True wraps the shared connection with a mutex, as described in Cx Oracle documentation:
The threaded parameter is expected to be a boolean expression which
indicates whether or not Oracle should wrap accesses to connections
with a mutex. Doing so in single threaded applications imposes a
performance penalty of about 10-15% which is why the default is False.
So I in my code, I just modified the following and it now works without crashing when the user tries to access data while it's being refreshed:
# inside the connect method of the db_obj class
self.db_connection = cx.connect('connection string', threaded=True) # False by default
I'm developing a Flask API, and I have the following code to have a connection pool using Psycopg2. I wonder should I consider to close the connection pool when the program terminates and how should I do this?
#contextmanager
def get_cursor(:
global connection_pool
if not cls.connection_pool:
cls.connection_pool = ThreadedConnectionPool(5, 25, dsn=PoolingWrap.generate_conn_string())
con = cls.connection_pool.getconn()
try:
yield con.cursor(cursor_factory=RealDictCursor)
finally:
cls.connection_pool.putconn(con)
I believe that so long as you are closing the transactions in each connection correctly, then you should not have any problem just leaving the global pool.
I think the worst that can happen in that case is some connections on the DB side take a short while to work out that they've been closed on the client side - but this should not be able to cause any data consistency type issues.
However, if you really want to close the connection pool before you exit, one way to do this is to register an atexit function.
import atexit
#atexit.register
def close_connection_pool():
global connection_pool
connection_pool.closeall()
I have a PySpark job that updates some objects in HBase (Spark v1.6.0; happybase v0.9).
It sort-of works if I open/close an HBase connection for each row:
def process_row(row):
conn = happybase.Connection(host=[hbase_master])
# update HBase record with data from row
conn.close()
my_dataframe.foreach(process_row)
After a few thousand upserts, we start to see errors like this:
TTransportException: Could not connect to [hbase_master]:9090
Obviously, it's inefficient to open/close a connection for each upsert. This function is really just a placeholder for a proper solution.
I then tried to create a version of the process_row function that uses a connection pool:
pool = happybase.ConnectionPool(size=20, host=[hbase_master])
def process_row(row):
with pool.connection() as conn:
# update HBase record with data from row
For some reason, the connection pool version of this function returns an error (see complete error message):
TypeError: can't pickle thread.lock objects
Can you see what I'm doing wrong?
Update
I saw this post and suspect I'm experiencing the same issue: Spark attempts to serialize the pool object and distribute it to each of the executors, but this connection pool object cannot be shared across multiple executors.
It sounds like I need to split the dataset into partitions, and use one connection per partition (see design patterns for using foreachrdd). I tried this, based on an example in the documentation:
def persist_to_hbase(dataframe_partition):
hbase_connection = happybase.Connection(host=[hbase_master])
for row in dataframe_partition:
# persist data
hbase_connection.close()
my_dataframe.foreachPartition(lambda dataframe_partition: persist_to_hbase(dataframe_partition))
Unfortunately, it still returns a "can't pickle thread.lock objects" error.
down the line happybase connections are just tcp connections so they cannot be shared between processes. a connection pool is primarily useful for multi-threaded applications and also proves useful for single-threaded applications that can use the pool as a global "connection factory" with connection reuse, which may simplify code because no "connection" objects need to be passed around. it also makes error recovery is a bit easier.
in any case a pool (which is just a group of connections) cannot be shared between processes. trying to serialise it does not make sense for that reason. (pools use locks which causes serialisation to fail but that is just a symptom.)
perhaps you can use a helper that conditionally creates a pool (or connection) and stores it as a module-local variable, instead of instantiating it upon import, e.g.
_pool = None
def get_pool():
global _pool
if _pool is None:
_pool = happybase.ConnectionPool(size=1, host=[hbase_master])
return pool
def process(...)
with get_pool().connection() as connection:
connection.table(...).put(...)
this instantiates the pool/connection upon first use instead of on import time.
I am given the task to update a database over the network with sqlalchemy. I have decided to use python's threading module. Currently I am using 1 thread, aka the producer thread, to direct other threads to consume work units via a queue.
The producer thread does something like this:
def produce(self, last_id):
unit = session.query(Request).order_by(Request.id) \
.filter(Request.item_id == None).yield_per(50)
self.queue.put(unit, True, Master.THREAD_TIMEOUT)
while the consumer threads does something similar to this:
def consume(self):
unit = self.queue.get()
request = unit
item = Item.get_item_by_url(request)
request.item = item
session.add(request)
session.flush()
and I am using sqlalchemy's scoped session:
session = scoped_session(sessionmaker(autocommit=True, autoflush=True, bind=engine))
However, I am getting the exception,
"sqlalchemy.exc.InvalidRequestError: Object FOO is already attached to session '1234' (this is '5678')"
I understand that this exception comes from the fact that the request object is created in one session (the producer session) while the consumers are using another scoped session because they belong to another thread.
My work around is to have my producer thread pass in the request.id into the queue while the consumer has to call the code below to retrieve the request object.
request = session.query(Request).filter(Request.id == request_id).first()
I do not like this solution because this involves another network call and is obviously not optimal.
Are there ways to avoid wasting the result of the producer's db call?
Is there a way to write the "produce" so that more than 1 id is passed into the queue as a work unit?
Feedback welcomed!
You need to detach your Request instance from the main thread session before you put it into the queue, then attach it to the queue processing thread session when taken from the queue again.
To detach, call .expunge() on the session, passing in the request:
session.expunge(unit)
and then when processing it in a queue thread, re-attach it by merging; set the load flag to False to prevent a round-trip to the database again:
session.merge(request, load=False)