There are many questions about is django db connection thread safe, but they all seem to be asking the default request threads.
What if I am writing custom script that uses database connection in threads:
from django.db import connections
import threading
class Transform(object):
def transform_data(self, listing):
cursor = self.connection.cursor()
cursor.execute('SELECT ... WHERE id = %s', listing.id)
data = cursor.fetchall()
...
def run(self):
connection = self.connections['legacy']
for listing in listings:
threading.Thread(target=self.transform_data, args=[listing])
How safe is data inside transform_data thread in terms of the result from cursor is not mixed up with other threads?
Ideally each thread should be using its own connection. If you do that when you execute the select query inside transform_data you are essentially getting a snapshot of the data at that point in time. You can retrieve the rows without having to worry about their being updated or deleted by other threads provided that the other threads have their own connection.
If all threads share the same connection what exactly happens is very dependent on what database you are using and transaction isolation level
Each item in the connections object returns a thread-local connection to that database. By default, these connections cannot be shared between threads; attempting to do so will result in a DatabaseError.
Always use connections[alias] within the thread that executes your queries. Never access connections[alias] in the parent thread and pass the object to the child thread. This will ensure that every connection object you use is local to the current thread, avoiding any threading issues.
To fix your code and make it thread-safe, you would change it like this:
from django.db import connections
import threading
class Transform(object):
def transform_data(self, listing):
# Access the database connection on the global `connections` object
# from within the child thread.
cursor = connections['legacy'].cursor()
cursor.execute('SELECT ... WHERE id = %s', listing.id)
data = cursor.fetchall()
...
def run(self):
for listing in listings:
threading.Thread(target=self.transform_data, args=[listing])
Related
Preface
I want to process tasks listed in a database table in parallel. Not looking for working code.
The Setup
1 PostgreSQL database server D
1 processing server P
1 User terminal T
using Python 3.6, psycopg2.7.6, PostgreSQL 11
D holds tables with data to be processed and a tasks table. A user at T ssh's into P, where the following command can be issued:
python -m core.utils.task
This task.py script is essentially a while loop that gets a task t from the tasks table on D with the status 'new' until there are no new tasks left. A task t is basically a set of arguments for another function called do_something(t). do_something(t) itself will make many connections to D to get data that needs to be processed and set task's to status 'done' once it finished – the while loop starts all over and gets a new task.
In order to run python -m core.utils.task multiple times I open multiple ssh connections. Not so good, I know; threading or multiprocessing would be better. But his is just for testing if I can run the mentioned command twice.
There is a script that manages all the database interactions called pgsql.py which is needed to get a task and then by do_something(t). I adapted a singleton pattern from this SE post.
Pseudo-Code (mostly)
task.py
import mymodule
import pgsql
def main():
while True:
r, c = pgsql.SQL.select_task() # rows and columns
task = dotdict(dict(zip(c, r[0])))
mymodule.do_something(task)
if __name__ == "__main__":
main()
mymodule.py
import pgsql
def do_something(t):
input = pgsql.SQL.get_images(t.table,t.schema,t.image_id,t.image_directory)
some_other_function(input)
pgsql.SQL.task_status(t.task_id,'done')
pgsql.py
import psycopg2 as pg
class Postgres(object):
"""Adapted from https://softwareengineering.stackexchange.com/a/358061/348371"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = object.__new__(cls)
db_config = {'dbname': 'dev01', 'host': 'XXXXXXXX',
'password': 'YYYYY', 'port': 5432, 'user': 'admin'}
try:
print('connecting to PostgreSQL database...')
connection = Postgres._instance.connection = pg.connect(**db_config)
connection.set_session(isolation_level='READ COMMITTED', autocommit=True)
except Exception as error:
print('Error: connection not established {}'.format(error))
Postgres._instance = None
else:
print('connection established')
return cls._instance
def __init__(self):
self.connection = self._instance.connection
def query(self, query):
try:
with self.connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
cols = [desc[0] for desc in cur.description]
except Exception as error:
print('error execting query "{}", error: {}'.format(query, error))
return None
else:
return rows, cols
def __del__(self):
self.connection.close()
db = Postgres()
class SQL():
def select_task():
s = """
UPDATE schema.tasks
SET status = 'ready'
WHERE task_id = ( SELECT task_id
FROM schema.tasks
WHERE tasks.status = 'new'
LIMIT 1)
RETURNING *
;
""".format(m=mode)
return Postgres.query(db, s)
def task_status(id,status):
s = """
UPDATE
schema.tasks
SET
status = '{s}'
WHERE
tasks.task_id = '{id}'
;
""".format(s=status,
id=id)
return Postgres.query(db, s)
Problem
This works with one ssh connection. Tasks are retrieved from the database and processed, once finished the task is set to 'done'. Once I open a second ssh connection in a second terminal to run python -m core.utils.task (so to say, in parallel) the exact same rows of the tasks table are processed in both - ignoring that they have been updated.
Question
What are your suggestions to get this to work? There are millions of tasks and I need to run them in parallel. Before implementing threading or multiprocessing I wanted to test it with multiple ssh connections first, bad idea? I have fiddled around with the isolation levels and autocommit settings in psycopg2's set_session() but without luck. I checked the sessions in the Database server and can see that each process of python -m core.utils.task has its own PID, only connecting once, exactly like this singleton pattern should work. Any ideas or pointers how to deal with this are much appreciated!
The main problem is that performing one task is not an atomic operation. Therefore, in different ssh sessions, the same task can be processing several times.
In this implementation, you can try to use an "INPROGRESS" status for task so as not to retrieve tasks that are already being processed (with "INPROGRESS" status). But be sure to use autocommit.
But I would implement this using threads and database connection pool. And would extract tasks in batches using OFFSET and LIMIT. The do_something, select_task and task_status functions would implement for batch of tasks.
Also, there is no need to implement the Postgres class as a singleton.
Amended (see the comments below)
You can add FOR UPDATE SKIP LOCKED to the SQL query in current implementation (see url).
If you want to work with batches, then separate the data by some serial column (well, or just sort the data in a table).
My implementation using batches.
This can be implemented using ThreadPoolExecutor and PersistentConnectionPool.
I wonder is it fine if I keep a reference to a db and a collection as class members ?
Like that
from pymongo import MongoClient
class ClientDataStore(object):
BASE_MONGO_CONNECTION_URL = 'mongodb://localhost:27017/'
MAIN_DB_NAME = "bank"
CLIENT_COLLECTION_NAME = "client"
def __init__(self):
self.mongo = MongoClient(ClientDataStore.BASE_MONGO_CONNECTION_URL)
self.db = self.mongo[ClientDataStore.MAIN_DB_NAME]
self.client_collection = self.db[ClientDataStore.CLIENT_COLLECTION_NAME]
def get_client_info(self, id):
client = self.client_collection.find_one({"_id": id})
return client
Will it keep the opened connection or it will open it as necessary ?
Or I should open the db and get the collection all only when I need this ?
Thanks
This is a good idea. MongoClient has a connection pool that keeps open connections indefinitely. Keeping an open connection will reduce latency and increase throughput in your application. See the Connection Pool FAQ for PyMongo.
Is it possible to access an in-memory SQLite database from different threads?
In the following sample code I create a SQLite database in memory and create a table. When I now go to a different execution context, which I think have to do when I go to a different thread, the created table isn't there anymore. If I would open a file based SQLite database, the table would be there.
Can I achieve the same behavior for an in-memory database?
from peewee import *
db = SqliteDatabase(':memory:')
class BaseModel(Model):
class Meta:
database = db
class Names(BaseModel):
name = CharField(unique=True)
print(Names.table_exists()) # this returns False
Names.create_table()
print(Names.table_exists()) # this returns True
print id(db.get_conn()) # Our main thread's connection.
with db.execution_context():
print(Names.table_exists()) # going to another context, this returns False if we are in :memory: and True if we work on a file *.db
print id(db.get_conn()) # A separate connection.
print id(db.get_conn()) # Back to the original connection.
Working!!
cacheDB = SqliteDatabase('file:cachedb?mode=memory&cache=shared')
Link
http://charlesleifer.com/blog/managing-database-connections-with-peewee/
https://groups.google.com/forum/#!topic/peewee-orm/78invrt3xyo
Should I be re-initializing the connection on every insert?
class TwitterStream:
def __init__(self, timeout=False):
while True:
dump_data()
def dump_data:
##dump my data into mongodb
##should I be doing this every time??:
client=MongoClient()
mongo=MongoClient('localhost',27017)
db=mongo.test
db.insert('some stuff':'other stuff')
##dump data and close connection
#########################
Do I need to open the connection every time I write a record? Or can I leave a connection open assuming I'll be writing to the database 5 times per second with about 10kb each time?
If just one connection is enough, where should I define the variables which hold the connection (client, mongo, db)?
Open one MongoClient that lives for the duration of your program:
client = MongoClient()
class TwitterStream:
def dump_data:
while True:
db = client.test
db.insert({'some stuff': 'other stuff'})
Opening a single MongoClient means you only pay its startup cost once, and its connection-pooling will minimize the cost of opening new connections.
If you're concerned about surviving occasional network issues, wrap your operations in an exception block:
try:
db.insert(...)
except pymongo.errors.ConnectionFailure:
# Handle error.
...
Opening connections is in general an expensive operation, so I recommend you to reuse them as much as possible.
In the case of MongoClient, you should be able to leave the connection open and keep reusing it. However, as the connection lives on for a longer time, eventually you'll start hitting connectivity issues. The recommended solution for this it to configure MongoClient to use auto-reconnect, and catch the AutoReconnect exception as part of your retry mechanisms.
Here's an example of said approach, taken from http://python.dzone.com/articles/save-monkey-reliably-writing:
while True:
time.sleep(1)
data = {
'time': datetime.datetime.utcnow(),
'oxygen': random.random()
}
# Try for five minutes to recover from a failed primary
for i in range(60):
try:
mabel_db.breaths.insert(data, safe=True)
print 'wrote'
break # Exit the retry loop
except pymongo.errors.AutoReconnect, e:
print 'Warning', e
time.sleep(5)
I'm creating a RESTful API which needs to access the database. I'm using Restish, Oracle, and SQLAlchemy. However, I'll try to frame my question as generically as possible, without taking Restish or other web APIs into account.
I would like to be able to set a timeout for a connection executing a query. This is to ensure that long running queries are abandoned, and the connection discarded (or recycled). This query timeout can be a global value, meaning, I don't need to change it per query or connection creation.
Given the following code:
import cx_Oracle
import sqlalchemy.pool as pool
conn_pool = pool.manage(cx_Oracle)
conn = conn_pool.connect("username/p4ss#dbname")
conn.ping()
try:
cursor = conn.cursor()
cursor.execute("SELECT * FROM really_slow_query")
print cursor.fetchone()
finally:
cursor.close()
How can I modify the above code to set a query timeout on it?
Will this timeout also apply to connection creation?
This is similar to what java.sql.Statement's setQueryTimeout(int seconds) method does in Java.
Thanks
for the query, you can look on timer and conn.cancel() call.
something in those lines:
t = threading.Timer(timeout,conn.cancel)
t.start()
cursor = conn.cursor()
cursor.execute(query)
res = cursor.fetchall()
t.cancel()
In linux see /etc/oracle/sqlnet.ora,
sqlnet.outbound_connect_timeout= value
also have options:
tcp.connect_timeout and sqlnet.expire_time, good luck!
You could look at setting up PROFILEs in Oracle to terminate the queries after a certain number of logical_reads_per_call and/or cpu_per_call
Timing Out with the System Alarm
Here's how to use the operating system timout to do this. It's generic, and works for things other than Oracle.
import signal
class TimeoutExc(Exception):
"""this exception is raised when there's a timeout"""
def __init__(self): Exception.__init__(self)
def alarmhandler(signame,frame):
"sigalarm handler. raises a Timeout exception"""
raise TimeoutExc()
nsecs=5
signal.signal(signal.SIGALRM, alarmhandler) # set the signal handler function
signal.alarm(nsecs) # in 5s, the process receives a SIGALRM
try:
cx_Oracle.connect(blah blah) # do your thing, connect, query, etc
signal.alarm(0) # if successful, turn of alarm
except TimeoutExc:
print "timed out!" # timed out!!