I want to use PyMongo as a logger for a Django app.
I don't mind if some inserts in the log table are lost, so I want to send a log to mongodb in another server, and continue the execution without waiting for confirmation.
I am reading pymongo docs, but it's not clear to me if the inserts in a collection are blocking or not.
I'm thinking of doing this inside a django model method
from pymongo import MongoClient
conn = MongoClient('mongoserver', 27017)
db = conn.main
col = db.log
col.insert({"user": "Pedro", "action": "search", "Origin": "Katmandu"}, w=0)
conn.close()
I don't know if I the insert is async like that and if the connection should be closed or not
Because you're passing a w=0 parameter to insert, the operation is non-blocking and the call will simply queue the document for insert and return.
And leave the connection open for best performance.
Related
Preface
I want to process tasks listed in a database table in parallel. Not looking for working code.
The Setup
1 PostgreSQL database server D
1 processing server P
1 User terminal T
using Python 3.6, psycopg2.7.6, PostgreSQL 11
D holds tables with data to be processed and a tasks table. A user at T ssh's into P, where the following command can be issued:
python -m core.utils.task
This task.py script is essentially a while loop that gets a task t from the tasks table on D with the status 'new' until there are no new tasks left. A task t is basically a set of arguments for another function called do_something(t). do_something(t) itself will make many connections to D to get data that needs to be processed and set task's to status 'done' once it finished – the while loop starts all over and gets a new task.
In order to run python -m core.utils.task multiple times I open multiple ssh connections. Not so good, I know; threading or multiprocessing would be better. But his is just for testing if I can run the mentioned command twice.
There is a script that manages all the database interactions called pgsql.py which is needed to get a task and then by do_something(t). I adapted a singleton pattern from this SE post.
Pseudo-Code (mostly)
task.py
import mymodule
import pgsql
def main():
while True:
r, c = pgsql.SQL.select_task() # rows and columns
task = dotdict(dict(zip(c, r[0])))
mymodule.do_something(task)
if __name__ == "__main__":
main()
mymodule.py
import pgsql
def do_something(t):
input = pgsql.SQL.get_images(t.table,t.schema,t.image_id,t.image_directory)
some_other_function(input)
pgsql.SQL.task_status(t.task_id,'done')
pgsql.py
import psycopg2 as pg
class Postgres(object):
"""Adapted from https://softwareengineering.stackexchange.com/a/358061/348371"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = object.__new__(cls)
db_config = {'dbname': 'dev01', 'host': 'XXXXXXXX',
'password': 'YYYYY', 'port': 5432, 'user': 'admin'}
try:
print('connecting to PostgreSQL database...')
connection = Postgres._instance.connection = pg.connect(**db_config)
connection.set_session(isolation_level='READ COMMITTED', autocommit=True)
except Exception as error:
print('Error: connection not established {}'.format(error))
Postgres._instance = None
else:
print('connection established')
return cls._instance
def __init__(self):
self.connection = self._instance.connection
def query(self, query):
try:
with self.connection.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
cols = [desc[0] for desc in cur.description]
except Exception as error:
print('error execting query "{}", error: {}'.format(query, error))
return None
else:
return rows, cols
def __del__(self):
self.connection.close()
db = Postgres()
class SQL():
def select_task():
s = """
UPDATE schema.tasks
SET status = 'ready'
WHERE task_id = ( SELECT task_id
FROM schema.tasks
WHERE tasks.status = 'new'
LIMIT 1)
RETURNING *
;
""".format(m=mode)
return Postgres.query(db, s)
def task_status(id,status):
s = """
UPDATE
schema.tasks
SET
status = '{s}'
WHERE
tasks.task_id = '{id}'
;
""".format(s=status,
id=id)
return Postgres.query(db, s)
Problem
This works with one ssh connection. Tasks are retrieved from the database and processed, once finished the task is set to 'done'. Once I open a second ssh connection in a second terminal to run python -m core.utils.task (so to say, in parallel) the exact same rows of the tasks table are processed in both - ignoring that they have been updated.
Question
What are your suggestions to get this to work? There are millions of tasks and I need to run them in parallel. Before implementing threading or multiprocessing I wanted to test it with multiple ssh connections first, bad idea? I have fiddled around with the isolation levels and autocommit settings in psycopg2's set_session() but without luck. I checked the sessions in the Database server and can see that each process of python -m core.utils.task has its own PID, only connecting once, exactly like this singleton pattern should work. Any ideas or pointers how to deal with this are much appreciated!
The main problem is that performing one task is not an atomic operation. Therefore, in different ssh sessions, the same task can be processing several times.
In this implementation, you can try to use an "INPROGRESS" status for task so as not to retrieve tasks that are already being processed (with "INPROGRESS" status). But be sure to use autocommit.
But I would implement this using threads and database connection pool. And would extract tasks in batches using OFFSET and LIMIT. The do_something, select_task and task_status functions would implement for batch of tasks.
Also, there is no need to implement the Postgres class as a singleton.
Amended (see the comments below)
You can add FOR UPDATE SKIP LOCKED to the SQL query in current implementation (see url).
If you want to work with batches, then separate the data by some serial column (well, or just sort the data in a table).
My implementation using batches.
This can be implemented using ThreadPoolExecutor and PersistentConnectionPool.
I have an old, large project based in Python 2.7 with Tornado framework. To work with MySQL, it initially used Tornado-MySQL with raw SQL queries, and it worked well, but now it must use MySQL 8, and that library is obsolete, unmaintained.
So, now I set TorMySQL library – it connects well to MySQL Server 8, but I don't fully understand how to use it, and this leads so bugs.
In one project's file I wrote this code to access databases:
from tornado import gen
from tornado.gen import Return
from tornado.ioloop import IOLoop
import tormysql
import settings
POOL = tormysql.ConnectionPool(
max_connections = 20,
idle_seconds = 7200, #timeout time, 0 is not timeout
wait_connection_timeout = 3,
host='127.0.0.1',
port=3306,
user=settings.MYSQL_USER,
passwd=settings.MYSQL_PASSWORD,
db='aivanf',
use_unicode=True,
charset='utf8mb4')
#gen.coroutine
def executePool(query, params):
with (yield POOL.Connection()) as conn:
with conn.cursor() as cursor:
try:
yield cursor.execute(query, params)
except Exception, ex:
print('Exception!\n{}'.format(ex))
yield conn.rollback()
raise Return(None)
else:
first = query[:10].lower()
if 'update' in first or 'insert' in first:
yield conn.commit()
if 'select' in first:
raise Return(cursor.fetchall())
else:
raise Return(None)
I use if's because this single function is called with different types of queries. I know, it's ugly, but works fine. Similar, but even simpler code for Tornado-MySQL worked completely perfect, but with MySQL 5.7 only.
However, some UPDATE / INSERT queries seem to be skipped, and I get these messages:
(1213, u'Deadlock found when trying to get lock; try restarting transaction')
WARNING:root:Connection maybe not release, used time 25.32s {'port': 3306, 'host': '127.0.0.1', 'user': '...', 'database': '...'} <3,2>.
Also, sometimes different clients of the server see different versions of data – like if they had different connections with own uncommitted data.
How to solve the problem?
I suppose that the problem about the pool – maybe I have to close / recreate it? The TorMySQL page has also this line: yield pool.close()
You probably have to conn.commit() even after a SELECT query - otherwise a run of SELECT queries are done within the same transaction as the first.
I think most users are accustomed to "autocommit" by default, but that does not seem to be the default mode for TorMySQL
(I was confused the same as you were, for the first couple days of using TorMySQL :)
I have a class that create the a MongoClient inside:
db = MongoDB ('mydb' , 'config')
I am successfully able to connect to 'mydb' database and 'config' collection - but after querying the collection I do need this connection to database again. I proceed to create connection with another database and collection
db = MongoDB ('mapping' , 'box_details')
In such a case how can I close the connection to DB previously - is it that it would automatically get closed when app exits?
I'd recommend you to open connection using pymongo.MongoClient which will return mongo_client object. mongo_cient has instance method close allowing you to close connection manually.
Please see documentation about mongo_client
There are many questions about is django db connection thread safe, but they all seem to be asking the default request threads.
What if I am writing custom script that uses database connection in threads:
from django.db import connections
import threading
class Transform(object):
def transform_data(self, listing):
cursor = self.connection.cursor()
cursor.execute('SELECT ... WHERE id = %s', listing.id)
data = cursor.fetchall()
...
def run(self):
connection = self.connections['legacy']
for listing in listings:
threading.Thread(target=self.transform_data, args=[listing])
How safe is data inside transform_data thread in terms of the result from cursor is not mixed up with other threads?
Ideally each thread should be using its own connection. If you do that when you execute the select query inside transform_data you are essentially getting a snapshot of the data at that point in time. You can retrieve the rows without having to worry about their being updated or deleted by other threads provided that the other threads have their own connection.
If all threads share the same connection what exactly happens is very dependent on what database you are using and transaction isolation level
Each item in the connections object returns a thread-local connection to that database. By default, these connections cannot be shared between threads; attempting to do so will result in a DatabaseError.
Always use connections[alias] within the thread that executes your queries. Never access connections[alias] in the parent thread and pass the object to the child thread. This will ensure that every connection object you use is local to the current thread, avoiding any threading issues.
To fix your code and make it thread-safe, you would change it like this:
from django.db import connections
import threading
class Transform(object):
def transform_data(self, listing):
# Access the database connection on the global `connections` object
# from within the child thread.
cursor = connections['legacy'].cursor()
cursor.execute('SELECT ... WHERE id = %s', listing.id)
data = cursor.fetchall()
...
def run(self):
for listing in listings:
threading.Thread(target=self.transform_data, args=[listing])
So I am building a mongo database class that will be provide access to inserting documents to the insertion service and provide access for viewing documents via a querying service. Right now I have the following for my database.py class:
import pymongo
client = pymongo.MongoClient('mongodb://localhost:27017/')
db_connection = client['my_database']
class DB_Object(object):
""" A class providing structure and access to the Database """
def add_document(self, json_obj):
coll = db_connection["some collection"]
document = {
"name" : "imma name",
"raw value" : 777,
"converted value" : 333
}
coll.insert(document)
def query_response(self, query):
"""query logic here"""
If I want concurrent queries and inserts with this class being called by multiple services is this the correct location for the lines:
client = pymongo.MongoClient('mongodb://localhost:27017/')
db_connection = client['my_database']
And is this a standard way to provide access?
Your code is correct. You should continue to use the same MongoClient instance for all operations in your application, this will ensure that all operations share the same connection pool and use as few connections as possible--this will maximize your efficiency. MongoClient is thread-safe so this will work even if you have concurrent operations on multiple threads.