I have some script which use multiprocessing module, and Django ORM.
Scenario is quite simple:
get data,
create n processes and assing one part of data to each process,
do something,
save into DB with Django ORM.
Well, in step 4 I have problem that Django don't save all data in DB. Function which prepare data are correct and checked, but I don't know where is problem with Django. Before processes are created, old connection is closed, so that each process can have his own connection.
Is there some problem with Django and multiprocessing, or number of maximum connections (I use 4 process max)?
Example code:
connection.close()
#part where I call function "fun", and sending data
p = Process(target=fun, args=(i, data,)
def fun(i, data):
result_1 = some_other_fun(data) #this is list
result_2 = some_other_fun2(data) #this is model specified in models.py
save_data(result_1, result_2)
def save_data(res1, res2):
for row in res1:
res1.fk_to_another_table = res2
res1.some_info = func()
res1.save()
Long story short. If I call save_data from Process, save() method doesn't save row in table. If that method is called without using Process (just normal call of script), everything is ok.
Related
I am using Python 3.6.8 and have a function that needs to run 77 times. I am passing in data that is pulling data out of PostgreSQL and doing a statistical analysis and then putting back into the database. I can only run 3 processes at the same time due to one at a time takes way to long (about 10 min for each function call) and me only being able to have 3 DB connections open at one time. I am trying to use the Poll library of Multiprocessing and it is trying to start all of them at once which is causing a too many connections error. Am i using the poll method correctly, if not what should i use to limit to ONLY 3 functions starting and finishing at the same time.
def AnalysisOf3Years(data):
FUNCTION RAN HERE
######START OF THE PROGRAM######
print("StartTime ((FINAL 77)): ", datetime.now())
con = psycopg2.connect(host="XXX.XXX.XXX.XXX", port="XXXX", user="USERNAME", password="PASSWORD", dbname="DATABASE")
cur = con.cursor()
date = datetime.today()
cur.execute("SELECT Value FROM table")
Array = cur.fetchall()
Data = []
con.close()
for value in Array:
Data.append([value,date])
p = Pool(3)
p.map(AnalysisOf3Years,Data)
print("EndTime ((FINAL 77)): ", datetime.now())
It appears you only briefly need your database connection, with the bulk on the script's time spent processing the data. If this is the case you may wish to pull the data out once and then write the data to disk. You can then load this data fresh from disk in each new instance of your program, without having to worry about your connection limit to the database.
If you want to look into connection pooling, you way wish to use pgbouncer. This is a separate program that sits between your main program and the database, pooling the number of connections you give it. You are then free to write your script as a single-threaded program, and you can spawn as many instances as your machine can cope with.
It's hard to tell why your program is misbehaving as the indentation is appears to be wrong. But at a guess it would seem like you do not create an use your pool in side a __main__ guard. Which, on certain OSes could lead to all sorts of weird issues.
You would expect well behaving code to look something like:
from multiprocessing import Pool
def double(x):
return x * 2
if __name__ == '__main__':
# means pool only gets created in the main parent process and not in the child pool processes
with Pool(3) as pool:
result = pool.map(double, range(5))
assert result == [0, 2, 4, 6, 8, 10]
You can use SQLAlchemy Python package that has database connection pooling as a standard functionality.
It does work with Postgres and many other database backends.
engine = create_engine('postgresql://me#localhost/mydb',
pool_size=3, max_overflow=0)
pool_size max number of connections to the database. You can set it to 3.
This page has some examples how to use that with Postgres -
https://docs.sqlalchemy.org/en/13/core/pooling.html
Based on your use case you might be also interested in
SingletonThreadPool
https://docs.sqlalchemy.org/en/13/core/pooling.html#sqlalchemy.pool.SingletonThreadPool
A Pool that maintains one connection per thread.
I'm runnning a Bokeh server, using the underlying Tornado framework.
I need the server to refresh some data at some point. This is done by fetching rows from an Oracle DB, using Cx_Oracle.
Thanks to Tornado's PeriodicCallback, the program checks every 30 seconds if new data should be loaded:
server.start()
from tornado.ioloop import PeriodicCallback
pcallback = PeriodicCallback(db_obj.reload_data_async, 10 * 1e3)
pcallback.start()
server.io_loop.start()
Where db_obj is an instance of a class which takes care of the DB related functions (connect, fetch, ...).
Basically, this is how the reload_data_async function looks like:
executor = concurrent.futures.ThreadPoolExecutor(4)
# methods of the db_obj class ...
#gen.coroutine
def reload_data_async(self):
# ... first, some code to check if the data should be reloaded ...
# ...
if data_should_be_reloaded:
new_data = yield executor.submit(self.fetch_data)
def fetch_data(self):
""" fetch new data in the DB """
cursor = cx.Cursor(self.db_connection)
cursor.execute("some SQL select request that takes time (select * from ...)")
rows = cursor.fetchall()
# some more processing thereafter
# ...
Basically, this works. But when I try to read the data while it's being load in fetch_data (by clicking for display in the GUI), the program crashes due to race condition (I guess?): it's accessing the data while it's being fetched at the same time.
I just discovered that tornado.concurrent.futures are not thread-safe:
tornado.concurrent.Future is similar to concurrent.futures.Future, but
not thread-safe (and therefore faster for use with single-threaded
event loops).
All in all, I think I should create a new thread to take care of the CX_Oracle operations. Can I do that using Tornado and keep using the PerodicCallback function? How can I convert my asynchronous operation to be thread-safe? What's the way to do this?
PS: Im using Python 2.7
Thanks
Solved it!
#Sraw is right: it should not cause crash.
Explanation: fetch_data() is using a cx Oracle Connection object (self.db_connection), which is NOT thread-safe by default. Setting the threaded parameter to True wraps the shared connection with a mutex, as described in Cx Oracle documentation:
The threaded parameter is expected to be a boolean expression which
indicates whether or not Oracle should wrap accesses to connections
with a mutex. Doing so in single threaded applications imposes a
performance penalty of about 10-15% which is why the default is False.
So I in my code, I just modified the following and it now works without crashing when the user tries to access data while it's being refreshed:
# inside the connect method of the db_obj class
self.db_connection = cx.connect('connection string', threaded=True) # False by default
I am new to Celery & Python and have cursory knowledge of both.
I have multiple Ubuntu servers which all run multiple Celery workers(10 - 15).
Each of these workers need to perform a certain task using a third party libraries/DLL. For that we first
need to instantiate their class object and store it (somehow in memory).
Then the Celery workers read RMQ queues to execute certain tasks which uses the above class object methods.
The goal is to instantiate the third party class object once (when celerty worker starts) and then on task execution,
use the class instance methods. Just keep doing this in repeatedly.
I don't want to use REDIS as it seems like too much overhead to store such tiny amount of data(class object).
I need help in figuring out how to store this instantiated class object per worker. If the worker fails or crashes, obviously, we instantiate the class again which is not a problem. Any help specifically code sample will help a lot.
To provide more analogy, my requirement is similar to having a unique database connection per worker and using the same connection every repeated request.
Updated with some poorly written code for that:
tasks.py
from celery import Celery, Task
#Declares the config file and this worker file
mycelery = Celery('tasks')
mycelery.config_from_object('celeryconfig2')
class IQ(Task):
_db = None
#property
def db(self):
if self._db is None:
print 'establish DB connection....'
self._db = Database.Connect()
return self._db
#mycelery.task(base=IQ)
def indexIQ():
print 'calling indexing.....'
if index.db is None:
print 'DB connection doesn't exist. Let's create one...'
....
....
print 'continue indexing!'
main.py
from tasks import *
indexIQ.apply_async()
indexIQ.appply_async()
indexIQ.appply_async()
print 'end program'
Expected output
calling indexing.....
DB connection doesn't exist. Let's create one...
establish DB connection....
continue indexing!
calling indexing.....
continue indexing!
calling indexing.....
continue indexing!
Unfortunately, I am getting the 1st 4 lines of output all the time which means the DB connection is happening at each task execution. What am I doing wrong?
Thanks
I have a simple function that writes the output of some calculations in a sqlite table. I would like to use this function in parallel using multi-processing in Python. My specific question is how to avoid conflict when each process tries to write its result into the same table? Running the code gives me this error: sqlite3.OperationalError: database is locked.
import sqlite3
from multiprocessing import Pool
conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute("CREATE TABLE table_1 (id int,output int)")
def write_to_file(a_tuple):
index = a_tuple[0]
input = a_tuple[1]
output = input + 1
c.execute('INSERT INTO table_1 (id, output)' 'VALUES (?,?)', (index,output))
if __name__ == "__main__":
p = Pool()
results = p.map(write_to_file, [(1,10),(2,11),(3,13),(4,14)])
p.close()
p.join()
Traceback (most recent call last):
sqlite3.OperationalError: database is locked
Using a Pool is a good idea.
I see three possible solutions to this problem.
First, instead of having the pool worker trying to insert data into the database, let the worker return the data to the parent process.
In the parent process, use imap_unordered instead of map.
This is an iterable that starts providing values as soon as they become available. The parent can than insert the data into the database.
This will serialize the access to the database, preventing the problem.
This solution would be preferred if the data to be inserted into the database is relatively small, but updates happen very often. So if it takes the same or more time to update the database than to calculate the data.
Second, you could use a Lock. A worker should then
acquire the lock,
open the database,
insert the values,
close the database,
release the lock.
This will avoid the overhead of sending the data to the parent process. But instead you may have workers stalling waiting to write their data into a database.
This would be a preferred solution if the amount of data to be inserted is large but it takes much longer to calculate the data than to insert it into the database.
Third, you could have each worker write to its own database, and merge them afterwards. You can do this directly in sqlite or even in Python. Although with a large amount of data I'm not sure if the latter has advantages.
The database is locked to protect your data from corruption.
I believe you cannot have many processes accessing the same database at the same time, at least NOT with
conn = sqlite3.connect('test.db')
c = conn.cursor()
If each process must access the database, you should consider closing at least the cursor object c (and, perhaps less strictly, the connect object conn) within each process and reopen it when the process needs it again. Somehow, the other processes need to wait for the current one to release the lock before another process can acquire the lock. (There are many ways to achieved the waiting).
Setting the isolation_level to 'EXCLUSIVE' fixed it for me:
conn = sqlite3.connect('test.db', isolation_level='EXCLUSIVE')
Encountering a pretty frustrating error that pops up whenever one of my api endpoints is accessed. To give context, the application I am working on a Flask app using SQLAlchemy that stores data in a PostgreSQL database set to hold 1000 connections.
One of the ways users can query said data is through the /timeseries endpoint. The data is returned as json, which is assembled from the ResultProxies returned from querying the database.
The hope was that by using multithreading, I could make the method invoked by the view controller for /timeseries run faster, as our original setup takes too long to respond to queries which would return large volumes of data.
I've read many other posts with the same problem due to not cleaning up sessions properly, but I feel as though I have that covered. Anything glaringly wrong with the code I've written?
The app is deployed with AWS elastic beanstalk.
#classmethod
def timeseries_all(cls, table_names, agg_unit, start, end, geom=None):
"""
For each candidate dataset, query the matching timeseries and push datasets with nonempty
timeseries into a list to convert to JSON and display.
:param table_names: list of tables to generate timetables for
:param agg_unit: a unit of time to divide up the data by (day, week, month, year)
:param start: starting date to limit query
:param end: ending date to limit query
:param geom: geometric constraints of the query
:returns: timeseries list to display
"""
threads = []
timeseries_dicts = []
# set up engine for use with threading
psql_db = create_engine(DATABASE_CONN, pool_size=10, max_overflow=-1, pool_timeout=100)
scoped_sessionmaker = scoped_session(sessionmaker(bind=psql_db, autoflush=True, autocommit=True))
def fetch_timeseries(t_name):
_session = scoped_sessionmaker()
# retrieve MetaTable object to call timeseries from
table = MetaTable.get_by_dataset_name(t_name)
# retrieve ResultProxy from executing timeseries selection
rp = _session.execute(table.timeseries(agg_unit, start, end, geom))
# empty results will just have a header
if rp.rowcount > 0:
timeseries = {
'dataset_name': t_name,
'items': [],
'count': 0
}
for row in rp.fetchall():
timeseries['items'].append({'count': row.count, 'datetime': row.time_bucket.date()})
timeseries['count'] += row.count
# load to outer storage
timeseries_dicts.append(timeseries)
# clean up session
rp.close()
scoped_sessionmaker.remove()
# create a new thread for every table to query
for name in table_names:
thread = threading.Thread(target=fetch_timeseries, args=(name, ))
threads.append(thread)
# start all threads
for thread in threads:
thread.start()
# wait for all threads to finish
for thread in threads:
thread.join()
# release all connections associated with this engine
psql_db.dispose()
return timeseries_dicts
I think that you are going about this in a bit of a roundabout way. Here are some suggestions on getting the most out of your postgres connections (I have used this configuration in production).
I would be using the Flask-SQLAlchemy extension to handle the connections to your Postgres instance. If you look at the SQLAlchemy docs you will see that the author highly recommends using this to handle the db connection lifecycle as opposed to rolling your own.
The more performant way to handle lots of requests is to put your Flask application behind a wsgi server like gunicorn or uwsgi. These servers will be able to spawn multiple instances of your application. Then when someone hits your endpoint you will have your connections load balanced between these instances.
So for example if you had uwsgi setup to run 5 processes you would then be able to handle 50 db connections simultaneously (5 apps x 10 pools each)