Why is multiprocessing.Pool closing/not sharing SQL connections across processes?

Why is multiprocessing.Pool closing/not sharing SQL connections across processes? - python

I have a function called connection which connects to the SQL databases. It can be simplified as:
import psycopg2 as pg
def connection():
_conn_DB = pg.connect(dbname="h2", user="h2naya", host=h2_IP, password=DB_PW)
_conn_DB.set_client_encoding('UTF8')
_conn_ML = pg.connect(dbname="postgres", user="h2naya", host="localhost", password=ML_PW)
_conn_ML.set_client_encoding('UTF8')
return {"db": _conn_DB, "ml": _conn_ML}
Now I am trying to get data within a 7-day period starting from a specified date, so I create another function:
import pandas as pd
conns = connection()
def get_latest_7d(date_string):
with conns["db"] as conn:
# date_string is formatted in my_sql in the real life
fasting_bg_latest_7d = pd.read_sql(my_sql, conn)
return fasting_bg_latest_7d
Now I can map get_latest_7d with a list of date strings, e.g. map(get_latest_7d, ["2018-07-01", "2018-07-02", "2018-07-03"]). It works.
As the list of date strings is really long, I'd like tool use multiprocessing.Pool.map to accelerate the procedure. However, the codes below give me InterfaceError: connection already closed:
from multiprocessing import Pool
with Pool() as p:
results = p.map(get_latest_7d, ["2018-07-01", "2018-07-02", "2018-07-03"])
I tried different ways and found the only working one is moving the line conns = connection() inside get_latest_7d, and not closing the connections before returning the data frame. By saying "closing the connections" I mean:
for conn_val in conns.values():
conn_val.close()
It seems that I need to create connections for each process, and I am not allowed to close the connections before the process ends. I am curious about:
Why can't the connection be shared across the processes?
Why does closing a connection in one process affect other processes?
Is there any better practice for my purpose?
PS: It seems to be recommended to build connections in each process but I am not sure that I fully understand.

Related

Python Multiprocessing restraint. limited to only 3 connections to database

I am using Python 3.6.8 and have a function that needs to run 77 times. I am passing in data that is pulling data out of PostgreSQL and doing a statistical analysis and then putting back into the database. I can only run 3 processes at the same time due to one at a time takes way to long (about 10 min for each function call) and me only being able to have 3 DB connections open at one time. I am trying to use the Poll library of Multiprocessing and it is trying to start all of them at once which is causing a too many connections error. Am i using the poll method correctly, if not what should i use to limit to ONLY 3 functions starting and finishing at the same time.
def AnalysisOf3Years(data):
FUNCTION RAN HERE
######START OF THE PROGRAM######
print("StartTime ((FINAL 77)): ", datetime.now())
con = psycopg2.connect(host="XXX.XXX.XXX.XXX", port="XXXX", user="USERNAME", password="PASSWORD", dbname="DATABASE")
cur = con.cursor()
date = datetime.today()
cur.execute("SELECT Value FROM table")
Array = cur.fetchall()
Data = []
con.close()
for value in Array:
Data.append([value,date])
p = Pool(3)
p.map(AnalysisOf3Years,Data)
print("EndTime ((FINAL 77)): ", datetime.now())

It appears you only briefly need your database connection, with the bulk on the script's time spent processing the data. If this is the case you may wish to pull the data out once and then write the data to disk. You can then load this data fresh from disk in each new instance of your program, without having to worry about your connection limit to the database.
If you want to look into connection pooling, you way wish to use pgbouncer. This is a separate program that sits between your main program and the database, pooling the number of connections you give it. You are then free to write your script as a single-threaded program, and you can spawn as many instances as your machine can cope with.
It's hard to tell why your program is misbehaving as the indentation is appears to be wrong. But at a guess it would seem like you do not create an use your pool in side a __main__ guard. Which, on certain OSes could lead to all sorts of weird issues.
You would expect well behaving code to look something like:
from multiprocessing import Pool
def double(x):
return x * 2
if __name__ == '__main__':
# means pool only gets created in the main parent process and not in the child pool processes
with Pool(3) as pool:
result = pool.map(double, range(5))
assert result == [0, 2, 4, 6, 8, 10]

You can use SQLAlchemy Python package that has database connection pooling as a standard functionality.
It does work with Postgres and many other database backends.
engine = create_engine('postgresql://me#localhost/mydb',
pool_size=3, max_overflow=0)
pool_size max number of connections to the database. You can set it to 3.
This page has some examples how to use that with Postgres -
https://docs.sqlalchemy.org/en/13/core/pooling.html
Based on your use case you might be also interested in
SingletonThreadPool
https://docs.sqlalchemy.org/en/13/core/pooling.html#sqlalchemy.pool.SingletonThreadPool
A Pool that maintains one connection per thread.

Tornado background task using cx_Oracle

I'm runnning a Bokeh server, using the underlying Tornado framework.
I need the server to refresh some data at some point. This is done by fetching rows from an Oracle DB, using Cx_Oracle.
Thanks to Tornado's PeriodicCallback, the program checks every 30 seconds if new data should be loaded:
server.start()
from tornado.ioloop import PeriodicCallback
pcallback = PeriodicCallback(db_obj.reload_data_async, 10 * 1e3)
pcallback.start()
server.io_loop.start()
Where db_obj is an instance of a class which takes care of the DB related functions (connect, fetch, ...).
Basically, this is how the reload_data_async function looks like:
executor = concurrent.futures.ThreadPoolExecutor(4)
# methods of the db_obj class ...
#gen.coroutine
def reload_data_async(self):
# ... first, some code to check if the data should be reloaded ...
# ...
if data_should_be_reloaded:
new_data = yield executor.submit(self.fetch_data)
def fetch_data(self):
""" fetch new data in the DB """
cursor = cx.Cursor(self.db_connection)
cursor.execute("some SQL select request that takes time (select * from ...)")
rows = cursor.fetchall()
# some more processing thereafter
# ...
Basically, this works. But when I try to read the data while it's being load in fetch_data (by clicking for display in the GUI), the program crashes due to race condition (I guess?): it's accessing the data while it's being fetched at the same time.
I just discovered that tornado.concurrent.futures are not thread-safe:
tornado.concurrent.Future is similar to concurrent.futures.Future, but
not thread-safe (and therefore faster for use with single-threaded
event loops).
All in all, I think I should create a new thread to take care of the CX_Oracle operations. Can I do that using Tornado and keep using the PerodicCallback function? How can I convert my asynchronous operation to be thread-safe? What's the way to do this?
PS: Im using Python 2.7
Thanks

Solved it!
#Sraw is right: it should not cause crash.
Explanation: fetch_data() is using a cx Oracle Connection object (self.db_connection), which is NOT thread-safe by default. Setting the threaded parameter to True wraps the shared connection with a mutex, as described in Cx Oracle documentation:
The threaded parameter is expected to be a boolean expression which
indicates whether or not Oracle should wrap accesses to connections
with a mutex. Doing so in single threaded applications imposes a
performance penalty of about 10-15% which is why the default is False.
So I in my code, I just modified the following and it now works without crashing when the user tries to access data while it's being refreshed:
# inside the connect method of the db_obj class
self.db_connection = cx.connect('connection string', threaded=True) # False by default

psycopg2 close connection pool

I'm developing a Flask API, and I have the following code to have a connection pool using Psycopg2. I wonder should I consider to close the connection pool when the program terminates and how should I do this?
#contextmanager
def get_cursor(:
global connection_pool
if not cls.connection_pool:
cls.connection_pool = ThreadedConnectionPool(5, 25, dsn=PoolingWrap.generate_conn_string())
con = cls.connection_pool.getconn()
try:
yield con.cursor(cursor_factory=RealDictCursor)
finally:
cls.connection_pool.putconn(con)

I believe that so long as you are closing the transactions in each connection correctly, then you should not have any problem just leaving the global pool.
I think the worst that can happen in that case is some connections on the DB side take a short while to work out that they've been closed on the client side - but this should not be able to cause any data consistency type issues.
However, if you really want to close the connection pool before you exit, one way to do this is to register an atexit function.
import atexit
#atexit.register
def close_connection_pool():
global connection_pool
connection_pool.closeall()

Python multiprocessing- write the results in the same file

I have a simple function that writes the output of some calculations in a sqlite table. I would like to use this function in parallel using multi-processing in Python. My specific question is how to avoid conflict when each process tries to write its result into the same table? Running the code gives me this error: sqlite3.OperationalError: database is locked.
import sqlite3
from multiprocessing import Pool
conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute("CREATE TABLE table_1 (id int,output int)")
def write_to_file(a_tuple):
index = a_tuple[0]
input = a_tuple[1]
output = input + 1
c.execute('INSERT INTO table_1 (id, output)' 'VALUES (?,?)', (index,output))
if __name__ == "__main__":
p = Pool()
results = p.map(write_to_file, [(1,10),(2,11),(3,13),(4,14)])
p.close()
p.join()
Traceback (most recent call last):
sqlite3.OperationalError: database is locked

Using a Pool is a good idea.
I see three possible solutions to this problem.
First, instead of having the pool worker trying to insert data into the database, let the worker return the data to the parent process.
In the parent process, use imap_unordered instead of map.
This is an iterable that starts providing values as soon as they become available. The parent can than insert the data into the database.
This will serialize the access to the database, preventing the problem.
This solution would be preferred if the data to be inserted into the database is relatively small, but updates happen very often. So if it takes the same or more time to update the database than to calculate the data.
Second, you could use a Lock. A worker should then
acquire the lock,
open the database,
insert the values,
close the database,
release the lock.
This will avoid the overhead of sending the data to the parent process. But instead you may have workers stalling waiting to write their data into a database.
This would be a preferred solution if the amount of data to be inserted is large but it takes much longer to calculate the data than to insert it into the database.
Third, you could have each worker write to its own database, and merge them afterwards. You can do this directly in sqlite or even in Python. Although with a large amount of data I'm not sure if the latter has advantages.

The database is locked to protect your data from corruption.
I believe you cannot have many processes accessing the same database at the same time, at least NOT with
conn = sqlite3.connect('test.db')
c = conn.cursor()
If each process must access the database, you should consider closing at least the cursor object c (and, perhaps less strictly, the connect object conn) within each process and reopen it when the process needs it again. Somehow, the other processes need to wait for the current one to release the lock before another process can acquire the lock. (There are many ways to achieved the waiting).

Setting the isolation_level to 'EXCLUSIVE' fixed it for me:
conn = sqlite3.connect('test.db', isolation_level='EXCLUSIVE')

PySpark dataframe.foreach() with HappyBase connection pool returns 'TypeError: can't pickle thread.lock objects'

I have a PySpark job that updates some objects in HBase (Spark v1.6.0; happybase v0.9).
It sort-of works if I open/close an HBase connection for each row:
def process_row(row):
conn = happybase.Connection(host=[hbase_master])
# update HBase record with data from row
conn.close()
my_dataframe.foreach(process_row)
After a few thousand upserts, we start to see errors like this:
TTransportException: Could not connect to [hbase_master]:9090
Obviously, it's inefficient to open/close a connection for each upsert. This function is really just a placeholder for a proper solution.
I then tried to create a version of the process_row function that uses a connection pool:
pool = happybase.ConnectionPool(size=20, host=[hbase_master])
def process_row(row):
with pool.connection() as conn:
# update HBase record with data from row
For some reason, the connection pool version of this function returns an error (see complete error message):
TypeError: can't pickle thread.lock objects
Can you see what I'm doing wrong?
Update
I saw this post and suspect I'm experiencing the same issue: Spark attempts to serialize the pool object and distribute it to each of the executors, but this connection pool object cannot be shared across multiple executors.
It sounds like I need to split the dataset into partitions, and use one connection per partition (see design patterns for using foreachrdd). I tried this, based on an example in the documentation:
def persist_to_hbase(dataframe_partition):
hbase_connection = happybase.Connection(host=[hbase_master])
for row in dataframe_partition:
# persist data
hbase_connection.close()
my_dataframe.foreachPartition(lambda dataframe_partition: persist_to_hbase(dataframe_partition))
Unfortunately, it still returns a "can't pickle thread.lock objects" error.

down the line happybase connections are just tcp connections so they cannot be shared between processes. a connection pool is primarily useful for multi-threaded applications and also proves useful for single-threaded applications that can use the pool as a global "connection factory" with connection reuse, which may simplify code because no "connection" objects need to be passed around. it also makes error recovery is a bit easier.
in any case a pool (which is just a group of connections) cannot be shared between processes. trying to serialise it does not make sense for that reason. (pools use locks which causes serialisation to fail but that is just a symptom.)
perhaps you can use a helper that conditionally creates a pool (or connection) and stores it as a module-local variable, instead of instantiating it upon import, e.g.
_pool = None
def get_pool():
global _pool
if _pool is None:
_pool = happybase.ConnectionPool(size=1, host=[hbase_master])
return pool
def process(...)
with get_pool().connection() as connection:
connection.table(...).put(...)
this instantiates the pool/connection upon first use instead of on import time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is multiprocessing.Pool closing/not sharing SQL connections across processes? - python

Related

Python Multiprocessing restraint. limited to only 3 connections to database

Tornado background task using cx_Oracle

psycopg2 close connection pool

Python multiprocessing- write the results in the same file

PySpark dataframe.foreach() with HappyBase connection pool returns 'TypeError: can't pickle thread.lock objects'

Categories

Resources