I'm using SQLAlchemy and multiprocessing. I also use scoped_session sinse it avoids share the same session but I've found an error and their solution but I don't understand why does it happend.
You can see my code below:
db.py
engine = create_engine(connection_string)
Session = sessionmaker(bind=engine)
DBSession = scoped_session(Session)
script.py
from multiprocessing import Pool, current_process
from db import DBSession
def process_feed(test):
session = DBSession()
print(current_process().name, session)
def run():
session = DBSession()
pool = Pool()
print(current_process().name, session)
pool.map_async(process_feed, [1, 2]).get()
if __name__ == "__main__":
run()
When I run script.py The output is:
MainProcess <sqlalchemy.orm.session.Session object at 0xb707b14c>
ForkPoolWorker-1 <sqlalchemy.orm.session.Session object at 0xb707b14c>
ForkPoolWorker-2 <sqlalchemy.orm.session.Session object at 0xb707b14c>
Note that session object is the same 0xb707b14c in the main process and their workers (child process)
BUT If I change the order of first two lines run():
def run():
pool = Pool() # <--- Now pool is instanced in the first line
session = DBSession() # <--- Now session is instanced in the second line
print(current_process().name, session)
pool.map_async(process_feed, [1, 2]).get()
And the I run script.py again the output is:
MainProcess <sqlalchemy.orm.session.Session object at 0xb66907cc>
ForkPoolWorker-1 <sqlalchemy.orm.session.Session object at 0xb669046c>
ForkPoolWorker-2 <sqlalchemy.orm.session.Session object at 0xb66905ec>
Now the session instances are different.
To understand why this happens, you need to understand what scoped_session and Pool actually does. scoped_session keeps a registry of sessions so that the following happens
the first time you call DBSession, it creates a Session object for you in the registry
subsequently, if necessary conditions are met (i.e. same thread, session has not been closed), it does not create a new Session object and instead returns you the previously created Session object back
When you create a Pool, it creates the workers in the __init__ method. (Note that there's nothing fundamental about starting the worker processes in __init__. An equally valid implementation could wait until workers are first needed before it starts them, which would exhibit different behavior in your example.) When this happens (on Unix), the parent process forks itself for every worker process, which involves the operating system copying the memory of the current running process into a new process, so you will literally get the exact same objects in the exact same places.
Putting these two together, in the first example you are creating a Session before forking, which gets copied over to all worker processes during the creation of the Pool, resulting in the same identity, while in the second example you delay the creation of the Session object until after the worker processes have started, resulting in different identities.
It's important to note that while the Session objects share the same id, they are not the same object, in the sense that if you change anything about the Session in the parent process, they will not be reflected in the child processes. They just happen to all share the same memory address due to the fork. However, OS-level resources like connections are shared, so if you had run a query on session before Pool(), a connection would have been created for you in the connection pool and subsequently forked into the child processes. If you then attempt to perform queries in the child processes you will run into weird errors because your processes are clobbering over each other over the same exact connection!
The above is moot for Windows because Windows does not have fork().
TCP connections are represented as file descriptors, which usually work across process boundaries, meaning this will cause concurrent access to the file descriptor on behalf of two or more entirely independent Python interpreter states.
https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing
Related
I am building a API with FastAPI served via uvicorn.
The API has endpoints that make use of python multiprocessing lib.
An endpoint spawns several Processes for a CPU bound tasks to perform them in parrallel.
Here is a high level code logic overview:
import multiprocessing as mp
class Compute:
def single_compute(self, single_comp_data):
# Computational Task CPU BOUND
global queue
queue.put(self.compute(single_comp_data))
def multi_compute(self, task_ids):
# Prepare for Compuation
output = {}
processes = []
global queue
queue = mp.Queue()
# Start Test Objs Computation
for tid in task_ids:
# Load task data here, to make use of object in memory cache
single_comp_data = self.load_data_from_cache(tid)
p = mp.Process(target=self.single_compute, args=single_comp_data)
p.start()
processes.append(p)
# Collect Parallel Computation
for p in processes:
result = queue.get()
output[result["tid"]]= result
p.join()
return output
Here is the simple API code:
from fastapi import FastAPI, Response
import json
app = FastAPI()
#comp holds an in memory cache, thats why its created in global scope
comp = Compute()
#app.get("/compute")
def compute(task_ids):
result = comp.multi_compute(task_ids)
return Response(content=json.dumps(result, default=str), media_type="application/json")
When run with multiple workers like this:
uvicorn compute_api:app --host 0.0.0.0 --port 7000 --workers 2
I am getting this python error
TypeError: can't pickle _thread.lock objects
With only 1 worker process it is fine. The program runs on UNIX/LINUX OS.
Could someone explain to me why the forking of a new process is not possible with multiple uvicorn processes here and why I am running into this tread lock?
In the end what should be achieved is simple:
uvicorn process that spawns multiple other processes (child processes
via fork) with memory copy of that uvicorn process. To perform cpu bound task.
TypeError: can't pickle _thread.lock objects
stems from whatever data you're passing into your subprocess in
p = mp.Process(target=self.single_compute, args=single_comp_data)
containing an unpickleable object.
All args/kwargs sent to a multiprocessing subprocess (be it via Process, or the higher-level methods in Pool) must be pickleable, and similarly the return value of the function run must be pickleable so it can be sent back to the parent process.
If you're on UNIX and using the fork start method for multiprocessing (which is the default on Linux, but not on macOS), you can also take advantage of copy-on-write memory semantics to avoid the copy "down" to the child processes by making the data available, e.g. via instance state, a global variable, ..., before spawning the subprocess, and having it fetch it by reference, instead of passing the data itself down as an argument.
This example is using imap_unordered for performance (the assumption being there's no need to process the ids in order), and would return a dict mapping an input ID to the result it creates.
class Compute:
_cache = {} # could be an instance variable too but whatever
def get_data(self, id):
if id not in self._cache:
self._cache[id] = get_data_from_somewhere(id)
return self._cache[id]
def compute_item(self, id):
data = self.get_data(id)
result = 42 # ... do heavy computation here ...
return (id, result)
def compute_result(self, ids) -> dict:
for id in ids:
self.get_data(id) # populate in parent process
with multiprocessing.Pool() as p:
return dict(p.imap_unordered(self.compute_item, ids))
Using Python's multiprocessing on Windows will require many arguments to be "picklable" while passing them to child processes.
import multiprocessing
class Foobar:
def __getstate__(self):
print("I'm being pickled!")
def worker(foobar):
print(foobar)
if __name__ == "__main__":
# Uncomment this on Linux
# multiprocessing.set_start_method("spawn")
foobar = Foobar()
process = multiprocessing.Process(target=worker, args=(foobar, ))
process.start()
process.join()
The documentation mentions this explicitly several times:
Picklability
Ensure that the arguments to the methods of proxies are picklable.
[...]
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
[...]
More picklability
Ensure that all arguments to Process.__init__() are picklable. Also, if you subclass Process then make sure that instances will be picklable when the Process.start method is called.
However, I noticed two main differences between "multiprocessing pickle" and the standard pickle module, and I have trouble making sense of all of this.
multiprocessing.Queue() are not "pickable" yet passable to child processes
import pickle
from multiprocessing import Queue, Process
def worker(queue):
pass
if __name__ == "__main__":
queue = Queue()
# RuntimeError: Queue objects should only be shared between processes through inheritance
pickle.dumps(queue)
# Works fine
process = Process(target=worker, args=(queue, ))
process.start()
process.join()
Not picklable if defined in "main"
import pickle
from multiprocessing import Process
def worker(foo):
pass
if __name__ == "__main__":
class Foo:
pass
foo = Foo()
# Works fine
pickle.dumps(foo)
# AttributeError: Can't get attribute 'Foo' on <module '__mp_main__' from 'C:\\Users\\Delgan\\test.py'>
process = Process(target=worker, args=(foo, ))
process.start()
process.join()
If multiprocessing does not use pickle internally, then what are the inherent differences between these two ways of serializing objects?
Also, what does "inherit" mean in the context of multiprocessing? How am I supposed to prefer it over pickle?
When a multiprocessing.Queue is passed to a child process, what is actually sent is a file descriptor (or handle) obtained from pipe, which must have been created by the parent before creating the child. The error from pickle is to prevent attempts to send a Queue over another Queue (or similar channel), since it’s too late to use it then. (Unix systems do actually support sending a pipe over certain kinds of socket, but multiprocessing doesn’t use such features.) It’s expected to be “obvious” that certain multiprocessing types can be sent to child processes that would otherwise be useless, so no mention is made of the apparent contradiction.
Since the “spawn” start method can’t create the new process with any Python objects already created, it has to re-import the main script to obtain relevant function/class definitions. It doesn’t set __name__ like the original run for obvious reasons, so anything that is dependent on that setting will not be available. (Here, it is unpickling that failed, which is why your manual pickling works.)
The fork methods start the children with the parent’s objects (at the time of the fork only) still existing; this is what is meant by inheritance.
I am new to Celery & Python and have cursory knowledge of both.
I have multiple Ubuntu servers which all run multiple Celery workers(10 - 15).
Each of these workers need to perform a certain task using a third party libraries/DLL. For that we first
need to instantiate their class object and store it (somehow in memory).
Then the Celery workers read RMQ queues to execute certain tasks which uses the above class object methods.
The goal is to instantiate the third party class object once (when celerty worker starts) and then on task execution,
use the class instance methods. Just keep doing this in repeatedly.
I don't want to use REDIS as it seems like too much overhead to store such tiny amount of data(class object).
I need help in figuring out how to store this instantiated class object per worker. If the worker fails or crashes, obviously, we instantiate the class again which is not a problem. Any help specifically code sample will help a lot.
To provide more analogy, my requirement is similar to having a unique database connection per worker and using the same connection every repeated request.
Updated with some poorly written code for that:
tasks.py
from celery import Celery, Task
#Declares the config file and this worker file
mycelery = Celery('tasks')
mycelery.config_from_object('celeryconfig2')
class IQ(Task):
_db = None
#property
def db(self):
if self._db is None:
print 'establish DB connection....'
self._db = Database.Connect()
return self._db
#mycelery.task(base=IQ)
def indexIQ():
print 'calling indexing.....'
if index.db is None:
print 'DB connection doesn't exist. Let's create one...'
....
....
print 'continue indexing!'
main.py
from tasks import *
indexIQ.apply_async()
indexIQ.appply_async()
indexIQ.appply_async()
print 'end program'
Expected output
calling indexing.....
DB connection doesn't exist. Let's create one...
establish DB connection....
continue indexing!
calling indexing.....
continue indexing!
calling indexing.....
continue indexing!
Unfortunately, I am getting the 1st 4 lines of output all the time which means the DB connection is happening at each task execution. What am I doing wrong?
Thanks
I have a PySpark job that updates some objects in HBase (Spark v1.6.0; happybase v0.9).
It sort-of works if I open/close an HBase connection for each row:
def process_row(row):
conn = happybase.Connection(host=[hbase_master])
# update HBase record with data from row
conn.close()
my_dataframe.foreach(process_row)
After a few thousand upserts, we start to see errors like this:
TTransportException: Could not connect to [hbase_master]:9090
Obviously, it's inefficient to open/close a connection for each upsert. This function is really just a placeholder for a proper solution.
I then tried to create a version of the process_row function that uses a connection pool:
pool = happybase.ConnectionPool(size=20, host=[hbase_master])
def process_row(row):
with pool.connection() as conn:
# update HBase record with data from row
For some reason, the connection pool version of this function returns an error (see complete error message):
TypeError: can't pickle thread.lock objects
Can you see what I'm doing wrong?
Update
I saw this post and suspect I'm experiencing the same issue: Spark attempts to serialize the pool object and distribute it to each of the executors, but this connection pool object cannot be shared across multiple executors.
It sounds like I need to split the dataset into partitions, and use one connection per partition (see design patterns for using foreachrdd). I tried this, based on an example in the documentation:
def persist_to_hbase(dataframe_partition):
hbase_connection = happybase.Connection(host=[hbase_master])
for row in dataframe_partition:
# persist data
hbase_connection.close()
my_dataframe.foreachPartition(lambda dataframe_partition: persist_to_hbase(dataframe_partition))
Unfortunately, it still returns a "can't pickle thread.lock objects" error.
down the line happybase connections are just tcp connections so they cannot be shared between processes. a connection pool is primarily useful for multi-threaded applications and also proves useful for single-threaded applications that can use the pool as a global "connection factory" with connection reuse, which may simplify code because no "connection" objects need to be passed around. it also makes error recovery is a bit easier.
in any case a pool (which is just a group of connections) cannot be shared between processes. trying to serialise it does not make sense for that reason. (pools use locks which causes serialisation to fail but that is just a symptom.)
perhaps you can use a helper that conditionally creates a pool (or connection) and stores it as a module-local variable, instead of instantiating it upon import, e.g.
_pool = None
def get_pool():
global _pool
if _pool is None:
_pool = happybase.ConnectionPool(size=1, host=[hbase_master])
return pool
def process(...)
with get_pool().connection() as connection:
connection.table(...).put(...)
this instantiates the pool/connection upon first use instead of on import time.
I am given the task to update a database over the network with sqlalchemy. I have decided to use python's threading module. Currently I am using 1 thread, aka the producer thread, to direct other threads to consume work units via a queue.
The producer thread does something like this:
def produce(self, last_id):
unit = session.query(Request).order_by(Request.id) \
.filter(Request.item_id == None).yield_per(50)
self.queue.put(unit, True, Master.THREAD_TIMEOUT)
while the consumer threads does something similar to this:
def consume(self):
unit = self.queue.get()
request = unit
item = Item.get_item_by_url(request)
request.item = item
session.add(request)
session.flush()
and I am using sqlalchemy's scoped session:
session = scoped_session(sessionmaker(autocommit=True, autoflush=True, bind=engine))
However, I am getting the exception,
"sqlalchemy.exc.InvalidRequestError: Object FOO is already attached to session '1234' (this is '5678')"
I understand that this exception comes from the fact that the request object is created in one session (the producer session) while the consumers are using another scoped session because they belong to another thread.
My work around is to have my producer thread pass in the request.id into the queue while the consumer has to call the code below to retrieve the request object.
request = session.query(Request).filter(Request.id == request_id).first()
I do not like this solution because this involves another network call and is obviously not optimal.
Are there ways to avoid wasting the result of the producer's db call?
Is there a way to write the "produce" so that more than 1 id is passed into the queue as a work unit?
Feedback welcomed!
You need to detach your Request instance from the main thread session before you put it into the queue, then attach it to the queue processing thread session when taken from the queue again.
To detach, call .expunge() on the session, passing in the request:
session.expunge(unit)
and then when processing it in a queue thread, re-attach it by merging; set the load flag to False to prevent a round-trip to the database again:
session.merge(request, load=False)