Using Celery to store class instantiated object - python

I am new to Celery & Python and have cursory knowledge of both.
I have multiple Ubuntu servers which all run multiple Celery workers(10 - 15).
Each of these workers need to perform a certain task using a third party libraries/DLL. For that we first
need to instantiate their class object and store it (somehow in memory).
Then the Celery workers read RMQ queues to execute certain tasks which uses the above class object methods.
The goal is to instantiate the third party class object once (when celerty worker starts) and then on task execution,
use the class instance methods. Just keep doing this in repeatedly.
I don't want to use REDIS as it seems like too much overhead to store such tiny amount of data(class object).
I need help in figuring out how to store this instantiated class object per worker. If the worker fails or crashes, obviously, we instantiate the class again which is not a problem. Any help specifically code sample will help a lot.
To provide more analogy, my requirement is similar to having a unique database connection per worker and using the same connection every repeated request.
Updated with some poorly written code for that:
tasks.py
from celery import Celery, Task
#Declares the config file and this worker file
mycelery = Celery('tasks')
mycelery.config_from_object('celeryconfig2')
class IQ(Task):
_db = None
#property
def db(self):
if self._db is None:
print 'establish DB connection....'
self._db = Database.Connect()
return self._db
#mycelery.task(base=IQ)
def indexIQ():
print 'calling indexing.....'
if index.db is None:
print 'DB connection doesn't exist. Let's create one...'
....
....
print 'continue indexing!'
main.py
from tasks import *
indexIQ.apply_async()
indexIQ.appply_async()
indexIQ.appply_async()
print 'end program'
Expected output
calling indexing.....
DB connection doesn't exist. Let's create one...
establish DB connection....
continue indexing!
calling indexing.....
continue indexing!
calling indexing.....
continue indexing!
Unfortunately, I am getting the 1st 4 lines of output all the time which means the DB connection is happening at each task execution. What am I doing wrong?
Thanks

Related

Tornado background task using cx_Oracle

I'm runnning a Bokeh server, using the underlying Tornado framework.
I need the server to refresh some data at some point. This is done by fetching rows from an Oracle DB, using Cx_Oracle.
Thanks to Tornado's PeriodicCallback, the program checks every 30 seconds if new data should be loaded:
server.start()
from tornado.ioloop import PeriodicCallback
pcallback = PeriodicCallback(db_obj.reload_data_async, 10 * 1e3)
pcallback.start()
server.io_loop.start()
Where db_obj is an instance of a class which takes care of the DB related functions (connect, fetch, ...).
Basically, this is how the reload_data_async function looks like:
executor = concurrent.futures.ThreadPoolExecutor(4)
# methods of the db_obj class ...
#gen.coroutine
def reload_data_async(self):
# ... first, some code to check if the data should be reloaded ...
# ...
if data_should_be_reloaded:
new_data = yield executor.submit(self.fetch_data)
def fetch_data(self):
""" fetch new data in the DB """
cursor = cx.Cursor(self.db_connection)
cursor.execute("some SQL select request that takes time (select * from ...)")
rows = cursor.fetchall()
# some more processing thereafter
# ...
Basically, this works. But when I try to read the data while it's being load in fetch_data (by clicking for display in the GUI), the program crashes due to race condition (I guess?): it's accessing the data while it's being fetched at the same time.
I just discovered that tornado.concurrent.futures are not thread-safe:
tornado.concurrent.Future is similar to concurrent.futures.Future, but
not thread-safe (and therefore faster for use with single-threaded
event loops).
All in all, I think I should create a new thread to take care of the CX_Oracle operations. Can I do that using Tornado and keep using the PerodicCallback function? How can I convert my asynchronous operation to be thread-safe? What's the way to do this?
PS: Im using Python 2.7
Thanks
Solved it!
#Sraw is right: it should not cause crash.
Explanation: fetch_data() is using a cx Oracle Connection object (self.db_connection), which is NOT thread-safe by default. Setting the threaded parameter to True wraps the shared connection with a mutex, as described in Cx Oracle documentation:
The threaded parameter is expected to be a boolean expression which
indicates whether or not Oracle should wrap accesses to connections
with a mutex. Doing so in single threaded applications imposes a
performance penalty of about 10-15% which is why the default is False.
So I in my code, I just modified the following and it now works without crashing when the user tries to access data while it's being refreshed:
# inside the connect method of the db_obj class
self.db_connection = cx.connect('connection string', threaded=True) # False by default

across process boundary in scoped_session

I'm using SQLAlchemy and multiprocessing. I also use scoped_session sinse it avoids share the same session but I've found an error and their solution but I don't understand why does it happend.
You can see my code below:
db.py
engine = create_engine(connection_string)
Session = sessionmaker(bind=engine)
DBSession = scoped_session(Session)
script.py
from multiprocessing import Pool, current_process
from db import DBSession
def process_feed(test):
session = DBSession()
print(current_process().name, session)
def run():
session = DBSession()
pool = Pool()
print(current_process().name, session)
pool.map_async(process_feed, [1, 2]).get()
if __name__ == "__main__":
run()
When I run script.py The output is:
MainProcess <sqlalchemy.orm.session.Session object at 0xb707b14c>
ForkPoolWorker-1 <sqlalchemy.orm.session.Session object at 0xb707b14c>
ForkPoolWorker-2 <sqlalchemy.orm.session.Session object at 0xb707b14c>
Note that session object is the same 0xb707b14c in the main process and their workers (child process)
BUT If I change the order of first two lines run():
def run():
pool = Pool() # <--- Now pool is instanced in the first line
session = DBSession() # <--- Now session is instanced in the second line
print(current_process().name, session)
pool.map_async(process_feed, [1, 2]).get()
And the I run script.py again the output is:
MainProcess <sqlalchemy.orm.session.Session object at 0xb66907cc>
ForkPoolWorker-1 <sqlalchemy.orm.session.Session object at 0xb669046c>
ForkPoolWorker-2 <sqlalchemy.orm.session.Session object at 0xb66905ec>
Now the session instances are different.
To understand why this happens, you need to understand what scoped_session and Pool actually does. scoped_session keeps a registry of sessions so that the following happens
the first time you call DBSession, it creates a Session object for you in the registry
subsequently, if necessary conditions are met (i.e. same thread, session has not been closed), it does not create a new Session object and instead returns you the previously created Session object back
When you create a Pool, it creates the workers in the __init__ method. (Note that there's nothing fundamental about starting the worker processes in __init__. An equally valid implementation could wait until workers are first needed before it starts them, which would exhibit different behavior in your example.) When this happens (on Unix), the parent process forks itself for every worker process, which involves the operating system copying the memory of the current running process into a new process, so you will literally get the exact same objects in the exact same places.
Putting these two together, in the first example you are creating a Session before forking, which gets copied over to all worker processes during the creation of the Pool, resulting in the same identity, while in the second example you delay the creation of the Session object until after the worker processes have started, resulting in different identities.
It's important to note that while the Session objects share the same id, they are not the same object, in the sense that if you change anything about the Session in the parent process, they will not be reflected in the child processes. They just happen to all share the same memory address due to the fork. However, OS-level resources like connections are shared, so if you had run a query on session before Pool(), a connection would have been created for you in the connection pool and subsequently forked into the child processes. If you then attempt to perform queries in the child processes you will run into weird errors because your processes are clobbering over each other over the same exact connection!
The above is moot for Windows because Windows does not have fork().
TCP connections are represented as file descriptors, which usually work across process boundaries, meaning this will cause concurrent access to the file descriptor on behalf of two or more entirely independent Python interpreter states.
https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing

PySpark dataframe.foreach() with HappyBase connection pool returns 'TypeError: can't pickle thread.lock objects'

I have a PySpark job that updates some objects in HBase (Spark v1.6.0; happybase v0.9).
It sort-of works if I open/close an HBase connection for each row:
def process_row(row):
conn = happybase.Connection(host=[hbase_master])
# update HBase record with data from row
conn.close()
my_dataframe.foreach(process_row)
After a few thousand upserts, we start to see errors like this:
TTransportException: Could not connect to [hbase_master]:9090
Obviously, it's inefficient to open/close a connection for each upsert. This function is really just a placeholder for a proper solution.
I then tried to create a version of the process_row function that uses a connection pool:
pool = happybase.ConnectionPool(size=20, host=[hbase_master])
def process_row(row):
with pool.connection() as conn:
# update HBase record with data from row
For some reason, the connection pool version of this function returns an error (see complete error message):
TypeError: can't pickle thread.lock objects
Can you see what I'm doing wrong?
Update
I saw this post and suspect I'm experiencing the same issue: Spark attempts to serialize the pool object and distribute it to each of the executors, but this connection pool object cannot be shared across multiple executors.
It sounds like I need to split the dataset into partitions, and use one connection per partition (see design patterns for using foreachrdd). I tried this, based on an example in the documentation:
def persist_to_hbase(dataframe_partition):
hbase_connection = happybase.Connection(host=[hbase_master])
for row in dataframe_partition:
# persist data
hbase_connection.close()
my_dataframe.foreachPartition(lambda dataframe_partition: persist_to_hbase(dataframe_partition))
Unfortunately, it still returns a "can't pickle thread.lock objects" error.
down the line happybase connections are just tcp connections so they cannot be shared between processes. a connection pool is primarily useful for multi-threaded applications and also proves useful for single-threaded applications that can use the pool as a global "connection factory" with connection reuse, which may simplify code because no "connection" objects need to be passed around. it also makes error recovery is a bit easier.
in any case a pool (which is just a group of connections) cannot be shared between processes. trying to serialise it does not make sense for that reason. (pools use locks which causes serialisation to fail but that is just a symptom.)
perhaps you can use a helper that conditionally creates a pool (or connection) and stores it as a module-local variable, instead of instantiating it upon import, e.g.
_pool = None
def get_pool():
global _pool
if _pool is None:
_pool = happybase.ConnectionPool(size=1, host=[hbase_master])
return pool
def process(...)
with get_pool().connection() as connection:
connection.table(...).put(...)
this instantiates the pool/connection upon first use instead of on import time.

How to use multiprocessing with class instances in Python?

I am trying to create a class than can run a separate process to go do some work that takes a long time, launch a bunch of these from a main module and then wait for them all to finish. I want to launch the processes once and then keep feeding them things to do rather than creating and destroying processes. For example, maybe I have 10 servers running the dd command, then I want them all to scp a file, etc.
My ultimate goal is to create a class for each system that keeps track of the information for the system in which it is tied to like IP address, logs, runtime, etc. But that class must be able to launch a system command and then return execution back to the caller while that system command runs, to followup with the result of the system command later.
My attempt is failing because I cannot send an instance method of a class over the pipe to the subprocess via pickle. Those are not pickleable. I therefore tried to fix it various ways but I can't figure it out. How can my code be patched to do this? What good is multiprocessing if you can't send over anything useful?
Is there any good documentation of multiprocessing being used with class instances? The only way I can get the multiprocessing module to work is on simple functions. Every attempt to use it within a class instance has failed. Maybe I should pass events instead? I don't understand how to do that yet.
import multiprocessing
import sys
import re
class ProcessWorker(multiprocessing.Process):
"""
This class runs as a separate process to execute worker's commands in parallel
Once launched, it remains running, monitoring the task queue, until "None" is sent
"""
def __init__(self, task_q, result_q):
multiprocessing.Process.__init__(self)
self.task_q = task_q
self.result_q = result_q
return
def run(self):
"""
Overloaded function provided by multiprocessing.Process. Called upon start() signal
"""
proc_name = self.name
print '%s: Launched' % (proc_name)
while True:
next_task_list = self.task_q.get()
if next_task is None:
# Poison pill means shutdown
print '%s: Exiting' % (proc_name)
self.task_q.task_done()
break
next_task = next_task_list[0]
print '%s: %s' % (proc_name, next_task)
args = next_task_list[1]
kwargs = next_task_list[2]
answer = next_task(*args, **kwargs)
self.task_q.task_done()
self.result_q.put(answer)
return
# End of ProcessWorker class
class Worker(object):
"""
Launches a child process to run commands from derived classes in separate processes,
which sit and listen for something to do
This base class is called by each derived worker
"""
def __init__(self, config, index=None):
self.config = config
self.index = index
# Launce the ProcessWorker for anything that has an index value
if self.index is not None:
self.task_q = multiprocessing.JoinableQueue()
self.result_q = multiprocessing.Queue()
self.process_worker = ProcessWorker(self.task_q, self.result_q)
self.process_worker.start()
print "Got here"
# Process should be running and listening for functions to execute
return
def enqueue_process(target): # No self, since it is a decorator
"""
Used to place an command target from this class object into the task_q
NOTE: Any function decorated with this must use fetch_results() to get the
target task's result value
"""
def wrapper(self, *args, **kwargs):
self.task_q.put([target, args, kwargs]) # FAIL: target is a class instance method and can't be pickled!
return wrapper
def fetch_results(self):
"""
After all processes have been spawned by multiple modules, this command
is called on each one to retreive the results of the call.
This blocks until the execution of the item in the queue is complete
"""
self.task_q.join() # Wait for it to to finish
return self.result_q.get() # Return the result
#enqueue_process
def run_long_command(self, command):
print "I am running number % as process "%number, self.name
# In here, I will launch a subprocess to run a long-running system command
# p = Popen(command), etc
# p.wait(), etc
return
def close(self):
self.task_q.put(None)
self.task_q.join()
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(5):
worker = Worker(config, index)
worker.run_long_command("ls /")
workers.append(worker)
for worker in workers:
worker.fetch_results()
# Do more work... (this would actually be done in a distributor in another class)
for worker in workers:
worker.close()
Edit: I tried to move the ProcessWorker class and the creation of the multiprocessing queues outside of the Worker class and then tried to manually pickle the worker instance. Even that doesn't work and I get an error
RuntimeError: Queue objects should only be shared between processes
through inheritance
. But I am only passing references of those queues into the worker instance?? I am missing something fundamental. Here is the modified code from the main section:
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(1):
task_q = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
process_worker = ProcessWorker(task_q, result_q)
worker = Worker(config, index, process_worker, task_q, result_q)
something_to_look_at = pickle.dumps(worker) # FAIL: Doesn't like queues??
process_worker.start()
worker.run_long_command("ls /")
So, the problem was that I was assuming that Python was doing some sort of magic that is somehow different from the way that C++/fork() works. I somehow thought that Python only copied the class, not the whole program into a separate process. I seriously wasted days trying to get this to work because all of the talk about pickle serialization made me think that it actually sent everything over the pipe. I knew that certain things could not be sent over the pipe, but I thought my problem was that I was not packaging things up properly.
This all could have been avoided if the Python docs gave me a 10,000 ft view of what happens when this module is used. Sure, it tells me what the methods of multiprocess module does and gives me some basic examples, but what I want to know is what is the "Theory of Operation" behind the scenes! Here is the kind of information I could have used. Please chime in if my answer is off. It will help me learn.
When you run start a process using this module, the whole program is copied into another process. But since it is not the "__main__" process and my code was checking for that, it doesn't fire off yet another process infinitely. It just stops and sits out there waiting for something to do, like a zombie. Everything that was initialized in the parent at the time of calling multiprocess.Process() is all set up and ready to go. Once you put something in the multiprocess.Queue or shared memory, or pipe, etc. (however you are communicating), then the separate process receives it and gets to work. It can draw upon all imported modules and setup just as if it was the parent. However, once some internal state variables change in the parent or separate process, those changes are isolated. Once the process is spawned, it now becomes your job to keep them in sync if necessary, either through a queue, pipe, shared memory, etc.
I threw out the code and started over, but now I am only putting one extra function out in the ProcessWorker, an "execute" method that runs a command line. Pretty simple. I don't have to worry about launching and then closing a bunch of processes this way, which has caused me all kinds of instability and performance issues in the past in C++. When I switched to launching processes at the beginning and then passing messages to those waiting processes, my performance improved and it was very stable.
BTW, I looked at this link to get help, which threw me off because the example made me think that methods were being transported across the queues: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
The second example of the first section used "next_task()" that appeared (to me) to be executing a task received via the queue.
Instead of attempting to send a method itself (which is impractical), try sending a name of a method to execute.
Provided that each worker runs the same code, it's a matter of a simple getattr(self, task_name).
I'd pass tuples (task_name, task_args), where task_args were a dict to be directly fed to the task method:
next_task_name, next_task_args = self.task_q.get()
if next_task_name:
task = getattr(self, next_task_name)
answer = task(**next_task_args)
...
else:
# poison pill, shut down
break
REF: https://stackoverflow.com/a/14179779
Answer on Jan 6 at 6:03 by David Lynch is not factually correct when he says that he was misled by
http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html.
The code and examples provided are correct and work as advertised. next_task() is executing a task received via the queue -- try and understand what the Task.__call__() method is doing.
In my case what, tripped me up was syntax errors in my implementation of run(). It seems that the sub-process will not report this and just fails silently -- leaving things stuck in weird loops! Make sure you have some kind of syntax checker running e.g. Flymake/Pyflakes in Emacs.
Debugging via multiprocessing.log_to_stderr()F helped me narrow down the problem.

Multithreading and Sqlalchemy

I am given the task to update a database over the network with sqlalchemy. I have decided to use python's threading module. Currently I am using 1 thread, aka the producer thread, to direct other threads to consume work units via a queue.
The producer thread does something like this:
def produce(self, last_id):
unit = session.query(Request).order_by(Request.id) \
.filter(Request.item_id == None).yield_per(50)
self.queue.put(unit, True, Master.THREAD_TIMEOUT)
while the consumer threads does something similar to this:
def consume(self):
unit = self.queue.get()
request = unit
item = Item.get_item_by_url(request)
request.item = item
session.add(request)
session.flush()
and I am using sqlalchemy's scoped session:
session = scoped_session(sessionmaker(autocommit=True, autoflush=True, bind=engine))
However, I am getting the exception,
"sqlalchemy.exc.InvalidRequestError: Object FOO is already attached to session '1234' (this is '5678')"
I understand that this exception comes from the fact that the request object is created in one session (the producer session) while the consumers are using another scoped session because they belong to another thread.
My work around is to have my producer thread pass in the request.id into the queue while the consumer has to call the code below to retrieve the request object.
request = session.query(Request).filter(Request.id == request_id).first()
I do not like this solution because this involves another network call and is obviously not optimal.
Are there ways to avoid wasting the result of the producer's db call?
Is there a way to write the "produce" so that more than 1 id is passed into the queue as a work unit?
Feedback welcomed!
You need to detach your Request instance from the main thread session before you put it into the queue, then attach it to the queue processing thread session when taken from the queue again.
To detach, call .expunge() on the session, passing in the request:
session.expunge(unit)
and then when processing it in a queue thread, re-attach it by merging; set the load flag to False to prevent a round-trip to the database again:
session.merge(request, load=False)

Categories