Multithreading and Sqlalchemy - python

I am given the task to update a database over the network with sqlalchemy. I have decided to use python's threading module. Currently I am using 1 thread, aka the producer thread, to direct other threads to consume work units via a queue.
The producer thread does something like this:
def produce(self, last_id):
unit = session.query(Request).order_by(Request.id) \
.filter(Request.item_id == None).yield_per(50)
self.queue.put(unit, True, Master.THREAD_TIMEOUT)
while the consumer threads does something similar to this:
def consume(self):
unit = self.queue.get()
request = unit
item = Item.get_item_by_url(request)
request.item = item
session.add(request)
session.flush()
and I am using sqlalchemy's scoped session:
session = scoped_session(sessionmaker(autocommit=True, autoflush=True, bind=engine))
However, I am getting the exception,
"sqlalchemy.exc.InvalidRequestError: Object FOO is already attached to session '1234' (this is '5678')"
I understand that this exception comes from the fact that the request object is created in one session (the producer session) while the consumers are using another scoped session because they belong to another thread.
My work around is to have my producer thread pass in the request.id into the queue while the consumer has to call the code below to retrieve the request object.
request = session.query(Request).filter(Request.id == request_id).first()
I do not like this solution because this involves another network call and is obviously not optimal.
Are there ways to avoid wasting the result of the producer's db call?
Is there a way to write the "produce" so that more than 1 id is passed into the queue as a work unit?
Feedback welcomed!

You need to detach your Request instance from the main thread session before you put it into the queue, then attach it to the queue processing thread session when taken from the queue again.
To detach, call .expunge() on the session, passing in the request:
session.expunge(unit)
and then when processing it in a queue thread, re-attach it by merging; set the load flag to False to prevent a round-trip to the database again:
session.merge(request, load=False)

Related

broken pipe error with python multiprocessing and socketserver

Essentially Im using the socketserver python library to try and handle communications from a central server to multiple raspberry pi4 and esp32 peripherals. Currently i have the socketserver running serve_forever, then the request handler calls a method from a processmanager class which starts a process that should handle the actual communication with the client.
It works fine if i use .join() on the process such that the processmanager method doesnt exit, but thats not how i would like it to run. Without .join() i get a broken pipe error as soon as the client communication process tries to send a message back to the client.
This is the process manager class, it gets defined in the main file and buildprocess is called through the request handler of the socketserver class:
import multiprocessing as mp
mp.allow_connection_pickling()
import queuemanager as qm
import hostmain as hmain
import camproc
import keyproc
import controlproc
# method that gets called into a process so that class and socket share memory
def callprocess(periclass, peritype, clientsocket, inqueue, genqueue):
periclass.startup(clientsocket)
class ProcessManager(qm.QueueManager):
def wipeproc(self, target):
# TODO make wipeproc integrate with the queue manager rather than directly to the class
for macid in list(self.procdict.keys()):
if target == macid:
# calls proc kill for the class
try:
self.procdict[macid]["class"].prockill()
except Exception as e:
print("exception:", e, "in wipeproc")
# waits for process to exit naturally (class threads to close)
self.procdict[macid]["process"].join()
# remove dict entry for this macid
self.procdict.pop(macid)
# called externally to create the new process and append to procdict
def buildprocess(self, peritype, macid, clientsocket):
# TODO put some logic here to handle the differences of the controller process
# generates queue object
inqueue = mp.Queue()
# creates periclass instance based on type
if peritype == hmain.cam:
periclass = camproc.CamMain(self, inqueue, self.genqueue)
elif peritype == hmain.keypad:
print("to be added to")
elif peritype == hmain.motion:
print("to be added to")
elif peritype == hmain.controller:
print("to be added to")
# init and start call for the new process
self.procdict[macid] = {"type": peritype, "inqueue": inqueue, "class": periclass, "process": None}
self.procdict[macid]["process"] = mp.Process(target=callprocess,
args=(self.procdict[macid]["class"], self.procdict[macid]["type"], clientsocket, self.procdict[macid]["inqueue"], self.genqueue))
self.procdict[macid]["process"].start()
# updating the process dictionary before class obj gets appended
# if macid in list(self.procdict.keys()):
# self.wipeproc(macid)
print(self.procdict)
print("client added")
to my eye, all the pertinent objects should be stored in the procdict dictionary but as i mentioned it just gets a broken pipe error unless i join the process with self.procdict[macid]["process"].join() before the end of the buildprocess method
I would like it to exit the method but leave the communication process running as is, ive tried a few different things with restructuring what gets defined within the process and without, but to no avail. Thus far i havent been able to find any pertinent solutions online but of course i may have missed something too.
Thankyou for reading this far if you did! Ive been stuck on this for a couple days so any help would be appreciated, this is my first project with multiprocessing and sockets on any sort of scale.
#################
Edit to include pastebin with all the code:
https://pastebin.com/u/kadytoast/1/PPWfyCFT
Without .join() i get a broken pipe error as soon as the client communication process tries to send a message back to the client.
That's because at the time when the request handler handle() returns, the socketserver does shutdown the connection. That socketserver simplifies the task of writing network servers means it does certain things automatically which are usually done in the course of network request handling. Your code is not quite making the intended use of socketserver. Especially, for handling requests asynchronously, Asynchronous Mixins are intended. With the ForkingMixIn the server will spawn a new process for each request, in contrast to your current code which does this by itself with mp.Process. So, I think you have basically two options:
code less of the request handling yourself and use the provided socketserver methods
stay with your own handling and don't use socketserver at all, so it won't get in the way.

across process boundary in scoped_session

I'm using SQLAlchemy and multiprocessing. I also use scoped_session sinse it avoids share the same session but I've found an error and their solution but I don't understand why does it happend.
You can see my code below:
db.py
engine = create_engine(connection_string)
Session = sessionmaker(bind=engine)
DBSession = scoped_session(Session)
script.py
from multiprocessing import Pool, current_process
from db import DBSession
def process_feed(test):
session = DBSession()
print(current_process().name, session)
def run():
session = DBSession()
pool = Pool()
print(current_process().name, session)
pool.map_async(process_feed, [1, 2]).get()
if __name__ == "__main__":
run()
When I run script.py The output is:
MainProcess <sqlalchemy.orm.session.Session object at 0xb707b14c>
ForkPoolWorker-1 <sqlalchemy.orm.session.Session object at 0xb707b14c>
ForkPoolWorker-2 <sqlalchemy.orm.session.Session object at 0xb707b14c>
Note that session object is the same 0xb707b14c in the main process and their workers (child process)
BUT If I change the order of first two lines run():
def run():
pool = Pool() # <--- Now pool is instanced in the first line
session = DBSession() # <--- Now session is instanced in the second line
print(current_process().name, session)
pool.map_async(process_feed, [1, 2]).get()
And the I run script.py again the output is:
MainProcess <sqlalchemy.orm.session.Session object at 0xb66907cc>
ForkPoolWorker-1 <sqlalchemy.orm.session.Session object at 0xb669046c>
ForkPoolWorker-2 <sqlalchemy.orm.session.Session object at 0xb66905ec>
Now the session instances are different.
To understand why this happens, you need to understand what scoped_session and Pool actually does. scoped_session keeps a registry of sessions so that the following happens
the first time you call DBSession, it creates a Session object for you in the registry
subsequently, if necessary conditions are met (i.e. same thread, session has not been closed), it does not create a new Session object and instead returns you the previously created Session object back
When you create a Pool, it creates the workers in the __init__ method. (Note that there's nothing fundamental about starting the worker processes in __init__. An equally valid implementation could wait until workers are first needed before it starts them, which would exhibit different behavior in your example.) When this happens (on Unix), the parent process forks itself for every worker process, which involves the operating system copying the memory of the current running process into a new process, so you will literally get the exact same objects in the exact same places.
Putting these two together, in the first example you are creating a Session before forking, which gets copied over to all worker processes during the creation of the Pool, resulting in the same identity, while in the second example you delay the creation of the Session object until after the worker processes have started, resulting in different identities.
It's important to note that while the Session objects share the same id, they are not the same object, in the sense that if you change anything about the Session in the parent process, they will not be reflected in the child processes. They just happen to all share the same memory address due to the fork. However, OS-level resources like connections are shared, so if you had run a query on session before Pool(), a connection would have been created for you in the connection pool and subsequently forked into the child processes. If you then attempt to perform queries in the child processes you will run into weird errors because your processes are clobbering over each other over the same exact connection!
The above is moot for Windows because Windows does not have fork().
TCP connections are represented as file descriptors, which usually work across process boundaries, meaning this will cause concurrent access to the file descriptor on behalf of two or more entirely independent Python interpreter states.
https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing

Using Celery to store class instantiated object

I am new to Celery & Python and have cursory knowledge of both.
I have multiple Ubuntu servers which all run multiple Celery workers(10 - 15).
Each of these workers need to perform a certain task using a third party libraries/DLL. For that we first
need to instantiate their class object and store it (somehow in memory).
Then the Celery workers read RMQ queues to execute certain tasks which uses the above class object methods.
The goal is to instantiate the third party class object once (when celerty worker starts) and then on task execution,
use the class instance methods. Just keep doing this in repeatedly.
I don't want to use REDIS as it seems like too much overhead to store such tiny amount of data(class object).
I need help in figuring out how to store this instantiated class object per worker. If the worker fails or crashes, obviously, we instantiate the class again which is not a problem. Any help specifically code sample will help a lot.
To provide more analogy, my requirement is similar to having a unique database connection per worker and using the same connection every repeated request.
Updated with some poorly written code for that:
tasks.py
from celery import Celery, Task
#Declares the config file and this worker file
mycelery = Celery('tasks')
mycelery.config_from_object('celeryconfig2')
class IQ(Task):
_db = None
#property
def db(self):
if self._db is None:
print 'establish DB connection....'
self._db = Database.Connect()
return self._db
#mycelery.task(base=IQ)
def indexIQ():
print 'calling indexing.....'
if index.db is None:
print 'DB connection doesn't exist. Let's create one...'
....
....
print 'continue indexing!'
main.py
from tasks import *
indexIQ.apply_async()
indexIQ.appply_async()
indexIQ.appply_async()
print 'end program'
Expected output
calling indexing.....
DB connection doesn't exist. Let's create one...
establish DB connection....
continue indexing!
calling indexing.....
continue indexing!
calling indexing.....
continue indexing!
Unfortunately, I am getting the 1st 4 lines of output all the time which means the DB connection is happening at each task execution. What am I doing wrong?
Thanks

Django - dictionary gets empty on next endpoint

i've a program which starts multiple threads with data from database and i store the thread object in a dictionary.
thread_manager.py
threads = {}
............
#controller for /start-threads
def start_threads(request):
datas = Data.objects.all()
for data in datas:
thread = MyThread(data)
threads[data.id] = thread
thread.start()
return HttpResponse("all threads are running")
def get_thread(id,request):
return threads[id]
at this point threads dictionary has all the threads in it and i can access the thread object with threads[id], now if try to get the thread from another endpoint (im using django)
views.py
import thread_manager
def get_thread(request, id):
thread = thread_manager.get_thread(id, request)
return HttpResponse("got thread with id {0}".format(id))
the threads dictionary is empty at this point(ofc i get a keyerror), if i run this on local server everything works fine. if i run this on live server which has uwsgi running django. it doesnt work, is this a problem with uWsgi or am i doing anything wrong, thanx.
Your server is almost certainly running with more than one process. But threads belong to a single process; you can't access them from a different one.
You don't say what you're doing with these threads, but offline work is almost always better done with a specific system such as Celery.

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

Categories