Load data in background thread with Python 3 - python

I am a bit frustrated about not being able to solve this seemingly simple problem:
I have a function that takes some time to load data:
def import_data(id):
time.sleep(5)
return 'data' + str(id)
A DataModel class calls this function and manages two datasets.
class DataModel():
def __init__(self):
self._data_1 = import_data(1)
self._data_2 = import_data(2)
def retrieve_data_1(self):
return self._data_1
def retrieve_data_2(self):
return self._data_2
Now, the main UI creates the DataModel, calling both import_data functions, which blocks it.
def main_ui():
# This takes 5 seconds for each dataset and blocks the main UI thread
dm = DataModel()
# Other stuff is happening. This time could be used to load data in the background
time.sleep(2)
# Retrieve the first dataset
data_1 = dm.retrieve_data_1()
# User interaction. This time could be used to load even larger datasets
time.sleep(10)
# Retrieve the second dataset
data_2 = dm.retrieve_data_2()
I want the datasets to be loaded in the background to reduce the time the UI is blocked.
My idea would be to implement it like this pseudocode:
class DataModel():
def __init__(self):
self._data_1 = Thread(import_data(1)).start()
self._data_2 = Thread(import_data(2)).start()
def retrieve_data_1(self):
return self._data_1.wait_for_result()
def retrieve_data_2(self):
return self._data_2.wait_for_result()
The import_data functions are called in separate threads and return Future objects.
The retrieve_data functions either block the main thread waiting for the Future to evaluate or return its result instantly.
Is there an easy way to implement this in Python 3.x with threading and/or asyncio? Thanks in advance!
(Edit: syntax correction)

Use the concurrent.futures module which is designed exactly for that kind of usage:
_pool = concurrent.futures.ThreadPoolExecutor()
class DataModel():
def __init__(self):
self._data_1 = _pool.submit(import_data, 1)
self._data_2 = _pool.submit(import_data, 2)
def retrieve_data_1(self):
return self._data_1.result()
def retrieve_data_2(self):
return self._data_2.result()
If your functions are global, and your data serializable, you can even seamlessly switch from ThreadPoolExecutor to ProcessPoolExecutor and benefit from true (process-based) parallelism.

Related

Python multiprocessing.Pool.apply_async() not executing class function

In a custom class I have the following code:
class CustomClass():
triggerQueue: multiprocessing.Queue
def __init__(self):
self.triggerQueue = multiprocessing.Queue()
def poolFunc(queueString):
print(queueString)
def listenerFunc(self):
pool = multiprocessing.Pool(5)
while True:
try:
queueString = self.triggerQueue.get_nowait()
pool.apply_async(func=self.poolFunc, args=(queueString,))
except queue.Empty:
break
What I intend to do is:
add a trigger to the queue (not implemented in this snippet) -> works as intended
run an endless loop within the listenerFunc that reads all triggers from the queue (if any are found) -> works as intended
pass trigger to poolFunc which is to be executed asynchronosly -> not working
It works as soon as I source my poolFun() outside of the class like
def poolFunc(queueString):
print(queueString)
class CustomClass():
[...]
But why is that so? Do I have to pass the self argument somehow? Is it impossible to perform it this way in general?
Thank you for any hint!
There are several problems going on here.
Your instance method, poolFunc, is missing a self parameter.
You are never properly terminating the Pool. You should take advantage of the fact that a multiprocessing.Pool object is a context manager.
You're calling apply_async, but you're never waiting for the results. Read the documentation: you need to call the get method on the AsyncResult object to receive the result; if you don't do this before your program exits your poolFunc function may never run.
By making the Queue object part of your class, you won't be able to pass instance methods to workers.
We can fix all of the above like this:
import multiprocessing
import queue
triggerQueue = multiprocessing.Queue()
class CustomClass:
def poolFunc(self, queueString):
print(queueString)
def listenerFunc(self):
results = []
with multiprocessing.Pool(5) as pool:
while True:
try:
queueString = triggerQueue.get_nowait()
results.append(pool.apply_async(self.poolFunc, (queueString,)))
except queue.Empty:
break
for res in results:
print(res.get())
c = CustomClass()
for i in range(10):
triggerQueue.put(f"testval{i}")
c.listenerFunc()
You can, as you mention, also replace your instance method with a static method, in which case we can keep triggerQueue as part of the class:
import multiprocessing
import queue
class CustomClass:
def __init__(self):
self.triggerQueue = multiprocessing.Queue()
#staticmethod
def poolFunc(queueString):
print(queueString)
def listenerFunc(self):
results = []
with multiprocessing.Pool(5) as pool:
while True:
try:
queueString = self.triggerQueue.get_nowait()
results.append(pool.apply_async(self.poolFunc, (queueString,)))
except queue.Empty:
break
for r in results:
print(r.get())
c = CustomClass()
for i in range(10):
c.triggerQueue.put(f"testval{i}")
c.listenerFunc()
But we still need to reap the pool_async results.
Okay, I found an answer and a workaround:
the answer is based the anser of noxdafox to this question.
Instance methods cannot be serialized that easily. What the Pickle protocol does when serialising a function is simply turning it into a string.
For a child process would be quite hard to find the right object your instance method is referring to due to separate process address spaces.
A functioning workaround is to declare the poolFunc() as static function like
#staticmethod
def poolFunc(queueString):
print(queueString)

How to make an unified interface for synchronous and asynchronous code in Python? [duplicate]

When implementing classes that have uses in both synchronous and asynchronous applications, I find myself maintaining virtually identical code for both use cases.
Just as an example, consider:
from time import sleep
import asyncio
class UselessExample:
def __init__(self, delay):
self.delay = delay
async def a_ticker(self, to):
for i in range(to):
yield i
await asyncio.sleep(self.delay)
def ticker(self, to):
for i in range(to):
yield i
sleep(self.delay)
def func(ue):
for value in ue.ticker(5):
print(value)
async def a_func(ue):
async for value in ue.a_ticker(5):
print(value)
def main():
ue = UselessExample(1)
func(ue)
loop = asyncio.get_event_loop()
loop.run_until_complete(a_func(ue))
if __name__ == '__main__':
main()
In this example, it's not too bad, the ticker methods of UselessExample are easy to maintain in tandem, but you can imagine that exception handling and more complicated functionality can quickly grow a method and make it more of an issue, even though both methods can remain virtually identical (only replacing certain elements with their asynchronous counterparts).
Assuming there's no substantial difference that makes it worth having both fully implemented, what is the best (and most Pythonic) way of maintaining a class like this and avoiding needless duplication?
There is no one-size-fits-all road to making an asyncio coroutine-based codebase useable from traditional synchronous codebases. You have to make choices per codepath.
Pick and choose from a series of tools:
Synchronous versions using asyncio.run()
Provide synchronous wrappers around coroutines, which block until the coroutine completes.
Even an async generator function such as ticker() can be handled this way, in a loop:
class UselessExample:
def __init__(self, delay):
self.delay = delay
async def a_ticker(self, to):
for i in range(to):
yield i
await asyncio.sleep(self.delay)
def ticker(self, to):
agen = self.a_ticker(to)
try:
while True:
yield asyncio.run(agen.__anext__())
except StopAsyncIteration:
return
These synchronous wrappers can be generated with helper functions:
from functools import wraps
def sync_agen_method(agen_method):
#wraps(agen_method)
def wrapper(self, *args, **kwargs):
agen = agen_method(self, *args, **kwargs)
try:
while True:
yield asyncio.run(agen.__anext__())
except StopAsyncIteration:
return
if wrapper.__name__[:2] == 'a_':
wrapper.__name__ = wrapper.__name__[2:]
return wrapper
then just use ticker = sync_agen_method(a_ticker) in the class definition.
Straight-up coroutine methods (not generator coroutines) could be wrapped with:
def sync_method(async_method):
#wraps(async_method)
def wrapper(self, *args, **kwargs):
return async.run(async_method(self, *args, **kwargs))
if wrapper.__name__[:2] == 'a_':
wrapper.__name__ = wrapper.__name__[2:]
return wrapper
Factor out common components
Refactor out the synchronous parts, into generators, context managers, utility functions, etc.
For your specific example, pulling out the for loop into a separate generator would minimise the duplicated code to the way the two versions sleep:
class UselessExample:
def __init__(self, delay):
self.delay = delay
def _ticker_gen(self, to):
yield from range(to)
async def a_ticker(self, to):
for i in self._ticker_gen(to):
yield i
await asyncio.sleep(self.delay)
def ticker(self, to):
for i in self._ticker_gen(to):
yield i
sleep(self.delay)
While this doesn't make much of any difference here it can work in other contexts.
Abstract Syntax Tree tranformation
Use AST rewriting and a map to transform coroutines into synchronous code. This can be quite fragile if you are not careful on how you recognise utility functions such as asyncio.sleep() vs time.sleep():
import inspect
import ast
import copy
import textwrap
import time
asynciomap = {
# asyncio function to (additional globals, replacement source) tuples
"sleep": ({"time": time}, "time.sleep")
}
class AsyncToSync(ast.NodeTransformer):
def __init__(self):
self.globals = {}
def visit_AsyncFunctionDef(self, node):
return ast.copy_location(
ast.FunctionDef(
node.name,
self.visit(node.args),
[self.visit(stmt) for stmt in node.body],
[self.visit(stmt) for stmt in node.decorator_list],
node.returns and ast.visit(node.returns),
),
node,
)
def visit_Await(self, node):
return self.visit(node.value)
def visit_Attribute(self, node):
if (
isinstance(node.value, ast.Name)
and isinstance(node.value.ctx, ast.Load)
and node.value.id == "asyncio"
and node.attr in asynciomap
):
g, replacement = asynciomap[node.attr]
self.globals.update(g)
return ast.copy_location(
ast.parse(replacement, mode="eval").body,
node
)
return node
def transform_sync(f):
filename = inspect.getfile(f)
lines, lineno = inspect.getsourcelines(f)
ast_tree = ast.parse(textwrap.dedent(''.join(lines)), filename)
ast.increment_lineno(ast_tree, lineno - 1)
transformer = AsyncToSync()
transformer.visit(ast_tree)
tranformed_globals = {**f.__globals__, **transformer.globals}
exec(compile(ast_tree, filename, 'exec'), tranformed_globals)
return tranformed_globals[f.__name__]
While the above is probably far from complete enough to fit all needs, and transforming AST trees can be daunting, the above would let you maintain just the async version and map that version to synchronous versions directly:
>>> import example
>>> del example.UselessExample.ticker
>>> example.main()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../example.py", line 32, in main
func(ue)
File "/.../example.py", line 21, in func
for value in ue.ticker(5):
AttributeError: 'UselessExample' object has no attribute 'ticker'
>>> example.UselessExample.ticker = transform_sync(example.UselessExample.a_ticker)
>>> example.main()
0
1
2
3
4
0
1
2
3
4
async/await is infectious by design.
Accept that your code will have different users — synchronous and asynchronous, and that these users will have different requirements, that over time the implementations will diverge.
Publish separate libraries
For example, compare aiohttp vs. aiohttp-requests vs. requests.
Likewise, compare asyncpg vs. psycopg2.
How to get there
Opt1. (easy) clone implementation, allow them to diverge.
Opt2. (sensible) partial refactor, let e.g. async library depend on and import sync library.
Opt3. (radical) create a "pure" library that can be used both in sync and async program. For example, see https://github.com/python-hyper/hyper-h2 .
On the upside, testing is easier and thorough. Consider how hard (or impossible) it is force the test framework to evaluate all possible concurrent execution orders in an async program. Pure library doesn't need that :)
On the down-side this style of programming requires different thinking, is not always straightforward, and may be suboptimal. For example, instead of await socket.read(2**20) you'd write for event in fsm.push(data): ... and rely on your library user to provide you with data in good-sized chunks.
For context, see the backpressure argument in https://vorpus.org/blog/some-thoughts-on-asynchronous-api-design-in-a-post-asyncawait-world/

How to access instance variables from a threaded instance method?

I have been struggling to figure out how to access the the instance variable from an instance method running in a separate thread like in the example below. I thought the loop would be skipped after the test.set_finished method is called from the main thread after 5 seconds.
However my loop keeps counting and I fail to understand why. Would be nice if someone could help me out a little and tell me what I need to do differently or what I'm trying to do is not possible at all.
import time
import concurrent.futures
class TestClass(object):
def __init__(self):
super(TestClass, self).__init__()
self.finished = False
def do_something(self):
i = 0
while not self.finished:
i+=1
print(i)
time.sleep(1)
def set_finished(self, finished_arg):
self.finished = finished_arg
def startThreadedMethod(self):
with concurrent.futures.ThreadPoolExecutor() as executor:
t1 = executor.submit(self.do_something)
test = TestClass()
test.startThreadedMethod()
time.sleep(5)
test.set_finished(True)

How to apply the multiprocessing module on my PyQt-GUI

I know this is an old question and have already found answers in other questions like this thread here. However, I have some problems applying it in my case.
The way I have things right now are the following: I have my MainWindow class where I can input some data. Then I have a Worker class which is a PySide2.QtCore.QThread object. To this class I pass some input data from the MainWindow. Inside this Worker class I have a method which sets up some ODEs, which in another method of the Worker class are being solved by scipy.integrate.solve_ivp. When the integration is done, I send the results via a signal back to the MainWindow. So the code roughly looks like this:
import PySide2
from scipy.integrate import solve_ivp
class Worker(QtCore.QThread):
def __init__(self,*args,**kwargs):
super(Worker,self).__init__()
"Here I collect input parameters"
def run(self):
"Here I call solve_ivp for the integration and send a signal with the
solution when it is done"
def ode_fun(self,t,c):
"Function where the ode equations are set up"
class Ui_MainWindow(QtWidgets.QMainWindow):
def __init__(self):
"set up the GUI"
self.btnStartSimulation.clicked.connect(self.start_simulation) #button to start the integration
def start_simulation(self):
self.watchthread(Worker)
self.thread.start()
def watchthread(self,worker):
self.thread = worker("input values")
"connect to signals from the thread"
Now I understand, that using the multiprocessing module I should be able to run the thread with the integration on another processor core to make it faster and make the GUI less laggy. However, from the link above I am not sure how I should apply this module or even how to restructure my code. Do I have to put the code that I now have in my Worker class into another class or am I somehow able to apply the multiprocessing module on my existing thread?
Any help is greatly appreciated!
Edit:
The new code looks like this:
class Worker(QtCore.QThread):
def __init__(self,*args,**kwargs):
super(Worker,self).__init__()
self.operation_parameters = args[0]
self.growth_parameters = args[1]
self.osmolality_parameters = args[2]
self.controller_parameters = args[3]
self.c_zero = args[4]
def run(self):
data = multiprocessing.Queue()
input_dict = {"function": self.ode_fun_vrabel_rushton_scaba_cont_co2_oxygen_biomass_metabol,
"time": [0, self.t_final],
"initial values": self.c_zero}
data.put(input_dict)
self.ode_process = multiprocessing.Process(target=self.multi_process_function, args=(data,))
self.ode_process.start()
self.solution = data.get()
def multi_process_function(self,data):
self.message_signal = True
input_dict = data.get()
solution = solve_ivp(input_dict["function"], input_dict["time"],
input_dict["initial values"], method="BDF")
data.put(solution)
def ode_fun(self,t,c):
"Function where the ode equations are set up"
(...) = self.operation_parameters
(...) = self.growth_parameters
(...) = self.osmolality_parameters
(...) = self.controller_parameters
Is it okay if I access the parameters in the ode_fun function via self."parameter_name"? Or do I also have to pass them with the data-parameter?
With the current code I receive the following error: TypeError: can't pickle Worker objects
You could call it from your worker like this:
import PySide2
from scipy.integrate import solve_ivp
import multiprocessing
class Worker(QtCore.QThread):
def __init__(self,*args,**kwargs):
super(Worker, self).__init__()
self.ode_process = None
"Here I collect input parameters"
def run(self):
"Here I call solve_ivp for the integration and send a signal with the solution when it is done"
data = multiprocessing.Queue()
data.put("all objects needed in the process, i would suggest a single dict from which you extract all data")
self.ode_process = multiprocessing.Process(target="your heavy duty function", args=(data,))
self.ode_process.start() # this is non blocking
# if you want it to block:
self.ode_process.join()
# make sure you remove all input data from the queue and fill it with the result, then to get it back:
results = data.get()
print(results) # or do with it what you want to do...
def ode_fun(self, t, c):
"Function where the ode equations are set up"
class Ui_MainWindow(QtWidgets.QMainWindow):
def __init__(self):
"set up the GUI"
self.btnStartSimulation.clicked.connect(self.start_simulation) #button to start the integration
def start_simulation(self):
self.watchthread(Worker)
self.thread.start()
def watchthread(self,worker):
self.thread = worker("input values")
"connect to signals from the thread"
Also beware that you would overwrite the running process now every time you press to start the simulation. You may want to use some sort of lock for that.

python multiprocessing manager & composite pattern sharing

I'm trying to share a composite structure through a multiprocessing manager but I felt in trouble with a "RuntimeError: maximum recursion depth exceeded" when trying to use just one of the Composite class methods.
The class is token from code.activestate and tested by me before inclusion into the manager.
When retrieving the class into a process and invoking its addChild() method I kept the RunTimeError, while outside the process it works.
The composite class inheritates from a SpecialDict class, that implements a ** ____getattr()____ **
method.
Could be possible that while calling addChild() the interpreter of python looks for a different ** ____getattr()____ ** because the right one is not proxied by the manager?
If so It's not clear to me the right way to make a proxy to that class/method
The following code reproduce exactly this condition:
1) this is the manager.py:
from multiprocessing.managers import BaseManager
from CompositeDict import *
class PlantPurchaser():
def __init__(self):
self.comp = CompositeDict('Comp')
def get_cp(self):
return self.comp
class Manager():
def __init__(self):
self.comp = QueuePurchaser().get_cp()
BaseManager.register('get_comp', callable=lambda:self.comp)
self.m = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
self.s = self.m.get_server()
self.s.serve_forever()
2) I want to use the composite into this consumer.py:
from multiprocessing.managers import BaseManager
class Consumer():
def __init__(self):
BaseManager.register('get_comp')
self.m = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
self.m.connect()
self.comp = self.m.get_comp()
ret = self.comp.addChild('consumer')
3) run all launching by a controller.py:
from multiprocessing import Process
class Controller():
def __init__(self):
for child in _run_children():
child.join()
def _run_children():
from manager import Manager
from consumer import Consumer as Consumer
procs = (
Process(target=Manager, name='Manager' ),
Process(target=Consumer, name='Consumer'),
)
for proc in procs:
proc.daemon = 1
proc.start()
return procs
c = Controller()
Take a look this related questions on how to do a proxy for CompositeDict() class
as suggested by AlberT.
The solution given by tgray works but cannot avoid race conditions
Is it possible there is a circular reference between the classes? For example, the outer class has a reference to the composite class, and the composite class has a reference back to the outer class.
The multiprocessing manager works well, but when you have large, complicated class structures, then you are likely to run into an error where a type/reference can not be serialized correctly. The other problem is that errors from multiprocessing manager are very cryptic. This makes debugging failure conditions even more difficult.
I think the problem is that you have to instruct the Manager on how to manage you object, which is not a standard python type.
In other worlds you have to create a proxy for you CompositeDict
You could look at this doc for an example: http://ruffus.googlecode.com/svn/trunk/doc/html/sharing_data_across_jobs_example.html
Python has a default maximum recursion depth of 1000 (or 999, I forget...). But you can change the default behavior thusly:
import sys
sys.setrecursionlimit(n)
Where n is the number of recursions you wish to allow.
Edit:
The above answer does nothing to solve the root cause of this problem (as pointed out in the comments). It only needs to be used if you are intentionally recursing more than 1000 times. If you are in an infinite loop (like in this problem), you will eventually hit whatever limit you set.
To address your actual problem, I re-wrote your code from scratch starting as simply as I could make it and built it up to what I believe is what you want:
import sys
from multiprocessing import Process
from multiprocessing.managers import BaseManager
from CompositDict import *
class Shared():
def __init__(self):
self.comp = CompositeDict('Comp')
def get_comp(self):
return self.comp
def set_comp(self, c):
self.comp = c
class Manager():
def __init__(self):
shared = Shared()
BaseManager.register('get_shared', callable=lambda:shared)
mgr = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
srv = mgr.get_server()
srv.serve_forever()
class Consumer():
def __init__(self, child_name):
BaseManager.register('get_shared')
mgr = BaseManager(address=('127.0.0.1', 50000), authkey='abracadabra')
mgr.connect()
shared = mgr.get_shared()
comp = shared.get_comp()
child = comp.addChild(child_name)
shared.set_comp(comp)
print comp
class Controller():
def __init__(self):
pass
def main(self):
m = Process(target=Manager, name='Manager')
m.daemon = True
m.start()
consumers = []
for i in xrange(3):
p = Process(target=Consumer, name='Consumer', args=('Consumer_' + str(i),))
p.daemon = True
consumers.append(p)
for c in consumers:
c.start()
for c in consumers:
c.join()
return 0
if __name__ == '__main__':
con = Controller()
sys.exit(con.main())
I did this all in one file, but you shouldn't have any trouble breaking it up.
I added a child_name argument to your consumer so that I could check that the CompositDict was getting updated.
Note that there is both a getter and a setter for your CompositDict object. When I only had a getter, each Consumer was overwriting the CompositDict when it added a child.
This is why I also changed your registered method to get_shared instead of get_comp, as you will want access to the setter as well as the getter within your Consumer class.
Also, I don't think you want to try joining your manager process, as it will "serve forever". If you look at the source for the BaseManager (./Lib/multiprocessing/managers.py:Line 144) you'll notice that the serve_forever() function puts you into an infinite loop that is only broken by KeyboardInterrupt or SystemExit.
Bottom line is that this code works without any recursive looping (as far as I can tell), but let me know if you still experience your error.

Categories