Libraries wrapping Python threads - python

What libraries exist that provide a higher level interface to concurrency in Python? I'm not looking to necessarily make use of multiple cores - if the GIL serializes everything, that's fine with me. I just want a smoother/higher level interface to concurrency.
My application is writing tests of our software, and sometimes I'd like to, say, run several simulated users, as well as code that makes changes to the backend, all in parallel. It would be great if, say, I could do something like this:
setup_front_end()
setup_back_end()
parallel:
while some_condition():
os.system('wget http://...')
sleep(1)
for x in xrange(10):
os.system('top >> top.log')
sleep(1)
for x in xrange(100):
mess_with_backend()
The idea in the above code is that we start 3 threads, the first one runs:
while some_condition():
os.system('wget http://...')
sleep(1)
The second one runs the second loop, and the third one runs the third.
I know it won't be that simple, but is there something that can take the drudgery out of writing functions, spinning up threads, joining on them, etc?

Besides the threading module, you should also have a look into the concurrent.futures module (for py2 ther is a backport).
It provides ready to use thread/process pool implementations and allows you to write efficient concurrent code. As it provides an uniform API for threading and multiprocessing it allows you to switch easily between the two, should you ever need to.

If you just want to hide the complexity of creating a Thread and setting the parameters, create a simple decorator and use it for your functions:
async.py:
from threading import Thread
class async(object):
def __init__(self, func):
self.func = func
def __call__(self, *args, **kwargs):
Thread(target=self.func, args=args, kwargs=kwargs).start()
This is just a very simple wrapper around your functions which creates a new Thread and executes the function in the Thread. If you want, you can easily adapt this to use processes instead of threads.
main.py:
import time
from async import async
#async
def long_function(name, seconds_to_wait):
print "%s started..." % name
time.sleep(seconds_to_wait)
print "%s ended after %d seconds" % (name, seconds_to_wait)
if __name__ == '__main__':
long_function("first", 3)
long_function("second", 4)
long_function("third", 5)
Just decorate your functions with #async and you can call it asynchronously.
In your case: Just wrap your loops within a function and decorate them with #async.
For a full feature asynchronous decorator look at: http://wiki.python.org/moin/PythonDecoratorLibrary#Asynchronous_Call

Related

Python asyncio: Queue concurrent to normal code

Edit: I am closing this question.
As it turns out, my goal of having parallel HTTP posts is pointless. After implementing it successfully with aiohttp, I run into deadlocks elsewhere in the pipeline.
I will reformulate this and post a single question in a few days.
Problem
I want to have a class that, during some other computation, holds generated data and can write it to a DB via HTTP (details below) when convenient. It's gotta be a class as it is also used to load/represent/manipulate data.
I have written a naive, nonconcurrent implementation that works:
The class is initialized and then used in a "main loop". Data is added to it in this main loop to a naive "Queue" (a list of HTTP requests). At certain intervals in the main loop, the class calls a function to write those requests and clear the "queue".
As you can expect, this is IO bound. Whenever I need to write the "queue", the main loop halts. Furthermore, since the main computation runs on a GPU, the loop is also not really CPU bound.
Essentially, I want to have a queue, and, say, ten workers running in the background and pushing items to the http connector, waiting for the push to finish and then taking on the next (or just waiting for the next write call, not a big deal). In the meantime, my main loop runs and adds to the queue.
Program example
My naive program looks something like this
class data_storage(...):
def add(...):
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
def main_loop(storage):
# do many things
for batch in dataset: #simplified example
# Do stuff
for some_other_loop:
(...)
storage.add(results)
# For example, call each iteration
storage.write_queue()
if __name__ == "__main__":
storage=data_storage()
main_loop(storage)
...
In detail: the connector class is from the package 'neo4j-connector' to post to my Neo4j database. It essentially does JSON formatting and uses the "requests" api from python.
This works, even without a real queue, since nothing is concurrent.
Now I have to make it work concurrently.
From my research, I have seen that ideally I would want a "producer-consumer" pattern, where both are initialized via asyncio. I have only seen this implemented via functions, not classes, so I don't know how to approach this. With functions, my main loop should be a producer coroutine and my write function becomes the consumer. Both are initiated as tasks on the queue and then gathered, where I'd initialize only one producer but many consumers.
My issue is that the main loop includes parts that are already parallel (e.g. PyTorch). Asyncio is not thread safe, so I don't think I can just wrap everything in async decorators and make a co-routine. This is also precisely why I want the DB logic in a separate class.
I also don't actually want or need the main loop to run "concurrently" on the same thread with the workers. But it's fine if that's the outcome as the workers don't do much on the CPU. But technically speaking, I want multi-threading? I have no idea.
My only other option would be to write into the queue until it is "full", halt the loop and then use multiple threads to dump it to the DB. Still, this would be much slower than doing it while the main loop is running. My gain would be minimal, just concurrency while working through the queue. I'd settle for it if need be.
However, from a stackoverflow post, I came up with this small change
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
Shockingly this sort of "works" and is blazingly fast. Of course since it's not a real queue, things get overwritten. Furthermore, this overwhelms or deadlocks the HTTP API and in general produces a load of errors.
But since this - in principle - works, I wonder if I could do is the following:
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def post(self, items):
if len(items) > 0:
self.nr_workers.increase()
res = self.connector.run(items)
self.nr_workers.decrease()
def write_queue(self):
if self.nr_workers < 10:
items=self.queue.get(200) # Extract and delete from queue, non-concurrent
self.post(items) # Add "Worker"
for some hypothetical queue and nr_workers objects. Then at the end of the main loop, have a function that blocks progress until number of workers is zero and clears, non-concurrently, the rest of the queue.
This seems like a monumentally bad idea, but I don't know how else to implement this. If this is terrible, I'd like to know before I start doing more work on this. Do you think it would it work?
Otherwise, could you give me any pointers as how to approach this situation correctly?
Some key words, tools or things to research would of course be enough.
Thank you!

How to run a code after 10 seconds for a single time in python?

I've seen a few multithreading posts regarding running a code every 10 seconds, but how do you run a code only once after x seconds?
Specifically, I am trying to create a class method that executes some code and calls another method after 10 seconds once but still allows other methods to be called in the meantime.
I suggest using Timer
E.g.:
from threading import Timer
class Test:
def start(self):
Timer(10, self.some_method, ()).start()
def some_method(self):
print "called some_method after 10 seconds"
t = Test()
t.start()
In C++ I use Boost to do similar tasks. Look at deadline_timer http://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio/reference/deadline_timer.html for example. The page has some examples as well to get you started. Unfortunately you need to use io_service to make use of deadline_timer.
Python has twisted framework to do the same http://twistedmatrix.com/documents/13.1.0/api/twisted.internet.interfaces.IReactorTime.html#callLater
In sytem programming world you could use timer_create http://man7.org/linux/man-pages/man2/timer_create.2.html for example (depending on the OS you are using) and I would not go that route unless you have a good reason todo so.
Implement #timeout decorator and put before the function you want to setup.
Use signal on Unix like OS, refer to timeout_decorator. Since Windows doesn't implement signals at the system level, you have to use InterruptableThread.

Running multiple independent python scripts concurrently

My goal is create one main python script that executes multiple independent python scripts in windows server 2012 at the same time. One of the benefits in my mind is that I can point taskscheduler to one main.py script as opposed to multiple .py scripts. My server has 1 cpu. I have read on multiprocessing,thread & subprocess which only added to my confusion a bit. I am basically running multiple trading scripts for different stock symbols all at the same time after market open at 9:30 EST. Following is my attempt but I have no idea whether this is right. Any direction/feedback is highly appreciated!
import subprocess
subprocess.Popen(["python", '1.py'])
subprocess.Popen(["python", '2.py'])
subprocess.Popen(["python", '3.py'])
subprocess.Popen(["python", '4.py'])
I think I'd try to do this like that:
from multiprocessing import Pool
def do_stuff_with_stock_symbol(symbol):
return _call_api()
if __name__ == '__main__':
symbols = ["GOOG", "APPL", "TSLA"]
p = Pool(len(symbols))
results = p.map(do_stuff_with_stock_symbol, symbols)
print(results)
(Modified example from multiprocessing introduction: https://docs.python.org/3/library/multiprocessing.html#introduction)
Consider using a constant pool size if you deal with a lot of stock symbols, because every python process will use some amount of memory.
Also, please note that using threads might be a lot better if you are dealing with an I/O bound workload (calling an API, writing and reading from disk). Processes really become necessary with python when dealing with compute bound workloads (because of the global interpreter lock).
An example using threads and the concurrent futures library would be:
import concurrent.futures
TIMEOUT = 60
def do_stuff_with_stock_symbol(symbol):
return _call_api()
if __name__ == '__main__':
symbols = ["GOOG", "APPL", "TSLA"]
with concurrent.futures.ThreadPoolExecutor(max_workers=len(symbols)) as executor:
results = {executor.submit(do_stuff_with_stock_symbol, symbol, TIMEOUT): symbol for symbol in symbols}
for future in concurrent.futures.as_completed(results):
symbol = results[future]
try:
data = future.result()
except Exception as exc:
print('{} generated an exception: {}'.format(symbol, exc))
else:
print('stock symbol: {}, result: {}'.format(symbol, data))
(Modified example from: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example)
Note that threads will still use some memory, but less than processes.
You could use asyncio or green threads if you want to reduce memory consumption per stock symbol to a minimum, but at some point you will run into network bandwidth problems because of all the concurrent API calls :)
While what you're asking might not be the best way to handle what you're doing, I've wanted to do similar things in the past and it took a while to find what I needed so to answer your question:
I'm not promising this to be the "best" way to do it, but it worked in my use case.
I created a class I wanted to use to extend threading.
thread.py
"""
Extends threading.Thread giving access to a Thread object which will accept
A thread_id, thread name, and a function at the time of instantiation. The
function will be called when the threads start() method is called.
"""
import threading
class Thread(threading.Thread):
def __init__(self, thread_id, name, func):
threading.Thread.__init__(self)
self.threadID = thread_id
self.name = name
# the function that should be run in the thread.
self.func = func
def run(self):
return self.func()
I needed some work done that was part of another package
work_module.py
import...
def func_that_does_work():
# do some work
pass
def more_work():
# do some work
pass
Then the main script I wanted to run
main.py
from thread import Thread
import work_module as wm
mythreads = []
mythreads.append(Thread(1, "a_name", wm.func_that_does_work))
mythreads.append(Thread(2, "another_name", wm.more_work))
for t in mythreads:
t.start()
The threads die when the run() is returned. Being this extends a Thread from threading there are several options available in the docs here: https://docs.python.org/3/library/threading.html
If all you're looking to do is automate the startup, creating a .bat file is a great and simple alternative to trying to do it with another python script.
the example linked in the comments shows how to do it with bash on unix based machines, but batch files can do a very similar thing with the START command:
start_py.bat:
START "" /B "path\to\python.exe" "path\to\script_1.py"
START "" /B "path\to\python.exe" "path\to\script_2.py"
START "" /B "path\to\python.exe" "path\to\script_3.py"
the full syntax for START can be found here.

How do I run two python loops concurrently?

Suppose I have the following in Python
# A loop
for i in range(10000):
Do Task A
# B loop
for i in range(10000):
Do Task B
How do I run these loops simultaneously in Python?
If you want concurrency, here's a very simple example:
from multiprocessing import Process
def loop_a():
while 1:
print("a")
def loop_b():
while 1:
print("b")
if __name__ == '__main__':
Process(target=loop_a).start()
Process(target=loop_b).start()
This is just the most basic example I could think of. Be sure to read http://docs.python.org/library/multiprocessing.html to understand what's happening.
If you want to send data back to the program, I'd recommend using a Queue (which in my experience is easiest to use).
You can use a thread instead if you don't mind the global interpreter lock. Processes are more expensive to instantiate but they offer true concurrency.
There are many possible options for what you wanted:
use loop
As many people have pointed out, this is the simplest way.
for i in xrange(10000):
# use xrange instead of range
taskA()
taskB()
Merits: easy to understand and use, no extra library needed.
Drawbacks: taskB must be done after taskA, or otherwise. They can't be running simultaneously.
multiprocess
Another thought would be: run two processes at the same time, python provides multiprocess library, the following is a simple example:
from multiprocessing import Process
p1 = Process(target=taskA, args=(*args, **kwargs))
p2 = Process(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
merits: task can be run simultaneously in the background, you can control tasks(end, stop them etc), tasks can exchange data, can be synchronized if they compete the same resources etc.
drawbacks: too heavy!OS will frequently switch between them, they have their own data space even if data is redundant. If you have a lot tasks (say 100 or more), it's not what you want.
threading
threading is like process, just lightweight. check out this post. Their usage is quite similar:
import threading
p1 = threading.Thread(target=taskA, args=(*args, **kwargs))
p2 = threading.Thread(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
coroutines
libraries like greenlet and gevent provides something called coroutines, which is supposed to be faster than threading. No examples provided, please google how to use them if you're interested.
merits: more flexible and lightweight
drawbacks: extra library needed, learning curve.
Why do you want to run the two processes at the same time? Is it because you think they will go faster (there is a good chance that they wont). Why not run the tasks in the same loop, e.g.
for i in range(10000):
doTaskA()
doTaskB()
The obvious answer to your question is to use threads - see the python threading module. However threading is a big subject and has many pitfalls, so read up on it before you go down that route.
Alternatively you could run the tasks in separate proccesses, using the python multiprocessing module. If both tasks are CPU intensive this will make better use of multiple cores on your computer.
There are other options such as coroutines, stackless tasklets, greenlets, CSP etc, but Without knowing more about Task A and Task B and why they need to be run at the same time it is impossible to give a more specific answer.
from threading import Thread
def loopA():
for i in range(10000):
#Do task A
def loopB():
for i in range(10000):
#Do task B
threadA = Thread(target = loopA)
threadB = Thread(target = loobB)
threadA.run()
threadB.run()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()
You could use threading or multiprocessing.
How about: A loop for i in range(10000): Do Task A, Do Task B ? Without more information i dont have a better answer.
I find that using the "pool" submodule within "multiprocessing" works amazingly for executing multiple processes at once within a Python Script.
See Section: Using a pool of workers
Look carefully at "# launching multiple evaluations asynchronously may use more processes" in the example. Once you understand what those lines are doing, the following example I constructed will make a lot of sense.
import numpy as np
from multiprocessing import Pool
def desired_function(option, processes, data, etc...):
# your code will go here. option allows you to make choices within your script
# to execute desired sections of code for each pool or subprocess.
return result_array # "for example"
result_array = np.zeros("some shape") # This is normally populated by 1 loop, lets try 4.
processes = 4
pool = Pool(processes=processes)
args = (processes, data, etc...) # Arguments to be passed into desired function.
multiple_results = []
for i in range(processes): # Executes each pool w/ option (1-4 in this case).
multiple_results.append(pool.apply_async(param_process, (i+1,)+args)) # Syncs each.
results = np.array(res.get() for res in multiple_results) # Retrieves results after
# every pool is finished!
for i in range(processes):
result_array = result_array + results[i] # Combines all datasets!
The code will basically run the desired function for a set number of processes. You will have to carefully make sure your function can distinguish between each process (hence why I added the variable "option".) Additionally, it doesn't have to be an array that is being populated in the end, but for my example, that's how I used it. Hope this simplifies or helps you better understand the power of multiprocessing in Python!

How to do a non-blocking URL fetch in Python

I am writing a GUI app in Pyglet that has to display tens to hundreds of thumbnails from the Internet. Right now, I am using urllib.urlretrieve to grab them, but this blocks each time until they are finished, and only grabs one at a time.
I would prefer to download them in parallel and have each one display as soon as it's finished, without blocking the GUI at any point. What is the best way to do this?
I don't know much about threads, but it looks like the threading module might help? Or perhaps there is some easy way I've overlooked.
You'll probably benefit from threading or multiprocessing modules. You don't actually need to create all those Thread-based classes by yourself, there is a simpler method using Pool.map:
from multiprocessing import Pool
def fetch_url(url):
# Fetch the URL contents and save it anywhere you need and
# return something meaningful (like filename or error code),
# if you wish.
...
pool = Pool(processes=4)
result = pool.map(f, image_url_list)
As you suspected, this is a perfect situation for threading. Here is a short guide I found immensely helpful when doing my own first bit of threading in python.
As you rightly indicated, you could create a number of threads, each of which is responsible for performing urlretrieve operations. This allows the main thread to continue uninterrupted.
Here is a tutorial on threading in python:
http://heather.cs.ucdavis.edu/~matloff/Python/PyThreads.pdf
Here's an example of how to use threading.Thread. Just replace the class name with your own and the run function with your own. Note that threading is great for IO restricted applications like your's and can really speed it up. Using pythong threading strictly for computation in standard python doesn't help because only one thread can compute at a time.
import threading, time
class Ping(threading.Thread):
def __init__(self, multiple):
threading.Thread.__init__(self)
self.multiple = multiple
def run(self):
#sleeps 3 seconds then prints 'pong' x times
time.sleep(3)
printString = 'pong' * self.multiple
pingInstance = Ping(3)
pingInstance.start() #your run function will be called with the start function
print "pingInstance is alive? : %d" % pingInstance.isAlive() #will return True, or 1
print "Number of threads alive: %d" % threading.activeCount()
#main thread + class instance
time.sleep(3.5)
print "Number of threads alive: %d" % threading.activeCount()
print "pingInstance is alive?: %d" % pingInstance.isAlive()
#isAlive returns false when your thread reaches the end of it's run function.
#only main thread now
You have these choices:
Threads: easiest but doesn't scale well
Twisted: medium difficulty, scales well but shares CPU due to GIL and being single threaded.
Multiprocessing: hardest. Scales well if you know how to write your own event loop.
I recommend just using threads unless you need an industrial scale fetcher.
You either need to use threads, or an asynchronous networking library such as Twisted. I suspect that using threads might be simpler in your particular use case.

Categories