Python Requests: Don't wait for request to finish

Python Requests: Don't wait for request to finish - python

In Bash, it is possible to execute a command in the background by appending &. How can I do it in Python?
while True:
data = raw_input('Enter something: ')
requests.post(url, data=data) # Don't wait for it to finish.
print('Sending POST request...') # This should appear immediately.

Here's a hacky way to do it:
try:
requests.get("http://127.0.0.1:8000/test/",timeout=0.0000000001)
except requests.exceptions.ReadTimeout:
pass
Edit: for those of you that observed that this will not await a response - that is my understanding of the question "fire and forget... do not wait for it to finish". There are much more thorough and complete ways to do it with threads or async if you need response context, error handling, etc.

I use multiprocessing.dummy.Pool. I create a singleton thread pool at the module level, and then use pool.apply_async(requests.get, [params]) to launch the task.
This command gives me a future, which I can add to a list with other futures indefinitely until I'd like to collect all or some of the results.
multiprocessing.dummy.Pool is, against all logic and reason, a THREAD pool and not a process pool.
Example (works in both Python 2 and 3, as long as requests is installed):
from multiprocessing.dummy import Pool
import requests
pool = Pool(10) # Creates a pool with ten threads; more threads = more concurrency.
# "pool" is a module attribute; you can be sure there will only
# be one of them in your application
# as modules are cached after initialization.
if __name__ == '__main__':
futures = []
for x in range(10):
futures.append(pool.apply_async(requests.get, ['http://example.com/']))
# futures is now a list of 10 futures.
for future in futures:
print(future.get()) # For each future, wait until the request is
# finished and then print the response object.
The requests will be executed concurrently, so running all ten of these requests should take no longer than the longest one. This strategy will only use one CPU core, but that shouldn't be an issue because almost all of the time will be spent waiting for I/O.

Elegant solution from Andrew Gorcester. In addition, without using futures, it is possible to use the callback and error_callback attributes (see
doc) in order to perform asynchronous processing:
def on_success(r: Response):
if r.status_code == 200:
print(f'Post succeed: {r}')
else:
print(f'Post failed: {r}')
def on_error(ex: Exception):
print(f'Post requests failed: {ex}')
pool.apply_async(requests.post, args=['http://server.host'], kwargs={'json': {'key':'value'},
callback=on_success, error_callback=on_error))

According to the doc, you should move to another library :
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide
any kind of non-blocking IO. The Response.content property will block
until the entire response has been downloaded. If you require more
granularity, the streaming features of the library (see Streaming
Requests) allow you to retrieve smaller quantities of the response at
a time. However, these calls will still block.
If you are concerned about the use of blocking IO, there are lots of
projects out there that combine Requests with one of Python’s
asynchronicity frameworks.
Two excellent examples are
grequests and
requests-futures.

Simplest and Most Pythonic Solution using threading
A Simple way to go ahead and send POST/GET or to execute any other function without waiting for it to finish is using the built-in Python Module threading.
import threading
import requests
def send_req():
requests.get("http://127.0.0.1:8000/test/")
for x in range(100):
threading.Thread(target=send_req).start() # start's a new thread and continues.
Other Important Features of threading
You can turn these threads into daemons using thread_obj.daemon = True
You can go ahead and wait for one to complete executing and then continue using thread_obj.join()
You can check if a thread is alive using thread_obj.is_alive() bool: True/False
You can even check the active thread count as well by threading.active_count()
Official Documentation

If you can write the code to be executed separately in a separate python program, here is a possible solution based on subprocessing.
Otherwise you may find useful this question and related answer: the trick is to use the threading library to start a separate thread that will execute the separated task.
A caveat with both approach could be the number of items (that's to say the number of threads) you have to manage. If the items in parent are too many, you may consider halting every batch of items till at least some threads have finished, but I think this kind of management is non-trivial.
For more sophisticated approach you can use an actor based approach, I have not used this library myself but I think it could help in that case.

from multiprocessing.dummy import Pool
import requests
pool = Pool()
def on_success(r):
print('Post succeed')
def on_error(ex):
print('Post requests failed')
def call_api(url, data, headers):
requests.post(url=url, data=data, headers=headers)
def pool_processing_create(url, data, headers):
pool.apply_async(call_api, args=[url, data, headers],
callback=on_success, error_callback=on_error)

Related

Multiprocessing hanging with requests.get

I have been working with a very simple bit of code, but the behavior is very strange. I am trying to send a request to a webpage using requests.get, but if the request takes longer than a few seconds, I would like to kill the process. I am following the response from the accepted answer here, but changing the function body to include the request. My code is below:
import multiprocessing as mp, requests
def get_page(_r):
_rs = requests.get('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas').text
_r.put(_rs)
q = mp.Queue()
p = mp.Process(target=get_page, args=(q,))
p.start()
time.sleep(3)
p.terminate()
p.join()
try:
result = q.get(False)
print(result)
except:
print('failed')
The code above simply hanges when running it. However, when I run
requests.get('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas').text
independently, a response is returned in under two seconds. Therefore, main code should print the page's HTML, however, it just stalls. Oddly, when I put an infinite loop in get_page:
def get_page(_r):
while True:
pass
_r.put('You will not see this')
the process is indeed terminated after three seconds. Therefore, I am certain the behavior has to do with requests. How could this be? I discovered a similar question here, but I am not using async. Could the issue have to do with monkey patching, as I am using requests along with time and multiprocessing? Any suggestions or comments would be appreciated. Thank you!
I am using:
Python 3.7.0
requests 2.21.0
Edit: #Hitobat pointed out that a param timeout can be used instead with requests. This does indeed work, however, I would appreciate any other ideas pertaining to why the requests is failing with multiprocessing.

I have reproduced your scenario and I have to refute the mentioned supposition "I am certain the behavior has to do with requests".
requests.get(...) returns the response as expected.
Let see how the process goes with some debug points:
import multiprocessing as mp, requests
import time
def get_page(_r):
_rs = requests.get('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas').text
print('--- response header', _rs[:17])
_r.put(_rs)
q = mp.Queue()
p = mp.Process(target=get_page, args=(q,))
p.start()
time.sleep(3)
p.terminate()
p.join()
try:
print('--- get data from queue of size', q.qsize())
result = q.get(False)
print(result)
except Exception as ex:
print('failed', str(ex))
The output:
--- response header
<!DOCTYPE html>
--- get data from queue of size 1
As we see the response is there and the process even advanced to try block statements but it hangs/stops at the statement q.get() when trying to extract data from the queue.
Therefore we may conclude that the queue is likely to be corrupted.
And we have a respective warning in multiprocessing library documentation (Pipes and Queues section):
Warning
If a process is killed using Process.terminate() or os.kill() while
it is trying to use a Queue, then the data in the queue is likely to
become corrupted. This may cause any other process to get an exception
when it tries to use the queue later on.
Looks like this is that kind of case.
How can we handle this issue?
A known workaround is using mp.Manager().Queue() (with intermediate proxying level) instead of mp.Queue:
...
q = mp.Manager().Queue()
p = mp.Process(target=get_page, args=(q,))

Calling another function asynchronously and never wait it to finish in Python

I am working on a chatbot, where before I reply to the user I make a DB call to save the chat in a table. This will be done each time user types something, and it increases the response time.
So to decrease the response time, we need to call this asynchronously.
How to do this in Python 3?
I have read tutorials of asyncio library, but did not understand it completely and could not understand how to make it work.
Another workaround is to use queuing system, but that sounds like an overkill.
Example:
request = get_request_from_chat
res = call_some_function_to_prepare_response()
save_data() # this will be call asynchronously
reply() # this should not wait save_data() to finish
Any suggestions are welcome.

Use loop.create_task(some_async_function()) to run an async function "in the background". For example, this answer shows how to do that in case of a trivial client-server communication.
In your case the pseudo-code would look like this:
request = await get_request_from_chat()
res = call_some_function_to_prepare_response()
loop = asyncio.get_event_loop()
loop.create_task(save_data()) # runs in the "background"
reply() # doesn't wait for save_data() to finish
For this to work, of course, the program must be written for asyncio and save_data must be a coroutine. For a chat server it's a good approach to follow anyway, so I would recommend to give asyncio a chance.

Because you mentioned
Another workaround is to use queuing system, but that sounds like an
overkill.
I assume you are open to other solutions so I will propose multi-threading approach:
from concurrent.futures import ThreadPoolExecutor
from time import sleep
def long_runnig_funciton(param1):
print(param1)
sleep(10)
return "Complete"
with ThreadPoolExecutor(max_workers=10) as executor:
future = executor.submit(long_runnig_funciton,["Param1"])
print(future.result(timeout=12))
Steps:
1) You create a ThreadPoolExecutor and define maximum number of concurrent tasks.
2) You submit a function with arguments it needs
3) You call result() on the return value from submit() when you need the results
Note that the result() can throw exception if exception was thrown in the submitted function
You can also check if the result of your call is ready with future.done() which returns True or False

How to apply parallel or asynchronous I/O file writing on a python piece of code

To begin with, we're given the following piece of code:
from validate_email import validate_email
import time
import os
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
emails = set()
with open(email_path) as f:
for email in f:
email = email.strip()
if email in emails:
continue
emails.add(email)
if validate_email(email, verify=True):
good_emails.write(email + '\n')
else:
bad_emails.write(email + '\n')
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
I expect contacting SMTP servers to be the most expensive part by far from my program when emails.txt contains large amount of lines (>1k). Using some form of parallel or asynchronous I/O should speed this up a lot, since I can wait for multiple servers to respond instead of waiting sequentially.
As far as I have read:
Asynchronous I/O operates by queuing a request for I/O to the file
descriptor, tracked independently of the calling process. For a file
descriptor that supports asynchronous I/O (raw disk devcies
typically), a process can call aio_read() (for instance) to request a
number of bytes be read from the file descriptor. The system call
returns immediately, whether or not the I/O has completed. Some time
later, the process then polls the operating system for the completion
of the I/O (that is, buffer is filled with data).
To be sincere, I didn't quite understand how to implement async I/O on my program. Can anybody take a little time and explain me the whole process ?
EDIT as per PArakleta suggested:
from validate_email import validate_email
import time
import os
from multiprocessing import Pool
import itertools
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
def verify_emails(email_path, good_filepath, bad_filepath):
good_emails = open(good_filepath, 'w+')
bad_emails = open(bad_filepath, 'w+')
with open(email_path, "r") as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
good_emails.close()
bad_emails.close()
if __name__ == "__main__":
os.system('cls')
verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")

You're asking the wrong question
Having looked at the validate_email package your real problem is that you're not efficiently batching your results. You should be only doing the MX lookup once per domain and then only connect to each MX server once, go through the handshake, and then check all of the addresses for that server in a single batch. Thankfully the validate_email package does the MX result caching for you, but you still need to be group the email addresses by server to batch the query to the server itself.
You need to edit the validate_email package to implement batching, and then probably give a thread to each domain using the actual threading library rather than multiprocessing.
It's always important to profile your program if it's slow and figure out where it is actually spending the time rather than trying to apply optimisation tricks blindly.
The requested solution
IO is already asynchronous if you are using buffered IO and your use case fits with the OS buffering. The only place you could potentially get some advantage is in read-ahead but Python already does this if you use the iterator access to a file (which you are doing). AsyncIO is an advantage to programs that are moving large amounts of data and have disabled the OS buffers to prevent copying the data twice.
You need to actually profile/benchmark your program to see if it has any room for improvement. If your disks aren't already throughput bound then there is a chance to improve the performance by parallel execution of the processing of each email (address?). The easiest way to check this is probably to check to see if the core running your program is maxed out (i.e. you are CPU bound and not IO bound).
If you are CPU bound then you need to look at threading. Unfortunately Python threading doesn't work in parallel unless you have non-Python work to be done so instead you'll have to use multiprocessing (I'm assuming validate_email is a Python function).
How exactly you proceed depends on where the bottleneck's in your program are and how much of a speed up you need to get to the point where you are IO bound (since you cannot actually go any faster than that you can stop optimising when you hit that point).
The emails set object is hard to share because you'll need to lock around it so it's probably best that you keep that in one thread. Looking at the multiprocessing library the easiest mechanism to use is probably Process Pools.
Using this you would need to wrap your file iterable in an itertools.ifilter which discards duplicates, and then feed this into a Pool.imap_unordered and then iterate that result and write into your two output files.
Something like:
with open(email_path) as f:
for result in Pool().imap_unordered(validate_map,
itertools.ifilter(unique, f):
(good, email) = result
if good:
good_emails.write(email)
else:
bad_emails.write(email)
The validate_map function should be something simple like:
def validate_map(e):
return (validate_email(e.strip(), verify=True), e)
The unique function should be something like:
seen_emails = set()
def unique(e):
if e in seen_emails:
return False
seen_emails.add(e)
return True
ETA: I just realised that validate_email is a library which actually contacts SMTP servers. Given that it's not busy in Python code you can use threading. The threading API though is not as convenient as the multiprocessing library but you can use multiprocessing.dummy to have a thread based Pool.
If you are CPU bound then it's not really worth having more threads/processes than cores but since your bottleneck is network IO you can benefit from many more threads/processes. Since processes are expensive you want to swap to threads and then crank up the number running in parallel (although you should be polite not to DOS-attack the servers you are connecting to).
Consider from multiprocessing.dummy import Pool as ThreadPool and then call ThreadPool(processes=32).imap_unordered().

What kind of problems (if any) would there be combining asyncio with multiprocessing?

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?

You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.

Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res

See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

Assistance with Python multithreading

Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()

It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()

You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).

For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.