Optimize Async Tornado code. Minimize the thread lock

Optimize Async Tornado code. Minimize the thread lock - python

How can I minimize the thread lock with Tornado? Actually, I have already the working code, but I suspect that it is not fully asynchronous.
I have a really long task.
It consists of making several requests to CouchDB to get meta-data and to construct a final link. Then I need to make the last request to CouchDB and stream a file (from 10 MB up to 100 MB). So, the result will be the streaming of a large file to a client.
The problem that the server can receive 100 simultaneous requests to download large files and I need not to lock thread and keep recieving new requests (I have to minimize the thread lock).
So, I am making several synchronous requests (requests library) and then stream a large file with chunks with AsyncHttpClient.
The questions are as follows:
1) Should I use AsyncHTTPClient EVERYWHERE? Since I have some interface it will take quite a lot of time to replace all synchronous requests with asynchronous ones. Is it worth doing it?
2) Should I use tornado.curl_httpclient.CurlAsyncHTTPClient? Will the code run faster (file download, making requests)?
3) I see that Python 3.5 introduced async and theoretically it can be faster. Should I use async or keep using the decorator #gen.coroutine?

Use AsyncHTTPClient or CurlAsyncHTTPClient. Since the "requests" library is synchronous, it blocks the Tornado event loop during execution and you can only have one request in progress at a time. To do asynchronous networking operations with Tornado requires purpose-built asynchronous network code, like CurlAsyncHTTPClient.
Yes, CurlAsyncHTTPClient is a bit faster than AsyncHTTPClient, you may notice a speedup if you stream large amounts of data with it.
async and await are faster than gen.coroutine and yield, so if you have yield statements that are executed very frequently in a tight loop, or if you have deeply nested coroutines that call coroutines, it will be worthwhile to port your code.

Related

using threading for multiple requests

In order to test our server we designed a test that sends a lot of requests with JSON payload and compares the response it gets back.
I'm currently trying to find a way to optimize the process by using multi threads to do so. I didn't find any solution for the problem that I'm facing though.
I have a url address and a bunch of JSON files (these files hold the requests, and for each request file there is an 'expected response' JSON to compare the response to).
I would like to use multi threading to send all these requests and still be able to match the response that I get back to the request I sent.
Any ideas?

Well, you have couple of options:
Use multiprocessing.pool.ThreadPool (Python 2.7) where you create pool of threads and then use them for dispatching requests. map_async may be of interest here if you want to make async requests,
Use concurrent.futures.ThreadPoolExecutor (Python 3) with similar way of working with ThreadPool pool and yet it is used for asynchronously executing callables,
You even have option of using multiprocessing.Pool, but I'm not sure if that will give you any benefit since everything you will be doing is I/O bound, so threads should do just fine,
You can make asynchronous requests with Twisted or asyncio but that may require a bit more learning if you are not accustomed to asynchronous programming.

Use python multiprocessing threadpool where you can get return value can be compared.
https://docs.python.org/2/library/multiprocessing.html
https://gist.github.com/wrunk/b689be4b59270c32441c

Which Python library should I use? SocketServer or Asyncio?

I am going to create web server which could receive a lot of connections. These 10000 connected users will send to server numbers and server will return these squared numbers to users back.
10000 connections are too many and asynchronous approach is appropriate here.
I found two libraries for Python 3.4 which can help:
socketserver
&
asyncio
With socketserver library we can use ThreadingMixIn and ForkingMixIn classes as async handlers. But this is restricted by number of cores.
On the other hand we have asyncio library. And I don't understand how exactly does it works.
Which one should I use? And could these two libraries work together?

There are different approaches to asynchronous programming.
The first approach is to monitor IO operations using threads, and manage those operations in a non-blocking manner. This is what SocketServer does.
The second approach is to monitor IO operations in the main thread using an event loop and a selector. This is usually what people mean when they talk about asynchronous programming, and that's what asyncio, twisted and gevent do.
The single-threaded approach has two advantages:
it limits the risk of race condition since the callbacks are running in the same thread
it gets rid of the overhead of creating one thread per client (see the 10K problem)
Here is an example of an asyncio TCP server. In your case, simply replace the handle_echo coroutine with your own implementation:
async def handle_client(reader, writer):
data = await reader.readline()
result = int(data.decode().strip()) ** 2
writer.write(str(result)).encode())
writer.close()
It should easily be able to handle thousands of clients.

Concurrent http requests using a blocking http library with Python in a single thread

So I have this problem I am trying to solve in a particular way, but I am not sure how hard it is to achieve.
I would like to use the asyncio/coroutines features of Python 3.4 to trigger many concurrent http requests using a blocking http library, like requests, or any Python api that does http requests like boto for aws.
I know about run_in_executor() method to run tasks in threads/processes, but I would like to avoid that.
I would like to do it in a single-thread, using those select features in Linux/Unix kernel.
Actually I was following David Beazley's presentation on this, and I was trying to use this code: https://github.com/dabeaz/concurrencylive/blob/master/aserver.py
but without the future/pool stuff, and use my blocking-api call instead of computing the Fibonacci number.
Put it seems that the http requests are still running in sequence.
Any ideas if this is possible? And how?
Thanks

Not possible. All the calls that the requests library makes to the underlying socket are blocking (i.e. socket.read) because the socket is in blocking mode. You could put the socket into non-blocking mode, but then socket.read would fail. You basically need an event-loop to tell you when it's possible to do a socket.read, but blocking libraries aren't written with one in mind. This is the whole reason why asyncio exists; providing a default event-loop that different libraries can share and make use of non-blocking file descriptors (e.g. sockets).
Use aiohttp, it's just as easy as requests and in the process you get to learn more about asyncio. asyncio and the new Python 3.5 async/await syntax are the Future of networking IO; yield to it (pun intended).

Could someone explain to me python-twisted like I'm five?

I've read the docs. I've played with examples. But I'm still unable to grasp what exactly asynchronous means when it is useful and where is the magic lots of people seem so crazy about?
If it is only for avoiding to wait for I/O when why to simple run in threads? Why does Deferred needed?
I think I'm missing some fundamental knowledge about computing so those questions. If so what is it?

like you're five... ok: threads bad, async good!
now, seriously: threads incur overhead - both in locking and switching of the interpreter, and in memory consumption and code complexity. when your program is IO bound, and does a lot of waiting for other services (APIs, databases) to return a response, you're basically waiting on idle, and wasting resources.
the point of async IO is to mitigate the overhead of threads while keeping the concurrency, and keeping your program simple, avoiding deadlocks and reducing complexity.
think for example about a chat server. you have thousands of connections on the server, and you want some people to receive some messages based on which room they are. doing this with threads will be much more complicated than doing this the async way.
re deferred - it's just a way of simplifying your code, instead of giving every function a callback to return to when the operation it's waiting for is ready.
another note - if you want a much simpler and elegant async IO framework, try tornado, which is basically an async web server, with async http client and a replacement for deferred. it's very nicely written and can be used as a general purpose async IO framework.
see http://tornadoweb.org

Does a multithreaded crawler in Python really speed things up?

Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?
Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?
thanks!

The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.
On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.

Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.
If you think about it, you have only a single network card, so parallel processing can't really help by definition.
What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.

When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.
Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.
Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.

Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.

Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.
You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimize Async Tornado code. Minimize the thread lock - python

Related

using threading for multiple requests

Which Python library should I use? SocketServer or Asyncio?

Concurrent http requests using a blocking http library with Python in a single thread

Could someone explain to me python-twisted like I'm five?

Does a multithreaded crawler in Python really speed things up?

Categories

Resources