I'm trying to learn some network/backend stuff.
I now want to build an API that makes an HTTP request, does some processing, sends back a response. Not very useful, but it's for learning.
I noticed that the get request is a huge bottleneck. It is a I/O problem i think because the respones are veery small.
Now I thought I could maybe do the downloading on multiple threads. If a fictional client of mine makes a request, an URL would need to be added to a pool, then fetched (by some worker thread) und returned to the worker thread, processed, and send back. Or something like that...
I'm really not an expert and maybe nothing what I just said made any sense... but I would really appreciate a little help:)
Multiple solutions exist.
You can use threading (thread pools) or multiprocessing (multiprocessing pools) to perform multiple requests in parallel.
Or you could use libraries like asyncio (or twisted) to perform multiple requests within one thread in a way, that waiting for IO is no more the blocking point.
I suggest you look at:
https://docs.python.org/3/library/threading.html for threading
or https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing#module-multiprocessing for multiprocessing.
Asynchronous programming is in my opinion much more difficult, but if curious look at
https://docs.python.org/3/library/asyncio.html?highlight=asyncio#module-asyncio for asyncio basics and at https://docs.aiohttp.org/en/stable/ for performing multiple http requests in 'parallel' with asyncio
Afterwards after playing a little you will probably have much pore precise questions.
Just post your code then, explain issues and you will get more help
Related
Is it possible to create as many threads to use 100% of CPU and is it really efficient? I'm planning to create a crawler in Python and, in order to make the program efficient, I want to create as many threads as possible, where each thread will be downloading one website. I tried looking up for some information online; unfortunately I didn't find much.
You are confusing your terminology, but that is ok. A very high level overview would help.
Concurrency can consist of IO bound (reading and writing from disk, http requests, etc) and CPU bound work (running a machine learning optimization function on a big set of data).
With IO bound work, which is what you are referring to I am assuming, in fact your CPU is not working very hard but rather waiting around for data to come back.
Contrast that with multi-processing where you can use multiple core of your machine to do more intense CPU bound work.
That said multi-threading could help you. I would advise to use the asyncio and aiohttp modules for Python. These will help you make sure whilst you are waiting for some response to be returned, the software can continue with other requests.
I use asyncio, aiohttp and bs4 when I need to do some web-scraping.
In order to test our server we designed a test that sends a lot of requests with JSON payload and compares the response it gets back.
I'm currently trying to find a way to optimize the process by using multi threads to do so. I didn't find any solution for the problem that I'm facing though.
I have a url address and a bunch of JSON files (these files hold the requests, and for each request file there is an 'expected response' JSON to compare the response to).
I would like to use multi threading to send all these requests and still be able to match the response that I get back to the request I sent.
Any ideas?
Well, you have couple of options:
Use multiprocessing.pool.ThreadPool (Python 2.7) where you create pool of threads and then use them for dispatching requests. map_async may be of interest here if you want to make async requests,
Use concurrent.futures.ThreadPoolExecutor (Python 3) with similar way of working with ThreadPool pool and yet it is used for asynchronously executing callables,
You even have option of using multiprocessing.Pool, but I'm not sure if that will give you any benefit since everything you will be doing is I/O bound, so threads should do just fine,
You can make asynchronous requests with Twisted or asyncio but that may require a bit more learning if you are not accustomed to asynchronous programming.
Use python multiprocessing threadpool where you can get return value can be compared.
https://docs.python.org/2/library/multiprocessing.html
https://gist.github.com/wrunk/b689be4b59270c32441c
So I have this problem I am trying to solve in a particular way, but I am not sure how hard it is to achieve.
I would like to use the asyncio/coroutines features of Python 3.4 to trigger many concurrent http requests using a blocking http library, like requests, or any Python api that does http requests like boto for aws.
I know about run_in_executor() method to run tasks in threads/processes, but I would like to avoid that.
I would like to do it in a single-thread, using those select features in Linux/Unix kernel.
Actually I was following David Beazley's presentation on this, and I was trying to use this code: https://github.com/dabeaz/concurrencylive/blob/master/aserver.py
but without the future/pool stuff, and use my blocking-api call instead of computing the Fibonacci number.
Put it seems that the http requests are still running in sequence.
Any ideas if this is possible? And how?
Thanks
Not possible. All the calls that the requests library makes to the underlying socket are blocking (i.e. socket.read) because the socket is in blocking mode. You could put the socket into non-blocking mode, but then socket.read would fail. You basically need an event-loop to tell you when it's possible to do a socket.read, but blocking libraries aren't written with one in mind. This is the whole reason why asyncio exists; providing a default event-loop that different libraries can share and make use of non-blocking file descriptors (e.g. sockets).
Use aiohttp, it's just as easy as requests and in the process you get to learn more about asyncio. asyncio and the new Python 3.5 async/await syntax are the Future of networking IO; yield to it (pun intended).
I've read the docs. I've played with examples. But I'm still unable to grasp what exactly asynchronous means when it is useful and where is the magic lots of people seem so crazy about?
If it is only for avoiding to wait for I/O when why to simple run in threads? Why does Deferred needed?
I think I'm missing some fundamental knowledge about computing so those questions. If so what is it?
like you're five... ok: threads bad, async good!
now, seriously: threads incur overhead - both in locking and switching of the interpreter, and in memory consumption and code complexity. when your program is IO bound, and does a lot of waiting for other services (APIs, databases) to return a response, you're basically waiting on idle, and wasting resources.
the point of async IO is to mitigate the overhead of threads while keeping the concurrency, and keeping your program simple, avoiding deadlocks and reducing complexity.
think for example about a chat server. you have thousands of connections on the server, and you want some people to receive some messages based on which room they are. doing this with threads will be much more complicated than doing this the async way.
re deferred - it's just a way of simplifying your code, instead of giving every function a callback to return to when the operation it's waiting for is ready.
another note - if you want a much simpler and elegant async IO framework, try tornado, which is basically an async web server, with async http client and a replacement for deferred. it's very nicely written and can be used as a general purpose async IO framework.
see http://tornadoweb.org
I want to use Python's multiprocessing to do concurrent processing without using locks (locks to me are the opposite of multiprocessing) because I want to build up multiple reports from different resources at the exact same time during a web request (normally takes about 3 seconds but with multiprocessing I can do it in .5 seconds).
My problem is that, if I expose such a feature to the web and get 10 users pulling the same report at the same time, I suddenly have 60 interpreters open at the same time (which would crash the system). Is this just the common sense result of using multiprocessing, or is there a trick to get around this potential nightmare?
Thanks
If you're really worried about having too many instances you could think about protecting the call with a Semaphore object. If I understand what you're doing then you can use the threaded semaphore object:
from threading import Semaphore
sem = Semaphore(10)
with sem:
make_multiprocessing_call()
I'm assuming that make_multiprocessing_call() will cleanup after itself.
This way only 10 "extra" instances of python will ever be opened, if another request comes along it will just have to wait until the previous have completed. Unfortunately this won't be in "Queue" order ... or any order in particular.
Hope that helps
You are barking up the wrong tree if you are trying to use multiprocess to add concurrency to a network app. You are barking up a completely wrong tree if you're creating processes for each request. multiprocess is not what you want (at least as a concurrency model).
There's a good chance you want an asynchronous networking framework like Twisted.
locks are only ever nessecary if you have multiple agents writing to a source. If they are just accessing, locks are not needed (and as you said defeat the purpose of multiprocessing).
Are you sure that would crash the system? On a web server using CGI, each request spawns a new process, so it's not unusual to see thousands of simultaneous processes (granted in python one should use wsgi and avoid this), which do not crash the system.
I suggest you test your theory -- it shouldn't be difficult to manufacture 10 simultaneous accesses -- and see if your server really does crash.