I can't find a straight answer on this. Is aiohttp asynchronous in the way that Javascript is? I'm building a project where I'll need to send a large number of requests to an endpoint. If I use requests, I'll need to wait for the response before I can send the next request. I researched a few async requests libraries in Python, but those all seemed to start new threads to send requests. If I understand asynchronous correctly, starting a new thread pretty much defeats the purpose of asynchronous code (tell me if I'm wrong here). I'm really looking for a single-threaded asynchronous requests library in Python that will send requests much like Javascript would (that is, will send another request while waiting for a response from the first and not start multiple threads). Is aiohttp what I'm looking for?
Related
Hi Stackoverflow community,
I would like to create a script that uses multi threading to create a high number of parallel requests for HTTP status codes on a large list of URL's (more than 30k vhosts).
The requests can be executed from the same server where the websites are hosted.
I was using multithreaded curl requests, but I'm not really satisfied with the results I've got. For a complete check of 30k hosts it takes more than an hour.
I am wondering if anyone has any tips or is there a more performant way to do it?
After testing some of the available solutions, the simplest and the fastest way was using webchk
webchk is a command-line tool developed in Python 3 for checking the HTTP status codes and response headers of URLs
The speed was impressive, the output was clean, it parsed 30k vhosts in about 2 minutes
https://webchk.readthedocs.io/en/latest/index.html
https://pypi.org/project/webchk/
If you're looking for parallelism and multi-threaded approaches to make HTTP requests with Python, then you might start with the aiohttp library, or use the popular requests package. Multithreading can be accomplished with multiprocessing, from the standard library.
Here's a discussion of rate limiting with aiohttp client: aiohttp: rate limiting parallel requests
Here's a discussion about making multiprocessing work with requests https://stackoverflow.com/a/27547938/10553976
Making it performant is a matter of your implementation. Be sure to profile your attempts and compare to your current implementation.
I am basically trying to start an HTTP server which will respond with content from a website which I can crawl using Scrapy. In order to start crawling the website I need to login to it and to do so I need to access a DB with credentials and such. The main issue here is that I need everything to be fully asynchronous and so far I am struggling to find a combination that will make everything work properly without many sloppy implementations.
I already got Klein + Scrapy working but when I get to implementing DB accesses I get all messed up in my head. Is there any way to make PyMongo asynchronous with twisted or something (yes, I have seen TxMongo but the documentation is quite bad and I would like to avoid it. I have also found an implementation with adbapi but I would like something more similar to PyMongo).
Trying to think things through the other way around I'm sure aiohttp has many more options to implement async db accesses and stuff but then I find myself at an impasse with Scrapy integration.
I have seen things like scrapa, scrapyd and ScrapyRT but those don't really work for me. Are there any other options?
Finally, if nothing works, I'll just use aiohttp and instead of Scrapy I'll do the requests to the websito to scrap manually and use beautifulsoup or something like that to get the info I need from the response. Any advice on how to proceed down that road?
Thanks for your attention, I'm quite a noob in this area so I don't know if I'm making complete sense. Regardless, any help will be appreciated :)
Is there any way to make pymongo asynchronous with twisted
No. pymongo is designed as a synchronous library, and there is no way you can make it asynchronous without basically rewriting it (you could use threads or processes, but that is not what you asked, also you can run into issues with thread-safeness of the code).
Trying to think things through the other way around I'm sure aiohttp has many more options to implement async db accesses and stuff
It doesn't. aiohttp is a http library - it can do http asynchronously and that is all, it has nothing to help you access databases. You'd have to basically rewrite pymongo on top of it.
Finally, if nothing works, I'll just use aiohttp and instead of scrapy I'll do the requests to the websito to scrap manually and use beautifulsoup or something like that to get the info I need from the response.
That means lots of work for not using scrapy, and it won't help you with the pymongo issue - you still have to rewrite pymongo!
My suggestion is - learn txmongo! If you can't and want to rewrite it, use twisted.web to write it instead of aiohttp since then you can continue using scrapy!
While I'm familiar with both HTTP servers and event loops, I'm having some trouble grasping the inner workings of Python's asyncio.
As a learning exercise, I've been trying to write a minimal HTTP server (just echoing out the request method, URI, headers and body), without additional dependencies. I've looked into aiohttp and aiowsgi for reference, but having trouble understanding what's going on there - in part because the perceived complexity of protocols, transports etc. is a bit overwhelming. So I'm currently stuck because I don't quite know where to begin.
Is it naive to expect this to be just a few lines of code to establish the connection, consume the incoming text stream and send back another text stream?
You can take a look on picoweb as example of very simple (and very limited) HTTP server.
But, sure, when you'll try to implement full-feature web server you will get something like aiohttp -- HTTP is complex (even maybe complicated) standard.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Multiple (asynchronous) connections with urllib2 or other http library?
I am working on a Linux web server that runs Python code to grab realtime data over HTTP from a 3rd party API. The data is put into a MySQL database.
I need to make a lot of queries to a lot of URL's, and I need to do it fast (faster = better). Currently I'm using urllib3 as my HTTP library.
What is the best way to go about this? Should I spawn multiple threads (if so, how many?) and have each query for a different URL?
I would love to hear your thoughts about this - thanks!
If a lot is really a lot than you probably want use asynchronous io not threads.
requests + gevent = grequests
GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.
import grequests
urls = [
'http://www.heroku.com',
'http://tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
rs = (grequests.get(u) for u in urls)
grequests.map(rs)
You should use multithreading as well as pipelining requests. For example search->details->save
The number of threads you can use doesn't depend on your equipment only. How many requests the service can serve? How many concurrent requests does it allow to run? Even your bandwidth can be a bottleneck.
If you're talking about a kind of scraping - the service could block you after certain limit of requests, so you need to use proxies or multiple IP bindings.
As for me, in the most cases, I can run 50-300 concurrent requests on my laptop from python scripts.
Sounds like an excellent application for Twisted. Here are some web-related examples, including how to download a web page. Here is a related question on database connections with Twisted.
Note that Twisted does not rely on threads for doing multiple things at once. Rather, it takes a cooperative multitasking approach---your main script starts the reactor and the reactor calls functions that you set up. Your functions must return control to the reactor before the reactor can continue working.
I am trying to move away from CherryPy for a web service that I am working on and one alternative that I am considering is Tornado. Now, most of my requests look on the backend something like:
get POST data
see if I have it in cache (database access)
if not make multiple HTTP requests to some other web service which can take even a good few seconds depending on the number of requests
I keep hearing that one should not block the tornado main loop; I am wondering if all of the above code is executed in the post() method of a RequestHandler, does this mean that I am blocking the code ? And if so, what's the appropriate approach to use tornado with the above requirements.
Tornado comes shipped with an asynchronous (actually two iirc) http client (AsyncHTTPClient). Use that one if you need to do additional http requests.
The database lookup should also be done using an asynchronous client in order to not block the tornado ioloop/mainloop. I know there are a couple of tornado tailor made database clients (e.g redis, mongodb) out there. The mysql lib is included in the tornado distribution.