Hi Stackoverflow community,
I would like to create a script that uses multi threading to create a high number of parallel requests for HTTP status codes on a large list of URL's (more than 30k vhosts).
The requests can be executed from the same server where the websites are hosted.
I was using multithreaded curl requests, but I'm not really satisfied with the results I've got. For a complete check of 30k hosts it takes more than an hour.
I am wondering if anyone has any tips or is there a more performant way to do it?
After testing some of the available solutions, the simplest and the fastest way was using webchk
webchk is a command-line tool developed in Python 3 for checking the HTTP status codes and response headers of URLs
The speed was impressive, the output was clean, it parsed 30k vhosts in about 2 minutes
https://webchk.readthedocs.io/en/latest/index.html
https://pypi.org/project/webchk/
If you're looking for parallelism and multi-threaded approaches to make HTTP requests with Python, then you might start with the aiohttp library, or use the popular requests package. Multithreading can be accomplished with multiprocessing, from the standard library.
Here's a discussion of rate limiting with aiohttp client: aiohttp: rate limiting parallel requests
Here's a discussion about making multiprocessing work with requests https://stackoverflow.com/a/27547938/10553976
Making it performant is a matter of your implementation. Be sure to profile your attempts and compare to your current implementation.
Related
I can't find a straight answer on this. Is aiohttp asynchronous in the way that Javascript is? I'm building a project where I'll need to send a large number of requests to an endpoint. If I use requests, I'll need to wait for the response before I can send the next request. I researched a few async requests libraries in Python, but those all seemed to start new threads to send requests. If I understand asynchronous correctly, starting a new thread pretty much defeats the purpose of asynchronous code (tell me if I'm wrong here). I'm really looking for a single-threaded asynchronous requests library in Python that will send requests much like Javascript would (that is, will send another request while waiting for a response from the first and not start multiple threads). Is aiohttp what I'm looking for?
I know we can use selenium with chrome driver as a high level website interaction tool, we can speed it up with Phantomjs. Then we can use the requests module to be even faster. But that's where my knowledge stops.
whats the fastest possible way in python that someone can do post and get requests? I assume there is a lower level library than requests? do we use sockets and packets?
'To execute the request as fast as possible'
if python's requests lib is the fastest is there other libs in other programming languages such as c++ that worth a look at?
It really depends on the task, for web scraping 1000 pages its fine but when you need to requests.post 1000000+ it adds up. I've also looked into the multiprocessing lib. It helps a lot using all computational resources I have but traversing the network and waiting for the response is the thing that takes the longest. I would have thought the best way to increase speed is by sending and receiving less data. lets say receive only 5 input parameters and send only those 5 as a post back and wait for a 200 response. any ideas how can i do this without receiving all source code?
Thanks!
When doing a scrape of a site, which would be preferable: using curl, or using Python's requests library?
I originally planned to use requests and explicitly specify a user agent. However, when I use this I often get an "HTTP 429 too many requests" error, whereas with curl, it seems to avoid that.
I need to update metadata information on 10,000 titles, and I need a way to pull down the information for each of the titles in a parallelized fashion.
What are the pros and cons of using each for pulling down information?
Since you want to parallelize the requests, you should use requests with grequests (if you're using gevent, or erequests if you're using eventlet). You may have to throttle how quickly you hit the website though since they may do some ratelimiting and be refusing you for requesting too much in too short a period of time.
Using requests would allow you to do it programmatically, which should result in a cleaner product.
If you use curl, you're doing os.system calls which are slower.
I'd go for the in-language version over an external program any day, because it's less hassle.
Only if it turns out unworkable would I fall back to this. Always consider that people's time is infinitely more valuable than machine time. Any "performance gains" in such an application will probably be swamped by network delays anyway.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Multiple (asynchronous) connections with urllib2 or other http library?
I am working on a Linux web server that runs Python code to grab realtime data over HTTP from a 3rd party API. The data is put into a MySQL database.
I need to make a lot of queries to a lot of URL's, and I need to do it fast (faster = better). Currently I'm using urllib3 as my HTTP library.
What is the best way to go about this? Should I spawn multiple threads (if so, how many?) and have each query for a different URL?
I would love to hear your thoughts about this - thanks!
If a lot is really a lot than you probably want use asynchronous io not threads.
requests + gevent = grequests
GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.
import grequests
urls = [
'http://www.heroku.com',
'http://tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
rs = (grequests.get(u) for u in urls)
grequests.map(rs)
You should use multithreading as well as pipelining requests. For example search->details->save
The number of threads you can use doesn't depend on your equipment only. How many requests the service can serve? How many concurrent requests does it allow to run? Even your bandwidth can be a bottleneck.
If you're talking about a kind of scraping - the service could block you after certain limit of requests, so you need to use proxies or multiple IP bindings.
As for me, in the most cases, I can run 50-300 concurrent requests on my laptop from python scripts.
Sounds like an excellent application for Twisted. Here are some web-related examples, including how to download a web page. Here is a related question on database connections with Twisted.
Note that Twisted does not rely on threads for doing multiple things at once. Rather, it takes a cooperative multitasking approach---your main script starts the reactor and the reactor calls functions that you set up. Your functions must return control to the reactor before the reactor can continue working.
I am trying to move away from CherryPy for a web service that I am working on and one alternative that I am considering is Tornado. Now, most of my requests look on the backend something like:
get POST data
see if I have it in cache (database access)
if not make multiple HTTP requests to some other web service which can take even a good few seconds depending on the number of requests
I keep hearing that one should not block the tornado main loop; I am wondering if all of the above code is executed in the post() method of a RequestHandler, does this mean that I am blocking the code ? And if so, what's the appropriate approach to use tornado with the above requirements.
Tornado comes shipped with an asynchronous (actually two iirc) http client (AsyncHTTPClient). Use that one if you need to do additional http requests.
The database lookup should also be done using an asynchronous client in order to not block the tornado ioloop/mainloop. I know there are a couple of tornado tailor made database clients (e.g redis, mongodb) out there. The mysql lib is included in the tornado distribution.