Ideal method for sending multiple HTTP requests over Python? [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Multiple (asynchronous) connections with urllib2 or other http library?
I am working on a Linux web server that runs Python code to grab realtime data over HTTP from a 3rd party API. The data is put into a MySQL database.
I need to make a lot of queries to a lot of URL's, and I need to do it fast (faster = better). Currently I'm using urllib3 as my HTTP library.
What is the best way to go about this? Should I spawn multiple threads (if so, how many?) and have each query for a different URL?
I would love to hear your thoughts about this - thanks!

If a lot is really a lot than you probably want use asynchronous io not threads.
requests + gevent = grequests
GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.
import grequests
urls = [
'http://www.heroku.com',
'http://tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
rs = (grequests.get(u) for u in urls)
grequests.map(rs)

You should use multithreading as well as pipelining requests. For example search->details->save
The number of threads you can use doesn't depend on your equipment only. How many requests the service can serve? How many concurrent requests does it allow to run? Even your bandwidth can be a bottleneck.
If you're talking about a kind of scraping - the service could block you after certain limit of requests, so you need to use proxies or multiple IP bindings.
As for me, in the most cases, I can run 50-300 concurrent requests on my laptop from python scripts.

Sounds like an excellent application for Twisted. Here are some web-related examples, including how to download a web page. Here is a related question on database connections with Twisted.
Note that Twisted does not rely on threads for doing multiple things at once. Rather, it takes a cooperative multitasking approach---your main script starts the reactor and the reactor calls functions that you set up. Your functions must return control to the reactor before the reactor can continue working.

Related

Is aiohttp truly asynchronous?

I can't find a straight answer on this. Is aiohttp asynchronous in the way that Javascript is? I'm building a project where I'll need to send a large number of requests to an endpoint. If I use requests, I'll need to wait for the response before I can send the next request. I researched a few async requests libraries in Python, but those all seemed to start new threads to send requests. If I understand asynchronous correctly, starting a new thread pretty much defeats the purpose of asynchronous code (tell me if I'm wrong here). I'm really looking for a single-threaded asynchronous requests library in Python that will send requests much like Javascript would (that is, will send another request while waiting for a response from the first and not start multiple threads). Is aiohttp what I'm looking for?

how to manage multiplies request in microservices?

My goal is to build synchronous management for multiplies requests.
I need to pass the requests to limit microservices.
when i get amount of requests more then the limit of the services.
I need to manage the requests (maybe queue) and I don't know where the requests are waiting and to where will the answers return from the service?
I need to stay synchronous and build this in python.
I try to work with nameko for framework.
thank for your help.

Multi threading script for HTTP status codes

Hi Stackoverflow community,
I would like to create a script that uses multi threading to create a high number of parallel requests for HTTP status codes on a large list of URL's (more than 30k vhosts).
The requests can be executed from the same server where the websites are hosted.
I was using multithreaded curl requests, but I'm not really satisfied with the results I've got. For a complete check of 30k hosts it takes more than an hour.
I am wondering if anyone has any tips or is there a more performant way to do it?
After testing some of the available solutions, the simplest and the fastest way was using webchk
webchk is a command-line tool developed in Python 3 for checking the HTTP status codes and response headers of URLs
The speed was impressive, the output was clean, it parsed 30k vhosts in about 2 minutes
https://webchk.readthedocs.io/en/latest/index.html
https://pypi.org/project/webchk/
If you're looking for parallelism and multi-threaded approaches to make HTTP requests with Python, then you might start with the aiohttp library, or use the popular requests package. Multithreading can be accomplished with multiprocessing, from the standard library.
Here's a discussion of rate limiting with aiohttp client: aiohttp: rate limiting parallel requests
Here's a discussion about making multiprocessing work with requests https://stackoverflow.com/a/27547938/10553976
Making it performant is a matter of your implementation. Be sure to profile your attempts and compare to your current implementation.

Asynchronous Socket connections in Python or Node

I am creating an application that basically has multiple connections to a third party Chat Streaming API(Socket based).
The way it works is - Every user has an account on my app and another account on the third party app. He gives me an access token for the third party chat app and I connect to the third party API to stream his chats. This happens for hundreds of users.
I need to create a socket connection pool for every user and run parallel threads. I am using a python library(for that API) and am able to achieve real time feeds for single users. How do I implement an asynchronous socket connection pool in Python or NodeJS? I have a Linux micro instance on EC2 and I need to run this application for 1000 users.
I am exploring Redis+Tornado to implement this. Are there any better alternatives?
This will be messy and also a couple of things to consider.
If you are going to use multiple threads remember that you can only run so many per CPU as the OS permits, rather go multiprocessing.
If you are going async with long polling processes it will prevent other clients from processing requests.
Solution
When your application absolutely needs to be real-time I would suggest websockets for server-client interaction.
Then from your clients request start a single process that listens\polls on your streaming API using multiprocessing in python. So you will essentially create a separate process for each client.
And now, to make your WebSocketHandler and Background API Streamer interact with each other you can use the Observer Pattern (https://en.wikipedia.org/wiki/Observer_pattern) to notify the WebSocket that you have received data from the API.
Make sure that you assign a unique ID to every client and make sure that you only post the data to the intended client when using websockets.
EDIT:
Web:
Also on your question regarding Tornado. It is a good lightweight framework for running a couple of users maybe 1000. But anything more than that I would suggest looking at Django as it will allow you to be more productive in producing code and also there are lots of tools out there that the community have developed over time.
Database:
Red.is is a good choice if you need a very fast no-sql db, also have a look at mongodb. If you require a multi-region DB I would suggest going with Cassandra or CouchDB due to the partitioned nodes. The image below might help you better decide which DB to use.

of tornado and blocking code

I am trying to move away from CherryPy for a web service that I am working on and one alternative that I am considering is Tornado. Now, most of my requests look on the backend something like:
get POST data
see if I have it in cache (database access)
if not make multiple HTTP requests to some other web service which can take even a good few seconds depending on the number of requests
I keep hearing that one should not block the tornado main loop; I am wondering if all of the above code is executed in the post() method of a RequestHandler, does this mean that I am blocking the code ? And if so, what's the appropriate approach to use tornado with the above requirements.
Tornado comes shipped with an asynchronous (actually two iirc) http client (AsyncHTTPClient). Use that one if you need to do additional http requests.
The database lookup should also be done using an asynchronous client in order to not block the tornado ioloop/mainloop. I know there are a couple of tornado tailor made database clients (e.g redis, mongodb) out there. The mysql lib is included in the tornado distribution.

Categories