Why AsyncHTTPClient in Tornado doesn't send request immediately? - python

In my current application I use Tornado AsyncHttpClient to make requests to a web site.
The flow is complex, procesing responses from previous request results in another request.
Actually, I download an article, then analyze it and download images mention in it
What bothers me is that while in my log I clearly see the message indicating that .fetch() on photo URL has beeen issued, no actual HTTP request is made, as sniffed in Wireshark
I tried tinkering with max_client_count and Curl/Simple HTTP client, but the bahvior is always the same - until all articles are downloaded not photo requests are actually issued. How can change this?
upd. some pseudo code
#VictorSergienko I am on Linux, so by default, I guess, EPoll version is used. The whole system is too complicated but it boils down to:
#gen.coroutine
def fetch_and_process(self, url, callback):
body = yield self.async_client.fetch(url)
res = yield callback(body)
return res
#gen.coroutine
def process_articles(self,urls):
wait_ids=[]
for url in urls:
#Enqueue but don't wait for one
IOLoop.current().add_callback(self.fetch_and_process(url, self.process_article))
wait_ids.append(yield gen.Callback(key=url))
#wait for all tasks to finish
yield wait_ids
#gen.coroutine
def process_article(self,body):
photo_url=self.extract_photo_url_from_page(body)
do_some_stuff()
print('I gonna download that photo '+photo_url)
yield self.download_photo(photo_url)
#gen.coroutine
def download_photo(self, photo_url):
body = yield self.async_client.fetch(photo_url)
with open(self.construct_filename(photo_url)) as f:
f.write(body)
And when it prints I gonna download that photo no actual request is made!
Instead, it keeps on downloading more articles and enqueueing more photos untils all articles are downloaded, only THEN all photos are requested in a bulk

AsyncHTTPClient has a queue, which you are filling up immediately in process_articles ("Enqueue but don't wait for one"). By the time the first article is processed its photos will go at the end of the queue after all the other articles.
If you used yield self.fetch_and_process instead of add_callback in process_articles, you would alternate between articles and their photos, but you could only be downloading one thing at a time. To maintain a balance between articles and photos while still downloading more than one thing at a time, consider using the toro package for synchronization primitives. The example in http://toro.readthedocs.org/en/stable/examples/web_spider_example.html is similar to your use case.

Related

concurrent connections in urllib3

Using a loop to make multiple requests to various websites, how is it possible to do this with a proxy in urllib3?
The code will read in a tuple of URLs, and use a for loop to connect to each site, however, currently it does not connect past the first url in the tuple. There is a proxy in place as well.
list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
for i in list:
http = ProxyManager("PROXY-PROXY")
http_get = http.request('GET', i, preload_content=False).read().decode()
I have removed the urls and proxy information from the above code. The first URL in the tuple will run fine, but after this, nothing else occurs, just waiting. I have tried the clear() method to reset the connection for each time in the loop.
unfortunately urllib3 is synchronous and blocks. You could use it with threads, but that is a hassle and usually leads to more problems. The main approach these days is to use some asynchronous network. Twisted and asyncio (with aiohttp maybe) are the popular packages.
I'll provide an example using trio framework and asks:
import asks
import trio
asks.init('trio')
path_list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
results = []
async def grabber(path):
r = await s.get(path)
results.append(r)
async def main(path_list):
async with trio.open_nursery() as n:
for path in path_list:
n.spawn(grabber(path))
s = asks.Session()
trio.run(main, path_list)
Using threads is not really that much of a hassle since 3.2 when concurrent.futures was added:
from urllib3 import ProxyManager
from concurrent.futures import ThreadPoolExecutor,wait
url_list:list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
thread_pool:ThreadPoolExecutor = ThreadPoolExecutor(max_workers=min(len(url_list),20))
tasks = []
for url in url_list:
def send_request() -> type:
# copy i into this function's stack frame
this_url:str = url
# could this assignment be removed from the loop?
# I'd have to read the docs for ProxyManager but probably
http:ProxyManager = ProxyManager("PROXY-PROXY")
return http.request('GET', this_url, preload_content=False).read().decode()
tasks.append(thread_pool.submit(send_request))
wait(tasks)
all_responses:list = [task.result() for task in tasks]
Later versions offer an event loop via asyncio. Issues I've had with asyncio are usually related to portability of libraries (IE aiohttp via pydantic), most of which are not pure python and have external libc dependencies. This can be an issue if you have to support a lot of docker apps which might have musl-libc(alpine) or glibc(everyone else).

python tornado async client

I created batch delayed http (async) client which allows to trigger multiple async http requests and most importantly it allows to delay the start of requests so for example 100 requests are not triggered at a time.
But it has an issue. The http .fetch() method has a handleMethod parameter which handles the response, but I found out that if the delay (sleep) after the fetch isn't long enough the handle method is not even triggered. (maybe the request is killed or what meanwhile).
It is probably related to .run_sync method. How to fix that? I want to put delays but dont want this issue happen.
I need to parse the response regardless how long the request takes, regardless the following sleep call (that call has another reason as i said, and should not be related to response handling at all)
class BatchDelayedHttpClient:
def __init__(self, requestList):
# class members
self.httpClient = httpclient.AsyncHTTPClient()
self.requestList = requestList
ioloop.IOLoop.current().run_sync(self.execute)
#gen.coroutine
def execute(self):
print("exec start")
for request in self.requestList:
print("requesting " + request["url"])
self.httpClient.fetch(request["url"], request["handleMethod"], method=request["method"], headers=request["headers"], body=request["body"])
yield gen.sleep(request["sleep"])
print("exec end")

Can I asynchronously duplicate a webapp2.RequestHandler Request to a different url?

For a percentage of production traffic, I want to duplicate the received request to a different version of my application. This needs to happen asynchronously so I don't double service time to the client.
The reason for doing this is so I can compare the responses generated by the prod version and a production candidate version. If their results are appropriately similar, I can be confident that the new version hasn't broken anything. (If I've made a functional change to the application, I'd filter out the necessary part of the response from this comparison.)
So I'm looking for an equivalent to:
class Foo(webapp2.RequestHandler):
def post(self):
handle = make_async_call_to('http://other_service_endpoint.com/', self.request)
# process the user's request in the usual way
test_response = handle.get_response()
# compare the locally-prepared response and the remote one, and log
# the diffs
# return the locally-prepared response to the caller
UPDATE
google.appengine.api.urlfetch was suggested as a potential solution to my problem, but it's synchronous in the dev_appserver, though it behaves the way I wanted in production (the request doesn't go out until get_response() is called, and it blocks). :
start_time = time.time()
rpcs = []
print 'creating rpcs:'
for _ in xrange(3):
rpcs.append(urlfetch.create_rpc())
print time.time() - start_time
print 'making fetch calls:'
for rpc in rpcs:
urlfetch.make_fetch_call(rpc, 'http://httpbin.org/delay/3')
print time.time() - start_time
print 'getting results:'
for rpc in rpcs:
rpc.get_result()
print time.time() - start_time
creating rpcs:
9.51290130615e-05
0.000154972076416
0.000189065933228
making fetch calls:
0.00029993057251
0.000356912612915
0.000473976135254
getting results:
3.15417003632
6.31326603889
9.46627306938
UPDATE2
So, after playing with some other options, I found a way to make completely non-blocking requests:
start_time = time.time()
rpcs = []
logging.info('creating rpcs:')
for i in xrange(10):
rpc = urlfetch.create_rpc(deadline=30.0)
url = 'http://httpbin.org/delay/{}'.format(i)
urlfetch.make_fetch_call(rpc, url)
rpc.callback = create_callback(rpc, url)
rpcs.append(rpc)
logging.info(time.time() - start_time)
logging.info('getting results:')
while rpcs:
rpc = apiproxy_stub_map.UserRPC.wait_any(rpcs)
rpcs.remove(rpc)
logging.info(time.time() - start_time)
...but the important point to note is that none of the async fetch options in urllib work in the dev_appserver. Having discovered this, I went back to try #DanCornilescu's solution and found that it only works properly in production, but not in the dev_appserver.
The URL Fetch service supports asynchronous requests. From Issuing an asynchronous request:
HTTP(S) requests are synchronous by default. To issue an asynchronous
request, your application must:
Create a new RPC object using urlfetch.create_rpc(). This object represents your asynchronous call in subsequent method calls.
Call urlfetch.make_fetch_call() to make the request. This method takes your RPC object and the request target's URL as parameters.
Call the RPC object's get_result() method. This method returns the result object if the request is successful, and raises an exception if
an error occurred during the request.
The following snippets demonstrate how to make a basic asynchronous
request from a Python application. First, import the urlfetch library
from the App Engine SDK:
from google.appengine.api import urlfetch
Next, use urlfetch to make the asynchronous request:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, "http://www.google.com/")
# ... do other things ...
try:
result = rpc.get_result()
if result.status_code == 200:
text = result.content
self.response.write(text)
else:
self.response.status_code = result.status_code
logging.error("Error making RPC request")
except urlfetch.DownloadError:
logging.error("Error fetching URL0")
Note: As per Sniggerfardimungus's experiment mentioned in the question's update the async calls might not work as expected on the development server - being serialized instead of concurrent, but they do so when deployed on GAE. Personally I didn't use the async calls yet, so I can't really say.
If the intent is not block at all waiting for the response from the production candidate app you could push a copy of the original request and the production-prepared response on a task queue then answer to the original request - with neglijible delay (that of enqueueing the task).
The handler for the respective task queue would, outside of the original request's critical path, make the request to the staging app using the copy of the original request (async or not, doesn't really matter from the point of view of impacting the production app's response time), get its response and compare it with the production-prepared response, log the deltas, etc. This can be nicely wrapped in a separate module for minimal changes to the production app and deployed/deleted as needed.

Rabbitmq remote call with Pika

I am new to rabbitmq and trying to figure out how I can make a client request a server with information about memory and CPU utilization with this tutorial (https://www.rabbitmq.com/tutorials/tutorial-six-python.html).
So the client requests for CPU and memory ( I believe I will need two queues) and the server respond with the values.
Is there anyway to simple create a client.py and server.py with this case using the Pika library in Python.
I would recommend you to follow the first RabbitMQ tutorials if you haven't already. The RPC example builds on concepts covered on previous examples (direct queues, exclusive queues, acknowledgements, etc.).
The RPC solution proposed on the tutorial requires at least two queues, depending on how many clients you want to use:
One direct queue (rpc_queue), used to send requests from the client to the server.
One exclusive queue per client, used to receive responses.
The request/response cycle:
The client sends a message to the rpc_queue. Each message includes a reply_to property, with the name of the client exclusive queue the server should reply to, and a correlation_id property, which is just an unique id used to track the request.
The server waits for messages on the rpc_queue. When a message arrives, it prepares the response, adds the correlation_id to the new message, and sends it to the queue defined in the reply_to message property.
The client waits on its exclusive queue until it finds a message with the correlation_id that was originally generated.
Jumping straight to your problem, the first thing to do is to define the message format you'll want to use on your responses. You can use JSON, msgpack or any other serialization library. For example, if using JSON, one message could look something like this:
{
"cpu": 1.2,
"memory": 0.3
}
Then, on your server.py:
def on_request(channel, method, props, body):
response = {'cpu': current_cpu_usage(),
'memory': current_memory_usage()}
properties = pika.BasicProperties(correlation_id=props.correlation_id)
channel.basic_publish(exchange='',
routing_key=props.reply_to,
properties=properties,
body=json.dumps(response))
channel.basic_ack(delivery_tag=method.delivery_tag)
# ...
And on your client.py:
class ResponseTimeout(Exception): pass
class Client:
# similar constructor as `FibonacciRpcClient` from tutorial...
def on_response(self, channel, method, props, body):
if self.correlation_id == props.correlation_id:
self.response = json.loads(body.decode())
def call(self, timeout=2):
self.response = None
self.correlation_id = str(uuid.uuid4())
self.channel.basic_publish(exchange='',
routing_key='rpc_queue',
properties=pika.BasicProperties(
reply_to=self.callback_queue,
correlation_id=self.correlation_id),
body='')
start_time = time.time()
while self.response is None:
if (start_time + timeout) < time.time():
raise ResponseTimeout()
self.connection.process_data_events()
return self.response
As you see, the code is pretty much the same as the original FibonacciRpcClient. The main differences are:
We use JSON as data format for our messages.
Our client call() method doesn't require a body argument (there's nothing to send to the server)
We take care of response timeouts (if the server is down, or if it doesn't reply to our messages)
Still, there're a lot of things to improve here:
No error handling: For example, if the client "forgets" to send a reply_to queue, our server is gonna crash, and will crash again on restart (the broken message will be requeued infinitely as long as it isn't acknowledged by our server)
We don't handle broken connections (no reconnection mechanism)
...
You may also consider replacing the RPC approach with a publish/subscribe pattern; in this way, the server simply broadcasts its CPU/memory state every X time interval, and one or more clients receive the updates.

How can I get the Python Task Queue and Channel API to send messages and respond to requests during a long-running process?

This is a probably basic question, but I have not been able to find the answer.
I have a long-running process that produces data every few minutes that I would like the client to receive as soon as it is ready. Currently I have the long-running process in a Task Queue, and it adds channel messages to another Task Queue from within a for loop. The client successfully receives the channel messages and downloads the data using a get request; however, the messages are being sent from the task queue after the long-running process finishes (after about 10 minutes) instead of when the messages are added to the task queue.
How can I have the messages in the task queue sent immediately? Do I need to have the for loop broken into several tasks? The for loop creates a number of dictionaries I think I would need to post to the data store and then retrieve for the next iteration (does not seem like an ideal solution), unless there is an easier way to return data from a task.
When I do not add the messages to a Task Queue and send the messages directly in the for loop, the server does not seem to respond to the client's get request for the data (possibly due to the for loop of the long-running process blocking the response?)
Here is a simplified version of my server code:
from google.appengine.ext import db
from google.appengine.api import channel
from google.appengine.api import taskqueue
from google.appengine.api import rdbms
class MainPage(webapp2.RequestHandler):
def get(self):
## This opens the GWT app
class Service_handler(webapp2.RequestHandler):
def get(self, parameters):
## This is called by the GWT app and generates the data to be
## sent to the client.
#This adds the long-process to a task queue
taskqueue.Task(url='/longprocess/', params = {'json_request': json_request}).add(queue_name='longprocess-queue')
class longprocess_handler(webapp2.RequestHandler):
def post(self):
#This has a for loop that recursively uses data in dictionaries to
#produce kml files every few minutes
for j in range(0, Time):
# Process data
# Send message to client using a task queue to send the message.
taskqueue.Task(url='/send/', params).add(queue_name=send_queue_name)
class send_handler(webapp2.RequestHandler):
def post(self):
# This sends the message to the client
# This is currently not happening until the long-process finishes,
# but I would like it to occur immediately.
class kml_handler(webapp2.RequestHandler):
def get(self, client_id):
## When the client receives the message, it picks up the data here.
app = webapp2.WSGIApplication([
webapp2.Route(r'/', handler=MainPage),
webapp2.Route(r'/Service/', handler=Service_handler),
webapp2.Route(r'/_ah/channel/<connected>/', handler = connection_handler),
webapp2.Route(r'/longprocess/', handler = longprocess_handler),
webapp2.Route(r'/kml/<client_id>', handler = kml_handler),
webapp2.Route(r'/send/', handler = send_handler)
],
debug=True)
Do I need to break up the long-process into tasks that send and retrieve results from the data store in order to have the send_handler execute immediately, or am I missing something? Thanks
The App Engine development server only processes one request at a time. In production, these things will occur simultaneously. Try in production, and check that things behave as expected there.
There's also not much reason to use a separate task to send the channel messages in production - just send them directly from the main task.

Categories