Hitting multiple APIs at once, tornado and python - python

I'm trying to make an API that will collect responses from several other API's and combine the results into one response. I want to sent the get requests asynchronously so that it runs faster, but even though I'm using coroutines and yielding, my code still seems to be making each request one at a time. Wondering if maybe it's because I'm using the requests library instead of tornado's AsyncHTTPClient, or because I'm calling self.path_get inside of a loop, or because I'm storing results in an instance variable?
The API's I'm hitting return arrays of JSON objects, and I want to combine them all into one array and write that to the response.
from tornado import gen, ioloop, web
from tornado.gen import Return
import requests
PATHS = [
"http://firsturl",
"http://secondurl",
"http://thirdurl"
]
class MyApi(web.RequestHandler):
#gen.coroutine
def get(self):
self.results = []
for path in PATHS:
x = yield self.path_get(path)
self.write({
"results": self.results,
})
#gen.coroutine
def path_get(self, path):
resp = yield requests.get(path)
self.results += resp.json()["results"]
raise Return(resp)
ROUTES = [
(r"/search", MyApi),
]
def run():
app = web.Application(
ROUTES,
debug=True,
)
app.listen(8000)
ioloop.IOLoop.current().start()
if __name__ == "__main__":
run()

There's many reasons why your code doesn't work. To begin, requests generally blocks the event loop and doesn't let anything else execute. Replace requests with AsyncHTTPClient.fetch. Also, the way you were yielding each request would also make the requests sequential and not concurrently like you thought. Here's an example of how your code could be restructured:
import json
from tornado import gen, httpclient, ioloop, web
# ...
class MyApi(web.RequestHandler):
#gen.coroutine
def get(self):
futures_list = []
for path in PATHS:
futures_list.append(self.path_get(path))
yield futures_list
result = json.dumps({'results': [x.result() for x in futures_list]})
self.write(result)
#gen.coroutine
def path_get(self, path):
request = httpclient.AsyncHTTPClient()
resp = yield request.fetch(path)
result = json.loads(resp.body.decode('utf-8'))
raise gen.Return(result)
What's happening is we're creating a list of Futures that get returned from gen.coroutine functions and yielding the entire list until the results from the request are available. Then once all the requests are complete, futures_list is iterated and the results are used to create a new list which is appended to a JSON object.

Related

Python multiprocess Pool vs Process

I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?
I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.
This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.
Shall I use Pool or Process in this case?
Thanks!
----The code below is added on 8/28---
I programmed with multiprocessing. The key challenges I'm facing are:
1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?
2) I want to check the response value of every requests, and have exception handling
The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!
def get_data(endpoint, get_params):
response = requests.get(endpoint, params = get_params)
if response.status_code != 200:
raise Exception("bad response for " + str(get_params))
return response.json()
def get_currency_data(endpoint, currency, date):
get_params = {'currency': currency,
'date' : date
}
for attempt in range(3):
try:
output = get_data(endpoint, get_params)
# additional return value check
# ......
return output['value']
except:
time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
return 'error'
def get_all_data(currencies, dates):
# I have many dates, but not too many currencies
for currency in currencies:
results = []
pool = Pool(processes=20)
for date in dates:
results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
output = [p.get() for p in results]
pool.close()
pool.join()
time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control
Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://localhost:8080/{}"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)
Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.
EDIT: Use requests_futures
As per #roganjosh comment below, requests_futures is a super-easy way to accomplish this.
from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content
for url, future in responses.items()
if future.result().status_code == 200}
EDIT: Use grequests to support Python 2.7
You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.
import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]
EDIT: Using multiprocessing
If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.
It's actually pretty straightforward, you're mapping the URL's through the http GET function:
import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)
The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.
Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.

Async query database for keys to use in multiple requests

I want to asynchronously query a database for keys, then make requests to several urls for each key.
I have a function that returns a Deferred from the database whose value is the key for several requests. Ideally, I would call this function and return a generator of Deferreds from start_requests.
#inlineCallbacks
def get_request_deferred(self):
d = yield engine.execute(select([table])) # async
d.addCallback(make_url)
d.addCallback(Request)
return d
def start_requests(self):
????
But attempting this in several ways raises
builtins.AttributeError: 'Deferred' object has no attribute 'dont_filter'
which I take to mean that start_requests must return Request objects, not Deferreds whose values are Request objects. The same seems to be true of spider middleware's process_start_requests().
Alternatively, I can make initial requests to, say, http://localhost/ and change them to the real url once the key is available from the database through downloader middleware's process_request(). However, process_request only returns a Request object; it cannot yield Requests to multiple pages using the key: attempting yield Request(url) raises
AssertionError: Middleware myDownloaderMiddleware.process_request
must return None, Response or Request, got generator
What is the cleanest solution to
get key asynchronously from database
for each key, generate several requests
You've provided no use case for async database queries to be a necessity. I'm assuming you cannot begin to scrape your URLs unless you query the database first? If that's the case then you're better off just doing the query synchronously, iterate over the query results, extract what you need, then yield Request objects. It makes little sense to query a db asynchronously and just sit around waiting for the query to finish.
You can let the callback for the Deferred object pass the urls to a generator of some sort. The generator will then convert any received urls into scrapy Request objects and yield them. Below is an example using the code you linked (not tested):
import scrapy
from Queue import Queue
from pdb import set_trace as st
from twisted.internet.defer import Deferred, inlineCallbacks
class ExampleSpider(scrapy.Spider):
name = 'example'
def __init__(self):
self.urls = Queue()
self.stop = False
self.requests = request_generator()
self.deferred = deferred_generator()
def deferred_generator(self):
d = Deferred()
d.addCallback(self.deferred_callback)
yield d
def request_generator(self):
while not self.stop:
url = self.urls.get()
yield scrapy.Request(url=url, callback=self.parse)
def start_requests(self):
return self.requests.next()
def parse(self, response):
st()
# when you need to parse the next url from the callback
yield self.requests.next()
#static_method
def deferred_callback(url):
self.urls.put(url)
if no_more_urls():
self.stop = True
Don't forget to stop the request generator when you're done.

write an asynchronous http client using twisted framework

i want to write an asynchronous http client using twisted framework which fires 5 requests asynchronously/simultaneously to 5 different servers. Then compare those responses and display a result. Could someone please help regarding this.
For this situation I'd suggest using treq and DeferredList to aggregate the responses then fire a callback when all the URLs have been returned. Here is a quick example:
import treq
from twisted.internet import reactor, defer, task
def fetchURL(*urls):
dList = []
for url in urls:
d = treq.get(url)
d.addCallback(treq.content)
dList.append(d)
return defer.DeferredList(dList)
def compare(responses):
# the responses are returned in a list of tuples
# Ex: [(True, b'')]
for status, content in responses:
print(content)
def main(reactor):
urls = [
'http://swapi.co/api/films/schema',
'http://swapi.co/api/people/schema',
'http://swapi.co/api/planets/schema',
'http://swapi.co/api/species/schema',
'http://swapi.co/api/starships/schema',
]
d = fetchURL(*urls) # returns Deferred
d.addCallback(compare) # fire compare() once the URLs return w/ a response
return d # wait for the DeferredList to finish
task.react(main)
# usually you would run reactor.run() but react() takes care of that
In the main function, a list of URLs are passed into fecthURL(). There, each site will make an async request and return a Deferred that will be appended to a list. Then the final list will be used to create and return a DeferredList obj. Finally we add a callback (compare() in this case) to the DeferredList that will access each response. You would put your comparison logic in the compare() function.
You don't necessarily need twisted to make asynchronous http requests. You can use python threads and the wonderful requests package.
from threading import Thread
import requests
def make_request(url, results):
response = requests.get(url)
results[url] = response
def main():
results = {}
threads = []
for i in range(5):
url = 'http://webpage/{}'.format(i)
t = Thread(target=make_request, kwargs={'url': url, 'results': results})
t.start()
threads.append(t)
for t in threads():
t.join()
print results

Faster Scraping of JSON from API: Asynchronous or?

I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)
You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.
tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc

Fetching data with Python's asyncio in a sequential order

I have a Python 2.7 program which pulls data from websites and dumps the results to a database. It follows the consumer producer model and is written using the threading module.
Just for fun I would like to rewrite this program using the new asyncio module (from 3.4) but I cannot figure out how to do this properly.
The most crucial requirement is that the program must fetch data from the same website in a sequential order. For example for an url 'http://a-restaurant.com' it should first get 'http://a-restaurant.com/menu/0', then 'http://a-restaurant.com/menu/1', then 'http://a-restaurant.com/menu/2', ...
If they are not fetched in order the website stops delivering pages altogether and you have to start from 0.
However another fetch for another website ('http://another-restaurant.com') can (and should) run at the same time (the other sites also have the sequantial restriction).
The threading module suits well for this as I can create separate threads for each website and in each thread it can wait until one page has finished loading before fetching another one.
Here's a grossly simplified code snippet from the threading version (Python 2.7):
class FetchThread(threading.Threading)
def __init__(self, queue, url)
self.queue = queue
self.baseurl = url
...
def run(self)
# Get 10 menu pages in a sequantial order
for food in range(10):
url = self.baseurl + '/' + str(food)
text = urllib2.urlopen(url).read()
self.queue.put(text)
...
def main()
queue = Queue.Queue()
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
for url in urls:
fetcher = FetchThread(queue, url)
fetcher.start()
...
And here's how I tried to do it with asyncio (in 3.4.1):
#asyncio.coroutine
def fetch(url):
response = yield from aiohttp.request('GET', url)
response = yield from response.read_and_close()
return response.decode('utf-8')
#asyncio.coroutine
def print_page(url):
page = yield from fetch(url)
print(page)
l = []
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
for url in urls:
for food in range(10):
menu_url = url + '/' + str(food)
l.append(print_page(menu_url))
loop.run_until_complete(asyncio.wait(l))
And it fetches and prints everything in a non-sequential order. Well, I guess that's the whole idea of those coroutines. Should I not use aiohttp and just fetch with urllib? But do the fetches for the first restaurant then block the fetches for the other restaurants? Am I just thinking this completely wrong?
(This is just a test to try fetch things in a sequential order. Haven't got to the queue part yet.)
Your current code will work fine for the restaurant that doesn't care about sequential ordering of requests. All ten requests for the menu will run concurrently, and will print to stdout as soon as they're complete.
Obviously, this won't work for the restaurant that requires sequential requests. You need to refactor a bit for that to work:
#asyncio.coroutine
def fetch(url):
response = yield from aiohttp.request('GET', url)
response = yield from response.read_and_close()
return response.decode('utf-8')
#asyncio.coroutine
def print_page(url):
page = yield from fetch(url)
print(page)
#syncio.coroutine
def print_pages_sequential(url, num_pages):
for food in range(num_pages):
menu_url = url + '/' + str(food)
yield from print_page(menu_url)
l = [print_pages_sequential('http://a-restaurant.com/menu', 10)]
conc_url = 'http://another-restaurant.com/menu'
for food in range(10):
menu_url = conc_url + '/' + str(food)
l.append(print_page(menu_url))
loop.run_until_complete(asyncio.wait(l))
Instead of adding all ten requests for the sequential restaurant to the list, we add one coroutine to the list which will iterate over all ten pages sequentially. The way this works is that yield from print_page will stop the execution of print_pages_sequential until the print_page request is complete, but it will do so without blocking any other coroutines that are running concurrently (like all the print_page calls you append to l).
By doing it this way, all of your "another-restaurant" requests can run completely concurrently, just like you want, and your "a-restaurant" requests will run sequentially, but without blocking any of the "another-restaurant" requests.
Edit:
If all the sites have the same sequential fetching requirement, the logic can be simplified more:
l = []
urls = ["http://a-restaurant.com/menu", "http://another-restaurant.com/menu"]
for url in urls:
menu_url = url + '/' + str(food)
l.append(print_page_sequential(menu_url, 10))
loop.run_until_complete(asyncio.wait(l))
asyncio.Task is replacement for threading.Thread in asyncio world.
asyncio.async also creates new task.
asyncio.gather is very convenient way to wait for several coroutines, I prefer it instead of asyncio.wait.
#asyncio.coroutine
def fetch(url):
response = yield from aiohttp.request('GET', url)
response = yield from response.read_and_close()
return response.decode('utf-8')
#asyncio.coroutine
def print_page(url):
page = yield from fetch(url)
print(page)
#asyncio.coroutine
def process_restaurant(url):
for food in range(10):
menu_url = url + '/' + str(food)
yield from print_page(menu_url)
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
coros = []
for url in urls:
coros.append(asyncio.Task(process_restaurant(url)))
loop.run_until_complete(asyncio.gather(*coros))

Categories