Fetching data with Python's asyncio in a sequential order - python

I have a Python 2.7 program which pulls data from websites and dumps the results to a database. It follows the consumer producer model and is written using the threading module.
Just for fun I would like to rewrite this program using the new asyncio module (from 3.4) but I cannot figure out how to do this properly.
The most crucial requirement is that the program must fetch data from the same website in a sequential order. For example for an url 'http://a-restaurant.com' it should first get 'http://a-restaurant.com/menu/0', then 'http://a-restaurant.com/menu/1', then 'http://a-restaurant.com/menu/2', ...
If they are not fetched in order the website stops delivering pages altogether and you have to start from 0.
However another fetch for another website ('http://another-restaurant.com') can (and should) run at the same time (the other sites also have the sequantial restriction).
The threading module suits well for this as I can create separate threads for each website and in each thread it can wait until one page has finished loading before fetching another one.
Here's a grossly simplified code snippet from the threading version (Python 2.7):
class FetchThread(threading.Threading)
def __init__(self, queue, url)
self.queue = queue
self.baseurl = url
...
def run(self)
# Get 10 menu pages in a sequantial order
for food in range(10):
url = self.baseurl + '/' + str(food)
text = urllib2.urlopen(url).read()
self.queue.put(text)
...
def main()
queue = Queue.Queue()
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
for url in urls:
fetcher = FetchThread(queue, url)
fetcher.start()
...
And here's how I tried to do it with asyncio (in 3.4.1):
#asyncio.coroutine
def fetch(url):
response = yield from aiohttp.request('GET', url)
response = yield from response.read_and_close()
return response.decode('utf-8')
#asyncio.coroutine
def print_page(url):
page = yield from fetch(url)
print(page)
l = []
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
for url in urls:
for food in range(10):
menu_url = url + '/' + str(food)
l.append(print_page(menu_url))
loop.run_until_complete(asyncio.wait(l))
And it fetches and prints everything in a non-sequential order. Well, I guess that's the whole idea of those coroutines. Should I not use aiohttp and just fetch with urllib? But do the fetches for the first restaurant then block the fetches for the other restaurants? Am I just thinking this completely wrong?
(This is just a test to try fetch things in a sequential order. Haven't got to the queue part yet.)

Your current code will work fine for the restaurant that doesn't care about sequential ordering of requests. All ten requests for the menu will run concurrently, and will print to stdout as soon as they're complete.
Obviously, this won't work for the restaurant that requires sequential requests. You need to refactor a bit for that to work:
#asyncio.coroutine
def fetch(url):
response = yield from aiohttp.request('GET', url)
response = yield from response.read_and_close()
return response.decode('utf-8')
#asyncio.coroutine
def print_page(url):
page = yield from fetch(url)
print(page)
#syncio.coroutine
def print_pages_sequential(url, num_pages):
for food in range(num_pages):
menu_url = url + '/' + str(food)
yield from print_page(menu_url)
l = [print_pages_sequential('http://a-restaurant.com/menu', 10)]
conc_url = 'http://another-restaurant.com/menu'
for food in range(10):
menu_url = conc_url + '/' + str(food)
l.append(print_page(menu_url))
loop.run_until_complete(asyncio.wait(l))
Instead of adding all ten requests for the sequential restaurant to the list, we add one coroutine to the list which will iterate over all ten pages sequentially. The way this works is that yield from print_page will stop the execution of print_pages_sequential until the print_page request is complete, but it will do so without blocking any other coroutines that are running concurrently (like all the print_page calls you append to l).
By doing it this way, all of your "another-restaurant" requests can run completely concurrently, just like you want, and your "a-restaurant" requests will run sequentially, but without blocking any of the "another-restaurant" requests.
Edit:
If all the sites have the same sequential fetching requirement, the logic can be simplified more:
l = []
urls = ["http://a-restaurant.com/menu", "http://another-restaurant.com/menu"]
for url in urls:
menu_url = url + '/' + str(food)
l.append(print_page_sequential(menu_url, 10))
loop.run_until_complete(asyncio.wait(l))

asyncio.Task is replacement for threading.Thread in asyncio world.
asyncio.async also creates new task.
asyncio.gather is very convenient way to wait for several coroutines, I prefer it instead of asyncio.wait.
#asyncio.coroutine
def fetch(url):
response = yield from aiohttp.request('GET', url)
response = yield from response.read_and_close()
return response.decode('utf-8')
#asyncio.coroutine
def print_page(url):
page = yield from fetch(url)
print(page)
#asyncio.coroutine
def process_restaurant(url):
for food in range(10):
menu_url = url + '/' + str(food)
yield from print_page(menu_url)
urls = ('http://a-restaurant.com/menu', 'http://another-restaurant.com/menu')
coros = []
for url in urls:
coros.append(asyncio.Task(process_restaurant(url)))
loop.run_until_complete(asyncio.gather(*coros))

Related

python looping fast through links

import requests
import json
from tqdm import tqdm
list of links to loop through
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
for loop for the link using requests
data = []
for link in tqdm(range(len(links))):
response = requests.get(links[link])
response = response.json()
data.append(response)
the above for loop is used to loop through all the list of links but its time consuming when I tried to loop on around a thousand links any help.
Simplest way is to turn it multithreaded. Best way is probably asynchronous.
Multithreaded solution:
import requests
from tqdm.contrib.concurrent import thread_map
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
def get_data(url):
response = requests.get(url)
response = response.json() # Do note this might fail at times
return response
data = thread_map(get_data, links)
Or without using tqdm.contrib.concurrent.thread_map:
import requests
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
links =['https://www.google.com/','https://www.google.com/','https://www.google.com/']
def get_data(url):
response = requests.get(url)
response = response.json() # Do note this might fail at times
return response
executor = ThreadPoolExecutor()
data = list(tqdm(executor.map(get_data, links), total=len(links)))
As suggested in the comment you can use asyncio and aiohttp.
import asyncio
import aiohttp
urls = ["your", "links", "here"]
# create aio connector
conn = aiohttp.TCPConnector(limit_per_host=100, limit=0, ttl_dns_cache=300)
# set number of parallel requests - if you are requesting different domains you are likely to be able to set this higher, otherwise you may be rate limited
PARALLEL_REQUESTS = 10
# Create results array to collect results
results = []
async def gather_with_concurrency(n):
# Create semaphore for async i/o
semaphore = asyncio.Semaphore(n)
# create an aiohttp session using the previous connector
session = aiohttp.ClientSession(connector=conn)
# await logic for get request
async def get(URL):
async with semaphore:
async with session.get(url, ssl=False) as response:
obj = await response.read()
# once object is acquired we append to list
results.append(obj)
# wait for all requests to be gathered and then close session
await asyncio.gather(*(get(url) for url in urls))
await session.close()
# get async event loop
loop = asyncio.get_event_loop()
# run using number of parallel requests
loop.run_until_complete(gather_with_concurrency(PARALLEL_REQUESTS))
# Close connection
conn.close()
# loop through results and do something to them
for res in results:
do_something(res)
I have tried to comment on the code as well as possible.
I have used BS4 to parse requests in this manner (in the do_something logic), but it will really depend on your use case.

Way to speed up fetching images using API and requests in python using Session()?

I am trying to retrieve 10 images through an API which returns JSON data by first making a request to the API and then storing 10 image URLs from the returned JSON data in a list. In my original iteration I made individual requests to those urls and saved the response content to file. My code is given below with my API key removed for obvious reasons:
def get_image(search_term):
number_images = 10
images = requests.get("https://pixabay.com/api/?key=insertkey&q={}&per_page={}".format(search_term,number_images))
images_json_dict = images.json()
hits = images_json_dict["hits"]
urls = []
for i in range(len(hits)):
urls.append(hits[i]["webformatURL"])
count =0
for url in urls:
picture_request = requests.get(url)
if picture_request.status_code == 200:
try:
with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
f.write(picture_request.content)
except:
os.mkdir(dir_path+r'\\images\\')
with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
f.write(picture_request.content)
count+=1
This was working fine apart from the fact that it was very slow. It took maybe 7 seconds to pull in those 10 images and save in a folder. I read here that it's possible to use Sessions() in the requests library to improve performance - I'd like to have those images as quickly as possible. I've modified the code as shown below however the problem I'm having is that the get request on the sessions object returns a requests.sessions.Session object rather than a response code and there is also no .content method to retrieve the content (I've added comments to the relevant lines of code below). I'm relatively new to programming so I'm uncertain if this is even the best way to do this. My question is how can I use sessions to retrieve back the image content now that I am using Session() or is there some smarter way to do this?
def get_image(search_term):
number_images = 10
images = requests.get("https://pixabay.com/api/?key=insertkey&q={}&per_page={}".format(search_term,number_images))
images_json_dict = images.json()
hits = images_json_dict["hits"]
urls = []
for i in range(len(hits)):
urls.append(hits[i]["webformatURL"])
count =0
#Now using Session()
picture_request = requests.Session()
for url in urls:
picture_request.get(url)
#This will no longer work as picture_request is an object
if picture_request == 200:
try:
with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
#This will no longer work as there is no .content method
f.write(picture_request.content)
except:
os.mkdir(dir_path+r'\\images\\')
with open(dir_path+r'\\images\\{}.jpg'.format(count),'wb') as f:
#This will no longer work as there is no .content method
f.write(picture_request.content)
count+=1
Assuming you want to stick with requests library, then you will need to use threading for create multiple parallel instances.
Library concurrent.futures have convenient constructor for creating multiple threads with concurrent.futures.ThreadPoolExecutor.
fetch() used for downloading images.
fetch_all() used for creating thread pool, you can select how many threads you want to run by passing threads argument.
get_urls() is your function for retrieving list of urls. You should pass your token (key) and search_term.
Note: In case you have Python older, than 3.7, you should replace f-strings (f"{args}") to regular formatting functions ("{}".format(args)).
import os
import requests
from concurrent import futures
def fetch(url, session = None):
if session:
r = session.get(url, timeout = 60.)
else:
r = reqests.get(url, timeout = 60.)
r.raise_for_status()
return r.content
def fetch_all(urls, session = requests.session(), threads = 8):
with futures.ThreadPoolExecutor(max_workers = threads) as executor:
future_to_url = {executor.submit(fetch, url, session = session): url for url in urls}
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is None:
yield url, future.result()
else:
print(f"{url} generated an exception: {future.exception()}")
yield url, None
def get_urls(search_term, number_images = 10, token = "", session = requests.session()):
r = requests.get(f"https://pixabay.com/api/?key={token}&q={search_term}&per_page={number_images}")
r.raise_for_status()
urls = [hit["webformatURL"] for hit in r.json().get("hits", [])]
return urls
if __name__ == "__main__":
root_dir = os.getcwd()
session = requests.session()
urls = get_urls("term", token = "token", session = session)
for url, content in fetch_all(urls, session = session):
if content is not None:
f_dir = os.path.join(root_dir, "images")
if not os.path.isdir(f_dir):
os.makedirs(f_dir)
with open(os.path.join(f_dir, os.path.basename(url)), "wb") as f:
f.write(content)
Also I recommending you to look at aiohttp. I will not provide example here, but give you link to one article for similar task, where you can read more about it.

Concurrent POST requests w/ attachment Python

I have a server which waits for a request containing a pictures:
#app.route("/uploader_ios", methods=['POST'])
def upload_file_ios():
imagefile = request.files['imagefile']
I can submit a post request quite easily using requests in python like so:
url = "<myserver>/uploader_ios"
files = {'imagefile': open(fname, 'rb')}
%time requests.post(url, files=files).json() # 2.77s
However, what I would like to do is submit 1000 or perhaps 100,000 requests at the same time. I wanted to try to do this using asyncio because I have been able to use this for get requests without a problem. However I can't see to create a valid post request that the server accepts.
My attempt is below:
import aiohttp
import asyncio
import json
# Testing with small amount
concurrent = 2
url_list = ['<myserver>/uploader_ios'] * 10
def handle_req(data):
return json.loads(data)['English']
def chunked_http_client(num_chunks, s):
# Use semaphore to limit number of requests
semaphore = asyncio.Semaphore(num_chunks)
#asyncio.coroutine
# Return co-routine that will work asynchronously and respect
# locking of semaphore
def http_get(url):
nonlocal semaphore
with (yield from semaphore):
# Attach files
files = aiohttp.FormData()
files.add_field('imagefile', open(fname, 'rb'))
response = yield from s.request('post', url, data=files)
print(response)
body = yield from response.content.read()
yield from response.wait_for_close()
return body
return http_get
def run_experiment(urls, _session):
http_client = chunked_http_client(num_chunks=concurrent, s=_session)
# http_client returns futures, save all the futures to a list
tasks = [http_client(url) for url in urls]
dfs_route = []
# wait for futures to be ready then iterate over them
for future in asyncio.as_completed(tasks):
data = yield from future
try:
out = handle_req(data)
dfs_route.append(out)
except Exception as err:
print("Error {0}".format(err))
return dfs_route
with aiohttp.ClientSession() as session: # We create a persistent connection
loop = asyncio.get_event_loop()
calc_routes = loop.run_until_complete(run_experiment(url_list, session))
The issue is that the response I get is:
.../uploader_ios) [400 BAD REQUEST]>
I am assuming this is because I am not correctly attaching the image-file

write an asynchronous http client using twisted framework

i want to write an asynchronous http client using twisted framework which fires 5 requests asynchronously/simultaneously to 5 different servers. Then compare those responses and display a result. Could someone please help regarding this.
For this situation I'd suggest using treq and DeferredList to aggregate the responses then fire a callback when all the URLs have been returned. Here is a quick example:
import treq
from twisted.internet import reactor, defer, task
def fetchURL(*urls):
dList = []
for url in urls:
d = treq.get(url)
d.addCallback(treq.content)
dList.append(d)
return defer.DeferredList(dList)
def compare(responses):
# the responses are returned in a list of tuples
# Ex: [(True, b'')]
for status, content in responses:
print(content)
def main(reactor):
urls = [
'http://swapi.co/api/films/schema',
'http://swapi.co/api/people/schema',
'http://swapi.co/api/planets/schema',
'http://swapi.co/api/species/schema',
'http://swapi.co/api/starships/schema',
]
d = fetchURL(*urls) # returns Deferred
d.addCallback(compare) # fire compare() once the URLs return w/ a response
return d # wait for the DeferredList to finish
task.react(main)
# usually you would run reactor.run() but react() takes care of that
In the main function, a list of URLs are passed into fecthURL(). There, each site will make an async request and return a Deferred that will be appended to a list. Then the final list will be used to create and return a DeferredList obj. Finally we add a callback (compare() in this case) to the DeferredList that will access each response. You would put your comparison logic in the compare() function.
You don't necessarily need twisted to make asynchronous http requests. You can use python threads and the wonderful requests package.
from threading import Thread
import requests
def make_request(url, results):
response = requests.get(url)
results[url] = response
def main():
results = {}
threads = []
for i in range(5):
url = 'http://webpage/{}'.format(i)
t = Thread(target=make_request, kwargs={'url': url, 'results': results})
t.start()
threads.append(t)
for t in threads():
t.join()
print results

Faster Scraping of JSON from API: Asynchronous or?

I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)
You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.
tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc

Categories