Multithreading / multiprocessing with a python loop - python

I have an script that loops through a range of URLs to pull item location based on returned json data. However, the script takes an 60 minutes to run and 55 minutes of that (per cprofile) is spent waiting for json data to load.
I would like to multithread to run multiple POST requests at a time to speed this up and have initially split up URL ranges into two halves to do this. Where I am getting stuck is how to implement multithreading or asyncio.
Slimmed down code:
import asyncio
import aiohttp
# i am not recommend to use globals
results = dict()
url = "https://www.website.com/store/ajax/search"
query = "store={}&size=18&query=17360031"
# this is default url opener got from aiohttp documentation
async def open_url(store, loop=None):
async with aiohttp.ClientSession(loop=loop) as session:
async with session.post(url, data={'searchQuery': query.format(store)}) as resp:
return await resp.json(), store
async def processing(loop=None):
# U need to use 'global' keyworld if U wan't to write to global variables
global results
# one of the simplest ways to parallelize requests, is to init Future, and when data will be ready save it to global
tasks = [open_url(store, loop=event_loop) for store in range(0, 5)]
for coro in asyncio.as_completed(tasks, loop=loop):
try:
data, store = await coro
results[store] = data['searchResults']['results'][0]['location']['aisle']
except (IndexError, KeyError):
continue
if __name__ == '__main__':
event_loop = asyncio.new_event_loop()
event_loop.run_until_complete(processing(loop=event_loop))
# Print Results
for store, data in results.items():
print(store, data)
json:
{u'count': 1,
u'results': [{u'department': {u'name': u'Home', u'storeDeptId': -1},
u'location': {u'aisle': [A], u'detailed': [A.536]},
u'score': u'0.507073'}],
u'totalCount': 1}

Even if you use multithreading or multiprocessing, each thread/process will still block until the JSON data is retrieved. This could speed up things a little but it's still not your best choice.
Since you're using requests, try grequests which combines this one with gevent. This lets you define a series of HTTP requests that run asynchronously. As a result you'll get a huge speed boost. The usage is very simple: just create a list of requests (using grequests.get) and pass it grequests.map.
Hope this helps!

If u wan't to parallelize requests( i hope u ask for this). This code snippet will help.
There are request opener,and 2000 post requests sent via aiohttp and asyncio.
python3.5 used
import asyncio
import aiohttp
# i am not recommend to use globals
results = dict()
MAX_RETRIES = 5
MATCH_SLEEP_TIME = 3 # i am recommend U to move this variables to other file like constants.py or any else
url = "https://www.website.com/store/ajax/search"
query = "store={}&size=18&query=44159"
# this is default url opener got from aiohttp documentation
async def open_url(store, semaphore, loop=None):
for _ in range(MAX_RETRIES):
with await semarhore:
try:
async with aiohttp.ClientSession(loop=loop) as session:
async with session.post(url, data={'searchQuery': query.format(store)}) as resp:
return await resp.json(), store
except ConnectionResetError:
# u can handle more exceptions here, and sleep if they are raised
await asyncio.sleep(MATCH_SLEEP_TIME, loop=loop)
continue
return None
async def processing(semaphore, loop=None):
# U need to use 'global' keyworld if U wan't to write to global variables
global results
# one of the simplest ways to parallelize requests, is to init Future, and when data will be ready save it to global
tasks = [open_url(store, semaphore, loop=event_loop) for store in range(0, 2000)]
for coro in asyncio.as_completed(tasks, loop=loop):
try:
response = await coro
if response is None:
continue
data, store = response
results[store] = data['searchResults']['results'][0]['location']['aisle']
except (IndexError, KeyError):
continue
if __name__ == '__main__':
event_loop = asyncio.new_event_loop()
semaphore = asyncio.Semaphore(50, loop=event_loop) # count of concurrent requests
event_loop.run_until_complete(processing(semaphore, loop=event_loop))

Related

Combine several post request into one, transform this batch and return answers for these post requests

I'm not good at work with request, but my current project require this. Now my server works like this:
from aiohttp import web
#routes.post('/')
async def my_func(request):
post = await request.json()
answer = '... do something on GPU ...'
return web.json_response(answer)`
But I want combine several requests into one and do my function on GPU only once. And after that return responses for all requests (may be in the loop). I can change aoihttp to different package if it necessary for solving.
For example, post request contains fields: {'id':1, 'data':'some data 1'}.
(1) I want wait 5 requests, combine data to list ['some data 1', ..,'some data 5']
(2) and then apply my function to this list (it returns me list of answers ['answer 1', ..,'answer 5']
(3) And after that I want make response for each request like this {'id':1, 'answers':'answer_1'}
I don't now how to realize steps (1) and (3).
You can keep a cache of requests (questions) and responses (answers) and a background task that checks said cache; when the questions cache length reaches 5 you run the GPU function and populate the answers cache.
Each request waits until the answers cache has the data it needs.
server.py
import asyncio
from aiohttp import web
def gpu_func(items):
"""Send a batch of 5 questions to GPU and return the answers"""
answers = {}
for item in items:
answers[item["id"]] = "answer for data: " + item["data"]
return answers
async def gpu_loop(app):
"""Check questions cache continuously and when we have 5 questions process them and populate answers cache"""
while True:
if len(app.cache["questions"]) >= 5:
print("running GPU function")
answers = gpu_func(app.cache["questions"][:5])
print("got %d answers from GPU" % len(answers))
app.cache["answers"].update(answers)
app.cache["questions"] = app.cache["questions"][5:]
await asyncio.sleep(0.05) # sleep for 50ms
async def handle(request):
"""Main request handler: populate questions cache and wait for the answer to be available in the answers cache"""
data = await request.post()
print("got request with data ", data)
request.app.cache["questions"].append(data)
# can implement here a time limit using a counter (sleep_delay*counter = max time for request)
while True:
if data["id"] in request.app.cache["answers"]:
break
await asyncio.sleep(0.05)
answer = request.app.cache["answers"].pop(data["id"], "unknown")
return web.Response(text=answer)
# create background task (gpu_loop)
async def start_background_tasks(app):
app.gpu_loop = asyncio.create_task(gpu_loop(app))
# stop background task on shutdown
async def cleanup_background_tasks(app):
app.gpu_loop.cancel()
await app.gpu_loop
def main():
app = web.Application()
app.cache = {"questions": [], "answers": {}}
app.add_routes([web.post("/", handle)])
app.on_startup.append(start_background_tasks)
app.on_cleanup.append(cleanup_background_tasks)
web.run_app(app)
if __name__ == "__main__":
main()
client.py
import aiohttp
import asyncio
async def make_request(session, num):
"""Make a single request using the existing session object and custom number"""
url = "http://127.0.01:8080"
data = {"id": num, "data": "question %d" % num}
response = await session.post(url, data=data)
text = await response.text()
return text
async def main():
"""Make 20 consecutive requests with a delay of 20 ms between them"""
tasks = []
session = aiohttp.ClientSession()
for i in range(20):
print("making request %d", i)
task = asyncio.ensure_future(make_request(session, i))
tasks.append(task)
await asyncio.sleep(0.02)
responses = await asyncio.gather(*tasks)
for response in responses:
print(response)
await session.close()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Depending on concurrency (how often the requests are being made from the client side) and processing time (how long it takes to process a group of requests) you might need to tweak the timing (sleep) and cache limit (how many requests to keep in the cache).
Hope this gets you started.

asyncio run_until_complete does not wait that all coroutines finish

I am making my first steps in Python and I have a bit of struggle trying to understand why I do not have the expected result with this one. Here is what I am trying to achieve :
I have a function that consumes an API. While waiting for the API to answer and given that I am going through a proxy that creates additional lag, I though that sending concurrent request will speed up the process (I run 100 concurrent requests). It does. But asyncio run_until_complete always returns some unfinished coroutines.
Here the code (simplified):
import aiohttp
import asyncio
async def consume_api(parameter):
url = "someurl" #it is actually based on the parameter
try:
async with aiohttp.ClientSession() as session:
async with session.get(URL, proxy="someproxy") as asyncresponse:
r = await asyncresponse.read()
except:
global error_count
error_count += 1
if error_count > 50:
return "Exceeded 50 try on same request"
else:
return consume_api(parameter)
return r.decode("utf-8")
def loop_on_api(list_of_parameter):
loop = asyncio.get_event_loop()
coroutines = [consume_api(list_of_parameter[i]) for i in range(len(list_of_parameter))]
results = loop.run_until_complete(asyncio.gather(*coroutines))
return results
When I run the debugger, the results returned by the loop_on_api function include a list of string corresponding to the results of consume_api and some occurence of <coroutine objects consume_api at 0x00...>. Those variables have a cr_running parameter at False and a cr_Frame.
Though if I check the coroutines variables, I can find all the 100 coroutines but none seems to have a cr_Frame.
Any idea what I am doing wrong?
I'm also thinking my way of counting the 50 error will be shared by all coroutines.
Any idea how I can make it specific?
This should work, you can add/change/refactor what ever you want
import aiohttp
import asyncio
async def consume_api(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.read()
def loop_on_api(list_of_urls):
loop = asyncio.get_event_loop()
coroutines = [consume_api(url) for url in list_of_urls]
results = loop.run_until_complete(asyncio.gather(*coroutines))
return results
if __name__ == '__main__':
print(loop_on_api(['https://google.com', 'https://twitter.com']))
It seems the issue is coming from the proxy I am using, which sometimes do not carry the request or response. Hence forcing a rerun seems to be the best answer. Hence I now check if the results returned have some coroutines remaining and re-run the loop_on_api() on them
def loop_on_api(list_of_parameter):
loop = asyncio.get_event_loop()
coroutines = [consume_api(list_of_parameter[i]) for i in range(len(list_of_parameter))]
results = loop.run_until_complete(asyncio.gather(*coroutines))
undone = []
rerun_list_of_parameter = []
for i in range(len(results)):
if str(type(results[i])) == "<class 'coroutine'>": #not very elegant >> is there a better way?
undone.append(i)
rerun_list_of_parameter.append(list_of_parameter[i])
if len(undone) > 0:
undone_results = loop_on_api(rerun_list_of_parameter)
for i in range(len(undone_results)):
results[undone[i]] = undone_results[i]
return results

what is the disadvantage of run asyncio.run multi times in Python code?

I'd like to embed some async code in my Python Project to make the http request part be asyncable . for example, I read params from Kafka, use this params to generate some urls and put the urls into a list. if the length of the list is greater than 1000, then I send this list to aiohttp to batch get the response.
I can not change the whole project from sync to async, so I could only change the http request part.
the code example is:
async def async_request(url):
async with aiohttp.ClientSession() as client:
resp = await client.get(url)
result = await resp.json()
return result
async def do_batch_request(url_list, result):
task_list = []
for url in url_list:
task = asyncio.create_task(async_request(url))
task_list.append(task)
batch_response = asyncio.gather(*task_list)
result.extend(batch_response)
def batch_request(url_list):
batch_response = []
asyncio.run(do_batch_request(url_list, batch_response))
return batch_response
url_list = []
for msg in kafka_consumer:
url = msg['url']
url_list.append(url)
if len(url_list) >= 1000:
batch_response = batch_request(url_list)
parse(batch_response)
....
As we know, asyncio.run will create an even loop to run the async function and then close the even loop. My problem is that, will my method influence the performance of the async code? And do you have some better way for my situation?
There's no serious problem with your approach and you'll get speed benefit from asyncio. Only possible problem here is that if later you'll want to do something async in other place in the code you'll not be able to do it concurrently with batch_request.
There's not much to do if you don't want to change the whole project from sync to async, but if in the future you'll want to run batch_request in parallel with something, keep in mind that you can run it in thread and wait for result asynchronously.

How to incorporate more complex logic into asyncio futures list comprehension

I have written a function for downloading activity IDs from Strava's public API.
The function iterates the API pages, collects the IDs and stops after it has collected IDs from the page it has identified as the last one:
import requests
def get_activity_ids():
"""Returns a list of activity ids for the token owner"""
ids = []
params = {
'page': 1,
'per_page':200,
'access_token':'1111111',
}
while True:
r = requests.get('https://www.strava.com/api/v3/athlete/activities', params).json()
if len(r) == 0:
break
else:
ids += [activity['id'] for activity in r]
if len(r) < 200: # if last page
break
print('PAGE: {}, response length: {}'.format(params['page'], len(r)))
params['page'] += 1
return ids
I now want to turn this function into an asynchronous one.
So far I got this:
import asyncio
import concurrent.futures
import requests
def get_ids():
ids = []
async def main():
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
executor,
requests.get,
'https://www.strava.com/api/v3/athlete/activities?page={page}&per_page=200&access_token=111111111'.format(page=page)
)
for page in range(1,4)
]
for response in await asyncio.gather(*futures):
for activity in response.json():
ids.append(activity['id'])
pass
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
return ids
But I don't know know how to incorporate the identification of last page logic from previous function (the while True block) into this one.
So I would need to somehow replace the for i in range(1,4) with such logic.
Does anyone know how to do this?
You're trying to parallelize operations that cannot be run in parallel. Each request needs to wait for the previous request to finish before you know whether the next request should even happen. This is inherently sequential.
If you're okay with requesting nonexistent pages, you could probably submit a limited number of requests in parallel, making more requests as previous requests finish, and stopping once you find you've hit the end. This would not be as simple as a list comprehension and a gather, though.

Multiple Await in Python Async Function

I am using aiohttp session along with a semaphore within a custom class:
async def get_url(self, url):
async with self.semaphore:
async with self.session.get(url) as response:
try:
text_response = await response.text()
read_response = await response.read()
json_response = await response.json()
await asyncio.sleep(random.uniform(0.1, 0.5))
except aiohttp.client_exceptions.ContentTypeError:
json_response = {}
return {
'json': json_response,
'text': text_response,
'read': read_response,
'status': response.status,
'url': response.url,
}
I have two questions:
Is it correct/incorrect to to have multiple await statements within a single async function? I need to return both the response.text() and response.read(). However, depending on the URL, the response.json() may or may not be available so I've thrown everything into a try/except block to catch this exception.
Since I am using this function to loop through a list of different RESTful API endpoints, I am controlling the number of simultaneous requests through the semaphore (set to max of 100) but I also need to stagger the requests so they aren't log jamming the host machine. So, I thought I could accomplish this by adding an asyncio.sleep that is randomly chosen between 0.1-0.5 seconds. Is this the best way to enforce a small wait in between requests? Should I move this to the beginning of the function instead of near the end?
It is absolutely fine to have multiple awaits in one async function, as far as you know what you are awaiting for, and each of them are awaited one by one, just like the very normal sequential execution. One thing to mention about aiohttp is that, you'd better call read() first and catch UnicodeDecodeError too, because internally text() and json() call read() first and process its result, you don't want the processing to prevent returning at least read_response. You don't have to worry about read() being called multiple times, it is simply cached in the response instance on the first call.
Random stagger is an easy and effective solution for sudden traffic. However if you want to control exactly the minimum time interval between any two requests - for academic reasons, you could set up two semaphores:
def __init__(self):
# something else
self.starter = asyncio.Semaphore(0)
self.ender = asyncio.Semaphore(30)
Then change get_url() to use them:
async def get_url(self, url):
await self.starter.acquire()
try:
async with self.session.get(url) as response:
# your code
finally:
self.ender.release()
Because starter was initialized with zero, so all get_url() coroutines will block on starter. We'll use a separate coroutine to control it:
async def controller(self):
last = 0
while self.running:
await self.ender.acquire()
sleep = 0.5 - (self.loop.time() - last) # at most 2 requests per second
if sleep > 0:
await asyncio.sleep(sleep)
last = self.loop.time()
self.starter.release()
And your main program should look something like this:
def run(self):
for url in [...]:
self.loop.create_task(self.get_url(url))
self.loop.create_task(self.controller())
So at first, the controller will release starter 30 times evenly in 15 seconds, because that is the initial value of ender. After that, the controller would release starter as soon as any get_url() ends, if 0.5 seconds have passed since the last release of starter, or it will wait up to that time.
One issue here: if the URLs to fetch is not a constant list in memory (e.g. coming from network constantly with unpredictable delays between URLs), the RPS limiter will fail (starter released too early before there is actually a URL to fetch). You'll need further tweaks for this case, even though the chance of a traffic burst is already very low.

Categories