Do I need to use threads too? I'm new to async, I was doing it creating 1 thread for every request that I wanted to do, but it was very "inefficient".
I want something like:
def async_request(i):
response = requests.post(url, headers=headers, json=body(i), timeout=5)
r = response.json()
if r['result'] > 50:
do something
i = 0
while True:
async_request(i)
i += 1
time.sleep(0.01) # 10 ms delay
I want to send the request every 10ms and I don't want to wait for one request response to send another, and after I receive the request inside the function "async_request" I want to do something with the result instantly if meet the requirements.
I think that using async for your problem is a good idea. However, for this, you have to change multiple things. First of all, I would include asyncio, which lets you place tasks on the event loop for processing. Secondly, aiohttp for non-blocking HTTP requests. The following code uses both to achieve your goal of sending multiple requests in short intervals:
import asyncio
import aiohttp
import json
async def send_post_request(session, url, headers, body):
async with session.get(url, headers=headers, data=body) as resp:
json = await resp.json()
return json
async def async_request(session, i):
# TODO change to your params
test_url = 'https://jsonplaceholder.typicode.com/todos/1'
body = json.dumps({'title': 'foo', 'body': 'bar', 'userId': i})
headers = {'content-type': 'application/json'}
result = await send_post_request(session, test_url, headers, body)
# TODO process JSON result
print(result)
async def main():
session = aiohttp.ClientSession()
loop = asyncio.get_event_loop()
i = 0
while True:
# Add new request to your event loop
loop.create_task(async_request(session, i))
i += 1
# TODO change sleeping period to what you want
await asyncio.sleep(0.1)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
P.S.: I used similar code in a project and noticed that you can clearly send more requests with asyncio than with traditional threads in a short amount of time.
Related
I am using asyncio in python to get data from a large number of urls.
Here I am trying to get data from yahoo.com 1000 times. And ~90% of the requests fail. Reducing the number of parallel requests reduces the % of fails. Trying to understand why this happens.
import asyncio
import aiohttp
could_not_fetch = []
fetching data from the url. almost 90% of them fail here.
async def fetch_page(url, session, id):
try:
async with session.get(url, timeout = 3) as res:
html = await res.text()
return html
except:
could_not_fetch.append(id)
async def process(id, url, session):
html = await fetch_page(url, session,id)
using aiohttp.ClientSession() to request data from urls, also passing their index.
async def dispatch(urls):
async with aiohttp.ClientSession() as session:
coros = (process(id, url, session) for id, url in enumerate(urls))
return await asyncio.gather(*coros)
using asyncio to get data from yahoo.com 1000 times. if I reduce the number to 100, a much lesser ~ 10% of requests fail.
def main():
loop = asyncio.get_event_loop()
loop = loop.run_until_complete(dispatch(1000 * ['https://yahoo.com/']))
print('could_not_fetch', len(could_not_fetch))
if __name__ == '__main__':
main()
Trying to understand why these requests fail and how to rectify this while doing 1k requests at a time.
Problem I'm trying to solve:
I'm making many api requests to a server. I'm trying to create delays bewtween async api calls to comply with the server's rate limit policy.
What I want it to do
I want it to behave like this:
Make api request #1
wait 0.1 seconds
Make api request #2
wait 0.1 seconds
... and so on ...
repeat until all requests are made
gather the responses and return the results in one object (results)
Issue:
When when I introduced asyncio.sleep() or time.sleep() in the code, it still made api requests almost instantaneously. It seemed to delay the execution of print(), but not the api requests. I suspect that I have to create the delays within the loop, not at the fetch_one() or fetch_all(), but couldn't figure out how to do so.
Code block:
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
#time.sleep(delay)
#asyncio.sleep(delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
Versions I'm using:
python 3.8.5
aiohttp 3.7.4
asyncio 3.4.3
I would appreciate any tips on guiding me to the right direction!
The call to asyncio.gather will launch all requests "simultaneously" - and on the other hand, if you would simply use a lock or await for each task, you would not gain anything from using parallelism at all.
The simplest thing to do, if you know the rate you can issue the requests, is simply to increase the asynchronous pause before each request in sucession - a simple global variable can do that:
next_delay = 0.1
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
global next_delay
next_delay += delay
await asyncio.sleep(next_delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
Now, if you want like, issue 5 requests and then issue the next 5, you could use a synchronization primitive like asyncio.Condition, using its wait_for on an expression which checks how many api calls are active:
active_calls = 0
MAX_CALLS = 5
async def fetch_all(loop, urls, delay):
event = asyncio.Event()
event.set()
results = await asyncio.gather(*[fetch_one(loop, url, delay, event) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay, cond):
global active_calls
active_calls += 1
if active_calls > MAX_CALLS:
event.clear()
await event.wait()
try:
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
finally:
active_calls -= 1
if active_calls == 0:
event.set()
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
For both examples, should your task avoid global variables in the design (actually,these are "module" variables) - you could either move all funtions to a class, and work on an instance, and promote the global variables to instance attributes, or use a mutable container, such as a list for holding the active_calls value in its first item, and pass that as a parameter.
When you use asyncio.gather you run all fetch_one coroutines concurrently. All of them wait for delay together, than make API calls instantaneously together.
To solve the issue, you should either await fetch_one in one by one in fetch_all or to use Semaphore to signalize next shouldn't start before previous is done.
Here's the idea:
import asyncio
_sem = asyncio.Semaphore(1)
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
async with _sem: # next coroutine(s) will stuck here until the previous is done
await asyncio.sleep(delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
I need to make an API request for several pieces of data, and then process each result. The request is paginated, so I'm currently doing
def get_results():
while True:
response = api(num_results=5)
if response is None: # No more results
break
yield response
def process_data():
for page in get_results():
for result in page:
do_stuff(result)
process_data()
I'm hoping to use asyncio to retrieve the next page of results from the API while I'm processing the current one, instead of waiting for results, processing them, then waiting again. I've modified the code to
import asyncio
async def get_results():
while True:
response = api(num_results=5)
if response is None: # No more results
break
yield response
async def process_data():
async for page in get_results():
for result in page:
do_stuff(result)
asyncio.run(process_data())
I'm not sure if this is doing what I intend it to. Is this the right way to make processing the current page of API results and getting the next page of results asynchronous?
Maybe you can use Asyncio.Queue to refactor your code to Producer/Consumer Pattern
import asyncio
import random
q = asyncio.Queue()
async def api(num_results):
# you could use aiohttp to fetch api
# fake content
await asyncio.sleep(1)
fake_response = random.random()
if fake_response < 0.1:
return None
return fake_response
async def get_results(q):
while True:
response = await api(num_results=5)
if response is None:
# indicate producer done
print('Producer Done')
await q.put(None)
break
print('Producer: ', response)
await q.put(response)
async def process_data():
while True:
data = await q.get()
if not data:
print('Consumer Done')
break
# process data whatever you want, but if its cpu intensive, you can call loop.run_in_executor
# fake the process needs a little time
await asyncio.sleep(3)
print('Consume', data)
loop = asyncio.get_event_loop()
loop.create_task(get_results(q))
loop.run_until_complete(process_data())
Come back to the question
Is this the right way to make processing the current page of API results and getting the next page of results asynchronous?
Its not the right way, because get_results() is iterated each time your do_stuff(result) done
I have the below code that will do GET requests at an http endpoint. However, doing them one at a time is super slow. So the code below will do them 50 at a time, but I need to add them to a set (I figured a set would be fastest, because there will be duplicate objects returned with this script. Right now, this just returns the objects in a string 50 at a time, when I need them separated so I can sort them after they are all in a set. I'm new to python so I'm not sure what else to try
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://httpbin.org/get"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(20))
loop.run_until_complete(future)
Right now, it just dumps all of the request responses to result, I need it to add each response to a set so I can work with the data later
I am making a script that gets the HTML of almost 20 000 pages and parses it to get just a portion of it.
I managed to get the 20 000 pages' content in a dataframe with aynchronous requests using asyncio and aiohttp but this script still wait for all the pages to be fetched to parse them.
async def get_request(session, url, params=None):
async with session.get(url, headers=HEADERS, params=params) as response:
return await response.text()
async def get_html_from_url(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_request(session, url))
html_page_response = await asyncio.gather(*tasks)
return html_page_response
html_pages_list = asyncio_loop.run_until_complete(get_html_from_url(urls))
Once I have the content of each page I managed to use multiprocessing's Pool to parallelize the parsing.
get_whatiwant_from_html(html_content):
parsed_html = BeautifulSoup(html_content, "html.parser")
clean = parsed_html.find("div", class_="class").get_text()
# Some re.subs
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
return clean
pool = Pool(4)
what_i_want = pool.map(get_whatiwant_from_html, html_content_list)
This code mixes asynchronously the fetching and the parsing but I would like to integrate multiprocessing into it:
async def process(url, session):
html = await getRequest(session, url)
return await get_whatiwant_from_html(html)
async def dispatch(urls):
async with aiohttp.ClientSession() as session:
coros = (process(url, session) for url in urls)
return await asyncio.gather(*coros)
result = asyncio.get_event_loop().run_until_complete(dispatch(urls))
Is there any obvious way to do this? I thought about creating 4 processes that each run the asynchronous calls but the implementation looks a bit complex and I'm wondering if there is another way.
I am very new to asyncio and aiohttp so if you have anything to advise me to read to get a better understanding, I will be very happy.
You can use ProcessPoolExecutor.
With run_in_executor you can do IO in your main asyncio process.
But your heavy CPU calculations in separate processes.
async def get_data(session, url, params=None):
loop = asyncio.get_event_loop()
async with session.get(url, headers=HEADERS, params=params) as response:
html = await response.text()
data = await loop.run_in_executor(None, partial(get_whatiwant_from_html, html))
return data
async def get_data_from_urls(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_data(session, url))
result_data = await asyncio.gather(*tasks)
return result_data
executor = concurrent.futures.ProcessPoolExecutor(max_workers=10)
asyncio_loop.set_default_executor(executor)
results = asyncio_loop.run_until_complete(get_data_from_urls(urls))
You can increase your parsing speed by changing your BeautifulSoup parser from html.parser to lxml which is by far the fastest, followed by html5lib. html.parser is the slowest of them all.
Your bottleneck is not processing issue but IO. You might want multiple threads and not process:
E.g. here is a template program that scraping and sleep to make it slow but ran in multiple threads and thus complete task faster.
from concurrent.futures import ThreadPoolExecutor
import random,time
from bs4 import BeautifulSoup as bs
import requests
URL = 'http://quotesondesign.com/wp-json/posts'
def quote_stream():
'''
Quoter streamer
'''
param = dict(page=random.randint(1, 1000))
quo = requests.get(URL, params=param)
if quo.ok:
data = quo.json()
author = data[0]['title'].strip()
content = bs(data[0]['content'], 'html5lib').text.strip()
print(f'{content}\n-{author}\n')
else:
print('Connection Issues :(')
def multi_qouter(workers=4):
with ThreadPoolExecutor(max_workers=workers) as executor:
_ = [executor.submit(quote_stream) for i in range(workers)]
if __name__ == '__main__':
now = time.time()
multi_qouter(workers=4)
print(f'Time taken {time.time()-now:.2f} seconds')
In your case, create a function that performs the task you want from starry to finish. This function would accept url and necessary parameters as arguments. After that create another function that calls the previous function in different threads, each thread having its our url. So instead of i in range(..), for url in urls. You can run 2000 threads at once, but I would prefer chunks of say 200 running parallel.