I am running a webscraper class who's method name is self.get_with_random_proxy_using_chain.
I am trying to send multithreaded calls to the same url, and would like that once there is a result from any thread, the method returns a response and closes other still active threads.
So far my code looks like this (probably naive):
from concurrent.futures import ThreadPoolExecutor, as_completed
# class initiation etc
max_workers = cpu_count() * 5
urls = [url_to_open] * 50
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_url=[]
for url in urls: # i had to do a loop to include sleep not to overload the proxy server
future_to_url.append(executor.submit(self.get_with_random_proxy_using_chain,
url,
timeout,
update_proxy_score,
unwanted_keywords,
unwanted_status_codes,
random_universe_size,
file_path_to_save_streamed_content))
sleep(0.5)
for future in as_completed(future_to_url):
if future.result() is not None:
return future.result()
But it runs all the threads.
Is there a way to close all threads once the first future has completed.
I am using windows and python 3.7x
So far I found this link, but I don't manage to make it work (pogram still runs for a long time).
As far as I know, running futures cannot be cancelled. Quite a lot has been written about this. And there are even some workarounds.
But I would suggest taking a closer look at the asyncio module. It is quite well suited for such tasks.
Below is a simple example, when several concurrent requests are made, and upon receiving the first result, the rest are canceled.
import asyncio
from typing import Set
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def wait_for_first_response(tasks):
done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
for p in pending:
p.cancel()
return done.pop().result()
async def request_one_of(*urls):
tasks = set()
async with ClientSession() as session:
for url in urls:
task = asyncio.create_task(fetch(url, session))
tasks.add(task)
return await wait_for_first_response(tasks)
async def main():
response = await request_one_of("https://wikipedia.org", "https://apple.com")
print(response)
asyncio.run(main())
Related
I'm trying to create an interface to an API, and I want to have the option to easily run the requests sync or asynchronously, and I came up with the following code.
import asyncio
import requests
def async_run(coro_list):
loop = asyncio.get_event_loop()
futures = [loop.run_in_executor(None, asyncio.run, coro) for coro in coro_list]
result = loop.run_until_complete(asyncio.gather(*futures))
return result
def sync_get(url):
return requests.get(url)
async def async_get(url):
return sync_get(url)
coro_list = [async_get("https://google.com"), async_get("https://google.com")]
responses = async_run(coro_list)
print(responses)
For me it's very intuitive to either call sync_get or create a list of async_get and call async_run, and requires no knowledge of async Python to understand how it works.
The only problem is that loop.run_in_executor(None, asyncio.run, coro) doesn't sound too optimal, and I couldn't find anyone else running this code on Github. So I'm wondering, is there a simpler way to accomplish the objective of abstracting these threading and asyncio concepts in some similar way, or is this code already optimal?
asyncio.run() is usually used as the main entry to run async code from sync code.
loop.run_in_executor(None, asyncio.run, coro) cause an event loop created in executor threads to run coro in coro_list. Why not directly run sync_get in executor threads?
import asyncio
import requests
def async_run(url_list):
loop = asyncio.get_event_loop()
futures = [loop.run_in_executor(None, sync_get, url) for url in url_list]
result = await asyncio.gather(*futures)
return result
def sync_get(url):
return requests.get(url)
#
# async def async_get(url):
# return sync_get(url)
url_list = ["https://google.com", "https://google.com"]
responses = asyncio.run(async_run(url_list))
print(responses)
There are async libaries, eg. aiohttp and httpx, to accomplish similar work.
At the end I chose not to cover completely asyncio under my interface.
Still with the goal of having not having to manage 2 "requests" functions, I made the API async first, and run the synchronous one with asyncio and I ended up with something like this.
def sync_request():
return asyncio.run(async_request(...))
async def async_request():
return await aiohttp.request(...) # pseudo code
I'm using python-twitter which isn't an asynchronous library and writing these to Django models. What I need to do for the sake of speed is read n batches of 100 user_ids at once. So:
[[1234, 4352, 12542, ...], [2342, 124124, 235235, 1249, ...], ...]
Each of these has to hit something like api.twitter.com/users/lookup.json.
I've tried to use something like this, but it seems to run synchronously:
await asyncio.gather(*[sync_users(user, api, batch) for batch in batches], return_exceptions=False)
I've also tried wrapping the synchronous library calls, but that also seems to run synchronously. How can I send out all of the username lookup requests at once?
loop = asyncio.get_event_loop()
executor = ThreadPoolExecutor(max_workers=5)
results = await loop.run_in_executor(executor, api.UsersLookup(user_id=batch, include_entities=True))
Instead of calling in a batch, Try something like this
import asyncio
from aiohttp import ClientSession
async def get_user(user_id):
async with ClientSession() as session:
print('calling')
async with session.get("http://httpbin.org/headers") as response:
print('getting response')
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
tasks = []
users = [1,2,3,4] # a list of user ids
for user_id in users:
tasks.append(asyncio.ensure_future(get_user(user_id)))
loop.run_until_complete(asyncio.wait(tasks))
I am making a script that gets the HTML of almost 20 000 pages and parses it to get just a portion of it.
I managed to get the 20 000 pages' content in a dataframe with aynchronous requests using asyncio and aiohttp but this script still wait for all the pages to be fetched to parse them.
async def get_request(session, url, params=None):
async with session.get(url, headers=HEADERS, params=params) as response:
return await response.text()
async def get_html_from_url(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_request(session, url))
html_page_response = await asyncio.gather(*tasks)
return html_page_response
html_pages_list = asyncio_loop.run_until_complete(get_html_from_url(urls))
Once I have the content of each page I managed to use multiprocessing's Pool to parallelize the parsing.
get_whatiwant_from_html(html_content):
parsed_html = BeautifulSoup(html_content, "html.parser")
clean = parsed_html.find("div", class_="class").get_text()
# Some re.subs
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
clean = re.sub("", "", clean)
return clean
pool = Pool(4)
what_i_want = pool.map(get_whatiwant_from_html, html_content_list)
This code mixes asynchronously the fetching and the parsing but I would like to integrate multiprocessing into it:
async def process(url, session):
html = await getRequest(session, url)
return await get_whatiwant_from_html(html)
async def dispatch(urls):
async with aiohttp.ClientSession() as session:
coros = (process(url, session) for url in urls)
return await asyncio.gather(*coros)
result = asyncio.get_event_loop().run_until_complete(dispatch(urls))
Is there any obvious way to do this? I thought about creating 4 processes that each run the asynchronous calls but the implementation looks a bit complex and I'm wondering if there is another way.
I am very new to asyncio and aiohttp so if you have anything to advise me to read to get a better understanding, I will be very happy.
You can use ProcessPoolExecutor.
With run_in_executor you can do IO in your main asyncio process.
But your heavy CPU calculations in separate processes.
async def get_data(session, url, params=None):
loop = asyncio.get_event_loop()
async with session.get(url, headers=HEADERS, params=params) as response:
html = await response.text()
data = await loop.run_in_executor(None, partial(get_whatiwant_from_html, html))
return data
async def get_data_from_urls(urls):
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(get_data(session, url))
result_data = await asyncio.gather(*tasks)
return result_data
executor = concurrent.futures.ProcessPoolExecutor(max_workers=10)
asyncio_loop.set_default_executor(executor)
results = asyncio_loop.run_until_complete(get_data_from_urls(urls))
You can increase your parsing speed by changing your BeautifulSoup parser from html.parser to lxml which is by far the fastest, followed by html5lib. html.parser is the slowest of them all.
Your bottleneck is not processing issue but IO. You might want multiple threads and not process:
E.g. here is a template program that scraping and sleep to make it slow but ran in multiple threads and thus complete task faster.
from concurrent.futures import ThreadPoolExecutor
import random,time
from bs4 import BeautifulSoup as bs
import requests
URL = 'http://quotesondesign.com/wp-json/posts'
def quote_stream():
'''
Quoter streamer
'''
param = dict(page=random.randint(1, 1000))
quo = requests.get(URL, params=param)
if quo.ok:
data = quo.json()
author = data[0]['title'].strip()
content = bs(data[0]['content'], 'html5lib').text.strip()
print(f'{content}\n-{author}\n')
else:
print('Connection Issues :(')
def multi_qouter(workers=4):
with ThreadPoolExecutor(max_workers=workers) as executor:
_ = [executor.submit(quote_stream) for i in range(workers)]
if __name__ == '__main__':
now = time.time()
multi_qouter(workers=4)
print(f'Time taken {time.time()-now:.2f} seconds')
In your case, create a function that performs the task you want from starry to finish. This function would accept url and necessary parameters as arguments. After that create another function that calls the previous function in different threads, each thread having its our url. So instead of i in range(..), for url in urls. You can run 2000 threads at once, but I would prefer chunks of say 200 running parallel.
I have the following problem that my code for api requests is really non deterministic. I use asyncio to make asynchronous requests, because I want to send multiple requests and have big frequency of changes(that's why I am sending 30 the same requests). Sometimes my code executes really quickly about 0.5s but sometimes it stucks after sending for example a half of the requests. Could anyone see some code bugs which can produce the following error? Or such thing is caused by some delays of the server responses?
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
data = await response.json()
print(data)
return await response.read()
async def run(r):
url = "https://www.bitstamp.net/api/ticker/"
tasks = []
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
t1 = time.time()
number = 30
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(number))
loop.run_until_complete(future)
t2= time.time()
print(t2-t1)
Simple example: I need to make two unrelated HTTP requests in parallel. What's the simplest way to do that? I expect it to be like that:
async def do_the_job():
with aiohttp.ClientSession() as session:
coro_1 = session.get('http://httpbin.org/get')
coro_2 = session.get('http://httpbin.org/ip')
return combine_responses(await coro_1, await coro_2)
In other words, I want to initiate IO operations and wait for their results so they effectively run in parallel. This can be achieved with asyncio.gather:
async def do_the_job():
with aiohttp.ClientSession() as session:
coro_1 = session.get('http://example.com/get')
coro_2 = session.get('http://example.org/tp')
return combine_responses(*(await asyncio.gather(coro_1, coro_2)))
Next, I want to have some complex dependency structure. I want to start operations when I have all prerequisites for them and get results when I need the results. Here helps asyncio.ensure_future which makes separate task from coroutine which is managed by event loop separately:
async def do_the_job():
with aiohttp.ClientSession() as session:
fut_1 = asyncio.ensure_future(session.get('http://httpbin.org/ip'))
coro_2 = session.get('http://httpbin.org/get')
coro_3 = session.post('http://httpbin.org/post', data=(await coro_2)
coro_3_result = await coro_3
return combine_responses(await fut_1, coro_3_result)
Is it true that, to achieve parallel non-blocking IO with coroutines in my logic flow, I have to use either asyncio.ensure_future or asyncio.gather (which actually uses asyncio.ensure_future)? Is there a less "verbose" way?
Is it true that normally developers have to think what coroutines should become separate tasks and use aforementioned functions to gain optimal performance?
Is there a point in using coroutines without multiple tasks in event loop?
How "heavy" are event loop tasks in real life? Surely, they're "lighter" than OS threads or processes. To what extent should I strive for minimal possible number of such tasks?
I need to make two unrelated HTTP requests in parallel. What's the
simplest way to do that?
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
request('http://httpbin.org/delay/1'),
request('http://httpbin.org/delay/1'),
)
print(len(results))
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
Yes, you may achieve concurrency with asyncio.gather or creating task with asyncio.ensure_future.
Next, I want to have some complex dependency structure? I want to
start operations when I have all prerequisites for them and get
results when I need the results.
While code you provided will do job, it would be nicer to split concurrent flows on different coroutines and again use asyncio.gather:
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def get_ip():
return await request('http://httpbin.org/ip')
async def post_from_get():
async with aiohttp.ClientSession() as session:
async with session.get('http://httpbin.org/get') as resp:
get_res = await resp.text()
async with session.post('http://httpbin.org/post', data=get_res) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
get_ip(),
post_from_get(),
)
print(len(results))
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
Is it true that normally developers have to think what coroutines
should become separate tasks and use aforementioned functions to gain
optimal performance?
Since you use asyncio you probably want to run some jobs concurrently to gain performance, right? asyncio.gather is a way to say - "run these jobs concurrently to get their results faster".
In case you shouldn't have to think what jobs should be ran concurrently to gain performance you may be ok with plain sync code.
Is there a point in using coroutines without multiple tasks in event
loop?
In your code you don't have to create tasks manually if you don't want it: both snippets in this answer don't use asyncio.ensure_future. But internally asyncio uses tasks constantly (for example, as you noted asyncio.gather uses tasks itself).
How "heavy" are event loop tasks in real life? Surely, they're
"lighter" than OS threads or processes. To what extent should I strive
for minimal possible number of such tasks?
Main bottleneck in async program is (almost always) network: you shouldn't worry about number of asyncio coroutines/tasks at all.