How to use parallelization in set/list comprehension using asyncio? - python

I want to create a multiprocess comprehension in Python 3.7.
Here's the code I have:
async def _url_exists(url):
"""Check whether a url is reachable"""
request = requests.get(url)
return request.status_code == 200:
async def _remove_unexisting_urls(rows):
return {row for row in rows if await _url_exists(row[0])}
rows = [
'http://example.com/',
'http://example.org/',
'http://foo.org/',
]
rows = asyncio.run(_remove_unexisting_urls(rows))
In this code example, I want to remove non-existing URLs from a list. (Note that I'm using a set instead of a list because I also want to remove duplicates).
My issue is that I still see that the execution is sequential. HTTP Requests make the execution wait.
When compared to a serial execution, the execution time is the same.
Am I doing something wrong?
How should these await/async keywords be used with python comprehension?

asyncio itself doesn't run different async functions concurrently. However, with the multiprocessing module's Pool.map, you can schedule functions to run in another process:
from multiprocessing.pool import Pool
pool = Pool()
def fetch(url):
request = requests.get(url)
return request.status_code == 200
rows = [
'http://example.com/',
'http://example.org/',
'http://foo.org/',
]
rows = [r for r in pool.map(fetch, rows) if r]

requests does not support asyncio. If you want to go for true asynchronous execution, you will have to look at libs like aiohttp or asks
Your set should be built before offloading to the tasks, so you don't even execute for duplicates, instead of streamlining the result.
With requests itself, you can fall back to run_in_executor which will execute your requests inside a ThreadPoolExecutor, so not really asynchronous I/O:
import asyncio
import time
from requests import exceptions, get
def _url_exists(url):
try:
r = get(url, timeout=10)
except (exceptions.ConnectionError, exceptions.ConnectTimeout):
return False
else:
return r.status_code is 200
async def _remove_unexisting_urls(l, r):
# making a set from the list before passing it to the futures
# so we just have three tasks instead of nine
futures = [l.run_in_executor(None, _url_exists, url) for url in set(r)]
return [await f for f in futures]
rows = [ # added some dupes
'http://example.com/',
'http://example.com/',
'http://example.com/',
'http://example.org/',
'http://example.org/',
'http://example.org/',
'http://foo.org/',
'http://foo.org/',
'http://foo.org/',
]
loop = asyncio.get_event_loop()
print(time.time())
result = loop.run_until_complete(_remove_unexisting_urls(loop, rows))
print(time.time())
print(result)
Output
1537266974.403686
1537266986.6789136
[False, False, False]
As you can see, there is a penalty from initializing the thread pool, ~2.3 seconds in this case. However, given that fact that each of the three tasks runs for ten seconds until timeout on my box (my IDE is not allowed through the proxy), an overall of twelve seconds execution time looks quite concurrent.

Related

REQUESTS Maximum number of attempts with a waiting time and in case of failure, give a message in Python

the situation is that sometimes a request does not load or gets stuck in Python, in case that happens or any error occurs, I would like to retry it "n" times and wait up to a maximum of 3 seconds for each one and in case the attempts are over tell me a message that f"Could not process {type_1} and {type_2}". Everything runs in parallel with concurrent.futures. Could you help me with that?
import Requests
import concurrent.futures
import json
data = [['PEN','USD'],['USD','EUR']]
def currency(element):
type_1 =element[0]
type_2 = element[1]
s = requests.Session()
url = f'https://usa.visa.com/cmsapi/fx/rates?amount=1&fee=0&utcConvertedDate=07%2F26%2F2022&exchangedate=07%2F26%2F2022&fromCurr={type_1}&toCurr={type_2}'
a = s.get(url)
response = json.loads(a)
value = response["convertedAmount"]
return value
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(
currency, data)
for value in results:
print(value)
Your code is almost there. Here, I modified a few things:
from concurrent.futures import ThreadPoolExecutor
import time
import requests
def convert_currency(tup):
from_currency, to_currency = tup
url = (
"https://usa.visa.com/cmsapi/fx/rates?amount=1&fee=0"
"&utcConvertedDate=07%2F26%2F2022&exchangedate=07%2F26%2F2022&"
f"fromCurr={from_currency}&toCurr={to_currency}"
)
session = requests.Session()
for _ in range(3):
try:
response = session.get(url, timeout=3)
if response.ok:
return response.json()["convertedAmount"]
except requests.exceptions.ConnectTimeout:
time.sleep(3)
return f"Could not process {from_currency} and {to_currency}"
data = [["VND", "XYZ"], ['PEN','USD'], ["ABC", "XYZ"], ['USD','EUR'], ["USD", "XXX"]]
with ThreadPoolExecutor() as executor:
results = executor.map(convert_currency, data)
for value in results:
print(value)
Notes
I retried 3 times (see the for loop)
Use timeout= to specify the time out (in seconds)
The .ok attribute will tell if the call was successful
No need to import json as the response object can JSON decode with the .json() method
You might experiment between ThreadPoolExecutor and ProcessPoolExecutor to see which one performs better

Asynchronous requests inside the for loop in python

I have this snippet
config = {10: 'https://www.youtube.com/', 5: 'https://www.youtube.com/', 7: 'https://www.youtube.com/',
3: 'https://sportal.com/', 11: 'https://sportal.com/'}
def test(arg):
for key in arg.keys():
requests.get(arg[key], timeout=key)
test(config)
On that way the things are happaning synchronously. I want to do it аsynchronously. I want to iterate through the loop without waiting for response for each address and to go ahead to the next one. And so until I iterate though all addresses in dictionary. Than I want to wait until I get all responses for all addresses and after that to get out of test function. I know that I can do it with threading but I read that with asyncio lyb it can be done better, but I couldn't implement it. If anyone have even better suggestions I am open for them. Here is my try:
async def test(arg):
loop = asyncio.get_event_loop()
tasks = [loop.run_in_executor(requests.get(arg[key], timeout=key) for key in arg.keys())]
await asyncio.gather(*tasks)
asyncio.run(test(config))
Here is the solution:
def addresses(adr, to):
requests.get(adr, timeout=to)
async def test(arg):
loop = asyncio.get_event_loop()
tasks = [loop.run_in_executor(None, addresses, arg[key], key) for key in arg.keys()]
await asyncio.gather(*tasks)
asyncio.run(test(config))
Now it works аsynchronously with lyb asyncio not with threading.
Some good answers here. I had trouble with this myself (I do a lot of webscraping) and so I created a package to help me async-scrape (https://pypi.org/project/async-scrape/).
It supports GET and POST. I tried to make it as easy to use as possible. You just need to specify a handler function for the response when you instantiate and then use the scrape_all method to do the work.
It uses the term scrape becasue i've build in some handlers for common errors when scraping websites.
You can do some things in it as well like limit the call rate if you find you're getting blocked.
An example of it's use is:
# Create an instance
from async_scrape import AsyncScrape
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
if resp.status == 200:
return "Request worked"
else:
return "Request failed"
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={},
fetch_error_handler=None,
use_proxy=False,
proxy=None,
pac_url=None,
acceptable_error_limit=100,
attempt_limit=5,
rest_between_attempts=True,
rest_wait=60,
call_rate_limit=None,
randomise_headers=True
)
urls = [
"https://www.google.com",
"https://www.bing.com",
]
resps = async_Scrape.scrape_all(urls)
To do this inside a loop i collect the results and add then to a set and pop off the old ones.
EG
from async_scrape import AsyncScrape
from bs4 import BeautifulSoup as bs
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
new_urls = bs.findall("a", {"class":"new_link_on_website"}
return [new_urls, resp]
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={}
)
# Run the loop
urls = set(["https://initial_webpage.com/"])
processed = set()
all_resps = []
while len(urls):
resps = async_scrape.scrape_all(urls)
# Get failed urls
success_reqs = set([
r["req"] for r in resps
if not r["error"]
])
errored_reqs = set([
r["req"] for r in resps
if r["error"]
])
# Get what you want from the responses
for r in success_reqs:
# Add found urls to urls
urls |= set(r["func_resp"][0]) # "func_resp" is the key to the return from your handler function
# Collect the response
all_resps.extend(r["func_resp"][1])
# Add to processed urls
processed.add(r["url"]) # "url" is the key to the url from the response
# Remove processed urls
urls = urls - processed

Asyncio get with unique URLs slower than non-unique

I have an asyncio function which calls aiohttp.Clientsession to get json data from many (10,000) URLs (code below).
During testing, I have simply been calling the async function with a list of urls:
urls=[url1, url2, url3, url4, url5]*2000
I have now achieved a speed of about 300 it/s with the above url variable.
However, it is about 10x slower when I try to run this with a complete list of 10,000 unique urls:
urls=[url1, url2, ..., url10000]
I am not sure why the unique url list is much slower. Maybe the asyncio maybe compares sessions to see if they are fetching the same thing and return the same value, instead of fetching a new value?
It seems as though THIS IMPLEMENTATION is robust and fast. I have even tried copying this but I get the same results: 300it/s with the non-unique list, and 30it/s with unique.
Perhaps there is something simple I am missing in the implementation of the function below.
import aiohttp
import asyncio
import tqdm.asyncio
from url_list import unique_urls
async def fetch(session, url, sem):
async with sem, session.get(url) as response:
return await response.json()
async def fetch_all(session, urls, loop):
sem = asyncio.Semaphore(100)
tasks = [asyncio.create_task(fetch(session, url, sem)) for url in urls]
return await asyncio.gather(*tasks,tq(tasks))
async def run(urls):
async with aiohttp.ClientSession(loop=loop) as session:
return await fetch_all(session, urls, loop)
async def tq(tasks):
for f in tqdm.asyncio.tqdm.as_completed(tasks):
await f
if __name__ == '__main__':
len(unique_urls)
loop = asyncio.get_event_loop()
calc_routes = loop.run_until_complete(run(unique_urls))
My unique and non-unique list are below:
nonunique_urls = [
"https://arweave.net/ufeA5gLvBtf9vFt_HaLtZsOXIIN5cOfHB2c_DKTbOT8",
"https://arweave.net/UHGKgIp9OAxBsAbR1px3Sqaq1kWqFvNfeEeT9VDeArY",
"https://arweave.net/J8dxe4MHj1TRRaEI5wxiXP-ulqTn_z5NADlX19kVxew",
"https://arweave.net/TuJ0uo6Gofe1cxaNIWwyx0RxjF5khkHTcwtdEu3m1BA",
"https://arweave.net/nv0yKK7U_2T2a4KdC41BBPgOYAldJxyFmnjnavjMI5I"
] * 2000
unique_urls=['https://arweave.net/ufeA5gLvBtf9vFt_HaLtZsOXIIN5cOfHB2c_DKTbOT8',
'https://arweave.net/UHGKgIp9OAxBsAbR1px3Sqaq1kWqFvNfeEeT9VDeArY',
'https://arweave.net/J8dxe4MHj1TRRaEI5wxiXP-ulqTn_z5NADlX19kVxew',
'https://arweave.net/TuJ0uo6Gofe1cxaNIWwyx0RxjF5khkHTcwtdEu3m1BA',
'https://arweave.net/nv0yKK7U_2T2a4KdC41BBPgOYAldJxyFmnjnavjMI5I',
'https://arweave.net/a8cnY1xvM82d444JMSvzKO2qy7DCW73tFaFuNAkPzeg',
'https://arweave.net/pfv8utICUpTHmvdM9bWxMyI8hHmJCf33bMw1F0nvXik',
'https://arweave.net/oRAf9M7gG3mtraqzXh9XsCxbM675u3wxzYM5scL1keY',
'https://arweave.net/2AiZNhPluj9jwdOOlIk_vqKYzhLwtJyLAT_7tdNvgVg',
........]

Fetching requests in queue concurrently

I have written code that allows me to start fetching the next chunk of data from an API while the previous chunk of data is being processed.
I'd like this to be always fetching up to 5 chunks concurrently at any given moment, but the returned data should always be processed in the correct order even if a request that is last in the queue completes before any other.
How can my code be changed to make this happen?
class MyClient:
async def fetch_entities(
self,
entity_ids:List[int],
objects:Optional[List[str]],
select_inbound:Optional[List[str]]=None,
select_outbound:Optional[List[str]]=None,
queue_size:int=5,
chunk_size:int=500,
):
"""
Fetch entities in chunks
While one chunk of data is being processed the next one can
already be fetched. In other words: Data processing does not
block data fetching.
"""
objects = ",".join(objects)
if select_inbound:
select_inbound = ",".join(select_inbound)
if select_outbound:
select_outbound = ",".join(select_outbound)
queue = asyncio.Queue(maxsize=queue_size)
# TODO: I want to be able to fill the queue with requests that are already executing
async def queued_chunks():
for ids in chunks(entity_ids, chunk_size):
res = await self.client.post(urllib.parse.quote("entities:fetchdata"), json={
"entityIds": ids,
"objects": objects,
"inbound": {
"linkTypeIds": select_outbound,
"objects": objects,
} if select_inbound else {},
"outbound": {
"linkTypeIds": select_inbound,
"objects": objects,
} if select_outbound else {},
})
await queue.put(res)
await queue.put(None)
asyncio.create_task(queued_chunks())
while True:
res = await queue.get()
if res is None:
break
res.raise_for_status()
queue.task_done()
for entity in res.json():
yield entity
I’d use two queues here: one with the chunks to process, and one for the chunks that are complete. You can have any number of worker tasks to process chunks, and you can put a size limit on the first queue to limit how many chunks you prefetch. Use just a single loop to receive the processed chunks, to ensure they are kept ordered (your code already does this).
The trick is to put futures into both queues, one for every chunk to be processed. The worker tasks that do the processing fetch a chunk and future pair, and then need to resolve the associated future by setting the POST response as the result of these futures. The loop that handles the processed chunks awaits on each future and so will only proceed to the next chunk when the current chunk has been fully processed. For this to work you need to put both the chunk and the corresponding future into the first queue for the workers to process. Put the same future into the second queue; these enforce the chunk results are processed in order.
So, in summary:
Have two queues:
chunks holds (chunk, future) objects.
completed holds futures, *the same futures paired with chunks in the other queue.
Create “worker” tasks that consume from the chunks queue. If you create 5, then 5 chunks will be processed in parallel. Every time a worker has competed processing, they set the result on the corresponding future.
use a “processed chunks” loop; it takes the next future from the completed queue and awaits on it. Only when the specific chunk associated with that future has been competed will it produce the result (set by a worker task).
As a rough sketch, it’d look something like this:
chunk_queue = asyncio.Queue()
completed_queue = asyncio.Queue()
WORKER_COUNT = queue_size
async def queued_chunks():
for ids in chunks(entity_ids, chunk_size):
future = asyncio.Future()
await chunk_queue.put((ids, future))
await completed_queue.put(future)
completed_queue.put(None)
async def worker():
while True:
ids, future = chunk_queue.get()
try:
res = await self.client.post(urllib.parse.quote("entities:fetchdata"), json={
"entityIds": ids,
"objects": objects,
"inbound": {
"linkTypeIds": select_outbound,
"objects": objects,
} if select_inbound else {},
"outbound": {
"linkTypeIds": select_inbound,
"objects": objects,
} if select_outbound else {},
})
res.raise_for_status()
future.set_result(res)
except Exception as e:
future.set_exception(e)
return
workers = [asyncio.create_task(worker) for _ in range(WORKER_COUNT)]
chunk_producer = asyncio.create_task(queued_chunks())
try:
while True:
future = await completed_queue.get()
if future is None:
# all chunks have been processed!
break
res = await future
yield from res.json()
finally:
for w in workers:
w.cancel()
asyncio.wait(workers)
If you must limit how many chunks are queued (and not just how many are being processed concurrently), set maxsize on the chunk_queue queue (to a value greater than WORKER_COUNT). Use this to limit memory requirements, for example.
However, if you were to set the maxsize to a value equal to WORKER_COUNT, you may as well get rid of the worker tasks altogether and instead put the body of the worker loop as coroutines wrapped in tasks into the completed results queue. The asyncio Task class is a subclass of Future, which automatically sets the future result when the coroutine it wraps completes. If you are not going to put in more chunks into the chunk_queue than you have worker tasks you may as well cut out the middleman, drop the chunk_queue altogether. The tasks then go into the completed queue instead of plain futures:
completed_queue = asyncio.Queue(maxsize=queue_size)
async def queued_chunks():
for ids in chunks(entity_ids, chunk_size):
task = asyncio.create_task(fetch_task(ids))
await completed_queue.put(task)
completed_queue.put(None)
async def fetch_task(ids):
res = await self.client.post(urllib.parse.quote("entities:fetchdata"),
json={
"entityIds": ids,
"objects": objects,
"inbound": {
"linkTypeIds": select_outbound,
"objects": objects,
} if select_inbound else {},
"outbound": {
"linkTypeIds": select_inbound,
"objects": objects,
} if select_outbound else {},
}
)
res.raise_for_status()
return res
chunk_producer = asyncio.create_task(queued_chunks())
while True:
task = await completed_queue.get()
if task is None:
# all chunks have been processed!
break
res = await task
yield from task.json()
This version is really close to what you had already, the only difference being that we put the await for the client POST coroutine and the check for the response status code into a separate coroutine to be run as tasks. You could also make the self.client.post() coroutine into the task (so not await on it) and leave checking for the response status to the final queue processing loop. That’s what Pablo’s answer proposed so I won’t repeat that here.
Note that this version starts the task before putting it in the queue. The queue was a not the only limit on the number of active tasks, there is also an already started task waiting for space to be put into the queue on one end (the line await completed_queue.put(task) blocks if the queue is full), and another task already taken out by the queue consumer (fetched by task = await completed_queue.get()). If you need to limit the number of active tasks, subtract 2 from the queue maxsize to set an upper limit.
Also, because tasks could complete in the meantime, there could be maxsize + 1 fewer active tasks, but you can’t start any more until more space has been freed in the queue. Because the first approach queues inputs for tasks, it hasn’t got these issues. You could mitigate this problem by using a semaphore rather than a bound queuesize to limit tasks (acquire a slot before starting a task, and just before returning from a task, releasing a slot).
Personally I’d pick my first proposal as it gives you separate control over concurrency and chunk prefetching, without the issues the second approach has.
Instead of awaiting the coroutine before you enqueue it, just enqueue the coroutine and await it later
class MyClient:
async def fetch_entities(
self,
entity_ids:List[int],
objects:Optional[List[str]],
select_inbound:Optional[List[str]]=None,
select_outbound:Optional[List[str]]=None,
queue_size:int=5,
chunk_size:int=500,
):
"""
Fetch entities in chunks
While one chunk of data is being processed the next one can
already be fetched. In other words: Data processing does not
block data fetching.
"""
objects = ",".join(objects)
if select_inbound:
select_inbound = ",".join(select_inbound)
if select_outbound:
select_outbound = ",".join(select_outbound)
queue = asyncio.Queue(maxsize=queue_size)
# TODO: I want to be able to fill the queue with requests that are already executing
async def queued_chunks():
for ids in chunks(entity_ids, chunk_size):
cor = self.client.post(urllib.parse.quote("entities:fetchdata"), json={
"entityIds": ids,
"objects": objects,
"inbound": {
"linkTypeIds": select_outbound,
"objects": objects,
} if select_inbound else {},
"outbound": {
"linkTypeIds": select_inbound,
"objects": objects,
} if select_outbound else {},
})
task = asyncio.create_task(cor)
await queue.put(cor)
await queue.put(None)
asyncio.create_task(queued_chunks())
while True:
task = await queue.get()
if task is None:
break
res = await task
res.raise_for_status()
queue.task_done()
for entity in res.json():
yield entity

How do I download a large list of URLs in parallel in pyspark?

I have an RDD containing 10000 urls to be fetched.
list = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
'http://google.com',
'http://twitter.com']
urls = sc.parallelize(list)
I need to check which urls are broken and preferably fetch the results to a corresponding RDD in Python. I tried this:
import asyncio
import concurrent.futures
import requests
async def get(url):
with concurrent.futures.ThreadPoolExecutor() as executor:
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
executor,
requests.get,
i
)
for i in url
]
return futures
async def get_response(futures):
response = await asyncio.gather(futures,return_exceptions=True)
return(response)
tasks = urls.map(lambda query: get(query)) # Method returns http call response as a Future[String]
results = tasks.map(lambda task: get_response(task) )
results = results.map(lambda response:'ERR' if isinstance(response, Exception) else 'OK' )
results.collect()
I get the following output which obviously is not right:
['OK', 'OK', 'OK']
I also tried this:
import asyncio
import concurrent.futures
import requests
async def get():
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
executor,
requests.get,
i
)
for i in urls.toLocalIterator()
]
for response in await asyncio.gather(*futures,return_exceptions=True):
print('{}: {}'.format(response, 'ERR' if isinstance(response, Exception) else 'OK'))
pass
loop = asyncio.get_event_loop()
loop.run_until_complete(get())
I get the following output:
HTTPConnectionPool(host='SDFKHSKHGKLHSKLJHGSDFKSJH.com', port=80): Max retries exceeded with url: / (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at 0x12c834210>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known')): ERR
<Response [200]>: OK
<Response [200]>: OK
Desired output would be something like this:
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com : ERR
http://google.com : OK
http://twitter.com : OK
But the problem with the second approach is the use of lists to store future objects. I believe that using RDD is better, since number of urls can be in millions or billions and no singe machine can handle it. Also it is not clear to me how to retrieve urls from responses.
If you're using concurrent.futures, you don't need asyncio at all (it will bring you no benefits since you are running in multiple threads anyway). You can use concurrent.futures.wait() to wait for multiple futures in parallel.
I can't test your data, but it should work with code like this:
import concurrent.futures, requests
def get_one(url):
resp = requests.get(url)
resp.raise_for_status()
return resp.text
def get_all():
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(get_one, url)
for url in urls.toLocalIterator()]
# the end of the "with" block will automatically wait
# for all of the executor's tasks to complete
for fut in futures:
if fut.exception() is not None:
print('{}: {}'.format(fut.exception(), 'ERR')
else:
print('{}: {}'.format(fut.result(), 'OK')
To do the same thing with asyncio, you should use aiohttp instead.
You can try pyspark-asyncactions
The naming convention for the patched methods is methodNameAsync, for example:
RDD.count ⇒ RDD.countAsync
DataFrame.take ⇒ RDD.takeAsync
DataFrameWriter.save ⇒ DataFrameWriter.saveAsync
Usage
To patch existing classes just import the package:
>>> import asyncactions
>>> from pyspark.sql import SparkSession
>>>
>>> spark = SparkSession.builder.getOrCreate()
All *Async methods return concurrent.futures.Future:
>>> rdd = spark.sparkContext.range(100)
>>> f = rdd.countAsync()
>>> f
<Future at ... state=running>
>>> type(f)
concurrent.futures._base.Future
>>> f.add_done_callback(lambda f: print(f.result()))
100

Categories