Objective:
I am trying to scrape multiple URLs simultaneously. I don't want to make too many requests at the same time so I am using this solution to limit it.
Problem:
Requests are being made for ALL tasks instead of for a limited number at a time.
Stripped-down Code:
async def download_all_product_information():
# TO LIMIT THE NUMBER OF CONCURRENT REQUESTS
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
# FUNCTION TO ACTUALLY DOWNLOAD INFO
async def get_product_information(url_to_append):
url = 'https://www.amazon.com.br' + url_to_append
print('Product Information - Page ' + str(current_page_number) + ' for category ' + str(
category_index) + '/' + str(len(all_categories)) + ' in ' + gender)
source = await get_source_code_or_content(url, should_render_javascript=True)
time.sleep(random.uniform(2, 5))
return source
# LOOP WHERE STUFF GETS DONE
for current_page_number in range(1, 401):
for gender in os.listdir(base_folder):
all_tasks = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_task = asyncio.create_task(get_product_information(product_specific_url))
all_tasks.append(current_task)
await gather_with_concurrency(random.randrange(8, 15), *all_tasks)
async def main():
await download_all_product_information()
# just to make sure there are not any problems caused by two event loops
if asyncio.get_event_loop().is_running(): # only patch if needed (i.e. running in Notebook, Spyder, etc)
import nest_asyncio
nest_asyncio.apply()
# for asynchronous functionality
if __name__ == '__main__':
asyncio.run(main())
What am I doing wrong? Thanks!
What is wrong is this line:
current_task = asyncio.create_task(get_product_information(product_specific_url))
When you create a "task" it is imediatelly scheduled for execution. As soons
as your code yield execution to the asyncio loop (at any "await" expression), asyncio will loop executing all your tasks.
The semaphore, in the original snippet you pointed too, guarded the creation of the tasks itself, ensuring only "n" tasks would be active at a time. What is passed in to gather_with_concurrency in that snippet are co-routines.
Co-routines, unlike tasks, are objects that are ready to be awaited, but are not yet scheduled. They canbe passed around for free, just like any other object - they will only be executed when they are either awaited, or wrapped by a task (and then when the code passes control to the asyncio loop).
In your code, you are creating the co-routine, with the get_product_information call, and immediately wrapping it in a task. In the await instruction in the line that calls gather_with_concurrency itself, they are all run at once.
The fix is simple: do not create a task at this point, just inside the code guarded by your semaphore. Add just the raw co-routines to your list:
...
all_coroutines = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_coroutine = get_product_information(product_specific_url)
all_coroutines.append(current_coroutine)
await gather_with_concurrency(random.randrange(8, 15), *all_coroutines)
There is still an unrelated incorrectness in this code that will make concurrency fail: you are making a synchronous call to time.sleepinside gather_product_information. This will stall the asyncio loop at this point
until the sleep is over. The correct thing to do is to use await asyncio.sleep(...) .
Related
I created API on Python and i want to start some long function, but I want to tell user that my endpoint worked successfully and i some task started in execution
I want to do it because i want so that the user does not wait for the function to be executed
If it were represented in pseudocode, it would probably look like this:
async my_endpoint(context):
func_name = context.func_name
<something_validation_block>
return 204 if all right
So, how created in one function ?
I tried something as:
async def handle(context):
<validate_block>
threading.Thread(
target=logn_func, args=(context,),
).start()
return 204
But unfortunately it does not work : (
First, asyncio has a method named asyncio.to_thread docs
It's provide a friendly method to work with async and threading.
(Or you can run task in threading pool docs)
then, you can use asyncio.create_task(coro) to run async function in background
it will return a Task object which is awaitable, or use task.add_done_callback to handle result.
import asyncio
import time
def block() -> str:
print("block function start")
time.sleep(1)
print("block function done")
return "result"
async def main() -> int:
task = asyncio.get_running_loop().run_in_executor(None, block)
task.add_done_callback(lambda task: print("task with result:", task.result()))
print("return 204")
return 204
asyncio.run(main())
block function start
return 204
block function done
task with result: result
NOTE: Save a reference to tasks, to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks. A task that isn’t referenced elsewhere may get garbage collected at any time, even before it’s done.
Question on asyncio. I have this working just not sure if it's the correct way or if there is a easier way.
The short versions of what I am trying to do is continuously to execute the run() 10x concurrently
To do this I had to create a function work_it() with a While True Loop
The run() function take about 5 minutes to complete. Database calls, processing, aiohttp reqeusts, and etc.
Is this the best way to to do this or is there another way to have asyncio continuously run a function over and over again with 10 concurrent processes.
Also is asyncio.gather the correct function to use? Am I better of using an executor?
Thanks in advance.
Erik
db = Database()
conn = db.connect()
async def run(worker_id=None):
"""
Using Shared Database Conneciton
Create a object. Query the database, process the data, and do a http post with aiohttp
Returns: True\False based on the http post
"""
# my_object = Object_Model(db)
# await do_sql_queries
# await process_data
# Lots of processing
# result = await aiohttp_requests
nap_time = random.randint(1,5)
print(f'Worker-{worker_id} sleeping for {nap_time}')
await asyncio.sleep(nap_time)
return True
async def work_it(worker_id=None):
"""
This worker should run forever
"""
while True:
start = time.monotonic()
result = await run(worker_id)
duration = time.monotonic() - start
print(f'Worker-{worker_id} ran for {duration:.6f} seconds')
async def main():
"""
Start 10 "workers"
"""
workers = 10
tasks = []
for worker_id in range(1, workers+1):
print(f'Building Task {worker_id}')
tasks.append(work_it(worker_id))
print(f'Await Gather')
await asyncio.gather(*tasks)
asyncio.run(main())
This may be a dummy question but I cannot seem to be able to run python google-clood-bigquery asynchronously.
My goal is to run multiple queries concurrently and wait for all to finish in an asyncio.wait() query gatherer. I'm using asyncio.create_tast() to launch the queries.
The problem is that each query waits for the precedent one to complete before starting.
Here is my query function (quite simple):
async def exec_query(self, query, **kwargs) -> bigquery.table.RowIterator:
job = self.api.query(query, **kwargs)
return job.result()
Since I cannot await job.result() should I await something else?
If you are working inside of a coroutine and want to run different queries without blocking the event_loop then you can use the run_in_executor function which basically runs your queries in background threads without blocking the loop. Here's a good example of how to use that.
Make sure though that that's exactly what you need; jobs created to run queries in the Python API are already asynchronous and they only block when you call job.result(). This means that you don't need to use asyncio unless you are inside of a coroutine.
Here's a quick possible example of retrieving results as soon as the jobs are finished:
from concurrent.futures import ThreadPoolExecutor, as_completed
import google.cloud.bigquery as bq
client = bq.Client.from_service_account_json('path/to/key.json')
query1 = 'SELECT 1'
query2 = 'SELECT 2'
threads = []
results = []
executor = ThreadPoolExecutor(5)
for job in [client.query(query1), client.query(query2)]:
threads.append(executor.submit(job.result))
# Here you can run any code you like. The interpreter is free
for future in as_completed(threads):
results.append(list(future.result()))
results will be:
[[Row((2,), {'f0_': 0})], [Row((1,), {'f0_': 0})]]
just to share a different solution:
import numpy as np
from time import sleep
query1 = """
SELECT
language.name,
average(language.bytes)
FROM `bigquery-public-data.github_repos.languages`
, UNNEST(language) AS language
GROUP BY language.name"""
query2 = 'SELECT 2'
def dummy_callback(future):
global jobs_done
jobs_done[future.job_id] = True
jobs = [bq.query(query1), bq.query(query2)]
jobs_done = {job.job_id: False for job in jobs}
[job.add_done_callback(dummy_callback) for job in jobs]
# blocking loop to wait for jobs to finish
while not (np.all(list(jobs_done.values()))):
print('waiting for jobs to finish ... sleeping for 1s')
sleep(1)
print('all jobs done, do your stuff')
Rather than using as_completed I prefer to use the built-in async functionality from the bigquery jobs themselves. This also makes it possible for me to decompose the datapipeline into separate Cloud Functions, without having to keep the main ThreadPoolExecutor live for the duration of the whole pipeline. Incidentally, this was the reason why I was looking into this: my pipelines are longer than the max timeout of 9 minutes for Cloud Functions (or even 15 minutes for Cloud Run).
Downside is I need to keep track of all the job_ids across the various functions, but that is relatively easy to solve when configuring the pipeline by specifying inputs and outputs such that they form a directed acyclic graph.
In fact I found a way to wrap my query in an asyinc call quite easily thanks to the asyncio.create_task() function.
I just needed to wrap the job.result() in a coroutine; here is the implementation. It does run asynchronously now.
class BQApi(object):
def __init__(self):
self.api = bigquery.Client.from_service_account_json(BQ_CONFIG["credentials"])
async def exec_query(self, query, **kwargs) -> bigquery.table.RowIterator:
job = self.api.query(query, **kwargs)
task = asyncio.create_task(self.coroutine_job(job))
return await task
#staticmethod
async def coroutine_job(job):
return job.result()
I used #dkapitan 's answer to provide an async wrapper:
async def async_bigquery(client, query):
done = False
def callback(future):
nonlocal done
done = True
job = client.query(query)
job.add_done_callback(callback)
while not done:
await asyncio.sleep(.1)
return job
I am working on a big data problem and am stuck with some concurrency and async io issues. The problem is as follows:
1) Have multiple huge files (~4gb each x upto 15) which I am processing using ProcessPoolExecutor from concurrent.futures module this way :
def process(source):
files = os.list(source)
with ProcessPoolExecutor() as executor:
future_to_url = {executor.submit(process_individual_file, source, input_file):input_file for input_file in files}
for future in as_completed(future_to_url):
data = future.result()
2) Now in each file, I want to go line by line, process line to create a particular json, group such 2K jsons together and hit an API with that request to get response. Here is the code:
def process_individual_file(source, input_file):
limit = 2000
with open(source+input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
response = requests.post(API_URL, json=json_array)
#check response status here
limit = 2000
3) Now the problem, the number of lines in each file being really large and that API call blocking and bit slow to respond, the program is taking huge amount of time to complete.
4) What I want to achieve is to make that API call async so that I can keep processing next batch of 2000 when that API call is happening.
5) Things I tried till now : I was trying to implement this using asyncio but there we need to collect the set of future tasks and wait for completion using event loop. Something like this:
async def process_individual_file(source, input_file):
tasks = []
limit = 2000
with open(source+input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
tasks.append(asyncio.ensure_future(call_api(json_array)))
limit = 2000
await asyncio.wait(tasks)
ioloop = asyncio.get_event_loop()
ioloop.run_until_complete(process_individual_file(source, input_file))
ioloop.close()
6) I am really not understanding this because this is indirectly the same as previous as it waits to collect all tasks before launching them. Can someone help me with what should be the correct architecture of this problem ? How can I call the API async way, without collecting all tasks and with ability to process next batch parallely ?
I am really not understanding this because this is indirectly the
same as previous as it waits to collect all tasks before launching
them.
No, you wrong here. When you create asyncio.Task with asyncio.ensure_future it starts executing call_api coroutine immediately. This is how tasks in asyncio work:
import asyncio
async def test(i):
print(f'{i} started')
await asyncio.sleep(i)
async def main():
tasks = [
asyncio.ensure_future(test(i))
for i
in range(3)
]
await asyncio.sleep(0)
print('At this moment tasks are already started')
await asyncio.wait(tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Output:
0 started
1 started
2 started
At this moment tasks are already started
Problem with your approach is that process_individual_file is not actually asynchronous: it does large amount of CPU-related job without returning control to your asyncio event loop. It's a problem - function blocks event loop making impossible tasks to be executed.
Very simple, but effective solution I think you can use - is to return control to event loop manually using asyncio.sleep(0) after a few amount of executing process_individual_file, for example, on reading each line:
async def process_individual_file(source, input_file):
tasks = []
limit = 2000
with open(source+input_file) as sf:
for line in sf:
await asyncio.sleep(0) # Return control to event loop to allow it execute tasks
json_array.append(form_json(line))
limit -= 1
if limit == 0:
tasks.append(asyncio.ensure_future(call_api(json_array)))
limit = 2000
await asyncio.wait(tasks)
Upd:
there will be more than millions of requests to be done and hence I am
feeling uncomfortable to store future objects for all of them in a
list
It makes much sense. Nothing good will happen if you run million parallel network requests. Usual way to set limit in this case is to use synchronization primitives like asyncio.Semaphore.
I advice you to make generator to get json_array from file, and acquire Semaphore before adding new task and release it on task ready. You will get clean code protected from many parallel running tasks.
This will look like something like this:
def get_json_array(input_file):
json_array = []
limit = 2000
with open(input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
yield json_array # generator will allow split file-reading logic from adding tasks
json_array = []
limit = 2000
sem = asyncio.Semaphore(50) # don't allow more than 50 parallel requests
async def process_individual_file(input_file):
for json_array in get_json_array(input_file):
await sem.acquire() # file reading wouldn't resume until there's some place for newer tasks
task = asyncio.ensure_future(call_api(json_array))
task.add_done_callback(lambda t: sem.release()) # on task done - free place for next tasks
task.add_done_callback(lambda t: print(t.result())) # print result on some call_api done
I've been trying to make a bot in Slack that remains responsive even if it hasn't finished processing earlier commands, so it could go and do something that takes some time without locking up. It should return whatever is finished first.
I think I'm getting part of the way there: it now doesn't ignore stuff that's typed in before an earlier command is finished running. But it still doesn't allow threads to "overtake" each other - a command called first will return first, even if it takes much longer to complete.
import asyncio
from slackclient import SlackClient
import time, datetime as dt
token = "my token"
sc = SlackClient(token)
#asyncio.coroutine
def sayHello(waitPeriod = 5):
yield from asyncio.sleep(waitPeriod)
msg = 'Hello! I waited {} seconds.'.format(waitPeriod)
return msg
#asyncio.coroutine
def listen():
yield from asyncio.sleep(1)
x = sc.rtm_connect()
info = sc.rtm_read()
if len(info) == 1:
if r'/hello' in info[0]['text']:
print(info)
try:
waitPeriod = int(info[0]['text'][6:])
except:
print('Can not read a time period. Using 5 seconds.')
waitPeriod = 5
msg = yield from sayHello(waitPeriod = waitPeriod)
print(msg)
chan = info[0]['channel']
sc.rtm_send_message(chan, msg)
asyncio.async(listen())
def main():
print('here we go')
loop = asyncio.get_event_loop()
asyncio.async(listen())
loop.run_forever()
if __name__ == '__main__':
main()
When I type /hello 12 and /hello 2 into the Slack chat window, the bot does respond to both commands now. However it doesn't process the /hello 2 command until it's finished doing the /hello 12 command. My understanding of asyncio is a work in progress, so it's quite possible I'm making a very basic error. I was told in a previous question that things like sc.rtm_read() are blocking functions. Is that the root of my problem?
Thanks a lot,
Alex
What is happening is your listen() coroutine is blocking at the yield from sayHello() statement. Only once sayHello() completes will listen() be able to continue on its merry way. The crux is that the yield from statement (or await from Python 3.5+) is blocking. It chains the two coroutines together and the 'parent' coroutine can't complete until the linked 'child' coroutine completes. (However, 'neighbouring' coroutines that aren't part of the same linked chain are free to proceed in the meantime).
The simple way to release sayHello() without holding up listen() in this case is to use listen() as a dedicated listening coroutine and to offload all subsequent actions into their own Task wrappers instead, thus not hindering listen() from responding to subsequent incoming messages. Something along these lines.
#asyncio.coroutine
def sayHello(waitPeriod, sc, chan):
yield from asyncio.sleep(waitPeriod)
msg = 'Hello! I waited {} seconds.'.format(waitPeriod)
print(msg)
sc.rtm_send_message(chan, msg)
#asyncio.coroutine
def listen():
# connect once only if possible:
x = sc.rtm_connect()
# use a While True block instead of repeatedly calling a new Task at the end
while True:
yield from asyncio.sleep(0) # use 0 unless you need to wait a full second?
#x = sc.rtm_connect() # probably not necessary to reconnect each loop?
info = sc.rtm_read()
if len(info) == 1:
if r'/hello' in info[0]['text']:
print(info)
try:
waitPeriod = int(info[0]['text'][6:])
except:
print('Can not read a time period. Using 5 seconds.')
waitPeriod = 5
chan = info[0]['channel']
asyncio.async(sayHello(waitPeriod, sc, chan))