This may be a dummy question but I cannot seem to be able to run python google-clood-bigquery asynchronously.
My goal is to run multiple queries concurrently and wait for all to finish in an asyncio.wait() query gatherer. I'm using asyncio.create_tast() to launch the queries.
The problem is that each query waits for the precedent one to complete before starting.
Here is my query function (quite simple):
async def exec_query(self, query, **kwargs) -> bigquery.table.RowIterator:
job = self.api.query(query, **kwargs)
return job.result()
Since I cannot await job.result() should I await something else?
If you are working inside of a coroutine and want to run different queries without blocking the event_loop then you can use the run_in_executor function which basically runs your queries in background threads without blocking the loop. Here's a good example of how to use that.
Make sure though that that's exactly what you need; jobs created to run queries in the Python API are already asynchronous and they only block when you call job.result(). This means that you don't need to use asyncio unless you are inside of a coroutine.
Here's a quick possible example of retrieving results as soon as the jobs are finished:
from concurrent.futures import ThreadPoolExecutor, as_completed
import google.cloud.bigquery as bq
client = bq.Client.from_service_account_json('path/to/key.json')
query1 = 'SELECT 1'
query2 = 'SELECT 2'
threads = []
results = []
executor = ThreadPoolExecutor(5)
for job in [client.query(query1), client.query(query2)]:
threads.append(executor.submit(job.result))
# Here you can run any code you like. The interpreter is free
for future in as_completed(threads):
results.append(list(future.result()))
results will be:
[[Row((2,), {'f0_': 0})], [Row((1,), {'f0_': 0})]]
just to share a different solution:
import numpy as np
from time import sleep
query1 = """
SELECT
language.name,
average(language.bytes)
FROM `bigquery-public-data.github_repos.languages`
, UNNEST(language) AS language
GROUP BY language.name"""
query2 = 'SELECT 2'
def dummy_callback(future):
global jobs_done
jobs_done[future.job_id] = True
jobs = [bq.query(query1), bq.query(query2)]
jobs_done = {job.job_id: False for job in jobs}
[job.add_done_callback(dummy_callback) for job in jobs]
# blocking loop to wait for jobs to finish
while not (np.all(list(jobs_done.values()))):
print('waiting for jobs to finish ... sleeping for 1s')
sleep(1)
print('all jobs done, do your stuff')
Rather than using as_completed I prefer to use the built-in async functionality from the bigquery jobs themselves. This also makes it possible for me to decompose the datapipeline into separate Cloud Functions, without having to keep the main ThreadPoolExecutor live for the duration of the whole pipeline. Incidentally, this was the reason why I was looking into this: my pipelines are longer than the max timeout of 9 minutes for Cloud Functions (or even 15 minutes for Cloud Run).
Downside is I need to keep track of all the job_ids across the various functions, but that is relatively easy to solve when configuring the pipeline by specifying inputs and outputs such that they form a directed acyclic graph.
In fact I found a way to wrap my query in an asyinc call quite easily thanks to the asyncio.create_task() function.
I just needed to wrap the job.result() in a coroutine; here is the implementation. It does run asynchronously now.
class BQApi(object):
def __init__(self):
self.api = bigquery.Client.from_service_account_json(BQ_CONFIG["credentials"])
async def exec_query(self, query, **kwargs) -> bigquery.table.RowIterator:
job = self.api.query(query, **kwargs)
task = asyncio.create_task(self.coroutine_job(job))
return await task
#staticmethod
async def coroutine_job(job):
return job.result()
I used #dkapitan 's answer to provide an async wrapper:
async def async_bigquery(client, query):
done = False
def callback(future):
nonlocal done
done = True
job = client.query(query)
job.add_done_callback(callback)
while not done:
await asyncio.sleep(.1)
return job
Related
I created API on Python and i want to start some long function, but I want to tell user that my endpoint worked successfully and i some task started in execution
I want to do it because i want so that the user does not wait for the function to be executed
If it were represented in pseudocode, it would probably look like this:
async my_endpoint(context):
func_name = context.func_name
<something_validation_block>
return 204 if all right
So, how created in one function ?
I tried something as:
async def handle(context):
<validate_block>
threading.Thread(
target=logn_func, args=(context,),
).start()
return 204
But unfortunately it does not work : (
First, asyncio has a method named asyncio.to_thread docs
It's provide a friendly method to work with async and threading.
(Or you can run task in threading pool docs)
then, you can use asyncio.create_task(coro) to run async function in background
it will return a Task object which is awaitable, or use task.add_done_callback to handle result.
import asyncio
import time
def block() -> str:
print("block function start")
time.sleep(1)
print("block function done")
return "result"
async def main() -> int:
task = asyncio.get_running_loop().run_in_executor(None, block)
task.add_done_callback(lambda task: print("task with result:", task.result()))
print("return 204")
return 204
asyncio.run(main())
block function start
return 204
block function done
task with result: result
NOTE: Save a reference to tasks, to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks. A task that isn’t referenced elsewhere may get garbage collected at any time, even before it’s done.
Objective:
I am trying to scrape multiple URLs simultaneously. I don't want to make too many requests at the same time so I am using this solution to limit it.
Problem:
Requests are being made for ALL tasks instead of for a limited number at a time.
Stripped-down Code:
async def download_all_product_information():
# TO LIMIT THE NUMBER OF CONCURRENT REQUESTS
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
# FUNCTION TO ACTUALLY DOWNLOAD INFO
async def get_product_information(url_to_append):
url = 'https://www.amazon.com.br' + url_to_append
print('Product Information - Page ' + str(current_page_number) + ' for category ' + str(
category_index) + '/' + str(len(all_categories)) + ' in ' + gender)
source = await get_source_code_or_content(url, should_render_javascript=True)
time.sleep(random.uniform(2, 5))
return source
# LOOP WHERE STUFF GETS DONE
for current_page_number in range(1, 401):
for gender in os.listdir(base_folder):
all_tasks = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_task = asyncio.create_task(get_product_information(product_specific_url))
all_tasks.append(current_task)
await gather_with_concurrency(random.randrange(8, 15), *all_tasks)
async def main():
await download_all_product_information()
# just to make sure there are not any problems caused by two event loops
if asyncio.get_event_loop().is_running(): # only patch if needed (i.e. running in Notebook, Spyder, etc)
import nest_asyncio
nest_asyncio.apply()
# for asynchronous functionality
if __name__ == '__main__':
asyncio.run(main())
What am I doing wrong? Thanks!
What is wrong is this line:
current_task = asyncio.create_task(get_product_information(product_specific_url))
When you create a "task" it is imediatelly scheduled for execution. As soons
as your code yield execution to the asyncio loop (at any "await" expression), asyncio will loop executing all your tasks.
The semaphore, in the original snippet you pointed too, guarded the creation of the tasks itself, ensuring only "n" tasks would be active at a time. What is passed in to gather_with_concurrency in that snippet are co-routines.
Co-routines, unlike tasks, are objects that are ready to be awaited, but are not yet scheduled. They canbe passed around for free, just like any other object - they will only be executed when they are either awaited, or wrapped by a task (and then when the code passes control to the asyncio loop).
In your code, you are creating the co-routine, with the get_product_information call, and immediately wrapping it in a task. In the await instruction in the line that calls gather_with_concurrency itself, they are all run at once.
The fix is simple: do not create a task at this point, just inside the code guarded by your semaphore. Add just the raw co-routines to your list:
...
all_coroutines = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_coroutine = get_product_information(product_specific_url)
all_coroutines.append(current_coroutine)
await gather_with_concurrency(random.randrange(8, 15), *all_coroutines)
There is still an unrelated incorrectness in this code that will make concurrency fail: you are making a synchronous call to time.sleepinside gather_product_information. This will stall the asyncio loop at this point
until the sleep is over. The correct thing to do is to use await asyncio.sleep(...) .
Question on asyncio. I have this working just not sure if it's the correct way or if there is a easier way.
The short versions of what I am trying to do is continuously to execute the run() 10x concurrently
To do this I had to create a function work_it() with a While True Loop
The run() function take about 5 minutes to complete. Database calls, processing, aiohttp reqeusts, and etc.
Is this the best way to to do this or is there another way to have asyncio continuously run a function over and over again with 10 concurrent processes.
Also is asyncio.gather the correct function to use? Am I better of using an executor?
Thanks in advance.
Erik
db = Database()
conn = db.connect()
async def run(worker_id=None):
"""
Using Shared Database Conneciton
Create a object. Query the database, process the data, and do a http post with aiohttp
Returns: True\False based on the http post
"""
# my_object = Object_Model(db)
# await do_sql_queries
# await process_data
# Lots of processing
# result = await aiohttp_requests
nap_time = random.randint(1,5)
print(f'Worker-{worker_id} sleeping for {nap_time}')
await asyncio.sleep(nap_time)
return True
async def work_it(worker_id=None):
"""
This worker should run forever
"""
while True:
start = time.monotonic()
result = await run(worker_id)
duration = time.monotonic() - start
print(f'Worker-{worker_id} ran for {duration:.6f} seconds')
async def main():
"""
Start 10 "workers"
"""
workers = 10
tasks = []
for worker_id in range(1, workers+1):
print(f'Building Task {worker_id}')
tasks.append(work_it(worker_id))
print(f'Await Gather')
await asyncio.gather(*tasks)
asyncio.run(main())
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.
I have a python app written in the Tornado Asynchronous framework. When an HTTP request comes in, this method gets called:
#classmethod
def my_method(cls, my_arg1):
# Do some Database Transaction #1
x = get_val_from_db_table1(id=1, 'x')
y = get_val_from_db_table2(id=7, 'y')
x += x + (2 * y)
# Do some Database Transaction #2
set_val_in_db_table1(id=1, 'x', x)
return True
The three database operations are interrelated. And this is a concurrent application so multiple such HTTP calls can be happening concurrently and hitting the same DB.
For data-integrity purposes, its important that the three database operations in this method are all called without another processes reading or writing to those database rows in between.
How can I make sure this method has database atomicity? Does Tornado have a decorator for this?
Synchronous database access
You haven't stated how you access your database. If, which is likely, you have synchronous DB access in get_val_from_db_table1 and friends (e.g. with pymysql) and my_method is blocking (doesn't return control to IO loop) then you block your server (which has implications on performance and responsiveness of your server) but effectively serialise your clients and only one can execute my_method at a time. So in terms of data consistency you don't need to do anything, but generally it's a bad design. You can solve both with #xyres's solution in short term (at cost of keeping in mind thread-safely concerns because most of Tornado's functionality isn't thread-safe).
Asynchronous database access
If you have asynchronous DB access in get_val_from_db_table1 and friends (e.g. with tornado-mysql) then you can use tornado.locks.Lock. Here's an example:
from tornado import web, gen, locks, ioloop
_lock = locks.Lock()
def synchronised(coro):
async def wrapper(*args, **kwargs):
async with _lock:
return await coro(*args, **kwargs)
return wrapper
class MainHandler(web.RequestHandler):
async def get(self):
result = await self.my_method('foo')
self.write(result)
#classmethod
#synchronised
async def my_method(cls, arg):
# db access
await gen.sleep(0.5)
return 'data set for {}'.format(arg)
if __name__ == '__main__':
app = web.Application([('/', MainHandler)])
app.listen(8080)
ioloop.IOLoop.current().start()
Note that the above is said about normal single-process Tornado application. If you use tornado.process.fork_processes, then you can only go with multiprocessing.Lock.
Since you want to run those three db operations one right after the other, the function my_method must be non-asynchronous.
But this would also mean that my_method will block the server. You definitely don't want that. One way that I can think of is to run this function in another thread. This won't block the server and will keep accepting new requests while the operations are running. And since, it's going to be non-async, db atomicity is guaranteed.
Here's the relevant code to get you started:
import concurrent.futures
executor = concurrent.futures.ThreadPoolExecutor(max_workers=1)
# Don't set `max_workers` more than 1, because then multiple
# threads will be able to perform db operations
class MyHandler(...):
#gen.coroutine
def get(self):
yield executor.submit(MyHandler.my_method, my_arg1)
# above, `yield` is used to wait for
# db operations to finish
# if you don't want to wait and return
# a response immediately remove the
# `yield` keyword
self.write('Done')
#classmethod
def my_method(cls, my_arg1):
# do db stuff ...
return True