Fetching requests in queue concurrently - python

I have written code that allows me to start fetching the next chunk of data from an API while the previous chunk of data is being processed.
I'd like this to be always fetching up to 5 chunks concurrently at any given moment, but the returned data should always be processed in the correct order even if a request that is last in the queue completes before any other.
How can my code be changed to make this happen?
class MyClient:
async def fetch_entities(
self,
entity_ids:List[int],
objects:Optional[List[str]],
select_inbound:Optional[List[str]]=None,
select_outbound:Optional[List[str]]=None,
queue_size:int=5,
chunk_size:int=500,
):
"""
Fetch entities in chunks
While one chunk of data is being processed the next one can
already be fetched. In other words: Data processing does not
block data fetching.
"""
objects = ",".join(objects)
if select_inbound:
select_inbound = ",".join(select_inbound)
if select_outbound:
select_outbound = ",".join(select_outbound)
queue = asyncio.Queue(maxsize=queue_size)
# TODO: I want to be able to fill the queue with requests that are already executing
async def queued_chunks():
for ids in chunks(entity_ids, chunk_size):
res = await self.client.post(urllib.parse.quote("entities:fetchdata"), json={
"entityIds": ids,
"objects": objects,
"inbound": {
"linkTypeIds": select_outbound,
"objects": objects,
} if select_inbound else {},
"outbound": {
"linkTypeIds": select_inbound,
"objects": objects,
} if select_outbound else {},
})
await queue.put(res)
await queue.put(None)
asyncio.create_task(queued_chunks())
while True:
res = await queue.get()
if res is None:
break
res.raise_for_status()
queue.task_done()
for entity in res.json():
yield entity

I’d use two queues here: one with the chunks to process, and one for the chunks that are complete. You can have any number of worker tasks to process chunks, and you can put a size limit on the first queue to limit how many chunks you prefetch. Use just a single loop to receive the processed chunks, to ensure they are kept ordered (your code already does this).
The trick is to put futures into both queues, one for every chunk to be processed. The worker tasks that do the processing fetch a chunk and future pair, and then need to resolve the associated future by setting the POST response as the result of these futures. The loop that handles the processed chunks awaits on each future and so will only proceed to the next chunk when the current chunk has been fully processed. For this to work you need to put both the chunk and the corresponding future into the first queue for the workers to process. Put the same future into the second queue; these enforce the chunk results are processed in order.
So, in summary:
Have two queues:
chunks holds (chunk, future) objects.
completed holds futures, *the same futures paired with chunks in the other queue.
Create “worker” tasks that consume from the chunks queue. If you create 5, then 5 chunks will be processed in parallel. Every time a worker has competed processing, they set the result on the corresponding future.
use a “processed chunks” loop; it takes the next future from the completed queue and awaits on it. Only when the specific chunk associated with that future has been competed will it produce the result (set by a worker task).
As a rough sketch, it’d look something like this:
chunk_queue = asyncio.Queue()
completed_queue = asyncio.Queue()
WORKER_COUNT = queue_size
async def queued_chunks():
for ids in chunks(entity_ids, chunk_size):
future = asyncio.Future()
await chunk_queue.put((ids, future))
await completed_queue.put(future)
completed_queue.put(None)
async def worker():
while True:
ids, future = chunk_queue.get()
try:
res = await self.client.post(urllib.parse.quote("entities:fetchdata"), json={
"entityIds": ids,
"objects": objects,
"inbound": {
"linkTypeIds": select_outbound,
"objects": objects,
} if select_inbound else {},
"outbound": {
"linkTypeIds": select_inbound,
"objects": objects,
} if select_outbound else {},
})
res.raise_for_status()
future.set_result(res)
except Exception as e:
future.set_exception(e)
return
workers = [asyncio.create_task(worker) for _ in range(WORKER_COUNT)]
chunk_producer = asyncio.create_task(queued_chunks())
try:
while True:
future = await completed_queue.get()
if future is None:
# all chunks have been processed!
break
res = await future
yield from res.json()
finally:
for w in workers:
w.cancel()
asyncio.wait(workers)
If you must limit how many chunks are queued (and not just how many are being processed concurrently), set maxsize on the chunk_queue queue (to a value greater than WORKER_COUNT). Use this to limit memory requirements, for example.
However, if you were to set the maxsize to a value equal to WORKER_COUNT, you may as well get rid of the worker tasks altogether and instead put the body of the worker loop as coroutines wrapped in tasks into the completed results queue. The asyncio Task class is a subclass of Future, which automatically sets the future result when the coroutine it wraps completes. If you are not going to put in more chunks into the chunk_queue than you have worker tasks you may as well cut out the middleman, drop the chunk_queue altogether. The tasks then go into the completed queue instead of plain futures:
completed_queue = asyncio.Queue(maxsize=queue_size)
async def queued_chunks():
for ids in chunks(entity_ids, chunk_size):
task = asyncio.create_task(fetch_task(ids))
await completed_queue.put(task)
completed_queue.put(None)
async def fetch_task(ids):
res = await self.client.post(urllib.parse.quote("entities:fetchdata"),
json={
"entityIds": ids,
"objects": objects,
"inbound": {
"linkTypeIds": select_outbound,
"objects": objects,
} if select_inbound else {},
"outbound": {
"linkTypeIds": select_inbound,
"objects": objects,
} if select_outbound else {},
}
)
res.raise_for_status()
return res
chunk_producer = asyncio.create_task(queued_chunks())
while True:
task = await completed_queue.get()
if task is None:
# all chunks have been processed!
break
res = await task
yield from task.json()
This version is really close to what you had already, the only difference being that we put the await for the client POST coroutine and the check for the response status code into a separate coroutine to be run as tasks. You could also make the self.client.post() coroutine into the task (so not await on it) and leave checking for the response status to the final queue processing loop. That’s what Pablo’s answer proposed so I won’t repeat that here.
Note that this version starts the task before putting it in the queue. The queue was a not the only limit on the number of active tasks, there is also an already started task waiting for space to be put into the queue on one end (the line await completed_queue.put(task) blocks if the queue is full), and another task already taken out by the queue consumer (fetched by task = await completed_queue.get()). If you need to limit the number of active tasks, subtract 2 from the queue maxsize to set an upper limit.
Also, because tasks could complete in the meantime, there could be maxsize + 1 fewer active tasks, but you can’t start any more until more space has been freed in the queue. Because the first approach queues inputs for tasks, it hasn’t got these issues. You could mitigate this problem by using a semaphore rather than a bound queuesize to limit tasks (acquire a slot before starting a task, and just before returning from a task, releasing a slot).
Personally I’d pick my first proposal as it gives you separate control over concurrency and chunk prefetching, without the issues the second approach has.

Instead of awaiting the coroutine before you enqueue it, just enqueue the coroutine and await it later
class MyClient:
async def fetch_entities(
self,
entity_ids:List[int],
objects:Optional[List[str]],
select_inbound:Optional[List[str]]=None,
select_outbound:Optional[List[str]]=None,
queue_size:int=5,
chunk_size:int=500,
):
"""
Fetch entities in chunks
While one chunk of data is being processed the next one can
already be fetched. In other words: Data processing does not
block data fetching.
"""
objects = ",".join(objects)
if select_inbound:
select_inbound = ",".join(select_inbound)
if select_outbound:
select_outbound = ",".join(select_outbound)
queue = asyncio.Queue(maxsize=queue_size)
# TODO: I want to be able to fill the queue with requests that are already executing
async def queued_chunks():
for ids in chunks(entity_ids, chunk_size):
cor = self.client.post(urllib.parse.quote("entities:fetchdata"), json={
"entityIds": ids,
"objects": objects,
"inbound": {
"linkTypeIds": select_outbound,
"objects": objects,
} if select_inbound else {},
"outbound": {
"linkTypeIds": select_inbound,
"objects": objects,
} if select_outbound else {},
})
task = asyncio.create_task(cor)
await queue.put(cor)
await queue.put(None)
asyncio.create_task(queued_chunks())
while True:
task = await queue.get()
if task is None:
break
res = await task
res.raise_for_status()
queue.task_done()
for entity in res.json():
yield entity

Related

How to count number of records (message) in the topic using kafka-python

As said in the title, i want to get a number of record in my topic and i can't find a solution using kafka-python library.
Does anyone have any idea ?
The main idea is to count how many messages there are in each partition of the topic and sum all these numbers. The result is the total number of messages on that topic.
I am using confluent_kafka as the main library.
from confluent_kafka import Consumer, TopicPartition
from concurrent.futures import ThreadPoolExecutor
consumer = Consumer({"bootstrap.servers": "localhost:6667", "group.id": "test"})
def get_partition_size(topic_name: str, partition_key: int):
topic_partition = TopicPartition(topic_name, partition_key)
low_offset, high_offset = consumer.get_watermark_offsets(topic_partition)
partition_size = high_offset - low_offset
return partition_size
def get_topic_size(topic_name: str):
topic = consumer.list_topics(topic=topic_name)
partitions = topic.topics[topic_name].partitions
workers, max_workers = [], len(partitions) or 1
with ThreadPoolExecutor(max_workers=max_workers) as e:
for partition_key in list(topic.topics[topic_name].partitions.keys()):
job = e.submit(get_partition_size, topic_name, partition_key)
workers.append(job)
topic_size = sum([w.result() for w in workers])
return topic_size
print(get_topic_size('my.kafka.topic'))
There is no specific API to count the number of records from a topic. You need to consume and count the number of records that you received from kafka consumer.
One solution is you can add one message each to all the partition and get the last offset. From offsets you can calculate the number of total message sent till now to the topic.
But this is not the right approach. You are not aware about how many messages consumers have already consumed and how many messages have been deleted by kafka. The only way is you can consume messages and count the number.
I wasn't able to get this working with kafka-python, but I was able to do it fairly easily with confluent-kafka libraries:
from confluent_kafka import Consumer
topic = "test_topic"
broker = "localhost:9092"
def get_count():
consumer = Consumer({
'bootstrap.servers': broker,
'group.id': 'my-group',
'auto.offset.reset': 'earliest',
})
consumer.subscribe([topic])
total_message_count = 0
while True:
msg = consumer.poll(1.0)
if msg is None:
print("No more messages")
break
if msg.error():
print("Consumer error: {}".format(msg.error()))
continue
total_message_count = total_message_count + 1
print('Received message {}: {}'.format(total_message_count,
msg.value().decode('utf-8')))
consumer.close()
print(total_message_count)

A blocked Python async function invocation also block another async function

I use FastAPI to develope data layer APIs accessing SQL Server.
No mater using pytds or pyodbc,
if there is a database transaction caused any request hangs,
all the other requests would be blocked. (even without database operation)
Reproduce:
Intentaionally do a serializable SQL Server session, begin a transaction and do not rollback or commit
INSERT INTO [dbo].[KVStore] VALUES ('1', '1', 0)
begin tran
SET TRANSACTION ISOLATION LEVEL Serializable
SELECT * FROM [dbo].[KVStore]
Send a request to the API with async handler function like this:
def kv_delete_by_key_2_sql():
conn = pytds.connect(dsn='192.168.0.1', database=cfg.kvStore_db, user=cfg.kvStore_uid,
password=cfg.kvStore_upwd, port=1435, autocommit=True)
engine = conn.cursor()
try:
sql = "delete KVStore; commit"
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(engine.execute, sql)
rs = future.result()
j = {
'success': True,
'rowcount': rs.rowcount
}
return jsonable_encoder(j)
except Exception as exn:
j = {
'success': False,
'reason': exn_handle(exn)
}
return jsonable_encoder(j)
#app.post("/kvStore/delete")
async def kv_delete(request: Request, type_: Optional[str] = Query(None, max_length=50)):
request_data = await request.json()
return kv_delete_by_key_2_sql()
And send a request to the API of the same app with async handler function like this:
async def hangit0(request: Request, t: int = Query(0)):
print(t, datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3])
await asyncio.sleep(t)
print(t, datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3])
j = {
'success': True
}
return jsonable_encoder(j)
#app.get("/kvStore/hangit/")
async def hangit(request: Request, t: int = Query(0)):
return await hangit0(request, t)
I expected step.2 would hang and step.3 should directly return after 2 seconds.
However step.3 never return if the transaction doesn't commit or rollback...
How do I make these handler functions work concurrently?
The reason is that rs = future.result() is actually a blocking call - see python docs. Unfortunately, executor.submit() doesn't return an awaitable object (concurrent.futures.Future is different from asyncio.Future.
You can use asyncio.wrap_future which takes concurrent.futures.Future and returns asyncio.Future (see python docs). The new Future object is awaitable thus you can convert your blocking function into an async function.
An Example:
import asyncio
import concurrent.futures
async def my_async():
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(lambda x: x + 1, 1)
return await asyncio.wrap_future(future)
print(asyncio.run(my_async()))
In your code, simply change the rs = future.result() to rs = await asyncio.wrap_future(future) and make the whole function async. That should do the magic, good luck! :)

python asyncio asynchronously fetch data by key from a dict when the key becomes available

Like title told, my use case is like this:
I have one aiohttp server, which accept request from client, when i have the request i generate one unique request id for it, and then i send the {req_id: req_pyaload} dict to some workers (the worker is not in python thus running in another process), when the workers complete the work, i get back the response and put them in a result dict like this: {req_id_1: res_1, req_id_2: res_2}.
Then I want my aiohttp server handler to await on above result dict, so when the specific response become available (by req_id) it can send it back.
I build below example code to try to simulate the process, but got stuck in implementing the coroutine async def fetch_correct_res(req_id) which should asynchronously/unblockly fetch the correct response by req_id.
import random
import asyncio
import shortuuid
n_tests = 1000
idxs = list(range(n_tests))
req_ids = []
for _ in range(n_tests):
req_ids.append(shortuuid.uuid())
res_dict = {}
async def fetch_correct_res(req_id):
pass
async def handler(req):
res = await fetch_correct_res(req)
assert req == res, "the correct res for the req should exactly be the req itself."
print("got correct res for req: {}".format(req))
async def randomly_put_res_to_res_dict():
for _ in range(n_tests):
random_idx = random.choice(idxs)
await asyncio.sleep(random_idx / 1000)
res_dict[req_ids[random_idx]] = req_ids[random_idx]
print("req: {} is back".format(req_ids[random_idx]))
So:
Is it possible to make this solution work? how?
If above solution is not possible, what should be the correct solution for this use case with asyncio?
Many thanks.
The only approach i can think of for now to make this work is: pre-created some asyncio.Queue with pre-assigned id, then for each incoming request assign one queue to it, so the handler just await on this queue, when the response come back i put it into this pre-assigned queue only, after the request fulfilled, i collect back the queue to use it for next incoming request. Not very elegant, but will solve the problem.
See if the below sample implementation fulfils your need
basically you want to respond back to the request(id) with your response(unable to predict the order) in an asynchronous way
So at the time of request handling, populate the dict with {request_id: {'event':<async.Event>, 'result': <result>}} and await on asyncio.Event.wait(), once the response is received, signal the event with asyncio.Event.set() which will release the await and then fetch the response from the dict based on the request id
I modified your code slightly to pre-populate the dict with request id and put the await on asyncio.Event.wait() until the signal comes from the response
import random
import asyncio
import shortuuid
n_tests = 10
idxs = list(range(n_tests))
req_ids = []
for _ in range(n_tests):
req_ids.append(shortuuid.uuid())
res_dict = {}
async def fetch_correct_res(req_id, event):
await event.wait()
res = res_dict[req_id]['result']
return res
async def handler(req, loop):
print("incoming request id: {}".format(req))
event = asyncio.Event()
data = {req :{}}
res_dict.update(data)
res_dict[req]['event']=event
res_dict[req]['result']='pending'
res = await fetch_correct_res(req, event)
assert req == res, "the correct res for the req should exactly be the req itself."
print("got correct res for req: {}".format(req))
async def randomly_put_res_to_res_dict():
random.shuffle(req_ids)
for i in req_ids:
await asyncio.sleep(random.randrange(2,4))
print("req: {} is back".format(i))
if res_dict.get(i) is not None:
event = res_dict[i]['event']
res_dict[i]['result'] = i
event.set()
loop = asyncio.get_event_loop()
tasks = asyncio.gather(handler(req_ids[0], loop),
handler(req_ids[1], loop),
handler(req_ids[2], loop),
handler(req_ids[3], loop),
randomly_put_res_to_res_dict())
loop.run_until_complete(tasks)
loop.close()
sample response from the above code
incoming request id: NDhvBPqMiRbteFD5WqiLFE
incoming request id: fpmk8yC3iQcgHAJBKqe2zh
incoming request id: M7eX7qeVQfWCCBnP4FbRtK
incoming request id: v2hAfcCEhRPUDUjCabk45N
req: VeyvAEX7YGgRZDHqa2UGYc is back
req: M7eX7qeVQfWCCBnP4FbRtK is back
got correct res for req: M7eX7qeVQfWCCBnP4FbRtK
req: pVvYoyAzvK8VYaHfrFA9SB is back
req: soP8NDxeQKYjgeT7pa3wtG is back
req: j3rcg5Lp59pQXuvdjCAyZe is back
req: NDhvBPqMiRbteFD5WqiLFE is back
got correct res for req: NDhvBPqMiRbteFD5WqiLFE
req: v2hAfcCEhRPUDUjCabk45N is back
got correct res for req: v2hAfcCEhRPUDUjCabk45N
req: porzHqMqV8SAuttteHRwNL is back
req: trVVqZrUpsW3tfjQajJfb7 is back
req: fpmk8yC3iQcgHAJBKqe2zh is back
got correct res for req: fpmk8yC3iQcgHAJBKqe2zh
This may work (note: I removed UUID in order to know req id in advance)
import random
import asyncio
n_tests = 1000
idxs = list(range(n_tests))
req_ids = []
for i in range(n_tests):
req_ids.append(i)
res_dict = {}
async def fetch_correct_res(req_id):
while not res_dict.get(req_id):
await asyncio.sleep(0.1)
return req_ids[req_id]
async def handler(req):
print("fetching req: ", req)
res = await fetch_correct_res(req)
assert req == res, "the correct res for the req should exactly be the req itself."
print("got correct res for req: {}".format(req))
async def randomly_put_res_to_res_dict(future):
for i in range(n_tests):
res_dict[req_ids[i]] = req_ids[i]
await asyncio.sleep(0.5)
print("req: {} is back".format(req_ids[i]))
future.set_result("done")
loop = asyncio.get_event_loop()
future = asyncio.Future()
asyncio.ensure_future(randomly_put_res_to_res_dict(future))
loop.run_until_complete(handler(10))
loop.close()
Is it the best solution? according to me No, basically its kind of requesting long running job status, and you should have (REST) api for doing the job submission and knowing job status like:
http POST server:port/job
{some job json paylod}
Response: 200 OK {"req_id": 1}
http GET server:port/job/1
Response: 200 OK {"req_id": 1, "status": "in process"}
http GET server:port/job/1
Response: 200 OK {"req_id": 1, "status": "done", "result":{}}

How to use parallelization in set/list comprehension using asyncio?

I want to create a multiprocess comprehension in Python 3.7.
Here's the code I have:
async def _url_exists(url):
"""Check whether a url is reachable"""
request = requests.get(url)
return request.status_code == 200:
async def _remove_unexisting_urls(rows):
return {row for row in rows if await _url_exists(row[0])}
rows = [
'http://example.com/',
'http://example.org/',
'http://foo.org/',
]
rows = asyncio.run(_remove_unexisting_urls(rows))
In this code example, I want to remove non-existing URLs from a list. (Note that I'm using a set instead of a list because I also want to remove duplicates).
My issue is that I still see that the execution is sequential. HTTP Requests make the execution wait.
When compared to a serial execution, the execution time is the same.
Am I doing something wrong?
How should these await/async keywords be used with python comprehension?
asyncio itself doesn't run different async functions concurrently. However, with the multiprocessing module's Pool.map, you can schedule functions to run in another process:
from multiprocessing.pool import Pool
pool = Pool()
def fetch(url):
request = requests.get(url)
return request.status_code == 200
rows = [
'http://example.com/',
'http://example.org/',
'http://foo.org/',
]
rows = [r for r in pool.map(fetch, rows) if r]
requests does not support asyncio. If you want to go for true asynchronous execution, you will have to look at libs like aiohttp or asks
Your set should be built before offloading to the tasks, so you don't even execute for duplicates, instead of streamlining the result.
With requests itself, you can fall back to run_in_executor which will execute your requests inside a ThreadPoolExecutor, so not really asynchronous I/O:
import asyncio
import time
from requests import exceptions, get
def _url_exists(url):
try:
r = get(url, timeout=10)
except (exceptions.ConnectionError, exceptions.ConnectTimeout):
return False
else:
return r.status_code is 200
async def _remove_unexisting_urls(l, r):
# making a set from the list before passing it to the futures
# so we just have three tasks instead of nine
futures = [l.run_in_executor(None, _url_exists, url) for url in set(r)]
return [await f for f in futures]
rows = [ # added some dupes
'http://example.com/',
'http://example.com/',
'http://example.com/',
'http://example.org/',
'http://example.org/',
'http://example.org/',
'http://foo.org/',
'http://foo.org/',
'http://foo.org/',
]
loop = asyncio.get_event_loop()
print(time.time())
result = loop.run_until_complete(_remove_unexisting_urls(loop, rows))
print(time.time())
print(result)
Output
1537266974.403686
1537266986.6789136
[False, False, False]
As you can see, there is a penalty from initializing the thread pool, ~2.3 seconds in this case. However, given that fact that each of the three tasks runs for ten seconds until timeout on my box (my IDE is not allowed through the proxy), an overall of twelve seconds execution time looks quite concurrent.

Python Hanging Threads

I have the following code:
final = []
with futures.ThreadPoolExecutor(max_workers=self.number_threads) as executor:
_futures = [executor.submit(self.get_attribute, listing,
self.proxies[listings.index(listing) % len(self.proxies)]) for listing
in listings]
for result in futures.as_completed(_futures):
try:
listing = result.result()
final.append(listing)
except Exception as e:
print traceback.format_exc()
return final
The self.get_attribute function that's submitted to the executor takes a dictionary and proxy as input and makes either one or two http requests to get some data and return with an edited dictionary. The problem is that the workers/threads hang towards the end of completing all the submitted tasks. If I submit 400 dictionaries, it will complete ~380 tasks, and then hang. If I submit 600, it will complete ~570-580. However if I submit 25, it will complete all of them. I'm not sure what the threshold is at which it will go from finishing to not finishing.
I have also tried using a queue and threading system like this:
def _get_attribute_thread(self):
while self.q.not_empty:
job = self.q.get()
listing = job['listing']
proxy = job['proxy']
self.threaded_results.put(self.get_attribute(listing, proxy))
self.q.task_done()
def _get_attributes_threaded_with_proxies(self, listings):
for listing in listings:
self.q.put({'listing': listing, 'proxy': self.proxies[listings.index(listing) % len(self.proxies)]})
for _ in xrange(self.number_threads):
thread = threading.Thread(target=self._get_attribute_thread)
thread.daemon = True
thread.start()
self.q.join()
final = []
while self.threaded_results.not_empty:
final.append(self.threaded_results.get())
return final
However the result is the same. What can I do to fix/debug the problem? Thanks in advance.

Categories