I want to make repeated requests to a server that will return with some tasks. The response from the server will be a dictionary with a list of functions that need to be called. For example:
{
tasks: [
{
function: "HelloWorld",
id: 1212
},
{
function: "GoodbyeWorld"
id: 1222
}
]
}
NOTE: I'm dummying it down.
For each of these tasks, I will run the specified function using multiprocessing. Here is an example of my code:
r = requests.get('https://localhost:5000', auth=('user', 'pass'))
data = r.json()
if len(data["tasks"]) > 0:
manager = multiprocessing.Manager()
for task in data["tasks"]:
if task["function"] == "HelloWorld":
helloObj = HelloWorldClass()
hello = multiprocessing.Process(target=helloObj.helloWorld)
hello.start()
hello.join()
elif task["function"] == "GoodbyeWorld":
byeObj = GoodbyeWorldClass()
bye = multiprocessing.Process(target=byeObj.byeWorld)
bye.start()
bye.join()
The problem is, I want to make repeated requests and fill the data["tasks"] array as the other processes are running. If I throw everything into some while loop, it'll only make a request after all the processes from the initial response is done (when join() has been reached for all processes).
Can anyone help me to make repeated requests and fill the array continuously? Please let me know if I need to make any clarifications.
If I understood you correctly, you need something like this:
import time
from multiprocessing import Process
import requests
from task import FunctionFactory
def get_tasks():
resp = requests.get('https://localhost:5000', auth=('user', 'pass'))
data = resp.json()
return data['tasks']
if __name__ == '__main__':
procs = {}
for _ in range(10):
tasks = get_tasks()
if not tasks:
time.sleep(5)
continue
for task in tasks:
if task['id'] in procs:
# This task has been already submitted for execution.
continue
func = FunctionFactory.build(task['function'])
proc = Process(target=func)
proc.start()
procs[task['id']] = proc
# Waiting for all the submitted tasks to finish.
for proc in procs.values():
proc.join()
Here, the function get_tasks is used to request a list of dictionaries with id and function keys from the server. In the main section, there is a procs dictionary that maps id to running process instances which execute functions built by a FunctionFactory using received tasks' function names. In the case there is already a running task with the same id, it gets ignored.
With this approach, you can request tasks as often as needed (here, 10 requests are used in a for loop) and start processes to execute them in parallel. In the end, you just wait for all the submitted tasks to finish.
You have a bug in your program, you should call the joins after you've created all the tasks. Join blocks until the process has finished -- in your case before you start the next one. Which practically makes you whole program run sequentially.
Related
I have a Python function that requests data via API and involves a rotating expiring key. The volume of requests necessitates some parallelization of the function. I am doing this with the multiprocessing.pool module ThreadPool. Example code:
import requests
from multiprocessing.pool import ThreadPool
from tqdm import tqdm
# Input is a list-of-dicts results of a previous process.
results = [...]
# Process starts by retrieving an authorization key.
headers = {"authorization": get_new_authorization()}
# api_call() is called on each existing result with the retrieved key.
results = thread(api_call, [(headers, result) for result in results])
# Function calls API with passed headers for given URL and returns dict.
def api_call(headers_plus_result):
headers, result = headers_plus_result
r = requests.get(result["url"]), headers=headers)
return json.loads(r.text)
# Threading function with default num_threads.
def thread(worker, jobs, num_threads=5):
pool = ThreadPool(num_threads)
results = list()
for result in tqdm(pool.imap_unordered(worker, jobs), total=len(jobs)):
if result:
results.append(result)
pool.close()
pool.join()
if results:
return results
# Function to get new authorization key.
def get_new_authorization():
...
return auth_key
I am trying to modify my mapping process so that, when the first worker fails (i.e. the authorization key expires), all other processes are paused until a new authorization key is retrieved. Then, the processes proceed with the new key.
Should this be inserted into the actual thread() function? If I put an exception in the api_call function itself, I don't see how I can stop the pool manager or update the header being passed to other workers.
Additionally: is using ThreadPool even the best method if I want this kind of flexibility?
A simpler possibility might be to use a multiprocessing.Event and a shared variable. The Event would indicate whether the authentication was legit or not, and the shared variable would contain the authentication.
event = mp.Event()
sharedAuthentication = mp.Array('u', 100) # 100 = max length
So a worker would run:
event.wait();
authentication = sharedAuthentication.value
Your main thread would initially set the authentication with
sharedAuthentication.value = ....
event.set()
and later modify the authentication with
event.clear()
... calculate new authentication
sharedAuthentication.value = .....
event.set()
Problem Outline
I have a python flask server where one of the endpoints has a moderate amount of work to do (the real code reads, resizes and returns an image). I want to optimise the endpoint so that it can be called multiple times in parallel.
The code I currently have (shown below) does not work because it relies on passing a multiprocessing.Event object through a multiprocessing.JoinableQueue which is not allowed and results in the following error:
RuntimeError: Condition objects should only be shared between processes
through inheritance
How can I use a separate process to compute some jobs and notify the main thread when a specific job is complete?
Proof of Concept
Flask can be multithreaded so if one request is waiting on a result other threads can continue to process other requests. I have a basic proof of concept here that shows that parallel requests can be optimised using multiprocessing: https://github.com/alanbacon/flask_multiprocessing
The example code on github spawns a new process for every request which I understand has considerable overheads, plus I've noticed that my proof-of-concept server crashes if there are more than 10 or 20 concurrent requests, I suspect this is because there are too many processes being spawned.
Current Attempt
I have tried to create a set of workers that pick jobs off a queue. When a job is complete the result is written to a shared memory area. Each job contains the work to be done and an Event object that can be set when the job is complete to signal the main thread.
Each request thread passes in a job with a newly created Event object, it then immediately waits on that event before returning the result. While one server request thread is waiting the server is able to use other threads to continue to serve other requests.
The problem as mentioned above is that Event objects can not be passed around in this way.
What approach should I take to circumvent this problem?
from flask import Flask, request, Response,
import multiprocessing
import uuid
app = Flask(__name__)
# flask config
app.config['PROPAGATE_EXCEPTIONS'] = True
app.config['DEBUG'] = False
def simpleWorker(complexity):
temp = 0
for i in range(0, complexity):
temp += 1
mgr = multiprocessing.Manager()
results = mgr.dict()
joinableQueue = multiprocessing.JoinableQueue()
lock = multiprocessing.Lock()
def mpWorker(joinableQueue, lock, results):
while True:
next_task = joinableQueue.get() # blocking call
if next_task is None: # poison pill to kill worker
break
simpleWorker(next_task['complexity']) # pretend to do heavy work
result = next_task['val'] * 2 # compute result
ID = next_task['ID']
with lock:
results[ID] = result # output result to shared memory
next_task['event'].set() # tell main process result is calculated
joinableQueue.task_done() # remove task from queue
#app.route("/work/<ID>", methods=['GET'])
def work(ID=None):
if request.method == 'GET':
# send a task to the consumer and wait for it to finish
uid = str(uuid.uuid4())
event = multiprocessing.Event()
# pass event to job so that job can tell this thread when processing is
# complete
joinableQueue.put({
'val': ID,
'ID': uid,
'event': event,
'complexity': 100000000
})
event.wait() # wait for result to be calculated
# get result from shared memory area, and clean up
with lock:
result = results[ID]
del results[ID]
return Response(str(result), 200)
if __name__ == "__main__":
num_consumers = multiprocessing.cpu_count() * 2
consumers = [
multiprocessing.Process(
target=mpWorker,
args=(joinableQueue, lock, results))
for i in range(num_consumers)
]
for c in consumers:
c.start()
host = '127.0.0.1'
port = 8080
app.run(host=host, port=port, threaded=True)
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.
I am creating a simple TCP server-client script in Python. The server is threaded and forks a new worker/thread for every client connection. So far I have pretty much coded the entire server module. But my function called the handle_clients() which is forked for every incoming client connection is getting very long. In order to improve the readability of the code I want to split my handle_clients() into multiple small functions. I do understand that when I split handle_client() into smaller functions, the split functions should be wrapped around mutex locks to synchronize shared usage between multiple handle_clients() threads. Doing this will actually reduce the efficiency of the program because handle_clients() will have to wait for other threads to unlock the shared functions before actually using it. My other thought was to create these smaller functions as threads within the handle_clients() thread. And wait for these threads to finish using Thread.join() before continuing. Is there a better way to do this?
My code:
#!/usr/bin/python
import socket
import threading
import pandas as pd
class TCPServer(object):
NUMBER_OF_THREADS = 0
BUFFER = 4096
threads_list = []
def __init__(self, port, hostname):
self.socket = socket.socket(
family=socket.AF_INET, type=socket.SOCK_STREAM)
self.socket.bind((hostname, port))
def listen_for_clients(self):
self.socket.listen(5)
while True:
client, address = self.socket.accept()
client_ID = client.recv(TCPServer.BUFFER)
print(f'Connected to client: {client_ID}')
if client_ID:
TCPServer.NUMBER_OF_THREADS = TCPServer.NUMBER_OF_THREADS + 1
thread = threading.Thread(
target=TCPServer.create_worker, args=(self, client, address, client_ID))
TCPServer.threads_list.append(thread)
thread.start()
if TCPServer.NUMBER_OF_THREADS > 2:
break
TCPServer.wait_for_workers()
def wait_for_workers():
for thread in TCPServer.threads_list:
thread.join()
def create_worker(self, client, address, client_ID):
print(f'Spawned a new worker for {client_ID}. Worker #: {TCPServer.NUMBER_OF_THREADS}')
data_list = []
data_frame = pd.DataFrame()
client.send("SEND_REQUEST_TYPE".encode())
request_type = client.recv(TCPServer.BUFFER).decode('utf-8')
if request_type == 'KMEANS':
print(f'Client: REQUEST_TYPE {request_type}')
client.send("SEND_DATA".encode())
while True:
data = client.recv(TCPServer.BUFFER).decode('utf-8')
if data == 'ROW':
client.send("OK".encode())
while True:
data = client.recv(TCPServer.BUFFER).decode('utf-8')
print(f'Client: {data}')
if data == 'ROW_END':
print('Data received: ', data_list)
series = pd.Series(data_list)
data_frame.append(series, ignore_index=True)
data_list = []
client.send("OK".encode())
break
else:
data_list.append(int(data))
client.send("OK".encode())
elif data == 'DATA_END':
client.send("WAIT".encode())
# (Vino) pass data to algorithm
print('Data received from client {client_ID}: ', data_frame)
elif request_type == 'NEURALNET':
pass
elif request_type == 'LINRIGRESSION':
pass
elif request_type == 'LOGRIGRESSION':
pass
def main():
port = input("Port: ")
server = TCPServer(port=int(port), hostname='localhost')
server.listen_for_clients()
if __name__ == '__main__':
main()
Note: This following block of code is repetative and will e used multiple times within the handle_client() function.
while True:
data = client.recv(TCPServer.BUFFER).decode('utf-8')
if data == 'ROW':
client.send("OK".encode())
while True:
data = client.recv(TCPServer.BUFFER).decode('utf-8')
print(f'Client: {data}')
if data == 'ROW_END':
print('Data received: ', data_list)
series = pd.Series(data_list)
data_frame.append(series, ignore_index=True)
data_list = []
client.send("OK".encode())
break
else:
data_list.append(int(data))
client.send("OK".encode())
elif data == 'DATA_END':
client.send("WAIT".encode())
# (Vino) pass data to algorithm
print('Data received from client {client_ID}: ', data_frame)
This is the block I want a place in a separate function and calls it within the handle_client() thread.
Your code is already long, I'll not dive into it but try to keep things general.
I do understand that when I split handle_client() into smaller functions, the split functions should be wrapped around mutex locks.
That's not directly true, between threads you already have to use locks to guard against memory overwriting, regarless your function calls.
The server is threaded
Looks like you're doing CPU-intensive work (I see LINALG, NEURALNET, ...), it is not logical to use threads, in Python, to dispatch CPU-intensive loads as the GIL will linearize CPU usage between your threads.
The way to parallelize CPU intensive work in Python is to use processes.
Processes do not share memory so you'll be able to manipulate variables freely without mutexes, but they won't be shared at all, I hope your jobs are independent, as they can't share any state.
If you need to share state, avoid locks, it's complicated to handle, it's the way to dead locks, and it's not readable, try to implement your "state sharing" with queues, as a pipeline of jobs, each worker pulling from a queue, doing work, and pushing to another queue, this way keep things clear and easy to understand. Plus there's implementation of queues for threads and processes so you'll be able to switch from both almost seamlessly.
if TCPServer.NUMBER_OF_THREADS > 2:
break
Hey, you're breaking out of your main loop when you have more than two threads, existing your main process, killing your server, I bet that now what you want. Oh and if you use processes instead of threads, you should prefork a pool of them, as their creation costs more than a thread. And reuse them, a process can do a job after finishing one, it does not have to die (typically use queues to send job to your processes).
Side note: I'd implement this using HTTP instead of raw TCP to benefit from the notions of request, response, error reporting, existing frameworks, and the ability to use existing clients (curl/wget in command line, your browser, requests in Python). I'd implement this fully asynchronously (no blocking HTTP request), like one request to create a job, and following requests to get the status and the result, like:
$ curl -X POST http://localhost/linalg/jobs/ -d '{your data}'
201 Created
Location: http://localhost/linalg/jobs/1
$ curl -XGET http://localhost/linalg/jobs/1
200 OK
{"status": "queued"}
Some time later…
$ curl -XGET http://localhost/linalg/jobs/1
200 OK
{"status": "in progress"}
Some time later…
$ curl -XGET http://localhost/linalg/jobs/1
200 OK
{"status": "done", "result": "..."}
To implement this there's a lot of nice work already done, typically aiohttp, apistar, and so on.
In celery i want to get the task status for all the tasks for specific task name. For that tried below code.
import celery.events.state
# Celery status instance.
stat = celery.events.state.State()
# task_by_type will return list of tasks.
query = stat.tasks_by_type("my_task_name")
# Print tasks.
print query
Now i'm getting empty list in this code.
celery.events.state.State() is a data-structure used to keep track of the state of celery workers and tasks. When calling State(), you get an empty state object with no data.
You should use app.events.Receiver(Stream Processing) or celery.events.snapshot(Batch Processing) to capture state that contains tasks.
Sample Code:
from celery import Celery
def my_monitor(app):
state = app.events.State()
def announce_failed_tasks(event):
state.event(event)
# task name is sent only with -received event, and state
# will keep track of this for us.
task = state.tasks.get(event['uuid'])
print('TASK FAILED: %s[%s] %s' % (
task.name, task.uuid, task.info(),))
with app.connection() as connection:
recv = app.events.Receiver(connection, handlers={
'task-failed': announce_failed_tasks,
'*': state.event,
})
recv.capture(limit=None, timeout=None, wakeup=True)
if __name__ == '__main__':
app = Celery(broker='amqp://guest#localhost//')
my_monitor(app)
This isn't natively supported. Depending on the backend (Mongo, Redis, etc), you may or may not be able to introspect the contents of a queue and find out what's in it. Even if you do, you'll miss items currently in progress.
That said, you could manage this yourself:
result = mytask.delay(...)
my_datastore.save("mytask", result.id)
...
for id in my_datastore.find(task="mytask"):
res = AsyncResult(id)
print res.state
In celery you can easily find the status of task by accessing them through task ID if you want to access them from other function.
Sample Code:-
#task(name='Sum_of_digits')
def ABC(x,y):
return x+y
Add this task for processing
res = ABC.delay(1, 2)
Now use the task res to fetch the state, status and results(res.get())
print(f"id={res.id}, state={res.state}, status={res.status}")