How can I dynamically create new process in python? - python

This is my main function. If I receive new offer, I need to check the payment. I have HandleNewOffer() function on that. But the problem with this code happens if there are 2(or more) offers at the same time. One of the buyers will have to wait until the closing of the transaction. So is this possible to generate new process with HandleNewOffer() function and kill it when it`s done to make several transactions at the same time? Thank you in advance.
def handler():
try:
conn = k.call('GET', '/api/').json() #connect
response = conn.call('GET', '/api/notifications/').json()
notifications = response['data']
for notification in notifications:
if notification['contact']:
HandleNewOffer(notification) # need to dynamically start new process if notification
except Exception as err:
error= ('Error')
Send(error)

I'd recommend to use the Pool of workers pattern here to limit the amount of concurrent calls to HandleNewOffer.
The concurrent.futures module offers ready-made implementations of the above mentioned pattern.
from concurrent.futures import ProcessPoolExecutor
def handler():
with ProcessPoolExecutor() as pool:
try:
conn = k.call('GET', '/api/').json() #connect
response = conn.call('GET', '/api/notifications/').json()
# collect notifications to process into a list
notifications = [n for n in response['data'] if n['contact']]
# send the list of notifications to the concurrent workers
results = pool.map(HandleNewOffer, notifications)
# iterate over the list of results from every HandleNewOffer call
for result in results:
print(result)
except Exception as err:
error= ('Error')
Send(error)
This logic will handle as many offers in parallel as many CPU cores you computer has.

Related

How to store concurrent.futures ProcessPoolExecutor HTTP responses and process in real time?

I have a project I am working on and I'm looking to use concurrent.futures ProcessPoolExecutor send a high number of HTTP requests. While the code below works great for getting the requests, I'm struggling with ideas to process the information as I get it. I tried inserting it into a sqlite3 database as I get responses, but it became tricky trying to manage locks and avoid the use of global variables.
Ideally, I'd like to start the Pool, and while it is executing, be able to read/store the data. Is this possible or should I take a different route with this...
pool = ProcessPoolExecutor(max_workers=60)
results = list(pool.map(http2_get, urls))
def http2_get(url):
while(True):
try:
start_time = millis()
result = s.get(url,verify=False)
print(url + " Total took " + str(millis() - start_time) + " ms")
return result
except Exception as e:
print(e,e.__traceback__.tb_lineno)
pass
As you noticed, map will not return until all the processes have finished. I assume that you want to process the data in the main process.
Instead of using map, submit all the tasks and process them as they finish:
from concurrent.futures import ProcessPoolExecutor, as_completed
pool = ProcessPoolExecutor(max_workers=60)
futures_list = [pool.submit(http2_get, url) for url in urls]
for future in as_completed(futures_list):
exception = future.exception()
if exception is not None:
# Handle exception in http2_get
pass
else:
result = future.result()
# process result...
Note that it is cleaner to use the ProcessPoolExecutor as a context manager:
with ProcessPoolExecutor(max_workers=60) as pool:
futures_list = [pool.submit(http2_get, url) for url in urls]
for future in as_completed(futures_list):
exception = future.exception()
if exception is not None:
# Handle exception in htt2_get
pass
else:
result = future.result()
# process result...

How to assign values that are available to threads

Im currently working on a scraper where I am trying to figure out how I can assign proxies that are avaliable to use, meaning that if I use 5 threads and if thread-1 uses proxy A, no other threads should be able to access proxy A and should try do randomize all available proxy pool.
import random
import time
from threading import Thread
import requests
list_op_proxy = [
"http://test.io:12345",
"http://test.io:123456",
"http://test.io:1234567",
"http://test.io:12345678"
]
session = requests.Session()
def handler(name):
while True:
try:
session.proxies = {
'https': f'http://{random.choice(list_op_proxy)}'
}
with session.get("https://stackoverflow.com"):
print(f"{name} - Yay request made!")
time.sleep(random.randint(5, 10))
except requests.exceptions as err:
print(f"Error! Lets try again! {err}")
continue
except Exceptions as err:
print(f"Error! Lets debug! {err}")
raise Exception
for i in range(5):
Thread(target=handler, args=(f'Thread {i}',)).start()
I wonder how I can create a way where I can use proxies that are available and not being used in any threads and "block" the proxy to not be able to be used to other threads and release once it is finished?
One way to go about this would be to just use a global shared list, that holds the currently active proxies or to remove the proxies from the list and readd them after the request is finished. You do not have to worry about concurrent access on the list, since CPython suffers from the GIL.
proxy = random.choice(list_op_proxy)
list_op_proxy.remove(proxy)
session.proxies = {
'https': f'http://{proxy}'
}
# ... do request
list_op_proxy.append(proxy)
you could also do this using a queue and just pop and add to make it more efficient.
Using a Proxy Queue
Another option is to put the proxies into a queue and get() a proxy before each query, removing it from the available proxies, and the put() it back after the request has been finished. This is a more efficient version of the above mentioned list approach.
First we need to initialize the proxy queue.
proxy_q = queue.Queue()
for proxy in proxies:
proxy_q.put(proxy)
Within the handler we then get a proxy from the queue before performing a request. We perform the request and put the proxy back to the queue.
We are using block=True, such that the queue blocks the thread if there is no proxy currently available. Otherwise the thread would terminate with a queue.Empty exception once all proxies are in use and a new one should be aquired.
def handler(name):
global proxy_q
while True:
proxy = proxy_q.get(block=True) # we want blocking behaviour
# ... do request
proxy_q.put(proxy)
# ... response handling can be done after proxy put to not
# block it longer than required
# do not forget to define a break condition
Using Queue and Multiprocessing
First you would initialize the manager and put all your data into the queue and initialize another structure for collecting your results (here we initialize a shared list).
manager = multiprocessing.Manager()
q = manager.Queue()
for e in entities:
q.put(e)
print(q.qsize())
results = manager.list()
The you initialize the scraping processes:
for proxy in proxies:
processes.append(multiprocessing.Process(
target=scrape_function,
args=(q, results, proxy)
daemon=True))
And then start each of them
for w in processes:
w.start()
lastly you join every process to ensure that the main process is not terminated before the subprocesses are finished
for w in processes:
w.join()
Inside the scrape_function you then simply get one item at a time and perform the request. The queue object in the default configuration raises an queue.Empty error when it is empty, so we are using an infinite while loop with a break condition catching the exception.
def scrape_function(q, results, proxy)
session = requests.Session()
session.proxies = {
'https': f'http://{proxy}'
}
while True:
try:
request_uri = q.get(block=False)
with session.get("https://stackoverflow.com"):
print(f"{name} - Yay request made!")
results.append(result)
time.sleep(random.randint(5, 10))
except queue.Empty:
break
The results of each query are appended to the results list, which is also shared among the different processes.

How to write python code that will work with threads or coroutines and will complete in deterministic time?

What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.

Calling a function within a thread

I am creating a simple TCP server-client script in Python. The server is threaded and forks a new worker/thread for every client connection. So far I have pretty much coded the entire server module. But my function called the handle_clients() which is forked for every incoming client connection is getting very long. In order to improve the readability of the code I want to split my handle_clients() into multiple small functions. I do understand that when I split handle_client() into smaller functions, the split functions should be wrapped around mutex locks to synchronize shared usage between multiple handle_clients() threads. Doing this will actually reduce the efficiency of the program because handle_clients() will have to wait for other threads to unlock the shared functions before actually using it. My other thought was to create these smaller functions as threads within the handle_clients() thread. And wait for these threads to finish using Thread.join() before continuing. Is there a better way to do this?
My code:
#!/usr/bin/python
import socket
import threading
import pandas as pd
class TCPServer(object):
NUMBER_OF_THREADS = 0
BUFFER = 4096
threads_list = []
def __init__(self, port, hostname):
self.socket = socket.socket(
family=socket.AF_INET, type=socket.SOCK_STREAM)
self.socket.bind((hostname, port))
def listen_for_clients(self):
self.socket.listen(5)
while True:
client, address = self.socket.accept()
client_ID = client.recv(TCPServer.BUFFER)
print(f'Connected to client: {client_ID}')
if client_ID:
TCPServer.NUMBER_OF_THREADS = TCPServer.NUMBER_OF_THREADS + 1
thread = threading.Thread(
target=TCPServer.create_worker, args=(self, client, address, client_ID))
TCPServer.threads_list.append(thread)
thread.start()
if TCPServer.NUMBER_OF_THREADS > 2:
break
TCPServer.wait_for_workers()
def wait_for_workers():
for thread in TCPServer.threads_list:
thread.join()
def create_worker(self, client, address, client_ID):
print(f'Spawned a new worker for {client_ID}. Worker #: {TCPServer.NUMBER_OF_THREADS}')
data_list = []
data_frame = pd.DataFrame()
client.send("SEND_REQUEST_TYPE".encode())
request_type = client.recv(TCPServer.BUFFER).decode('utf-8')
if request_type == 'KMEANS':
print(f'Client: REQUEST_TYPE {request_type}')
client.send("SEND_DATA".encode())
while True:
data = client.recv(TCPServer.BUFFER).decode('utf-8')
if data == 'ROW':
client.send("OK".encode())
while True:
data = client.recv(TCPServer.BUFFER).decode('utf-8')
print(f'Client: {data}')
if data == 'ROW_END':
print('Data received: ', data_list)
series = pd.Series(data_list)
data_frame.append(series, ignore_index=True)
data_list = []
client.send("OK".encode())
break
else:
data_list.append(int(data))
client.send("OK".encode())
elif data == 'DATA_END':
client.send("WAIT".encode())
# (Vino) pass data to algorithm
print('Data received from client {client_ID}: ', data_frame)
elif request_type == 'NEURALNET':
pass
elif request_type == 'LINRIGRESSION':
pass
elif request_type == 'LOGRIGRESSION':
pass
def main():
port = input("Port: ")
server = TCPServer(port=int(port), hostname='localhost')
server.listen_for_clients()
if __name__ == '__main__':
main()
Note: This following block of code is repetative and will e used multiple times within the handle_client() function.
while True:
data = client.recv(TCPServer.BUFFER).decode('utf-8')
if data == 'ROW':
client.send("OK".encode())
while True:
data = client.recv(TCPServer.BUFFER).decode('utf-8')
print(f'Client: {data}')
if data == 'ROW_END':
print('Data received: ', data_list)
series = pd.Series(data_list)
data_frame.append(series, ignore_index=True)
data_list = []
client.send("OK".encode())
break
else:
data_list.append(int(data))
client.send("OK".encode())
elif data == 'DATA_END':
client.send("WAIT".encode())
# (Vino) pass data to algorithm
print('Data received from client {client_ID}: ', data_frame)
This is the block I want a place in a separate function and calls it within the handle_client() thread.
Your code is already long, I'll not dive into it but try to keep things general.
I do understand that when I split handle_client() into smaller functions, the split functions should be wrapped around mutex locks.
That's not directly true, between threads you already have to use locks to guard against memory overwriting, regarless your function calls.
The server is threaded
Looks like you're doing CPU-intensive work (I see LINALG, NEURALNET, ...), it is not logical to use threads, in Python, to dispatch CPU-intensive loads as the GIL will linearize CPU usage between your threads.
The way to parallelize CPU intensive work in Python is to use processes.
Processes do not share memory so you'll be able to manipulate variables freely without mutexes, but they won't be shared at all, I hope your jobs are independent, as they can't share any state.
If you need to share state, avoid locks, it's complicated to handle, it's the way to dead locks, and it's not readable, try to implement your "state sharing" with queues, as a pipeline of jobs, each worker pulling from a queue, doing work, and pushing to another queue, this way keep things clear and easy to understand. Plus there's implementation of queues for threads and processes so you'll be able to switch from both almost seamlessly.
if TCPServer.NUMBER_OF_THREADS > 2:
break
Hey, you're breaking out of your main loop when you have more than two threads, existing your main process, killing your server, I bet that now what you want. Oh and if you use processes instead of threads, you should prefork a pool of them, as their creation costs more than a thread. And reuse them, a process can do a job after finishing one, it does not have to die (typically use queues to send job to your processes).
Side note: I'd implement this using HTTP instead of raw TCP to benefit from the notions of request, response, error reporting, existing frameworks, and the ability to use existing clients (curl/wget in command line, your browser, requests in Python). I'd implement this fully asynchronously (no blocking HTTP request), like one request to create a job, and following requests to get the status and the result, like:
$ curl -X POST http://localhost/linalg/jobs/ -d '{your data}'
201 Created
Location: http://localhost/linalg/jobs/1
$ curl -XGET http://localhost/linalg/jobs/1
200 OK
{"status": "queued"}
Some time later…
$ curl -XGET http://localhost/linalg/jobs/1
200 OK
{"status": "in progress"}
Some time later…
$ curl -XGET http://localhost/linalg/jobs/1
200 OK
{"status": "done", "result": "..."}
To implement this there's a lot of nice work already done, typically aiohttp, apistar, and so on.

How to receive multiple request in a Tornado application

I have a Tornado web application, this app can receive GET and POST request from the client.
The POSTs request put an information received in a Tornado Queue, then I pop this information from the queue and with it I do an operation on the database, this operation can be very slow, it can take several seconds to complete!
In the meantime that this database operation goes on I want to be able to receive other POSTs (that put other information in the queue) and GET. The GET are instead very fast and must return to the client their result immediatly.
The problem is that when I pop from the queue and the slow operation begin the server doesn't accept other requests from the client. How can I resolve this?
This is the semplified code I have written so far (import are omitted for avoid wall of text):
# URLs are defined in a config file
application = tornado.web.Application([
(BASE_URL, Variazioni),
(ARTICLE_URL, Variazioni),
(PROMO_URL, Variazioni),
(GET_FEEDBACK_URL, Feedback)
])
class Server:
def __init__(self):
http_server = tornado.httpserver.HTTPServer(application, decompress_request=True)
http_server.bind(8889)
http_server.start(0)
transactions = TransactionsQueue() #contains the queue and the function with interact with it
IOLoop.instance().add_callback(transactions.process)
def start(self):
try:
IOLoop.instance().start()
except KeyboardInterrupt:
IOLoop.instance().stop()
if __name__ == "__main__":
server = Server()
server.start()
class Variazioni(tornado.web.RequestHandler):
''' Handle the POST request. Put an the data received in the queue '''
#gen.coroutine
def post(self):
TransactionsQueue.put(self.request.body)
self.set_header("Location", FEEDBACK_URL)
class TransactionsQueue:
''' Handle the queue that contains the data
When a new request arrive, the generated uuid is putted in the queue
When the data is popped out, it begin the operation on the database
'''
queue = Queue(maxsize=3)
#staticmethod
def put(request_uuid):
''' Insert in the queue the uuid in postgres format '''
TransactionsQueue.queue.put(request_uuid)
#gen.coroutine
def process(self):
''' Loop over the queue and load the data in the database '''
while True:
# request_uuid is in postgres format
transaction = yield TransactionsQueue.queue.get()
try:
# this is the slow operation on the database
yield self._load_json_in_db(transaction )
finally:
TransactionsQueue.queue.task_done()
Moreover I don't understand why if I do 5 POST in a row, it put all five data in the queue though the maximun size is 3.
I'm going to guess that you use a synchronous database driver, so _load_json_in_db, although it is a coroutine, is not actually async. Therefore it blocks the entire event loop until the long operation completes. That's why the server doesn't accept more requests until the operation is finished.
Since _load_json_in_db blocks the event loop, Tornado can't accept more requests while it's running, so your queue never grows to its max size.
You need two fixes.
First, use an async database driver written specifically for Tornado, or run database operations on threads using Tornado's ThreadPoolExecutor.
Once that's done your application will be able to fill the queue, so second, TransactionsQueue.put must do:
TransactionsQueue.queue.put_nowait(request_uuid)
This throws an exception if there are already 3 items in the queue, which I think is what you intend.

Categories