I'm using multiprocessing.dummy.Pool to issue RESTful API calls in parallel.
For now the code looks like:
from multiprocessing.dummy import Pool
def onecall(args):
env = args[0]
option = args[1]
return env.call(option) # call() returns a list
def call_all():
threadpool = Pool(processes=4)
all_item = []
for item in threadpool.imap_unordered(onecall, ((create_env(), x) for x in range(100))):
all_item.extend(item)
return all_item
In the code above, env object wraps a requests.Session() object and thus is in charge of maintaining connection session. The 100 tasks use 100 different env objects. Thus, each task just creates 1 connection, make 1 API call, and disconnect.
However, to enjoy the benefit of HTTP keep-alive, I want the 100 tasks to share 4 env objects (one object per thread) so each connection serves multiple API calls one-by-one. How should I achieve that?
Using threading.local seems to work.
from multiprocessing.dummy import Pool
import threading
tlocal = threading.local()
def getEnv():
try:
return tlocal.env
except AttributeError:
tlocal.env = create_env()
return tlocal.env
def onecall(args):
option = args[0]
return getEnv().call(option) # call() returns a list
def call_all():
threadpool = Pool(processes=4)
all_item = []
for item in threadpool.imap_unordered(onecall, ((x,) for x in range(100))):
all_item.extend(item)
return all_item
Related
So I have been struggling with this one error of pickle which is driving me crazy. I have the following master Engine class with the following code :
import eventlet
import socketio
import multiprocessing
from multiprocessing import Queue
from multi import SIOSerever
class masterEngine:
if __name__ == '__main__':
serverObj = SIOSerever()
try:
receiveData = multiprocessing.Process(target=serverObj.run)
receiveData.start()
receiveProcess = multiprocessing.Process(target=serverObj.fetchFromQueue)
receiveProcess.start()
receiveData.join()
receiveProcess.join()
except Exception as error:
print(error)
and I have another file called multi which runs like the following :
import multiprocessing
from multiprocessing import Queue
import eventlet
import socketio
class SIOSerever:
def __init__(self):
self.cycletimeQueue = Queue()
self.sio = socketio.Server(cors_allowed_origins='*',logger=False)
self.app = socketio.WSGIApp(self.sio, static_files={'/': 'index.html',})
self.ws_server = eventlet.listen(('0.0.0.0', 5000))
#self.sio.on('production')
def p_message(sid, message):
self.cycletimeQueue.put(message)
print("I logged : "+str(message))
def run(self):
eventlet.wsgi.server(self.ws_server, self.app)
def fetchFromQueue(self):
while True:
cycle = self.cycletimeQueue.get()
print(cycle)
As you can see I can trying to create two processes of def run and fetchFromQueue which i want to run independently.
My run function starts the python-socket server to which im sending some data from a html web page ( This runs perfectly without multiprocessing). I am then trying to push the data received to a Queue so that my other function can retrieve it and play with the data received.
I have a set of time taking operations that I need to carry out with the data received from the socket which is why im pushing it all into a Queue.
On running the master Engine class I receive the following :
Can't pickle <class 'threading.Thread'>: it's not the same object as threading.Thread
I ended!
[Finished in 0.5s]
Can you please help with what I am doing wrong?
From multiprocessing programming guidelines:
Explicitly pass resources to child processes
On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
Apart from making the code (potentially) compatible with Windows and the other start methods this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.
Therefore, I slightly modified your example by removing everything unnecessary, but showing an approach where the shared queue is explicitly passed to all processes that use it:
import multiprocessing
MAX = 5
class SIOSerever:
def __init__(self, queue):
self.cycletimeQueue = queue
def run(self):
for i in range(MAX):
self.cycletimeQueue.put(i)
#staticmethod
def fetchFromQueue(cycletimeQueue):
while True:
cycle = cycletimeQueue.get()
print(cycle)
if cycle >= MAX - 1:
break
def start_server(queue):
server = SIOSerever(queue)
server.run()
if __name__ == '__main__':
try:
queue = multiprocessing.Queue()
receiveData = multiprocessing.Process(target=start_server, args=(queue,))
receiveData.start()
receiveProcess = multiprocessing.Process(target=SIOSerever.fetchFromQueue, args=(queue,))
receiveProcess.start()
receiveData.join()
receiveProcess.join()
except Exception as error:
print(error)
0
1
...
I am porting a simple python 3 script to AWS Lambda.
The script is simple: it gathers information from a dozen of S3 objects and returns the results.
The script used multiprocessing.Pool to gather all the files in parallel. Though multiprocessing cannot be used in an AWS Lambda environment since /dev/shm is missing.
So I thought instead of writing a dirty multiprocessing.Process / multiprocessing.Queue replacement, I would try asyncio instead.
I am using the latest version of aioboto3 (8.0.5) on Python 3.8.
My problem is that I cannot seem to gain any improvement between a naive sequential download of the files, and an asyncio event loop multiplexing the downloads.
Here are the two versions of my code.
import sys
import asyncio
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import boto3
import aioboto3
BUCKET = 'some-bucket'
KEYS = [
'some/key/1',
[...]
'some/key/10',
]
async def download_aio():
"""Concurrent download of all objects from S3"""
async with aioboto3.client('s3') as s3:
objects = [s3.get_object(Bucket=BUCKET, Key=k) for k in KEYS]
objects = await asyncio.gather(*objects)
buffers = await asyncio.gather(*[o['Body'].read() for o in objects])
def download():
"""Sequentially download all objects from S3"""
s3 = boto3.client('s3')
for key in KEYS:
object = s3.get_object(Bucket=BUCKET, Key=key)
object['Body'].read()
def run_sequential():
download()
def run_concurrent():
loop = asyncio.get_event_loop()
#loop.set_default_executor(ProcessPoolExecutor(10))
#loop.set_default_executor(ThreadPoolExecutor(10))
loop.run_until_complete(download_aio())
The timing for both run_sequential() and run_concurrent() are quite similar (~3 seconds for a dozen of 10MB files).
I am convinced the concurrent version is not, for multiple reasons:
I tried switching to Process/ThreadPoolExecutor, and I the processes/threads spawned for the duration of the function, though they are doing nothing
The timing between sequential and concurrent is very close to the same, though my network interface is definitely not saturated, and the CPU is not bound either
The time taken by the concurrent version increases linearly with the number of files.
I am sure something is missing, but I just can't wrap my head around what.
Any ideas?
After loosing some hours trying to understand how to use aioboto3 correctly, I decided to just switch to my backup solution.
I ended up rolling my own naive version of multiprocessing.Pool for use within an AWS lambda environment.
If someone stumble across this thread in the future, here it is. It is far from perfect, but easy enough to replace multiprocessing.Pool as-is for my simple cases.
from multiprocessing import Process, Pipe
from multiprocessing.connection import wait
class Pool:
"""Naive implementation of a process pool with mp.Pool API.
This is useful since multiprocessing.Pool uses a Queue in /dev/shm, which
is not mounted in an AWS Lambda environment.
"""
def __init__(self, process_count=1):
assert process_count >= 1
self.process_count = process_count
#staticmethod
def wrap_pipe(pipe, index, func):
def wrapper(args):
try:
result = func(args)
except Exception as exc: # pylint: disable=broad-except
result = exc
pipe.send((index, result))
return wrapper
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, exc_traceback):
pass
def map(self, function, arguments):
pending = list(enumerate(arguments))
running = []
finished = [None] * len(pending)
while pending or running:
# Fill the running queue with new jobs
while len(running) < self.process_count:
if not pending:
break
index, args = pending.pop(0)
pipe_parent, pipe_child = Pipe(False)
process = Process(
target=Pool.wrap_pipe(pipe_child, index, function),
args=(args, ))
process.start()
running.append((index, process, pipe_parent))
# Wait for jobs to finish
for pipe in wait(list(map(lambda t: t[2], running))):
index, result = pipe.recv()
# Remove the finished job from the running list
running = list(filter(lambda x: x[0] != index, running))
# Add the result to the finished list
finished[index] = result
return finished
it's 1.5 years later and aioboto3 is still not well documented or supported.
The multithreading option is good. but AIO is an easier and more clear implementation
I don't actually know what's wrong with your AIO code. It's even not running now because of the updates I guess. but using aiobotocore this code worked. my test was with 100 images. in the sequential code, it takes 8 sec. in average. in IO it was less than 2.
with 1000 images it was 17 sec.
import asyncio
from aiobotocore.session import get_session
async def download_aio(s3,bucket,file_name):
o = await s3.get_object(Bucket=bucket, Key=file_name)
x = await o['Body'].read()
async def run_concurrent():
tasks =[]
session = get_session()
async with session.create_client('s3') as s3:
for k in KEYS[:100]:
tasks.append(asyncio.ensure_future(get_object(s3,BUCKET,k)))
await asyncio.gather(*tasks)
I am trying to simulate an environment with vms and trying to run an object method in background thread. My code looks like the following.
hyper_v.py file :
import random
from threading import Thread
from virtual_machine import VirtualMachine
class HyperV(object):
def __init__(self, hyperv_name):
self.hyperv_name = hyperv_name
self.vms_created = {}
def create_vm(self, vm_name):
if vm_name not in self.vms_created:
vm1 = VirtualMachine({'vm_name': vm_name})
self.vms_created[vm_name] = vm1
vm1.boot()
else:
print('VM:', vm_name, 'already exists')
def get_vm_stats(self, vm_name):
print('vm stats of ', vm_name)
print(self.vms_created[vm_name].get_values())
if __name__ == '__main__':
hv = HyperV('temp')
vm_name = 'test-vm'
hv.create_vm(vm_name)
print('getting vm stats')
th2 = Thread(name='vm1_stats', target=hv.get_vm_stats(vm_name) )
th2.start()
virtual_machine.py file in the same directory:
import random, time, uuid, json
from threading import Thread
class VirtualMachine(object):
def __init__(self, interval = 2, *args, **kwargs):
self.vm_id = str(uuid.uuid4())
#self.vm_name = kwargs['vm_name']
self.cpu_percentage = 0
self.ram_percentage = 0
self.disk_percentage = 0
self.interval = interval
def boot(self):
print('Bootingup', self.vm_id)
th = Thread(name='vm1', target=self.update() )
th.daemon = True #Setting the thread as daemon thread to run in background
print(th.isDaemon()) #This prints true
th.start()
def update(self):
# This method needs to run in the background simulating an actual vm with changing values.
i = 0
while(i < 5 ): #Added counter for debugging, ideally this would be while(True)
i+=1
time.sleep(self.interval)
print('updating', self.vm_id)
self.cpu_percentage = round(random.uniform(0,100),2)
self.ram_percentage = round(random.uniform(0,100),2)
self.disk_percentage = round(random.uniform(0,100),2)
def get_values(self):
return_json = {'cpu_percentage': self.cpu_percentage,
'ram_percentage': self.ram_percentage,
'disk_percentage': self.disk_percentage}
return json.dumps(return_json)
The idea is to create a thread that keeps on updating the values and on request, we read the values of the vm object by calling the vm_obj.get_values() we would be creating multiple vm_objects to simulate multiple vms running in parallel and we need to get the information from a particular vm on request.
The problem, that I am facing, is that the update() function of the vm doesnot run in the background (even though the thread is set as daemon thread).
The method call hv.get_vm_stats(vm_name) waits until the completion of vm_object.update() (which is called by vm_object.boot()) and then prints the stats. I would like to get the stats of the vm on request by keeping the vm_object.update() running in the background forever.
Please share your thoughts if I am overlooking anything related to the basics. I tried looking into the issues related to the python threading library but I could not come to any conclusion. Any help is greatly appreciated. The next steps would be to have a REST api to call these functions to get the data of any vm but I am struck with this problem.
Thanks in advance,
As pointed out by #Klaus D in the comments, my mistake was using the braces when specifying the target function in the thread definition, which resulted in the function being called right away.
target=self.update() will call the method right away. Remove the () to
hand the method over to the thread without calling it.
from multiprocessing.dummy import Pool as ThreadPool
class TSNew:
def __init__(self):
self.redis_client = redis.StrictRedis(host="172.17.31.147", port=4401, db=0)
self.global_switch = 0
self.pool = ThreadPool(40) # init pool
self.dnn_model = None
self.nnf = None
self.md5sum_nnf = "initialize"
self.thread = threading.Thread(target=self.load_model_item)
self.ts_picked_ids = None
self.thread.start()
self.memory = deque(maxlen=3000)
self.process = threading.Thread(target=self.process_user_dict)
self.process.start()
def load_model_item(self):
'''
code
'''
def predict_memcache(self,user_dict):
'''
code
'''
def process_user_dict(self):
while True:
'''
code to generate user_dicts which is a list
'''
results = self.pool.map(self.predict_memcache, user_dicts)
'''
code
'''
TSNew_ = TSNew()
def get_user_result():
logging.info("----------------come in ------------------")
if request.method == 'POST':
user_dict_json = request.get_data()# userid
if user_dict_json == '' or user_dict_json is None:
logging.info("----------------user_dict_json is ''------------------")
return ''
try:
user_dict = json.loads(user_dict_json)
except:
logging.info("json load error, pass")
return ''
TSNew_.memory.append(user_dict)
logging.info('add to deque TSNew_.memory size: %d PID: %d', len(TSNew_.memory), os.getpid())
logging.info("add to deque userid: %s, nation: %s \n",user_dict['user_id'], user_dict['user_country'])
return 'SUCCESS\n'
#app.route('/', methods=['POST'])
def get_ts_gbdt_id():
return get_user_result()
from werkzeug.contrib.fixers import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=4444)
I create a multi thread pool in class __init__ and I use the self.pool
to map the function of predict_memcache.
I have two doubts:
(a) Should I initialize the pool in __init__ or just init it right before
results = self.pool.map(self.predict_memcache, user_dicts)
(b) Since the pool is a multi thread operation and it is executed in the thread of process_user_dict, so is there any hidden error ?
Thanks.
Question (a):
It depends. If you need to run process_user_dict more than once, then it makes sense to start the pool in the constructor and keep it running. Creating a thread pool always comes with some overhead and by keeping the pool alive between calls to process_user_dict you would avoid that additional overhead.
If you just want to process one set of input, you can as well create your pool right inside process_user_dict. But probably not right before results = self.pool.map(self.predict_memcache, user_dicts) because that would create a pool for every iteration of your surrounding while loop.
In your specific case, it does not make any difference. You create your TSNew_ object on module-level, so that it remains alive (and with it the thread pool) while your app is running; the same thread pool from the same TSNew instance is used to process all the requests during the lifetime of app.run().
Since you seem to be using that construct with self.process = threading.Thread(target=self.process_user_dict) as some sort of listener on self.memory, creating the pool in the constructor is functionally equivalent to creating the pool inside of process_user_dict (but outside the loop).
Question (b):
Technically, there is no hidden error by default when creating a thread inside a thread. In the end, any additional thread's ultimate parent is always the MainThread, that is implicitly created for every instance of a Python interpreter. Basically, every time you create a thread inside a Python program, you create a thread in a thread.
Actually, your code does not even create a thread inside a thread. Your self.pool is created inside the MainThread. When the pool is instantiated via self.pool = ThreadPool(40) it creates the desired number (40) of worker threads, plus one worker handler thread, one task handler thread and one result handler thread. All of these are child threads of the MainThread. All you do with regards to your pool inside your thread under self.process is calling its map method to assign tasks to it.
However, I do not really see the point of what you are doing with that self.process here.
Making a guess, I would say that you want to start the loop in process_user_dict to act as kind of a listener on self.memory, so that the pool starts processing user_dict as soon as they start showing up in the deque in self.memory. From what I see you doing in get_user_result, you seem to get one user_dict per request. I understand that you might have concurrent user sessions passing in these dicts, but do you really see benfit from process_user_dict running in an infinite loop over simply calling TSNew_.process_user_dict() after TSNew_.memory.append(user_dict)? You could even omit self.memory completely and pass the dict directly to process_user_dict, unless I am missing something you did not show us.
I am using Threading to speed up collecting data from a website via RESTful API. I am storing the results in a list that I will walk later. I have made the list global, however when I try to print(list) outside of a thread, I see no results. Within, I can print(list) and see that it is appending correctly with all data. What's the best way to collect this data?
global global_list
global_list = []
pool = ActivePool()
s = threading.Sempahore(3)
def get_data(s, pool, i, list):
with s:
name = threading.currentThread().getName()
pool.makeActive(Name)
data = []
session = get_session(login info and url)#function for establishing connection to remote site
request = {API Call 'i'}
response = session.post(someurl, json=request)
session.close()
data = response.json()
global_list.append(data)
pool.makeInactive(name)
def main():
for i in api_list:
t = threading.Thread(target=get_data, name=i['uniqueID'], args=(s, pool, i, global_list))
print(global_list)
if __name__ == '__main__':
main()
I was able to use Queue! Per the below Question, I stored the data I needed with put() within the thread and then storage it in a list I appended with get() outside of the threads.
Return Value From Thread