I am having a hard time working with Python multiprocessing module.
In a nutshell, I have a dictionary object which updates, say, occurrences of a string from lots of s3 files. The key for the dictionary is the occurrence I need which increments by 1 each time it is found.
Sample code:
import boto3
from multiprocessing import Process, Manager
import simplejson
client = boto3.client('s3')
occurences_to_find = ["occ1", "occ2", "occ3"]
list_contents = []
def getS3Key(prefix_name, occurence_dict):
kwargs = {'Bucket': "bucket_name", 'Prefix': "prefix_name"}
while True:
value = client.list_objects_v2(**kwargs)
try:
contents = value['Contents']
for obj in contents:
key=obj['Key']
yield key
try:
kwargs['ContinuationToken'] = value['NextContinuationToken']
except KeyError:
break
except KeyError:
break
def getS3Object(s3_key, occurence_dict):
object = client.get_object(Bucket=bucket_name, Key=s3_key)
objjects = object['Body'].read()
for object in objects:
object_json = simplejson.loads(activity)
msg = activity_json["msg"]
for occrence in occurence_dict:
if occrence in msg:
occurence_dict[str(occrence)] += 1
break
'''each process will hit this function'''
def doWork(prefix_name_list, occurence_dict):
for prefix_name in prefix_name_list:
for s3_key in getS3Key(prefix_name, occurence_dict):
getS3Object(s3_key, occurence_dict)
def main():
manager = Manager()
'''shared dictionary between processes'''
occurence_dict = manager.dict()
procs = []
s3_prefixes = [["prefix1"], ["prefix2"], ["prefix3"], ["prefix4"]]
for occurrence in occurences_to_find:
occurence_dict[occurrence] = 0
for index,prefix_name_list in enumerate(s3_prefixes):
proc = Process(target=doWork, args=(prefix_name_list, occurence_dict))
procs.append(proc)
for proc in procs:
proc.start()
for proc in procs:
proc.join()
print(occurence_dict)
main()
I am having issues with speed of the code as it takes hours for the code to run with more than 10000 s3 prefixes and keys. I think the manager dictionary is shared and is locked by each process, and hence, it is not being updated concurrently; rather one process waits for it to be "released".
How can I update the dictionary parallely? Or, how can I maintain multiple dicts for each process and then combine the result in the end?
Related
Goal:
Accelerate the random walk generation by using multiple processes.
Get the list of vertices ids from which I want random walks to be generated in an input queue
Start as much processes as possible with the correct parameters
Make them put the random walks into an output queue
Wait for completion
Read the output queue
What I am doing:
# Libraries imports
from multiprocessing import cpu_count, Process, Queue
import queue
import configparser
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import AnonymousTraversalSource, traversal
from gremlin_python.process.graph_traversal import __
# Function the processes are supposed to execute
def job(proc_id:int, siq:Queue, rwq:Queue, g:AnonymousTraversalSource, length:int):
while True:
try:
# Get next element in ids queue
start_id = siq.get_nowait()
except queue.Empty:
# If the ids queue is empty, then terminate
break
else:
# Do a random walk of length <length> from the vertex with id <start_id>
random_walk = g.V(start_id).repeat(
__.local(__.both().sample(1))
).times(length).path().next()
print(f"{proc_id}: rw obtained")
# Transform the list of vertices into a comma-separated string of ids
rwq.put(",".join(
[str(v.id) for v in random_walk]
))
print(f"{proc_id}: rw handled")
if __name__ == "__main__":
# Get the parameters from the <config.ini> configuration file
config = configparser.RawConfigParser()
config.read("config.ini")
jg_uri = config["JANUSGRAPH"]["URI"]
file_random_walks = config["FILES"]["RANDOM_WALKS"]
walks_nb_per_node = int(config["WALKS"]["NB_PER_NODE"])
walks_length = int(config["WALKS"]["LENGTH"])
# Connect to Janus Graph
connection = DriverRemoteConnection(jg_uri, "g")
g_main = traversal().withRemote(connection)
# Instantiate the queues and populate the ids one
start_ids_queue = Queue()
random_walks_queue = Queue()
for vertex in g_main.V().has("vertex_label", "<label>").fold().next():
start_ids_queue.put(vertex.id)
# Create and start the processes
nb_processes = cpu_count()
processes = []
for i in range(nb_processes):
p = Process(target=job, args=(
i,
start_ids_queue,
random_walks_queue,
g_main,
walks_length
))
processes.append(p)
p.start()
for p in processes:
p.join()
# Once the processes are terminated, read the random walks queue
random_walks = []
while not random_walks_queue.empty():
random_walks.append(random_walks_queue.get())
# Do something with the random walks
...
Issue:
Once the processes are started, nothing seems to happen. I never get the X: rw obtained/X: rw handled messages. With a bit more logging, I can see that the queries have been sent yet isn't finishing.
In the logs, when performing the first g_main.V().has("vertex_label", "<label>").fold().next() in the main process (when I populate the ids queue), I have the following message:
DEBUG:gremlinpython:submit with bytecode '[['V'], ['has', 'vertex_label', 'movie'], ['fold']]'
DEBUG:gremlinpython:message '[['V'], ['has', 'vertex_label', '<label>'], ['fold']]'
DEBUG:gremlinpython:processor='traversal', op='bytecode', args='{'gremlin': [['V'], ['has', 'vertex_label', '<label>'], ['fold']], 'aliases': {'g': 'g'}}'
DEBUG:asyncio:Using selector: EpollSelector
When the other processes send their queries, I have similar logs:
DEBUG:gremlinpython:submit with bytecode '[['V', 16456], ['repeat', [['local', [['both'], ['sample', 1]]]]], ['times', 10], ['path']]'
DEBUG:gremlinpython:message '[['V', 16456], ['repeat', [['local', [['both'], ['sample', 1]]]]], ['times', 10], ['path']]'
DEBUG:gremlinpython:processor='traversal', op='bytecode', args='{'gremlin': [['V', 16456], ['repeat', [['local', [['both'], ['sample', 1]]]]], ['times', 10], ['path']], 'aliases': {'g': 'g'}}'
DEBUG:asyncio:Using selector: EpollSelector
The issue seems not to reside in the query sent, but instead in the indefinite wait that ensues.
If you know of an issue with gremlinpython and multiprocessing, if there is a problem in my multi-processing code, or if you have any explanation that I may have overlooked, please explain to me! Thanks a lot to everyone reading this!
Solutions:
The first partial solution that I found is to use multi-threading instead of multiprocessing:
import configparser
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import AnonymousTraversalSource, traversal
from gremlin_python.process.graph_traversal import __
import threading
class myThread(threading.Thread):
def __init__(self, thread_id, g, length):
threading.Thread.__init__(self)
self.thread_id = thread_id
self.thread_count = 0
self.gtraversal = g
self.walk_length = length
self.nb_walks = nb_walks
def run(self):
while True:
start_ids_list_lock.acquire()
try:
start_id = start_ids_list.pop(0)
start_ids_list_lock.release()
except IndexError:
start_ids_list_lock.release()
break
else:
self.thread_count += 1
random_walk = job(
vertex_id=start_id,
g=self.gtraversal,
length=self.walk_length,
nb_walks=self.nb_walks
)
random_walks_list_lock.acquire()
random_walks_list.append(random_walk)
random_walks_list_lock.release()
logging.info(f"Thread {self.thread_id}: {self.thread_count} done")
def job(vertex_id:int, g:AnonymousTraversalSource, length:int) -> str:
random_walk = g.V(vertex_id).repeat(
__.local(__.both().sample(1))
).times(length).path().next()
return ",".join(random_walk)
config = configparser.RawConfigParser()
config.read("config.ini")
jg_uri = config["JANUSGRAPH"]["URI"]
file_random_walks = config["FILES"]["RANDOM_WALKS"]
walks_length = int(config["WALKS"]["LENGTH"])
connection = DriverRemoteConnection(jg_uri, "g")
g_main = traversal().withRemote(connection)
threads = []
start_ids_list = []
random_walks_list = []
random_walks_list_lock = threading.Lock()
start_ids_list_lock = threading.Lock()
start_ids_list = [vertex.id for vertex in g_main.V().has("vertex_label", "<label>").fold().next()]
nb_vertices = len(start_ids_list)
nb_threads = 6
for i in range(nb_threads):
thread = myThread(
thread_id=i,
g=g_main,
length=walks_length
)
thread.start()
threads.append(thread)
for t in threads:
t.join()
# Do something with the random walks
...
This solution is effectively working and improves the execution time of the program. This isn't a full answer though, as it doesn't explain why the multiprocessing is not performing as I expected.
I'm using ProcessPoolExecutor context manager to run several Kafka consumers in parallel. I need to store the process IDs of the child processes so that later, I can cleanly terminate those processes. I have such code:
Class MultiProcessConsumer:
...
def run_in_parallel(self):
parallelism_factor = 5
with ProcessPoolExecutor() as executor:
processes = [executor.submit(self.consume) for _ in range(parallelism_factor)]
# It would be nice If I could write [process.pid for process in processes] to a file here.
def consume(self):
while True:
for message in self.kafka_consumer:
do_stuff(message)
I know I can use os.get_pid() in the consume method to get PIDs. But, handling them properly (in case of constant shutting down or starting up of consumers) requires some extra work.
How would you propose that I get and store PIDs of the child processes in such a context?
os.get_pid() seems to be the way to go. Just pass them through a Queue or Pipe in combination with maybe some random UUID that you pass to the process before to identify the PID.
from concurrent.futures import ProcessPoolExecutor
import os
import time
import uuid
#from multiprocessing import Process, Queue
import multiprocessing
import queue
#The Empty exception in in Queue, multiprocessing borrows
#it from there
# https://stackoverflow.com/questions/9908781/sharing-a-result-queue-among-several-processes
m = multiprocessing.Manager()
q = m.Queue()
def task(n, queue, uuid):
my_pid = os.getpid()
print("Executing our Task on Process {}".format(my_pid))
queue.put((uuid, my_pid))
time.sleep(n)
return n * n
def main():
with ProcessPoolExecutor(max_workers = 3) as executor:
some_dict = {}
for i in range(10):
print(i)
u = uuid.uuid4()
f = executor.submit(task, i, q, u)
some_dict[u] = [f, None] # PID not known here
try:
rcv_uuid, rcv_pid = q.get(block=True, timeout=1)
some_dict[rcv_uuid][1] = rcv_pid # store PID
except queue.Empty as e:
print('handle me', e)
print('I am', rcv_uuid, 'and my PID is', rcv_pid)
if __name__ == '__main__':
main()
Although this field is private, you could use the field in PoolProcessExecutor self._processes. The code snippet below shows how to use this variable.
import os
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import wait
nb_processes = 100
executor = ProcessPoolExecutor(nb_processes )
futures = [executor.submit(os.getpid) for _ in range(nb_processes )]
wait(futures)
backends = list(map(lambda x: x.result(), futures))
assert len(set(backends)) == nb_processes
In the case above, an assertion error is raised. This is because a new task can reuse the forked processes in the pool. You cannot know all forked process IDs through the method you memtioned. Hence, you can do as:
import os
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import wait
nb_processes = 100
executor = ProcessPoolExecutor(nb_processes )
futures = [executor.submit(os.getpid) for _ in range(nb_processes )]
wait(futures)
backends = list(map(lambda x: x.result(), futures))
assert len(set(executor._processes.keys())) == nb_processes
print('all of PID are: %s.' % list(executor._processes.keys()))
If you don't want to destroy the encapsulation, you could inhert the ProcessPoolExecutor and create a new property for that.
I have a list containing ID Number's, I want to implement every unique ID Number in an API call for each Multiprocessor whilst running the same corresponding functions, implementing the same conditional statements to each processor etc. I have tried to make sense of it but there is not a lot online about this procedure.
I thought to use a for loop but I don't want every processor running this for loop picking up every item in a list. I just need each item to be associated to each processor.
I was thinking something like this:
from multiprocessing import process
import requests, json
ID_NUMBERS = ["ID 1", "ID 2", "ID 3".... ETC]
BASE_URL = "www.api.com"
KEY = {"KEY": "12345"}
a = 0
for x in ID_NUMBERS:
def[a]():
while Active_live_data == True:
# continuously loops over, requesting data from the website
unique_api_call = "{}/livedata[{}]".format(BASE_URL, x)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
#some extra conditional code...
a += 1
processes = []
b = 0
for _ in range(len(ID_NUMBERS))
p = multiprocessing.Process(target = b)
p.start()
processes.append(p)
b += 1
Any help would be greatly appreciated!
Kindest regards,
Andrew
You can use the map function:
import multiprocessing as mp
num_cores = mp.cpu_count()
pool = mp.Pool(processes=num_cores)
results = pool.map(your_function, list_of_IDs)
This will execute the function your_function, each time with a different item from the list list_of_IDs, and the values returned by your_function will be stored in a list of values (results).
Same approach as #AlessiaM but uses the high-level api in the concurrent.futures module.
import concurrent.futures as mp
import requests, json
BASE_URL = ''
KEY = {"KEY": "12345"}
ID_NUMBERS = ["ID 1", "ID 2", "ID 3"]
def job(id):
unique_api_call = "{}/livedata[{}]".format(BASE_URL, id)
request_it = requests.get(unique_api_call, headers=KEY)
show_it = (json.loads(request_it.content))
return show_it
# Default to as many workers as there are processors,
# But since your job is IO bound (vs CPU bound),
# you could increase this to an even bigger figure by giving the `max_workers` parameter
with mp.ProcessPoolExecutor() as pool:
results = pool.map(job,ID_NUMBERS)
# Process results here
The program is supposed to use the translate-python API to translate ENGLISH_DICT into 60+ languages (ENGLISH_DICT has been shorted a lot and so has LANG_CODES). Translating a huge dictionary into 60+ languages takes a little close to 2 hours with synchronized coding which is why I wanted to use threads.
My thread pool is supposed to be size 4, but I sometimes get 10 threads running without the previous threads completing (Found this out by putting a print statement on the first and last line of the thread handler). Also, the pool will run multiple threads, but as soon as a few threads complete the entire program terminates and I get a 0 exit code. Lastly, if my max pool size is 10 and I have less than 10 threads join then the program terminated immediately.
More than 4 threads running without previous threads completing
Only 8 threads finished running out of 65 that were scheduled to run
9 threads were created but the max thread pool size is 10. The threads started to run but main program exited with a 0 exit code
import copy
import os
import json
import threading
from multiprocessing.dummy import Pool
from queue import Queue
from translate import Translator
LANG_CODES = {"af", "ar", "bn", "bs", "bg", "yue", "ca", "fi", "fr"}
VERIFIED_LANGUAGES = {'en', 'es', 'zh'}
TOTAL_TRANSLATIONS = len(LANG_CODES) - len(VERIFIED_LANGUAGES)
NUM_OF_THREADS = 100
DIR_NAME = 'translations'
#Iterate through nested dictionaries and translate string values
#Then prints the final dictionary as JSON
def translate(english_words: dict, dest_lang: str) -> str:
stack = []
cache = {}
T = Translator(provider='microsoft', from_lang='en', to_lang=dest_lang, secret_access_key=API_SECRET1)
translated_words = copy.deepcopy(english_words)
##Populate dictionary with top-level keys or translate top-level words
for key in translated_words.keys():
value = translated_words[key]
if type(value) == dict:
stack.append(value)
else:
if value in cache:
translated_words[key] = cache[key]
else:
translation = T.translate(value)
translated_words[key] = translation
cache[translation] = translation
while len(stack):
dic = stack.pop()
for key in dic.keys():
value = dic[key]
if type(value) == dict:
stack.append(value)
else:
if value in cache:
dic[key] = cache[value]
else:
# print('Translating "' + value +'" for', dest_lang)
translation = T.translate(value)
# print('Done translating "' + value +'" for', dest_lang)
# print('Translated', value, '->', translation)
cache[translation] = translation
dic[key] = translation
return json.dumps(translated_words, indent=4)
##GENERATES A FOLDER CALLED 'translations' WITH LOCALE JSON FILES IN THE WORKING DIRECTORY THE SCRIPT IS LAUNCHED IN WITH MULTIPLE THREADS WORKING ON DIFFERENT LANGUAGES
def generate_translations(english_dict: dict):
if not os.path.exists(DIR_NAME):
os.mkdir(DIR_NAME)
finished_langs = set(map(lambda file_name: file_name.split('.json')[0], os.listdir(DIR_NAME)))
LANG_CODES.difference_update(finished_langs)
pool = Pool(NUM_OF_THREADS)
thread_params = [(english_dict, lang_code) for lang_code in sorted(LANG_CODES) if not lang_code.split('-')[0] in VERIFIED_LANGUAGES]
pool.map_async(thread_handler, thread_params)
pool.close()
pool.join()
print('DONE GENERATING')
##TRANSLATES AN ENTIRE DICTIONARY AND THEN WRITES IT TO A FILE IN THE TRANSLATION FOLDER
def thread_handler(params: tuple):
english_dict, lang_code = params
print('Translating for lang_code: ', lang_code)
translated_string_json = translate(english_dict, lang_code)
print('done getting string for', lang_code)
file = open(DIR_NAME + '/' + lang_code + '.json', 'w')
file.write(translated_string_json)
file.close()
num_of_langs_remaining = TOTAL_TRANSLATIONS - len(os.listdir(DIR_NAME))
print('Done translating for lang_code: ' + lang_code +'.', num_of_langs_remaining, 'remaining.\n\n')
ENGLISH_DICT = {
"changePassword": {
"yourCurrentPassword": "Your current password",
"newPassword": "New password",
"reenterNewPassword": "Re-enter new password",
"changePassword": "Change Password",
"yourProfile": "Your Profile",
"emptyFieldAlert": {
"header": "All fields must not be empty",
"body": "Please fill in all the fields"
}
}
}
if __name__ == '__main__':
generate_translations(ENGLISH_DICT)
Threads in Python are not equal to Threads in e.g. Java.
They don't actually use multiple CPU cores to execute your code parallel.
Multiprocessing is used instead.
multiprocessing.dummy just uses the API from multiprocessing but is actually a wrapper for threading.
You should use
from multiprocessing import Pool
instead for actual parallelization and better performance.
You should count the amount of threads with
print(len(active_children()))
If you don't use the AsyncResult or a callback then you should just use
map(thread_handler, thread_params)
instead of
map_async(thread_handler, thread_params)
because it works parallel anyway and even blocks until it completes the given task.
The context manager protocol also works with Pool:
with Pool(NUM_OF_THREADS) as pool:
thread_params = [(english_dict, lang_code) for lang_code in sorted(LANG_CODES) if not lang_code.split('-')[0] in VERIFIED_LANGUAGES]
pool.map(thread_handler, thread_params)
print('DONE GENERATING')
I have a piece of code that queries a DB and returns a set of IDs. For each ID, I need to run a related query to get a dataset. I would like to run the queries in parallel to speed up the processing. Once all the processes are run, then I build a block of text and write that to a file, then move to the next id.
How do I ensure that all the processes start at the same time, then wait for all of them to complete before moving to the page =... and writefile operations?
If run as it, I get the following error: Process object is not iterable (on line 9).
Here is what I have so far:
from helpers import *
import multiprocessing
idSet = getIDset(10)
for id in idSet:
ds1 = multiprocessing.Process(target = getDS1(id))
ds1list1, ds1Item1, ds1Item2 = (ds1)
ds2 = multiprocessing.Process(target = getDS2(id))
ds3 = multiprocessing.Process(target = getDS3(id))
ds4 = multiprocessing.Process(target = getDS4(id))
ds5 = multiprocessing.Process(target = getDS5(id))
movefiles = multiprocessing.Process(moveFiles(srcPath = r'Z://', src = ds1Item2 , dstPath=r'E:/new_data_dump//'))
## is there a better way to get them to start in unison than this?
ds1.start()
ds2.start()
ds3.start()
ds4.start()
ds5.start()
## how do I know all processes are finished before moving on?
page = +ds1+'\n' \
+ds2+'\n' \
+ds3+'\n' \
+ds4+'\n' \
+ds5+'\n'
writeFile(r'E:/new_data_dump/',filename+'.txt',page)
I usually keep my "processes" in a list.
plist = []
for i in range(0, 5) :
p = multiprocessing.Process(target = getDS2(id))
plist.append(p)
for p in plist :
p.start()
... do stuff ...
for p in plist :
p.join() # <---- this will wait for each process to finish before continuing
Also I think you have an issue with creating your Process. "target" is supposed to be a function. Not the result of a function as it seems you have it (unless your function returns functions).
It should look like this:
p = Process(target=f, args=('bob',))
Where target is the function, and args is a tuple of arguemnts passed like so:
def f(name) :
print name