I'm trying to speed up some data processing using the multiprocessing module, the idea being I can send a chunk of data to each process I start up to utilize all the cores on my machine instead of just one at a time.
So I built an iterator for the data using the pandas read_fwf() function, with chunksize=50000 lines at a time. My problem is that eventually the iterator should raise StopIteration, and I'm trying to catch this in an except block in the child process and pass it along to the parent thread using a Queue to let the parent know it can stop spawning child processes. I have no idea what's wrong though, but what's happening is it gets to the end of the data and then keeps spawning processes which essentially do nothing.
def MyFunction(data_iterator, results_queue, Placeholder, message_queue):
try:
current_data = data_iterator.next()
#does other stuff here
#that isn't important
placeholder_result = "Eggs and Spam"
results_queue.put(placeholder_result)
return None
except StopIteration:
message_queue.put("Out Of Data")
return None
results_queue = Queue() #for passing results from each child process
message_queue = Queue() #for passing the stop iteration message
cpu_count = cpu_count() #num of cores on the machine
Data_Remaining = True #loop control
output_values = [] #list to put results in
print_num_records = 0 #used to print how many lines have been processed
my_data_file = "some_data.dat"
data_iterator = BuildDataIterator(my_data_file)
while Data_Remaining:
processes = []
for process_num in range(cpu_count):
if __name__ == "__main__":
p = Process(target=MyFunction, args=(data_iterator,results_queue,Placeholder, message_queue))
processes.append(p)
p.start()
print "Process " + str(process_num) + " Started" #print some stuff to
print_num_records = print_num_records + 50000 #show how far along
print "Processing records through: ", print_num_records #my data file I am
for i,p in enumerate(processes):
print "Joining Process " + str(i)
output_values.append(results_queue.get())
p.join(None)
if not message_queue.empty():
message = message_queue.get()
else:
message = ""
if message == "Out Of Data":
Data_Remaining = False
print "STOP ITERATION NOW PLEASE"
Update:
I discovered a problem with the data iterator. There are approximately 8 million rows in my data set, and after it processes the 8 million it never actually returns a StopIteration, it keeps returning the same 14 rows of data over and over. Here is the code that builds my data iterator:
def BuildDataIterator(my_data_file):
#data_columns is a list of 2-tuples
#headers is a list of strings
#num_lines is 50000
data_reader = read_fwf(my_data_file, colspecs=data_columns, header=None, names=headers, chunksize=num_lines)
data_iterator = data_reader.__iter__()
return data_iterator
Related
I have a script that reads data via the serial port from a development board. I want to have this script upload the data to a MongoDB collection at the end of each loop, but I don't want the loop to block because of the upload. When I try to use the multiprocessing library to do so, the loop only uploads a blank document.
client = MongoClient()
db = client['CompostMonitor-1']
def upload_to_database(data):
# Connect to the collection where the data will be stored
collection = db.RedBoard
# Insert the data into the collection
collection.insert_one(data)
port = '/dev/ttyUSB0'
filename = '~/TestData'
containernumber = 1
baud_rate = 9600
RBSerial = serial.Serial(port, baud_rate, timeout=1)
directoryBase = "{}/{}/Bucket {}/RB".format(filename, time.strftime("%m-%d-%Y"), containernumber)
pathlib.Path(directoryBase).mkdir(parents=True, exist_ok=True)
logFileRB = '{}/RB_Bucket_{}_{}_{}_log.bin'.format(directoryBase, containernumber, time.strftime("%m-%d-%Y"),
time.strftime("%H;%M;%S"))
csvRB = '{}/RB_Bucket_{}_{}_{}.csv'.format(directoryBase, containernumber, time.strftime("%m-%d-%Y"),
time.strftime("%H;%M;%S"))
startup = True
count = 0
bytearray = []
RB_DataList = []
RB_DataDict = {}
header = ['Date/Time',
'SGP TVOC (ppb)',
'BME Humidity (%)',
'BME Pressure (Pa)',
'BME Temp (Deg C)']
startTime = time.time()
p = multiprocessing.Process(target=upload_to_database, args=(RB_DataDict,))
while 1:
RB_DataDict = {'_id': ''}
RB_inbyte = RBSerial.read(size=1)
with open(logFileRB, 'ab') as l:
l.write(RB_inbyte)
bytearray.append(RB_inbyte)
if RB_inbyte == b'\n':
bytearray.pop()
with open(csvRB, 'a', newline = '') as table:
writer = csv.writer(table)
if count == 0:
writer.writerow(header)
RB_DataSplit = ''.join(str(bytearray)).replace(" ", "").replace('b', '').replace("'", '').replace(",", '').\
replace('[', '').replace(']', '').split(';')
RB_DataList.append(time.strftime("%m-%d-%Y %H:%M:%S"))
for i in range(len(RB_DataSplit)):
RB_DataList.append(RB_DataSplit[i])
print(RB_DataList)
writer.writerow(RB_DataList)
RB_DataDict = {'Date_Time': RB_DataList[0], 'TVOC Con': RB_DataList[1], 'BME Humidity': RB_DataList[2],
'BME Pressure': RB_DataList[3], 'BME Temp': RB_DataList[4]}
print(RB_DataDict)
RB_DataList = []
# upload_to_database(RB_DataDict)
if startup:
p.start()
startup = False
bytearray = []
However, if I just call upload_to_database(RB_DataDict) as in the commented line, it works as intended. I thought that starting the process would have it continually upload RB_DataDict to my Mongo database, but it appears that it just runs one time and then stops.
I haven't found any examples of code attempting to use multiprocessing in an infinite loop, so it's hard to compare my code to something that works. How can I change this code so that it uploads RB_DataDict with the multiprocessing object each time the dictionary is populated?
I found a solution to my problem. I don't really understand why this works so well, but it does:
if __name__ == '__main__':
if startup:
p.start()
startup = False
print('Startup == False')
else:
# Close the process instance and start a new one!
p.close()
p = multiprocessing.Process(target= upload_to_database, args = (RB_DataDict,))
p.start()
print('should have uploaded something here')
Just closing the original process on the second loop and starting a new one fixes the issue. I'm not sure if, in my particular case, the if __name__ == '__main__' is necessary, as this script isn't intended to be imported for anything else, but I just followed the lead of the multiprocessing documentation.
I'm working with Orcaflex (a FEM software for offshore analysis, but should not be relevant). I created a script to check if the simulations I've performed have been completed successfully (The simulation can fail for not reaching convergence). Since I'm talking about thousands of files I was trying to parallelize the process with multiprocessing. Following, my code. Sorry but I can't produce a working example for you, but I'll try to explain in detail. I created a derived Class of multiprocessing.Process and overwrite the run() to perform the checks on the simulations files.
Then, in __main__ I set a number of processors, split the files accordingly, and start the execution.
The problem is that the processes are not spawning altogether but, in what appear to be, a random amount of time from one to another. Is this what it is supposed to be or am I missing something?
What I mean by not spawning altogether is that I see:
[Info/Worker-1] child process calling self.run()
and for example:
[Info/Worker-4] child process calling self.run()
after about 10 min of the program running.
Thanks in advance for any help/suggetsion.
import os
import subprocess
import glob
import multiprocessing
import logging
import sys
import OrcFxAPI as of
class Worker(multiprocessing.Process):
myJobs = []
def setJobs(self, jobList):
self.myJobs = jobList
#staticmethod
def changedExtensionFileName(oldFileName, newExtension):
return '.'.join((os.path.splitext(oldFileName)[0], newExtension))
def run(self):
failed = []
model = of.Model(threadCount=1)
for job in self.myJobs:
try:
print('%s starting' % job)
sys.stdout.flush()
model.LoadSimulation(job)
if model.state == of.ModelState.SimulationStoppedUnstable:
newJob = job.replace('.sim', '.dat')
failed.append(newJob)
with open('Failed_Sim.txt', 'a') as f:
f.write(f'{newJob}\n')
f.close()
model.LoadData(newJob)
model.general.ImplicitConstantTimeStep /= 2
model.SaveData(newJob)
print(f'{job} has failed, reducing time step')
except of.DLLError as err:
print('%s ERROR: %s' % (job, err))
sys.stdout.flush()
with open(self.changedExtensionFileName(job, 'FAIL'), 'w') as f:
f.write('%s error: %s' % (job, err))
f.close()
return
if __name__ == '__main__':
import re
sim_file = [f for f in os.listdir() if re.search(r'\d\d\d\d.*.sim', f)]
# begin multprocessing
multiprocessing.log_to_stderr()
logger = multiprocessing.get_logger()
logger.setLevel(logging.INFO)
corecount = 14
workers = []
chunkSize = int(len(sim_file) / corecount)
chunkRemainder = int(len(sim_file) % corecount)
print('%s jobs found, dividing across %s workers - %s each remainder %s' % (str(len(sim_file)), str(corecount), chunkSize, chunkRemainder))
start = 0
for coreNum in range(0, corecount):
worker = Worker()
workers.append(worker)
end = start + chunkSize
if chunkRemainder>0:
chunkRemainder -= 1
end += 1
if end>len(sim_file):
end = len(sim_file)
worker.setJobs(sim_file[start:end])
worker.start()
start = end
if start>=len(sim_file):
break
for worker in workers:
worker.join()
print('Done...')
OK, so no one put up their hand to answer this by minor tweak (which I don't know how to do!), so here comes the larger rejig proposal...
def worker(inpData):
#The worker process
failed1 = []
failed2 = []
for job in inpData: #I'm not sure of the data shape of the chunks, has your original method split them into coherent chunks capable of being processed independently? My step here could be wrong.
try:
#print('%s starting' % job) #Prints won't appear on console from worker processes from windows, so commented them all out
model.LoadSimulation(job)
if model.state == of.ModelState.SimulationStoppedUnstable:
newJob = job.replace('.sim', '.dat')
failed1.append(newJob)
#I'd recommend we pass the list "failed" back to master and write to text from there, otherwise you could have several processes updating the text file at once, leading to possible loss of data
#with open('Failed_Sim.txt', 'a') as f:
# f.write(f'{newJob}\n')
# f.close()
model.LoadData(newJob)
model.general.ImplicitConstantTimeStep /= 2
model.SaveData(newJob)
#print(f'{job} has failed, reducing time step')
except of.DLLError as err:
#print('%s ERROR: %s' % (job, err))
#sys.stdout.flush()
#with open(self.changedExtensionFileName(job, 'FAIL'), 'w') as f:
# f.write('%s error: %s' % (job, err))
# f.close()
failed2.append(job)
#Note I've made two failed lists to pass back, for both failure types
return failed1, failed2
if __name__ == "__main__":
import re
import multiprocessing as mp
nCPUs = mp.cpu_count()
sim_file = [f for f in os.listdir() if re.search(r'\d\d\d\d.*.sim', f)]
#Make the chunks
chunkSize = int(len(sim_file) / corecount)
chunkRemainder = int(len(sim_file) % corecount)
print('%s jobs found, dividing across %s workers - %s each remainder %s' % (str(len(sim_file)), str(corecount), chunkSize, chunkRemainder))
chunks = []
start = 0
for iChunk in range(0, nCPUs)
end = start + chunkSize
if chunkRemainder>0:
chunkRemainder -= 1
end += 1
if end>len(sim_file):
end = len(sim_file)
chunk.append(sim_file[start:end])
#Send to workers
pool = mp.Pool(processes=nCPUs)
futA = []
for iChunk in range(0, nCPUs):
futA.append(pool.apply_async(worker, args=(chunk[iChunk],))
#Gather results
if futA:
failedDat = []
failedSim = []
for iChunk in range(0, len(futA)):
resA, resB = futA[iChunk].get()
failedDat.extend(resA)
failedSim.extend(resB)
pool.close()
if failedDat:
print("Following jobs failed, reducing timesteps:")
print(failedDat)
if failedSim:
print("Following sims failed due to errors")
print(failedSim)
I have data, which is in a text file. Each line is a computation to do. This file has around 100 000 000 lines.
First I load everything into the ram, then I have a a method that performs the computation and gives the following results:
def process(data_line):
#do computation
return result
Then I call it like this with packets of 2000 lines and then save the result to disk :
POOL_SIZE = 15 #nbcore - 1
PACKET_SIZE = 2000
pool = Pool(processes=POOL_SIZE)
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets = int(number_of_lines/ PACKET_SIZE)
for i in range(number_of_packets):
lines_packet = data_lines[:PACKET_SIZE]
data_lines = data_lines[PACKET_SIZE:]
results = pool.map(process, lines_packet)
save_computed_data_to_disk(to_be_computed_filename, results)
# process the last packet, which is smaller
results.extend(pool.map(process, data_lines))
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
The problem is, while I was writing to disk, my CPU is computing nothing and has 8 cores. It is looking at the task manager and it seems that quite a lot of CPU time is lost.
I have to write to disk after having completed my computation because the results are 1000 times larger than the input.
Anyways, I would have to write to the disk at some point. If time is not lost here, it will be lost later.
What could I do to allow one core to write to disk, while still computing with the others? Switch to C?
At this rate I can process 100 millions lines in 75h, but I have 12 billions lines to process, so any improvement is welcome.
example of timings:
Processing packet 2/15 953 of C:/processing/drop_zone\to_be_processed_txt_files\t_to_compute_303620.txt
Launching task and waiting for it to finish...
Task completed, Continuing
Packet was processed in 11.534576654434204 seconds
We are currently going at a rate of 0.002306915330886841 sec/words
Which is 433.47928145051293 words per seconds
Saving in temporary file
Printing writing 5000 computed line to disk took 0.04400920867919922 seconds
saving word to resume from : 06 20 25 00 00
Estimated time for processing the remaining packets is : 51:19:25
Note: This SharedMemory works only for Python >= 3.8 since it first appeared there
Start 3 kinds of processes: Reader, Processor(s), Writer.
Have Reader process read the file incrementally, sharing the result via shared_memory and Queue.
Have the Processor(s) consume the Queue, consume the shared_memory, and return the result(s) via another Queue. Again, as shared_memory.
Have the Writer process consume the second Queue, writing to the destination file.
Have them all communicate through, say, some Events or DictProxy, with the MainProcess who will act as the orchestrator.
Example:
import time
import random
import hashlib
import multiprocessing as MP
from queue import Queue, Empty
# noinspection PyCompatibility
from multiprocessing.shared_memory import SharedMemory
from typing import Dict, List
def readerfunc(
shm_arr: List[SharedMemory], q_out: Queue, procr_ready: Dict[str, bool]
):
numshm = len(shm_arr)
for batch in range(1, 6):
print(f"Reading batch #{batch}")
for shm in shm_arr:
#### Simulated Reading ####
for j in range(0, shm.size):
shm.buf[j] = random.randint(0, 255)
#### ####
q_out.put((batch, shm))
# Need to sync here because we're reusing the same SharedMemory,
# so gotta wait until all processors are done before sending the
# next batch
while not q_out.empty() or not all(procr_ready.values()):
time.sleep(1.0)
def processorfunc(
q_in: Queue, q_out: Queue, suicide: type(MP.Event()), procr_ready: Dict[str, bool]
):
pname = MP.current_process().name
procr_ready[pname] = False
while True:
time.sleep(1.0)
procr_ready[pname] = True
if q_in.empty() and suicide.is_set():
break
try:
batch, shm = q_in.get_nowait()
except Empty:
continue
print(pname, "got batch", batch)
procr_ready[pname] = False
#### Simulated Processing ####
h = hashlib.blake2b(shm.buf, digest_size=4, person=b"processor")
time.sleep(random.uniform(5.0, 7.0))
#### ####
q_out.put((pname, h.hexdigest()))
def writerfunc(q_in: Queue, suicide: type(MP.Event())):
while True:
time.sleep(1.0)
if q_in.empty() and suicide.is_set():
break
try:
pname, digest = q_in.get_nowait()
except Empty:
continue
print("Writing", pname, digest)
#### Simulated Writing ####
time.sleep(random.uniform(3.0, 6.0))
#### ####
print("Writing", pname, digest, "done")
def main():
shm_arr = [
SharedMemory(create=True, size=1024)
for _ in range(0, 5)
]
q_read = MP.Queue()
q_write = MP.Queue()
procr_ready = MP.Manager().dict()
poison = MP.Event()
poison.clear()
reader = MP.Process(target=readerfunc, args=(shm_arr, q_read, procr_ready))
procrs = []
for n in range(0, 3):
p = MP.Process(
target=processorfunc, name=f"Proc{n}", args=(q_read, q_write, poison, procr_ready)
)
procrs.append(p)
writer = MP.Process(target=writerfunc, args=(q_write, poison))
reader.start()
[p.start() for p in procrs]
writer.start()
reader.join()
print("Reader has ended")
while not all(procr_ready.values()):
time.sleep(5.0)
poison.set()
[p.join() for p in procrs]
print("Processors have ended")
writer.join()
print("Writer has ended")
[shm.close() for shm in shm_arr]
[shm.unlink() for shm in shm_arr]
if __name__ == '__main__':
main()
You say you have 8 cores, yet you have:
POOL_SIZE = 15 #nbcore - 1
Assuming you want to leave one processor free (presumably for the main process?) why wouldn't this number be 7? But why do you even want to read a processor free? You are making successive calls to map. While the main process is waiting for these calls to return, it requires know CPU. This is why if you do not specify a pool size when you instantiate your pool it defaults to the number of CPUs you have and not that number minus one. I will have more to say about this below.
Since you have a very large, in-memory list, is it possible that you are expending waisted cycles in your loop rewriting this list on each iteration of the loop. Instead, you can just take a slice of the list and pass that as the iterable argument to map:
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
with Pool(processes=POOL_SIZE) as pool:
offset = 0
for i in range(number_of_packets):
results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
offset += PACKET_SIZE
save_computed_data_to_disk(to_be_computed_filename, results)
if remainder:
results = pool.map(process, data_lines[offset:offset+remainder])
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
Between each call to map the main process is writing out the results to to_be_computed_filename. In the meanwhile, every process in your pool is sitting idle. This should be given to another process (actually a thread running under the main process):
import multiprocessing
import queue
import threading
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
def save_data(q):
while True:
results = q.get()
if results is None:
return # signal to terminate
save_computed_data_to_disk(to_be_computed_filename, results)
q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()
with Pool(processes=POOL_SIZE) as pool:
offset = 0
for i in range(number_of_packets):
results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
offset += PACKET_SIZE
q.put(results)
if remainder:
results = pool.map(process, data_lines[offset:offset+remainder])
q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
I've chosen to run save_data in a thread of the main process. This could also be another process in which case you would need to use a multiprocessing.Queue instance. But I figured the main process thread is mostly waiting for the map to complete and there would not be competition for the GIL. Now if you do not leave a processor free for the threading job, save_data, it may end up doing most of the saving only after all of the results have been created. You would need to experiment a bit with this.
Ideally, I would also modify the reading of the input file so as to not have to first read it all into memory but rather read it line by line yielding 2000 line chunks and submitting those as jobs for map to process:
import multiprocessing
import queue
import threading
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
def save_data(q):
while True:
results = q.get()
if results is None:
return # signal to terminate
save_computed_data_to_disk(to_be_computed_filename, results)
def read_data():
"""
yield lists of PACKET_SIZE
"""
lines = []
with open(some_file, 'r') as f:
for line in iter(f.readline(), ''):
lines.append(line)
if len(lines) == PACKET_SIZE:
yield lines
lines = []
if lines:
yield lines
q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()
with Pool(processes=POOL_SIZE) as pool:
for l in read_data():
results = pool.map(process, l)
q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
I made two assumptions: The writing is hitting the I/O bound, not the CPU bound - meaning that throwing more cores onto writing would not improve the performance. And the process function contains some heavy computations.
I would approach it differently:
Split up the large list into a list of list
Feed it than into the processes
Store the total result
Here is the example code:
import multiprocessing as mp
data_lines = [0]*10000 # read it from file
size = 2000
# Split the list into a list of list (with chunksize `size`)
work = [data_lines[i:i + size] for i in range(0, len(data_lines), size)]
def process(data):
result = len(data) # some something fancy
return result
with mp.Pool() as p:
result = p.map(process, work)
save_computed_data_to_disk(file_name, result)
On meta: You may also have a look into numpy or pandas (depending on the data) because it sounds that you would like to do something into that direction.
The first thing that comes to mind for the code is to run the saving function in the thread. By this we exclude the bottelneck of waiting disk writing. Like so:
executor = ThreadPoolExecutor(max_workers=2)
future = executor.submit(save_computed_data_to_disk, to_be_computed_filename, results)
saving_futures.append(future)
...
concurrent.futures.wait(saving_futures, return_when=ALL_COMPLETED) # wait all saved to disk after processing
print("Done")
So I have been playing around with Multiprocessing and I was thinking to upgrade my knowledge where I can read the first sentence from the text file for process 1 then the second sentence for process 2 etc...
txt file:
helloworld#world.com
helloworld2#world.com
helloworld3#world.com
helloworld4#world.com
helloworld5#world.com
and this is how the code is looking:
def info(thread):
global prod
prod = int(thread) + 1
runit()
def runit():
log("Profile-" + str(prod) + Fore.GREEN + ' - ' + email)
#From here I can then use the email for each worker basically. Or thats the plan atleast. Theplan is that every worker will have its own email that can be used in here.
sys.exit()
def main():
user_input = 0
while True:
try:
user_input = int(input(Fore.WHITE + 'How many tasks do you wanna run? [NUMBERS] \n' + Fore.RESET))
except ValueError:
print(Fore.RED + "Stop being stupid" + Fore.RESET)
continue
else:
with open('email.txt') as f:
content = f.readlines()
content = [x.strip('\n') for x in content]
try:
for i, email in enumerate(content):
print(email)
except ValueError as e:
print(e)
HowManyThread = user_input
i = 0
jobs = []
for i in range(HowManyThread):
p = multiprocessing.Process(target=info, args=(str(i),))
jobs.append(p)
time.sleep(.5)
p.start()
for p in jobs:
p.join()
sys.exit()
Log is just a log message basically, Nothing special
Fore.COLOR <-- Colorama
However, I have completely no idea what I should do to actually make each process take each email row. So basically....
Process-1 to take helloworld#world.com
Process-2 to take helloworld2#world.com
Process-3 to take helloworld3#world.com
Process-4 to take helloworld4#world.com
Process-5 to take helloworld5#world.com
What are the suggestions on how I can do this? I'm completely off and have absolutely no idea on how to move forward.
Update
from multiprocessing import pool, Process, Queue
from tqdm import tqdm
with open('email.txt') as f:
content = f.readlines()
global email_list
email_list = [x.strip('\n') for x in content]
def info(thread):
global prod
prod = int(thread) + 1
runit()
def runit(email_index):
email = email_list[email_index]
log("Profile-" + str(prod) + Fore.GREEN + ' - ' + email)
sys.exit()
def main():
wipe()
text()
Infotext = "First name : Last name : Email: : Random char + Street"
with open('data.json', 'w') as f:
json.dump(Infotext, f)
f.write("\n")
with Pool(8) as pool:
result_list = list(tqdm(pool.imap_unordered(, range(len(email_list)), chunksize=5), total=len(email_list))))
if __name__ == '__main__':
try:
main()
except Exception as e:
print(e)
print(traceback.print_exc())
print(traceback)
The following approach delegates the multiprocessing to a pool of workers, each of which receives a chunk of indices and processes these indices a single line at a time (the choice of poolsize=8 and chunksize=5 here is arbitrary and can be tuned according to your requirements).
The result of all workers is then collected into a final list. Note that imap_unordered is only appropriate if you don't care about the order in which the lines are processed (i.e. result_list does not maintain the original order of content.
from multiprocessing import Pool
# progress bar to track your multiproc
from tqdm import tqdm
with open('email.txt') as f:
content = f.readlines()
# this list will be accessed by each worker
global email_list
email_list = [x.strip('\n') for x in content]
# define function that worker will apply to each email
# it gets sent an index for the list of emails
# it accesses the email at that index, performs its function and returns
def runit(email_index):
email = email_list[email_index]
# do the stuff you're interested in for a single email
# run the multiprocessing to get your results
# this sends the indexes for the emails out to the workers
# and collects the results of runit into result list
with Pool(8) as pool:
result_list = list(tqdm(pool.imap_unordered(runit,
range(len(email_list)), chunksize=5),
total=len(email_list)))
What you need is a pool of worker processes - even if for you use case, I really wonder whether threads (or multiprocessing.dummy) would not be enough.
A pool starts the asked number of worker processes, and you can submit asynchronous tasks to the pool that will be handled by the first free worked.
A stripped down version of your example (no fancy printing, no unnecessary reading of a sequential file in a list) could be:
import multiprocessing
import time
def runit(prod, email):
print("Profile-" + str(prod) + ' - ' + email)
#From here I can then use the email for each worker basically. Or thats the plan atleast. Theplan is that every worker will have its own email that can be used in here.
# sys.exit() # NEVER CALL EXPLICITELY sys.exit() in a worker process
time.sleep(1) # to add a delay inside each task
def main():
while True:
try:
HowManyThread = int(input(
'How many tasks do you wanna run? [NUMBERS] \n'))
except ValueError:
print("Stop being stupid")
continue
if HowManyThread == 0: break
pool = multiprocessing.Pool(HowManyThread)
with open('email.txt') as f:
for i, email in enumerate(f):
email = email.strip()
# runit will be runned by a worker process
pool.apply_async(runit, (i, email))
pool.close() # no more task to add
pool.join() # wait for last worker to end
if __name__ == "__main__":
main()
I have a very weird problem with the code below. when numrows = 10 the Process loops completes itself and proceeds to finish. If the growing list becomes larger it goes into a deadlock. Why is this and how can I solve this?
import multiprocessing, time, sys
# ----------------- Calculation Engine -------------------
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "Project ID: %s started. " % par
res = doCalculation(par)
queueOut.put(res)
except:
break
def write(queue, fname):
print 'Started to write to file'
fhandle = open(fname, "w")
while True:
try:
res = queue.get(block = False)
for m in res:
print >>fhandle, m
except:
break
fhandle.close()
print 'Complete writing to the file'
def doCalculation(project_ID):
numrows = 100
toFileRowList = []
for i in range(numrows):
toFileRowList.append([project_ID]*100)
print "%s %s" % (multiprocessing.current_process().name, i)
return toFileRowList
def main():
parlist = [276, 266]
nthreads = multiprocessing.cpu_count()
workerQueue = multiprocessing.Queue()
writerQueue = multiprocessing.Queue()
feedProc = multiprocessing.Process(target = feed , args = (workerQueue, parlist))
calcProc = [multiprocessing.Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = multiprocessing.Process(target = write, args = (writerQueue, 'somefile.csv'))
feedProc.start()
feedProc.join ()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
if __name__=='__main__':
sys.exit(main())
I think the problem is the Queue buffer getting filled, so you need to read from the queue before you can put additional stuff in it.
For example, in your feed thread you have:
queue.put(par)
If you keep putting much stuff without reading this will cause it to block untill the buffer is freed, but the problem is that you only free the buffer in your calc thread, which in turn doesn't get started before you join your blocking feed thread.
So, in order for your feed thread to finish, the buffer should be freed, but the buffer won't be freed before the thread finishes :)
Try organizing your queues access more.
The feedProc and the writeProc are not actually running in parallel with the rest of your program. When you have
proc.start()
proc.join ()
you start the process and then, on the join() you immediatly wait for it to finish. In this case there's no gain in multiprocessing, only overhead. Try to start ALL processes at once before you join them. This will also have the effect that your queues get emptied regularyl and you won't deadlock.