Producer-consumer problem - trying to save into a csv file - python

so this seemingly simple problem is doing my head in.
I have a dataset (datas) and I do some processing on it (this isn't the issue, though this takes time owing to the size of the dataset) to produce multiple rows to be stored into a CSV file. However, it is very taxing to produce a row, then save it to csv, then produce a row and then save it etc.
So I'm trying to implement producer and consumer threads - producers will produce each row of data (to speed up the process), store in a queue and a single consumer will then append to my csv file.
My attempts below result in success sometimes (the data is correctly saved) or other times the data is "cut off" (either an entire row or part of it).
What am I doing wrong?
from threading import Thread
from queue import Queue
import csv
q = Queue()
def producer():
datas = [["hello","world"],["test","hey"],["my","away"],["your","gone"],["bye","hat"]]
for data in datas:
q.put(data)
def consumer():
while True:
local = q.get()
file = open('dataset.csv','a')
with file as fd:
writer = csv.writer(fd)
writer.writerow(local)
file.close()
q.task_done()
for i in range(10):
t = Thread(target=consumer)
t.daemon = True
t.start()
producer()
q.join()

I think this does something similar to what you're trying to do. For testing purposes, it prefixes each row of data in the CSV file produced with a "producer id" so the source of the data can be seen in the results.
As you will be able to see from the csv file produced, all the data produced gets put into it.
import csv
import random
from queue import Queue
from threading import Thread
import time
SENTINEL = object()
def producer(q, id):
data = (("hello", "world"), ("test", "hey"), ("my", "away"), ("your", "gone"),
("bye", "hat"))
for datum in data:
q.put((id,) + datum) # Prefix producer ID to datum for testing.
time.sleep(random.random()) # Vary thread speed for testing.
class Consumer(Thread):
def __init__(self, q):
super().__init__()
self.q = q
def run(self):
with open('dataset.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
while True:
datum = self.q.get()
if datum is SENTINEL:
break
writer.writerow(datum)
def main():
NUM_PRODUCERS = 10
queue = Queue()
# Create producer threads.
threads = []
for id in range(NUM_PRODUCERS):
t = Thread(target=producer, args=(queue, id+1,))
t.start()
threads.append(t)
# Create Consumer thread.
consumer = Consumer(queue)
consumer.start()
# Wait for all producer threads to finish.
while threads:
threads = [thread for thread in threads if thread.is_alive()]
queue.put(SENTINEL) # Indicate to consumer thread no more data.
consumer.join()
print('Done')
if __name__ == '__main__':
main()

Related

Is this the most I can get from Python multiprocess?

I have data, which is in a text file. Each line is a computation to do. This file has around 100 000 000 lines.
First I load everything into the ram, then I have a a method that performs the computation and gives the following results:
def process(data_line):
#do computation
return result
Then I call it like this with packets of 2000 lines and then save the result to disk :
POOL_SIZE = 15 #nbcore - 1
PACKET_SIZE = 2000
pool = Pool(processes=POOL_SIZE)
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets = int(number_of_lines/ PACKET_SIZE)
for i in range(number_of_packets):
lines_packet = data_lines[:PACKET_SIZE]
data_lines = data_lines[PACKET_SIZE:]
results = pool.map(process, lines_packet)
save_computed_data_to_disk(to_be_computed_filename, results)
# process the last packet, which is smaller
results.extend(pool.map(process, data_lines))
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
The problem is, while I was writing to disk, my CPU is computing nothing and has 8 cores. It is looking at the task manager and it seems that quite a lot of CPU time is lost.
I have to write to disk after having completed my computation because the results are 1000 times larger than the input.
Anyways, I would have to write to the disk at some point. If time is not lost here, it will be lost later.
What could I do to allow one core to write to disk, while still computing with the others? Switch to C?
At this rate I can process 100 millions lines in 75h, but I have 12 billions lines to process, so any improvement is welcome.
example of timings:
Processing packet 2/15 953 of C:/processing/drop_zone\to_be_processed_txt_files\t_to_compute_303620.txt
Launching task and waiting for it to finish...
Task completed, Continuing
Packet was processed in 11.534576654434204 seconds
We are currently going at a rate of 0.002306915330886841 sec/words
Which is 433.47928145051293 words per seconds
Saving in temporary file
Printing writing 5000 computed line to disk took 0.04400920867919922 seconds
saving word to resume from : 06 20 25 00 00
Estimated time for processing the remaining packets is : 51:19:25
Note: This SharedMemory works only for Python >= 3.8 since it first appeared there
Start 3 kinds of processes: Reader, Processor(s), Writer.
Have Reader process read the file incrementally, sharing the result via shared_memory and Queue.
Have the Processor(s) consume the Queue, consume the shared_memory, and return the result(s) via another Queue. Again, as shared_memory.
Have the Writer process consume the second Queue, writing to the destination file.
Have them all communicate through, say, some Events or DictProxy, with the MainProcess who will act as the orchestrator.
Example:
import time
import random
import hashlib
import multiprocessing as MP
from queue import Queue, Empty
# noinspection PyCompatibility
from multiprocessing.shared_memory import SharedMemory
from typing import Dict, List
def readerfunc(
shm_arr: List[SharedMemory], q_out: Queue, procr_ready: Dict[str, bool]
):
numshm = len(shm_arr)
for batch in range(1, 6):
print(f"Reading batch #{batch}")
for shm in shm_arr:
#### Simulated Reading ####
for j in range(0, shm.size):
shm.buf[j] = random.randint(0, 255)
#### ####
q_out.put((batch, shm))
# Need to sync here because we're reusing the same SharedMemory,
# so gotta wait until all processors are done before sending the
# next batch
while not q_out.empty() or not all(procr_ready.values()):
time.sleep(1.0)
def processorfunc(
q_in: Queue, q_out: Queue, suicide: type(MP.Event()), procr_ready: Dict[str, bool]
):
pname = MP.current_process().name
procr_ready[pname] = False
while True:
time.sleep(1.0)
procr_ready[pname] = True
if q_in.empty() and suicide.is_set():
break
try:
batch, shm = q_in.get_nowait()
except Empty:
continue
print(pname, "got batch", batch)
procr_ready[pname] = False
#### Simulated Processing ####
h = hashlib.blake2b(shm.buf, digest_size=4, person=b"processor")
time.sleep(random.uniform(5.0, 7.0))
#### ####
q_out.put((pname, h.hexdigest()))
def writerfunc(q_in: Queue, suicide: type(MP.Event())):
while True:
time.sleep(1.0)
if q_in.empty() and suicide.is_set():
break
try:
pname, digest = q_in.get_nowait()
except Empty:
continue
print("Writing", pname, digest)
#### Simulated Writing ####
time.sleep(random.uniform(3.0, 6.0))
#### ####
print("Writing", pname, digest, "done")
def main():
shm_arr = [
SharedMemory(create=True, size=1024)
for _ in range(0, 5)
]
q_read = MP.Queue()
q_write = MP.Queue()
procr_ready = MP.Manager().dict()
poison = MP.Event()
poison.clear()
reader = MP.Process(target=readerfunc, args=(shm_arr, q_read, procr_ready))
procrs = []
for n in range(0, 3):
p = MP.Process(
target=processorfunc, name=f"Proc{n}", args=(q_read, q_write, poison, procr_ready)
)
procrs.append(p)
writer = MP.Process(target=writerfunc, args=(q_write, poison))
reader.start()
[p.start() for p in procrs]
writer.start()
reader.join()
print("Reader has ended")
while not all(procr_ready.values()):
time.sleep(5.0)
poison.set()
[p.join() for p in procrs]
print("Processors have ended")
writer.join()
print("Writer has ended")
[shm.close() for shm in shm_arr]
[shm.unlink() for shm in shm_arr]
if __name__ == '__main__':
main()
You say you have 8 cores, yet you have:
POOL_SIZE = 15 #nbcore - 1
Assuming you want to leave one processor free (presumably for the main process?) why wouldn't this number be 7? But why do you even want to read a processor free? You are making successive calls to map. While the main process is waiting for these calls to return, it requires know CPU. This is why if you do not specify a pool size when you instantiate your pool it defaults to the number of CPUs you have and not that number minus one. I will have more to say about this below.
Since you have a very large, in-memory list, is it possible that you are expending waisted cycles in your loop rewriting this list on each iteration of the loop. Instead, you can just take a slice of the list and pass that as the iterable argument to map:
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
with Pool(processes=POOL_SIZE) as pool:
offset = 0
for i in range(number_of_packets):
results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
offset += PACKET_SIZE
save_computed_data_to_disk(to_be_computed_filename, results)
if remainder:
results = pool.map(process, data_lines[offset:offset+remainder])
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
Between each call to map the main process is writing out the results to to_be_computed_filename. In the meanwhile, every process in your pool is sitting idle. This should be given to another process (actually a thread running under the main process):
import multiprocessing
import queue
import threading
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
def save_data(q):
while True:
results = q.get()
if results is None:
return # signal to terminate
save_computed_data_to_disk(to_be_computed_filename, results)
q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()
with Pool(processes=POOL_SIZE) as pool:
offset = 0
for i in range(number_of_packets):
results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
offset += PACKET_SIZE
q.put(results)
if remainder:
results = pool.map(process, data_lines[offset:offset+remainder])
q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
I've chosen to run save_data in a thread of the main process. This could also be another process in which case you would need to use a multiprocessing.Queue instance. But I figured the main process thread is mostly waiting for the map to complete and there would not be competition for the GIL. Now if you do not leave a processor free for the threading job, save_data, it may end up doing most of the saving only after all of the results have been created. You would need to experiment a bit with this.
Ideally, I would also modify the reading of the input file so as to not have to first read it all into memory but rather read it line by line yielding 2000 line chunks and submitting those as jobs for map to process:
import multiprocessing
import queue
import threading
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
def save_data(q):
while True:
results = q.get()
if results is None:
return # signal to terminate
save_computed_data_to_disk(to_be_computed_filename, results)
def read_data():
"""
yield lists of PACKET_SIZE
"""
lines = []
with open(some_file, 'r') as f:
for line in iter(f.readline(), ''):
lines.append(line)
if len(lines) == PACKET_SIZE:
yield lines
lines = []
if lines:
yield lines
q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()
with Pool(processes=POOL_SIZE) as pool:
for l in read_data():
results = pool.map(process, l)
q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
I made two assumptions: The writing is hitting the I/O bound, not the CPU bound - meaning that throwing more cores onto writing would not improve the performance. And the process function contains some heavy computations.
I would approach it differently:
Split up the large list into a list of list
Feed it than into the processes
Store the total result
Here is the example code:
import multiprocessing as mp
data_lines = [0]*10000 # read it from file
size = 2000
# Split the list into a list of list (with chunksize `size`)
work = [data_lines[i:i + size] for i in range(0, len(data_lines), size)]
def process(data):
result = len(data) # some something fancy
return result
with mp.Pool() as p:
result = p.map(process, work)
save_computed_data_to_disk(file_name, result)
On meta: You may also have a look into numpy or pandas (depending on the data) because it sounds that you would like to do something into that direction.
The first thing that comes to mind for the code is to run the saving function in the thread. By this we exclude the bottelneck of waiting disk writing. Like so:
executor = ThreadPoolExecutor(max_workers=2)
future = executor.submit(save_computed_data_to_disk, to_be_computed_filename, results)
saving_futures.append(future)
...
concurrent.futures.wait(saving_futures, return_when=ALL_COMPLETED) # wait all saved to disk after processing
print("Done")

Process pool results without waiting for all tasks to finish

from multiprocessing import Pool
from functools import partial
from time import sleep
import random
import string
import uuid
import os
import glob
def task_a(param1, param2, mydata):
thread_id = str(uuid.uuid4().hex) # this may not be robust enough to guarantee no collisions, address
output_filename = ''.join([str(thread_id),'.txt'])
# part 1 - create output file for task_b to use
with open(output_filename, 'w') as outfile:
for line in mydata:
outfile.write(line)
# part 2 - do some extra stuff (whilst task_b is running)
sleep(5)
print('Task A finished')
return output_filename # not interested in return val
def task_b(expected_num_files):
processed_files = 0
while processed_files<expected_num_files:
print('I am task_b, waiting for {} files ({} so far)'.format(expected_num_files, processed_files))
path_to_search = ''
for filename in glob.iglob(path_to_search + '*.txt', recursive=True):
print('Got file : {}'.format(filename))
# would do something complicated here
os.rename(filename, filename+'.done')
processed_files+=1
sleep(10)
if __name__ == '__main__':
param1 = '' # dummy variable, need to support in solution
param2 = '' # dummy variable, need to support in solution
num_workers = 2
full_data = [[random.choice(string.ascii_lowercase) for _ in range(5)] for _ in range(100)]
print(full_data)
for i in range(0, len(full_data), num_workers):
print('Going to process {}'.format(full_data[i:i+num_workers]))
p = Pool(num_workers)
task_a_func = partial(task_a, param1, param2)
results = p.map(task_a_func, full_data[i:i+num_workers])
p.close()
p.join()
task_b(expected_num_files=num_workers) # want this running sooner
print('Iteration {} complete'.format(i))
#want to wait for task_a's and task_b to finish
I'm having trouble scheduling these tasks to run concurrently.
task_a is a multiprocessing pool that produces an output file part way through it execution.
task_b MUST process the output files sequentially can be in any order (can be as soon as they are available), WHILST task_a continues to run (it will no longer change the output file)
The next iteration must only start when both all task_a's have completed AND task_b has completed.
The toy code I have posted obviously waits for task_a's to fully complete before task_b is started (which is not what I want)
I have looked at multiprocessing / subprocess etc. but cannot find a way to launch both the pool and the single task_b process concurrently AND wait for BOTH to finish.
task_b is written as if it could be changed to an external script, but I am still stuck on how manage the execution.
Should I effectively merge code from task_b into task_a and somehow pass a flag to ensure one worker per pool 'runs the task_b code' via a if/else - at least then I would just be waiting on the pool to complete?
You can use an interprocess queue to communicate the filenames between task a and task b.
Also, initializing pool repeatedly inside the loop is harmful and unnecessarily slow.
Its better to initialize the pool once in the beginning.
from multiprocessing import Pool, Manager, Event
from functools import partial
from time import sleep
import random
import string
import uuid
import os
import glob
def task_a(param1, param2, queue, mydata):
thread_id = str(uuid.uuid4().hex)
output_filename = ''.join([str(thread_id),'.txt'])
output_filename = 'data/' + output_filename
with open(output_filename, 'w') as outfile:
for line in mydata:
outfile.write(line)
print(f'{thread_id}: Task A file write complete for data {mydata}')
queue.put(output_filename)
print('Task A finished')
def task_b(queue, num_workers, data_size, event_task_b_done):
print('Task b started!')
processed_files = 0
while True:
filename = queue.get()
if filename == 'QUIT':
# Whenever you want task_b to quit, just push 'quit' to the queue
print('Task b quitting')
break
print('Got file : {}'.format(filename))
os.rename(filename, filename+'.done')
processed_files+=1
print(f'Have processed {processed_files} so far!')
if (processed_files % num_workers == 0) or (processed_files == data_size):
event_task_b_done.set()
if __name__ == '__main__':
param1 = '' # dummy variable, need to support in solution
param2 = '' # dummy variable, need to support in solution
num_workers = 2
data_size = 100
full_data = [[random.choice(string.ascii_lowercase) for _ in range(5)] for _ in range(data_size)]
mgr = Manager()
queue = mgr.Queue()
event_task_b_done = mgr.Event()
# One extra worker for task b
p = Pool(num_workers + 1)
p.apply_async(task_b, args=(queue, num_workers, data_size, event_task_b_done))
task_a_func = partial(task_a, param1, param2, queue)
for i in range(0, len(full_data), num_workers):
data = full_data[i:i+num_workers]
print('Going to process {}'.format(data))
p.map_async(task_a_func, full_data[i:i+num_workers])
print(f'Waiting for task b to process all {num_workers} files...')
event_task_b_done.wait()
event_task_b_done.clear()
print('Iteration {} complete'.format(i))
queue.put('QUIT')
p.close()
p.join()
exit(0)

Problems with serial communication and queues

I've some problems creating a multi-process serial logger.
The plan: Having a seperate process reading from serial port, putting data into a queue. The main process reads the entire queue after some time and processes the data.
But I'm not sure if this is the right way to do it, because sometimes the data is not in the right order. It works well for slow communication.
Do I have to lock something?! Is there a smarter way to do this?
import time
import serial
from multiprocessing import Process, Queue
def myProcess(q):
with serial.Serial("COM2", 115200, 8, "E", 1, timeout=None) as ser:
while True:
q.put("%02X" % ser.read(1)[0])
if __name__=='__main__':
try:
q = Queue()
p = Process(target=myProcess, args=(q,))
p.daemon = True
p.start()
data = []
while True:
print(q.qsize()) #!debug
while not q.empty(): #get all data from queue
data.append(q.get())
#proc_data(data) #data processing
time.sleep(1) #emulate data processing
del data[:] #clear buffer
except KeyboardInterrupt:
print("clean-up") #!debug
p.join()
Update:
I tried another version based on threads (see code below), but with the same effect/problem. The carry-over works fine, but one byte 'between' the carry-over and the new data is always gone -> The script will miss the byte when main reads the queue?!
import time, serial, threading, queue
def read_port(q):
with serial.Serial("COM2", 19200, 8, "E", 1, timeout=None) as ser:
while t.is_alive():
q.put("%02X" % ser.read(1)[0])
def proc_data(data, crc):
#processing data here
carry = data[len(data)/2:] #DEBUG: emulate result (return last half of data)
return carry
if __name__=='__main__':
try:
q = queue.Queue()
t = threading.Thread(target=read_port, args=(q,))
t.daemon = True
t.start()
data = []
while True:
try:
while True:
data.append(q.get_nowait()) #get all data from queue
except queue.Empty:
pass
print(data) #DEBUG: show carry-over + new data
data = proc_data(data) #process data and store carry-over
print(data) #DEBUG: show new carry-over
time.sleep(1) #DEBUG: emulate processing time
except KeyboardInterrupt:
print("clean-up")
t.join(0)
Consider the following code.
1) the two processes are siblings; the parent just sets them up then waits for control-C to interrupt everything
2) one proc puts raw bytes on the shared queue
3) other proc blocks for the first byte of data. When it gets the first byte, it then grabs the rest of the data, outputs it in hex, then continues.
4) parent proc just sets up others then waits for interrupt using signal.pause()
Note that with multiprocessing, the qsize() (and probably empty()) functions are unreliable -- thus the above code will reliably grab your data.
source
import signal, time
import serial
from multiprocessing import Process, Queue
def read_port(q):
with serial.Serial("COM2", 115200, 8, "E", 1, timeout=None) as ser:
while True:
q.put( ser.read(1)[0] )
def show_data(q):
while True:
# block for first byte of data
data = [ q.get() ]
# consume more data if available
try:
while True:
data.append( q.get_nowait() )
except Queue.Empty:
pass
print 'got:', ":".join("{:02x}".format(ord(c)) for c in data)
if __name__=='__main__':
try:
q = Queue()
Process(target=read_port, args=(q,)).start()
Process(target=show_data, args=(q,)).start()
signal.pause() # wait for interrupt
except KeyboardInterrupt:
print("clean-up") #!debug

File Processor using multiprocessing

I am writing a file processor that can (hopefully) parse arbitrary files and perform arbitrary actions on the parsed contents. The file processor needs to run continuously. The basic idea that I am following is
Each file will have two associated processes (One for reading, other for Parsing and writing somewhere else)
The reader will read a line into a common buffer(say a Queue) till EOF or buffer full. Then wait(sleep)
Writer will read from buffer, parse the stuff, write it to (say) DB till buffer not empty. Then wait(sleep)
Interrupting the main program will cause the reader/writer to exit safely (buffer can be washed away without writing)
The program runs fine. But, sometimes Writer will initialize first and find the buffer empty. So it will go to sleep. The Reader will fill the buffer and sleep too. So for sleep_interval my code does nothing. To get around that thing, I tried using a multiprocessing.Event() to signal to the writer that the buffer has some entries which it may process.
My code is
import multiprocessing
import time
import sys
import signal
import Queue
class FReader(multiprocessing.Process):
"""
A basic file reader class
It spawns a new process that shares a queue with the writer process
"""
def __init__(self,queue,fp,sleep_interval,read_offset,event):
self.queue = queue
self.fp = fp
self.sleep_interval = sleep_interval
self.offset = read_offset
self.fp.seek(self.offset)
self.event = event
self.event.clear()
super(FReader,self).__init__()
def myhandler(self,signum,frame):
self.fp.close()
print "Stopping Reader"
sys.exit(0)
def run(self):
signal.signal(signal.SIGINT,self.myhandler)
signal.signal(signal.SIGCLD,signal.SIG_DFL)
signal.signal(signal.SIGILL,self.myhandler)
while True:
sleep_now = False
if not self.queue.full():
print "READER:Reading"
m = self.fp.readline()
if not self.event.is_set():
self.event.set()
if m:
self.queue.put((m,self.fp.tell()),block=False)
else:
sleep_now = True
else:
print "Queue Full"
sleep_now = True
if sleep_now:
print "Reader sleeping for %d seconds"%self.sleep_interval
time.sleep(self.sleep_interval)
class FWriter(multiprocessing.Process):
"""
A basic file writer class
It spawns a new process that shares a queue with the reader process
"""
def __init__(self,queue,session,sleep_interval,fp,event):
self.queue = queue
self.session = session
self.sleep_interval = sleep_interval
self.offset = 0
self.queue_offset = 0
self.fp = fp
self.dbqueue = Queue.Queue(50)
self.event = event
self.event.clear()
super(FWriter,self).__init__()
def myhandler(self,signum,frame):
#self.session.commit()
self.session.close()
self.fp.truncate()
self.fp.write(str(self.offset))
self.fp.close()
print "Stopping Writer"
sys.exit(0)
def process_line(self,line):
#Do not process comments
if line[0] == '#':
return None
my_list = []
split_line = line.split(',')
my_list = split_line
return my_list
def run(self):
signal.signal(signal.SIGINT,self.myhandler)
signal.signal(signal.SIGCLD,signal.SIG_DFL)
signal.signal(signal.SIGILL,self.myhandler)
while True:
sleep_now = False
if not self.queue.empty():
print "WRITER:Getting"
line,offset = self.queue.get(False)
#Process the line just read
proc_line = self.process_line(line)
if proc_line:
#Must write it to DB. Put it into DB Queue
if self.dbqueue.full():
#DB Queue is full, put data into DB before putting more data
self.empty_dbqueue()
self.dbqueue.put(proc_line)
#Keep a track of the maximum offset in the queue
self.queue_offset = offset if offset > self.queue_offset else self.queue_offset
else:
#Looks like writing queue is empty. Just check if DB Queue is empty too
print "WRITER: Empty Read Queue"
self.empty_dbqueue()
sleep_now = True
if sleep_now:
self.event.clear()
print "WRITER: Sleeping for %d seconds"%self.sleep_interval
#time.sleep(self.sleep_interval)
self.event.wait(5)
def empty_dbqueue(self):
#The DB Queue has many objects waiting to be written to the DB. Lets write them
print "WRITER:Emptying DB QUEUE"
while True:
try:
new_line = self.dbqueue.get(False)
except Queue.Empty:
#Write the new offset to file
self.offset = self.queue_offset
break
print new_line[0]
def main():
write_file = '/home/xyz/stats.offset'
wp = open(write_file,'r')
read_offset = wp.read()
try:
read_offset = int(read_offset)
except ValueError:
read_offset = 0
wp.close()
print read_offset
read_file = '/var/log/somefile'
file_q = multiprocessing.Queue(100)
ev = multiprocessing.Event()
new_reader = FReader(file_q,open(read_file,'r'),30,read_offset,ev)
new_writer = FWriter(file_q,open('/dev/null'),30,open(write_file,'w'),ev)
new_reader.start()
new_writer.start()
try:
new_reader.join()
new_writer.join()
except KeyboardInterrupt:
print "Closing Master"
new_reader.join()
new_writer.join()
if __name__=='__main__':
main()
The dbqueue in Writer is for batching together Database writes and for each line I keep the offset of that line. The maximum offset written into DB is stored into offset file on exit so that I can pick up where I left on next run. The DB object (session) is just '/dev/null' for demo.
Previously rather than do
self.event.wait(5)
I was doing
time.sleep(self.sleep_interval)
Which (as I have said) worked well but introduced a little delay. But then the processes exited perfectly.
Now on doing a Ctrl-C on the main process, the reader exits but the writer throws an OSError
^CStopping Reader
Closing Master
Stopping Writer
Process FWriter-2:
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "FileParse.py", line 113, in run
self.event.wait(5)
File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 303, in wait
self._cond.wait(timeout)
File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 212, in wait
self._wait_semaphore.acquire(True, timeout)
OSError: [Errno 0] Error
I know event.wait() somehow blocks the code but I can't get how to overcome this. I tried wrapping self.event.wait(5) and sys.exit() in a try: except OSError: block but that only makes the program hang forever.
I am using Python-2.6
I think it would be better to use the Queue blocking timeout for the Writer class - using Queue.get(True, 5), then if during the time interval something was put into the queue, the Writer would wake up immediately.. The Writer loop would then be something like:
while True:
sleep_now = False
try:
print "WRITER:Getting"
line,offset = self.queue.get(True, 5)
#Process the line just read
proc_line = self.process_line(line)
if proc_line:
#Must write it to DB. Put it into DB Queue
if self.dbqueue.full():
#DB Queue is full, put data into DB before putting more data
self.empty_dbqueue()
self.dbqueue.put(proc_line)
#Keep a track of the maximum offset in the queue
self.queue_offset = offset if offset > self.queue_offset else self.queue_offset
except Queue.Empty:
#Looks like writing queue is empty. Just check if DB Queue is empty too
print "WRITER: Empty Read Queue"
self.empty_dbqueue()

Multiprocessing, writing to file, and deadlock on large loops

I have a very weird problem with the code below. when numrows = 10 the Process loops completes itself and proceeds to finish. If the growing list becomes larger it goes into a deadlock. Why is this and how can I solve this?
import multiprocessing, time, sys
# ----------------- Calculation Engine -------------------
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "Project ID: %s started. " % par
res = doCalculation(par)
queueOut.put(res)
except:
break
def write(queue, fname):
print 'Started to write to file'
fhandle = open(fname, "w")
while True:
try:
res = queue.get(block = False)
for m in res:
print >>fhandle, m
except:
break
fhandle.close()
print 'Complete writing to the file'
def doCalculation(project_ID):
numrows = 100
toFileRowList = []
for i in range(numrows):
toFileRowList.append([project_ID]*100)
print "%s %s" % (multiprocessing.current_process().name, i)
return toFileRowList
def main():
parlist = [276, 266]
nthreads = multiprocessing.cpu_count()
workerQueue = multiprocessing.Queue()
writerQueue = multiprocessing.Queue()
feedProc = multiprocessing.Process(target = feed , args = (workerQueue, parlist))
calcProc = [multiprocessing.Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = multiprocessing.Process(target = write, args = (writerQueue, 'somefile.csv'))
feedProc.start()
feedProc.join ()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
if __name__=='__main__':
sys.exit(main())
I think the problem is the Queue buffer getting filled, so you need to read from the queue before you can put additional stuff in it.
For example, in your feed thread you have:
queue.put(par)
If you keep putting much stuff without reading this will cause it to block untill the buffer is freed, but the problem is that you only free the buffer in your calc thread, which in turn doesn't get started before you join your blocking feed thread.
So, in order for your feed thread to finish, the buffer should be freed, but the buffer won't be freed before the thread finishes :)
Try organizing your queues access more.
The feedProc and the writeProc are not actually running in parallel with the rest of your program. When you have
proc.start()
proc.join ()
you start the process and then, on the join() you immediatly wait for it to finish. In this case there's no gain in multiprocessing, only overhead. Try to start ALL processes at once before you join them. This will also have the effect that your queues get emptied regularyl and you won't deadlock.

Categories