File Processor using multiprocessing

File Processor using multiprocessing - python

I am writing a file processor that can (hopefully) parse arbitrary files and perform arbitrary actions on the parsed contents. The file processor needs to run continuously. The basic idea that I am following is
Each file will have two associated processes (One for reading, other for Parsing and writing somewhere else)
The reader will read a line into a common buffer(say a Queue) till EOF or buffer full. Then wait(sleep)
Writer will read from buffer, parse the stuff, write it to (say) DB till buffer not empty. Then wait(sleep)
Interrupting the main program will cause the reader/writer to exit safely (buffer can be washed away without writing)
The program runs fine. But, sometimes Writer will initialize first and find the buffer empty. So it will go to sleep. The Reader will fill the buffer and sleep too. So for sleep_interval my code does nothing. To get around that thing, I tried using a multiprocessing.Event() to signal to the writer that the buffer has some entries which it may process.
My code is
import multiprocessing
import time
import sys
import signal
import Queue
class FReader(multiprocessing.Process):
"""
A basic file reader class
It spawns a new process that shares a queue with the writer process
"""
def __init__(self,queue,fp,sleep_interval,read_offset,event):
self.queue = queue
self.fp = fp
self.sleep_interval = sleep_interval
self.offset = read_offset
self.fp.seek(self.offset)
self.event = event
self.event.clear()
super(FReader,self).__init__()
def myhandler(self,signum,frame):
self.fp.close()
print "Stopping Reader"
sys.exit(0)
def run(self):
signal.signal(signal.SIGINT,self.myhandler)
signal.signal(signal.SIGCLD,signal.SIG_DFL)
signal.signal(signal.SIGILL,self.myhandler)
while True:
sleep_now = False
if not self.queue.full():
print "READER:Reading"
m = self.fp.readline()
if not self.event.is_set():
self.event.set()
if m:
self.queue.put((m,self.fp.tell()),block=False)
else:
sleep_now = True
else:
print "Queue Full"
sleep_now = True
if sleep_now:
print "Reader sleeping for %d seconds"%self.sleep_interval
time.sleep(self.sleep_interval)
class FWriter(multiprocessing.Process):
"""
A basic file writer class
It spawns a new process that shares a queue with the reader process
"""
def __init__(self,queue,session,sleep_interval,fp,event):
self.queue = queue
self.session = session
self.sleep_interval = sleep_interval
self.offset = 0
self.queue_offset = 0
self.fp = fp
self.dbqueue = Queue.Queue(50)
self.event = event
self.event.clear()
super(FWriter,self).__init__()
def myhandler(self,signum,frame):
#self.session.commit()
self.session.close()
self.fp.truncate()
self.fp.write(str(self.offset))
self.fp.close()
print "Stopping Writer"
sys.exit(0)
def process_line(self,line):
#Do not process comments
if line[0] == '#':
return None
my_list = []
split_line = line.split(',')
my_list = split_line
return my_list
def run(self):
signal.signal(signal.SIGINT,self.myhandler)
signal.signal(signal.SIGCLD,signal.SIG_DFL)
signal.signal(signal.SIGILL,self.myhandler)
while True:
sleep_now = False
if not self.queue.empty():
print "WRITER:Getting"
line,offset = self.queue.get(False)
#Process the line just read
proc_line = self.process_line(line)
if proc_line:
#Must write it to DB. Put it into DB Queue
if self.dbqueue.full():
#DB Queue is full, put data into DB before putting more data
self.empty_dbqueue()
self.dbqueue.put(proc_line)
#Keep a track of the maximum offset in the queue
self.queue_offset = offset if offset > self.queue_offset else self.queue_offset
else:
#Looks like writing queue is empty. Just check if DB Queue is empty too
print "WRITER: Empty Read Queue"
self.empty_dbqueue()
sleep_now = True
if sleep_now:
self.event.clear()
print "WRITER: Sleeping for %d seconds"%self.sleep_interval
#time.sleep(self.sleep_interval)
self.event.wait(5)
def empty_dbqueue(self):
#The DB Queue has many objects waiting to be written to the DB. Lets write them
print "WRITER:Emptying DB QUEUE"
while True:
try:
new_line = self.dbqueue.get(False)
except Queue.Empty:
#Write the new offset to file
self.offset = self.queue_offset
break
print new_line[0]
def main():
write_file = '/home/xyz/stats.offset'
wp = open(write_file,'r')
read_offset = wp.read()
try:
read_offset = int(read_offset)
except ValueError:
read_offset = 0
wp.close()
print read_offset
read_file = '/var/log/somefile'
file_q = multiprocessing.Queue(100)
ev = multiprocessing.Event()
new_reader = FReader(file_q,open(read_file,'r'),30,read_offset,ev)
new_writer = FWriter(file_q,open('/dev/null'),30,open(write_file,'w'),ev)
new_reader.start()
new_writer.start()
try:
new_reader.join()
new_writer.join()
except KeyboardInterrupt:
print "Closing Master"
new_reader.join()
new_writer.join()
if __name__=='__main__':
main()
The dbqueue in Writer is for batching together Database writes and for each line I keep the offset of that line. The maximum offset written into DB is stored into offset file on exit so that I can pick up where I left on next run. The DB object (session) is just '/dev/null' for demo.
Previously rather than do
self.event.wait(5)
I was doing
time.sleep(self.sleep_interval)
Which (as I have said) worked well but introduced a little delay. But then the processes exited perfectly.
Now on doing a Ctrl-C on the main process, the reader exits but the writer throws an OSError
^CStopping Reader
Closing Master
Stopping Writer
Process FWriter-2:
Traceback (most recent call last):
File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
self.run()
File "FileParse.py", line 113, in run
self.event.wait(5)
File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 303, in wait
self._cond.wait(timeout)
File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 212, in wait
self._wait_semaphore.acquire(True, timeout)
OSError: [Errno 0] Error
I know event.wait() somehow blocks the code but I can't get how to overcome this. I tried wrapping self.event.wait(5) and sys.exit() in a try: except OSError: block but that only makes the program hang forever.
I am using Python-2.6

I think it would be better to use the Queue blocking timeout for the Writer class - using Queue.get(True, 5), then if during the time interval something was put into the queue, the Writer would wake up immediately.. The Writer loop would then be something like:
while True:
sleep_now = False
try:
print "WRITER:Getting"
line,offset = self.queue.get(True, 5)
#Process the line just read
proc_line = self.process_line(line)
if proc_line:
#Must write it to DB. Put it into DB Queue
if self.dbqueue.full():
#DB Queue is full, put data into DB before putting more data
self.empty_dbqueue()
self.dbqueue.put(proc_line)
#Keep a track of the maximum offset in the queue
self.queue_offset = offset if offset > self.queue_offset else self.queue_offset
except Queue.Empty:
#Looks like writing queue is empty. Just check if DB Queue is empty too
print "WRITER: Empty Read Queue"
self.empty_dbqueue()

Related

Producer-consumer problem - trying to save into a csv file

so this seemingly simple problem is doing my head in.
I have a dataset (datas) and I do some processing on it (this isn't the issue, though this takes time owing to the size of the dataset) to produce multiple rows to be stored into a CSV file. However, it is very taxing to produce a row, then save it to csv, then produce a row and then save it etc.
So I'm trying to implement producer and consumer threads - producers will produce each row of data (to speed up the process), store in a queue and a single consumer will then append to my csv file.
My attempts below result in success sometimes (the data is correctly saved) or other times the data is "cut off" (either an entire row or part of it).
What am I doing wrong?
from threading import Thread
from queue import Queue
import csv
q = Queue()
def producer():
datas = [["hello","world"],["test","hey"],["my","away"],["your","gone"],["bye","hat"]]
for data in datas:
q.put(data)
def consumer():
while True:
local = q.get()
file = open('dataset.csv','a')
with file as fd:
writer = csv.writer(fd)
writer.writerow(local)
file.close()
q.task_done()
for i in range(10):
t = Thread(target=consumer)
t.daemon = True
t.start()
producer()
q.join()

I think this does something similar to what you're trying to do. For testing purposes, it prefixes each row of data in the CSV file produced with a "producer id" so the source of the data can be seen in the results.
As you will be able to see from the csv file produced, all the data produced gets put into it.
import csv
import random
from queue import Queue
from threading import Thread
import time
SENTINEL = object()
def producer(q, id):
data = (("hello", "world"), ("test", "hey"), ("my", "away"), ("your", "gone"),
("bye", "hat"))
for datum in data:
q.put((id,) + datum) # Prefix producer ID to datum for testing.
time.sleep(random.random()) # Vary thread speed for testing.
class Consumer(Thread):
def __init__(self, q):
super().__init__()
self.q = q
def run(self):
with open('dataset.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
while True:
datum = self.q.get()
if datum is SENTINEL:
break
writer.writerow(datum)
def main():
NUM_PRODUCERS = 10
queue = Queue()
# Create producer threads.
threads = []
for id in range(NUM_PRODUCERS):
t = Thread(target=producer, args=(queue, id+1,))
t.start()
threads.append(t)
# Create Consumer thread.
consumer = Consumer(queue)
consumer.start()
# Wait for all producer threads to finish.
while threads:
threads = [thread for thread in threads if thread.is_alive()]
queue.put(SENTINEL) # Indicate to consumer thread no more data.
consumer.join()
print('Done')
if __name__ == '__main__':
main()

Problems with serial communication and queues

I've some problems creating a multi-process serial logger.
The plan: Having a seperate process reading from serial port, putting data into a queue. The main process reads the entire queue after some time and processes the data.
But I'm not sure if this is the right way to do it, because sometimes the data is not in the right order. It works well for slow communication.
Do I have to lock something?! Is there a smarter way to do this?
import time
import serial
from multiprocessing import Process, Queue
def myProcess(q):
with serial.Serial("COM2", 115200, 8, "E", 1, timeout=None) as ser:
while True:
q.put("%02X" % ser.read(1)[0])
if __name__=='__main__':
try:
q = Queue()
p = Process(target=myProcess, args=(q,))
p.daemon = True
p.start()
data = []
while True:
print(q.qsize()) #!debug
while not q.empty(): #get all data from queue
data.append(q.get())
#proc_data(data) #data processing
time.sleep(1) #emulate data processing
del data[:] #clear buffer
except KeyboardInterrupt:
print("clean-up") #!debug
p.join()
Update:
I tried another version based on threads (see code below), but with the same effect/problem. The carry-over works fine, but one byte 'between' the carry-over and the new data is always gone -> The script will miss the byte when main reads the queue?!
import time, serial, threading, queue
def read_port(q):
with serial.Serial("COM2", 19200, 8, "E", 1, timeout=None) as ser:
while t.is_alive():
q.put("%02X" % ser.read(1)[0])
def proc_data(data, crc):
#processing data here
carry = data[len(data)/2:] #DEBUG: emulate result (return last half of data)
return carry
if __name__=='__main__':
try:
q = queue.Queue()
t = threading.Thread(target=read_port, args=(q,))
t.daemon = True
t.start()
data = []
while True:
try:
while True:
data.append(q.get_nowait()) #get all data from queue
except queue.Empty:
pass
print(data) #DEBUG: show carry-over + new data
data = proc_data(data) #process data and store carry-over
print(data) #DEBUG: show new carry-over
time.sleep(1) #DEBUG: emulate processing time
except KeyboardInterrupt:
print("clean-up")
t.join(0)

Consider the following code.
1) the two processes are siblings; the parent just sets them up then waits for control-C to interrupt everything
2) one proc puts raw bytes on the shared queue
3) other proc blocks for the first byte of data. When it gets the first byte, it then grabs the rest of the data, outputs it in hex, then continues.
4) parent proc just sets up others then waits for interrupt using signal.pause()
Note that with multiprocessing, the qsize() (and probably empty()) functions are unreliable -- thus the above code will reliably grab your data.
source
import signal, time
import serial
from multiprocessing import Process, Queue
def read_port(q):
with serial.Serial("COM2", 115200, 8, "E", 1, timeout=None) as ser:
while True:
q.put( ser.read(1)[0] )
def show_data(q):
while True:
# block for first byte of data
data = [ q.get() ]
# consume more data if available
try:
while True:
data.append( q.get_nowait() )
except Queue.Empty:
pass
print 'got:', ":".join("{:02x}".format(ord(c)) for c in data)
if __name__=='__main__':
try:
q = Queue()
Process(target=read_port, args=(q,)).start()
Process(target=show_data, args=(q,)).start()
signal.pause() # wait for interrupt
except KeyboardInterrupt:
print("clean-up") #!debug

how to use threads to grab multiple chunks of a file concurrently from server but write to disk atomically?

I am stuck in a grieve problem, and not able to figure out which way to go, in attempts made whole day I have posted so many times, this is not a duplicate question, since I need clarity how can I use multiple threads to grab a multiple chunks simultaneously from a server but write to disk atomically by locking the file write operation for single thread access; while the next thread waits for lock to get released.
import argparse,logging, Queue, os, requests, signal, sys, time, threading
import utils as _fdUtils
DESKTOP_PATH = os.path.expanduser("~/Desktop")
appName = 'FileDownloader'
logFile = os.path.join(DESKTOP_PATH, '%s.log' % appName)
_log = _fdUtils.fdLogger(appName, logFile, logging.DEBUG, logging.DEBUG, console_level=logging.DEBUG)
queue = Queue.Queue()
STOP_REQUEST = threading.Event()
maxSplits = threading.BoundedSemaphore(3)
threadLimiter = threading.BoundedSemaphore(5)
lock = threading.Lock()
Update:1
def _grabAndWriteToDisk(url, saveTo, first=None, queue=None, mode='wb', irange=None):
""" Function to download file..
Args:
url(str): url of file to download
saveTo(str): path where to save file
first(int): starting byte of the range
queue(Queue.Queue): queue object to set status for file download
mode(str): mode of file to be downloaded
irange(str): range of byte to download
"""
fileName = url.split('/')[-1]
filePath = os.path.join(saveTo, fileName)
fileSize = int(_fdUtils.getUrlSizeInBytes(url))
downloadedFileSize = 0 if not first else first
block_sz = 8192
irange = irange if irange else '0-%s' % fileSize
# print mode
resp = requests.get(url, headers={'Range': 'bytes=%s' % irange}, stream=True)
fileBuffer = resp.raw.read()
with open(filePath, mode) as fd:
downloadedFileSize += len(fileBuffer)
fd.write(fileBuffer)
status = r"%10d [%3.2f%%]" % (downloadedFileSize, downloadedFileSize * 100. / fileSize)
status = status + chr(8)*(len(status)+1)
sys.stdout.write('%s\r' % status)
time.sleep(.05)
sys.stdout.flush()
if downloadedFileSize == fileSize:
STOP_REQUEST.set()
if queue:
queue.task_done()
_log.info("Download Completed %s%% for file %s, saved to %s",
downloadedFileSize * 100. / fileSize, fileName, saveTo)
class ThreadedFetch(threading.Thread):
""" docstring for ThreadedFetch
"""
def __init__(self, queue):
super(ThreadedFetch, self).__init__()
self.queue = queue
self.lock = threading.Lock()
def run(self):
threadLimiter.acquire()
try:
items = self.queue.get()
url = items[0]
saveTo = DESKTOP_PATH if not items[1] else items[1]
split = items[-1]
# grab split chunks in separate thread.
if split > 1:
maxSplits.acquire()
try:
sizeInBytes = int(_fdUtils.getUrlSizeInBytes(url))
byteRanges = _fdUtils.getRange(sizeInBytes, split)
mode = 'wb'
th = threading.Thread(target=_grabAndWriteToDisk, args=(url, saveTo, first, self.queue, mode, _range))
_log.info("Pulling for range %s using %s" , _range, th.getName())
th.start()
# _grabAndWriteToDisk(url, saveTo, first, self.queue, mode, _range)
mode = 'a'
finally:
maxSplits.release()
else:
while not STOP_REQUEST.isSet():
self.setName("primary_%s" % url.split('/')[-1])
# if downlaod whole file in single chunk no need
# to start a new thread, so directly download here.
_grabAndWriteToDisk(url, saveTo, 0, self.queue)
finally:
threadLimiter.release()
def main(appName, flag='with'):
args = _fdUtils.getParser()
urls_saveTo = {}
if flag == 'with':
_fdUtils.Watcher()
elif flag != 'without':
_log.info('unrecognized flag: %s', flag)
sys.exit()
# spawn a pool of threads, and pass them queue instance
# each url will be downloaded concurrently
for i in xrange(len(args.urls)):
t = ThreadedFetch(queue)
t.daemon = True
t.start()
split = 4
try:
for url in args.urls:
# TODO: put split as value of url as tuple with saveTo
urls_saveTo[url] = args.saveTo
# populate queue with data
for url, saveTo in urls_saveTo.iteritems():
queue.put((url, saveTo, split))
# wait on the queue until everything has been processed
queue.join()
_log.info('Finsihed all dowonloads.')
except (KeyboardInterrupt, SystemExit):
_log.critical('! Received keyboard interrupt, quitting threads.')
I expect to multiple threads grab each chunk but in the terminal i see same thread grabbing each range of chunk:
INFO - Pulling for range 0-25583 using Thread-1
INFO - Pulling for range 25584-51166 using Thread-1
INFO - Pulling for range 51167-76748 using Thread-1
INFO - Pulling for range 76749-102331 using Thread-1
INFO - Download Completed 100.0% for file 607800main_kepler1200_1600-1200.jpg
but what I am expecting is
INFO - Pulling for range 0-25583 using Thread-1
INFO - Pulling for range 25584-51166 using Thread-2
INFO - Pulling for range 51167-76748 using Thread-3
INFO - Pulling for range 76749-102331 using Thread-4
Please do not mark duplicate, without understanding...
please recommend if there is a better approach of doing what I am trying to do..
Cheers:/

If you want the download happen simultaneously and just writing to the file to be atomic then don't put the lock around downloading and writing to the file, but just around the write part.
As every thread uses its own file object to write I don't see the need to lock that access anyway. What you have to make sure is every thread writes to the correct offset, so you need a seek() call on the file before writing the data chunk. Otherwise you'll have to write the chunks in file order, which would make things more complicated.

Read a file for ten seconds

I'm working with Logfiles right now. My need is I want to read a file line by line for a specified period of time, say 10s. Can anybody help me if there is a way to accomplish this in Python?

Run tail or tac using Popen and iterate over output until you find a line you want to stop. Here is a example snippet.
filename = '/var/log/nginx/access.log'
# Command to read file from the end
cmd = sys.platform == 'darwin' and ['tail', '-r', filename] or ['tac', filename]
# But if you want read it from beginning, use the following
#cmd = ['cat', filename]
proc = Popen(cmd, close_fds=True, stdout=PIPE, stderr=PIPE)
output = proc.stdout
FORMAT = [
# 'foo',
# 'bar',
]
def extract_log_data(line):
'''Extact data in you log format, normalize it.
'''
return dict(zip(FORMAT, line))
csv.register_dialect('nginx', delimiter=' ', quoting=csv.QUOTE_MINIMAL)
lines = csv.reader(output, dialect='nginx')
started_at = dt.datetime.utcnow()
for line in lines:
data = extract_log_data(line)
print data
if (dt.datetime.utcnow() - started_at) >= dt.timedelta(seconds=10):
break
output.close()
proc.terminate()

Code
from multiprocessing import Process
import time
def read_file(path):
try:
# open file for writing
f = open(path, "r")
try:
for line in f:
# do something
pass
# always close the file when leaving the try block
finally:
f.close()
except IOError:
print "Failed to open/read from file '%s'" % (path)
def read_file_limited_time(path, max_seconds):
# init Process
p = Process(target=read_file, args=(path,))
# start process
p.start()
# for max seconds
for i in range(max_seconds):
# sleep for 1 seconds (you may change the sleep time to suit your needs)
time.sleep(1)
# if process is not alive, we can break the loop
if not p.is_alive():
break
# if process is still alive after max_seconds, kiil it!
if p.is_alive():
p.terminate()
def main():
path = "f1.txt"
read_file_limited_time(path,10)
if __name__ == "__main__":
main()
Notes
The reason why we "wake up" every 1 second and check whether the process we started is still alive is just to prevent us from keep sleeping when the process has finished. time wasting to sleep for 9 seconds if the process ended after 1 second.

Multiprocessing, writing to file, and deadlock on large loops

I have a very weird problem with the code below. when numrows = 10 the Process loops completes itself and proceeds to finish. If the growing list becomes larger it goes into a deadlock. Why is this and how can I solve this?
import multiprocessing, time, sys
# ----------------- Calculation Engine -------------------
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "Project ID: %s started. " % par
res = doCalculation(par)
queueOut.put(res)
except:
break
def write(queue, fname):
print 'Started to write to file'
fhandle = open(fname, "w")
while True:
try:
res = queue.get(block = False)
for m in res:
print >>fhandle, m
except:
break
fhandle.close()
print 'Complete writing to the file'
def doCalculation(project_ID):
numrows = 100
toFileRowList = []
for i in range(numrows):
toFileRowList.append([project_ID]*100)
print "%s %s" % (multiprocessing.current_process().name, i)
return toFileRowList
def main():
parlist = [276, 266]
nthreads = multiprocessing.cpu_count()
workerQueue = multiprocessing.Queue()
writerQueue = multiprocessing.Queue()
feedProc = multiprocessing.Process(target = feed , args = (workerQueue, parlist))
calcProc = [multiprocessing.Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = multiprocessing.Process(target = write, args = (writerQueue, 'somefile.csv'))
feedProc.start()
feedProc.join ()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
if __name__=='__main__':
sys.exit(main())

I think the problem is the Queue buffer getting filled, so you need to read from the queue before you can put additional stuff in it.
For example, in your feed thread you have:
queue.put(par)
If you keep putting much stuff without reading this will cause it to block untill the buffer is freed, but the problem is that you only free the buffer in your calc thread, which in turn doesn't get started before you join your blocking feed thread.
So, in order for your feed thread to finish, the buffer should be freed, but the buffer won't be freed before the thread finishes :)
Try organizing your queues access more.

The feedProc and the writeProc are not actually running in parallel with the rest of your program. When you have
proc.start()
proc.join ()
you start the process and then, on the join() you immediatly wait for it to finish. In this case there's no gain in multiprocessing, only overhead. Try to start ALL processes at once before you join them. This will also have the effect that your queues get emptied regularyl and you won't deadlock.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

File Processor using multiprocessing - python

Related

Producer-consumer problem - trying to save into a csv file

Problems with serial communication and queues

how to use threads to grab multiple chunks of a file concurrently from server but write to disk atomically?

Read a file for ten seconds

Multiprocessing, writing to file, and deadlock on large loops

Categories

Resources