Parallel processing of stdin lines with multiprocess - python

I have a basic script that reads in a a csv file line by line, does some basic processing of each line and using this decides if it wants to keep the line or not. If it keeps it, it goes into a buffer array which is then written later on (or dealt with appropriately with chunking if the file will not fit in to RAM, but more on that later). I was advised that because of the magic of HDDs i would benefit greatly from a write-buffer rather than just looping each line through directly.
I am attempting to see if I can speed this up by doing the processing of the line on a different process to the stdin reading the line. It is clear that the script is i/o speed bound so i want that to be running as much as possible, not waiting for cpu stuff.
To try this i set up 2 processes using pipes to move the lines from one to the other. This seems to work, however is an order of magnitude slower than if i just did everything in a single loop over the lines being fed in.
Basic implement snippet
outStream = open("updated.csv",'w')
outputBuffer = []
for line in sys.stdin:
row = line.strip()
index = parseRow(row) # CPU-bound process
if index not in mainList:
outputBuffer.append(row)
# some test to see if memory is filling up, if so dump it and reset buffer, just here to show logic atm
if(getMem()>60):
writeBuffer(outputBuffer, outStream)
outputBuffer = []
# make sure last batch is written out
if outputBuffer != []:
writeBuffer(outputBuffer, outStream)
outStream.close()
Attempt at parallel solution
def reader( lineConn, fileno ):
for line in os.fdopen(fileno):
lineConn.send(line)
lineConn.send("end")
def parser( lineConn ):
outputBuffer = []
outStream = open("updated.csv",'w')
while 1:
line = lineConn.recv()
if(line == "end"):
break
row = line.strip()
index = parseRow(row)
if index not in mainList:
outputBuffer.append(row)
writeBuffer(outputBuffer, outStream)
############ main
outputBuffer = []
lineRead, lineParse = Pipe()
io = sys.stdin.fileno()
p1 = Process(target=reader, args=(lineRead,io))
p2 = Process(target=parser, args=(lineParse,))
p1.start()
p2.start()
p1.join()
p2.join()
I suspect it has something to do with the send being slower than the receive, but i also put in a wait timer on the parsing end (i.e. if poll() was false then just hang tight for a bit) but even so the moment i add in the send(line) stuff it grinds to a halt, presumably because i don't understand enough about how it all works.
I had grand plans for making sure that the reading was paused when the writing needed to be done, i.e. file didn't fit in RAM, but got stuck way before that.
As an aside, part of the reason the parsing is low is because i need to hunt down the comma part of the csv (just 2 col) to dig out the index, which means reading the line to get the line and then reading it again to find the comma and grab the integer before it - is there a way to somehow get the stdin to deal with some of this? i.e. read the line and also return the position of the first comma you found?
Important edit
Actually just realised the task requires the output to be sent to stdout not directly written to file, so does this mean i don't need to buffer because the HDD i/o is only on one end?

Related

Read in large text file (~20m rows), apply function to rows, write to new text file

I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.
My code looks like this:
with open('f.txt', 'r') as f:
function(f,w)
Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.
I have tried:
def multiprocess(f,w):
cores = multiprocessing.cpu_count()
with Pool(cores) as p:
pieces = p.map(function,f,w)
f.close()
w.close()
multiprocess(f,w)
But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.
even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.
In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
The slowest solution, but most efficient RAM usage
Your initial solution to execute and process the file line by line
For maximum speed, but most RAM consumption
Read the entire File into RAM as a list via f.readlines(), if your entire dataset can fit in memory, comfortably
Figure out the number of cores (say 8 cores for example)
Split the list evenly into 8 lists
pass each list to the function to be executed by a Process instance (at this point your RAM usage will be further doubled, which is the trade off for max speed), but you should del the original big list right after to free some RAM
Each Process handles its entire chunk in order line by line, and write it into its own output file (out_file1.txt, out_file2.txt, etc.)
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue
Figure out the number of cores in a variable cores (say 8)
Initialize 8 queue, 8 processes, and pass each Queue to each process. At this point each Process should open its own output file (outfile1.txt, outfile2.txt, etc.)
Each process shall poll (and block) for a chunk of 10_000 rows, process them, and write them to their respective output files sequentially
In a loop in the Parent Process, Read 10_000 * 8 lines from your input 20m-rows file
split that into several lists (10K chunks) to push to your respective Processes Queues
When your done with 20m rows exit the loop, pass a special value into each process Queue that signals the end of input data
When each process detects that special End of Data value in its own Queue, each shall close their output file, and exit
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.
Using in to iterate is not the way but you can call more than one line by time, you just need to sum one or more to read more than one line, doing this the program will read faster.
Look this snippet.
# Python code to
# demonstrate readlines()
L = ["Geeks\n", "for\n", "Geeks\n"]
# writing to file
file1 = open('myfile.txt', 'w')
file1.writelines(L)
file1.close()
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()
count = 0
# Strips the newline character
for line in Lines:
count += 1
print("Line{}: {}".format(count, line.strip()))
I got it from: https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/.

Read specific lines of csv file

Hllo guys,
so i have a huge CSV file (500K of lines), i want to process the file simultaneously with 4 processes (so each one will read aprox. 100K of lines)
what is the best way to do it using multi proccessing?
what i have up til now:
def csv_handler(path, procceses = 5):
test_arr = []
with open(path) as fd:
reader = DictReader(fd)
for row in reader:
test_arr.append(row)
current_line = 0
equal_length = len(test_arr) / 5
for i in range(5):
process1 = multiprocessing.Process(target=get_data, args=(test_arr[current_line: current_line + equal_length],))
current_line = current_line + equal_length
i know it's a bad udea to do that with one reading line, but i don't find another option..
i would be happy to get some ideas to how to do it in a better way!
CSV is a pretty tricky format to split the reads up with, and other file formats may be more ideal.
The basic problem is that as lines may be different lengths, you can't know where to start reading a particular lines easily to "fseek" to it. You would have to scan through the file counting newlines, which is basically, reading it.
But you can get pretty close which sounds like it is enough for your needs. Say for two parts, take the file size, divide that by 2.
The first part you start at zero, and stop after completing the record at file_size / 2.
The second part, you seek to file_size / 2, look for the next new line, and start there.
This way while the Python processes won't all get exactly the same amount it will be pretty close, and avoids too much inter-process message passing or multi-threading and with CPython probably the global interpreter lock.
Of course all the normal things for optimising either file IO, or Python code still apply (depending on where your bottleneck lies. You need to measure this.).

Removing the line from file once processed

I am reading content from file line by line. Once line processed, I clear it out. Here is the code
import os
lines = open('q0.txt').readlines()
for i, line in enumerate(lines[:]):
print line
flag = raw_input()
print lines[i]
del lines[i]
open('q0.txt', 'w').writelines(lines)
I am going through large q0.txt. My intension is, if there is any intruption in between, I should not reprocess previously processed lines again.
In above code, though I delete lines[i], it still remain in file. What is wrong?
I expect the above code to throw an IndexError somewhere.
Why? Let us say your script reads a 100 line file. lines[:] will have 100 lines in it.
Meanwhile, del lines[i] will continue deleting items.
Eventually, the for loop will reach 100th element. If there is, even one single del operation, del lines[99] will fail and throw an IndexError.
Therefore, the lines open('q0.txt', 'w').writelines(lines) will never get executed when there is a deleted. And, hence, the file continue to remain the same.
This is my understanding.
Since raw_input is blocking your code, you might wanna separate the process in two threads: the main one and one that you create in your code. Since threads run concurrently and in an unpredictable order (kinda), you're not gonna be able to control exactly on what line the interruption is gonna reach your main while loop). Threads are a very tricky part to get right and it requires a lot of reading, testing and checking why things happen the way they happen...
Also, since you don't mind consuming your lines, you can do what's called a destructive read: Load the contents of the file into a lines variable, and keep getting the last one with pop() until you run out of lines to consume (or the flag has been activated). Check what a pop() method does in a list. Be aware that pop() always returns the last item of a list. If you want the items printed in the original order, you have to use shift or pop from a previously reversed list.
import threading
interrupt=None
def flag_activator():
global interrupt
interrupt = raw_input("(!!) Type yes when you wanna stop\n\n")
print "Oh gosh! The user input %s" % interrupt
th = threading.Thread(target=flag_activator)
th.start()
fr = open('q0.txt', 'r')
lines = fr.readlines()
fr.close()
while lines and interrupt != 'yes':
print "I read this line: %s" % lines.pop()
if len(lines) > 0:
print "Crap! There are still lines"
fw = open('q0.txt', 'w')
fw.writelines(lines)
fw.close()
Now, that code is gonna block your terminal until you type yes on the terminal.
PS: Don't forget to close your opened files (if you don't want to call close() explicitly, see the with statement here and here)
EDIT (as per OP's comments to my misunderstanding):
If what you want is to ensure that the file will not contain the already processed line if your script suddenly stops, an inefficient (but straightforward) way to accomplish that is:
Open the file for read and write (you're gonna need a different file descriptor for each operation)
Load all the file's lines into a variable
Process the first line
Remove that line from the list variable
Write the remaining list to the file
Repeat until no more lines are loaded.
All this opening/closing of files is really, really inefficient, though, but here it goes:
done = False
while done == False:
with open("q0.txt", 'r') as fr, open("q0.txt", 'w') as fw:
lines = fr.readlines()
if len(lines) > 0:
print lines[0] # This would be your processing
del lines[0]
fw.writelines(lines)
else:
done = True

Read a file continuously and sending the output to another file

I have a file 'out.txt' that is updated continuously. I need to send the contents of this file periodically to another file 'received.txt' every N minutes. I do not want the previous lines to be sent. So the scripts needs to send the new data and update 'received.txt' with the new lines of txt, but not repeat lines.
I'm having a hard time putting this script together. I'm guessing I need some sort of loop to do this continuously. Here is what I have so far. (not in order)
EDIT: I am using Debian(Raspbian) on a Raspberry Pi
import sys
num_lines = sum(1 for line in open('out.txt')) # read the last line of the updated file
sys.stdout = open('received.txt', 'w') #write to the received.txt file
print 'test'
f = open('out.txt', 'r') #read the data from the last line
f.readline(num_lines)
for line in f:
print line
Any advice would be extremely helpful.
Thank you
There are a few ways to do this.
The simplest is to keep looping over the file even after EOF. You could do this by just wrapping a while True: around the for line in f:, or by just looping forever around f.readline().
But this will waste a lot of CPU power and possibly even disk access checking over and over as fast as possible whether the file is still at EOF. You can fix that by sleeping whenever you get to the end of the file, like this:
while True:
for line in f:
print line
time.sleep(0.5)
But if the file is not written to for a long time, you're still wasting CPU power (which may not seem like a problem, but imagine what happens when the computer wants to go to sleep, and it can't because you're making it work every half a second). And meanwhile, if the file is being written to a lot faster than twice/second, you're going to lag.
So, a better solution is to block until there's something to read.
Unfortunately, there's no easy cross-platform way to do this. Fortunately, there are relatively easy platform-specific ways to do it on most platforms, but I'd need to know your platform to help.
For example, on OS X or other *BSD systems, you can use kqueue to wait until a file has something to read:
from select import *
# the rest of your code until the reading loop
while True:
for line in f:
print line
kq = kqueue()
kq.control([kevent(f.fileno(), filter=KQ_FILTER_READ, flags=KQ_EV_ADD)], 0, 0)
kq.control(None, 1)
kq.close()
But that won't work on Windows, or linux, or any other platform. (Also, that's a pretty bad way to do it on BSD, it's just shorter to show this way than the right way. If you want to do this for OS X, find a good tutorial on using kqueue in Python, don't copy this code.)

Parallel processing of a large .csv file in Python

I'm processing large CSV files (on the order of several GBs with 10M lines) using a Python script.
The files have different row lengths, and cannot be loaded fully into memory for analysis.
Each line is handled separately by a function in my script. It takes about 20 minutes to analyze one file, and it appears disk access speed is not an issue, but rather processing/function calls.
The code looks something like this (very straightforward). The actual code uses a Class structure, but this is similar:
csvReader = csv.reader(open("file","r")
for row in csvReader:
handleRow(row, dataStructure)
Given the calculation requires a shared data structure, what would be the best way to run the analysis in parallel in Python utilizing multiple cores?
In general, how do I read multiple lines at once from a .csv in Python to transfer to a thread/process? Looping with for over the rows doesn't sound very efficient.
Thanks!
This might be too late, but just for future users I'll post anyway. Another poster mentioned using multiprocessing. I can vouch for it and can go into more detail. We deal with files in the hundreds of MB/several GB every day using Python. So it's definitely up to the task. Some of files we deal with aren't CSVs, so the parsing can be fairly complex and take longer than the disk access. However, the methodology is the same no matter what file type.
You can process pieces of the large files concurrently. Here's pseudo code of how we do it:
import os, multiprocessing as mp
# process file function
def processfile(filename, start=0, stop=0):
if start == 0 and stop == 0:
... process entire file...
else:
with open(file, 'r') as fh:
fh.seek(start)
lines = fh.readlines(stop - start)
... process these lines ...
return results
if __name__ == "__main__":
# get file size and set chuck size
filesize = os.path.getsize(filename)
split_size = 100*1024*1024
# determine if it needs to be split
if filesize > split_size:
# create pool, initialize chunk start location (cursor)
pool = mp.Pool(cpu_count)
cursor = 0
results = []
with open(file, 'r') as fh:
# for every chunk in the file...
for chunk in xrange(filesize // split_size):
# determine where the chunk ends, is it the last one?
if cursor + split_size > filesize:
end = filesize
else:
end = cursor + split_size
# seek to end of chunk and read next line to ensure you
# pass entire lines to the processfile function
fh.seek(end)
fh.readline()
# get current file location
end = fh.tell()
# add chunk to process pool, save reference to get results
proc = pool.apply_async(processfile, args=[filename, cursor, end])
results.append(proc)
# setup next chunk
cursor = end
# close and wait for pool to finish
pool.close()
pool.join()
# iterate through results
for proc in results:
processfile_result = proc.get()
else:
...process normally...
Like I said, that's only pseudo code. It should get anyone started who needs to do something similar. I don't have the code in front of me, just doing it from memory.
But we got more than a 2x speed up from this on the first run without fine tuning it. You can fine tune the number of processes in the pool and how large the chunks are to get an even higher speed up depending on your setup. If you have multiple files as we do, create a pool to read several files in parallel. Just be careful no to overload the box with too many processes.
Note: You need to put it inside an "if main" block to ensure infinite processes aren't created.
Try benchmarking reading your file and parsing each CSV row but doing nothing with it. You ruled out disk access, but you still need to see if the CSV parsing is what's slow or if your own code is what's slow.
If it's the CSV parsing that's slow, you might be stuck, because I don't think there's a way to jump into the middle of a CSV file without scanning up to that point.
If it's your own code, then you can have one thread reading the CSV file and dropping rows into a queue, and then have multiple threads processing rows from that queue. But don't bother with this solution if the CSV parsing itself is what's making it slow.
Because of the GIL, Python's threading won't speed-up computations that are processor bound like it can with IO bound.
Instead, take a look at the multiprocessing module which can run your code on multiple processors in parallel.
If the rows are completely independent just split the input file in as many files as CPUs you have. After that, you can run as many instances of the process as input files you have now. This instances, since they are completely different processes, will not be bound by GIL problems.
Just found a solution to this old problem. I tried Pool.imap, and it seems to simplify processing large file significantly. imap has one significant benefit when comes to processing large files: It returns results as soon as they are ready, and not wait for all the results to be available. This saves lot of memory.
(Here is an untested snippet of code which reads a csv file row by row, process each row and write it back to a different csv file. Everything is done in parallel.)
import multiprocessing as mp
import csv
CHUNKSIZE = 10000 # Set this to whatever you feel reasonable
def _run_parallel(csvfname, csvoutfname):
with open(csvfname) as csvf, \
open(csvoutfname, 'w') as csvout\
mp.Pool() as p:
reader = csv.reader(csvf)
csvout.writerows(p.imap(process, reader, chunksize=CHUNKSIZE))
If you use zmq and a DEALER middle man, you'd be able spread the row processing not just to the CPUs on your computer but across a network to as many processes as necessary. This would essentially guarentee that you hit an IO limit vs a CPU limit :)

Categories