multiprocess to split one file - Is it always IO bound?

multiprocess to split one file - Is it always IO bound? - python

I was reading a similar thread where the OP wanted to process each line in a function using multiprocessing (found here). The answer to this question that was intriguing was the following:
from multiprocessing import Pool
def process_line(line):
return "FOO: %s" % line
if __name__ == "__main__":
pool = Pool(4)
with open('file.txt') as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 4)
I'm wondering if you can do the same, but instead of returning each line processed, write it into another file.
Basically I want to see if there is a way to MP reading and writing a file in order to split it up by lines. Say I want 100,000 lines per file.
from multiprocessing import Pool
def write_lines(line):
#need method to write lines to multiple files, perhaps a Queue?
if __name__ == "__main__":
#all my procs
pool = Pool()
with open('file.txt') as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 100000)
I could use a MP Queue to split up the file into separate Queue objects, then fill each processor with a job of writing out all the lines, but I still have to read through the file first. So will it always be completely IO bound and never be able to be MP in an efficient way?

As you suspected, this is workload really won't benefit much (if at all) from multiprocessing. All you're doing here is reading one file, then writing the contents of that file to other files. This is completely I/O bound; the bottleneck is going to be the speed of reading and writing to disk. Using multiprocessing to try to write multiple files to the same disk concurrently isn't going to make the writes any faster, because the disk can only write one thing at a time.
Where multiprocessing can help is if you've got some CPU-bound work that can be parallelized, but that really isn't the case with what you're trying to do. If you wanted to read lines from a file, do some fairly heavy processing of each line, and then write them to some other file, multiprocessing would help, but it doesn't sound like you need to do any processing prior to writing each line.

Related

Read in large text file (~20m rows), apply function to rows, write to new text file

I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.
My code looks like this:
with open('f.txt', 'r') as f:
function(f,w)
Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.
I have tried:
def multiprocess(f,w):
cores = multiprocessing.cpu_count()
with Pool(cores) as p:
pieces = p.map(function,f,w)
f.close()
w.close()
multiprocess(f,w)
But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.

even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.
In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
The slowest solution, but most efficient RAM usage
Your initial solution to execute and process the file line by line
For maximum speed, but most RAM consumption
Read the entire File into RAM as a list via f.readlines(), if your entire dataset can fit in memory, comfortably
Figure out the number of cores (say 8 cores for example)
Split the list evenly into 8 lists
pass each list to the function to be executed by a Process instance (at this point your RAM usage will be further doubled, which is the trade off for max speed), but you should del the original big list right after to free some RAM
Each Process handles its entire chunk in order line by line, and write it into its own output file (out_file1.txt, out_file2.txt, etc.)
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue
Figure out the number of cores in a variable cores (say 8)
Initialize 8 queue, 8 processes, and pass each Queue to each process. At this point each Process should open its own output file (outfile1.txt, outfile2.txt, etc.)
Each process shall poll (and block) for a chunk of 10_000 rows, process them, and write them to their respective output files sequentially
In a loop in the Parent Process, Read 10_000 * 8 lines from your input 20m-rows file
split that into several lists (10K chunks) to push to your respective Processes Queues
When your done with 20m rows exit the loop, pass a special value into each process Queue that signals the end of input data
When each process detects that special End of Data value in its own Queue, each shall close their output file, and exit
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.

Using in to iterate is not the way but you can call more than one line by time, you just need to sum one or more to read more than one line, doing this the program will read faster.
Look this snippet.
# Python code to
# demonstrate readlines()
L = ["Geeks\n", "for\n", "Geeks\n"]
# writing to file
file1 = open('myfile.txt', 'w')
file1.writelines(L)
file1.close()
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()
count = 0
# Strips the newline character
for line in Lines:
count += 1
print("Line{}: {}".format(count, line.strip()))
I got it from: https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/.

Parallel Processing while writing to file

I have a csv file with around 5million lines, I want to read every line, scrape some data then append it to the end of the line. As this takes time, I want to do this process in parallel (the computation isn't intensive on it's own so I'm not looking at multi-threading, just running in parallel).
Then I want to write all the lines into a new file.
I'm not sure what's the best way to do this, so far I have the loop, function but i need to write the write_to_file function and the parallelisation function. Any tips on the best way to do this and which libraries to use?
Thanks!

You can use joblib package to parallelize this operation.
from joblib import Parallel,delayed
import time
now = time.time()
def process_one_line(line):
# Write your logic to process one line
# e.g., I am appending "_modified" on each line
return line + "_modified"
# Read your file
with open("bigfile.txt","r") as fp:
modified_lines = Parallel(n_jobs=1000, prefer='threads')(delayed(process_one_line) (line) for line in fp)
# Write to new file
with open("modified_bigfile.txt","w") as fp:
Parallel(n_jobs=1000, prefer='threads')(delayed(fp.write) (line) for line in modified_lines)
print(f"Total Time taken - {(time.time() - now)/60}")
I tested this code. It took approx 1.54 mins to modify the 49M file which was pretty nice. Also, you can increase the value for n_jobs if you want it to make it faster.
prefer argument can take threads or processes. My computer crashed when I tried processes as my n_jobs was 1000
You can read more about joblib here

how do I safely write data from a single hdf5 file to multiple files in parallel in python?

I am trying to write my data (from a single file in hdf5 format) to multiple files, and it works fine when the task is executed in serial. Now I want to improve the efficiency and modify the code using the multiprocessing module, but the output sometimes go wrong. Here's a simplified version of my code.
import multiprocessing as mp
import numpy as np
import math, h5py, time
N = 4 # number of processes to use
block_size = 300
data_sz = 678
dataFile = 'mydata.h5'
# fake some data
mydata = np.zeros((data_sz, 1))
for i in range(data_sz):
mydata[i, 0] = i+1
h5file = h5py.File(dataFile, 'w')
h5file.create_dataset('train', data=mydata)
# fire multiple workers
pool = mp.Pool(processes=N)
total_part = int(math.ceil(1. * data_sz / block_size))
for i in range(total_part):
pool.apply_async(data_write_func, args=(dataFile, i, ))
pool.close()
pool.join()
and the data_write_func()'s structure is:
def data_write_func(h5file_dir, i, block_size=block_size):
hf = h5py.File(h5file_dir)
fout = open('data_part_' + str(i), 'w')
data_part = hf['train'][block_size*i : min(block_size*(i+1), data_sz)] # np.ndarray
for line in data_part:
# do some processing, that takes a while...
time.sleep(0.01)
# then write out..
fout.write(str(line[0]) + '\n')
fout.close()
when I set N=1, it works well. but when I set N=2 or N=4, the result get messed sometimes(not every time!). e.g. in data_part_1 I expect the output to be:
301,
302,
303,
...
But sometimes what I get is
0,
0,
0,
...
sometimes I get
379,
380,
381,
...
I'm new to the multiprocessing module, and find it tricky. Appreciate it if any suggestions!

After fixing the fout.write and mydata=... as Andriy suggested your program works as intended, because every process writes to his own file. There's no way the processes intermingle with each other.
What you probaby wanted to do is using multiprocessing.map() which cuts your iterable for you (so you don't need to do the block_size thingies), plus it guarantees that the results are done in order. I've reworked your code to use multiprocessing map:
import multiprocessing
from functools import partial
import pprint
def data_write_func(line):
i = multiprocessing.current_process()._identity[0]
line = [i*2 for i in line]
files[i-1].write(",".join((str(s) for s in line)) + "\n")
N = 4
mydata=[[x+1,x+2,x+3,x+4] for x in range(0,4000*N,4)] # fake some data
files = [open('data_part_'+str(i), 'w') for i in range(N)]
pool = multiprocessing.Pool(processes=N)
pool.map(data_write_func, mydata)
pool.close()
pool.join()
Please note:
i is taken from the process itself, it's either 1 or 2
as now data_write_func is called for every row, the file opening needs to be done in the parent process. Also: you don't need to do the close() the file manually, the OS will do that for you on exit of your python program.
Now, I guess in the end you'd want to have all the output in one file, not in separate files. If your output line is below 4096 bytes on linux (or below 512 bytes on OSX, for other OSes see here) you're actually safe to just open one file (in append mode) and let every process just write into that one file, as writes below these sizes are guaranteed to be atomic by Unix.
Update:
"What if the data is stored in hdf5 file as dataset?"
According to hdf5 doc this works out of the box since version 2.2.0:
Parallel HDF5 is a configuration of the HDF5 library which lets you share open files across multiple parallel processes. It uses the MPI (Message Passing Interface) standard for interprocess communication
So if you do this in your code:
h5file = h5py.File(dataFile, 'w')
dset = h5file.create_dataset('train', data=mydata)
Then you can just access dset from within your process and read/write to it without taking any extra measures. See also this example from h5py using multiprocessing

The issue could not be replicated. Here is my full code:
#!/usr/bin/env python
import multiprocessing
N = 4
mydata=[[x+1,x+2,x+3,x+4] for x in range(0,4000*N,4)] # fake some data
def data_write_func(mydata, i, block_size=1000):
fout = open('data_part_'+str(i), 'w')
data_part = mydata[block_size*i: block_size*i+block_size]
for line in data_part:
# do some processing, say *2 for each element...
line = [x*2 for x in line]
# then write out..
fout.write(','.join(map(str,line))+'\n')
fout.close()
pool = multiprocessing.Pool(processes=N)
for i in range(2):
pool.apply_async(data_write_func, (mydata, i, ))
pool.close()
pool.join()
Sample output from data_part_0:
2,4,6,8
10,12,14,16
18,20,22,24
26,28,30,32
34,36,38,40
42,44,46,48
50,52,54,56
58,60,62,64

multiprocessing cannot guarantee the order of code execution between different threads, it is perfectly reasonable for 2 processes to execute in reverse order of their creation order (at least on windows and mainstream linux)
usually when you use parallelization you need worker threads to generate the data then aggregate the data into a thread safe data structure and save that to file, but you are writing to one file here, presumably on to one hard disk, do you have any reason to believe you will get any additional performance by using multiple threads?

Performance: fastest way of reading in files with Python

So I have about 400 files ranging from 10kb to 56mb in size, file type being .txt/.doc(x)/.pdf/.xml and I have to read them all. My read in files are basically:
#for txt files
with open("TXT\\" + path, 'r') as content_file:
content = content_file.read().split(' ')
#for doc files using pydoc
contents = '\n'.join([para.text for para in doc.paragraphs]).encode("ascii","ignore").decode("utf-8").split(' ')
#for pdf files using pypdf2
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
contents = content.encode("ascii","ignore").decode("utf-8").split(' ')
#for xml files using lxml
tree = etree.parse(path)
contents = etree.tostring(tree, encoding='utf8', method='text')
contents = contents.decode("utf-8").split(' ')
But I notice even reading 30 text files with under 50kb size each and doing operations on it will take 41 seconds. But If I read a single text file with 56mb takes me 9 seconds. So I'm guessing that it's the file I/O that's slowing me down instead of my program.
Any idea on how to speed up this process? Maybe break down each file type into 4 different threads? But how would you go about doing that since they are sharing the same list and that single list will be written to a file when they are done.

If you're blocked on file I/O, as you suspect, there's probably not much you can do.
But parallelizing to different threads might help if you have great bandwidth but terrible latency. Especially if you're dealing with, say, a networked filesystem or a multi-platter logical drive. So, it can't hurt to try.
But there's no reason to do it per file type; just use a single pool to handle all your files. For example, using the futures module:*
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(process_file, list_of_filenames)
A ThreadPoolExecutor is slightly smarter than a basic thread pool, because it lets you build composable futures, but here you don't need any of that, so I'm just using it as a basic thread pool because Python doesn't have one of those.**
The constructor creates 4 threads, and all the queues and anything else needed to manage putting tasks on those threads and getting results back.
Then, the map method just goes through each filename in list_of_filenames, creates a task out of calling process_file on that filename, submits it to the pool, and then waits for all of the tasks to finish.
In other words, this is the same as writing:
results = [process_file(filename) for filename in list_of_filenames]
… except that it uses four threads to process the files in parallel.
There are some nice examples in the docs if this isn't clear enough.
* If you're using Python 2.x, you'll need to install a backport before you can use this. Or you can use multiprocessing.dummy.Pool instead, as noted below.
** Actually, it does, in multiprocessing.dummy.Pool, but that's not very clearly documented.

Parallel processing of a large .csv file in Python

I'm processing large CSV files (on the order of several GBs with 10M lines) using a Python script.
The files have different row lengths, and cannot be loaded fully into memory for analysis.
Each line is handled separately by a function in my script. It takes about 20 minutes to analyze one file, and it appears disk access speed is not an issue, but rather processing/function calls.
The code looks something like this (very straightforward). The actual code uses a Class structure, but this is similar:
csvReader = csv.reader(open("file","r")
for row in csvReader:
handleRow(row, dataStructure)
Given the calculation requires a shared data structure, what would be the best way to run the analysis in parallel in Python utilizing multiple cores?
In general, how do I read multiple lines at once from a .csv in Python to transfer to a thread/process? Looping with for over the rows doesn't sound very efficient.
Thanks!

This might be too late, but just for future users I'll post anyway. Another poster mentioned using multiprocessing. I can vouch for it and can go into more detail. We deal with files in the hundreds of MB/several GB every day using Python. So it's definitely up to the task. Some of files we deal with aren't CSVs, so the parsing can be fairly complex and take longer than the disk access. However, the methodology is the same no matter what file type.
You can process pieces of the large files concurrently. Here's pseudo code of how we do it:
import os, multiprocessing as mp
# process file function
def processfile(filename, start=0, stop=0):
if start == 0 and stop == 0:
... process entire file...
else:
with open(file, 'r') as fh:
fh.seek(start)
lines = fh.readlines(stop - start)
... process these lines ...
return results
if __name__ == "__main__":
# get file size and set chuck size
filesize = os.path.getsize(filename)
split_size = 100*1024*1024
# determine if it needs to be split
if filesize > split_size:
# create pool, initialize chunk start location (cursor)
pool = mp.Pool(cpu_count)
cursor = 0
results = []
with open(file, 'r') as fh:
# for every chunk in the file...
for chunk in xrange(filesize // split_size):
# determine where the chunk ends, is it the last one?
if cursor + split_size > filesize:
end = filesize
else:
end = cursor + split_size
# seek to end of chunk and read next line to ensure you
# pass entire lines to the processfile function
fh.seek(end)
fh.readline()
# get current file location
end = fh.tell()
# add chunk to process pool, save reference to get results
proc = pool.apply_async(processfile, args=[filename, cursor, end])
results.append(proc)
# setup next chunk
cursor = end
# close and wait for pool to finish
pool.close()
pool.join()
# iterate through results
for proc in results:
processfile_result = proc.get()
else:
...process normally...
Like I said, that's only pseudo code. It should get anyone started who needs to do something similar. I don't have the code in front of me, just doing it from memory.
But we got more than a 2x speed up from this on the first run without fine tuning it. You can fine tune the number of processes in the pool and how large the chunks are to get an even higher speed up depending on your setup. If you have multiple files as we do, create a pool to read several files in parallel. Just be careful no to overload the box with too many processes.
Note: You need to put it inside an "if main" block to ensure infinite processes aren't created.

Try benchmarking reading your file and parsing each CSV row but doing nothing with it. You ruled out disk access, but you still need to see if the CSV parsing is what's slow or if your own code is what's slow.
If it's the CSV parsing that's slow, you might be stuck, because I don't think there's a way to jump into the middle of a CSV file without scanning up to that point.
If it's your own code, then you can have one thread reading the CSV file and dropping rows into a queue, and then have multiple threads processing rows from that queue. But don't bother with this solution if the CSV parsing itself is what's making it slow.

Because of the GIL, Python's threading won't speed-up computations that are processor bound like it can with IO bound.
Instead, take a look at the multiprocessing module which can run your code on multiple processors in parallel.

If the rows are completely independent just split the input file in as many files as CPUs you have. After that, you can run as many instances of the process as input files you have now. This instances, since they are completely different processes, will not be bound by GIL problems.

Just found a solution to this old problem. I tried Pool.imap, and it seems to simplify processing large file significantly. imap has one significant benefit when comes to processing large files: It returns results as soon as they are ready, and not wait for all the results to be available. This saves lot of memory.
(Here is an untested snippet of code which reads a csv file row by row, process each row and write it back to a different csv file. Everything is done in parallel.)
import multiprocessing as mp
import csv
CHUNKSIZE = 10000 # Set this to whatever you feel reasonable
def _run_parallel(csvfname, csvoutfname):
with open(csvfname) as csvf, \
open(csvoutfname, 'w') as csvout\
mp.Pool() as p:
reader = csv.reader(csvf)
csvout.writerows(p.imap(process, reader, chunksize=CHUNKSIZE))

If you use zmq and a DEALER middle man, you'd be able spread the row processing not just to the CPUs on your computer but across a network to as many processes as necessary. This would essentially guarentee that you hit an IO limit vs a CPU limit :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.