From one of our client's requirement, I have to develop an application which should be able to process huge CSV files. File size could be in the range of 10 MB - 2GB in size.
Depending on size, module decides whether to read the file using Multiprocessing pool or by using normal CSV reader.
But from observation, multi processing taking longer time than normal CSV reading when tested both the modes for a file with size of 100 MB.
Is this correct behaviour? OR Am I doing something wrong?
Here is my code:
def set_file_processing_mode(self, fpath):
""" """
fsize = self.get_file_size(fpath)
if fsize > FILE_SIZE_200MB:
self.read_in_async_mode = True
else:
self.read_in_async_mode = False
def read_line_by_line(self, filepath):
"""Reads CSV line by line"""
with open(filepath, 'rb') as csvin:
csvin = csv.reader(csvin, delimiter=',')
for row in iter(csvin):
yield row
def read_huge_file(self, filepath):
"""Read file in chunks"""
pool = mp.Pool(1)
for chunk_number in range(self.chunks): #self.chunks = 20
proc = pool.apply_async(read_chunk_by_chunk,
args=[filepath, self.chunks, chunk_number])
reader = proc.get()
yield reader
pool.close()
pool.join()
def iterate_chunks(self, filepath):
"""Read huge file rows"""
for chunklist in self.read_huge_file(filepath):
for row in chunklist:
yield row
#timeit #-- custom decorator
def read_csv_rows(self, filepath):
"""Read CSV rows and pass it to processing"""
if self.read_in_async_mode:
print("Reading in async mode")
for row in self.iterate_chunks(filepath):
self.process(row)
else:
print("Reading in sync mode")
for row in self.read_line_by_line(filepath):
self.process(row)
def process(self, formatted_row):
"""Just prints the line"""
self.log(formatted_row)
def read_chunk_by_chunk(filename, number_of_blocks, block):
'''
A generator that splits a file into blocks and iterates
over the lines of one of the blocks.
'''
results = []
assert 0 <= block and block < number_of_blocks
assert 0 < number_of_blocks
with open(filename) as fp :
fp.seek(0,2)
file_size = fp.tell()
ini = file_size * block / number_of_blocks
end = file_size * (1 + block) / number_of_blocks
if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()
while fp.tell() < end:
results.append(fp.readline())
return results
if __name__ == '__main__':
classobj.read_csv_rows(sys.argv[1])
Here is a test:
$ python csv_utils.py "input.csv"
Reading in async mode
FINISHED IN 3.75 sec
$ python csv_utils.py "input.csv"
Reading in sync mode
FINISHED IN 0.96 sec
Question is :
Why Async mode is taking longer?
NOTE: Removed unnecessary functions/lines to avoid complexity in the code
Is this correct behaviour?
Yes - it may not be what you expect, but it is consistent with the way you implemented it and how multiprocessing works.
Why Async mode is taking longer?
The way your example works is perhaps best illustrated by a parable - bear with me please:
Let's say you ask your friend to engage in an experiment. You want him to go through a book and mark each page with a pen, as fast as he can. There are two rounds with a distinct setup, and you are going to time each round and then compare which one was faster:
open the book on the first page, mark it, then flip the page and mark the following pages as they come up. Pure sequential processing.
process the book in chunks. For this he should run through the book's pages chunk by chunk. That is he should first make a list of page numbers
as starting points, say 1, 10, 20, 30, 40, etc. Then for each chunk, he should close the book, open it on the page for the starting point, process all pages before the next starting point comes up, close the book, then start all over again for the next chunk.
Which of these approaches will be faster?
Am I doing something wrong?
You decide both approaches take too long. What you really want to do is ask multiple people (processes) to do the marking in parallel. Now with a book (as with a file) that's difficult because, well, only one person (process) can access the book (file) at any one point. Still it can be done if the order of processing doesn't matter and it is the marking itself - not the accessing - that should run in parallel. So the new approach is like this:
cut the pages out of the book and sort them into say 10 stacks
ask ten people to mark one stack each
This approach will most certainly speed up the whole process. Perhaps surprisingly though the speed up will be less than a factor of 10 because step 1 takes some time, and only one person can do it. That's called Amdahl's law [wikipedia]:
Essentially what it means is that the (theoretical) speed-up of any process can only be as fast as the parallel processing part p is reduced in speed in relation to the part's sequential processing time (p/s).
Intuitively, the speed-up can only come from the part of the task that is processed in parallel, all the sequential parts are not affected and take the same amount of time, whether p is processed in parallel or not.
That said, in our example, obviously the speed-up can only come from step 2 (marking pages in parallel by multiple people), as step 1 (tearing up the book) is clearly sequential.
develop an application which should be able to process huge CSV files
Here's how to approach this:
determine what part of the processing can be done in parallel, i.e. process each chunk sepearately and out of sequence
read the file sequentially, splitting it up into chunks as you go
use multiprocessing to run multiple processing steps in parallel
Something like this:
def process(rows):
# do all the processing
...
return result
if __name__ == '__main__':
pool = mp.Pool(N) # N > 1
chunks = get_chunks(...)
for rows in chunks:
result += pool.apply_async(process, rows)
pool.close()
pool.join()
I'm not defining get_chunks here because there are several documented approaches to doing this e.g. here or here.
Conclusion
Depending on the kind of processing required for each file, it may well be that the sequential approach to processing any one file is the fastest possible approach, simply because the processing parts don't gain much from being done in parallel. You may still end up processing it chunk by chunk due to e.g. memory constraints. If that is the case, you probably don't need multiprocessing.
If you have multiple files that can be processed in parallel,
multiprocessing is a very good approach. It works the same way as shown above, where the chunks are not rows but filenames.
Related
I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.
My code looks like this:
with open('f.txt', 'r') as f:
function(f,w)
Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.
I have tried:
def multiprocess(f,w):
cores = multiprocessing.cpu_count()
with Pool(cores) as p:
pieces = p.map(function,f,w)
f.close()
w.close()
multiprocess(f,w)
But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.
even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.
In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
The slowest solution, but most efficient RAM usage
Your initial solution to execute and process the file line by line
For maximum speed, but most RAM consumption
Read the entire File into RAM as a list via f.readlines(), if your entire dataset can fit in memory, comfortably
Figure out the number of cores (say 8 cores for example)
Split the list evenly into 8 lists
pass each list to the function to be executed by a Process instance (at this point your RAM usage will be further doubled, which is the trade off for max speed), but you should del the original big list right after to free some RAM
Each Process handles its entire chunk in order line by line, and write it into its own output file (out_file1.txt, out_file2.txt, etc.)
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue
Figure out the number of cores in a variable cores (say 8)
Initialize 8 queue, 8 processes, and pass each Queue to each process. At this point each Process should open its own output file (outfile1.txt, outfile2.txt, etc.)
Each process shall poll (and block) for a chunk of 10_000 rows, process them, and write them to their respective output files sequentially
In a loop in the Parent Process, Read 10_000 * 8 lines from your input 20m-rows file
split that into several lists (10K chunks) to push to your respective Processes Queues
When your done with 20m rows exit the loop, pass a special value into each process Queue that signals the end of input data
When each process detects that special End of Data value in its own Queue, each shall close their output file, and exit
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.
Using in to iterate is not the way but you can call more than one line by time, you just need to sum one or more to read more than one line, doing this the program will read faster.
Look this snippet.
# Python code to
# demonstrate readlines()
L = ["Geeks\n", "for\n", "Geeks\n"]
# writing to file
file1 = open('myfile.txt', 'w')
file1.writelines(L)
file1.close()
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()
count = 0
# Strips the newline character
for line in Lines:
count += 1
print("Line{}: {}".format(count, line.strip()))
I got it from: https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/.
I am trying to process a 51GB text file in Python. Here is my attempt at processing this file in parallel:
def parallel_read(pid, filename, num_processes):
with codecs.open(filename) as infile:
for num, line in enumerate(infile):
index = num % num_processes
if index != pid: # Only process lines corresponding to your ProcessId. Discard others.
continue
process_line(line)
def main(file_to_read):
num_processes = 8
arguments = []
for x in range(num_processes):
arguments.append((x, file_to_read, num_processes))
pool = mp.Pool(num_processes)
results = pool.starmap(parallel_read, arguments)
Questions:
Is this really going to speed things up? I mean the file is still being read sequentially by each of the processes, just that the processing is restricted only to specific lines. IMO this might speed things up assuming processing of a line takes significantly more time than reading a line.
Are there any limits on the number of processes I can start in Parallel, especially wrt how many file connections I can open in Parallel? (In other words, if I have access to an 80 core machine, can I run 160 processes in Parallel?)
Is there a better way to read large files in Parallel in Python?
Is there a way to search text files in python for a phrase withough having to use forloops and if statments such as:
for line in file:
if line in myphrase:
do something
This seems like a very inefficient way to go through the file as it does not run in parallel if I understand correctly, but rather iteratively. Is re.search a more efficient system by which to do it?
Reading a sequential file (e.g. a text file) is always going to be a sequential process. Unless you can store it in separate chunks or skip ahead somehow it will be hard to do any parallel processing.
What you could do is separate the inherently sequential reading process from the searching process. This requires that the file content be naturally separated into chunks (e.g. lines) across which the search is not intended to find a result.
The general structure would look like this:
initiate a list of processing threads with input queues
read the file line by line and accumulate chunks of lines up to a given threshold
when the threshold or the end of file is reached, add the chunk of lines to the next processing thread's input queue
wait for all processing threads to be done
merge results from all the search threads.
In this era of solid state drives and fast memory busses, you would need some pretty compelling constraining factors to justify going to that much trouble.
You can figure out your minimum processing time by measuring how long it takes to read (without processing) all the lines in your largest file.
It is unlikely that the search process for each line will add much to that time given that I/O to read the data (even on an SSD) will take much longer than the search operation's CPU time.
Let's say you have the file:
Hello World!
I am a file.
Then:
file = open("file.txt", "r")
x = file.read()
# x is now:
"Hello World!\nI am a file."
# just one string means that you can search it faster.
# Remember:
file.close()
Edit:
To actually test how long it takes:
import time
start_time = time.time()
# Read File here
end_time = time.time()
print("This meathod took " + str( end_time - start_time ) + " seconds to run!")
Another Edit:
I read some other articles and did the test, and the fastest checking meathod if you're just trying to find True of False is:
x = file.read() # "Hello World!\nI am a file."
tofind = "Hello"
tofind_in_x = tofind in x
# True
This meathod was faster than regex in my tests by quite a bit.
The tool you need is called regular expressions (regex).
You can use it as follows:
import re
if re.match(myphrase, myfile.read()):
do_something()
Divide and Conquer algorithim
-- takes a function and a list as it's inputs.
returns function(list)
This bit is simple, it gets cooler in that it uses the multi processing module in order to split the list up and then process it all in different bits and return one single list. (This is the entirity of the .py file below just copy all the code blocks into a .py file for python3 and you should see the problem live.)
Got my imports
import multiprocessing as multi
import numpy as np
import pickle
import os
A way to log things ( this doesn't seem to want to work in the process)
def log(text):
text = str(text)
with open(text, 'w') as file:
file.write('Nothing')
Function Wrapper
The goal of this function is to take a function, and deal with the providing it data by pulling it from the disk. Mostly because Pipes just end up with an error that I can not find a solution to.
def __wrap(function):
filename = multi.current_process().name + '.tmp'
with open(filename, 'rb') as file:
item_list = pickle.load(file)
result = function(item_list)
with open(filename, 'wb') as file:
pickle.dump(result, file)
The meat and potatoes
This divides the list into smaller lists for each CPU to gobble down and then starts little processes for each chunk. It saves the input data onto the disk for the __wrap() function to pull up. Finally it pulls up the results that have been written to disk bt the __wrap() function, concatenates them into a single list and returns the value.
def divide_and_conquer(f, things):
cpu_count = multi.cpu_count()
chunks = np.array_split(things ,cpu_count )
cpus = []
for cpu in range(cpu_count):
filename = '{}.tmp'.format(cpu)
with open(filename, 'wb') as file:
pickle.dump(chunks[cpu], file)
p = multi.Process(name = str(cpu), target = __wrap, args = (f,))
p.start()
cpus.append(p)
for cpu in cpus:
cpu.join()
done = []
for cpu in cpus:
filename = '{}.tmp'.format(cpu.name)
with open(filename, 'rb') as file:
data = pickle.load(file)
os.remove(filename)
done.append(data)
try:
done = np.concatenate(done)
except ValueError:
pass
return done
Test Sample
to_do = list(range(10))
def func(thins):
for thin in thins:
thin
return [0, 1, 2,3]
divide_and_conquer(func, to_do)
This just does not have the expected output, it just outputs the input for some reason.
Ultimately my goal with this is to speed up long running computations. I often find myself dealing with lists where each item takes a couple seconds to parse. (web scraping etc) I pretty much just want to add this tool to my little 'often used code snippets library' so I can just import and go
"rt.divide_and_conquer(tough_function, really_long_list)"
and see an easy 8 fold improvement.
I'm currently seeing issues with this working on windows (haven't gotten around to testing it on my linux box yet) and my reading around has shown me that apparently Linux and Windows handle multiprocessing differently.
You don't need to reinvent the wheel. If I understand what you are trying to achieve correctly, then concurrent.futures module is what you need.
ProcessPoolExecutor does the job of splitting a list, launching multiple processes (using maximum number of available threads with default settings) and applying a function to each element in those lists.
I'm processing large CSV files (on the order of several GBs with 10M lines) using a Python script.
The files have different row lengths, and cannot be loaded fully into memory for analysis.
Each line is handled separately by a function in my script. It takes about 20 minutes to analyze one file, and it appears disk access speed is not an issue, but rather processing/function calls.
The code looks something like this (very straightforward). The actual code uses a Class structure, but this is similar:
csvReader = csv.reader(open("file","r")
for row in csvReader:
handleRow(row, dataStructure)
Given the calculation requires a shared data structure, what would be the best way to run the analysis in parallel in Python utilizing multiple cores?
In general, how do I read multiple lines at once from a .csv in Python to transfer to a thread/process? Looping with for over the rows doesn't sound very efficient.
Thanks!
This might be too late, but just for future users I'll post anyway. Another poster mentioned using multiprocessing. I can vouch for it and can go into more detail. We deal with files in the hundreds of MB/several GB every day using Python. So it's definitely up to the task. Some of files we deal with aren't CSVs, so the parsing can be fairly complex and take longer than the disk access. However, the methodology is the same no matter what file type.
You can process pieces of the large files concurrently. Here's pseudo code of how we do it:
import os, multiprocessing as mp
# process file function
def processfile(filename, start=0, stop=0):
if start == 0 and stop == 0:
... process entire file...
else:
with open(file, 'r') as fh:
fh.seek(start)
lines = fh.readlines(stop - start)
... process these lines ...
return results
if __name__ == "__main__":
# get file size and set chuck size
filesize = os.path.getsize(filename)
split_size = 100*1024*1024
# determine if it needs to be split
if filesize > split_size:
# create pool, initialize chunk start location (cursor)
pool = mp.Pool(cpu_count)
cursor = 0
results = []
with open(file, 'r') as fh:
# for every chunk in the file...
for chunk in xrange(filesize // split_size):
# determine where the chunk ends, is it the last one?
if cursor + split_size > filesize:
end = filesize
else:
end = cursor + split_size
# seek to end of chunk and read next line to ensure you
# pass entire lines to the processfile function
fh.seek(end)
fh.readline()
# get current file location
end = fh.tell()
# add chunk to process pool, save reference to get results
proc = pool.apply_async(processfile, args=[filename, cursor, end])
results.append(proc)
# setup next chunk
cursor = end
# close and wait for pool to finish
pool.close()
pool.join()
# iterate through results
for proc in results:
processfile_result = proc.get()
else:
...process normally...
Like I said, that's only pseudo code. It should get anyone started who needs to do something similar. I don't have the code in front of me, just doing it from memory.
But we got more than a 2x speed up from this on the first run without fine tuning it. You can fine tune the number of processes in the pool and how large the chunks are to get an even higher speed up depending on your setup. If you have multiple files as we do, create a pool to read several files in parallel. Just be careful no to overload the box with too many processes.
Note: You need to put it inside an "if main" block to ensure infinite processes aren't created.
Try benchmarking reading your file and parsing each CSV row but doing nothing with it. You ruled out disk access, but you still need to see if the CSV parsing is what's slow or if your own code is what's slow.
If it's the CSV parsing that's slow, you might be stuck, because I don't think there's a way to jump into the middle of a CSV file without scanning up to that point.
If it's your own code, then you can have one thread reading the CSV file and dropping rows into a queue, and then have multiple threads processing rows from that queue. But don't bother with this solution if the CSV parsing itself is what's making it slow.
Because of the GIL, Python's threading won't speed-up computations that are processor bound like it can with IO bound.
Instead, take a look at the multiprocessing module which can run your code on multiple processors in parallel.
If the rows are completely independent just split the input file in as many files as CPUs you have. After that, you can run as many instances of the process as input files you have now. This instances, since they are completely different processes, will not be bound by GIL problems.
Just found a solution to this old problem. I tried Pool.imap, and it seems to simplify processing large file significantly. imap has one significant benefit when comes to processing large files: It returns results as soon as they are ready, and not wait for all the results to be available. This saves lot of memory.
(Here is an untested snippet of code which reads a csv file row by row, process each row and write it back to a different csv file. Everything is done in parallel.)
import multiprocessing as mp
import csv
CHUNKSIZE = 10000 # Set this to whatever you feel reasonable
def _run_parallel(csvfname, csvoutfname):
with open(csvfname) as csvf, \
open(csvoutfname, 'w') as csvout\
mp.Pool() as p:
reader = csv.reader(csvf)
csvout.writerows(p.imap(process, reader, chunksize=CHUNKSIZE))
If you use zmq and a DEALER middle man, you'd be able spread the row processing not just to the CPUs on your computer but across a network to as many processes as necessary. This would essentially guarentee that you hit an IO limit vs a CPU limit :)