Performance: fastest way of reading in files with Python - python

So I have about 400 files ranging from 10kb to 56mb in size, file type being .txt/.doc(x)/.pdf/.xml and I have to read them all. My read in files are basically:
#for txt files
with open("TXT\\" + path, 'r') as content_file:
content = content_file.read().split(' ')
#for doc files using pydoc
contents = '\n'.join([para.text for para in doc.paragraphs]).encode("ascii","ignore").decode("utf-8").split(' ')
#for pdf files using pypdf2
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
contents = content.encode("ascii","ignore").decode("utf-8").split(' ')
#for xml files using lxml
tree = etree.parse(path)
contents = etree.tostring(tree, encoding='utf8', method='text')
contents = contents.decode("utf-8").split(' ')
But I notice even reading 30 text files with under 50kb size each and doing operations on it will take 41 seconds. But If I read a single text file with 56mb takes me 9 seconds. So I'm guessing that it's the file I/O that's slowing me down instead of my program.
Any idea on how to speed up this process? Maybe break down each file type into 4 different threads? But how would you go about doing that since they are sharing the same list and that single list will be written to a file when they are done.

If you're blocked on file I/O, as you suspect, there's probably not much you can do.
But parallelizing to different threads might help if you have great bandwidth but terrible latency. Especially if you're dealing with, say, a networked filesystem or a multi-platter logical drive. So, it can't hurt to try.
But there's no reason to do it per file type; just use a single pool to handle all your files. For example, using the futures module:*
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(process_file, list_of_filenames)
A ThreadPoolExecutor is slightly smarter than a basic thread pool, because it lets you build composable futures, but here you don't need any of that, so I'm just using it as a basic thread pool because Python doesn't have one of those.**
The constructor creates 4 threads, and all the queues and anything else needed to manage putting tasks on those threads and getting results back.
Then, the map method just goes through each filename in list_of_filenames, creates a task out of calling process_file on that filename, submits it to the pool, and then waits for all of the tasks to finish.
In other words, this is the same as writing:
results = [process_file(filename) for filename in list_of_filenames]
… except that it uses four threads to process the files in parallel.
There are some nice examples in the docs if this isn't clear enough.
* If you're using Python 2.x, you'll need to install a backport before you can use this. Or you can use multiprocessing.dummy.Pool instead, as noted below.
** Actually, it does, in multiprocessing.dummy.Pool, but that's not very clearly documented.

Related

Searching python text file without for loops and if statments

Is there a way to search text files in python for a phrase withough having to use forloops and if statments such as:
for line in file:
if line in myphrase:
do something
This seems like a very inefficient way to go through the file as it does not run in parallel if I understand correctly, but rather iteratively. Is re.search a more efficient system by which to do it?
Reading a sequential file (e.g. a text file) is always going to be a sequential process. Unless you can store it in separate chunks or skip ahead somehow it will be hard to do any parallel processing.
What you could do is separate the inherently sequential reading process from the searching process. This requires that the file content be naturally separated into chunks (e.g. lines) across which the search is not intended to find a result.
The general structure would look like this:
initiate a list of processing threads with input queues
read the file line by line and accumulate chunks of lines up to a given threshold
when the threshold or the end of file is reached, add the chunk of lines to the next processing thread's input queue
wait for all processing threads to be done
merge results from all the search threads.
In this era of solid state drives and fast memory busses, you would need some pretty compelling constraining factors to justify going to that much trouble.
You can figure out your minimum processing time by measuring how long it takes to read (without processing) all the lines in your largest file.
It is unlikely that the search process for each line will add much to that time given that I/O to read the data (even on an SSD) will take much longer than the search operation's CPU time.
Let's say you have the file:
Hello World!
I am a file.
Then:
file = open("file.txt", "r")
x = file.read()
# x is now:
"Hello World!\nI am a file."
# just one string means that you can search it faster.
# Remember:
file.close()
Edit:
To actually test how long it takes:
import time
start_time = time.time()
# Read File here
end_time = time.time()
print("This meathod took " + str( end_time - start_time ) + " seconds to run!")
Another Edit:
I read some other articles and did the test, and the fastest checking meathod if you're just trying to find True of False is:
x = file.read() # "Hello World!\nI am a file."
tofind = "Hello"
tofind_in_x = tofind in x
# True
This meathod was faster than regex in my tests by quite a bit.
The tool you need is called regular expressions (regex).
You can use it as follows:
import re
if re.match(myphrase, myfile.read()):
do_something()

Function through multiprocessing returns the input in Python

Divide and Conquer algorithim
-- takes a function and a list as it's inputs.
returns function(list)
This bit is simple, it gets cooler in that it uses the multi processing module in order to split the list up and then process it all in different bits and return one single list. (This is the entirity of the .py file below just copy all the code blocks into a .py file for python3 and you should see the problem live.)
Got my imports
import multiprocessing as multi
import numpy as np
import pickle
import os
A way to log things ( this doesn't seem to want to work in the process)
def log(text):
text = str(text)
with open(text, 'w') as file:
file.write('Nothing')
Function Wrapper
The goal of this function is to take a function, and deal with the providing it data by pulling it from the disk. Mostly because Pipes just end up with an error that I can not find a solution to.
def __wrap(function):
filename = multi.current_process().name + '.tmp'
with open(filename, 'rb') as file:
item_list = pickle.load(file)
result = function(item_list)
with open(filename, 'wb') as file:
pickle.dump(result, file)
The meat and potatoes
This divides the list into smaller lists for each CPU to gobble down and then starts little processes for each chunk. It saves the input data onto the disk for the __wrap() function to pull up. Finally it pulls up the results that have been written to disk bt the __wrap() function, concatenates them into a single list and returns the value.
def divide_and_conquer(f, things):
cpu_count = multi.cpu_count()
chunks = np.array_split(things ,cpu_count )
cpus = []
for cpu in range(cpu_count):
filename = '{}.tmp'.format(cpu)
with open(filename, 'wb') as file:
pickle.dump(chunks[cpu], file)
p = multi.Process(name = str(cpu), target = __wrap, args = (f,))
p.start()
cpus.append(p)
for cpu in cpus:
cpu.join()
done = []
for cpu in cpus:
filename = '{}.tmp'.format(cpu.name)
with open(filename, 'rb') as file:
data = pickle.load(file)
os.remove(filename)
done.append(data)
try:
done = np.concatenate(done)
except ValueError:
pass
return done
Test Sample
to_do = list(range(10))
def func(thins):
for thin in thins:
thin
return [0, 1, 2,3]
divide_and_conquer(func, to_do)
This just does not have the expected output, it just outputs the input for some reason.
Ultimately my goal with this is to speed up long running computations. I often find myself dealing with lists where each item takes a couple seconds to parse. (web scraping etc) I pretty much just want to add this tool to my little 'often used code snippets library' so I can just import and go
"rt.divide_and_conquer(tough_function, really_long_list)"
and see an easy 8 fold improvement.
I'm currently seeing issues with this working on windows (haven't gotten around to testing it on my linux box yet) and my reading around has shown me that apparently Linux and Windows handle multiprocessing differently.
You don't need to reinvent the wheel. If I understand what you are trying to achieve correctly, then concurrent.futures module is what you need.
ProcessPoolExecutor does the job of splitting a list, launching multiple processes (using maximum number of available threads with default settings) and applying a function to each element in those lists.

Multithreading in Python for reading files

I am new to Python, and have never tried Multithreading.
My objective is to read set of file and get some specific data from the file.
I have already created a code which is doing my task perfectly. But it is taking a lot of time as few files are very large.
final_output = []
for file in os.listdir(file_path):
final_string = error_collector(file_path, file)
final_output = final_output + final_string
The error_collector function is reading each line of the file and fetching useful information and returning a list for each file, which I am concatenating with the file list so that I can get all information in a single list.
What I want to achieve is some way by which I can do parallel processing of the files instead of reading one file at a time.
Can someone please help.
Using mmap can improve the speed of reading files.
If the data that is to be read is relatively small compared to the total size of the file, doing this in combination with Pool.map is a good strategy.
What you want to do is called multiprocessing in Python. Multithreading only uses one cpu core.
One way to do it is:
from multiprocessing import Pool
fl = os.listdir(file_path)
def fun(i):
final_string = error_collector(file_path, fl[i])
final_output = final_output + final_string
p = Pool(4)
final_output = p.map(fun, range(len(fl)))
p.terminate()
EDIT: If the bottleneck really is disk I/O, you can store your files in a better format (i.e. use the pickle module and store as binary).

multiprocess to split one file - Is it always IO bound?

I was reading a similar thread where the OP wanted to process each line in a function using multiprocessing (found here). The answer to this question that was intriguing was the following:
from multiprocessing import Pool
def process_line(line):
return "FOO: %s" % line
if __name__ == "__main__":
pool = Pool(4)
with open('file.txt') as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 4)
I'm wondering if you can do the same, but instead of returning each line processed, write it into another file.
Basically I want to see if there is a way to MP reading and writing a file in order to split it up by lines. Say I want 100,000 lines per file.
from multiprocessing import Pool
def write_lines(line):
#need method to write lines to multiple files, perhaps a Queue?
if __name__ == "__main__":
#all my procs
pool = Pool()
with open('file.txt') as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 100000)
I could use a MP Queue to split up the file into separate Queue objects, then fill each processor with a job of writing out all the lines, but I still have to read through the file first. So will it always be completely IO bound and never be able to be MP in an efficient way?
As you suspected, this is workload really won't benefit much (if at all) from multiprocessing. All you're doing here is reading one file, then writing the contents of that file to other files. This is completely I/O bound; the bottleneck is going to be the speed of reading and writing to disk. Using multiprocessing to try to write multiple files to the same disk concurrently isn't going to make the writes any faster, because the disk can only write one thing at a time.
Where multiprocessing can help is if you've got some CPU-bound work that can be parallelized, but that really isn't the case with what you're trying to do. If you wanted to read lines from a file, do some fairly heavy processing of each line, and then write them to some other file, multiprocessing would help, but it doesn't sound like you need to do any processing prior to writing each line.

Parallel processing of a large .csv file in Python

I'm processing large CSV files (on the order of several GBs with 10M lines) using a Python script.
The files have different row lengths, and cannot be loaded fully into memory for analysis.
Each line is handled separately by a function in my script. It takes about 20 minutes to analyze one file, and it appears disk access speed is not an issue, but rather processing/function calls.
The code looks something like this (very straightforward). The actual code uses a Class structure, but this is similar:
csvReader = csv.reader(open("file","r")
for row in csvReader:
handleRow(row, dataStructure)
Given the calculation requires a shared data structure, what would be the best way to run the analysis in parallel in Python utilizing multiple cores?
In general, how do I read multiple lines at once from a .csv in Python to transfer to a thread/process? Looping with for over the rows doesn't sound very efficient.
Thanks!
This might be too late, but just for future users I'll post anyway. Another poster mentioned using multiprocessing. I can vouch for it and can go into more detail. We deal with files in the hundreds of MB/several GB every day using Python. So it's definitely up to the task. Some of files we deal with aren't CSVs, so the parsing can be fairly complex and take longer than the disk access. However, the methodology is the same no matter what file type.
You can process pieces of the large files concurrently. Here's pseudo code of how we do it:
import os, multiprocessing as mp
# process file function
def processfile(filename, start=0, stop=0):
if start == 0 and stop == 0:
... process entire file...
else:
with open(file, 'r') as fh:
fh.seek(start)
lines = fh.readlines(stop - start)
... process these lines ...
return results
if __name__ == "__main__":
# get file size and set chuck size
filesize = os.path.getsize(filename)
split_size = 100*1024*1024
# determine if it needs to be split
if filesize > split_size:
# create pool, initialize chunk start location (cursor)
pool = mp.Pool(cpu_count)
cursor = 0
results = []
with open(file, 'r') as fh:
# for every chunk in the file...
for chunk in xrange(filesize // split_size):
# determine where the chunk ends, is it the last one?
if cursor + split_size > filesize:
end = filesize
else:
end = cursor + split_size
# seek to end of chunk and read next line to ensure you
# pass entire lines to the processfile function
fh.seek(end)
fh.readline()
# get current file location
end = fh.tell()
# add chunk to process pool, save reference to get results
proc = pool.apply_async(processfile, args=[filename, cursor, end])
results.append(proc)
# setup next chunk
cursor = end
# close and wait for pool to finish
pool.close()
pool.join()
# iterate through results
for proc in results:
processfile_result = proc.get()
else:
...process normally...
Like I said, that's only pseudo code. It should get anyone started who needs to do something similar. I don't have the code in front of me, just doing it from memory.
But we got more than a 2x speed up from this on the first run without fine tuning it. You can fine tune the number of processes in the pool and how large the chunks are to get an even higher speed up depending on your setup. If you have multiple files as we do, create a pool to read several files in parallel. Just be careful no to overload the box with too many processes.
Note: You need to put it inside an "if main" block to ensure infinite processes aren't created.
Try benchmarking reading your file and parsing each CSV row but doing nothing with it. You ruled out disk access, but you still need to see if the CSV parsing is what's slow or if your own code is what's slow.
If it's the CSV parsing that's slow, you might be stuck, because I don't think there's a way to jump into the middle of a CSV file without scanning up to that point.
If it's your own code, then you can have one thread reading the CSV file and dropping rows into a queue, and then have multiple threads processing rows from that queue. But don't bother with this solution if the CSV parsing itself is what's making it slow.
Because of the GIL, Python's threading won't speed-up computations that are processor bound like it can with IO bound.
Instead, take a look at the multiprocessing module which can run your code on multiple processors in parallel.
If the rows are completely independent just split the input file in as many files as CPUs you have. After that, you can run as many instances of the process as input files you have now. This instances, since they are completely different processes, will not be bound by GIL problems.
Just found a solution to this old problem. I tried Pool.imap, and it seems to simplify processing large file significantly. imap has one significant benefit when comes to processing large files: It returns results as soon as they are ready, and not wait for all the results to be available. This saves lot of memory.
(Here is an untested snippet of code which reads a csv file row by row, process each row and write it back to a different csv file. Everything is done in parallel.)
import multiprocessing as mp
import csv
CHUNKSIZE = 10000 # Set this to whatever you feel reasonable
def _run_parallel(csvfname, csvoutfname):
with open(csvfname) as csvf, \
open(csvoutfname, 'w') as csvout\
mp.Pool() as p:
reader = csv.reader(csvf)
csvout.writerows(p.imap(process, reader, chunksize=CHUNKSIZE))
If you use zmq and a DEALER middle man, you'd be able spread the row processing not just to the CPUs on your computer but across a network to as many processes as necessary. This would essentially guarentee that you hit an IO limit vs a CPU limit :)

Categories