Inconsistent loop execution times with large memory processing - python

I have numerous large csv files (~400 MB each, need to process thousands of them, at least a hundred per program execution) containing long strings in the first cell of each row (about 100-300 characters per row for 1 million rows per file), and my Python program checks if a substring is in a given string. If so, then I append the row containing the string to a list, to be stored in another series of csv files after all the input files have been processed. For the first dozen input files, the program runs at about 20 seconds per file, which I am satisfied with.
The relevant portion of the code (the string-processing loop) looks as such:
check = set(['a','b','c'])
storage = []
data = glob.glob('data_address/*.csv')
for raw_file in data:
read_file = open(raw_file,'r',newline='',encoding='utf-8')
list_file = list(csv.reader((line.replace('\0','') for line in read_file), delimiter=","))
row_count = sum(1 for row in list_file)
for i in range(1, row_count);
text = set(list_file[i][0].split())
if len(check.intersection(text)) > 0:
storage.append(list_file[i])
The problem is that as the number of processed input files grows, I begin to have certain files that take much longer than 20 seconds. Furthermore, these anomalies take longer and longer to process - the first anomaly takes about 50 seconds to process, and towards the end of the loop, anomalies can take thousands of seconds to process, suggesting that the problem is with the loop itself rather than any individual file. These anomalies are not obviously different from the other files in terms of number of string matches.
What I don't understand is that the increase in processing time is not consistent. I still have plenty of 20-second files in between each anomaly, so it cannot be that the program is simply slowing down as memory storage increases. Does anyone have any idea what's going on? cProfile fails to show any component that might be causing the issue.
I use 64-bit Python 3.8 on Windows 10 with a 1TB hard drive, with about 10,000 MB active memory.

We don't yet have enough information to properly diagnose the issue, so in the meantime I thought I would try to improve the code you shared:
import csv
import pathlib
check = {'a', 'b', 'c'}
data_path = pathlib.Path("data_address")
saved_rows = []
for curr_path in data_path.glob("*.csv"):
with open(curr_path, newline='') as curr_file:
reader = csv.reader((line.replace('\0', '') for line in curr_file), delimiter=",")
for row in reader:
row_text = row[0].split()
if any(elem in check for elem in row_text):
saved_rows.append(row)
Although I'm not able to test it, it should work just fine.

Related

How can I efficiently open 30gb of file and process pieces of it without slowing down?

I have a some large files (more than 30gb) with pieces of information which I need to do some calculations on, like averaging. The pieces I mention are the slices of file, and I know the beginning line numbers and count of following lines for each slice.
So I have a dictionary with keys as beginning line numbers and values as count of following rows, and I use this dictionary to loop through the file and get slices over it. for each slice, I create a table, make some conversions and averaging, create a new table an convert it into a dictionary. I use islice for slicing and pandas dataframe to create tables from each slice.
however, in time process is getting slower and slower, even the size of the slices are more or less the same.
First 1k slices - processed in 1h
Second 1k slices - processed in 4h
Third 1k slices - processed in 8h
Second 1k slices - processed in 17h
And I am waiting for days to complete the processes.
Right now I am doing this on a windows 10 machine, 1tb SSD, 32 GB ram. Previously I also tried on a Linux machine (ubuntu 18.4) with 250gb SSD and 8gb ram + 8gb virtual ram. Both resulted more or less the same.
What I noticed in windows is, 17% of CPU and 11% of memory is being used, but disk usage is 100%. I do not fully know what disk usage means and how I can improve it.
As a part of the code I was also importing data into mongodb while working on Linux, and I thought maybe it was because of indexing in mongodb. but when I print the processing time and import time I noticed that almost all time is spent on processing, import takes few seconds.
Also to gain time, I am now doing the processing part on a stronger windows machine and writing the docs as txt files. I expect that writing on disk slows down the process a bit but txt file sizes are not more than 600kb.
Below is the piece of code, how I read the file:
with open(infile) as inp:
for i in range(0,len(seg_ids)):
inp.seek(0)
segment_slice = islice(inp,list(seg_ids.keys())[i], (list(seg_ids.keys())[i]+list(seg_ids.values())[i]+1))
segment = list(segment_slice)
for _, line in enumerate(segment[1:]):
#create dataframe and perform calculations
So I want to learn if there is a way to improve the processing time. I suppose my code reads whole file from beginning for each slice, and going through the end of the file reading time goes longer and longer.
As a note, because of the time constraints, I started with the most important slices I have to process first. So the rest will be more random slices on the files. So solution should be applicable for random slices, if there are any (I hope).
I am not experienced in scripting so please forgive me if I am asking a silly question, but I really could not find any answer.
A couple of things come to mind.
First, if you bring the data into a pandas DataFrame there is a 'chunksize' argument for importing large data. It allows you to process / dump what you need / don't, while proving information such as df.describe which will give you summary stats.
Also, I hear great things about dask. It is a scalable platform via parallel, multi-core, multi-machine processing and is almost as simple as using pandas and numpy with very little management of resources required.
Use pandas or dask and pay attention to the options for read_csv(). mainly: chunck_size, nrows, skiprows, usecols, engine (use C), low_memory, memory_map
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html][1]
The problem here is you are rereading a HUGE unindexed file multiple times line by line from the start of the file. No wonder this takes HOURS.
Each islice in your code is starting from the very beginning of the file -- every time -- and reads and discards all the lines from the file before it even reaches the start of the data of interest. That is very slow and inefficient.
The solution is to create a poor man's index to that file and then read smaller chucks for each slice.
Let's create a test file:
from pathlib import Path
p=Path('/tmp/file')
with open(p, 'w') as f:
for i in range(1024*1024*500):
f.write(f'{i+1}\n')
print(f'{p} is {p.stat().st_size/(1024**3):.2f} GB')
That creates a file of roughly 4.78 GB. Not as large as 30 GB but big enough to be SLOW if you are not thoughtful.
Now try reading that entire file line-by-line with the Unix utility wc to count the total lines (wc being the fastest way to count lines, in general):
$ time wc -l /tmp/file
524288000 file
real 0m3.088s
user 0m2.452s
sys 0m0.636s
Compare that with the speed for Python3 to read the file line by line and print the total:
$ time python3 -c 'with open("file","r") as f: print(sum(1 for l in f))'
524288000
real 0m53.880s
user 0m52.849s
sys 0m0.940s
Python is almost 18x slower to read the file line by line than wc.
Now do a further comparison. Check out the speed of the Unix utility tail to print the last n lines of a file:
$ time tail -n 3 file
524287998
524287999
524288000
real 0m0.007s
user 0m0.003s
sys 0m0.004s
The tail utility is 445x faster than wc (and roughly 8,000x faster than Python) to get to the last three lines of the file because it is using a windowed index buffer. ie, tail reads some number of bytes at the end of the file and then gets the last n lines from the buffer it read.
It is possible to use the same tail approach to your application.
Consider this photo:
The approach you are using the equivalent to reading every tape on that rack to find the data that is on two of the middle tapes only -- and doing it over and over and over...
In the 1950's (the era of the photo) each tape was roughly indexed for what it held. Computers would call for a specific tape in the rack -- not ALL the tapes in the rack.
The solution to your issue (in oversight) is to build a tape-like indexing scheme:
Run through the 30 GB file ONCE and create an index of sub-blocks by starting line number for the block. Think of each sub-block as roughly one tape (except it runs easily all the way to the end...)
Instead of using int.seek(0) before every read, you would seek to the block that contains the line number of interest (just like tail does) and then use islice offset adjusted to where that block's starting line number is in relationship to the start of the file.
You have a MASSIVE advantage compared with what they had to do in the 50's and 60's: You only need to calculate the starting block since you have access to the entire remaining file. 1950's tape index would call for tapes x,y,z,... to read data larger than one tape could hold. You only need to find x which contains the starting line number of interest.
BTW, since each IBM tape of this type held roughly 3 MB, your 30 GB file would be more than 10 million of these tapes...
Correctly implemented (and this is not terribly hard to do) it would speed up the read performance by 100x or more.
Constructing a useful index to a text file by line offset might be as easy as something like this:
def index_file(p, delimiter=b'\n', block_size=1024*1024):
index={0:0}
total_lines, cnt=(0,0)
with open(p, 'rb') as f:
while buf:=f.raw.read(block_size):
cnt=buf.count(delimiter)
idx=buf.rfind(delimiter)
key=cnt+total_lines
index[key]=f.tell()-(len(buf)-idx)+len(delimiter)
total_lines+=cnt
return index
# this index is created in about 4.9 seconds on my computer...
# with a 4.8 GB file, there are ~4,800 index entries
That constructs an index correlating starting line number (in that block) with byte offset from the beginning of the file:
>>> idx=index_file(p)
>>> idx
{0: 0, 165668: 1048571, 315465: 2097150, 465261: 3145722,
...
524179347: 5130682368, 524284204: 5131730938, 524288000: 5131768898}
Then if you want access to lines[524179347:524179500] you don't need to read 4.5 GB to get there; you can just do f.seek(5130682368) and start reading instantly.

Summarizing huge amounts of data

I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:
blabla1/blabla2/blabla3
word1/word2/word3
...
What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:
blabla1/blabla2/blabla3 : 1
word1/word2/word3 : 3
...
My attempts so far has been simply to save all entries in a dictionary and count them, i.e.
entry_count_dict = defaultdict(int)
with open(file) as f:
for line in f:
entry_count_dict[line] += 1
However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice.
The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.
In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:
import h5py
import numpy as np
entry_count_dict = h5py.File(filename)
with open(file) as f:
for line in f:
if line in entry_count_dict:
entry_count_file[line][0] += 1
else:
entry_count_file.create_dataset(line,
data=np.array([1]),
compression="lzf")
However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)).
However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.
One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:
sort file.txt > sorted_file.txt
And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.
Any tips on how to approach this would be much appreciated.
There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.
or,
try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too)
btw it scales
I suggest trying dask.array,
which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:
mkdir output
awk '/./ {print $0 > ( "output/" substr($0,0,1))}` input/*
This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.
Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:
from collections import Counter
ngrams = Counter()
for line in open(filename):
ngrams[line.strip()] += 1
for key, val in ngrams.items():
print(key, val, sep='\t')
If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.

Python Running Increasingly Slower, Garbage Collection Issue?

So I have code that grabs a list of files from a directory that initially had over 14 millions files. This is a hex-core machine with 20 GB RAM running Ubuntu 14.04 desktop and just grabbing a list of files takes hours - I haven't actually timed it.
Over the past week or so I've run code that doesn't nothing more than gather this list of files, open each file to determine when it was created, and move it to a directory based on the month and year it was created. (The files have been both scp'd and rsync'd so the timestamp the OS provides is meaningless at this point, hence opening the file.)
When I first started running this loop it was moving 1000 files in about 90 seconds. Then after several hours like this that 90 seconds became 2.5 min, then 4, then 5, then 9, and eventually 15 min. So I shut it down and started over.
I noticed that today once it was done gathering a list of over 9 millions files that moving 1000 files took 15 min right off the bat. I just shut the process down again and rebooted the machine because the time to move 1000 files had climbed to over 90 min.
I had hoped to find some means of doing a while + list.pop() style strategy to free memory as the loop progressed. Then found a couple of SO posts that said it could be done with for i in list: ... list.remove(...) but that this was a terrible idea.
Here's the code:
from basicconfig.startup_config import *
arc_dir = '/var/www/data/visits/'
def step1_move_files_to_archive_dirs(files):
"""
:return:
"""
cntr = 0
for f in files:
cntr += 1
if php_basic_files.file_exists(f) is False:
continue
try:
visit = json.loads(php_basic_files.file_get_contents(f))
except:
continue
fname = php_basic_files.basename(f)
try:
dt = datetime.fromtimestamp(visit['Entrance Time'])
except KeyError:
continue
mYr = dt.strftime("%B_%Y")
# Move the lead to Monthly archive
arc_path = arc_dir + mYr + '//'
if not os.path.exists(arc_path):
os.makedirs(arc_path, 0777)
if not os.path.exists(arc_path):
print "Directory: {} was not created".format(arc_path)
else:
# Move the file to the archive
newFile = arc_path + fname
#print "File moved to {}".format(newFile)
os.rename(f, newFile)
if cntr % 1000 is 0:
print "{} files moved ({})".format(cntr, datetime.fromtimestamp(time.time()).isoformat())
def step2_combine_visits_into_1_file():
"""
:return:
"""
file_dirs = php_basic_files.glob(arc_dir + '*')
for fd in file_dirs:
arc_files = php_basic_files.glob(fd + '*.raw')
arc_fname = arc_dir + php_basic_str.str_replace('/', '', php_basic_str.str_replace(arc_dir, '', fd)) + '.arc'
try:
arc_file_data = php_basic_files.file_get_contents(arc_fname)
except:
arc_file_data = {}
for f in arc_files:
uniqID = moduleName = php_adv_str.fetchBefore('.', php_basic_files.basename(f))
if uniqID not in arc_file_data:
visit = json.loads(php_basic_files.file_get_contents(f))
arc_file_data[uniqID] = visit
php_basic_files.file_put_contents(arc_fname, json.dumps(arc_file_data))
def main():
"""
:return:
"""
files = php_basic_files.glob('/var/www/html/ver1/php/VisitorTracking/data/raw/*')
print "Num of Files: {}".format(len(files))
step1_move_files_to_archive_dirs(files)
step2_combine_visits_into_1_file()
Notes:
basicconfig is essentially a bunch of constants I have for the environment and a few commonly used libraries like all the php_basic_* libraries. (I used PHP for years before picking up Python so I built a library to mimic the more common functions I used in order to be up and running with Python faster.)
The step1 def is as far as the program gets so far. The step2 def could, and likely should, be run in parallel. However, I figured I/O was the bottleneck and doing even more of it in parallel would likely slow all functions down a lot more. (I have been tempted to rsync the archive directories to another machine for aggregation thus getting parallel speed without the I/O bottleneck but figured the rsync would also be quite slow.)
The files themselves are all 3 Kb each so not very large.
----- Final Thoughts -------
Like I said, it doesn't appear, to me at least, that any data is being stored from each file opening. Therefore memory should not be an issue. However, I do notice that only 1.2 GB of RAM is being used right now and over 12 GB of was being used before. A big chunk of that 12 could be storing 14 million file names and paths. I've only just started the processing again so for next several hours python will be gathering a list of files and that list isn't in memory yet.
So I was wondering if there was a garbage collection issue or something else I was missing. Why is it slowing down as it progresses through the loop?
step1_move_files_to_archive_dirs:
Here's some reasons Step 1 might be taking longer than you expected...
The response to any exception during Step 1 is to continue to the next file. If you have any corrupted data files, they will stay in the filesystem forever, increasing the amount of work this function has to do next time (and the next, and the next...).
You are reading in every file and converting it from JSON to a dict, just to extract one date. So everything is read and converted at least once. If you control the creation of these files, it might be worth storing this value in the filename or in a separate index / log, so you don't have to go searching for that value again later.
If the input directories and output / archive directories are on separate filesystems, os.rename(f, newFile) can't just rename the file, but has to copy every byte from the source filesystem to the target filesystem. So either every file is near-instantaneously renamed, or every input file is slowly copied.
PS: It's weird that this function double-checks things like whether the input file still exists, or if os.makedirs worked, but then allows any exception from os.rename to crash you mid-loop.
step2_combine_visits_into_1_file:
All your file I/O is hidden inside that PHP library, but it looks to this PHP outsider like you're trying to store in RAM the contents of all the files in each subdirectory. Then, you accumulate all those contents inside some smaller number of archive files, while preserving (most of?) the data that was already there. Not only is that probably slow to begin with, it will get slower as time goes on.
Function code mostly replaced by comments:
file_dirs = # arch_dir/* --- Maybe lots, maybe only a few.
for fd in file_dirs:
arc_files = # arch_dir/subdir*.raw or maybe arch_dir/subdir/*.raw.
arc_fname = # subdir.arc
arc_file_data = # Contents of JSON file subdir.arc, as a dict.
for f in arc_files: # The *.raw files.
uniqID = # String based on f's filename.
if uniqID not in arc_file_data:
# Add to arc_file_data the uniqID key, and the
# _ entire contents_ of the .raw file as its value.
php_basic_files.file_put_contents # (...)
# Convert the arc_file_data dict into one _massive_ string,
# and replace the contents of the subdir.arc file.
Unless you have some maintenance job that periodically trims the *.arc files, you will eventually have the entire contents of all 14 million files (plus any older files) inside the *.arc files. Each of those .arc files gets read into a dict, converted to a mega-string, grown (probably), and then written back to the filesystem. That's a ton of I/O, even if the average .arc file isn't very big (which can only happen if there are lots of them).
Why do all this anyway? By the start of Step 2, you've already got a unique ID for each .raw input file, and it's already in the filename --- so why not use the filesystem itself to store /arch_dir/subdir/unique_id.json?
If you really do need all this data in a few huge archives, that shouldn't require so much work. The .arc files are little more than the unaltered contents of the .raw files, with bits of a JSON dictionary between them. A simple shell script could slap that together without ever interpreting the JSON itself.
(If the values are not just JSON but quoted JSON, you would have to change whatever reads the .arc files to not un-quote those values. But now I'm purely speculating, since I can only see some of what's happening.)
PS: Am I missing something, or is arc_files a list of *.raw filenames. Shouldn't it be raw_files?
Other Comments:
As others have noted, if your file-globbing function returns a mega-list of 14 million filenames, it would be vastly more memory-efficient as a generator that can yield one filename at a time.
Finally, you mentioned popping filenames off a list (although I don't see that in your code)... There is a huge time penalty for inserting or removing the first element of a large list --- del my_list[0] or my_list.pop(0) or my_list.insert(0, something) --- because items 1 though n-1 all have to be copied one index toward 0. That turns an O(n) operation into O(n**2)... again, if that's in your code anywhere.

python write text to file slower than printing text to terminal with python?

I'm writing a program that takes a string and compute all possible repeated permutations from this string. I'll show some fragments of my code, I would be grateful if someone can point me how to improved the speed when sending the data to a file.
Scenario 1
Sending the output to stdout took about 12 seconds to write 531,441 lines (3mb)
import itertools
for word in itertools.product(abcdefghi,repeat = 6):
print(word)
Scenario 2
Then I tried sending the output to a file instead of stdout, and this took a roughly around 5 minutes.
import itertools
word_counter=0
for word in itertools.product(abcdefghi,repeat = 6):
word_counter=word_counter+1
if word_counter==1:
open('myfile', 'w').write(word)
else:
open('myfile', 'a').write(word)
word_counter keep track of the number of repeated permutations as the function is looping. When word_counter is 1 the program creates the file and afterwards append the data to the file when word_counter is greater than 1.
I use a program on the web to do this and I found the program took the same time when printing the data to a terminal and this same web prgoram took about 3 seconds to output these combinations to a file while my program took 5 minutes to output the data to a file!
I also tried running my program and redirecting output to a file in a bash terminal, and this took the same time (3 sec)!
'myprog' > 'output file'
You are reopening the file for every write, try not doing that:
import itertools
output = open('myfile', 'w')
for word in itertools.product(abcdefghi, repeat=6):
output.write(word + '\n')
[Edit with explanation]
When you're working with 530,000 words, even making something a tiny bit slower for each word, adds up to a LOT slower for the whole program.
My way, you do one piece of setup work (open the file) and put it in memory, then go through 500,000 words and save them, then do one piece of tidy up work (close the file). That's why the file is saved in a variable - so you can set it up once, and use it again and again.
Your way, you do almost no setup work first, then you add one to the counter 500,000 times, check the value of the counter 500,000 times, branch this way or that 500,000 times, open the file and force Windows (or Linux) to check your permissions every time, put it in memory 500,000 times, write to it 500,000 times, stop using the file you opened (because you didn't save it) so it falls into the 'garbage' and gets tidied up - 500,000 times, and then finish.
The amount of work is small each time, but when you do them all so many times, it adds up.
The same as previous answers but with a context!
import itertools
with open('myfile', 'w') as output:
for word in itertools.product(abcdefghi, repeat=6):
output.write(word + '\n')
Context have the benefits of cleaning up after themselves and handling errors.

Parallel processing of a large .csv file in Python

I'm processing large CSV files (on the order of several GBs with 10M lines) using a Python script.
The files have different row lengths, and cannot be loaded fully into memory for analysis.
Each line is handled separately by a function in my script. It takes about 20 minutes to analyze one file, and it appears disk access speed is not an issue, but rather processing/function calls.
The code looks something like this (very straightforward). The actual code uses a Class structure, but this is similar:
csvReader = csv.reader(open("file","r")
for row in csvReader:
handleRow(row, dataStructure)
Given the calculation requires a shared data structure, what would be the best way to run the analysis in parallel in Python utilizing multiple cores?
In general, how do I read multiple lines at once from a .csv in Python to transfer to a thread/process? Looping with for over the rows doesn't sound very efficient.
Thanks!
This might be too late, but just for future users I'll post anyway. Another poster mentioned using multiprocessing. I can vouch for it and can go into more detail. We deal with files in the hundreds of MB/several GB every day using Python. So it's definitely up to the task. Some of files we deal with aren't CSVs, so the parsing can be fairly complex and take longer than the disk access. However, the methodology is the same no matter what file type.
You can process pieces of the large files concurrently. Here's pseudo code of how we do it:
import os, multiprocessing as mp
# process file function
def processfile(filename, start=0, stop=0):
if start == 0 and stop == 0:
... process entire file...
else:
with open(file, 'r') as fh:
fh.seek(start)
lines = fh.readlines(stop - start)
... process these lines ...
return results
if __name__ == "__main__":
# get file size and set chuck size
filesize = os.path.getsize(filename)
split_size = 100*1024*1024
# determine if it needs to be split
if filesize > split_size:
# create pool, initialize chunk start location (cursor)
pool = mp.Pool(cpu_count)
cursor = 0
results = []
with open(file, 'r') as fh:
# for every chunk in the file...
for chunk in xrange(filesize // split_size):
# determine where the chunk ends, is it the last one?
if cursor + split_size > filesize:
end = filesize
else:
end = cursor + split_size
# seek to end of chunk and read next line to ensure you
# pass entire lines to the processfile function
fh.seek(end)
fh.readline()
# get current file location
end = fh.tell()
# add chunk to process pool, save reference to get results
proc = pool.apply_async(processfile, args=[filename, cursor, end])
results.append(proc)
# setup next chunk
cursor = end
# close and wait for pool to finish
pool.close()
pool.join()
# iterate through results
for proc in results:
processfile_result = proc.get()
else:
...process normally...
Like I said, that's only pseudo code. It should get anyone started who needs to do something similar. I don't have the code in front of me, just doing it from memory.
But we got more than a 2x speed up from this on the first run without fine tuning it. You can fine tune the number of processes in the pool and how large the chunks are to get an even higher speed up depending on your setup. If you have multiple files as we do, create a pool to read several files in parallel. Just be careful no to overload the box with too many processes.
Note: You need to put it inside an "if main" block to ensure infinite processes aren't created.
Try benchmarking reading your file and parsing each CSV row but doing nothing with it. You ruled out disk access, but you still need to see if the CSV parsing is what's slow or if your own code is what's slow.
If it's the CSV parsing that's slow, you might be stuck, because I don't think there's a way to jump into the middle of a CSV file without scanning up to that point.
If it's your own code, then you can have one thread reading the CSV file and dropping rows into a queue, and then have multiple threads processing rows from that queue. But don't bother with this solution if the CSV parsing itself is what's making it slow.
Because of the GIL, Python's threading won't speed-up computations that are processor bound like it can with IO bound.
Instead, take a look at the multiprocessing module which can run your code on multiple processors in parallel.
If the rows are completely independent just split the input file in as many files as CPUs you have. After that, you can run as many instances of the process as input files you have now. This instances, since they are completely different processes, will not be bound by GIL problems.
Just found a solution to this old problem. I tried Pool.imap, and it seems to simplify processing large file significantly. imap has one significant benefit when comes to processing large files: It returns results as soon as they are ready, and not wait for all the results to be available. This saves lot of memory.
(Here is an untested snippet of code which reads a csv file row by row, process each row and write it back to a different csv file. Everything is done in parallel.)
import multiprocessing as mp
import csv
CHUNKSIZE = 10000 # Set this to whatever you feel reasonable
def _run_parallel(csvfname, csvoutfname):
with open(csvfname) as csvf, \
open(csvoutfname, 'w') as csvout\
mp.Pool() as p:
reader = csv.reader(csvf)
csvout.writerows(p.imap(process, reader, chunksize=CHUNKSIZE))
If you use zmq and a DEALER middle man, you'd be able spread the row processing not just to the CPUs on your computer but across a network to as many processes as necessary. This would essentially guarentee that you hit an IO limit vs a CPU limit :)

Categories