Python: efficient file io - python

What is the most efficient (fastest) way to simultaneously read in two large files and do some processing?
I have two files; a.txt and b.txt, each containing about a hundred thousand corresponding lines. My goal is to read in the two files and then do some processing on each line pair
def kernel:
a_file=open('a.txt','r')
b_file=open('b.txt', 'r')
a_line = a_file.readline()
b_line = b_file.readline()
while a_line:
process(a_spl,b_spl) #process requiring both corresponding file lines
I looked in to xreadlines and readlines but i'm wondering if i can do better. speed is of paramount importance for this task.
thank you.

The below code does not accumulate data from the input files in memory, unless the process function does that by itself.
from itertools import izip
def process(line1, line2):
# process a line from each input
with open(file1, 'r') as f1:
with open(file2, 'r') as f2:
for a, b in izip(f1, f2):
process(a, b)
If the process function is efficient, this code should run quickly enough for most purposes. The for loop will terminate when the end of one of the files is reached. If either file contains an extraordinarily long line (i.e. XML, JSON), or if the files are not text, this code may not work well.

You can use with statement to make sure your files are closed after the execution. From this blog entry:
to open a file, process its contents, and make sure to close it, you can simply do:
with open("x.txt") as f:
data = f.read()
do something with data

String IO can be pretty fast -- probably your processing will be what slows things down. Consider a simple input loop to feed a queue like:
queue = multiprocessing.Queue(100)
a_file = open('a.txt')
b_file = open('b.txt')
for pair in itertools.izip(a_file, b_file):
queue.put(pair) # blocks here on full queue
You can set up a pool of processes pulling items from the queue and taking action on each, assuming your problem can be parallelised this way.

I'd change your while condition to the following so that it doesn't fail when a has more lines than b.
while a_line and b_line
Otherwise, that looks good. You are reading in the two lines that you need, then processing. You could even multithread this by reading in N pairs of line and sending each pair off to a new thread or similar.

Related

Issues reading in large .gz files

I am reading in a large zipped json file ~4GB. I want to read in the first n lines.
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
line_n = f.readlines(1)
print(ast.literal_eval(line_n[0])['events']) # a dictionary object
This works fine when I want to read a single line. If now try and read in a loop e.g.
no_of_lines = 1
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
for line in range(no_of_lines):
line_n = f.readlines(line)
print(ast.literal_eval(line_n[0])['events'])
My code takes forever to execute, even if that loop is of length 1. I'm assuming this behaviour has something to do with how gzip read files, perhaps when I loop it tries to obtain information about the file length which causes the long execution time? Can anyone shed some light on this and potentially provide an alternative way of doing this?
An edited first line of my data:
['{"events": {"category": "EVENT", "mac_address": "123456", "co_site": "HSTH"}}\n']
You are using the readlines() method, which reads all lines from a file simultaneously. This can cause performance issues when reading huge files, as Python needs to load all the lines into memory at once.
An alternative is to use the iter() method to iterate over the lines of the file, without having to load all the lines into memory at once:
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
for line in f:
print(ast.literal_eval(line)['events'])

Read in large text file (~20m rows), apply function to rows, write to new text file

I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.
My code looks like this:
with open('f.txt', 'r') as f:
function(f,w)
Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.
I have tried:
def multiprocess(f,w):
cores = multiprocessing.cpu_count()
with Pool(cores) as p:
pieces = p.map(function,f,w)
f.close()
w.close()
multiprocess(f,w)
But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.
even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.
In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
The slowest solution, but most efficient RAM usage
Your initial solution to execute and process the file line by line
For maximum speed, but most RAM consumption
Read the entire File into RAM as a list via f.readlines(), if your entire dataset can fit in memory, comfortably
Figure out the number of cores (say 8 cores for example)
Split the list evenly into 8 lists
pass each list to the function to be executed by a Process instance (at this point your RAM usage will be further doubled, which is the trade off for max speed), but you should del the original big list right after to free some RAM
Each Process handles its entire chunk in order line by line, and write it into its own output file (out_file1.txt, out_file2.txt, etc.)
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue
Figure out the number of cores in a variable cores (say 8)
Initialize 8 queue, 8 processes, and pass each Queue to each process. At this point each Process should open its own output file (outfile1.txt, outfile2.txt, etc.)
Each process shall poll (and block) for a chunk of 10_000 rows, process them, and write them to their respective output files sequentially
In a loop in the Parent Process, Read 10_000 * 8 lines from your input 20m-rows file
split that into several lists (10K chunks) to push to your respective Processes Queues
When your done with 20m rows exit the loop, pass a special value into each process Queue that signals the end of input data
When each process detects that special End of Data value in its own Queue, each shall close their output file, and exit
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.
Using in to iterate is not the way but you can call more than one line by time, you just need to sum one or more to read more than one line, doing this the program will read faster.
Look this snippet.
# Python code to
# demonstrate readlines()
L = ["Geeks\n", "for\n", "Geeks\n"]
# writing to file
file1 = open('myfile.txt', 'w')
file1.writelines(L)
file1.close()
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()
count = 0
# Strips the newline character
for line in Lines:
count += 1
print("Line{}: {}".format(count, line.strip()))
I got it from: https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/.

What's the most efficient use of computing resources when reading from one file and writing to another? line by line or in a batch to a list?

I'm reading line by line from a text file and manipulating the string to then be written to a csv file.
I can think of two best ways to do this (and I welcome other ideas or modifications):
Read, process single line into a list, and go straight to writing the line.
linelist = []
with open('dirty.txt', 'r') as dirty_text:
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in dirty_text:
#Parse fields into list, replacing the previous list item with a new string that is a comma-separated row.
#Write list item into clean.csv.
Read and process the lines into a list (until reaching the size limit of a list), then writing the list to the csv in one big batch. Repeat until end of file (but I'm leaving out the loop for this example).
linelist = []
seekpos = 0
with open('dirty.txt', 'r') as dirty_text:
for line in dirty_text:
#Parse fields into list until the end of the file or the end of the list's memory space, such that each list item is a string that is a comma-separated row.
#update seek position to come back to after this batch, if looping through multiple batches
with open('clean.csv', 'a') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
#write list into clean.csv, each list item becoming a comma-separated row.
#This would likely be a loop for bigger files, but for my project and for simplicity, it's not necessary.
Which process is the most efficient use of resources?
In this case, I'm assuming nobody (human or otherwise) needs access to either file during this process (though I would gladly hear discussion about efficiency in that case).
I'm also assuming a list demands less resources than a dictionary.
Memory use is my primary concern. My hunch is that the first process uses the least memory because the list never gets longer than one item, so the maximum memory it uses at any given moment is less than that of the second process which maxes out the list memory. But, I'm not sure how dynamic memory allocation works in Python, and you have two file objects open at the same time in the first process.
As for power usage and total time it takes, I'm not sure which process is more efficient. My hunch is that with multiple batches, the second option would use more power and take more time because it opens and closes the files at each batch.
As for code complexity and length, the first option seems like it will turn out simpler and shorter.
Other considerations?
Which process is best?
Is there a better way? Ten better ways?
Thanks in advance!
Reading all the data into memory is inefficient because it uses more memory than necessary.
You can trade some CPU for memory; the program to read everything into memory will have a single, very simple main loop; but the main bottleneck will be the I/O channel, so it really won't be faster. Regardless of how fast the code runs, any reasonable implementation will spend most of its running time waiting for the disk.
If you have enough memory, reading the entire file into memory will work fine. Once the data is bigger than your available memory, performance will degrade ungracefully (i.e. the OS will start swapping regions of memory out to disk and then swap them back in when they are needed again; in the worst case, this will basically grind the system to a halt, a situation called thrashing). The main reason to prefer reading and writing a line at a time is that the program will perform without degradation even when you scale up to larger amounts of data.
I/O is already buffered; just write what looks natural, and let the file-like objects and the operating system take care of the actual disk reads and writes.
with open('dirty.txt', 'r') as dirty_text:
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in dirty_text:
row = some_function(line)
cleancsv_writer.writerow(row)
If all the work of cleaning up a line is abstracted away by some_function, you don't even need the for loop.
with open('dirty.txt', 'r') as dirty_text,\
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
cleancsv_writer.writerows(some_function(line) for line in dirty_text))

Bufferization in GzipFile

Imagine the following simple script:
def reader():
for line in open('logfile.log'):
# do some stuff here like splitting the line or filtering etc.
yield some_new_line
def writer(stream):
with gzip.GzipFile('some_output_file.gz', 'w') as fh:
for _s in stream:
fh.write(_s+'\n')
stream = reader()
writer(stream)
So pretty simple - read lines using generators and write some result into a gzip file.
But how to speed it up? The HDD seems to be a bottleneck. I saw I can use buffer size for reads - using open(file, mode, buffer) syntax. But I'm not quite sure it will work in my case (with generators).
Also I didn't find any bufferization parameter for the gzip.GzipFile call. From the code, it's based on some bufferized class, but I don't see any further docs on that.
I have a (crazy?) idea to create an explicit cache and replace open methods with it - so it will read the file in bigger chunks, say, by 8MB, and then perform splitting it by lines. As for writes, I thought to create a list of lines to write, collect them (say, 5000 lines), and then dump into the file.
Am I trying to re-invent the wheel? I'm not satisfied with the performance the script currently has, so I'm trying to speed it up as much as possible.
UPD. I have around 4-5 different parallel workers running. They all perform reads and writes. So I guess the HDD is jumping from one sector to another, and this is the reason why I want to implement some bufferization to dump the data periodically in big chunks.
Thanks!
I can just propose more compact code:
def reader():
for line in open('logfile.log'):
# do some stuff here like splitting the line or filtering etc.
yield some_new_line
def writer(stream):
with gzip.GzipFile('some_output_file.gz', 'w') as fh:
fh.writelines(stream)
writer(reader())
However, there is no actual speed-up. Python will manage the streams, but if you cannot spare memory for full file write, the speed-up will not be great.
The compression though gzip is the slowest step. The following function will give you only ~3% speed-up (disregarding the generator's part).
def writer():
f = open('logfile.log').read()
gzip.GzipFile('some_output_file.gz', 'w').write(f)
writer()
So, if you need gzip, than you cannot do much.

Parallel processing of a large .csv file in Python

I'm processing large CSV files (on the order of several GBs with 10M lines) using a Python script.
The files have different row lengths, and cannot be loaded fully into memory for analysis.
Each line is handled separately by a function in my script. It takes about 20 minutes to analyze one file, and it appears disk access speed is not an issue, but rather processing/function calls.
The code looks something like this (very straightforward). The actual code uses a Class structure, but this is similar:
csvReader = csv.reader(open("file","r")
for row in csvReader:
handleRow(row, dataStructure)
Given the calculation requires a shared data structure, what would be the best way to run the analysis in parallel in Python utilizing multiple cores?
In general, how do I read multiple lines at once from a .csv in Python to transfer to a thread/process? Looping with for over the rows doesn't sound very efficient.
Thanks!
This might be too late, but just for future users I'll post anyway. Another poster mentioned using multiprocessing. I can vouch for it and can go into more detail. We deal with files in the hundreds of MB/several GB every day using Python. So it's definitely up to the task. Some of files we deal with aren't CSVs, so the parsing can be fairly complex and take longer than the disk access. However, the methodology is the same no matter what file type.
You can process pieces of the large files concurrently. Here's pseudo code of how we do it:
import os, multiprocessing as mp
# process file function
def processfile(filename, start=0, stop=0):
if start == 0 and stop == 0:
... process entire file...
else:
with open(file, 'r') as fh:
fh.seek(start)
lines = fh.readlines(stop - start)
... process these lines ...
return results
if __name__ == "__main__":
# get file size and set chuck size
filesize = os.path.getsize(filename)
split_size = 100*1024*1024
# determine if it needs to be split
if filesize > split_size:
# create pool, initialize chunk start location (cursor)
pool = mp.Pool(cpu_count)
cursor = 0
results = []
with open(file, 'r') as fh:
# for every chunk in the file...
for chunk in xrange(filesize // split_size):
# determine where the chunk ends, is it the last one?
if cursor + split_size > filesize:
end = filesize
else:
end = cursor + split_size
# seek to end of chunk and read next line to ensure you
# pass entire lines to the processfile function
fh.seek(end)
fh.readline()
# get current file location
end = fh.tell()
# add chunk to process pool, save reference to get results
proc = pool.apply_async(processfile, args=[filename, cursor, end])
results.append(proc)
# setup next chunk
cursor = end
# close and wait for pool to finish
pool.close()
pool.join()
# iterate through results
for proc in results:
processfile_result = proc.get()
else:
...process normally...
Like I said, that's only pseudo code. It should get anyone started who needs to do something similar. I don't have the code in front of me, just doing it from memory.
But we got more than a 2x speed up from this on the first run without fine tuning it. You can fine tune the number of processes in the pool and how large the chunks are to get an even higher speed up depending on your setup. If you have multiple files as we do, create a pool to read several files in parallel. Just be careful no to overload the box with too many processes.
Note: You need to put it inside an "if main" block to ensure infinite processes aren't created.
Try benchmarking reading your file and parsing each CSV row but doing nothing with it. You ruled out disk access, but you still need to see if the CSV parsing is what's slow or if your own code is what's slow.
If it's the CSV parsing that's slow, you might be stuck, because I don't think there's a way to jump into the middle of a CSV file without scanning up to that point.
If it's your own code, then you can have one thread reading the CSV file and dropping rows into a queue, and then have multiple threads processing rows from that queue. But don't bother with this solution if the CSV parsing itself is what's making it slow.
Because of the GIL, Python's threading won't speed-up computations that are processor bound like it can with IO bound.
Instead, take a look at the multiprocessing module which can run your code on multiple processors in parallel.
If the rows are completely independent just split the input file in as many files as CPUs you have. After that, you can run as many instances of the process as input files you have now. This instances, since they are completely different processes, will not be bound by GIL problems.
Just found a solution to this old problem. I tried Pool.imap, and it seems to simplify processing large file significantly. imap has one significant benefit when comes to processing large files: It returns results as soon as they are ready, and not wait for all the results to be available. This saves lot of memory.
(Here is an untested snippet of code which reads a csv file row by row, process each row and write it back to a different csv file. Everything is done in parallel.)
import multiprocessing as mp
import csv
CHUNKSIZE = 10000 # Set this to whatever you feel reasonable
def _run_parallel(csvfname, csvoutfname):
with open(csvfname) as csvf, \
open(csvoutfname, 'w') as csvout\
mp.Pool() as p:
reader = csv.reader(csvf)
csvout.writerows(p.imap(process, reader, chunksize=CHUNKSIZE))
If you use zmq and a DEALER middle man, you'd be able spread the row processing not just to the CPUs on your computer but across a network to as many processes as necessary. This would essentially guarentee that you hit an IO limit vs a CPU limit :)

Categories