How to read, process and write large file in python? - python

I have a txt file with size 100GB. I have to read it, do some processing and write in the same order as in original file in the most fastest way. For reading and writing I couldn't use multiprocessing, for processing I have tried map but I've got memory overflow, I've also tried with imap but it seems not speeding the process.

Don't load the file into memory all at once, but read it line by line. Don't store the output into memory all at once, but write it line by line. In the most basic form:
with open('input.txt', 'rt') as input_file:
with open('output.txt', 'wt') as output_file:
for input_line in input_file:
output_line = process(input_line)
# Assuming that output_line will have a trailing newline.
output_file.write(output_line)
If the processing is very CPU-intensive, you could gain performance with parallelization. But the I/O will have to remain sequential, so it's only worth it if you see that the program is taking 100% of a CPU core.

Forget about "speeding it up" for now, if you have to process a 100GB file your issue should be memory consumption. Your first issue is getting it to work at all.
You cannot read the whole file into memory, then process it (unless you happen to have 100GB RAM). You have to either read it line by line, or in batches and process only parts at a time.
So, instead of using file.read(), use file.readline() or file.read(some_size), depending on what that file that you read.
In the same way, if you need to write a processed 100GB file, don't collect all results in a list or something, write each line to the result file as soon as you are done processing a line of the original file.
Incidentally, I would expect this to the fastest method with your requirements (same order), because you can forget doing anything out of order and then sorting. You can't sort the data in-place if it's too big for your memory, and while sorting it on the file system is possible, it's more complicated and I am somewhat doubtful it would be faster.

Related

Locking files (in python)

So I have a couple of questions:
Does read, write in python using file.read(), file.write() happen using locks ?
Or can read, write happen at the same time ?
I have pickled object to a file on the disk. Now multiple threads can read the pickled model. But I also want to update the pickled file. Now one way is to lock the file on every read, write. But this would be inefficient since locking the files only when they are being read by different process is unneccary.
How should I solve this ?

Read a file multi-threaded in python in chunks of 2KB.

I have to read a file in chunks of 2KB and do some operation on those chunks. Now where I'm actually stuck is, when the data needs to be thread-safe. From what I've seen in online tutorials and StackOverflow answers, we define a worker thread, and override its run method. The run method uses data from a queue which we pass as an argument, and which contains the actual data. But to load that queue with data, I'll have to go through the file serially, which eliminates parallelism. I want that multiple threads read the file in parallel manner. So I'll have to cover the read part in the run function only. But I'm not sure how to go with that. Help needed.
Reading the file serially is your best option since (hardware wise) it gives you the best read throughout.
Usually the slow part is not in the data reading but in its processing...

Is there any advantage to reading the entire file

Are there any advantages/disadvantages to reading an entire file in one go rather than reading the bytes as required? So is there any advantage to:
file_handle = open("somefile", rb)
file_contents = file_handle.read()
# do all the things using file_contents
compared to:
file_handle = open("somefile", rb)
part1 = file_handle.read(10)
# do some stuff
part2 = file_handle.read(8)
# do some more stuff etc
Background: I am writing a p-code (bytecode) interpreter in Python and have initially just written a naive implementation that reads bytes from the file as required and performs the necessary actions etc. A friend I was showing the program has suggested that I should instead read the entire file into memory (Python list?) and then process it from memory to avoid lots of slow disk reads. The test files are currently less than 1KB and will probably be at most a few 100KB so I would have expected the Operating System and disk controller system to cache the file obviating any performance issues caused by repeatedly reading small chunks of the file.
Cache aside, you still have system calls. Each read() results in a mode switch to trigger the kernel. You can see this with strace or another tool to look at system calls.
This might be premature for a 100 KB file though. As always, test your code to know for sure.
If you want to do any kind of random access then putting it in a list is going to be much faster than seeking from disk. Even if the OS does cache disk access, you are hitting another layer of cache. In any case, you can't be sure how the OS will behave.
Here are 3 cases I can think of that would motivate doing it in-memory:
You might have a jump instruction which you can execute by adding a number to your program counter. Doing that to the index of an array vs seeking a file is a good use case.
You may want to optimise your VM's behaviour, and that may involve reading the file more than once. Scanning a list twice vs reading a file twice will be much quicker.
Depending on opcodes and the grammar of your language you may want to look ahead in a 'cycle' to speed up execution. If that ends up doing two seeks then this could end up degrading performance.
If your file will always be small enough fit in RAM then it's probably worth reading it all into memory. Profile it with a real program and see if it's noticeably faster.
A single call to read() will be faster than multiple calls to read(). The tradeoff is that with a single call you must be able to fit all data in memory at once, whereas with multiple reads you only have to retain a fraction of the total amount of data. For files that are just a few kilobytes or megabytes, the difference won't be noticeable. For files that are several gigs in size, memory becomes more important.
Also, to do a single read means all of the data must be present, whereas multiple reads can be used to process data as it is streaming in from an external source.
If you are looking for performance, I would recommend going through generators. Since you have small file size, memory would not be any big concern, but its still a good practice. Still reading file from disc multiple times is a definite bottleneck for a scalable solution.

Python shared file access between threads

I have a thread writing to a file(writeThread) periodically and another(readThread) that reads from the file asynchronously. Can readThread access the file using a different handle and not mess anything up?
If not, does python have a shared lock that can be used by writeThread but does not block readThread ? I wouldn't prefer a simple non-shared lock because file access takes order of a millisecond and the writeThread write period is of the same order(the period depends on some external parameters). Thus, a situation may arise where even though writeThread may release the lock, it will re-acquire it immediately and thus cause starvation.
A solution which I can think of is to maintain multiple copies of the file, one for reading and another for writing and avoid the whole situation all-together. However, the file sizes involved may become huge, thus making this method not preferable.
Are there any other alternatives or is this a bad design ?
Thanks
Yes, you can open the file multiple times and get independent access to it. Each file object will have its own buffers and position so for instance a seek on one will not mess up the other. It works pretty much like multiple program access and you have to be careful when reading / writing the same area of the file. For instance, a write that appends to the end of the file won't be seen by the reader until the write object flushes. Rewrites of existing data won't be seen by the reader until both the reader and writer flush. Writes won't be atomic, so if you are writing records the reader may see partial records. async Select or poll events on the reader may be funky... not sure about that one.
An alternative is mmap but I haven't used it enough to know the gotchas.

for line in open(filename)

I frequently see python code similar to
for line in open(filename):
do_something(line)
When does filename get closed with this code?
Would it be better to write
with open(filename) as f:
for line in f.readlines():
do_something(line)
filename would be closed when it falls out of scope. That normally would be the end of the method.
Yes, it's better to use with.
Once you have a file object, you perform all file I/O by calling methods of this object. [...] When you are done with the file, you should finish by calling the close method on the object, to close the connection to the file:
input.close()
In short scripts, people often omit this step, as Python automatically closes the file when a file object is reclaimed during garbage collection (which in mainstream Python means the file is closed just about at once, although other important Python implementations, such as Jython and IronPython, have other, more relaxed garbage collection strategies). Nevertheless, it is good programming practice to close your files as soon as possible, and it is especially a good idea in larger programs, which otherwise may be at more risk of having excessive numbers of uselessly open files lying about. Note that try/finally is particularly well suited to ensuing that a file gets closed, even when a function terminates due to an uncaught exception.
Python Cookbook, Page 59.
Drop .readlines(). It is redundant and undesirable for large files (due to memory consumption). The variant with 'with' block always closes file.
with open(filename) as file_:
for line in file_:
do_something(line)
When file will be closed in the bare 'for'-loop variant depends on Python implementation.
The with part is better because it close the file afterwards.
You don't even have to use readlines(). for line in file is enough.
I don't think the first one closes it.
python is garbage-collected - cpython has reference counting and a backup cycle detecting garbage collector.
File objects close their file handle when the are deleted/finalized.
Thus the file will be eventually closed, and in cpython will closed as soon as the for loop finishes.

Categories