Handling big text files in Python - python

Basics are that I need to process 4gig text files on a per line basis.
using .readline() or for line in f is great for memory but takes ages to IO. Would like to use something like yield, but that (I think) will chop lines.
POSSIBLE ANSWER:
file.readlines([sizehint])¶
Read until EOF using readline() and return a list containing the lines
thus read. If the optional sizehint
argument is present, instead of
reading up to EOF, whole lines
totalling approximately sizehint bytes
(possibly after rounding up to an
internal buffer size) are read.
Objects implementing a file-like
interface may choose to ignore
sizehint if it cannot be implemented,
or cannot be implemented efficiently.
Didn't realize you could do this!

You can just iterate over the file object:
with open("filename") as f:
for line in f:
whatever
This will do some internal buffering to improve the performance. (Note that file.readline() will perform considerably worse because it does not buffer -- that's why you can't mix iteration over a file object with file.readline().)

If you want to do something on a per-line basis you can just loop over the file object:
f = open("w00t.txt")
for line in f:
# do stuff
However, doing stuff on a per-line basis can be a actual bottleneck of performance, so perhaps you should use a better chunk size? What you can do is, for example, read 4096 bytes, find the last line ending \n, process on that part and prepend the part that is left to the next chunk.

You could always chunk the lines up? I mean why open one file and iterate all the way through when you can open the same file 6 times and iterate through.
e.g.
a #is the first 1024 bytes
b #is the next 1024
#etcetc
f #is the last 1024 bytes
Each file handle running in a separate process and we start to cook on gas. Just remember to deal with line endings properly.

Related

Read in large text file (~20m rows), apply function to rows, write to new text file

I have a very large text file, and a function that does what I want it to do to each line. However, when reading line by line and applying the function, it takes roughly three hours. I'm wondering if there isn't a way to speed this up with chunking or multiprocessing.
My code looks like this:
with open('f.txt', 'r') as f:
function(f,w)
Where the function takes in the large text file and an empty text file and applies the function and writes to the empty file.
I have tried:
def multiprocess(f,w):
cores = multiprocessing.cpu_count()
with Pool(cores) as p:
pieces = p.map(function,f,w)
f.close()
w.close()
multiprocess(f,w)
But when I do this, I get a TypeError <= unsupported operand with type 'io.TextWrapper' and 'int'. This could also be the wrong approach, or I may be doing this wrong entirely. Any advice would be much appreciated.
even if you can successfully pass open file objects to child OS processes in your Pool as arguments f and w (which I don't think you can on any OS) trying to read from and write to files concurrently is a bad idea, to say the least.
In general, I recommend using the Process class rather than Pool, assuming that the output end result needs to maintain the same order as the input 20m lines file.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process
The slowest solution, but most efficient RAM usage
Your initial solution to execute and process the file line by line
For maximum speed, but most RAM consumption
Read the entire File into RAM as a list via f.readlines(), if your entire dataset can fit in memory, comfortably
Figure out the number of cores (say 8 cores for example)
Split the list evenly into 8 lists
pass each list to the function to be executed by a Process instance (at this point your RAM usage will be further doubled, which is the trade off for max speed), but you should del the original big list right after to free some RAM
Each Process handles its entire chunk in order line by line, and write it into its own output file (out_file1.txt, out_file2.txt, etc.)
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
for an intermediate trade-off between speed and RAM, but the most complex, we will have to use the Queue class
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Queue
Figure out the number of cores in a variable cores (say 8)
Initialize 8 queue, 8 processes, and pass each Queue to each process. At this point each Process should open its own output file (outfile1.txt, outfile2.txt, etc.)
Each process shall poll (and block) for a chunk of 10_000 rows, process them, and write them to their respective output files sequentially
In a loop in the Parent Process, Read 10_000 * 8 lines from your input 20m-rows file
split that into several lists (10K chunks) to push to your respective Processes Queues
When your done with 20m rows exit the loop, pass a special value into each process Queue that signals the end of input data
When each process detects that special End of Data value in its own Queue, each shall close their output file, and exit
Have your OS concatenate your output files in order into one big output file. you can use subprocess.run('cat out_file* > big_output.txt') if you are running a UNIX system, or the equivalent Windows command for windows.
Convoluted? well, it is usually a trade-off between Speed, RAM, Complexity. Also for a 20m row task, one needs to make sure that data processing is as optimal as possible - inline as much functions as you can, avoid alot of math, use Pandas / numpy in child processes if possible, etc.
Using in to iterate is not the way but you can call more than one line by time, you just need to sum one or more to read more than one line, doing this the program will read faster.
Look this snippet.
# Python code to
# demonstrate readlines()
L = ["Geeks\n", "for\n", "Geeks\n"]
# writing to file
file1 = open('myfile.txt', 'w')
file1.writelines(L)
file1.close()
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()
count = 0
# Strips the newline character
for line in Lines:
count += 1
print("Line{}: {}".format(count, line.strip()))
I got it from: https://www.geeksforgeeks.org/read-a-file-line-by-line-in-python/.

Modify only a specific part of a file

I have a python script that is supposed to read a file. The issue is that that file is very large so for efficiency I decided that my script should only read from line 650000 and onward, since previous line does not contain relevant information.
Is there any way to only modify lines 650000 till eof, so for example, if i read() this file only those specific lines would appear?
Files are not line-oriented, they are blocks of bytes.
There's no way, short of reading the data in, to figure out how many bytes make up those first 650,000 lines, so you'd have to do that just in order to skip them.
Starting modifying a file at a certain offset is possible, but that offset will be in bytes which is the addressing unit used by files.
Skipping lines can be done easily enough:
with open("myfile.txt", "w+t") as f:
for i in xrange(650000):
f.readline() # Read a line and throw it away
f.write("hello")
This will truncate the file so that there will be no data after the hello (but 650,000 lines before it, of course).

python jump to a line in a txt file (a gzipped one)

I'm reading through a large file, and processing it.
I want to be able to jump to the middle of the file without it taking a long time.
right now I am doing:
f = gzip.open(input_name)
for i in range(1000000):
f.read() # just skipping the first 1M rows
for line in f:
do_something(line)
is there a faster way to skip the lines in the zipped file?
If I have to unzip it first, I'll do that, but there has to be a way.
It's of course a text file, with \n separating lines.
The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.
To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.
with gzip.open(filename) as f:
# jumps to `initial_row`
for line in itertools.slice(f, initial_row, None):
# have a party
Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip').
Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f variable above -- you iterate over lines. You do not need to think about the '\n' characters.
You can use itertools.islice, passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:
from itertools import islice
for line in islice(f,1000000,None):
print(line)
Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f).
Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.
The consume recipe as #wwii suggested recipe is also worth checking out
Not really.
If you know the number of bytes you want to skip, you can use .seek(amount) on the file object, but in order to skip a number of lines, Python has to go through the file byte by byte to count the newline characters.
The only alternative that comes to my mind is if you handle a certain static file, that won't change. In that case, you can index it once, i.e. find out and remember the positions of each line. If you have that in e.g. a dictionary that you save and load with pickle, you can skip to it in quasi-constant time with seek.
It is not possible to randomly seek within a gzip file. Gzip is a stream algorithm and so it must always be uncompressed from the start until where your data of interest lies.
It is not possible to jump to a specific line without an index. Lines can be scanned forward or scanned backwards from the end of the file in continuing chunks.
You should consider a different storage format for your needs. What are your needs?

Process data, much larger than physical memory, in chunks

I need to process some data that is a few hundred times bigger than RAM. I would like to read in a large chunk, process it, save the result, free the memory and repeat. Is there a way to make this efficient in python?
The general key is that you want to process the file iteratively.
If you're just dealing with a text file, this is trivial: for line in f: only reads in one line at a time. (Actually it buffers things up, but the buffers are small enough that you don't have to worry about it.)
If you're dealing with some other specific file type, like a numpy binary file, a CSV file, an XML document, etc., there are generally similar special-purpose solutions, but nobody can describe them to you unless you tell us what kind of data you have.
But what if you have a general binary file?
First, the read method takes an optional max bytes to read. So, instead of this:
data = f.read()
process(data)
You can do this:
while True:
data = f.read(8192)
if not data:
break
process(data)
You may want to instead write a function like this:
def chunks(f):
while True:
data = f.read(8192)
if not data:
break
yield data
Then you can just do this:
for chunk in chunks(f):
process(chunk)
You could also do this with the two-argument iter, but many people find that a bit obscure:
for chunk in iter(partial(f.read, 8192), b''):
process(chunk)
Either way, this option applies to all of the other variants below (except for a single mmap, which is trivial enough that there's no point).
There's nothing magic about the number 8192 there. You generally do want a power of 2, and ideally a multiple of your system's page size. beyond that, your performance won't vary that much whether you're using 4KB or 4MB—and if it does, you'll have to test what works best for your use case.
Anyway, this assumes you can just process each 8K at a time without keeping around any context. If you're, e.g., feeding data into a progressive decoder or hasher or something, that's perfect.
But if you need to process one "chunk" at a time, your chunks could end up straddling an 8K boundary. How do you deal with that?
It depends on how your chunks are delimited in the file, but the basic idea is pretty simple. For example, let's say you use NUL bytes as a separator (not very likely, but easy to show as a toy example).
data = b''
while True:
buf = f.read(8192)
if not buf:
process(data)
break
data += buf
chunks = data.split(b'\0')
for chunk in chunks[:-1]:
process(chunk)
data = chunks[-1]
This kind of code is very common in networking (because sockets can't just "read all", so you always have to read into a buffer and chunk into messages), so you may find some useful examples in networking code that uses a protocol similar to your file format.
Alternatively, you can use mmap.
If your virtual memory size is larger than the file, this is trivial:
with mmap.mmap(f.fileno(), access=mmap.ACCESS_READ) as m:
process(m)
Now m acts like a giant bytes object, just as if you'd called read() to read the whole thing into memory—but the OS will automatically page bits in and out of memory as necessary.
If you're trying to read a file too big to fit into your virtual memory size (e.g., a 4GB file with 32-bit Python, or a 20EB file with 64-bit Python—which is only likely to happen in 2013 if you're reading a sparse or virtual file like, say, the VM file for another process on linux), you have to implement windowing—mmap in a piece of the file at a time. For example:
windowsize = 8*1024*1024
size = os.fstat(f.fileno()).st_size
for start in range(0, size, window size):
with mmap.mmap(f.fileno(), access=mmap.ACCESS_READ,
length=windowsize, offset=start) as m:
process(m)
Of course mapping windows has the same issue as reading chunks if you need to delimit things, and you can solve it the same way.
But, as an optimization, instead of buffering, you can just slide the window forward to the page containing the end of the last complete message, instead of 8MB at a time, and then you can avoid any copying. This is a bit more complicated, so if you want to do it, search for something like "sliding mmap window", and write a new question if you get stuck.

Is there any way to find the buffer size of a file object

I'm trying to "map" a very large ascii file. Basically I read lines until I find a certain tag and then I want to know the position of that tag so that I can seek to it again later to pull out the associated data.
from itertools import dropwhile
with open(datafile) as fin:
ifin = dropwhile(lambda x:not x.startswith('Foo'), fin)
header = next(ifin)
position = fin.tell()
Now this tell doesn't give me the right position. This question has been asked in various forms before. The reason is presumably because python is buffering the file object. So, python is telling me where it's file-pointer is, not where my file pointer is. I don't want to turn off this buffering ... The performance here is important. However, it would be nice to know if there is a way to determine how many bytes python chooses to buffer. In my actual application, as long as I'm close the the lines which start with Foo, it doesn't matter. I can drop a few lines here and there. So, what I'm actually planning on doing is something like:
position = fin.tell() - buffer_size(fin)
Is there any way to go about finding the buffer size?
To me, it looks like the buffer size is hard-coded in Cpython to be 8192. As far as I can tell, there is no way to get this number from the python interface other than to read a single line when you open the file, do a f.tell() to figure out how much data python actually read and then seek back to the start of the file before continuing.
with open(datafile) as fin:
next(fin)
bufsize = fin.tell()
fin.seek(0)
ifin = dropwhile(lambda x:not x.startswith('Foo'), fin)
header = next(ifin)
position = fin.tell()
Of course, this fails in the event that the first line is longer than 8192 bytes long, but that's not of any real consequence for my application.

Categories