Using mmap to apply regex to whole file

Using mmap to apply regex to whole file - python

I'm trying to apply a regular expression to a whole file (not just each line independently) using the following code:
import mmap, re
ifile = open(ifilename)
data = mmap.mmap(ifile.fileno(), 0)
print data
mo = re.search('error: (.*)', data)
if mo:
print "found error"
This is based on the answer to the question How do I re.search or re.match on a whole file without reading it all into memory?
But I'm getting the following error:
Traceback (most recent call last):
File "./myscript.py", line 29, in ?
mo = re.search('error: (.*)', data)
File "/usr/lib/python2.3/sre.py", line 137, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
How can I fix this problem?
In the question Match multiline regex in file object, I found that another possibility for reading the whole file is the following, instead of the mmap object:
data = open("data.txt").read()
Any reason to prefer the mmap rather than the simple buffer/string?

You really have two questions buried in here.
Your Technical Issue
The problem you're facing will most likely be resolved if you upgrade to a newer version of Python, or you should at least get a better traceback. The mmap docs specify that you need to open a file for update to mmap it, and you're not currently doing that.
ifile = open(ifilename) # default is to open as read
Should be this:
ifile = open(ifilename, 'r+')
Or, if you can update to Python 2.6 as you mentioned in your comments,
with open(ifilename, 'r+') as fi:
# do stuff with open file
If you don't open a file with write permissions on 2.7 and try to mmap it, a "Permission denied" exception is raised. I suspect that error was not implemented in 2.3, so now you're being allowed to continue with an invalid mmap object that fails when you try to search it with the regex.
mmap vs. open().read()
In the end, you will be able to do (almost) the same thing with both methods. re.search(pattern, mmap_or_long_string) will search either your memory mapped file or the long string that results from the read() call.
The main difference between the two methods is in Virtual vs Real Memory consumption.
In a memory-mapped file, the file remains on disk (or wherever it is) and you directly access it through virtual memory addresses. When you read a file in using read(), you are bringing the whole file into (real) memory all at once.
Why One or the Other:
File Size
The most significant limit on the size of the file you can map is the size of your virtual memory address space, which is dictated by your CPU (32 or 64 bit). The memory allocated must be contiguous though, so you may have allocation errors if the OS can't find a large enough block to allocate the memory. When using read(), on the other hand, your limit is physical memory available instead. If you are accessing files larger than available memory and reading individual lines isn't an option, consider mmap.
File Sharing Among Processes
If you are parallelizing read-only operations on a large file, you can map it into memory to share it among processes instead of each process reading in a copy of the whole file.
Readability/Familiarity
Many more people are familiar with the simple open() and read() functions than memory mapping. Unless you have a compelling reason to use mmap, sticking with the basic IO functions is probably better in the long run for maintainability.
Speed
This one is a wash. A lot of forums and posts like to talk about mmap speed (because it bypasses some system calls once the file is mapped), but the underlying mechanism is still accessing a disk, while reading a whole file in brings everything into memory and only performs disk accesses at the beginning and end of working with the file. There is endless complexity if you try to account for caching (both hard disk and CPU), memory paging, and file access patterns. It is much easier to stick with the tried and true method of profiling. You will see different results based on your individual use case and access patterns for your files, so profile both and see which one is faster for you.
Other Resources
A good summary of the differences
PyMOTW
A good SO question
Wikipedia Virtual Memory article

Related

Among the many Python file copy functions, which ones are safe if the copy is interrupted?

As seen in How do I copy a file in Python?, there are many file copy functions:
shutil.copy
shutil.copy2
shutil.copyfile (and also shutil.copyfileobj)
or even a naive method:
with open('sourcefile', 'rb') as f, open('destfile', 'wb') as g:
while True:
block = f.read(16*1024*1024) # work by blocks of 16 MB
if not block: # EOF
break
g.write(block)
Among all these methods, which ones are safe in the case of a copy interruption (example: kill the Python process)? The last one in the list looks ok.
By safe I mean: if a 1 GB file copy is not 100% finished (let's say it's interrupted in the middle of the copy, after 400MB), the file size should not be reported as 1GB in the filesystem, it should:
either report the size the file had when the last bytes were written (e.g. 400MB)
or be deleted
The worst would be that the final filesize is written first (internally with an fallocate or ftruncate?). This would be a problem if the copy is interrupted: by looking at the file-size, we would think the file is correctly written.
Many incremental backup programs (I'm coding one) use "filename+mtime+fsize" to check if a file has to be copied or if it's already there (of course a better solution is to SHA256 source and destination files but this is not done for every sync, too much time-consuming; off-topic here).
So I want to make sure that the "copy file" function does not store the final file size immediately (then it could fool the fsize comparison), before copying the actual file content.
Note: I'm asking the question because, while shutil.filecopy was rather straighforward on Python 3.7 and below, see source (which is more or less the naive method above), it seems much more complicated on Python 3.9, see source, with many different cases for Windows, Linux, MacOS, "fastcopy" tricks, etc.

Assuming that destfile does not exist prior to the copy, the naive method is safe, per your definition of safe.
shutil.copyfileobj() and shutil.copyfile() are close second in the ranking.
shutils.copy() is next, and shutils.copy2() would be last.
Explanation:
It is a filesystem's job to guarantee consistency based on application requests. If you are only writing X bytes to a file, the file size will only account for these X bytes.
Therefore, doing direct FS operations like the naive method will work.
It is now a matter of what these higher-level functions do with the filesystem.
The API doesn't state what happens if python crashes mid-copy, but it is a de-facto expectation from everyone that these functions behave like Unix cp, i.e. don't mess with the file size.
Assuming that the maintainers of CPython don't want to break people's expectations, all these functions should be safe per your definition.
That said, it isn't guaranteed anywhere, AFAICT.
However, shutil.copyfileobj() and shutil.copyfile() expressly have their API promise to not copy metadata, so they're not likely to try and set the size.
shutils.copy() wouldn't try to set the file size, only the mode, and in most filesystems setting the size and the mode require two different filesystem operations, so it should still be safe.
shutils.copy2() says it will copy metadata, and if you look at its source code, you'll see that it only copies the metadata after copying the data, so even that should be safe. Even more, copying the metadata doesn't copy the size.
So this would only be a problem if some of the internal functions python uses try to optimize using ftruncate(), fallocate(), or some such, which is unlikely given that people who write system APIs (like the python maintainers) are very aware of people's expectations.

Downsides of keeping file handles open?

I am parsing some XML and write data to different files depending on the XML element that is currently being processed. Processing an element is really fast, and writing the data is, too. Therefore, files would need to open and close very often. For example, given a huge file:
for _, node in lxml.etree.iterparse(file):
with open(f"{node.tag}.txt", 'a') as fout:
fout.write(node.attrib['someattr']+'\n'])
This would work, but relatively speaking it would take a lot of time opening and closing the files. (Note: this is a toy program. In reality the actual contents that I write to the files as well as the filenames are different. See the last paragraph for data details.)
An alternative could be:
fhs = {}
for _, node in lxml.etree.iterparse(file):
if node.tag not in fhs:
fhs[node.tag] = open(f"{node.tag}.txt", 'w')
fhs[node.tag].write(node.attrib['someattr']+'\n'])
for _, fh in fhs.items(): fh.close()
This will keep the files open until the parsing of XML is completed. There is a bit of lookup overhead, but that should be minimal compared to iteratively opening and closing the file.
My question is, what is the downside of this approach, performance wise? I know that this will make the open files inaccessible by other processes, and that you may run into a limit of open files. However, I am more interested in performance issues. Does keeping all file handles open create some sort of memory issues or processing issues? Perhaps too much file buffering is going on in such scenarios? I am not certain, hence this question.
The input XML files can be up to around 70GB. The number of files generated is limited to around 35, which is far from the limits I read about in the aforementioned post.

The obvious downsides you have already mentioned, is that there will be a lot of memory required to keep all the file handles open, depending of course on how many files. This is a calculation you have to do on your own. And don't forget the write locks.
Otherwise there isn't very much wrong with it per say, but it would be good with some precaution:
fhs = {}
try:
for _, node in lxml.etree.iterparse(file):
if node.tag not in fhs:
fhs[node.tag] = open(f"{node.tag}.txt", 'w')
fhs[node.tag].write(node.attrib['someattr']+'\n'])
finally:
for fh in fhs.values(): fh.close()
Note:
When looping over a dict in python, the items you get are really only the keys. I'd recommend doing for key, item in d.items(): or for item in d.values():

You don't didn't say just how many files the process would end up holding open. If it's not so many that it creates a problem, then this could be a good approach. I doubt you can really know without trying it out with your data and in your execution environment.
In my experience, open() is relatively slow, so avoiding unnecessary calls is definitely worth thinking about-- you also avoid setting up all the associated buffers, populating them, flushing them every time you close the file, and garbage-collecting. Since you ask, file pointers do come with large buffers. On OS X, the default buffer size is 8192 bytes (8KB) and there is additional overhead for the object, as with all Python object. So if you have hundreds or thousands of files and little RAM, it can add up. You can specify less buffering or no buffering at all, but that could defeat any efficiency gained from avoiding repeated opens.
Edit: For just 35 distinct files (or any two-digit number), you have nothing to worry about: The space that 35 output buffers will need (at 8 KB per buffer for the actual buffering) will not even be the biggest part of your memory footprint. So just go ahead and do it they way you proposed. You'll see a dramatic speed improvement over opening and closing the file for each xml node.
PS. The default buffer size is given by io.DEFAULT_BUFFER_SIZE.

As a good rule,try to close a file as soon as possible.
Note that also your operating system has limits - you can open only certain number of files. So you might soon hit this limit and you will start getting "Failed to open file" exceptions.
Memory and file handles leaking are obvious problem ( if you fail to close the files for some reason ).

If you are generating thousands of files the way you might considder writing
them to a directory structure to get them separately stored in different
directories to have easier access afterwards. For example: a/a/aanode.txt , a/c/acnode.txt, etc.
In case the XML contains consecutive nodes you can write while that
condition is True. You only close the moment another node for another file appears.
What you gain from it largely depends on the structure of your XML file.

Low level file processing in ruby/python

So I hope this question already hasn't been answered, but I can't seem to figure out the right search term.
First some background:
I have text data files that are tabular and can easily climb into the 10s of GBs. The computer processing them is already heavily loaded from the hours long data collection(at up to 30-50MB/s) as it is doing device processing and control.Therefore, disk space and access are at a premium. We haven't moved from spinning disks to SSDs due to space constraints.
However, we are looking to do something with the just collected data that doesn't need every data point. We were hoping to decimate the data and collect every 1000th point. However, loading these files (Gigabytes each) puts a huge load on the disk which is unacceptable as it could interrupt the live collection system.
I was wondering if it was possible to use a low level method to access every nth byte (or some other method) in the file (like a database does) because the file is very well defined (Two 64 bit doubles in each row). I understand too low level access might not work because the hard drive might be fragmented, but what would the best approach/method be? I'd prefer a solution in python or ruby because that's what the processing will be done in, but in theory R, C, or Fortran could also work.
Finally, upgrading the computer or hardware isn't an option, setting up the system took hundreds of man-hours so only software changes can be performed. However, it would be a longer term project but if a text file isn't the best way to handle these files, I'm open to other solutions too.
EDIT: We generate (depending on usage) anywhere from 50000 lines(records)/sec to 5 million lines/sec databases aren't feasible at this rate regardless.

This should be doable using seek and read methods on a file object. Doing this will prevent the entire file from being loaded into memory, as you would only be working with file streams.
Also, since the files are well defined and predictable, you won't have any trouble seeking ahead N bytes to the next record in the file.
Below is an example. Demo the code below at http://dbgr.cc/o
with open("pretend_im_large.bin", "rb") as f:
start_pos = 0
read_bytes = []
# seek to the end of the file
f.seek(0,2)
file_size = f.tell()
# seek back to the beginning of the stream
f.seek(0,0)
while f.tell() < file_size:
read_bytes.append(f.read(1))
f.seek(9,1)
print read_bytes
The code above assumes pretend_im_large.bin is a file with the contents:
A00000000
B00000000
C00000000
D00000000
E00000000
F00000000
The output of the code above is:
['A', 'B', 'C', 'D', 'E', 'F']

I don't think that Python is going to give you a strong guarantee that it won't actually read the entire file when you use f.seek. I think that this is too platform- and implementation-specific to rely on Python. You should use Windows-specific tools that give you a guarantee of random acess rather than sequential.
Here's a snippet of Visual Basic that you can modify to suit your needs. You can define your own record type that's two 64-bit integers long. Alternatively, you can use a C# FileStream object and use its seek method to get what you want.
If this is performance-critical software, I think you need to make sure you're getting access to the OS primitives that do what you want. I can't find any references that indicate that Python's seek is going to do what you want. If you go that route, you need to test it to make sure it does what it seems like it should.

Is the file human-readable text or in the native format of the computer (sometimes called binary)? If the files are text, you could reduce the processing load and file size by switching to native format. Converting from the internal representation of floating point numbers to human-reading numbers is CPU intensive.
If the files are in native format then it should be easy to skip in the file since each record will be 16 bytes. In Fortran, open the file with an open statement that includes form="unformated", access="direct", recl=16. Then you can read an arbitrary record X without reading intervening records via rec=X in the read statement. If the file is text, you can also read it with direct IO, but it might not be that each two numbers always uses the same number of characters (bytes). You can examine your files and answer that question. If the records are always the same length, then you can use the same technique, just with form="formatted". If the records vary in length, then you could read a large chunk and locate your numbers within the chunk.

Efficient reading of 800 GB XML file in Python 2.7

I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.
Currently, I am just using open('foo.txt') with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.
A point in the right direction would be much appreciated.

The standard open() function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.
Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen() call (wfopen() on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.
The XML parsing done by iterparse reads the file in chunks of 16384 bytes (16kb).
If you want to control the buffersize, use the buffering keyword argument:
open('foo.xml', buffering=(2<<16) + 8) # buffer enough for 8 full parser reads
which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.
The io.open() function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.
You could try and see if io.open('foo.xml', 'rb', buffering=2<<16) is going to perform any better. Opening in rb mode will give you a io.BufferedReader instance.
You do not want to use io.TextIOWrapper; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r (textmode) instead.
Using io.open() may give you more flexibility and a richer API, but the underlying C file object is opened using open() instead of fopen(), and all buffering is handled by the Python io.BufferedIOBase implementation.
Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.

Have you tried a lazy function?: Lazy Method for Reading Big File in Python?
this seems to already answer your question. However, I would consider using this method to write your data to a DATABASE, mysql is free: http://dev.mysql.com/downloads/ , NoSQL is also free and might be a little more tailored to operations involving writing 800gb of data, or similar amounts: http://www.oracle.com/technetwork/database/nosqldb/downloads/default-495311.html

I haven't tried it with such epic xml files, but last time I had to deal with large (and relatively simple) xml files, I used a sax parser.
It basically gives you callbacks for each "event" and leaves it to you to store the data you need. You can give an open file so you don't have to read it in all at once.

How to use mmap in python when the whole file is too big

I have a python script which read a file line by line and look if each line matches a regular expression.
I would like to improve the performance of that script by using memory map the file before I search. I have looked into mmap example: http://docs.python.org/2/library/mmap.html
My question is how can I mmap a file when it is too big (15GB) for the memory of my machine (4GB)
I read the file like this:
fi = open(log_file, 'r', buffering=10*1024*1024)
for line in fi:
//do somemthong
fi.close()
Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?
Thank you.

First, the memory of your machine is irrelevant. It's the size of your process's address space that's relevant. With a 32-bit Python, this will be somewhere under 4GB. With a 64-bit Python, it will be more than enough.
The reason for this is that mmap isn't about mapping a file into physical memory, but into virtual memory. An mmapped file becomes just like a special swap file for your program. Thinking about this can get a bit complicated, but the Wikipedia links above should help.
So, the first answer is "use a 64-bit Python". But obviously that may not be applicable in your case.
The obvious alternative is to map in the first 1GB, search that, unmap it, map in the next 1GB, etc. The way you do this is by specifying the length and offset parameters to the mmap method. For example:
m = mmap.mmap(f.fileno(), length=1024*1024*1024, offset=1536*1024*1024)
However, the regex you're searching for could be found half-way in the first 1GB, and half in the second. So, you need to use windowing—map in the first 1GB, search, unmap, then map in a partially-overlapping 1GB, etc.
The question is, how much overlap do you need? If you know the maximum possible size of a match, you don't need anything more than that. And if you don't know… well, then there is no way to actually solve the problem without breaking up your regex—if that isn't obvious, imagine how you could possibly find a 2GB match in a single 1GB window.
Answering your followup question:
Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?
As with any performance question, if it really matters, you need to test it, and if it doesn't, don't worry about it.
If you want me to guess: I think mmap may be faster here, but only because (as J.F. Sebastian implied) looping and calling re.match 128K times as often may cause your code to be CPU-bound instead of IO-bound. But you could optimize that away without mmap, just by using read. So, would mmap be faster than read? Given the sizes involved, I'd expect the performance of mmap to be much faster on old Unix platforms, about the same on modern Unix platforms, and a bit slower on Windows. (You can still get large performance benefits out of mmap over read or read+lseek if you're using madvise, but that's not relevant here.) But really, that's just a guess.
The most compelling reason to use mmap is usually that it's simpler than read-based code, not that it's faster. When you have to use windowing even with mmap, and when you don't need to do any seeking with read, this is less compelling, but still, if you try writing the code both ways, I'd expect your mmap code would end up a bit more readable. (Especially if you tried to optimize out the buffer copies from the obvious read solution.)

I came to try using mmap because I used fileh.readline() on a file being dozens of GB in size and wanted to make it faster. Unix strace utility seems to reveal that the file is read in 4kB chunks now, and at least the output from strace seems to me printed slowly and I know parsing the file takes many hours.
$ strace -v -f -p 32495
Process 32495 attached
read(5, "blah blah blah foo bar xxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
^CProcess 32495 detached
$
This thread is so far the only explaining me I should not try to mmap a too large file. I do not understand why isn't there already a helper function like mmap_for_dummies(filename) which would do internally os.path.size(filename) and then either doing normal open(filename, 'r', buffering=10*1024*1024) or doing mmap.mmap(open(filename).fileno()). I certainly want to avoid fiddling with sliding window approach myself but would the function do a simple decision whether to do mmap or not would be enough for me.
Finally to mention, it is still not clear to me why some examples on the internet mention open(filename, 'rb') without explanation (e.g. https://docs.python.org/2/library/mmap.html). Provided one often wants to use the file in a for loop with .readline() call I do not know if I should open in 'rb' or just 'r' mode (I guess it is necessary to preserve the '\n').
Thanks for mentioning the buffering=10*1024*1024) argument, is probably more helpful than changing my code to gain some speed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.