Efficient reading of 800 GB XML file in Python 2.7

Efficient reading of 800 GB XML file in Python 2.7 - python

I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.
Currently, I am just using open('foo.txt') with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.
A point in the right direction would be much appreciated.

The standard open() function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.
Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen() call (wfopen() on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.
The XML parsing done by iterparse reads the file in chunks of 16384 bytes (16kb).
If you want to control the buffersize, use the buffering keyword argument:
open('foo.xml', buffering=(2<<16) + 8) # buffer enough for 8 full parser reads
which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.
The io.open() function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.
You could try and see if io.open('foo.xml', 'rb', buffering=2<<16) is going to perform any better. Opening in rb mode will give you a io.BufferedReader instance.
You do not want to use io.TextIOWrapper; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r (textmode) instead.
Using io.open() may give you more flexibility and a richer API, but the underlying C file object is opened using open() instead of fopen(), and all buffering is handled by the Python io.BufferedIOBase implementation.
Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.

Have you tried a lazy function?: Lazy Method for Reading Big File in Python?
this seems to already answer your question. However, I would consider using this method to write your data to a DATABASE, mysql is free: http://dev.mysql.com/downloads/ , NoSQL is also free and might be a little more tailored to operations involving writing 800gb of data, or similar amounts: http://www.oracle.com/technetwork/database/nosqldb/downloads/default-495311.html

I haven't tried it with such epic xml files, but last time I had to deal with large (and relatively simple) xml files, I used a sax parser.
It basically gives you callbacks for each "event" and leaves it to you to store the data you need. You can give an open file so you don't have to read it in all at once.

Related

Text output to a temporary file, compatible with Python 2.7?

I'd like to have a way to write Unicode text output to a temporary file created with tempfile API, which would support Python 3 style options for encoding and newline conversion, but would work also on Python 2.7 (for unicode values).
To open files with regular predictable names, a portable way is provided by io.open. But with temporary files, the secure way is to get an OS handle to the file, to ensure that the file name cannot be hijacked by a concurrent malicious process. There are no io workalikes to tempfile.NamedTemporaryFile or os.fdopen, and on Python 2.7 there are issues with the file objects obtained that way:
the built-in file objects cannot be wrapped by io.TextIoWrapper which supports both encoding and newline conversion;
the codecs API can produce an encoding writer, but that does not perform newline conversion. The underlying file must be opened in binary mode, otherwise the same code breaks in Python 3 (and it's generally not sane to expect correct newline conversion on arbitary character-encoded data).
I've come up with two ways to deal with the portability problem, each of which has certain disadvantages:
Close the file object (or the OS descriptor) without removing the file, and reopen the file by name with io.open. When using NamedTemporaryFile, this means the delete construction parameter has to be set to false and the user has the responsibility to delete the file when it's no longer needed. There is also an added security hazard, in the rather unusual case when the directory where temporary file is created is writable to potential attackers and the sticky bit is not set in its permission mode bits.
Write the entire output to an io.StringIO buffer created with newline parameter as appropriate, then write the buffered string into the encoding writer obtained from codecs. This is bad for performance and memory usage on large files.
Are there other alternatives?

Using mmap to apply regex to whole file

I'm trying to apply a regular expression to a whole file (not just each line independently) using the following code:
import mmap, re
ifile = open(ifilename)
data = mmap.mmap(ifile.fileno(), 0)
print data
mo = re.search('error: (.*)', data)
if mo:
print "found error"
This is based on the answer to the question How do I re.search or re.match on a whole file without reading it all into memory?
But I'm getting the following error:
Traceback (most recent call last):
File "./myscript.py", line 29, in ?
mo = re.search('error: (.*)', data)
File "/usr/lib/python2.3/sre.py", line 137, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
How can I fix this problem?
In the question Match multiline regex in file object, I found that another possibility for reading the whole file is the following, instead of the mmap object:
data = open("data.txt").read()
Any reason to prefer the mmap rather than the simple buffer/string?

You really have two questions buried in here.
Your Technical Issue
The problem you're facing will most likely be resolved if you upgrade to a newer version of Python, or you should at least get a better traceback. The mmap docs specify that you need to open a file for update to mmap it, and you're not currently doing that.
ifile = open(ifilename) # default is to open as read
Should be this:
ifile = open(ifilename, 'r+')
Or, if you can update to Python 2.6 as you mentioned in your comments,
with open(ifilename, 'r+') as fi:
# do stuff with open file
If you don't open a file with write permissions on 2.7 and try to mmap it, a "Permission denied" exception is raised. I suspect that error was not implemented in 2.3, so now you're being allowed to continue with an invalid mmap object that fails when you try to search it with the regex.
mmap vs. open().read()
In the end, you will be able to do (almost) the same thing with both methods. re.search(pattern, mmap_or_long_string) will search either your memory mapped file or the long string that results from the read() call.
The main difference between the two methods is in Virtual vs Real Memory consumption.
In a memory-mapped file, the file remains on disk (or wherever it is) and you directly access it through virtual memory addresses. When you read a file in using read(), you are bringing the whole file into (real) memory all at once.
Why One or the Other:
File Size
The most significant limit on the size of the file you can map is the size of your virtual memory address space, which is dictated by your CPU (32 or 64 bit). The memory allocated must be contiguous though, so you may have allocation errors if the OS can't find a large enough block to allocate the memory. When using read(), on the other hand, your limit is physical memory available instead. If you are accessing files larger than available memory and reading individual lines isn't an option, consider mmap.
File Sharing Among Processes
If you are parallelizing read-only operations on a large file, you can map it into memory to share it among processes instead of each process reading in a copy of the whole file.
Readability/Familiarity
Many more people are familiar with the simple open() and read() functions than memory mapping. Unless you have a compelling reason to use mmap, sticking with the basic IO functions is probably better in the long run for maintainability.
Speed
This one is a wash. A lot of forums and posts like to talk about mmap speed (because it bypasses some system calls once the file is mapped), but the underlying mechanism is still accessing a disk, while reading a whole file in brings everything into memory and only performs disk accesses at the beginning and end of working with the file. There is endless complexity if you try to account for caching (both hard disk and CPU), memory paging, and file access patterns. It is much easier to stick with the tried and true method of profiling. You will see different results based on your individual use case and access patterns for your files, so profile both and see which one is faster for you.
Other Resources
A good summary of the differences
PyMOTW
A good SO question
Wikipedia Virtual Memory article

Low level file processing in ruby/python

So I hope this question already hasn't been answered, but I can't seem to figure out the right search term.
First some background:
I have text data files that are tabular and can easily climb into the 10s of GBs. The computer processing them is already heavily loaded from the hours long data collection(at up to 30-50MB/s) as it is doing device processing and control.Therefore, disk space and access are at a premium. We haven't moved from spinning disks to SSDs due to space constraints.
However, we are looking to do something with the just collected data that doesn't need every data point. We were hoping to decimate the data and collect every 1000th point. However, loading these files (Gigabytes each) puts a huge load on the disk which is unacceptable as it could interrupt the live collection system.
I was wondering if it was possible to use a low level method to access every nth byte (or some other method) in the file (like a database does) because the file is very well defined (Two 64 bit doubles in each row). I understand too low level access might not work because the hard drive might be fragmented, but what would the best approach/method be? I'd prefer a solution in python or ruby because that's what the processing will be done in, but in theory R, C, or Fortran could also work.
Finally, upgrading the computer or hardware isn't an option, setting up the system took hundreds of man-hours so only software changes can be performed. However, it would be a longer term project but if a text file isn't the best way to handle these files, I'm open to other solutions too.
EDIT: We generate (depending on usage) anywhere from 50000 lines(records)/sec to 5 million lines/sec databases aren't feasible at this rate regardless.

This should be doable using seek and read methods on a file object. Doing this will prevent the entire file from being loaded into memory, as you would only be working with file streams.
Also, since the files are well defined and predictable, you won't have any trouble seeking ahead N bytes to the next record in the file.
Below is an example. Demo the code below at http://dbgr.cc/o
with open("pretend_im_large.bin", "rb") as f:
start_pos = 0
read_bytes = []
# seek to the end of the file
f.seek(0,2)
file_size = f.tell()
# seek back to the beginning of the stream
f.seek(0,0)
while f.tell() < file_size:
read_bytes.append(f.read(1))
f.seek(9,1)
print read_bytes
The code above assumes pretend_im_large.bin is a file with the contents:
A00000000
B00000000
C00000000
D00000000
E00000000
F00000000
The output of the code above is:
['A', 'B', 'C', 'D', 'E', 'F']

I don't think that Python is going to give you a strong guarantee that it won't actually read the entire file when you use f.seek. I think that this is too platform- and implementation-specific to rely on Python. You should use Windows-specific tools that give you a guarantee of random acess rather than sequential.
Here's a snippet of Visual Basic that you can modify to suit your needs. You can define your own record type that's two 64-bit integers long. Alternatively, you can use a C# FileStream object and use its seek method to get what you want.
If this is performance-critical software, I think you need to make sure you're getting access to the OS primitives that do what you want. I can't find any references that indicate that Python's seek is going to do what you want. If you go that route, you need to test it to make sure it does what it seems like it should.

Is the file human-readable text or in the native format of the computer (sometimes called binary)? If the files are text, you could reduce the processing load and file size by switching to native format. Converting from the internal representation of floating point numbers to human-reading numbers is CPU intensive.
If the files are in native format then it should be easy to skip in the file since each record will be 16 bytes. In Fortran, open the file with an open statement that includes form="unformated", access="direct", recl=16. Then you can read an arbitrary record X without reading intervening records via rec=X in the read statement. If the file is text, you can also read it with direct IO, but it might not be that each two numbers always uses the same number of characters (bytes). You can examine your files and answer that question. If the records are always the same length, then you can use the same technique, just with form="formatted". If the records vary in length, then you could read a large chunk and locate your numbers within the chunk.

Is InMemoryUploadedFile really "in memory"?

I understand that opening a file just creates a file handler that takes a fixed memory irrespective of the size of the file.
Django has a type called InMemoryUploadedFile that represents files uploaded via forms.
I get the handle to my file object inside the django view like this:
file_object = request.FILES["uploadedfile"]
This file_object has type InMemoryUploadedFile.
Now we can see for ourselves that, file_object has the method .read() which is used to read files into memory.
bytes = file_object.read()
Wasn't file_object of type InMemoryUploadedFile already "in memory"?

The read() method on a file object is way to access content from within a file object irrespective of whether that file is in memory or stored on the disk. It is similar to other utility file access methods like readlines or seek.
The behavior is similar to what is built into Python which in turn is built over the operating system's fread() method.
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in
non-blocking mode, less data than was requested may be returned, even
if no size parameter was given.
On the question of where exactly the InMemoryUploadedFile is stored, it is a bit more complicated.
Before you save uploaded files, the data needs to be stored somewhere.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast.
However, if an uploaded file is too large, Django will write the
uploaded file to a temporary file stored in your system’s temporary
directory. On a Unix-like platform this means you can expect Django to
generate a file called something like /tmp/tmpzfp6I6.upload. If an
upload is large enough, you can watch this file grow in size as Django
streams the data onto disk.
These specifics – 2.5 megabytes; /tmp; etc. – are simply “reasonable
defaults”. Read on for details on how you can customize or completely
replace upload behavior.

One thing to consider is that in python file like objects have an API that is pretty strictly adhered to. This allows code to be very flexible, they are abstractions over I/O streams. These allow your code to not have to worry about where the data is coming from, ie. memory, filesystem, network, etc.
File like objects usually define a couple methods, one of which is read
I am not sure of the actually implementation of InMemoryUploadedFile, or how they are generated or where they are stored (I am assuming they are totally in memory though), but you can rest assured that they are file like objects and contain a read method, because they adhere to the file api.
For the implementation you could start checking out the source:
https://github.com/django/django/blob/master/django/core/files/uploadedfile.py#L90
https://github.com/django/django/blob/master/django/core/files/base.py
https://github.com/django/django/blob/master/django/core/files/uploadhandler.py

How to use mmap in python when the whole file is too big

I have a python script which read a file line by line and look if each line matches a regular expression.
I would like to improve the performance of that script by using memory map the file before I search. I have looked into mmap example: http://docs.python.org/2/library/mmap.html
My question is how can I mmap a file when it is too big (15GB) for the memory of my machine (4GB)
I read the file like this:
fi = open(log_file, 'r', buffering=10*1024*1024)
for line in fi:
//do somemthong
fi.close()
Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?
Thank you.

First, the memory of your machine is irrelevant. It's the size of your process's address space that's relevant. With a 32-bit Python, this will be somewhere under 4GB. With a 64-bit Python, it will be more than enough.
The reason for this is that mmap isn't about mapping a file into physical memory, but into virtual memory. An mmapped file becomes just like a special swap file for your program. Thinking about this can get a bit complicated, but the Wikipedia links above should help.
So, the first answer is "use a 64-bit Python". But obviously that may not be applicable in your case.
The obvious alternative is to map in the first 1GB, search that, unmap it, map in the next 1GB, etc. The way you do this is by specifying the length and offset parameters to the mmap method. For example:
m = mmap.mmap(f.fileno(), length=1024*1024*1024, offset=1536*1024*1024)
However, the regex you're searching for could be found half-way in the first 1GB, and half in the second. So, you need to use windowing—map in the first 1GB, search, unmap, then map in a partially-overlapping 1GB, etc.
The question is, how much overlap do you need? If you know the maximum possible size of a match, you don't need anything more than that. And if you don't know… well, then there is no way to actually solve the problem without breaking up your regex—if that isn't obvious, imagine how you could possibly find a 2GB match in a single 1GB window.
Answering your followup question:
Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?
As with any performance question, if it really matters, you need to test it, and if it doesn't, don't worry about it.
If you want me to guess: I think mmap may be faster here, but only because (as J.F. Sebastian implied) looping and calling re.match 128K times as often may cause your code to be CPU-bound instead of IO-bound. But you could optimize that away without mmap, just by using read. So, would mmap be faster than read? Given the sizes involved, I'd expect the performance of mmap to be much faster on old Unix platforms, about the same on modern Unix platforms, and a bit slower on Windows. (You can still get large performance benefits out of mmap over read or read+lseek if you're using madvise, but that's not relevant here.) But really, that's just a guess.
The most compelling reason to use mmap is usually that it's simpler than read-based code, not that it's faster. When you have to use windowing even with mmap, and when you don't need to do any seeking with read, this is less compelling, but still, if you try writing the code both ways, I'd expect your mmap code would end up a bit more readable. (Especially if you tried to optimize out the buffer copies from the obvious read solution.)

I came to try using mmap because I used fileh.readline() on a file being dozens of GB in size and wanted to make it faster. Unix strace utility seems to reveal that the file is read in 4kB chunks now, and at least the output from strace seems to me printed slowly and I know parsing the file takes many hours.
$ strace -v -f -p 32495
Process 32495 attached
read(5, "blah blah blah foo bar xxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
^CProcess 32495 detached
$
This thread is so far the only explaining me I should not try to mmap a too large file. I do not understand why isn't there already a helper function like mmap_for_dummies(filename) which would do internally os.path.size(filename) and then either doing normal open(filename, 'r', buffering=10*1024*1024) or doing mmap.mmap(open(filename).fileno()). I certainly want to avoid fiddling with sliding window approach myself but would the function do a simple decision whether to do mmap or not would be enough for me.
Finally to mention, it is still not clear to me why some examples on the internet mention open(filename, 'rb') without explanation (e.g. https://docs.python.org/2/library/mmap.html). Provided one often wants to use the file in a for loop with .readline() call I do not know if I should open in 'rb' or just 'r' mode (I guess it is necessary to preserve the '\n').
Thanks for mentioning the buffering=10*1024*1024) argument, is probably more helpful than changing my code to gain some speed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.