Streaming 1GB File in Python

Streaming 1GB File in Python - python

How long should it take to stream a 1GB file in python on say a 2Ghz Intel Core 2 Duo machine?
fp = open('publisher_feed_8663.xml')
for line in fp:
a = line.split('<')
I suppose I wasn't specific enough. This process takes 20+ minutes which is abnormally long. Based on empirical data, what is a reasonable time?

Your answer:
start = time.time()
fp = open('publisher_feed_8663.xml')
for line in fp:
a = line.split('<')
print time.time() - start
You will require a 1GB file named publisher_feed_8663.xml, python and a 2Ghz Intel Core 2 Duo machine.
For parsing of XML, you probably want to use an event based stream parser, such as SAX or lxml. I recommend reading the lxml documentation about iterparse: http://lxml.de/parsing.html#iterparse-and-iterwalk
As for how long should this take, you can do trivial harddrive benchmarks on linux using tools like hdparm -tT /dev/sda.
More RAM always helps with processing large files, as the OS can keep a bigger disk cache.

Other people have talked about the time, I'll talk about the processing (XML aside).
If you're doing something this massive, you should certainly look at generators. This pdf will teach you basically all you will ever need to know about generators. Any time you are either consuming or producing large amounts of data (especially serially) generators should be your very best friend.

That will entirely depend on what's in the file. You're reading it a line at a time, which will mean a load of overhead calling the iterator again and again for the common case of lots of short lines. Use fp.read(CHUNK) with some large number for CHUNK to improve performance.
However, I'm not sure what you're doing with split('<'). You can't usefully process XML with tools as basic as that, or with line-at-a-time parsing, since XML is not line-based. If you actually want to do something with the XML infoset in the file as you read it, you should consider a SAX parser. (Then again, 1GB of XML? That's already non-sensible really.)

Related

python parse large logs to extract certain patterns

I have large system log (max 1GB) and I need to parse it for extracting certain things.
initially i wrote python code to reading file line by line (using with open) it took very very long time
I learned about mmap and using mmap it's taking around 5 mins .
(I have precompiled the regular expression to save time)
is there any better approach so that it takes less time ? (i am using python 2.7.3 32bit )

You might want to use awk for that kind of things, it's more or less what it's designed to do.

Low level file processing in ruby/python

So I hope this question already hasn't been answered, but I can't seem to figure out the right search term.
First some background:
I have text data files that are tabular and can easily climb into the 10s of GBs. The computer processing them is already heavily loaded from the hours long data collection(at up to 30-50MB/s) as it is doing device processing and control.Therefore, disk space and access are at a premium. We haven't moved from spinning disks to SSDs due to space constraints.
However, we are looking to do something with the just collected data that doesn't need every data point. We were hoping to decimate the data and collect every 1000th point. However, loading these files (Gigabytes each) puts a huge load on the disk which is unacceptable as it could interrupt the live collection system.
I was wondering if it was possible to use a low level method to access every nth byte (or some other method) in the file (like a database does) because the file is very well defined (Two 64 bit doubles in each row). I understand too low level access might not work because the hard drive might be fragmented, but what would the best approach/method be? I'd prefer a solution in python or ruby because that's what the processing will be done in, but in theory R, C, or Fortran could also work.
Finally, upgrading the computer or hardware isn't an option, setting up the system took hundreds of man-hours so only software changes can be performed. However, it would be a longer term project but if a text file isn't the best way to handle these files, I'm open to other solutions too.
EDIT: We generate (depending on usage) anywhere from 50000 lines(records)/sec to 5 million lines/sec databases aren't feasible at this rate regardless.

This should be doable using seek and read methods on a file object. Doing this will prevent the entire file from being loaded into memory, as you would only be working with file streams.
Also, since the files are well defined and predictable, you won't have any trouble seeking ahead N bytes to the next record in the file.
Below is an example. Demo the code below at http://dbgr.cc/o
with open("pretend_im_large.bin", "rb") as f:
start_pos = 0
read_bytes = []
# seek to the end of the file
f.seek(0,2)
file_size = f.tell()
# seek back to the beginning of the stream
f.seek(0,0)
while f.tell() < file_size:
read_bytes.append(f.read(1))
f.seek(9,1)
print read_bytes
The code above assumes pretend_im_large.bin is a file with the contents:
A00000000
B00000000
C00000000
D00000000
E00000000
F00000000
The output of the code above is:
['A', 'B', 'C', 'D', 'E', 'F']

I don't think that Python is going to give you a strong guarantee that it won't actually read the entire file when you use f.seek. I think that this is too platform- and implementation-specific to rely on Python. You should use Windows-specific tools that give you a guarantee of random acess rather than sequential.
Here's a snippet of Visual Basic that you can modify to suit your needs. You can define your own record type that's two 64-bit integers long. Alternatively, you can use a C# FileStream object and use its seek method to get what you want.
If this is performance-critical software, I think you need to make sure you're getting access to the OS primitives that do what you want. I can't find any references that indicate that Python's seek is going to do what you want. If you go that route, you need to test it to make sure it does what it seems like it should.

Is the file human-readable text or in the native format of the computer (sometimes called binary)? If the files are text, you could reduce the processing load and file size by switching to native format. Converting from the internal representation of floating point numbers to human-reading numbers is CPU intensive.
If the files are in native format then it should be easy to skip in the file since each record will be 16 bytes. In Fortran, open the file with an open statement that includes form="unformated", access="direct", recl=16. Then you can read an arbitrary record X without reading intervening records via rec=X in the read statement. If the file is text, you can also read it with direct IO, but it might not be that each two numbers always uses the same number of characters (bytes). You can examine your files and answer that question. If the records are always the same length, then you can use the same technique, just with form="formatted". If the records vary in length, then you could read a large chunk and locate your numbers within the chunk.

Concurrent SAX processing of large, simple XML files?

I have a couple of gigantic XML files (10GB-40GB) that have a very simple structure: just a single root node containing multiple row nodes. I'm trying to parse them using SAX in Python, but the extra processing I have to do for each row means that the 40GB file takes an entire day to complete. To speed things up, I'd like to use all my cores simultaneously. Unfortunately, it seems that the SAX parser can't deal with "malformed" chunks of XML, which is what you get when you seek to an arbitrary line in the file and try parsing from there. Since the SAX parser can accept a stream, I think I need to divide my XML file into eight different streams, each containing [number of rows]/8 rows and padded with fake opening and closing tags. How would I go about doing this? Or — is there a better solution that I might be missing? Thank you!

You can't easily split the SAX parsing into multiple threads, and you don't need to: if you just run the parse without any other processing, it should run in 20 minutes or so. Focus on the processing you do to the data in your ContentHandler.

My suggested way is to read the whole XML file into an internal format and do the extra processing afterwards. SAX should be fast enough to read 40GB of XML in no more than an hour.
Depending on the data you could use a SQLite database or HDF5 file for intermediate storage.
By the way, Python is not really multi-threaded (see GIL). You need the multiprocessing module to split the work into different processes.

How to use mmap in python when the whole file is too big

I have a python script which read a file line by line and look if each line matches a regular expression.
I would like to improve the performance of that script by using memory map the file before I search. I have looked into mmap example: http://docs.python.org/2/library/mmap.html
My question is how can I mmap a file when it is too big (15GB) for the memory of my machine (4GB)
I read the file like this:
fi = open(log_file, 'r', buffering=10*1024*1024)
for line in fi:
//do somemthong
fi.close()
Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?
Thank you.

First, the memory of your machine is irrelevant. It's the size of your process's address space that's relevant. With a 32-bit Python, this will be somewhere under 4GB. With a 64-bit Python, it will be more than enough.
The reason for this is that mmap isn't about mapping a file into physical memory, but into virtual memory. An mmapped file becomes just like a special swap file for your program. Thinking about this can get a bit complicated, but the Wikipedia links above should help.
So, the first answer is "use a 64-bit Python". But obviously that may not be applicable in your case.
The obvious alternative is to map in the first 1GB, search that, unmap it, map in the next 1GB, etc. The way you do this is by specifying the length and offset parameters to the mmap method. For example:
m = mmap.mmap(f.fileno(), length=1024*1024*1024, offset=1536*1024*1024)
However, the regex you're searching for could be found half-way in the first 1GB, and half in the second. So, you need to use windowing—map in the first 1GB, search, unmap, then map in a partially-overlapping 1GB, etc.
The question is, how much overlap do you need? If you know the maximum possible size of a match, you don't need anything more than that. And if you don't know… well, then there is no way to actually solve the problem without breaking up your regex—if that isn't obvious, imagine how you could possibly find a 2GB match in a single 1GB window.
Answering your followup question:
Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?
As with any performance question, if it really matters, you need to test it, and if it doesn't, don't worry about it.
If you want me to guess: I think mmap may be faster here, but only because (as J.F. Sebastian implied) looping and calling re.match 128K times as often may cause your code to be CPU-bound instead of IO-bound. But you could optimize that away without mmap, just by using read. So, would mmap be faster than read? Given the sizes involved, I'd expect the performance of mmap to be much faster on old Unix platforms, about the same on modern Unix platforms, and a bit slower on Windows. (You can still get large performance benefits out of mmap over read or read+lseek if you're using madvise, but that's not relevant here.) But really, that's just a guess.
The most compelling reason to use mmap is usually that it's simpler than read-based code, not that it's faster. When you have to use windowing even with mmap, and when you don't need to do any seeking with read, this is less compelling, but still, if you try writing the code both ways, I'd expect your mmap code would end up a bit more readable. (Especially if you tried to optimize out the buffer copies from the obvious read solution.)

I came to try using mmap because I used fileh.readline() on a file being dozens of GB in size and wanted to make it faster. Unix strace utility seems to reveal that the file is read in 4kB chunks now, and at least the output from strace seems to me printed slowly and I know parsing the file takes many hours.
$ strace -v -f -p 32495
Process 32495 attached
read(5, "blah blah blah foo bar xxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
^CProcess 32495 detached
$
This thread is so far the only explaining me I should not try to mmap a too large file. I do not understand why isn't there already a helper function like mmap_for_dummies(filename) which would do internally os.path.size(filename) and then either doing normal open(filename, 'r', buffering=10*1024*1024) or doing mmap.mmap(open(filename).fileno()). I certainly want to avoid fiddling with sliding window approach myself but would the function do a simple decision whether to do mmap or not would be enough for me.
Finally to mention, it is still not clear to me why some examples on the internet mention open(filename, 'rb') without explanation (e.g. https://docs.python.org/2/library/mmap.html). Provided one often wants to use the file in a for loop with .readline() call I do not know if I should open in 'rb' or just 'r' mode (I guess it is necessary to preserve the '\n').
Thanks for mentioning the buffering=10*1024*1024) argument, is probably more helpful than changing my code to gain some speed.

Are there ways to modify/update xml files other than totally over writing the old file?

I'm working on a script which involves continuously analyzing data and outputting results in a multi-threaded way. So basically the result file(an xml file) is constantly being updated/modified (sometimes 2-3 times/per second).
I'm currently using lxml to parse/modify/update the xml file, which works fine right now. But from what I can tell, you have to rewrite the whole xml file even sometimes you just add one entry/sub-entry like <weather content=sunny /> somewhere in the file. The xml file is growing bigger gradually, and so is the overhead.
As far as efficiency/resource is concerned, any other way to update/modify the xml file? Or you will have to switch to SQL database or similar some day when the xml file is too big to parse/modify/update?

No you generally cannot - and not just XML files, any file format.
You can only update "in place" if you overwite bytes exactly (i.e. don't add or remove any characters, just replace some with something of the same byte length).
Using a form of database sounds like a good option.

It certainly sounds like you need some sort of database, as Li-anung Yip states this would take care of all kinds of nasty multi-threaded sync issues.
You stated that your data is gradually increasing? How is it being consumed? Are clients forced to download the entire result file each time?
Don't know your use-case but perhaps you could consider using an ATOM feed to distribute your data changes? Providing support for Atom pub would also effectively REST enable your data. It's still XML, but in a standard's compliant format that is easy to consume and poll for changes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.