Regular expression parsing (streaming) a binary file?

Regular expression parsing (streaming) a binary file? - python

I'm trying to implement a strings(1)-like function in Python.
import re
def strings(f, n=5):
# TODO: support files larger than available RAM
return re.finditer(br'[!-~\s]{%i,}' % n, f.read())
if __name__ == '__main__':
import sys
with open(sys.argv[1], 'rb') as f:
for m in strings(f):
print(m[0].decode().replace('\x0A', '\u240A'))
Setting aside the case of actual matches* that are larger than the available RAM, the above code fails in the case of files that are merely, themselves, larger than the available RAM!
An attempt to naively "iterate over f" will be done linewise, even for binary files; this may be inappropriate because (a) it may return different results than just running the regex on the whole input, and (b) if the machine has 4 gigabytes of RAM and the file contains any match for rb'[^\n]{8589934592,}', then that unasked-for match will cause a memory problem anyway!
Does Python's regex library enable any simple way to stream re.finditer over a binary file?
*I am aware that it is possible to write regular expressions that may require an exponential amount of CPU or RAM relative to their input length. Handling these cases is, obviously, out-of-scope; I'm assuming for the purposes of the question that the machine at least has enough resource to handle the regex, its largest match on the input, the acquisition of this match, and the ignoring-of all nonmatches.
Not a duplicate of Regular expression parsing a binary file? -- that question is actually asking about bytes-like objects; I am asking about binary files per se.
Not a duplicate of Parse a binary file with Regular Expressions? -- for the same reason.
Not a duplicate of Regular expression for binary files -- that question only addressed the special case where offsets of all matches were known beforehand.
Not a duplicate of Regular expression for binary files -- combination of both of these reasons.

Does Python's regex library enable any simple way to stream re.finditer over a binary file?
Well, while typing up the question in such excruciating detail and getting suppporting documentation, I found the solution:
mmap — Memory-mapped file support
Memory-mapped file objects behave like both bytearray and like file objects. You can use mmap objects in most places where bytearray are expected; for example, you can use the re module to search through a memory-mapped file. …
Enacted:
import re, mmap
def strings(f, n=5):
view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
return re.finditer(br'[!-~\s]{%i,}' % n, view)
Caveat: on 32-bit systems, this might not work for files larger than 2GiB, if the underlying standard library is deficient.
However, it looks like it should be fine on both Windows and any well-maintained Linux distribution:
13.8 Memory-mapped I/O
Since mmapped pages can be stored back to their file when physical memory is low, it is possible to mmap files orders of magnitude larger than both the physical memory and swap space. The only limit is address space. The theoretical limit is 4GB on a 32-bit machine - however, the actual limit will be smaller since some areas will be reserved for other purposes. If the LFS interface is used the file size on 32-bit systems is not limited to 2GB … the full 64-bit [8 EiB] are available. …
Creating a File Mapping Using Large Pages
… you must specify the FILE_MAP_LARGE_PAGES flag with the MapViewOfFile function to map large pages. …

Related

read list of strings as file, python3

I have a list of strings, and would like to pass this to an api that accepts only a file-like object, without having to concatenate/flatten the list to use the likes of StringIO.
The strings are utf-8, don't necessarily end in newlines, and if naively concatenated could be used directly in StringIO.
Preferred solution would be within the standard library (python3.8) (Given the shape of the data is naturally similar to a file (~identical to readlines() obviously), and memory access pattern would be efficient, I have a feeling I'm just failing to DuckDuckGo correctly) - but if that doesn't exist any "streaming" (no data concatenation) solution would suffice.
[Update, based on #JonSG's links]
Both RawIOBase and TextIOBase look provide an api that decouples arbitrarily sized "chunks"/fragments (in my case: strings in a list) from a file-like read which can specify its own read chunk size, while streaming the data itself (memory cost increases by only some window at any given time [dependent of course on behavior of your source & sink])
RawIOBase.readinto looks especially promising because it provides the buffer returned to client reads directly, allowing much simpler code - but this appears to come at the cost of one full copy (into that buffer).
TextIOBase.read() has its own cost for its operation solving the same subproblem, which is concatenating k (k much smaller than N) chunks together.
I'll investigate both of these.

Zlib library image compression not working properly

I'm trying to compress an image using zlib library on python (vscode). I generate an output file but it weights the same as the original file.
This is the code:
import zlib
with open("garenap.jpg", "rb") as in_file:
compressed = zlib.compress(in_file.read(), -1)
with open("arroz", "wb") as out_file:
out_file.write(compressed)

I think the two files would not weight the exact same. If you try the following:
import zlib
with open("garenap.jpg", "rb") as in_file:
compressed = zlib.compress(in_file.read(), -1)
print(in_file.tell())
with open("arroz", "wb") as out_file:
out_file.write(compressed)
print(out_file.tell())
you should see two slightly different numbers (which are basically the file size).
For some jpg of mine I got:
3563384
3448655
so the zlib.compress() is actually reducing the file size a tiny bit.
You should observe something similar yourself too.
Anything that is not the same number is fine.
As #jasonharper already pointed out, the JPEG format is already highly compressed, but not DEFLATE compressed, as zlib would do (including the implementation available in Python).
This is a bit different from the lossy compression implemented in JPEG, which is based on an integral transform. The output of this transform is typically non-redundant and therefore the Lempel-Ziv 77 algorithm implemented with DEFLATE (or any other implementation, for what is worth) is of limited efficacy.
In conclusion, zlib is doing its job, but it is unlikely to be effective for jpeg data.
Note on larger compressed files
The zlib compressed files can be larger than their inputs.
This is true for any loseless compression algorithm, and can be easily proved: consider multiple consecutive applications of a loseless algorithm, if any application would strictly reduce the file size, you would eventually get to a size equal to 0, i.e. an empty file. Obviously this cannot be inverted, thus demonstrating that loseless compression is not compatible with always reducing file size.
Looking into LZ77 details from Wikipedia:
LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream.
The following is not exactly how LZ77 works but should give you the idea.
Let's replace repeating characters with the character followed by the number of times it is repeated.
This algorithm works well with xxxxxxxx being reduced to x8 (x 8 times). If the sequence is non-redundant, e.g. abcdefgh, then this algorithm would produce a1b1c1d1e1f1g1h1 which does not reduce the input size, but would actually DOUBLE it.
What you are observing is something similar.

how to rapidaly load data into memory with python?

I have a large csv file (5 GB) and I can read it with pandas.read_csv(). This operation takes a lot of time 10-20 minutes.
How can I speed it up?
Would it be useful to transform the data in a sqllite format? In case what should I do?
EDIT: More information:
The data contains 1852 columns and 350000 rows. Most of the columns are float65 and contain numbers. Some other contains string or dates (that I suppose are considered as string)
I am using a laptop with 16 GB of RAM and SSD hard drive. The data should fit fine in memory (but I know that python tends to increase the data size)
EDIT 2 :
During the loading I receive this message
/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1841,1842,1844) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
EDIT: SOLUTION
Read one time the csv file and save it as
data.to_hdf('data.h5', 'table')
This format is incredibly efficient

This actually depends on which part of reading it is taking 10 minutes.
If it's actually reading from disk, then obviously any more compact form of the data will be better.
If it's processing the CSV format (you can tell this because your CPU is at near 100% on one core while reading; it'll be very low for the other two), then you want a form that's already preprocessed.
If it's swapping memory, e.g., because you only have 2GB of physical RAM, then nothing is going to help except splitting the data.
It's important to know which one you have. For example, stream-compressing the data (e.g., with gzip) will make the first problem a lot better, but the second one even worse.
It sounds like you probably have the second problem, which is good to know. (However, there are things you can do that will probably be better no matter what the problem.)
Your idea of storing it in a sqlite database is nice because it can at least potentially solve all three at once; you only read the data in from disk as-needed, and it's stored in a reasonably compact and easy-to-process form. But it's not the best possible solution for the first two, just a "pretty good" one.
In particular, if you actually do need to do array-wide work across all 350000 rows, and can't translate that work into SQL queries, you're not going to get much benefit out of sqlite. Ultimately, you're going to be doing a giant SELECT to pull in all the data and then process it all into one big frame.
Writing out the shape and structure information, then writing the underlying arrays in NumPy binary form. Then, for reading, you have to reverse that. NumPy's binary form just stores the raw data as compactly as possible, and it's a format that can be written blindingly quickly (it's basically just dumping the raw in-memory storage to disk). That will improve both the first and second problems.
Similarly, storing the data in HDF5 (either using Pandas IO or an external library like PyTables or h5py) will improve both the first and second problems. HDF5 is designed to be a reasonably compact and simple format for storing the same kind of data you usually store in Pandas. (And it includes optional compression as a built-in feature, so if you know which of the two you have, you can tune it.) It won't solve the second problem quite as well as the last option, but probably well enough, and it's much simpler (once you get past setting up your HDF5 libraries).
Finally, pickling the data may sometimes be faster. pickle is Python's native serialization format, and it's hookable by third-party modules—and NumPy and Pandas have both hooked it to do a reasonably good job of pickling their data.
(Although this doesn't apply to the question, it may help someone searching later: If you're using Python 2.x, make sure to explicitly use pickle format 2; IIRC, NumPy is very bad at the default pickle format 0. In Python 3.0+, this isn't relevant, because the default format is at least 3.)

Python has two built-in libraries called pickle and cPickle that can store any Python data structure.
cPickle is identical to pickle except that cPickle has trouble with Unicode stuff and is 1000x faster.
Both are really convenient for saving stuff that's going to be re-loaded into Python in some form, since you don't have to worry about some kind of error popping up in your file I/O.
Having worked with a number of XML files, I've found some performance gains from loading pickles instead of raw XML. I'm not entirely sure how the performance compares with CSVs, but it's worth a shot, especially if you don't have to worry about Unicode stuff and can use cPickle. It's also simple, so if it's not a good enough boost, you can move on to other methods with minimal time lost.
A simple example of usage:
>>> import pickle
>>> stuff = ["Here's", "a", "list", "of", "tokens"]
>>> fstream = open("test.pkl", "wb")
>>> pickle.dump(stuff,fstream)
>>> fstream.close()
>>>
>>> fstream2 = open("test.pkl", "rb")
>>> old_stuff = pickle.load(fstream2)
>>> fstream2.close()
>>> old_stuff
["Here's", 'a', 'list', 'of', 'tokens']
>>>
Notice the "b" in the file stream openers. This is important--it preserves cross-platform compatibility of the pickles. I've failed to do this before and had it come back to haunt me.
For your stuff, I recommend writing a first script that parses the CSV and saves it as a pickle; when you do your analysis, the script associated with that loads the pickle like in the second block of code up there.
I've tried this with XML; I'm curious how much of a boost you will get with CSVs.

If the problem is in the processing overhead, then you can divide the file into smaller files and handle them in different CPU cores or threads. Also for some algorithms the python time will increase non-linearly and the dividing method will help in these cases.

python precautions to economize on size of text file of purely numerical characters

I am tabulating a lot of output from some network analysis, listing an edge per line, which results in dozens of gigabytes, stretching the limits of my resources (understatement). As I only deal with numerical values, it occurred to me that I might be smarter than using the Py3k defaults. I.e. some other character encoding might save me quite some space if I only have digits (and space and the occasional decimal dot). As constrained I am, I might even save on the line endings (Not to have the Windows standard CRLF duplicate). What is the best practice on this?
An example line would read like this:
62233 242344 0.42442423
(Where actually the last number is pointlessly precise, I will cut it back to three nonzero digits.)
As I will need to read in the text file with other software (Stata, actually), I cannot keep the data in arbitrary binary, though I see no reason why Stata would only read UTF-8 text. Or you simply say that avoiding UTF-8 barely saves me anything?
I think compression would not work for me, as I write the text line by line and it would be great to limit the output size even during this. I might easily be mistaken how compression works, but I thought it could save me space after the file is generated, but my issue is that my code crashes already as I am tabulating the text file (line by line).
Thanks for all the ideas and clarifying questions!

You can use zlib or gzip to compress the data as you generate it. You won't need to change your format at all, the compression will adjust to the characters and sequences that you use the most to create an optimal file size.

Avoid the character encodings entirely and save your data in a binary format. See Python's struct. Ascii-encoded a value like 4-billion takes 10 bytes, but fits in a 4-byte integer. There are a lot of downsides to a custom binary format (its hard to manually debug, or inspect with other tools, etc)

I have done some study on this. Clever encoding does not matter once you apply compression. Even if you use some binary encoding, they seems to contain the same entropy and end up in similar size after compression.
The Power of Gzip
Yes there are Python library allow you to stream output and automatically compress it.
Lossy encoding does save space. Cutting down the precision helps.

I don't know the capabilities of data input in Stata, and a quick search reveals that said capabilities are described in the User's Guide, which seems to be available only on dead-tree copies. So I don't know if my suggestion is feasible.
An instant saving of half the size would be if you used 4-bits per character. You have an alphabet of 0 to 9, period, (possibly) minus sign, space and newline, which are 14 characters fitting perfectly in 2**4==16 slots.
If this can be used in Stata, I can help more with suggestions for quick conversions.

Python: quickly loading 7 GB of text files into unicode strings

I have a large directory of text files--approximately 7 GB. I need to load them quickly into Python unicode strings in iPython. I have 15 GB of memory total. (I'm using EC2, so I can buy more memory if absolutely necessary.)
Simply reading the files will be too slow for my purposes. I have tried copying the files to a ramdisk and then loading them from there into iPython. That speeds things up but iPython crashes (not enough memory left over?) Here is the ramdisk setup:
mount -t tmpfs none /var/ramdisk -o size=7g
Anyone have any ideas? Basically, I'm looking for persistent in-memory Python objects. The iPython requirement precludes using IncPy: http://www.stanford.edu/~pgbovine/incpy.html .
Thanks!

There is much that is confusing here, which makes it more difficult to answer this question:
The ipython requirement. Why do you need to process such large data files from within ipython instead of a stand-alone script?
The tmpfs RAM disk. I read your question as implying that you read all of your input data into memory at once in Python. If that is the case, then python allocates its own buffers to hold all the data anyway, and the tmpfs filesystem only buys you a performance gain if you reload the data from the RAM disk many, many times.
Mentioning IncPy. If your performance issues are something you could solve with memoization, why can't you just manually implement memoization for the functions where it would help most?
So. If you actually need all the data in memory at once -- if your algorithm reprocesses the entire dataset multiple times, for example -- I would suggest looking at the mmap module. That will provide the data in raw bytes instead of unicode objects, which might entail a little more work in your algorithm (operating on the encoded data, for example), but will use a reasonable amount of memory. Reading the data into Python unicode objects all at once will require either 2x or 4x as much RAM as it occupies on disk (assuming the data is UTF-8).
If your algorithm simply does a single linear pass over the data (as does the Aho-Corasick algorithm you mention), then you'd be far better off just reading in a reasonably sized chunk at a time:
with codecs.open(inpath, encoding='utf-8') as f:
data = f.read(8192)
while data:
process(data)
data = f.read(8192)
I hope this at least gets you closer.

I saw the mention of IncPy and IPython in your question, so let me plug a project of mine that goes a bit in the direction of IncPy, but works with IPython and is well-suited to large data: http://packages.python.org/joblib/
If you are storing your data in numpy arrays (strings can be stored in numpy arrays), joblib can use memmap for intermediate results and be efficient for IO.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.