Does f.seek(500000,0) go through all the first 499999 characters of the file before getting to the 500000th?
In other words, is f.seek(n,0) of order O(n) or O(1)?
You need to be a bit more specific on what type of object f is.
If f is a normal io module object for a file stored on disk, you have to determine if you are dealing with:
The raw binary file object
A buffer object, wrapping the raw binary file
A TextIO object, wrapping the buffer
An in-memory BytesIO or TextIO object
The first option just uses the lseek system call to reposition the file descriptor position. If this call is O(1) depends on the OS and what kind of file system you have. For a Linux system with ext4 filesystem, lseek is O(1).
Buffers just clear the buffer if your seek target is outside of the current buffered region and read in new buffer data. That's O(1) too, but the fixed cost is higher.
For text files, things are more complicated as variable-byte-length codecs and line-ending translation mean you can't always map the binary stream position to a text position without scanning from the start. The implementation doesn't allow for non-zero current-position- or end-relative seeks, and does it's best to minimise how much data is read for absolute seeks. Internal state shared with the text decoder tracks a recent 'safe point' to seek back to and read forward to the desired position. Worst-case this is O(n).
The in-memory file objects are just long, addressable arrays really. Seeking is O(1) because you can just alter the current position pointer value.
There are legion other file-like objects that may or may not support seeking. How they handle seeking is implementation dependent.
The zipfile module supports seeking on zip files opened in read-only mode, and seeking to a point that lies before the data section covered by the current buffer requires a full re-read and decompression of the data up to the desired point, seeking after requires reading from the current position until you reach the new. The gzip, lzma and bz2 modules all use the same shared implementation, that also starts reading from the start if you seek to a point before the current read position (and there's no larger buffer to avoid this).
The chunk module allows seeking within the chunk boundaries and delegates to the underlying object. This is an O(1) operation if the underlying file seek operation is O(1).
Etc. So, it depends.
It would depend on the implementation of f. However, in normal file-system files, it is O(1).
If python implements f on text files, it could be implemented as O(n), as each character may need to be inspected to manage cr/lf pairs correctly.
This would be based on whether f.seek(n,0) gave the same result as a loop of reading chars, and (depending on OS) cr/lf were shrunk to lf or lf expanded to cr/lf
If python implements f on a compressed stream, then the order would b O(n), as decompression may require some working of blocks, and decompression.
Related
I'm trying to implement a strings(1)-like function in Python.
import re
def strings(f, n=5):
# TODO: support files larger than available RAM
return re.finditer(br'[!-~\s]{%i,}' % n, f.read())
if __name__ == '__main__':
import sys
with open(sys.argv[1], 'rb') as f:
for m in strings(f):
print(m[0].decode().replace('\x0A', '\u240A'))
Setting aside the case of actual matches* that are larger than the available RAM, the above code fails in the case of files that are merely, themselves, larger than the available RAM!
An attempt to naively "iterate over f" will be done linewise, even for binary files; this may be inappropriate because (a) it may return different results than just running the regex on the whole input, and (b) if the machine has 4 gigabytes of RAM and the file contains any match for rb'[^\n]{8589934592,}', then that unasked-for match will cause a memory problem anyway!
Does Python's regex library enable any simple way to stream re.finditer over a binary file?
*I am aware that it is possible to write regular expressions that may require an exponential amount of CPU or RAM relative to their input length. Handling these cases is, obviously, out-of-scope; I'm assuming for the purposes of the question that the machine at least has enough resource to handle the regex, its largest match on the input, the acquisition of this match, and the ignoring-of all nonmatches.
Not a duplicate of Regular expression parsing a binary file? -- that question is actually asking about bytes-like objects; I am asking about binary files per se.
Not a duplicate of Parse a binary file with Regular Expressions? -- for the same reason.
Not a duplicate of Regular expression for binary files -- that question only addressed the special case where offsets of all matches were known beforehand.
Not a duplicate of Regular expression for binary files -- combination of both of these reasons.
Does Python's regex library enable any simple way to stream re.finditer over a binary file?
Well, while typing up the question in such excruciating detail and getting suppporting documentation, I found the solution:
mmap — Memory-mapped file support
Memory-mapped file objects behave like both bytearray and like file objects. You can use mmap objects in most places where bytearray are expected; for example, you can use the re module to search through a memory-mapped file. …
Enacted:
import re, mmap
def strings(f, n=5):
view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
return re.finditer(br'[!-~\s]{%i,}' % n, view)
Caveat: on 32-bit systems, this might not work for files larger than 2GiB, if the underlying standard library is deficient.
However, it looks like it should be fine on both Windows and any well-maintained Linux distribution:
13.8 Memory-mapped I/O
Since mmapped pages can be stored back to their file when physical memory is low, it is possible to mmap files orders of magnitude larger than both the physical memory and swap space. The only limit is address space. The theoretical limit is 4GB on a 32-bit machine - however, the actual limit will be smaller since some areas will be reserved for other purposes. If the LFS interface is used the file size on 32-bit systems is not limited to 2GB … the full 64-bit [8 EiB] are available. …
Creating a File Mapping Using Large Pages
… you must specify the FILE_MAP_LARGE_PAGES flag with the MapViewOfFile function to map large pages. …
I have a list of strings, and would like to pass this to an api that accepts only a file-like object, without having to concatenate/flatten the list to use the likes of StringIO.
The strings are utf-8, don't necessarily end in newlines, and if naively concatenated could be used directly in StringIO.
Preferred solution would be within the standard library (python3.8) (Given the shape of the data is naturally similar to a file (~identical to readlines() obviously), and memory access pattern would be efficient, I have a feeling I'm just failing to DuckDuckGo correctly) - but if that doesn't exist any "streaming" (no data concatenation) solution would suffice.
[Update, based on #JonSG's links]
Both RawIOBase and TextIOBase look provide an api that decouples arbitrarily sized "chunks"/fragments (in my case: strings in a list) from a file-like read which can specify its own read chunk size, while streaming the data itself (memory cost increases by only some window at any given time [dependent of course on behavior of your source & sink])
RawIOBase.readinto looks especially promising because it provides the buffer returned to client reads directly, allowing much simpler code - but this appears to come at the cost of one full copy (into that buffer).
TextIOBase.read() has its own cost for its operation solving the same subproblem, which is concatenating k (k much smaller than N) chunks together.
I'll investigate both of these.
I wanted to know how does python hashlib library treat sparse files. If the file has a lot of zero blocks then instead of wasting CPU and memory on reading zero blocks does it do any optimization like scanning the inode block map and reading only allocated blocks to compute the hash?
If it does not do it already, what would be the best way to do it myself.
PS: Not sure it would be appropriate to post this question in StackOverflow Meta.
Thanks.
The hashlib module doesn't even work with files. You have to read the data in and pass blocks to the hashing object, so I have no idea why you think it would handle sparse files at all.
The I/O layer doesn't do anything special for sparse files, but that's the OS's job; if it knows the file is sparse, the "read" operation doesn't need to do I/O, it just fills in your buffer with zeroes with no I/O.
There's convenient peek function in io.BufferedReader. But
peek([n])
Return 1 (or n if specified) bytes from a buffer without advancing
the position. Only a single read on the raw stream is done to satisfy
the call. The number of bytes returned may be less than requested since
at most all the buffer’s bytes from the current position to the end are
returned.
it is returning too few bytes.
Where shall I get reliable multi-byte peek (without using read and disrupting other code nibbling the stream byte by byte and interpreting data from it)?
It depends on what you mean by reliable. The buffered classes are specifically tailored to prevent I/O as much as possible (as that is the whole point of a buffer) so they only guarantee that it will do 1 read of the buffer at most. The amount of data returned depends exclusively on the amount of data that the buffer already has in it.
If you need an exact amount of data, you will need to alter the underlying structures. In particular, you will probably need to re-open the stream with a bigger buffer.
If that is not an option, you could provide a wrapper class so that you can intercept the reads that you need and provide the data transparently to other code that actually want to consume the data.
I read that mmap is advantageous than fileinput, because it will read a page into kernel pagecache and shares the page in user address space. Whereas, fileinput actually brings a page into kernel and copies a line to user address space. So, there is this extra space overhead with fileinput.
So, I am planning to move to mmap, but I want to know from advanced python hackers whether it improves performance?
If so, is there a similar implementation of fileinput that uses mmap?
Please point me to any opensource code, if you are aware of.
thank you
mmap takes a file and sticks it in RAM so that you can index it like an array of bytes or as a big data structure.
Its a lot faster if you are accessing your file in a "random-access" manner -- that is doing a lot of fseek(), fread(), fwrite() combinations.
But if you are just reading the file in and processing each line once (say), then it is unlikely to be significantly faster. In fact, for any reasonable file size (remember with mmap it all must fit in RAM -- or paging occurs which begins to reduce the efficiency of mmap) it probably is indistinguishable.