File pointer in python - python

I have a bunch of questions in file handling in Python. Please help me sort them out.
Suppose I create a file something like this.
>>>f = open("text,txt", "w+")
>>>f.tell()
>>>0
f is a file object.
Can I assume it to be a file pointer?
If so what is f pointing to ? The empty space reserved for first byte in file structure?
Can I assume file structure to be zero indexed?
In microprocessors what I learnt is that the pointer always points to the next instruction. How is it in python? If I write a character say 'b' in the file, will my file pointer points to character 'b' or to the location next to 'b'?

You don't specify a version, and file objects behave a little bit differently between Python 2 and Python 3. The general idea is the same, but some of the specific details are different. The following assumes you're using Python 3, or that you're using the version of open from the io module in Python 2.6 or 2.7 rather than Python 2's builtin open.
It isn't a file pointer, although there's a good chance it is implemented in terms of one behind the scenes. Unlike C, Python does not expose the concept of pointers.
However, what you seem to be thinking of is the 'stream position', which is kindof similar to a pointer. This is the number reported by tell(), and which can be fed into seek(). For binary files, it is a byte offset from the start of the file. In text files, it is just 'an offset' which is meaningful to the file object - the docs call it an "opaque number" (ie, it has no defined physical meaning in terms of how the file is stored on disk). But in both cases, it is an offset from the start, and therefore the start is zero. This is only true if the file supports random access - which you usually will be, but be prepared to eventually run into a situation where you're not - in which case, seek and tell raise errors.
Like the instruction pointer in processors, the stream position is where the next operation will start from, rather than where the current one finished. So, yes, after you've written a string to the file, the current position will usually be one offset value past that.
When you've just opened a file, the offset will usually be zero or the end of the file (one higher than the maximum value you could read from without getting EOF). It will be zero if you've opened it in 'r' mode, the end if you've opened it in 'a' mode and the two are equivalent for 'w' and 'w+' modes since those truncate the file to zero bytes.

The file object is implemented using the C stadard library's stdio. So it contains a "file descriptor" (since it's based on stdio, "under the hood" it will contain a pointer to a struct FILE, which is what is commonly called a file pointer.). And you can use tell and seek. On the other hand, it is also an iterator and a context manager. So it has more funtcionality.
It is not a pointer, but rather a reference. Keep in mind that in Python f is a name, that references a file object.
If you are using file.seek(), it uses 0-based absolute positioning by default.
You are confusing a processor register with file handling. The question makes no sense.

There's nothing special about a file object. Just think of it as an object
the name f points to the file object on the heap, just like in l = [1, 2, 3] the name l points to the list object on the heap
From the documentation, there is no __getitem__ member, so this is not a meaningful question

Related

How do I know when to shrink the size of a file after a write call?

When my filesystem receives a call to write() with a buffer length that reaches outside of the current filesize, the program knows to increase the file size.
However, what if the file gets smaller? For example, what if I change the contents of a file from "hello" to "hell"? A call to write() is issued with offset 0 buffer length 4. If the filesystem follows these instructions and only updates the first 4 bytes of stored data without changing the file size, the next read will still show the same string.
How does my fuse implementation know when it is supposed to discard data at the end of the current buffer it's writing and shrink the size of the file?
So, how to differentiate between:
example string -> write(offset=0, buf="hello") -> hello
example string -> write(offset=0, buf="hello") -> hellole string
I am using pyfuse3 but I suspect this logic to be universal across bindings in other languages.
My issue was that I didn't check for the O_TRUNC flag when opening a file, it turns out that the truncate command uses setattr while redirection in a shell uses open with O_TRUNC set.

File.read() is jumping to weird address in python

The code below
fd = open(r"C:\folder1\file.acc", 'r')
fd.seek(12672)
print str(fd.read(1))
print "after", fd.tell()
Is returning after 16257 instead of the expected after 12673
What is going on here? Is there a way the creator of the file can put some sort of protection on the file to mess with my reads? I am only having issues with a range of addresses. The rest of the file reads as expected.
It looks as though you are trying to deal with a file with a simple "stream of bytes at linearly increasing offsets" model, but you are opening it with 'r' rather than 'rb'. Given that the path name starts with C:\ we can also assume that you are running on a Windows system. Text streams on Windows—whether opened in Python, or in various other languages including the C base for CPython—do funny translations where '\n' in Python becomes the two-byte sequence '\r', '\n' within the bytes-as-stored-in-the-file. This makes file offsets behave in a non-linear fashion (though as someone who avoids Windows I would not care to guess at the precise behaviors).
It's therefore important to open file file with 'rb' mode for reading. This becomes even more critical when you use Python3, which uses Unicode for base strings: opening a stream with mode 'r' produces text, as in strings, type 'str', which are Unicode; but opening it with mode 'rb' produces bytes, as in strings of <class 'bytes'>.
Notes on things you did not ask about
You may use use r+b for writing if you do not want to truncate an existing file, or wb to create a new file or truncate any existing file. Remember that + means "add the other mode", while w means "truncate existing or create anew for writing", so r+ is read-and-write without truncation, while w+ is write-and-read with truncation. In all cases, including the b means "... and treat as stream of bytes."
As you can see, there is a missing mode here: how do you open for writing (only) without truncation, yet creating the file if necessary? Python, like C, gives you a third letter option a (which you can also mix with + and b as usual). This opens for writing without truncation, creating a new file only if necessary—but it has the somewhat annoying side effect of forcing all writes to append, which is what the a stands for. This means you cannot open a file for writing without truncation, position into the middle of it, and overwrite just a bit of it. Instead, you must open for read-plus, position into the middle of it, and overwrite just the one bit. But the read-plus mode fails—raises an OSError exception—if the file does not currently exist.
You can open with r+ and if it fails, try again with w or w+, but the flaw here is that the operation is non-atomic: if two or more entities—let's call them Alice and Bob, though often they are just two competing programs—are trying to do this on a single file name, it's possible that Alice sees the file does not exist yet, then pauses a bit; then Bob sees that the file does not exist, creates-and-truncates it, writes contents, and closes it; then Alice resumes, and creates-and-truncates, losing Bob's data. (In practice, two competing entities like this need to cooperate anyway, but to do so reliably, they need some sort of atomic synchronization, and for that you must drop to OS-specific operations. Python 3.3 adds the x character for exclusive, which helps implement atomicity.)
If you do open a stream for both reading and writing, there is another annoying caveat: any time you wish to "switch directions" you are required to introduce an apparently-pointless seek. ("Any time" is a bit too strong: e.g., after an attempt to read produces end-of-file, you may switch then as well. The set of conditions to remember, however, is somewhat difficult; it's easier to say "seek before changing directions.") This is inherited from the underlying C "standard I/O" implementation. Python could work around it—and I was just now searching to see if Python 3 does, and have not found an answer—but Python 2 did not. The underlying C implementation is also not required to have this flaw, and some, such as mine, do not, but it's safest to assume that it might, and do the apparently-pointless seek.

Prefer BytesIO or bytes for internal interface in Python?

I'm trying to decide on the best internal interface to use in my code, specifically around how to handle file contents. Really, the file contents are just binary data, so bytes is sufficient to represent them.
I'm storing files in different remote locations, so have a couple of different classes for reading and writing. I'm trying to figure out the best interface to use for my functions. Originally I was using file paths, but that was suboptimal because it meant that disk was always used (which meant lots of clumsy tempfiles).
There are several areas of the code that have the same requirement, and would directly use whatever was returned from this interface. As a result whatever abstraction I choose will touch a fair bit of code.
What are the various tradeoffs to using BytesIO vs bytes?
def put_file(location, contents_as_bytes):
def put_file(location, contents_as_fp):
def get_file_contents(location):
def get_file_contents(location, fp):
Playing around I've found that using the File-Like interfaces (BytesIO, etc) requires a bit of administration overhead in terms of seek(0) etc. That raises a questions like:
is it better to seek before you start, or after you've finished?
do you seek to the start or just operate from the position the file is in?
should you tell() to maintain the position?
looking at something like shutil.copyfileobj it doesn't do any seeking
One advantage I've found with using file-like interfaces instead is that it allows for passing in the fp to write into when you're retrieving data. Which seems to give a good deal of flexibility.
def get_file_contents(location, write_into=None):
if not write_into:
write_into = io.BytesIO()
# get the contents and put it into write_into
return write_into
get_file_contents('blah', file_on_disk)
get_file_contents('blah', gzip_file)
get_file_contents('blah', temp_file)
get_file_contents('blah', bytes_io)
new_bytes_io = get_file_contents('blah')
# etc
Is there a good reason to prefer BytesIO over just using fixed bytes when designing an interface in python?
The benefit of io.BytesIO objects is that they implement a common-ish interface (commonly known as a 'file-like' object). BytesIO objects have an internal pointer (whose position is returned by tell()) and for every call to read(n) the pointer advances n bytes. Ex.
import io
buf = io.BytesIO(b'Hello world!')
buf.read(1) # Returns b'H'
buf.tell() # Returns 1
buf.read(1) # Returns b'e'
buf.tell() # Returns 2
# Set the pointer to 0.
buf.seek(0)
buf.read() # This will return b'H', like the first call.
In your use case, both the bytes object and the io.BytesIO object are maybe not the best solutions. They will read the complete contents of your files into memory.
Instead, you could look at tempfile.TemporaryFile (https://docs.python.org/3/library/tempfile.html).

In Python, what is the difference between f.readlines() and list(f)

From both Python2 Tutorial and Python3 Tutorial, there is a line in the midpoint of section 7.2.1 saying:
If you want to read all the lines of a file in a list you can also use list(f) or f.readlines().
So my question is: What is the difference between these two ways to turn a file object to a list? I am curious both in performance aspect and in underneath Python object implementation (and maybe the difference between the Python2 and Python3).
Functionally, there is no difference; both methods result in the exact same list.
Implementation wise, one uses the file object as an iterator (calls next(f) repeatedly until StopIteration is raised), the other uses a dedicated method to read the whole file.
Python 2 and 3 differ in what that means, exactly, unless you use io.open() in Python 2. Python 2 file objects use a hidden buffer for file iteration, which can trip you up if you mix file object iteration and .readline() or .readlines() calls.
The io library (which handles all file I/O in Python 3) does not use such a hidden buffer, all buffering is instead handled by a BufferedIOBase() wrapper class. In fact, the io.IOBase.readlines() implementation uses the file object as an iterator under the hood anyway, and TextIOWrapper iteration delegates to TextIOWrapper.readline(), so list(f) and f.readlines() essentially are the same thing, really.
Performance wise, there isn't really a difference even in Python 2, as the bottleneck is file I/O; how quickly can you read it from disk. At a micro level, performance can depend on other factors, such as if the OS has already buffered the data and how long the lines are.

Parsing large (20GB) text file with python - reading in 2 lines as 1

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.
inputFileHandle = open(inputFileName, 'r')
row = 0
for line in inputFileHandle:
row = row + 1
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)
I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.
Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.
It's a bug in Python.
Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread().
In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF."
Oddly, there is an almost exact copy of this function in Perl source code:
http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668
The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?]
At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.
And the work-around:
But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().
The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).
The code you've posted looks fine by itself, so I would suspect a bug in your Python build.
FWIW, the snippet would be a little cleaner if it used enumerate:
inputFileHandle = open(inputFileName, 'r')
for row, line in enumerate(inputFileHandle):
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)

Categories