File.read() is jumping to weird address in python - python

The code below
fd = open(r"C:\folder1\file.acc", 'r')
fd.seek(12672)
print str(fd.read(1))
print "after", fd.tell()
Is returning after 16257 instead of the expected after 12673
What is going on here? Is there a way the creator of the file can put some sort of protection on the file to mess with my reads? I am only having issues with a range of addresses. The rest of the file reads as expected.

It looks as though you are trying to deal with a file with a simple "stream of bytes at linearly increasing offsets" model, but you are opening it with 'r' rather than 'rb'. Given that the path name starts with C:\ we can also assume that you are running on a Windows system. Text streams on Windows—whether opened in Python, or in various other languages including the C base for CPython—do funny translations where '\n' in Python becomes the two-byte sequence '\r', '\n' within the bytes-as-stored-in-the-file. This makes file offsets behave in a non-linear fashion (though as someone who avoids Windows I would not care to guess at the precise behaviors).
It's therefore important to open file file with 'rb' mode for reading. This becomes even more critical when you use Python3, which uses Unicode for base strings: opening a stream with mode 'r' produces text, as in strings, type 'str', which are Unicode; but opening it with mode 'rb' produces bytes, as in strings of <class 'bytes'>.
Notes on things you did not ask about
You may use use r+b for writing if you do not want to truncate an existing file, or wb to create a new file or truncate any existing file. Remember that + means "add the other mode", while w means "truncate existing or create anew for writing", so r+ is read-and-write without truncation, while w+ is write-and-read with truncation. In all cases, including the b means "... and treat as stream of bytes."
As you can see, there is a missing mode here: how do you open for writing (only) without truncation, yet creating the file if necessary? Python, like C, gives you a third letter option a (which you can also mix with + and b as usual). This opens for writing without truncation, creating a new file only if necessary—but it has the somewhat annoying side effect of forcing all writes to append, which is what the a stands for. This means you cannot open a file for writing without truncation, position into the middle of it, and overwrite just a bit of it. Instead, you must open for read-plus, position into the middle of it, and overwrite just the one bit. But the read-plus mode fails—raises an OSError exception—if the file does not currently exist.
You can open with r+ and if it fails, try again with w or w+, but the flaw here is that the operation is non-atomic: if two or more entities—let's call them Alice and Bob, though often they are just two competing programs—are trying to do this on a single file name, it's possible that Alice sees the file does not exist yet, then pauses a bit; then Bob sees that the file does not exist, creates-and-truncates it, writes contents, and closes it; then Alice resumes, and creates-and-truncates, losing Bob's data. (In practice, two competing entities like this need to cooperate anyway, but to do so reliably, they need some sort of atomic synchronization, and for that you must drop to OS-specific operations. Python 3.3 adds the x character for exclusive, which helps implement atomicity.)
If you do open a stream for both reading and writing, there is another annoying caveat: any time you wish to "switch directions" you are required to introduce an apparently-pointless seek. ("Any time" is a bit too strong: e.g., after an attempt to read produces end-of-file, you may switch then as well. The set of conditions to remember, however, is somewhat difficult; it's easier to say "seek before changing directions.") This is inherited from the underlying C "standard I/O" implementation. Python could work around it—and I was just now searching to see if Python 3 does, and have not found an answer—but Python 2 did not. The underlying C implementation is also not required to have this flaw, and some, such as mine, do not, but it's safest to assume that it might, and do the apparently-pointless seek.

Related

Python file open function modes

I have noticed that, in addition to the documented mode characters, Python 2.7.5.1 in Windows XP and 8.1 also accepts modes U and D at least when reading files. Mode U is used in numpy's genfromtxt. Mode D has the effect that the file is deleted, as per the following code fragment:
f = open('text.txt','rD')
print(f.next())
f.close() # file text.txt is deleted when closed
Does anybody know more about these modes, especially whether they are a permanent feature of the language applicable also on Linux systems?
The D flag seems to be Windows specific. Windows seems to add several flags to the fopen function in its CRT, as described here.
While Python does filter the mode string to make sure no errors arise from it, it does allow some of the special flags, as can be seen in the Python sources here. Specifically, it seems that the N flag is filtered out, while the T and D flags are allowed:
while (*++mode) {
if (*mode == ' ' || *mode == 'N') /* ignore spaces and N */
continue;
s = "+TD"; /* each of this can appear only once */
...
I would suggest sticking to the documented options to keep the code cross-platform.
This is a bit misleading.
open() as mode arg accepts any character, while you pass a valid one i.e.: "w,r,b,+,a".
Thus you can write: open("fname", "w+ANYTHINGYOUWANT").
It will open file as open("fname", "w+").
And open("fname", "rANYTHINGYOUWANT").
will open file as open("fname", "r").
Regarding "U" flag:
In addition to the standard fopen() values mode may be 'U' or 'rU'.
Python is usually built with universal newlines support; supplying 'U'
opens the file as a text file, but lines may be terminated by any of
the following: the Unix end-of-line convention '\n', the Macintosh
convention '\r', or the Windows convention '\r\n'. All of these
external representations are seen as '\n' by the Python program. If
Python is built without universal newlines support a mode with 'U' is
the same as normal text mode. Note that file objects so opened also
have an attribute called newlines which has a value of None (if no
newlines have yet been seen), '\n', '\r', '\r\n', or a tuple
containing all the newline types seen.
As you can read in Python documentation https://docs.python.org/2/library/functions.html#open
EDIT:
D: Specifies a file as temporary. It is deleted when the last file
pointer is closed.
as you can read in #tmr232's link.
The c, n, t, S, R, T, and D mode options are Microsoft extensions for
fopen and _fdopen and should not be used where ANSI portability is
desired
Further update:
I propose to submit the phenomenon as a bug, because opening a file as read only i.e. with flag "r", then allowing to delete after/via closing it adding a single character like "D", even accidentally is a serious security issue, I think.
But, if this has some unavoidable functionality, please inform me.

File pointer in python

I have a bunch of questions in file handling in Python. Please help me sort them out.
Suppose I create a file something like this.
>>>f = open("text,txt", "w+")
>>>f.tell()
>>>0
f is a file object.
Can I assume it to be a file pointer?
If so what is f pointing to ? The empty space reserved for first byte in file structure?
Can I assume file structure to be zero indexed?
In microprocessors what I learnt is that the pointer always points to the next instruction. How is it in python? If I write a character say 'b' in the file, will my file pointer points to character 'b' or to the location next to 'b'?
You don't specify a version, and file objects behave a little bit differently between Python 2 and Python 3. The general idea is the same, but some of the specific details are different. The following assumes you're using Python 3, or that you're using the version of open from the io module in Python 2.6 or 2.7 rather than Python 2's builtin open.
It isn't a file pointer, although there's a good chance it is implemented in terms of one behind the scenes. Unlike C, Python does not expose the concept of pointers.
However, what you seem to be thinking of is the 'stream position', which is kindof similar to a pointer. This is the number reported by tell(), and which can be fed into seek(). For binary files, it is a byte offset from the start of the file. In text files, it is just 'an offset' which is meaningful to the file object - the docs call it an "opaque number" (ie, it has no defined physical meaning in terms of how the file is stored on disk). But in both cases, it is an offset from the start, and therefore the start is zero. This is only true if the file supports random access - which you usually will be, but be prepared to eventually run into a situation where you're not - in which case, seek and tell raise errors.
Like the instruction pointer in processors, the stream position is where the next operation will start from, rather than where the current one finished. So, yes, after you've written a string to the file, the current position will usually be one offset value past that.
When you've just opened a file, the offset will usually be zero or the end of the file (one higher than the maximum value you could read from without getting EOF). It will be zero if you've opened it in 'r' mode, the end if you've opened it in 'a' mode and the two are equivalent for 'w' and 'w+' modes since those truncate the file to zero bytes.
The file object is implemented using the C stadard library's stdio. So it contains a "file descriptor" (since it's based on stdio, "under the hood" it will contain a pointer to a struct FILE, which is what is commonly called a file pointer.). And you can use tell and seek. On the other hand, it is also an iterator and a context manager. So it has more funtcionality.
It is not a pointer, but rather a reference. Keep in mind that in Python f is a name, that references a file object.
If you are using file.seek(), it uses 0-based absolute positioning by default.
You are confusing a processor register with file handling. The question makes no sense.
There's nothing special about a file object. Just think of it as an object
the name f points to the file object on the heap, just like in l = [1, 2, 3] the name l points to the list object on the heap
From the documentation, there is no __getitem__ member, so this is not a meaningful question

Parsing large (20GB) text file with python - reading in 2 lines as 1

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.
inputFileHandle = open(inputFileName, 'r')
row = 0
for line in inputFileHandle:
row = row + 1
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)
I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.
Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.
It's a bug in Python.
Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread().
In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF."
Oddly, there is an almost exact copy of this function in Perl source code:
http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668
The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?]
At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.
And the work-around:
But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().
The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).
The code you've posted looks fine by itself, so I would suspect a bug in your Python build.
FWIW, the snippet would be a little cleaner if it used enumerate:
inputFileHandle = open(inputFileName, 'r')
for row, line in enumerate(inputFileHandle):
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)

Difference between binary and text I/O in python on Windows

I know that I should open a binary file using "rb" instead of "r" because Windows behaves differently for binary and non-binary files.
But I don't understand what exactly happens if I open a file the wrong way and why this distinction is even necessary. Other operating systems seem to do fine by treating both kinds of files the same.
Well this is for historical (or as i like to say it, hysterical) reasons. The file open modes are inherited from C stdio library and hence we follow it.
For Windows, there is no difference between text and binary files, just like in any of the Unix clones. No, i mean it! - there are (were) file systems/OSes in which text file is completely different beast from object file and so on. In some you had to specify the maximum length of lines in advance and fixed size records were used... fossils from the times of 80-column paper punch-cards and such. Luckily, not so in Unices, Windows and Mac.
However - all other things equal - Unix, Windows and Mac hystorically differ in what characters they use in output stream to mark end of one line (or, same thing, as separator between lines). In Unix, \x0A (\n) is used. In Windows, sequence of two characters \x0D\x0A (\r\n) is used; on Mac - just \xOD (\r). Here are some clues on the origin of use of those two symbols - ASCII code 10 is called Line Feed (LF) and when sent to teletype, would cause it to move down one line (Y++), without changing its horizontal (X) position. Carriage Return (CR) - ASCII 13 - on the other hand, would cause the printing carriage to return to the beginning of the line (X=0) without scrolling one line down. So when sending output to the printer, both \r and \n had to be send, so that the carriage will move to the beginning of a new line. Now when typing on terminal keyboard, operators naturally are expected to press one key and not two for end of line. That on Apple][ was the key 'Return' (\r).
At any rate, this is how things settled. C's creators were concerned about portability - much of Unix was written in C, unlike before, when OSes were written in assembler. So they did not want to deal with each platform quirks about text representation, so they added this evil hack to their I/O library depending on the platform, the input and output to that file will be "patched" on the fly so that the program will see the new lines the righteous, Unix-way - as '\n' - no matter if it was '\r\n' from Windows or '\r' from Mac. So the developer need not worry on what OS the program ran, it could still read and write text files in native format.
There was a problem, however - not all files are text, there are other formats and in they are very sensitive to replacing one character with another. So they though, we will call those "binary files" and indicate that to fopen() by including 'b' in the mode - and this will flag the library not to do any behind-the-scenes conversion. And that's how it came to be the way it is :)
So to recap, if file is open with 'b' in binary mode, no conversions will take place. If it was open in text mode, depending on the platform, some conversions of the new line character(s) may occur - towards Unix point of view. Naturally, on Unix platform there is no difference between reading/writing to "text" or "binary" file.
This mode is about conversion of line endings.
When reading in text mode, the platform's native line endings (\r\n on Windows) are converted to Python's Unix-style \n line endings. When writing in text mode, the reverse happens.
In binary mode, no such conversion is done.
Other platforms usually do fine without the conversion, because they store line endings natively as \n. (An exception is Mac OS, which used to use \r in the old days.) Code relying on this, however, is not portable.
In Windows, text mode will convert the newline \n to a carriage return followed by a newline \r\n.
If you read text in binary mode, there are no problems. If you read binary data in text mode, it will likely be corrupted.
For reading files there should be no difference. When writing to text-files Windows will automatically mess up your line-breaks (it will add \r's before the \n's). That's why you should use "wb".

Line reading chokes on 0x1A

I have the following file:
abcde
kwakwa
<0x1A>
line3
linllll
Where <0x1A> represents a byte with the hex value of 0x1A. When attempting to read this file in Python as:
for line in open('t.txt'):
print line,
It only reads the first two lines, and exits the loop.
The solution seems to be to open the file in binary (or universal newline mode) - 'rb' or 'rU'. Can you explain this behavior ?
0x1A is Ctrl-Z, and DOS historically used that as an end-of-file marker. For example, try using a command prompt, and "type"ing your file. It will only display the content up the Ctrl-Z.
Python uses the Windows CRT function _wfopen, which implements the "Ctrl-Z is EOF" semantics.
Ned is of course correct.
If your curiosity runs a little deeper, the root cause is backwards compatibility taken to an extreme. Windows is compatible with DOS, which used Ctrl-Z as an optional end of file marker for text files. What you might not know is that DOS was compatible with CP/M, which was popular on small computers before the PC. CP/M's file system didn't keep track of file sizes down to the byte level, it only kept track by the number of floppy disk sectors. If your file wasn't an exact multiple of 128 bytes, you needed a way to mark the end of the text. This Wikipedia article implies that the selection of Ctrl-Z was based on an even older convention used by DEC.

Categories