How to detect a file has been truncated while reading

How to detect a file has been truncated while reading - python

I'm reading lines from a group of files (log files) following them as they are written using pyinotify.
I'm opening and reading the files with python native methods:
file = open(self.file_path, 'r')
# ... later
line = file.readline()
This is generally stable and can handle the file being deleted and re-created. pyinotify will notify the unlink and subsequent link.
However some log files are not being deleted. Instead they are being truncated and new content written to the beginning of the same file.
I'm having trouble reliably detecting when this has occurred since pyinotify will simply report only a write. The only evidence I currently get is that pyinotify reports a write and readline() returns an empty string. BUT, it is possible that two subsiquent writes could trigger the same behavior.
I have thought of comparing a file's size to file.tell() but according to the documentation tell produces an opaque number and it appears this can't be trusted to be a number of bytes.
Is there a simple way to detect a file has been truncated while reading from it?
Edit:
Truncating a file can be simulated with simple shell commands:
echo hello > test.log
echo hello >> test.log
# Truncate test.log
echo goodbye > test.log
To compliment this, a simple python script can be used to confirm that file.tell() does not reduce when the file is truncated:
foo = open('./test.log', 'r')
line = foo.readline()
while line != '':
print(foo.tell())
print(line)
line = foo.readline()
# Put a breakpoint on the following line and
# truncate the file before it executes
print(foo.tell())

Use os.lseek(file.fileno(),0,os.SEEK_CUR) to obtain a byte offset without moving the file pointer. You can’t really use the regular file interface to find out, not least because it may have buffered text (that no longer exists) that it hasn’t made visible to Python yet. If the file is not a byte stream (e.g., the default open in Python 3), it could even be in the middle of a multibyte character and be unable to proceed even if the file immediately grew back past your file offset.

Related

Strange behavior when trying to create and write to a text file on macOS [duplicate]

This question already has answers here:
Convert UTF-8 with BOM to UTF-8 with no BOM in Python
(7 answers)
Closed last year.
I'm opening a plain text file, parsing it, and adding different lines to existing, empty string variables. I add these variables into a new variable that is a multi-line fstring. Trying to write the data to a new text file is not behaving as expected.
Reading the original file works fine. Text is properly parsed, variables populated.
The multi-line fstring variable seems fine. Prints normally. Even tried formatting it different ways which I show below.
When writing to a new file, that's where the strangeness starts. I've tried 2 ways:
Straight coding the open function with w or w+
Adding the above to a function and using that inside main()
The file is saved to disk with the correct name. Trying to double-click open in Finder produces nothing. Right-click to open produces nothing. Trying to move to trash with command+delete gives an error:
It sounds like the file goes to trash, but as the file disappears from the folder a new one is created with the same name in its place.
If I try to open in TextMate via File > Open, it opens as a blank file with no errors.
Since I can't get rid of the file, I have to delete the directory and create the directory again with the same name, or force delete in Terminal using rm. Restarting the system does not help. Relaunching Finder does nothing. Saving text files from other apps works fine. Directory is chmod 755.
If I copy an existing text file into the output directory, rename it to what the file is expected to be named, and let python overwrite the contents, it doesn't work either. The file modification date changes (and I see the file "blink" in Finder) but the contents remain the same. However, the file is not corrupted and opens normally.
If I do the same but delete the text inside of the copied file first, then run the script, python writes no data to the file, I can't open it by double-clicking on it, and I get error -43 again with the odd non-trashing behavior.
The strangest thing is this: if I add another with open() at the end of the script, and open the file that was just created and supposedly written to, and print its contents, the contents print. It's like when the script ends the file contents are being removed or its being corrupted somehow. Tried to close the file inside the script even though it's not needed, but same behavior persists.
Code:
Here's the code for writing:
FORMAT='utf-8'
OUTPUT_DIR = '/Path/To/SaveFolder'
# as a function
def write_to_file(content, fpath, name):
the_file = os.path.join(fpath, name)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(content)
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
write_to_file(multiline_var, OUTPUT_DIR, filename)
# or hard coded in main()
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
the_file = os.path.join(OUTPUT_DIR, filename)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(multiline_var)
I have tried using w w+ wt and wt+ and with and without encoding='utf-8'
Here is an example of multi-line fstring variable:
# using triple quotes
multiline_var = f"""
[PROJ-{pcode}] {full_title} by {author}
{description}
{URL}
{DIVIDER_1}
{TEXT_BLURB}
Some text here and then {SOME_MORE_TEXT}"
{DIVIDER_1}
{SOME_LINK}
"""
# or inside parens
multiline_var = (
f"[PROJ-{pcode}] {full_title} by {author}\n"
f"{description}\n\n"
f"{URL}\n"
f"{DIVIDER_1}\n"
f"{TEXT_BLURB}\n\n"
f"Some text here and then {SOME_MORE_TEXT}\n"
f"{DIVIDER_1}\n\n"
f"{SOME_LINK}"
)
Using exiftool on the text file shows the following, so it looks the data is there but must be corrupted:
File Size : 1797 bytes
File Modification Date/Time : 2021:12:31 15:55:39-05:00
File Access Date/Time : 2021:12:31 15:58:13-05:00
File Inode Change Date/Time : 2021:12:31 15:55:39-05:00
File Permissions : -rw-r--r--
File Type : TXT
File Type Extension : txt
MIME Type : text/plain
MIME Encoding : utf-8
Byte Order Mark : No
Newlines : Unix LF
Line Count : 55
Word Count : 181
Not sure what I'm doing wrong. VScode shows no syntax errors in the script. There are no errors in Terminal when running the script. Have I made some simple mistake in the above code? Maybe the fstring variable is causing a problem?

Thanks to #bnaecker for leading me to the solution to this problem.
It appeared that when creating/writing to a text file with a long name, Python can corrupt it. Not sure why, as I save long names for images with Python image libraries all the time. Using a short name like "MyFile.txt" it worked just fine, but that was a red herring.
I have updated this post with my journey to the final solution for using the long names that are needed for my project, though I'm not sure why the problem exists.
First Attempts:
So far creating using a short name and then renaming to a long one.... attempts have failed. I did notice that python is locking the file it creates and never unlocks it. Not sure if this is the problem. Setting chflags with os.system('chflags nouchg') command does not work, not even with sudo, and not even in the Terminal doing it manually.
Using os.rename() in Python corrupts the file
Using os.system('mv oldFile.txt newFile.txt') corrupts the file
Manually using mv command in Terminal corrupts the file
Manually changing the filename in the Finder does not (wtf?)
I kept looking for workarounds but nothing did the job.
Round 2:
Progress!
After much tinkering, I discovered a hidden character inside the file. I ran cat /path/longfilename.txt in Terminal, selected and copied the output and pasted into VScode. Here is what I saw:
Somehow a hidden character is getting into the project code number.
Pasting it into a Unicode search engine it came up as a ZERO WIDTH NO-BREAK SPACE also known in Unicode as EF BB BF. However, when pasting this symbol into TextMate it shows up as <U+FEFF> which is?...
The Byte Order Mark!
Opening a normal utf-8 text file in a hex editor also shows the files starting with EFBBBF for the BOM.
Now, the text file being read and parsed at first has no blank lines to start the file, so I added a line break, and also tried adding some spaces. This time when writing the file I could open it, however, after sending it to the trash, the same behavior occurred and the file was broken again. It seems that because other corrupted versions were in the trash, it added the symbol back to the file name for some reason.
So what appears to be happening, for whatever reason, when Python opens the text file I'm parsing that has no line break at the top, it seems to be grabbing the BOM from the file and adding that to the first variable which is grabbing the first line of the text file. Since that text is a number code that starts the file name, the BOM symbol is being added to the file name as well as the code inside the text file.
Just... wow
The Current Solution:
I have to leave a blank line at the start of the text file that I'm opening and parsing and a simple line break won't do it. I have no idea why this is. I added some spaces for good measure because randomly the BOM would be added to the variable and filename again. So far (knock on wood) as long as the first line of that initial file has some spaces and then a line break, and previous corrupted files have been deleted from the trash, a long file name can be used for all the files I'm creating and writing to without any problems.
This corruption even persists if I remove the encoding flag from both of the open functions I'm using (one to read and parse, the other to create and write).
If anyone knows why this is happening, please share. I've never seen it mentioned before. I'm not sure if it's a python 3.8 bug, a mac OS bug, the way TextMate wrote the original file, or a combination of these.
Correct Solution:
Thanks to #tripleee for the proper way to handle this, as I don't remember seeing this before, though I haven't been using python for very long.
In order to ignore the BOM, reading in the text file to be parsed with an encoding='utf-8-sig' does the job. Seems to be why it exists. :)
Problem solved.

Reading a dynamically updated log file via readline

I'm trying to read a log file, written line by line, via readline.
I'm surprised to observe the following behaviour (code executed in the interpreter, but same happens when variations are executed from a file):
f = open('myfile.log')
line = readline()
while line:
print(line)
line = f.readline()
# --> This displays all lines the file contains so far, as expected
# At this point, I open the log file with a text editor (Vim),
# add a line, save and close the editor.
line = f.readline()
print(line)
# --> I'm expecting to see the new line, but this does not print anything!
Is this behaviour standard? Am I missing something?
Note: I know there are better way to deal with an updated file for instance with generators as pointed here: Reading from a frequently updated file. I'm just interested in understanding the issue with this precise use case.

For your specific use case, the explanation is that Vim uses a write-to-temp strategy. This means that all writing operations are performed on a temporary file.
On the contrary, your scripts reads from the original file, so it does not see any change on it.
To further test, instead of Vim, you can try to directly write on the file using:
echo "Hello World" >> myfile.log
You should see the new line from python.

for following your file, you can use this code:
f = open('myfile.log')
while True:
line = readline()
if not line:
print(line)

Why does '\x01\x1A' (Start-of-Header and Substitute control characters) in a textfile line stop a for-loop prematurely?

I'm using Python 2.7.15, Windows 7
Context
I wrote a script to read and tokenize each line of a FileZilla log file (specifications here) for the IP address of the host that initiated the connection to the FileZilla server. I'm having trouble parsing the log text field that follows the > character. The script I wrote uses the:
with open('fz.log','r') as rh:
for lineno, line in rh:
pass
construct to read each line. That for-loop stopped prematurely when it encountered a log text field that contained the SOH and SUB characters. I can't show you the log file since it contains sensitive information but the crux of the problem can be reproduced by reading a textfile that contains those characters on a line.
My goal is to extract the IP addresses (which I can do using re.search()) but before that happens, I have to remove those control characters. I do this by creating a copy of the log file where the lines containing those control characters are removed. There's probably a better way, but I'm more curious why the for-loop just stops after encountering the control characters.
Reproducing the Issue
I reproduced the problem with this code:
if __name__ == '__main__':
fn = 'writetest.txt'
fn2 = 'writetest_NoControlChars.txt'
# Create the problematic textfile
with open(fn, 'w') as wh:
wh.write("This line comes first!\n");
wh.write("Blah\x01\x1A\n"); # Write Start-of-Header and Subsitute unicode character to line
wh.write("This comes after!")
# Try to read the file above, removing the SOH/SUB characters if encountered
with open(fn, 'r') as rh:
with open(fn2, 'w') as wh:
for lineno, line in enumerate(rh):
sline = line.translate(None,'\x01\x1A')
wh.write(sline)
print "Line #{}: {}".format(lineno, sline)
print "Program executed."
Output
The code above creates 2 output files and produces the following in a console window:
Line #0: This line comes first!
Line #1: Blah
Program executed.
I step-debugged through the code in Eclipse and immediately after executing the
for lineno, line in enumerate(rh):
statement, rh, the handle for that opened file was closed. I had expected it to move onto the third line, printing out This comes after! to console and writing it out to writetest_NoControlChars.txt but neither events happened. Instead, execution jumped to print "Program executed".
Picture of Local Variable values in Debug Console

You have to open this file in binary mode if you know it contains non-text data: open(fn, 'rb')

Python fails to open 11gb csv in r+ mode but opens in r mode

I'm having problems with some code that loops through a bunch of .csvs and deletes the final line if there's nothing in it (i.e. files that end with the \n newline character)
My code works successfully on all files except one, which is the largest file in the directory at 11gb. The second largest file is 4.5gb.
The line it fails on is simply:
with open(path_str,"r+") as my_file:
and I get the following message:
IOError: [Errno 22] invalid mode ('r+') or filename: 'F:\\Shapefiles\\ab_premium\\processed_csvs\\a.csv'
The path_str I create using os.file.join to avoid errors, and I tried renaming the file to a.csv just to make sure there wasn't anything odd going on with the filename. This made no difference.
Even more strangely, the file is happy to open in r mode. I.e. the following code works fine:
with open(path_str,"r") as my_file:
I have tried navigating around the file in read mode, and it's happy to read characters at the start, end, and in the middle of the file.
Does anyone know of any limits on the size of file that Python can deal with or why I might be getting this error? I'm on Windows 7 64bit and have 16gb of RAM.

The default I/O stack in Python 2 is layered over CRT FILE streams. On Windows these are built on top of a POSIX emulation API that uses file descriptors (which in turn is layered over the user-mode Windows API, which is layered over the kernel-mode I/O system, which itself is a deeply layered system based on I/O request packets; the hardware is down there somewhere...). In the POSIX layer, opening a file with _O_RDWR | _O_TEXT mode (as in "r+"), requires seeking to the end of the file to remove CTRL+Z, if it's present. Here's a quote from the CRT's fopen documentation:
Open in text (translated) mode. In this mode, CTRL+Z is interpreted as
an end-of-file character on input. In files opened for reading/writing
with "a+", fopen checks for a CTRL+Z at the end of the file and
removes it, if possible. This is done because using fseek and ftell to
move within a file that ends with a CTRL+Z, may cause fseek to behave
improperly near the end of the file.
The problem here is that the above check calls the 32-bit _lseek (bear in mind that sizeof long is 4 bytes on 64-bit Windows, unlike most other 64-bit platforms), instead of _lseeki64. Obviously this fails for an 11 GB file. Specifically, SetFilePointer fails because it gets called with a NULL value for lpDistanceToMoveHigh. Here's the return value and LastErrorValue for the latter call:
0:000> kc 2
Call Site
KERNELBASE!SetFilePointer
MSVCR90!lseek_nolock
0:000> r rax
rax=00000000ffffffff
0:000> dt _TEB #$teb LastErrorValue
ntdll!_TEB
+0x068 LastErrorValue : 0x57
The error code 0x57 is ERROR_INVALID_PARAMETER. This is referring to lpDistanceToMoveHigh being NULL when trying to seek from the end of a large file.
To work around this problem with CRT FILE streams, I recommend opening the file using io.open instead. This is a backported implementation of Python 3's I/O stack. It always opens files in raw binary mode (_O_BINARY), and it implements its own buffering and text-mode layers on top of the raw layer.
>>> import io
>>> f = io.open('a.csv', 'r+')
>>> f
<_io.TextIOWrapper name='a.csv' encoding='cp1252'>
>>> f.buffer
<_io.BufferedRandom name='a.csv'>
>>> f.buffer.raw
<_io.FileIO name='a.csv' mode='rb+'>
>>> f.seek(0, os.SEEK_END)
11811160064L

Allow Rsync to read file open by python process without python process failing

I am trying to set up a mail log parser that will pull out specific lines into another file, which will then get rsync'd to a remote server. The problem I am having is that when rsync reads the file being written, it seems to cause the parser to stop functioning. I believe this is because the parser is emulating a tail -f as maillog is being written consistently.
So: How do I allow rsync to touch the file I'm writing with this code (result_file), while still allowing it to follow the end of the maillog looking for new files:
#! /usr/bin/python
import time, re, sys
result_file = open('/var/log/mrp_mail_parsed.log', 'a+')
def tail(logfile):
logfile.seek(0,2)
while True:
line = logfile.readline()
if not line:
time.sleep(0.1)
continue
yield line
if __name__ == '__main__':
logfile = open('/var/log/maillog', 'r')
logline = tail(logfile)
for line in logline:
match = re.search(r'.+postfix-mrp.+', line)
if match:
result_file.write(line,)
result_file.flush()

I don't know who's writing the file, or how, so I can't be sure, but I'd give better than even odds that your problem is this:
If the file isn't being appended to in-place, but is instead being rewritten, your code will stop tracking the file. To test this:
import sys
import time
def tail(logfile):
logfile.seek(0,2)
while True:
line = logfile.readline()
if not line:
time.sleep(0.1)
continue
yield line
with open(sys.argv[1]) as f:
for line in tail(f):
print(line.rstrip())
Now:
$ touch foo
$ python tailf.py foo &
$ echo "hey" >> foo
foo
$ echo "hey" > foo
To see what's happening better, try checking the inode and size via stat. As soon as the path refers to a different file than the one your script has open, your script is now watching a file that nobody else will ever touch again.
It's also possible that someone is truncating and rewriting the file in-place. This won't change the inode, but it will still mean that you won't read anything, because you're trying to read from a position past the end of the file.
I have no idea whether the file being rsync'd is causing this, or whether that's just a coincidence. Without knowing what rsync command you're running, or seeing whether the file is being replaced or the file is being truncated and rewritten when that command runs, all we can do is guess.

I don't believe rsync is causing your problems: A separate process reading the file shouldn't affect the writer. You can easily test this by pausing rsync.
I'm guessing the problem is with python's handling of file reads when you hit end of file. A crude way that's guaranteed to work is to read to remember the offest at the last EOF (using tell()). For each new read, reopen the file and seek to the remembered offset.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.