Reading a dynamically updated log file via readline - python

I'm trying to read a log file, written line by line, via readline.
I'm surprised to observe the following behaviour (code executed in the interpreter, but same happens when variations are executed from a file):
f = open('myfile.log')
line = readline()
while line:
print(line)
line = f.readline()
# --> This displays all lines the file contains so far, as expected
# At this point, I open the log file with a text editor (Vim),
# add a line, save and close the editor.
line = f.readline()
print(line)
# --> I'm expecting to see the new line, but this does not print anything!
Is this behaviour standard? Am I missing something?
Note: I know there are better way to deal with an updated file for instance with generators as pointed here: Reading from a frequently updated file. I'm just interested in understanding the issue with this precise use case.

For your specific use case, the explanation is that Vim uses a write-to-temp strategy. This means that all writing operations are performed on a temporary file.
On the contrary, your scripts reads from the original file, so it does not see any change on it.
To further test, instead of Vim, you can try to directly write on the file using:
echo "Hello World" >> myfile.log
You should see the new line from python.

for following your file, you can use this code:
f = open('myfile.log')
while True:
line = readline()
if not line:
print(line)

Related

Python text command doesn't output the string data on the actual text file

I am just learning about the text file function in python 3 by using website called, https://www.w3schools.com/python/python_file_write.asp and https://www.geeksforgeeks.org/reading-writing-text-files-python/ although the program seems collect, the text data in the python's programming screen doesn't show in the actual text data file.
Is there any mistake I've ever made in the text program below?
The version of my Python is Python 3.7.5 .
File = open("NewTextFile.Txt", "a")
string = "ABC"
File.write(string)
File.close
You forgot to put () at File.close, so the file is not properly closed. Try putting ().
Often it is recommended to use with clause:
with open('NewTextFile.Txt', 'a') as file:
string = 'ABC'
file.write(string)
Note that you don't need to explicitly close the file here. The file is kept open within the clause. Once your python program exits the with clause, the file is automatically closed; in this way your program gets less prone to mistakes.
For more information, see a relevant python doc:
It is good practice to use the with keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point.
— https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

How to detect a file has been truncated while reading

I'm reading lines from a group of files (log files) following them as they are written using pyinotify.
I'm opening and reading the files with python native methods:
file = open(self.file_path, 'r')
# ... later
line = file.readline()
This is generally stable and can handle the file being deleted and re-created. pyinotify will notify the unlink and subsequent link.
However some log files are not being deleted. Instead they are being truncated and new content written to the beginning of the same file.
I'm having trouble reliably detecting when this has occurred since pyinotify will simply report only a write. The only evidence I currently get is that pyinotify reports a write and readline() returns an empty string. BUT, it is possible that two subsiquent writes could trigger the same behavior.
I have thought of comparing a file's size to file.tell() but according to the documentation tell produces an opaque number and it appears this can't be trusted to be a number of bytes.
Is there a simple way to detect a file has been truncated while reading from it?
Edit:
Truncating a file can be simulated with simple shell commands:
echo hello > test.log
echo hello >> test.log
# Truncate test.log
echo goodbye > test.log
To compliment this, a simple python script can be used to confirm that file.tell() does not reduce when the file is truncated:
foo = open('./test.log', 'r')
line = foo.readline()
while line != '':
print(foo.tell())
print(line)
line = foo.readline()
# Put a breakpoint on the following line and
# truncate the file before it executes
print(foo.tell())
Use os.lseek(file.fileno(),0,os.SEEK_CUR) to obtain a byte offset without moving the file pointer. You can’t really use the regular file interface to find out, not least because it may have buffered text (that no longer exists) that it hasn’t made visible to Python yet. If the file is not a byte stream (e.g., the default open in Python 3), it could even be in the middle of a multibyte character and be unable to proceed even if the file immediately grew back past your file offset.

Why does '\x01\x1A' (Start-of-Header and Substitute control characters) in a textfile line stop a for-loop prematurely?

I'm using Python 2.7.15, Windows 7
Context
I wrote a script to read and tokenize each line of a FileZilla log file (specifications here) for the IP address of the host that initiated the connection to the FileZilla server. I'm having trouble parsing the log text field that follows the > character. The script I wrote uses the:
with open('fz.log','r') as rh:
for lineno, line in rh:
pass
construct to read each line. That for-loop stopped prematurely when it encountered a log text field that contained the SOH and SUB characters. I can't show you the log file since it contains sensitive information but the crux of the problem can be reproduced by reading a textfile that contains those characters on a line.
My goal is to extract the IP addresses (which I can do using re.search()) but before that happens, I have to remove those control characters. I do this by creating a copy of the log file where the lines containing those control characters are removed. There's probably a better way, but I'm more curious why the for-loop just stops after encountering the control characters.
Reproducing the Issue
I reproduced the problem with this code:
if __name__ == '__main__':
fn = 'writetest.txt'
fn2 = 'writetest_NoControlChars.txt'
# Create the problematic textfile
with open(fn, 'w') as wh:
wh.write("This line comes first!\n");
wh.write("Blah\x01\x1A\n"); # Write Start-of-Header and Subsitute unicode character to line
wh.write("This comes after!")
# Try to read the file above, removing the SOH/SUB characters if encountered
with open(fn, 'r') as rh:
with open(fn2, 'w') as wh:
for lineno, line in enumerate(rh):
sline = line.translate(None,'\x01\x1A')
wh.write(sline)
print "Line #{}: {}".format(lineno, sline)
print "Program executed."
Output
The code above creates 2 output files and produces the following in a console window:
Line #0: This line comes first!
Line #1: Blah
Program executed.
I step-debugged through the code in Eclipse and immediately after executing the
for lineno, line in enumerate(rh):
statement, rh, the handle for that opened file was closed. I had expected it to move onto the third line, printing out This comes after! to console and writing it out to writetest_NoControlChars.txt but neither events happened. Instead, execution jumped to print "Program executed".
Picture of Local Variable values in Debug Console
You have to open this file in binary mode if you know it contains non-text data: open(fn, 'rb')

Allow Rsync to read file open by python process without python process failing

I am trying to set up a mail log parser that will pull out specific lines into another file, which will then get rsync'd to a remote server. The problem I am having is that when rsync reads the file being written, it seems to cause the parser to stop functioning. I believe this is because the parser is emulating a tail -f as maillog is being written consistently.
So: How do I allow rsync to touch the file I'm writing with this code (result_file), while still allowing it to follow the end of the maillog looking for new files:
#! /usr/bin/python
import time, re, sys
result_file = open('/var/log/mrp_mail_parsed.log', 'a+')
def tail(logfile):
logfile.seek(0,2)
while True:
line = logfile.readline()
if not line:
time.sleep(0.1)
continue
yield line
if __name__ == '__main__':
logfile = open('/var/log/maillog', 'r')
logline = tail(logfile)
for line in logline:
match = re.search(r'.+postfix-mrp.+', line)
if match:
result_file.write(line,)
result_file.flush()
I don't know who's writing the file, or how, so I can't be sure, but I'd give better than even odds that your problem is this:
If the file isn't being appended to in-place, but is instead being rewritten, your code will stop tracking the file. To test this:
import sys
import time
def tail(logfile):
logfile.seek(0,2)
while True:
line = logfile.readline()
if not line:
time.sleep(0.1)
continue
yield line
with open(sys.argv[1]) as f:
for line in tail(f):
print(line.rstrip())
Now:
$ touch foo
$ python tailf.py foo &
$ echo "hey" >> foo
foo
$ echo "hey" > foo
To see what's happening better, try checking the inode and size via stat. As soon as the path refers to a different file than the one your script has open, your script is now watching a file that nobody else will ever touch again.
It's also possible that someone is truncating and rewriting the file in-place. This won't change the inode, but it will still mean that you won't read anything, because you're trying to read from a position past the end of the file.
I have no idea whether the file being rsync'd is causing this, or whether that's just a coincidence. Without knowing what rsync command you're running, or seeing whether the file is being replaced or the file is being truncated and rewritten when that command runs, all we can do is guess.
I don't believe rsync is causing your problems: A separate process reading the file shouldn't affect the writer. You can easily test this by pausing rsync.
I'm guessing the problem is with python's handling of file reads when you hit end of file. A crude way that's guaranteed to work is to read to remember the offest at the last EOF (using tell()). For each new read, reopen the file and seek to the remembered offset.

Find a line of text with a pattern using Windows command line or Python

I need to run a command line tool that verifies a file and displays a bunch of information about it. I can export this information to a txt file but it includes a lot of extra data. I just need one line for the file:
"The signature is timestamped: Thu May 24 17:13:16 2012"
The time could be different, but I just need to extract this data and put it into a file. Is there a good way to do this from the command line itself or maybe python? I plan on using Python to locate and download the file to be verified, then run the command line tool to verify it so it can get the data then send that data in an email.
This is on a windows PC.
Thanks for your help
You don't need to use Python to do this. If you're using a Unix environment, you can use fgrep right from the command-line and redirect the output to another file.
fgrep "The signature is timestamped: " input.txt > output.txt
On Windows you can use:
find "The signature is timestamped: " < input.txt > output.txt
You mention the command line utility "displays" some information, so it may well be printing to stdout, so one way is to run the utility within Python, and capture the output.
import subprocess
# Try with some basic commands here maybe...
file_info = subprocess.check_output(['your_command_name', 'input_file'])
for line in file_info.splitlines():
# print line here to see what you get
if file_info.startswith('The signature is timestamped: '):
print line # do something here
This should fit in nicely with the "use python to download and locate" - so that can use urllib.urlretrieve to download (possibly with a temporary name), then run the command line util on the temp file to get the details, then the smtplib to send emails...
In python you can do something like this:
timestamp = ''
with open('./filename', 'r') as f:
timestamp = [line for line in f.readlines() if 'The signature is timestamped: ' in line]
I haven't tested this but I think it'd work. Not sure if there's a better solution.
I'm not too sure about the exact syntax of this exported file you have, but python's readlines() function might be helpful for this.
h=open(pathname,'r') #opens the file for reading
for line in h.readlines():
print line#this will print out the contents of each line of the text file
If the text file has the same format every time, the rest is easy; if it is not, you could do something like
for line in h.readlines():
if line.split()[3] == 'timestamped':
print line
output_string=line
as for writing to a file, you'll want to open the file for writing, h=open(name, "w"), then use h.write(output_string) to write it to a text file

Categories