Effectively reading a large, active Python log file - python

When my Python script is writing a large amount of logs to a text file line by line using the Python built-in logging library, in my Delphi-powered Windows program I want to effectively read all newly added logs (lines).
When the Python scripting is logging
to the file, my Windows program will
keep a readonly file handle to
that log file;
I'll use the Windows API to get
informed when the log file is
changed; Once the file is changed, it'll read the newly appended lines.
I'm new to Python, do you see any possible problem with this approach? Does the Python logging lib lock the entire log? Thanks!

It depends on the logging handler you use, of course, but as you can see from the source code, logging.FileHandler does not currently create any file locks. By default, it opens files in 'a' (append) mode, so as long as your Windows calls can handle that, you should be fine.

As ʇsәɹoɈ commented, the standard FileHandler logger does not lock the file, so it should work. However, if for some reason you cannot keep you lock on the file - then I'd recommend having your other app open the file periodically, record the position it's read to and then seek back to that point later. I know the Linux DenyHosts program uses this approach when dealing with log files that it has to monitor for a long period of time. In those situations, simply holding a lock isn't feasible, since directories may move, the file get rotated out, etc. Though it does complicate things in that then you have to store filename + read position in persistent state somewhere.

Related

Python2: How to parse a logfile that is held open in another process reliably?

I'm trying to write a Python script that will parse a logfile produced by another daemon. This is being done on Linux. I want to be able to parse the log file reliably.
In other words, periodically, we run a script that reads the log file, line by line, and does something with each line. The logging script would need to see every line that may end up in the log file. It could run say once per minute via cron.
Here's the problem that I'm not sure exactly how to solve. Since the other process has a write handle to the file, it could write to the while at the same time that I am reading from the same log file.
Also, every so often we would want to clear this logfile so its size does not get out of control. But the process producing the log file has no way to clear the file other than regularly stopping, truncating or deleting the file, and then restarting. (I feel like logrotate has some method of doing this, but I don't know if logrotate depends on the daemon being aware, or if it's actually closing and restarting daemons, etc. Not to mention I don't want other logs rotated, just this one specific log; and I don't want this script to require other possible users to setup logrotate.)
Here's the problems:
Since the logger process could write to the file while I already have an open file handle, I feel like I could easily miss records in the log file.
If the logger process were to decide to stop, clear the log file, and restart, and the log analyzer didn't run at exactly the same time, log entries would be lost. Similarly, if the log analyzer causes the logger to stop logging while it analyzes, information could also be lost that is dropped because the logger daemon isn't listening.
If I were to use a method like "note the size of the file since last time and seek there if the file is larger", then what would happen if, for some reason, between runs, the logger reset the logfile, but then had reason to log even more than it contained last time? E.g. We execute a log analyze loop. We get 50 log entries, so we set a mark that we have read 50 entries. Next time we run, we see 60 entries. But, all 60 are brand new; the file had been cleared and restarted since the last log run. Instead we end up seeking to entry 51 and missing 50 entries! Either way it doesn't solve the problem of needing to periodically clear the log.
I have no control over the logger daemon. (Imagine we're talking about something like syslog here. It's not syslog but same idea - a process that is pretty critical holds a logfile open.) So I have no way to change its logging method. It starts at init time, opens a log file, and writes to it. We want to be able to clear that logfile AND analyze it, making sure we get every log entry through the Python script at some point.
The ideal scenario would be this:
The log daemon runs at system init.
Via cron, the Python log analyzer runs once per minute (or once per 5 minutes or whatever is deemed appropriate)
The log analyzer collects every single line from the current log file and immediately truncates it, causing the log file to be blanked out. Python maintains the original contents in a list.
The logger then continues to go about its business, with the now blank file. In the mean time, Python can continue to parse the entries at its leisure from the Python list in memory.
I've very, very vaguely studied fifo's, but am not sure if that would be appropriate. In that scenario the log analyzer would run as a daemon itself, while the original logger writes to a FIFO. I have very little knowledge in this area however and don't know if it'd be a solution or not.
So I guess the question really is twofold:
How to reliably read EVERY entry written to the log from Python? Including if the log grows, is reset, etc.
How, if possible to truncate a file that has an open write handle? (Ideally, this would be something I could do from Python; I could do something like logfile.readlines(); logfile.truncate so that way no entries would get lost. But this seems like unless the logger process was well aware of this, it'd end up causing more problems than it solves.)
Thanks!
I don’t see any particular reason why you should be not able to read log file created by syslogd. You are saying that you are using some process similar to syslog, and process is keeping your log file open? Since you are asking rather for ideas, I would recommend you to use syslog! http://pic.dhe.ibm.com/infocenter/tpfhelp/current/index.jsp?topic=%2Fcom.ibm.ztpf-ztpfdf.doc_put.cur%2Fgtpc1%2Fhsyslog.html
It is working anyway – use it. Some easy way to write to log is to use logger command:
logger “MYAP: hello”
In python script you can do it like:
import os
os.system(‘logger “MYAP: hello”’)
Also remember you can actually configure syslogd. http://pic.dhe.ibm.com/infocenter/tpfhelp/current/index.jsp?topic=%2Fcom.ibm.ztpf-ztpfdf.doc_put.cur%2Fgtpc1%2Fconstmt.html
Also about your problem with empty logs – sysclog is not clearing logs. There are other tools for it – on debian for example logrotate is used. In this scenario if your log is empty – you can check backup file created by logrotate.
Since it looks like your problem is in logging tool, my advise would be to use syslog for logging. And other tool for rotating logs. Then you can easily parse logs. And if by any means (I don’t know if it is even possible with syslog) you miss some data – remember you will get it in next iteration anyway ;)
Some other idea would be to copy your logfile and work with copy...

When does Python write a file to disk?

I have a library that interacts with a configuration file. When the library is imported, the initialization code reads the configuration file, possibly updates it, and then writes the updated contents back to the file (even if nothing was changed).
Very occasionally, I encounter a problem where the contents of the configuration file simply disappear. Specifically, this happens when I run many invocations of a short script (using the library), back-to-back, thousands of times. It never occurs during the same directories, which leads me to believe it's a somewhat random problem--specifically a race condition with IO.
This is a pain to debug, since I can never reliably reproduce the problem and it only happens on some systems. I have a suspicion about what might happen, but I wanted to see if my picture of file I/O in Python is correct.
So the question is, when does a Python program actually write file contents to a disk? I thought that the contents would make it to disk by the time that the file closed, but then I can't explain this error. When python closes a file, does it flush the contents to the disk itself, or simply queue it up to the filesystem? Is it possible that file contents can be written to disk after Python terminates? And can I avoid this issue by using fp.flush(); os.fsync(fp.fileno()) (where fp is the file handle)?
If it matters, I'm programming on a Unix system (Mac OS X, specifically). Edit: Also, keep in mind that the processes are not running concurrently.
Appendix: Here is the specific race condition that I suspect:
Process #1 is invoked.
Process #1 opens the configuration file in read mode and closes it when finished.
Process #1 opens the configuration file in write mode, erasing all of its contents. The erasing of the contents is synced to the disk.
Process #1 writes the new contents to the file handle and closes it.
Process #1: Upon closing the file, Python tells the OS to queue writing these contents to disk.
Process #1 closes and exits
Process #2 is invoked
Process #2 opens the configuration file in read mode, but new contents aren't synced yet. Process #2 sees an empty file.
The OS finally finishes writing the contents to disk, after process 2 reads the file
Process #2, thinking the file is empty, sets defaults for the configuration file.
Process #2 writes its version of the configuration file to disk, overwriting the last version.
It is almost certainly not python's fault. If python closes the file, OR exits cleanly (rather than killed by a signal), then the OS will have the new contents for the file. Any subsequent open should return the new contents. There must be something more complicated going on. Here are some thoughts.
What you describe sounds more likely to be a filesystem bug than a Python bug, and a filesystem bug is pretty unlikely.
Filesystem bugs are far more likely if your files actually reside in a remote filesystem. Do they?
Do all the processes use the same file? Do "ls -li" on the file to see its inode number, and see if it ever changes. In your scenario, it should not. Is it possible that something is moving files, or moving directories, or deleting directories and recreating them? Are there symlinks involved?
Are you sure that there is no overlap in the running of your programs? Are any of them run from a shell with "&" at the end (i.e. in the background)? That could easily mean that a second one is started before the first one is finished.
Are there any other programs writing to the same file?
This isn't your question, but if you need atomic changes (so that any program running in parallel only sees either the old version or the new one, never the empty file), the way to achieve it is to write the new content to another file (e.g. "foo.tmp"), then do os.rename("foo.tmp", "foo"). Rename is atomic.

Pickup directories: How not to pickup files that are still being written?

I have a Python script that checks on a pickup directory and processes any files that it finds, and then deletes them.
How can I make sure not to pickup a file that is still being written by the process that drops files in that directory?
My test case is pretty simple. I copy-paste 300MB of files into the pickup directory, and frequently the script will grab a file that's still being written. It operates on only the partial file, then delete it. This fires off a file operation error in the OS as the file it was writing to disappeared.
I've tried acquiring a lock on the file (using the FileLock module) before I open/process/delete it. But that hasn't helped.
I've considered checking the modification time on the file to avoid anything within X seconds of now. But that seems clunky.
My test is on OSX, but I'm trying to find a solution that will work across the major platforms.
I see a similar question here (How to check if a file is still being written?), but there was no clear solution.
Thank you
As a workaround, you could listen to file modified events (watchdog is cross-platform). The modified event (on OS X at least) isn't fired for each write, it's only fired on close. So when you detect a modified event you can assume all writes are complete.
Of course, if the file is being written in chunks, and being saved after each chunk this won't work.
One solution to this problem would be to change the program writing the files to write the files to a temporary file first, and then move that temporary file to the destination when it is done. On most operating systems, when the source and destination are on the same file system, move is atomic.
If you have no control over the writing portion, about all you can do is watch the file yourself, and when it stops growing for a certain amount of time, call it good. I have to use that method myself, and found 40 seconds is safe for my conditions.
Each OS will have a different solution, because file locking mechanisms are not portable.
On Windows, you can use OS locking.
On Linux you can have a peek at open files (similarily how lsof does) and if file is open, leave it.
Have you tried opening the file before coping it? If the file is still in use, then open() should throw exception.
try:
with open(filename, "rb") as fp:
pass
# Copy the file
except IOError:
# Dont copy

How do I watch a folder for changes and when changes are done using Python?

i need to watch a folder for incoming files. i did that with the following help:
How do I watch a file for changes?
the problem is that the files that are being moved are pretty big (10gb)
and i want to be notified when all files are done moving.
i tried comparing the size of the folder every 20 seconds but the file shows its correct size even tough windows shows that it is still moving.
i am using windows with python
i found a solution using open and waiting for an io exception.
if the file is still being moved i get errno 13.
You should take a look at this link:
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
There you can see the comparison of the method you are speaking about (simple polling) with two other windows-specific techniques which, in my opinion, offers a really better solution to your problem!
Otherwise, if you are using linux, there's iNotify and the relative Python wrapper:
Pyinotify is a pure Python module used
for monitoring filesystems events on
Linux platforms through inotify
Here: http://trac.dbzteam.org/pyinotify
If you have control over the process of importing the files, I would put a lock file when starting to copy files in, and remove it when you are done. by lock file I mean a tmp empty file, which is just there to indicate that you are coping a file. then your py script can check for the existence of the lock files.
You may be able to use os.stat() to monitor the mtime of the file. However be aware that under various network conditions, the copy may stall momentarily and so the mtime is not updated for a few seconds, so you need to make allowance for this.
Another option is to try opening the file with exclusive read/write which should fail under windows if the file is still opened by the other process
The most reliable method would be to write your own program to move the files.
try checking for the last-modified time change instead of the filesize during your poll.

Does python have hooks into EXT3

We have several cron jobs that ftp proxy logs to a centralized server. These files can be rather large and take some time to transfer. Part of the requirement of this project is to provide a logging mechanism in which we log the success or failure of these transfers. This is simple enough.
My question is, is there a way to check if a file is currently being written to? My first solution was to just check the file size twice within a given timeframe and check the file size. But a co-worker said that there may be able to hook into the EXT3 file system via python and check the attributes to see if the file is currently being appended to. My Google-Fu came up empty.
Is there a module for EXT3 or something else that would allow me to check the state of a file? The server is running Fedora Core 9 with EXT3 file system.
no need for ext3-specific hooks; just check lsof, or more exactly, /proc/<pid>/fd/* and /proc/<pid>/fdinfo/* (that's where lsof gets it's info, AFAICT). There you can check if the file is open, if it's writeable, and the 'cursor' position.
That's not the whole picture; but any more is done in processpace by stdlib on the writing process, as most writes are buffered and the kernel only sees bigger chunks of data, so any 'ext3-aware' monitor wouldn't get that either.
There's no ext3 hooks to check what you'd want directly.
I suppose you could dig through the source code of Fuser linux command, replicate the part that finds which process owns a file, and watch that resource. When noone longer has the file opened, it's done transferring.
Another approach:
Your cron jobs should tell that they're finished.
We have our cron jobs that transport files just write an empty filename.finished after it's transferred the filename. Another approach is to transfer them to a temporary filename, e.g. filename.part and then rename it to filename Renaming is atomic. In both cases you check repeatedly until the presence of filename or filename.finished

Categories