subprocess stdout/stderr to finite size logfile

subprocess stdout/stderr to finite size logfile - python

I have a process which chats a lot to stderr, and I want to log that stuff to a file.
foo 2> /tmp/foo.log
Actually I'm launching it with python subprocess.Popen, but it may as well be from the shell for the purposes of this question.
with open('/tmp/foo.log', 'w') as stderr:
foo_proc = subprocess.Popen(['foo'], stderr=stderr)
The problem is after a few days my log file can be very large, like >500 MB. I am interested in all that stderr chat, but only the recent stuff. How can I limit the size of the logfile to, say, 1 MB? The file should be a bit like a circular buffer in that the most recent stuff will be written but the older stuff should fall out of the file, so that it never goes above a given size.
I'm not sure if there's an elegant Unixey way to do this already which I'm simply not aware of, with some sort of special file.
An alternative solution with log rotation would be sufficient for my needs as well, as long as I don't have to interrupt the running process.

You should be able to use the stdlib logging package to do this. Instead of connecting the subprocess' output directly to a file, you can do something like this:
import logging
logger = logging.getLogger('foo')
def stream_reader(stream):
while True:
line = stream.readline()
logger.debug('%s', line.strip())
This just logs every line received from the stream, and you can configure logging with a RotatingFileHandler which provides log file rotation. You then arrange to read this data and log it.
foo_proc = subprocess.Popen(['foo'], stderr=subprocess.PIPE)
thread = threading.Thread(target=stream_reader, args=(foo_proc.stderr,))
thread.setDaemon(True) # optional
thread.start()
# do other stuff
thread.join() # await thread termination (optional for daemons)
Of course you can call stream_reader(foo_proc.stderr) too, but I'm assuming you might have other work to do while the foo subprocess does its stuff.
Here's one way you could configure logging (code that should only be executed once):
import logging, logging.handlers
handler = logging.handlers.RotatingFileHandler('/tmp/foo.log', 'a', 100000, 10)
logging.getLogger().addHandler(handler)
logging.getLogger('foo').setLevel(logging.DEBUG)
This will create up to 10 files of 100K named foo.log (and after rotation foo.log.1, foo.log.2 etc., where foo.log is the latest). You could also pass in 1000000, 1 to give you just foo.log and foo.log.1, where the rotation happens when the file would exceed 1000000 bytes in size.

The way with circular buffer would be hard to implement, as you would constantly have to rewrite the whole file as soon as something falls out.
The approach with logrotate or something would be your way to go. In this case, you simply would do similiar to this:
import subprocess
import signal
def hupsignal(signum, frame):
global logfile
logfile.close()
logfile = open('/tmp/foo.log', 'a')
logfile = open('/tmp/foo.log', 'a')
signal.signal()
foo_proc = subprocess.Popen(['foo'], stderr=subprocess.PIPE)
for chunk in iter(lambda: foo_proc.stderr.read(8192), ''):
# iterate until EOF occurs
logfile.write(chunk)
# or do you want to rotate yourself?
# Then omit the signal stuff and do it here.
# if logfile.tell() > MAX_FILE_SIZE:
# logfile.close()
# logfile = open('/tmp/foo.log', 'a')
It is not a complete solution; think of it as pseudocode as it is untested and I am not sure about the syntax in the one or other place. Probably it needs some modification for making it work. But you should get the idea.
As well, it is an example of how to make it work with logrotate. Of course, you can rotate your logfile yourself, if needed.

You may be able to use the properties of 'open file descriptions' (distinct from, but closely related to, 'open file descriptors'). In particular, the current write position is associated with the open file description, so two processes that share an single open file description can each adjust the write position.
So, in context, the original process could retain the file descriptor for standard error of the child process, and periodically, when the position reaches your 1 MiB size, reposition the pointer to the start of the file, thus achieving your required circular buffer effect.
The biggest problem is determining where the current messages are being written, so that you can read from the oldest material (just in front of the file position) to the newest material. It is unlikely that new lines overwriting the old will match exactly, so there'd be some debris. You might be able to follow each line from the child with a known character sequence (say 'XXXXXX'), and then have each write from the child reposition to overwrite the previous marker...but that definitely requires control over the program that's being run. If it is not under your control, or cannot be modified, that option vanishes.
An alternative would be to periodically truncate the file (maybe after copying it), and to have the child process write in append mode (because the file is opened in the parent in append mode). You could arrange to copy the material from the file to a spare file before truncating to preserve the previous 1 MiB of data. You might use up to 2 MiB that way, which is a lot better than 500 MiB and the sizes could be configured if you're actually short of space.
Have fun!

Related

How to efficiently close and open files in a Python logger class

I'm implementing a simple logging class that writes out some messages to a log file. I have a doubt on how to manage the opening/closing of the file in a sensible and pythonic way.
I understood that the idiomatic way to do the writing in files is via the with statement. Therefore this is a simplified version of the code I have:
class Logger():
def __init__(self, filename, mode='w', name='root'):
self.filename = filename
self.name = name
# remove the previous content of the file if mode for logger is 'w'
if mode == 'w':
with open(self.filename, 'w') as f:
f.write('')
def info(self, msg):
with open(self.filename, 'a') as f:
f.write(f'INFO:{self.name}:{msg}\n')
logger = Logger('log.txt')
logger.info('Starting program')
The problem is that this implementation will open and close the file as many times as the logger is called, which will be hundred of times. I'm concerned with this being an overheat of the program (the runtime of this program is important). It perhaps would be more sensible to open the file at the moment of creation of the logger, and close it when the program finishes. But this goes against the "use width" rule, and certainly there is a serious risk that I (or the user of the class) will forget to manually close the file at the end. Other problem of this approach is that if I want to create different loggers that dump to the same file, I'll have to add careful checks to know whether the file is already open or not by previous loggers...
So all in all, what's the most pythonic and sensible way to handle the opening/closing of files in this context?

While I agree with the other comments that the most pythonic way is to use the standard lib, I'll try to answer your question as it was asked.
I think the with construct is a great construct but it doesn't mean it works in every situation. Opening and saving a file handle for continual use is not unpythonic if it makes sense in your situation (IMO). Opening, do something, and closing it in the same function with try/except/finally blocks would be unpythonic. I think it may be preferred to only open it when you first try to use it (instead of at creation time). But that can depend on the rest of the application.
If you start creating different loggers that write to the same file, if in the same process, I would think the goal would be to have a single open file handle that all the loggers write to instead of each logger having their own handle they write to. But multi-instance and multi-process logging synchronization is where the stdlib shines, so...you know...your mileage may vary.

Why it's needed to open file every time we want to append the file

As the thread How do you append to a file?, most answer is about open a file and append to it, for instance:
def FileSave(content):
with open(filename, "a") as myfile:
myfile.write(content)
FileSave("test1 \n")
FileSave("test2 \n")
Why don't we just extract myfile out and only write to it when FileSave is invoked.
global myfile
myfile = open(filename)
def FileSave(content):
myfile.write(content)
FileSave("test1 \n")
FileSave("test2 \n")
Is the latter code better cause it's open the file only once and write it multiple times?
Or, there is no difference cause what's inside python will guarantee the file is opened only once albeit the open method is invoked multiple times.

There are a number of problems with your modified code that aren't really relevant to your question: you open the file in read-only mode, you never close the file, you have a global statement that does nothing…
Let's ignore all of those and just talk about the advantages and disadvantages of opening and closing a file over and over:
Wastes a bit of time. If you're really unlucky, the file could even just barely keep falling out of the disk cache and waste even more time.
Ensures that you're always appending to the end of the file, even if some other program is also appending to the same file. (This is pretty important for, e.g., syslog-type logs.)1
Ensures that you've flushed your writes to disk at some point, which reduces the chance of lost data if your program crashes or gets killed.
Ensures that you've flushed your writes to disk as soon as you write them. If you try to open and read the file elsewhere in the same program, or in a different program, or if the end user just opens it in Notepad, you won't be missing the last 1.73KB worth of lines because they're still in a buffer somewhere and won't be written until later.2
So, it's a tradeoff. Often, you want one of those guarantees, and the performance cost isn't a big deal. Sometimes, it is a big deal and the guarantees don't matter. Sometimes, you really need both, so you have to write something complicated where you manually buffer up bits and write-and-flush them all at once.
1. As the Python docs for open make clear, this will happen anyway on some Unix systems. But not on other Unix systems, and not on Windows..
2. Also, if you have multiple writers, they're all appending a line at a time, rather than appending whenever they happen to flush, which is again pretty important for logfiles.

In general global should be avoided if possible.
The reason that people use the with command when dealing with files is that it explicitly controls the scope. Once the with operator is done the file is closed and the file variable is discarded.
You can avoid using the with operator but then you must remember to call myfile.close(). Particularly if you're dealing with a lot of files.
One way that avoids using the with block that also avoids using global is
def filesave(f_obj, string):
f_obj.write(string)
f = open(filename, 'a')
filesave(f, "test1\n")
filesave(f, "test2\n")
f.close()
However at this point you'd be better off getting rid of the function and just simply doing:
f = open(filename, 'a')
f.write("test1\n")
f.write("test2\n")
f.close()
At which point you could easily put it within a with block:
with open(filename, 'a') as f:
f.write("test1\n")
f.write("test2\n")
So yes. There's no hard reason to not do what you're doing. It's just not very Pythonic.

The latter code may be more efficient, but the former code is safer because it makes sure that the content that each call to FileSave writes to the file gets flushed to the filesystem so that other processes can read the updated content, and by closing the file handle with each call using open as a context manager, you allow other processes a chance to write to the file as well (specifically in Windows).

It really depends on the circumstances, but here are some thoughts:
A with block absolutely guarantees that the file will be closed once the block is exited. Python does not make and weird optimizations for appending files.
In general, globals make your code less modular, and therefore harder to read and maintain. You would think that the original FileSave function is attempting to avoid globals, but it's using the global name filename, so you may as well use a global file altogether at that point, as it will save you some I/O overhead.
A better option would be to avoid globals at all, or to at least use them properly. You really don't need a separate function to wrap file.write, but if it represents something more complex, here is a design suggestion:
def save(file, content):
print(content, file=file)
def my_thing(filename):
with open(filename, 'a') as f:
# do some stuff
save(f, 'test1')
# do more stuff
save(f, 'test2')
if __name__ == '__main__':
my_thing('myfile.txt')
Notice that when you call the module as a script, a file name defined in the global scope will be passed in to the main routine. However, since the main routine does not reference global variables, you can A) read it easier because it's self contained, and B) test it without having to wonder how to feed it inputs without breaking everything else.
Also, by using print instead of file.write, you avoid having to spend newlines manually.

How can you delete all data added to textfile by a user if they kill the program on python?

I am writing a program that stores user names and passwords but I want to delete all the data stored on the text file if the user kills the program....is that possible to do in a simple way?

You can use tempfile.TempFile on a Unix-like OS. It will create a file with no visible directory entry. Even if the process is killed with kill -9, the file will be destroyed.

From a net-result standpoint, you can treat your changes as a transaction.
1. Make a copy of the original text file.
2. Perform whatever changes to the copy that you wish to make.
3. Confirm that the changes are okay, and the user wishes to save them.
4. Replace the old text file with the new text file.
Step 4 can happen fairly quickly, although not atomically. Still, certainly way faster than most users can type Ctrl-C. To do this, you would:
a. 'move' the original file to a backup.
b. 'move' the new file to the original name.
c. 'unlink' the backup

UNIX/Linux oriented answer-
You can write a handler for the signals, you could catch:
SIGTERM (15)
SIGINT (2)
SIGQUIT (3)
SIGABRT (6)
SIGSTOP (24)
Just to mention some, but if kill -9 is used, SIGKILL is going to give you a hard time (see my comment). Therefore, it would be a better idea storing data encrypted (if that's good for something)
from signal import signal
def handle_signal(signum, stackframe):
''' signal handler '''
# code to wipe or delete your file
return True
if __name__ == '__main__':
# defining your handler for SIGINT
signal(2, handle_signal)
# the rest of your code

If by "kill" you mean a keyboard interrupt or another type of clean exit from the program you can use the following:
import os
try:
# Write to file here
except KeyboardInterrupt:
os.remove('file.txt')
If you are asking if the program is receiving a kill signal from the operating system, then there is no way to handle that.

just open the file and close it .
f = open('myfile.dat', 'w')
f.close()

How to perform file-locking on Windows without installing a new package

I've added code to a Python package (brian2) that places an exclusive lock on a file to prevent a race condition. However, because this code includes calls to fcntl, it does not work on Windows. Is there a way for me to place exclusive locks on files in Windows without installing a new package, like pywin32? (I don't want to add a dependency to brian2.)

Since msvcrt is part of the standard library, I assume you have it. The msvcrt (MicroSoft Visual C Run Time) module only implements a small number of the routines available in the MS RTL, however it does implement file locking.
Here is an example:
import msvcrt, os, sys
REC_LIM = 20
pFilename = "rlock.dat"
fh = open(pFilename, "w")
for i in range(REC_LIM):
# Here, construct data into "line"
start_pos = fh.tell() # Get the current start position
# Get the lock - possible blocking call
msvcrt.locking(fh.fileno(), msvcrt.LK_RLCK, len(line)+1)
fh.write(line) # Advance the current position
end_pos = fh.tell() # Save the end position
# Reset the current position before releasing the lock
fh.seek(start_pos)
msvcrt.locking(fh.fileno(), msvcrt.LK_UNLCK, len(line)+1)
fh.seek(end_pos) # Go back to the end of the written record
fh.close()
The example shown has a similar function as for fcntl.flock(), however the code is very different. Only exclusive locks are supported.
Unlike fcntl.flock() there is no start argument (or whence). The lock or unlock call only operates on the current file position. This means that in order to unlock the correct region we have to move the current file position back to where it was before we did the read or write. Having unlocked, we now have to advance the file position again, back to where we were after the read or write, so we can proceed.
If we unlock a region for which we have no lock then we do not get an error or exception.

UNIX named PIPE end of file

I'm trying to use a unix named pipe to output statistics of a running service. I intend to provide a similar interface as /proc where one can see live stats by catting a file.
I'm using a code similar to this in my python code:
while True:
f = open('/tmp/readstatshere', 'w')
f.write('some interesting stats\n')
f.close()
/tmp/readstatshere is a named pipe created by mknod.
I then cat it to see the stats:
$ cat /tmp/readstatshere
some interesting stats
It works fine most of the time. However, if I cat the entry several times in quick successions, sometimes I get multiple lines of some interesting stats instead of one. Once or twice, it has even gone into an infinite loop printing that line forever until I killed it. The only fix that I've got so far is to put a delay of let's say 500ms after f.close() to prevent this issue.
I'd like to know why exactly this happens and if there is a better way of dealing with it.
Thanks in advance

A pipe is simply the wrong solution here. If you want to present a consistent snapshot of the internal state of your process, write that to a temporary file and then rename it to the "public" name. This will prevent all issues that can arise from other processes reading the state while you're updating it. Also, do NOT do that in a busy loop, but ideally in a thread that sleeps for at least one second between updates.

What about a UNIX socket instead of a pipe?
In this case, you can react on each connect by providing fresh data just in time.
The only downside is that you cannot cat the data; you'll have to create a new socket handle and connect() to the socket file.
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
try:
os.unlink(MYSOCKETFILE)
except OSError: pass
s = socket.socket(socket.AF_UNIX)
s.bind(MYSOCKETFILE)
s.listen(10)
while True:
s2, peeraddr = s.accept()
s2.send('These are my actual data')
s2.close()
Program querying this socket:
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
s = socket.socket(socket.AF_UNIX)
s.connect(MYSOCKETFILE)
while True:
d = s.recv(100)
if not d: break
print d
s.close()

I think you should use fuse.
it has python bindings, see http://pypi.python.org/pypi/fuse-python/
this allows you to compose answers to questions formulated as posix filesystem system calls

Don't write to an actual file. That's not what /proc does. Procfs presents a virtual (non-disk-backed) filesystem which produces the information you want on demand. You can do the same thing, but it'll be easier if it's not tied to the filesystem. Instead, just run a web service inside your Python program, and keep your statistics in memory. When a request comes in for the stats, formulate them into a nice string and return them. Most of the time you won't need to waste cycles updating a file which may not even be read before the next update.

You need to unlink the pipe after you issue the close. I think this is because there is a race condition where the pipe can be opened for reading again before cat finishes and it thus sees more data and reads it out, leading to multiples of "some interesting stats."
Basically you want something like:
while True:
os.mkfifo(the_pipe)
f = open(the_pipe, 'w')
f.write('some interesting stats')
f.close()
os.unlink(the_pipe)
Update 1: call to mkfifo
Update 2: as noted in the comments, there is a race condition in this code as well with multiple consumers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.