Large file not flushed to disk immediately after calling close()? - python

I'm creating large file with my python script (more than 1GB, actually there's 8 of them). Right after I create them I have to create process that will use those files.
The script looks like:
# This is more complex function, but it basically does this:
def use_file():
subprocess.call(['C:\\use_file', 'C:\\foo.txt']);
f = open( 'C:\\foo.txt', 'wb')
for i in 10000:
f.write( one_MB_chunk)
f.flush()
os.fsync( f.fileno())
f.close()
time.sleep(5) # With this line added it just works fine
t = threading.Thread( target=use_file)
t.start()
But application use_file acts like foo.txt is empty. There are some weird things going on:
if I execute C:\use_file C:\foo.txt in console (after script finished) I get correct results
if I execute manually use_file() in another python console I get correct results
C:\foo.txt is visible on disk right after open() was called, but remains size 0B until the end of script
if I add time.sleep(5) it just starts working as expected (or rather required)
I've already found:
os.fsync() but it doesn't seem to work (result from use_file is as if C:\foo.txt was empty)
Using buffering=(1<<20) (when opening file) doesn't seem to work either
I'm more and more curious about this behaviour.
Questions:
Does python fork close() operation into background? Where is this documented?
How to work this around?
Am I missing something?
After adding sleep: is that a windows/python bug?
Notes: (for the case that there's something wrong with the other side) application use_data uses:
handle = CreateFile("foo.txt", GENERIC_READ, FILE_SHARE_READ, NULL,
OPEN_EXISTING, 0, NULL);
size = GetFileSize(handle, NULL)
And then processes size bytes from foo.txt.

f.close() calls f.flush(), which sends the data to the OS. That doesn't necessarily write the data to disk, because the OS buffers it. As you rightly worked out, if you want to force the OS to write it to disk, you need to os.fsync().
Have you considered just piping the data directly into use_file?
EDIT: you say that os.fsync() 'doesn't work'. To clarify, if you do
f = open(...)
# write data to f
f.flush()
os.fsync(f.fileno())
f.close()
import pdb; pdb.set_trace()
and then look at the file on disk, does it have data?

Edit: updated with information specific to Python 3.x
There is a super old bug report discussing a suspiciosly similar problem at https://bugs.python.org/issue4944. I made a small test that shows the bug: https://gist.github.com/estyrke/c2f5d88156dcffadbf38
After getting a wonderful explanation from user eryksun at the bug link above, I now understand why this happens, and it is not a bug per se. When a child process is created on Windows, by default it inherits all open file handles from the parent process. So what you're seeing is probably actually a sharing violation because the file you're trying to read in the child process is open for writing through an inherited handle in another child process. A possible sequence of events that causes this (using the reproduction example at the Gist above):
Thread 1 opens file 1 for writing
Thread 2 opens file 2 for writing
Thread 2 closes file 2
Thread 2 launches child 2
-> Inherits the file handle from file 1, still open with write access
Thread 1 closes file 1
Thread 1 launches child 1
-> Now it can't open file 1, because the handle is still open in child 2
Child 2 exits
-> Last handle to file 1 closed
Child 1 exits
When I compile the simple C child program and run the script on my machine, it fails in at least one of the threads most of the time with Python 2.7.8. With Python 3.2 and 3.3 the test script without redirection does not fail, because the default value of the close_fds argument to subprocess.call is now True when redirection is not used. The other test script using redirection still fails in those versions. In Python 3.4 both tests succeed, because of PEP 446 which makes all file handles non-inheritable by default.
Conclusion
Spawning a child process from a thread in Python means the child inherits all open file handles, even from other threads than the one where the child is spawned. This is, at least for me, not particularly intuitive.
Possible solutions:
Upgrade to Python 3.4, where file handles are non-inheritable by default.
Pass close_fds=True to subprocess.call to disable inheriting altogether (this is the default in Python 3.x). Note though that this prevents redirection of the child process' standard input/output/error.
Make sure all files are closed before spawning new processes.
Use os.open to open files with the os.O_NOINHERIT flag on Windows.
tempfile.mkstemp also uses this flag.
Use the win32api instead. Passing a NULL pointer for the lpSecurityAttributes parameter also prevents inheriting the descriptor:
from contextlib import contextmanager
import win32file
#contextmanager
def winfile(filename):
try:
h = win32file.CreateFile(filename, win32file.GENERIC_WRITE, 0, None, win32file.CREATE_ALWAYS, 0, 0)
yield h
finally:
win32file.CloseHandle(h)
with winfile(tempfilename) as infile:
win32file.WriteFile(infile, data)

Related

Read from pty without endless hanging

I have a script, that prints colored output if it is on tty. A bunch of them executes in parallel, so I can't put their stdout to tty. I don't have control over the script code either (to force coloring), so I want to fake it via pty. My code:
invocation = get_invocation()
master, slave = pty.openpty()
subprocess.call(invocation, stdout=slave)
print string_from_fd(master)
And I can't figure out, what should be in string_from_fd. For now, I have something like
def string_from_fd(fd):
return os.read(fd, 1000)
It works, but that number 1000 looks strange . I think output can be quiet large, and any number there could be not sufficient. I tried a lot of solutions from stack overflow, but none of them works (it prints nothing or hanging forever).
I am not very familiar with file descriptors and all that, so any clarification if I'm doing something wrong would be much appreciated.
Thanks!
This won't work for long outputs: subprocess.call will block once the PTY's buffer is full. That's why subprocess.communicate exists, but that won't work with a PTY.
The standard/easiest solution is to use the external module pexpect, which uses PTYs internally: For example,
pexpect.spawn("/bin/ls --color=auto").read()
will give you the ls output with color codes.
If you'd like to stick to subprocess, then you must use subprocess.Popen for the reason stated above. You are right in your assumption that by passing 1000, you read at most 1000 bytes, so you'll have to use a loop. os.read blocks if there is nothing to read and waits for data to appear. The catch is how to recognize when the process terminated: In this case, you know that no more data will arrive. The next call to os.read will block forever. Luckily, the operating system helps you detect this situation: If all file descriptors to the pseudo terminal that could be used for writing are closed, then os.read will either return an empty string or return an error, depending on the OS. You can check for this condition and exit the loop when this happens. Now the final piece to understanding the following code is to understand how open file descriptors and subprocess go together: subprocess.Popen internally calls fork(), which duplicates the current process including all open file descriptors, and then within one of the two execution paths calls exec(), which terminates the current process in favour of a new one. In the other execution path, control returns to your Python script. So after calling subprocess.Popen there are two valid file descriptors for the slave end of the PTY: One belongs to the spawned process, one to your Python script. If you close yours, then the only file descriptor that could be used to send data to the master end belongs to the spawned process. Upon its termination, it is closed, and the PTY enters the state where calls to read on the master end fail.
Here's the code:
import os
import pty
import subprocess
master, slave = pty.openpty()
process = subprocess.Popen("/bin/ls --color", shell=True, stdout=slave,
stdin=slave, stderr=slave, close_fds=True)
os.close(slave)
output = []
while True:
try:
data = os.read(master, 1024)
except OSError:
break
if not data:
break
output.append(data) # In Python 3, append ".decode()" to os.read()
output = "".join(output)

How are file objects cleaned up in Python when the process is killed?

What happens to a file object in Python when the process is terminated? Does it matter whether Python is terminated with SIGTERM, SIGKILL, SIGHUP (etc.) or by a KeyboardInterrupt exception?
I have some logging scripts that continually acquire data and write it to a file. I don't care about doing any extra clean up, but I just want to make sure that log file is not corrupted when Python is abruptly terminated (e.g. I could leave it running in the background and just shutdown the computer). I made the following test scripts to try to see what happens:
termtest.sh:
for i in $(seq 1 10); do
python termtest.py $i & export pypid=$!
sleep 0.3
echo $pypid
kill -SIGTERM $pypid
done
termtest.py:
import csv
import os
import signal
import sys
end_loop = False
def handle_interrupt(*args):
global end_loop
end_loop = True
signal.signal(signal.SIGINT, handle_interrupt)
with open('test' + str(sys.argv[-1]) + '.txt', 'w') as csvfile:
writer = csv.writer(csvfile)
for idx in range(int(1e7)):
writer.writerow((idx, 'a' * 60000))
csvfile.flush()
os.fsync(csvfile.fileno())
if end_loop:
break
I ran termtest.sh with different signals (changed SIGTERM to SIGINT, SIGHUP, and SIGKILL in termtest.sh) (note: I put an explicit handler in termtest.py for SIGINT since Python does not handle that one other than as Ctrl+C). In all cases, all of the output files had only complete rows (no partial writes) and did not appear corrupted. I put the flush() and fsync() calls to try to make sure the data was being written to disk as much as possible so that the script had the greatest chance of being interrupted mid-write.
So can I conclude that Python always completes a write when it is terminated and does not leave a file in an intermediate state? Or does this depend on the operating system and file system (I was testing with Linux and an ext4 partition)?
It's not how files are "cleaned up" so much as how they are written to. It's possible that a program might perform multiple writes for a single "chunk" of data (row, or whatever) and you could interrupt in the middle of this process and end up with partial records written.
Looking at the C source for the csv module, it assembles each row to a string buffer, then writes that using a single write() call. That should generally be safe; either the row is passed to the OS or it's not, and if it gets to the OS it's all going to get written or it's not (barring of course things like hardware issues where part of it could go into a bad sector).
The writer object is a Python object, and a custom writer could do something weird in its write() that could break this, but assuming it's a regular file object, it should be fine.

Can I open a named pipe on Linux for non-blocked writing in Python?

I created a fifo file using mkfifo. Is it possible to open/write to this without blocking? I'd like to be agnostic whether there is a reader or not.
The following:
with open('fifo', 'wb', 0) as file:
file.write(b'howdy')
Just stalls at the open until I do a cat fifo from another shell. I want my program to make progress regardless there's a data consumer watching or not.
Is there a different linux mechanism I should be using perhaps?
From man 7 fifo:
A process can open a FIFO in nonblocking mode. In this case, opening for read-only will succeed even if no-one has opened on the write side yet, opening for write-only will fail with ENXIO (no such device or address) unless the other end has already been opened.
So the first solution is opening FIFO with O_NONBLOCK. In this case you can check errno: if it is equal to ENXIO, then you can try opening FIFO later.
import errno
import posix
try:
posix.open('fifo', posix.O_WRONLY | posix.O_NONBLOCK)
except OSError as ex:
if ex.errno == errno.ENXIO:
pass # try later
The other possible way is opening FIFO with O_RDWR flag. It will not block in this case. Other process can open it with O_RDONLY without problem.
import posix
posix.open('fifo', posix.O_RDWR)

Opening a file in a Python module and closing it when the script ends

I have a Python module that performs some logging during some of the methods it contains:
module.py
LOG_FILE = "/var/log/module.log"
def log(message):
with os.fdopen(os.open(LOG_FILE, os.O_RDWR | os.O_CREAT, 0664), "a+") as f:
f.write("[%s] %s\n" % (time.strftime("%c"), message))
def do_something():
log("Doing something")
# ...
In this implementation the log file will be opened and closed every time the log method is called.
I'm considering refactoring it so the file is opened once when the module is loaded, but I'm not sure how to ensure it is closed when a script importing the module ends. Is there a clean way to do this?
Edit: I'm not asking about closing the file when an exception is encountered, but when the script that imports my module exits.
OS takes care of open file descriptors then a process dies. It may lead to a data loss if file buffers inside the application are not flushed. You could add f.flush() in the log() function after each write (note: it does not guarantee that the data is physically written to disk and therefore it is still may be lost on a power failure, see Threadsafe and fault-tolerant file writes).
Python may also close (and flush) the file on exit during a garbage collection. But you shouldn't rely on it.
atexit works only during a normal exit (and exit on some signals). It won't help if the script is killed abruptly.
As #René Fleschenberg suggested, use logging module that calls .flush() and perhaps registers atexit handlers for you.
Python is usually pretty good at cleaning up after itself. If you must do something when the script ends, you need to look at the atexit module - but even then, it offers no guarantees.
You may also want to consider logging to either stdout or stderr, depending on purpose, which avoids keeping a file around all together:
import sys
def log(message):
sys.stderr.write("[%s] %s\n" % (time.strftime("%c"), message))
Python will automatically close the opened files for you when the script that has imported your module exits.
But really, just use Python's logging module.

Python program using os.pipe and os.fork() issue

I've recently needed to write a script that performs an os.fork() to split into two processes. The child process becomes a server process and passes data back to the parent process using a pipe created with os.pipe(). The child closes the 'r' end of the pipe and the parent closes the 'w' end of the pipe, as usual. I convert the returns from pipe() into file objects with os.fdopen.
The problem I'm having is this: The process successfully forks, and the child becomes a server. Everything works great and the child dutifully writes data to the open 'w' end of the pipe. Unfortunately the parent end of the pipe does two strange things:
A) It blocks on the read() operation on the 'r' end of the pipe.
Secondly, it fails to read any data that was put on the pipe unless the 'w' end is entirely closed.
I immediately thought that buffering was the problem and added pipe.flush() calls, but these didn't help.
Can anyone shed some light on why the data doesn't appear until the writing end is fully closed? And is there a strategy to make the read() call non blocking?
This is my first Python program that forked or used pipes, so forgive me if I've made a simple mistake.
Are you using read() without specifying a size, or treating the pipe as an iterator (for line in f)? If so, that's probably the source of your problem - read() is defined to read until the end of the file before returning, rather than just read what is available for reading. That will mean it will block until the child calls close().
In the example code linked to, this is OK - the parent is acting in a blocking manner, and just using the child for isolation purposes. If you want to continue, then either use non-blocking IO as in the code you posted (but be prepared to deal with half-complete data), or read in chunks (eg r.read(size) or r.readline()) which will block only until a specific size / line has been read. (you'll still need to call flush on the child)
It looks like treating the pipe as an iterator is using some further buffer as well, for "for line in r:" may not give you what you want if you need each line to be immediately consumed. It may be possible to disable this, but just specifying 0 for the buffer size in fdopen doesn't seem sufficient.
Heres some sample code that should work:
import os, sys, time
r,w=os.pipe()
r,w=os.fdopen(r,'r',0), os.fdopen(w,'w',0)
pid = os.fork()
if pid: # Parent
w.close()
while 1:
data=r.readline()
if not data: break
print "parent read: " + data.strip()
else: # Child
r.close()
for i in range(10):
print >>w, "line %s" % i
w.flush()
time.sleep(1)
Using
fcntl.fcntl(readPipe, fcntl.F_SETFL, os.O_NONBLOCK)
Before invoking the read() solved both problems. The read() call is no longer blocking and the data is appearing after just a flush() on the writing end.
I see you have solved the problem of blocking i/o and buffering.
A note if you decide to try a different approach: subprocess is the equivalent / a replacement for the fork/exec idiom. It seems like that's not what you're doing: you have just a fork (not an exec) and exchanging data between the two processes -- in this case the multiprocessing module (in Python 2.6+) would be a better fit.
The "parent" vs. "child" part of fork in a Python application is silly. It's a legacy from 16-bit unix days. It's an affectation from a day when fork/exec and exec were Important Things to make the most of a tiny little processor.
Break your Python code into two separate parts: parent and child.
The parent part should use subprocess to run the child part.
A fork and exec may happen somewhere in there -- but you don't need to care.

Categories