Share objects with file handle attribute between processes

Share objects with file handle attribute between processes - python

I have a question about shared resource with file handle between processes.
Here is my test code:
from multiprocessing import Process,Lock,freeze_support,Queue
import tempfile
#from cStringIO import StringIO
class File():
def __init__(self):
self.temp = tempfile.TemporaryFile()
#print self.temp
def read(self):
print "reading!!!"
s = "huanghao is a good boy !!"
print >> self.temp,s
self.temp.seek(0,0)
f_content = self.temp.read()
print f_content
class MyProcess(Process):
def __init__(self,queue,*args,**kwargs):
Process.__init__(self,*args,**kwargs)
self.queue = queue
def run(self):
print "ready to get the file object"
self.queue.get().read()
print "file object got"
file.read()
if __name__ == "__main__":
freeze_support()
queue = Queue()
file = File()
queue.put(file)
print "file just put"
p = MyProcess(queue)
p.start()
Then I get a KeyError like below:
file just put
ready to get the file object
Process MyProcess-1:
Traceback (most recent call last):
File "D:\Python26\lib\multiprocessing\process.py", line 231, in _bootstrap
self.run()
File "E:\tmp\mpt.py", line 35, in run
self.queue.get().read()
File "D:\Python26\lib\multiprocessing\queues.py", line 91, in get
res = self._recv()
File "D:\Python26\lib\tempfile.py", line 375, in __getattr__
file = self.__dict__['file']
KeyError: 'file'
I think when I put the File() object into queue , the object got serialized, and file handle can not be serialized, so, i got the KeyError:
Anyone have any idea about that? if I want to share objects with file handle attribute, what should I do?

I have to object (at length, won't just fit in a commentl;-) to #Mark's repeated assertion that file handles just can't be "passed around between running processes" -- this is simply not true in real, modern operating systems, such as, oh, say, Unix (free BSD variants, MacOSX, and Linux, included -- hmmm, I wonder what OS's are left out of this list...?-) -- sendmsg of course can do it (on a "Unix socket", by using the SCM_RIGHTS flag).
Now the poor, valuable multiprocessing is fully right to not exploit this feature (even assuming there might be black magic to implement it on Windows too) -- most developers would no doubt misuse it anyway (having multiple processes access the same open file concurrently and running into race conditions). The only proper way to use it is for a process which has exclusive rights to open certain files to pass the opened file handles to another process which runs with reduced privileges -- and then never use that handle itself again. No way to enforce that in the multiprocessing module, anyway.
Back to #Andy's original question, unless he's going to work on Linux only (AND with local processes only, too) and willing to play dirty tricks with the /proc filesystem, he's going to have to define his application-level needs more sharply and serialize file objects accordingly. Most files have a path (or can be made to have one: path-less files are pretty rare, actually non-existent on Windows I believe) and thus can be serialized via it -- many others are small enough to serialize by sending their content over -- etc, etc.

Related

Understanding python close method

Is it correctly understood that the following two functions do the exact same? No matter how they are invoked.
def test():
file = open("testfile.txt", "w")
file.write("Hello World")
def test_2():
with open("testfile.txt", "w") as f:
f.write("Hello World")
Since python invokes the close method when an object is no longer referenced.
If not then this quote confuses me:
Python automatically closes a file when the reference object of a file
is reassigned to another file. It is a good practice to use the
close() method to close a file.
from https://www.tutorialspoint.com/python/file_close.htm

No, the close method would be invoked by python garbage collector (finalizer) machinery in the first case, and immediately in the second case. If you loop calling your test or test_2 functions thousands of times, the observed behavior could be different.
File descriptors are (at least on Linux) a precious and scarce resource (when it is exhausted, the open(2) syscall fails). On Linux use getrlimit(2) with RLIMIT_NOFILE to query the limit on the number of file descriptors for your process. You should prefer the close(2) syscall to be invoked quickly once a file handle is useless.
Your question is implementation specific, operating system specific, and computer specific. You may want to understand more about operating systems by reading Operating Systems: Three Easy Pieces.
On Linux, try also the cat /proc/$$/limits or cat /proc/self/limits command in a terminal. You would see a line starting with Max open files (on my Debian desktop computer, right now in december 2019, the soft limit is 1024). See proc(5).

No. The first one will not save the information correctly. You need to use file.close() to ensure that file is closed properly and data is saved.
On the other hand, with statement handles file operations for you. It will keep the file open for as long as the program keeps executing at the same indent level and as soon as it goes to a level higher will automatically close and save the file.
More information here.

In case of test function, close method is not called until Python garbage collector will del f, in this case it's invoked by file __del__ magic method which is invoked on variable deletion.
In case of test_2 function, close method is called when code execution goes outside of with statement. Read more about python context managers which is used by with statement.
with foo as f:
do_something()
roughly is just syntax sugar for:
f = foo.__enter__()
do_something()
f.__exit__()
and in case of file, __exit__ implicitly calls close

No, it is not correctly understood. The close method is invoked via the __exit__ method, which is only invoked when exiting a with statement not when exiting a function. Se code example below:
class Temp:
def __exit__(self, exc_type, exc_value, tb):
print('exited')
def __enter__(self):
pass
def make_temp():
temp = Temp()
make_temp()
print('temp_make')
with Temp() as temp:
pass
print('temp_with')
Witch outputs:
temp_make
exited
temp_with

Python program bombs out after finishing

I'm getting a strange Python error. I'm executing a file that looks like this.
if __name__ == '__main__':
MyClass().main()
print('Done 1')
print('Done 2')
The preceding runs successfully. But when I change it to this, I get the strange result.
if __name__ == '__main__':
myObject = MyClass()
myObject.main()
print('Done 1')
print('Done 2')
The output looks like this.
Done 1
Done 2
Exception ignored in: <function Viewer.__del__ at 0x0000021569EF72F0>
Traceback (most recent call last):
File "C:\...\lib\site-packages\gym\envs\classic_control\rendering.py", line 143, in __del__
File "C:\...\lib\site-packages\gym\envs\classic_control\rendering.py", line 62, in close
File "C:\...\lib\site-packages\pyglet\window\win32\__init__.py", line 305, in close
File "C:\...\lib\site-packages\pyglet\window\__init__.py", line 770, in close
ImportError: sys.meta_path is None, Python is likely shutting down
Process finished with exit code 0
There is a blank line after the final print line. The same thing happens when the final line does not have an end-of-line marker.
I get the same result whether I run it from within PyCharm using the run command or from the terminal.
As you can probably tell from the error lines, the program generates an animation. (It's the cart-pole problem from OpenAI gym.)
Since the program completes before the error, it's not a disaster. But I'd like to understand what's happening.
Thanks.

Python provides a __del__ dunder method for classes that will be called as the instances are garbage collected, if they're garbage collected.
When it's used, the __del__ method typically performs some sort of cleanup.
Due to the fact that it's fairly easy to inadvertently prevent an object from being collected, reliance on the __del__ to perform cleanup (instead of say, a context manager's __exit__ or an explicit .close() method) is generally advised against.
Your error highlights a different reason for avoiding relying on __del__, however: that during shutdown __del__ will be called but possibly after other things that it relies on are freed.
The proposed workarounds on the github issue linked in the comments should be instructive, as they all ensure that the cleanup is done at a time where the things that that cleanup relies on (e.g. sys.meta_path) are still in defined/not yet freed, e.g.:
try:
del env
except ImportError:
pass
and
env = gym.make('CartPole-v0')
...
env.env.close()
and (likely, but much less efficient or clear)
import gc; gc.collect()

How to test file locking in Python

So I want to write some files that might be locked/blocked for write/delete by other processes and like to test that upfront.
As I understand: os.access(path, os.W_OK) only looks for the permissions and will return true although a file cannot currently be written. So I have this little function:
def write_test(path):
try:
fobj = open(path, 'a')
fobj.close()
return True
except IOError:
return False
It actually works pretty well, when I try it with a file that I manually open with a Program. But as a wannabe-good-developer I want to put it in a test to automatically see if it works as expected.
Thing is: If I just open(path, 'a') the file I can still open() it again no problem! Even from another Python instance. Although Explorer will actually tell me that the file is currently open in Python!
I looked up other posts here & there about locking. Most are suggesting to install a package. You migth understand that I don't wanna do that to test a handful lines of code. So I dug up the packages to see the actual spot where the locking is eventually done...
fcntl? I don't have that. win32con? Don't have it either... Now in filelock there is this:
self.fd = os.open(self.lockfile, os.O_CREAT|os.O_EXCL|os.O_RDWR)
When I do that on a file it moans that the file exists!! Ehhm ... yea! That's the idea! But even when I do it on a non-existing path. I can still open(path, 'a') it! Even from another python instance...
I'm beginning to think that I fail to understand something very basic here. Am I looking for the wrong thing? Can someone point me into the right direction?
Thanks!

You are trying to implement the file locking problem using just the system call open(). The Unix-like systems uses by default advisory file locking. This means that cooperating processes may use locks to coordinate access to a file among themselves, but uncooperative processes are also free to ignore locks and access the file in any way they choose. In other words, file locks lock out other file lockers only, not I/O. See Wikipedia.
As stated in system call open() reference the solution for performing atomic file locking using a lockfile is to create a unique file on the same file system (e.g., incorporating hostname and pid), use link(2) to make a link to the lockfile. If link() returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful.
That is why in filelock they also use the function fcntl.flock() and puts all that stuff in a module as it should be.

Alright! Thanks to those guys I actually have something now! So this is my function:
def lock_test(path):
"""
Checks if a file can, aside from it's permissions, be changed right now (True)
or is already locked by another process (False).
:param str path: file to be checked
:rtype: bool
"""
import msvcrt
try:
fd = os.open(path, os.O_APPEND | os.O_EXCL | os.O_RDWR)
except OSError:
return False
try:
msvcrt.locking(fd, msvcrt.LK_NBLCK, 1)
msvcrt.locking(fd, msvcrt.LK_UNLCK, 1)
os.close(fd)
return True
except (OSError, IOError):
os.close(fd)
return False
And the unittest could look something like this:
class Test(unittest.TestCase):
def test_lock_test(self):
testfile = 'some_test_name4142351345.xyz'
testcontent = 'some random blaaa'
with open(testfile, 'w') as fob:
fob.write(testcontent)
# test successful locking and unlocking
self.assertTrue(lock_test(testfile))
os.remove(testfile)
self.assertFalse(os.path.exists(testfile))
# make file again, lock and test False locking
with open(testfile, 'w') as fob:
fob.write(testcontent)
fd = os.open(testfile, os.O_APPEND | os.O_RDWR)
msvcrt.locking(fd, msvcrt.LK_NBLCK, 1)
self.assertFalse(lock_test(testfile))
msvcrt.locking(fd, msvcrt.LK_UNLCK, 1)
self.assertTrue(lock_test(testfile))
os.close(fd)
with open(testfile) as fob:
content = fob.read()
self.assertTrue(content == testcontent)
os.remove(testfile)
Works. Downsides are:
It's kind of testing itself with itself
so the initial OSError catch is not even tested, only locking again with msvcrt
But I dunno how to make it better now.

Python : Locking text file on NFS

I have a file results.txt on a server which is accessed by multiple VMs through NFS. A process runs on each of these VMs which reads the results.txt file and modifies it. If two processes, A and B, read the file at the same time, then modification of either A or B would be present in results.txt based on the order in which the processes write to the file.
If process A has a write lock over the file then process B would have to wait till the lock is released to read the results.txt file.
I have tried implementing this using Python:
import fcntl
f = open("/path/result.txt")
fcntl.flock(f,fcntl.LOCK_EX)
#code
It works as expected for files on the local disk.
but when I run try to lock a file on the mounted path, I get the following error:
Traceback (most recent call last):
File "lock.py", line 12, in <module>
fcntl.flock(f,fcntl.LOCK_EX)
IOError: [Errno 45] Operation not supported
I tried fcntl.fcntl and fcntl.flock but got the same error. Is this an issue with the way I am using fcntl? Is any configuration required on the server where file is stored?
Edit:
This is how I am using fcntl.fcntl:
f= open("results.txt")
lockdata = struct.pack('hhllhh', fcntl.F_RDLCK,0,0,0,0,0)
rv = fcntl.fcntl(f, fcntl.F_SETLKW, lockdata)
The NFS server version is 3.

I found flufl.lock most suited for my requirement.
Quoting the author from project page:
[...] O_EXCL is broken on NFS file systems, programs which rely on
it
for performing locking tasks will contain a race condition. The
solution for performing atomic file locking using a lockfile is to
create a unique file on the same fs (e.g., incorporating hostname and
pid), use link(2) to make a link to the lockfile. If link() returns
0, the lock is successful. Otherwise, use stat(2) on the unique file
to check if its link count has increased to 2, in which case the lock
is also successful.
Since it is not part of the standard library I couldn't use it. Also, my requirement was only a subset of all the features offered by this module.
The following functions were written based on the modules. Please make changes based on the requirements.
def lockfile(target,link,timeout=300):
global lock_owner
poll_time=10
while timeout > 0:
try:
os.link(target,link)
print("Lock acquired")
lock_owner=True
break
except OSError as err:
if err.errno == errno.EEXIST:
print("Lock unavailable. Waiting for 10 seconds...")
time.sleep(poll_time)
timeout-=poll_time
else:
raise err
else:
print("Timed out waiting for the lock.")
def releaselock(link):
try:
if lock_owner:
os.unlink(link)
print("File unlocked")
except OSError:
print("Error:didn't possess lock.")
This is a crude implementation that works for me. I have been using it and haven't faced any issues. There are many things that can be improved though. Hope this helps.

subprocess stdout/stderr to finite size logfile

I have a process which chats a lot to stderr, and I want to log that stuff to a file.
foo 2> /tmp/foo.log
Actually I'm launching it with python subprocess.Popen, but it may as well be from the shell for the purposes of this question.
with open('/tmp/foo.log', 'w') as stderr:
foo_proc = subprocess.Popen(['foo'], stderr=stderr)
The problem is after a few days my log file can be very large, like >500 MB. I am interested in all that stderr chat, but only the recent stuff. How can I limit the size of the logfile to, say, 1 MB? The file should be a bit like a circular buffer in that the most recent stuff will be written but the older stuff should fall out of the file, so that it never goes above a given size.
I'm not sure if there's an elegant Unixey way to do this already which I'm simply not aware of, with some sort of special file.
An alternative solution with log rotation would be sufficient for my needs as well, as long as I don't have to interrupt the running process.

You should be able to use the stdlib logging package to do this. Instead of connecting the subprocess' output directly to a file, you can do something like this:
import logging
logger = logging.getLogger('foo')
def stream_reader(stream):
while True:
line = stream.readline()
logger.debug('%s', line.strip())
This just logs every line received from the stream, and you can configure logging with a RotatingFileHandler which provides log file rotation. You then arrange to read this data and log it.
foo_proc = subprocess.Popen(['foo'], stderr=subprocess.PIPE)
thread = threading.Thread(target=stream_reader, args=(foo_proc.stderr,))
thread.setDaemon(True) # optional
thread.start()
# do other stuff
thread.join() # await thread termination (optional for daemons)
Of course you can call stream_reader(foo_proc.stderr) too, but I'm assuming you might have other work to do while the foo subprocess does its stuff.
Here's one way you could configure logging (code that should only be executed once):
import logging, logging.handlers
handler = logging.handlers.RotatingFileHandler('/tmp/foo.log', 'a', 100000, 10)
logging.getLogger().addHandler(handler)
logging.getLogger('foo').setLevel(logging.DEBUG)
This will create up to 10 files of 100K named foo.log (and after rotation foo.log.1, foo.log.2 etc., where foo.log is the latest). You could also pass in 1000000, 1 to give you just foo.log and foo.log.1, where the rotation happens when the file would exceed 1000000 bytes in size.

The way with circular buffer would be hard to implement, as you would constantly have to rewrite the whole file as soon as something falls out.
The approach with logrotate or something would be your way to go. In this case, you simply would do similiar to this:
import subprocess
import signal
def hupsignal(signum, frame):
global logfile
logfile.close()
logfile = open('/tmp/foo.log', 'a')
logfile = open('/tmp/foo.log', 'a')
signal.signal()
foo_proc = subprocess.Popen(['foo'], stderr=subprocess.PIPE)
for chunk in iter(lambda: foo_proc.stderr.read(8192), ''):
# iterate until EOF occurs
logfile.write(chunk)
# or do you want to rotate yourself?
# Then omit the signal stuff and do it here.
# if logfile.tell() > MAX_FILE_SIZE:
# logfile.close()
# logfile = open('/tmp/foo.log', 'a')
It is not a complete solution; think of it as pseudocode as it is untested and I am not sure about the syntax in the one or other place. Probably it needs some modification for making it work. But you should get the idea.
As well, it is an example of how to make it work with logrotate. Of course, you can rotate your logfile yourself, if needed.

You may be able to use the properties of 'open file descriptions' (distinct from, but closely related to, 'open file descriptors'). In particular, the current write position is associated with the open file description, so two processes that share an single open file description can each adjust the write position.
So, in context, the original process could retain the file descriptor for standard error of the child process, and periodically, when the position reaches your 1 MiB size, reposition the pointer to the start of the file, thus achieving your required circular buffer effect.
The biggest problem is determining where the current messages are being written, so that you can read from the oldest material (just in front of the file position) to the newest material. It is unlikely that new lines overwriting the old will match exactly, so there'd be some debris. You might be able to follow each line from the child with a known character sequence (say 'XXXXXX'), and then have each write from the child reposition to overwrite the previous marker...but that definitely requires control over the program that's being run. If it is not under your control, or cannot be modified, that option vanishes.
An alternative would be to periodically truncate the file (maybe after copying it), and to have the child process write in append mode (because the file is opened in the parent in append mode). You could arrange to copy the material from the file to a spare file before truncating to preserve the previous 1 MiB of data. You might use up to 2 MiB that way, which is a lot better than 500 MiB and the sizes could be configured if you're actually short of space.
Have fun!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.