Design of a python pickleable object that describes a file - python

I would like to create a class that describes a file resource and then pickle it. This part is straightforward. To be concrete, let's say that I have a class "A" that has methods to operate on a file. I can pickle this object if it does not contain a file handle. I want to be able to create a file handle in order to access the resource described by "A". If I have an "open()" method in class "A" that opens and stores the file handle for later use, then "A" is no longer pickleable. (I add here that opening the file includes some non-trivial indexing which cannot be cached--third party code--so closing and reopening when needed is not without expense). I could code class "A" as a factory that can generate file handles to the described file, but that could result in multiple file handles accessing the file contents simultaneously. I could use another class "B" to handle the opening of the file in class "A", including locking, etc. I am probably overthinking this, but any hints would be appreciated.

The question isn't too clear; what it looks like is that:
you have a third-party module which has picklable classes
those classes may contain references to files, which makes the classes themselves not picklable because open files aren't picklable.
Essentially, you want to make open files picklable. You can do this fairly easily, with certain caveats. Here's an incomplete but functional sample:
import pickle
class PicklableFile(object):
def __init__(self, fileobj):
self.fileobj = fileobj
def __getattr__(self, key):
return getattr(self.fileobj, key)
def __getstate__(self):
ret = self.__dict__.copy()
ret['_file_name'] = self.fileobj.name
ret['_file_mode'] = self.fileobj.mode
ret['_file_pos'] = self.fileobj.tell()
del ret['fileobj']
return ret
def __setstate__(self, dict):
self.fileobj = open(dict['_file_name'], dict['_file_mode'])
self.fileobj.seek(dict['_file_pos'])
del dict['_file_name']
del dict['_file_mode']
del dict['_file_pos']
self.__dict__.update(dict)
f = PicklableFile(open("/tmp/blah"))
print f.readline()
data = pickle.dumps(f)
f2 = pickle.loads(data)
print f2.read()
Caveats and notes, some obvious, some less so:
This class should operate directly on the file object you got from open. If you're using wrapper classes on files, like gzip.GzipFile, those should go above this, not below it. Logically, treat this as a decorator class on top of file.
If the file doesn't exist when you unpickle, it can't be unpickled and will throw an exception.
If it's a different file, the behavior may or may not make sense.
If the file mode includes file creation ('w+'), and the file doesn't exist, it'll be created; we don't know what file permissions to use, since that's not stored with the file. If this is important--it probably shouldn't be--then store the correct permissions in the class when you first create it.
If the file isn't seekable, trying to seek to the old position may raise IOError; if you're using a file like that you'll need to decide how to handle that.
The file classes in Python 2 and Python 3 are different; there's no file class in Python 3. Even if you're only using Python 2 right now, don't subclass file.
I'd steer away from doing this; having pickled data dependent on external files not changing and staying in the same place is brittle. This makes it difficult to even relocate files, since your pickled data won't make sense.

If you open a pointer to a file, pickle it, then attempt to reconstitute is later, there is no guarantee that file will still be available for opening.
To elaborate, the file pointer really represents a connection to the file. Just like a database connection, you can't "pickle" the other end of the connection, so this won't work.
Is it possible to keep the file pointer around in memory in its own process instead?

It sounds like you know you can't pickle the handle, and you're ok with that, you just want to pickle the part that can be pickled. As your object stands now, it can't be pickled because it has the handle. Do I have that right? If so, read on.
The pickle module will let your class describe its own state to pickle, for exactly these cases. You want to define your own __getstate__ method. The pickler will invoke it to get the state to be pickled, only if the method is missing does it go ahead and do the default thing of trying to pickle all the attributes.

Related

Python get file path from a file descriptor int (as returned from os.open)

I am using fusepy and I need to convert a file descriptor back in to a file object so that I can obtain the original file path
From the fusepy examples, when a file is created, a file descriptor is returned - for example:
def open(self, path, flags):
print("open:", path)
return os.open(path, flags)
the returned result is an integer: <class 'int'> with the value of 4
in a separate function named write, I need to reverse the file descriptor back in to a file so that I can get the file path, so I tried this:
f = os.fdopen(fh)
When I check the type of f I get the following f is type: <class '_io.TextIOWrapper'>
Which is not quite what I was expecting but a quick dir(f) shows that it has a name property, I thought that's what I was looking for, except name is simply the number 4...
How can I get the original file path the descriptor points to?
Since this seems to satisfy the need, I'll post it as an answer for the time being. One could access underlying objects through f.buffer and f.buffer.raw (not the surprise mentioned in question depends a bit on Python version used, in v2 this looked different), but that still won't help accessing the filename used top open the file. Note: this could have just as well taken place in a calling process and a descriptor could have been inherited by the python process.
Not sure if there is a nicer and more portable way, but one can query OS and on U*X like system a readily available way would be to reference procfs structure, namely for the above example:
os.readlink(f"/proc/self/fd/{fh}")
Still not an entirely trivial question. Descriptor may still be open and used, while the underlying file(name; filesystem reference) has already been deleted.

Is it possible to uniformly save any object in a JSON file?

I'm working on a web-server type of application and as part of multi-language communication I need to serialize objects in a JSON file. The issue is that I'd like to create a function which can take any custom object and save it at run time rather than limiting the function to what type of objects it can store based on structure.
Apologies if this question is a duplicate, however from what I have searched the other questions and answers do not seem to tackle the dynamic structure aspect of the problem, thus leading me to open this question.
The function is going to be used to communicate between PHP server code and Python scripts, hence the need for such a solution
I have attempted to use the json.dump(data,outfile) function, however the issue is that I need to convert such objects to a legal data structure first
JSON is a rigidly structured format, and Python's json module, by design, won't try to coerce types it doesn't understand.
Check out this SO answer. While __dict__ might work in some cases, it's often not exactly what you want. One option is to write one or more classes that inherit JSONEncoder and provides a method that turns your type or types into basic types that json.dump can understand.
Another option would be to write a parent class, e.g. JSONSerializable and have these data types inherit it the way you'd use an interface in some other languages. Making it an abstract base class would make sense, but I doubt that's important to your situation. Define a method on your base class, e.g. def dictify(self), and either implement it if it makes sense to have a default behavior or just have it it raise NotImplementedError.
Note that I'm not calling the method serialize, because actual serialization will be handled by json.dump.
class JSONSerializable(ABC):
def dictify(self):
raise NotImplementedError("Missing serialization implementation!")
class YourDataType(JSONSerializable):
def __init__(self):
self.something = None
# etc etc
def dictify(self):
return {"something": self.something}
class YourIncompleteDataType(JSONSerializable):
# No dictify(self) implementation
pass
Example usage:
>>> valid = YourDataType()
>>> valid.something = "really something"
>>> valid.dictify()
{'something': 'really something'}
>>>
>>> invalid = YourIncompleteDataType()
>>> invalid.dictify()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in dictify
NotImplementedError: Missing dictify implementation!
Basically, though: You do need to handle this yourself, possibly on a per-type basis, depending on how different your types are. It's just a matter of what method of formatting your types for serialization is the best for your use case.

Which python module contains file object methods?

While it is simple to search by using help for most methods that have a clear help(module.method) arrangement, for example help(list.extend), I cannot work out how to look up the method .readline() in python's inbuilt help function.
Which module does .readline belong to? How would I search in help for .readline and related methods?
Furthermore is there any way I can use the interpreter to find out which module a method belongs to in future?
Don't try to find the module. Make an instance of the class you want, then call help on the method of that instance, and it will find the correct help info for you. Example:
>>> f = open('pathtosomefile')
>>> help(f.readline)
Help on built-in function readline:
readline(size=-1, /) method of _io.TextIOWrapper instance
Read until newline or EOF.
Returns an empty string if EOF is hit immediately.
In my case (Python 3.7.1), it's defined on the type _io.TextIOWrapper (exposed publicly as io.TextIOWrapper, but help doesn't know that), but memorizing that sort of thing isn't very helpful. Knowing how to figure it out by introspecting the specific thing you care about is much more broadly applicable. In this particular case, it's extra important not to try guessing, because the open function can return a few different classes, each with different methods, depending on the arguments provided, including io.BufferedReader, io.BufferedWriter, io.BufferedRandom, and io.FileIO, each with their own version of the readline method (though they all share a similar interface for consistency's sake).
From the text of help(open):
open() returns a file object whose type depends on the mode, and
through which the standard file operations such as reading and writing
are performed. When open() is used to open a file in a text mode ('w',
'r', 'wt', 'rt', etc.), it returns a TextIOWrapper. When used to open
a file in a binary mode, the returned class varies: in read binary
mode, it returns a BufferedReader; in write binary and append binary
modes, it returns a BufferedWriter, and in read/write mode, it returns
a BufferedRandom.
See also the section of python's io module documentation on the class hierarchy.
So you're looking at TextIOWrapper, BufferedReader, BufferedWriter, or BufferedRandom. These all have their own sets of class hierarchies, but suffice it to say that they share the IOBase superclass at some point - that's where the functions readline() and readlines() are declared. Of course, each subclass implements these functions differently for its particular mode - if you do
help(_io.TextIOWrapper.readline)
you should get the documentation you're looking for.
In particular, you're having trouble accessing the documentation for whichever version of readline you need, because you can't be bothered to figure out which class it is. You can actually call help on an object as well. If you're working with a particular file object, then you can spin up a terminal, instantiate it, and then just pass it to help() and it'll show you whatever interface is closest to the surface. Example:
x = open('some_file.txt', 'r')
help(x.readline)

How does open handle context management?

The python built-ins open, and file work with context managers in a way that I don't quite understand.
It is to my understanding that open will create a file. file implements the context-manager methods __enter__ and __exit__. I would initially expect __enter__ to implement the actual opening of the file descriptor.
However, using open outside of a with block will return a file which is already open. So, it appears either file.__init__ or open is actually opening the file descriptor, and as far as I can tell file.__enter__ isn't doing anything. Or maybe file.__init__/open calls file.__enter__ directly?
First question:
What is the execution-flow of the open built-in? What does open handle, what does file.__init__ handle, and what does file.__enter__ handle? How does this work when re-using one file object for multiple cycles of opening/closing the file? How is this different from re-using other contextmanager objects for multiple context-cycles?
Second question:
Objects such as file objects have a setup steps and teardown steps. The setup occurs in __init__ , and the tear-down occurs in either close or __exit__.
Is this a good design pattern? Should this design pattern be implemented for custom functions/context managers?
If you look in _pyio.py (a pure-Python implementation of the io module) you find the following code in class IOBase:
### Context manager ###
def __enter__(self): # That's a forward reference
"""Context management protocol. Returns self (an instance of IOBase)."""
self._checkClosed()
return self
def __exit__(self, *args):
"""Context management protocol. Calls close()"""
self.close()
This contains the answers to most of your questions. The important thing to understand is that the context manager's function is to insure that you close the file when you are done with it. It does this simply by calling the close function, which saves you the trouble of doing so.
What does file.__enter__ handle? Nothing. It simply returns to you the file object that was the result of the call to the built-in function open().
How does this work when using one file object for multiple cycles of opening and closing the file? The context manager is not very useful for that purpose, since you must explicitly open the file each time.
Is this a good design pattern? Yes, because it reduces the amount of code you have to write, it's easy to read and understand.
Should this pattern be implemented for custom functions/context managers? Any time you have an object that needs to be cleaned up, or has usage that involves some type of open/close concept, you should consider this pattern. The standard library has many other examples.
For Question 1
In CPython, open() does nothing but creating a file object, which the underlying C type is PyFileObject; See source code in bltinmodule.c and fileobject.c
static PyObject *
builtin_open(PyObject *self, PyObject *args, PyObject *kwds)
{
return PyObject_Call((PyObject*)&PyFile_Type, args, kwds);
}
file.__init__ would open the file
file.__enter__ indeed do nothing except doing empty check on field file.fp
file.__exit__ invoke close() method to close file
For Question 2
Why file design like this is due to a historical reason.
open and with are two different keywords introduced on different versions of CPython. with was introduced till Python 2.5 (see PEP 343). At that time, open has been used for a long time.
For our customized type, we could design like file or not, depends on the concrete application context.
For example, threading.Lock is a different design, its init and enter are separately.

How to save changes to a python object?

I have a Python dictionary of objects from a class that I have created in one file. It is of the form {string : object}, with several key, value pairs.
My goal is to do something in a method in a separate file that changes an attribute of certain objects in the dictionary and to save those changes to those objects while keeping them within the dictionary.
I've tried using pickle, but it doesn't seem to save the changes to the objects within the dictionary.
Basic Idea of what I'm doing right now and what is wrong with it:
File #1:
class A:
def __init__(self):
self.value = 0
a = A()
dict = {"Test" : a}
pickle.dump(dict, open("save.p", "wb"))
File #2:
dict = pickle.load(open("save.p", "rb"))
dict["Test"].value += 1
print(dict["Test"].value)
pickle.dump(dict, open("save.p", "wb"))
So when I run File #2 the first time, it should print 1, and it does
but when I run File #2 the second time, I want it to print 2, but it prints 1 again because the change to the value was not saved.
It could be that I am using pickle incorrectly...
Any help would be appreciated! Thanks!
From the pickle documentation:
Note that none of the class’s code or data is pickled
See pickling class instances for the right way to do it.
Also class A does not exist in the unpickling environment, that can't be a good thing, class are unpickeled by name if I read the doc right.
BTW I'd use json over pickle so you can open the file between two runs and inspect it yourself to understand what happen. There's a few advantages to use json over pickle, and a few to use pickle over json, here's a comparison between pickle and json.
Oh, and, avoid naming your variables dict or any existing builtins, it shadows them and can lead to very strange behaviors.

Categories