How can I store documentation in a pickle file? - python

In Python, I generate quite often a pickle file to conserve the work I have done during programming.
Is there any possibility to store something like a docstring in the pickle that explains how this pickle was generated and what it's meaning is.

Because you can combine all kinds of items into dictionaries, tuples, and lists before pickling them, I would say the most straightforward solution would be to use a dictionary that has a docstring key.
pickle_dict = {'objs': [some, stuff, inhere],
'docstring': 'explanation of those objects'}
Of course, depending on what you are pickling, you may want key-value pairs for each object instead of a list of objects.
When you open the pickle back up, you can just read the docstring to remember how this pickle came to be.
As an alternative solution, I often just need to save one or two integer values about the pickle. In this case, I choose to save in the title of the pickle file. Depending on what you are doing, this could be preferred so you can read the "docstring" without having to unpickle it.

DataFrames and lists don't typically have docstrings because they are data. The docstring specification says:
A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object.
You can create any of these to make a docstring associated with the process that uses your data. The main class of your module for example.
class MyClass:
"""My docstring"""
def __init__(self, df):
self.df = df # Your dataframe
...
Something like this seems like it is closest to what you were asking within the conventions of the language.

Related

Python: can an object have an object as a "default" representation?

I am just getting started with OOP, so I apologise in advance if my question is as obvious as 2+2. :)
Basically I created a class that adds attributes and methods to a panda data frame. That's because I am sometimes looking to do complex but repetitive tasks like merging with a bunch of other tables, dropping duplicates, etc. So it's pleasant to be able to do that in just one go with a predefined method. For example, I can create an object like this:
mysupertable = MySupperTable(original_dataframe)
And then do:
mysupertable.complex_operation()
Where original_dataframe is the original panda data frame (or object) that is defined as an attribute to the class. Now, this is all good and well, but if I want to print (or just access) that original data frame I have to do something like
print(mysupertable.original_dataframe)
Is there a way to have that happening "by default" so if I just do:
print(mysupertable)
it will print the original data frame, rather than the memory location?
I know there are the str and rep methods that can be implemented in a class which return default string representations for an object. I was just wondering if there was a similar magic method (or else) to just default showing a particular attribute. I tried looking this up but I think I am somehow not finding the right words to describe what I want to do, because I can't seem to be able to find an answer.
Thank you!
Cheers
In your MySupperTable class, do:
class MySupperTable:
# ... other stuff in the class
def __str__(self) -> str:
return str(self.original_dataframe)
That will make it so that when a MySupperTable is converted to a str, it will convert its original_dataframe to a str and return that.
When you pass an object to print() it will print the object's string representation, which under the hood is retrieved by calling the object.__str__(). You can give a custom definition to this method the way that you would define any other method.

How do you name your json methods?

I work with a bit of json in my code. I work with it in both Objective-C and python (3 of course!). Someday, I'll probably be forced to work with it in java. All of these platforms have libraries for parsing json strings into native objects, usually dictionaries with stuff in them.
So every time I write a method/function that either produces or consumes json data, I'm always torn. Because sometimes, they consume or produce it in string form, and sometimes they do it in the higher level form.
For example, let's say I have a Script object that reifies some scheduling, and I can turn it into json for easy http transmission or mongo-ification, or whatever. And so I make two methods:
class Script(object):
def toJson(self):
...
def fromJson(self, json):
...
While these methods communicate that the Script object can populate itself or represent itself via json, it's totally unclear which form. Is the json variable there a dict or or a string?
So I'm wondering if others have evolved naming conventions that help clarify this?
I run into a similar issue (althought in my case it's xml). It may not be elegant, but I have a tendency to use the type name when returning that type, and append 'str' when returning a stringified version of the type. Eg:
class Script:
def tojson(self):
...
def tojsonstr(self):
...
def fromjson(self, json):
...
def fromjsonstr(self, jsonstr):
...
I've experimented w/ adding underscores, but in the long run, this just creates cumbersome identifiers (and my fingers get twisted typing longer strings).
I suppose you could get clever, and test the passed object for isinstance (at least in the case of fromX(), Eg:
def fromjson(self, json):
if isinstance(json, basestring):
# assume str
else:
# assume json object
...
Because I do a bit of pymongo as well, where it's common to refer to these structured dictionaries as docs, I began using that term (after discussing this question with a number of peers elsewhere). So I have methods/functions that look like:
def toDoc(self):
...
and
- (void) fromDoc: (NSDictionary*) doc {
...
}
Then, if I want to paper over those with json variants which do the to/from string encoding, I name this with JSON.
I realized in the end, that in JavaScript itself, this doesn't come up, because it's just an object already. But with languages like Python and ObjectiveC we tend to use intermediary literal objects such as dictionaries and lists as we step between our domain objects and JSON. What I really needed was some word that meant "JavaScriptishObjectStructure" or something like that.

Use python dict to lookup mutable objects

I have a bunch of File objects, and a bunch of Folder objects. Each folder has a list of files. Now, sometimes I'd like to lookup which folder a certain file is in. I don't want to traverse over all folders and files, so I create a lookup dict file -> folder.
folder = Folder()
myfile = File()
folder_lookup = {}
# This is pseudocode, I don't actually reach into the Folder
# object, but have an appropriate method
folder.files.append(myfile)
folder_lookup[myfile] = folder
Now, the problem is, the files are mutable objects. My application is built around the fact. I change properites on them, and the GUI is notified and updated accordingly. Of course you can't put mutable objects in dicts. So what I tried first is to generate a hash based on the current content, basically:
def __hash__(self):
return hash((self.title, ...))
This didn't work of course, because when the object's contents changed its hash (and thus its identity) changed, and everything got messed up. What I need is an object that keeps its identity, although its contents change. I tried various things, like making __hash__ return id(self), overriding __eq__, and so on, but never found a satisfying solution. One complication is that the whole construction should be pickelable, so that means I'd have to store id on creation, since it could change when pickling, I guess.
So I basically want to use the identity of an object (not its state) to quickly look up data related to the object. I've actually found a really nice pythonic workaround for my problem, which I might post shortly, but I'd like to see if someone else comes up with a solution.
I felt dirty writing this. Just put folder as an attribute on the file.
class dodgy(list):
def __init__(self, title):
self.title = title
super(list, self).__init__()
self.store = type("store", (object,), {"blanket" : self})
def __hash__(self):
return hash(self.store)
innocent_d = {}
dodge_1 = dodgy("dodge_1")
dodge_2 = dodgy("dodge_2")
innocent_d[dodge_1] = dodge_1.title
innocent_d[dodge_2] = dodge_2.title
print innocent_d[dodge_1]
dodge_1.extend(range(5))
dodge_1.title = "oh no"
print innocent_d[dodge_1]
OK, everybody noticed the extremely obvious workaround (that took my some days to come up with), just put an attribute on File that tells you which folder it is in. (Don't worry, that is also what I did.)
But, it turns out that I was working under wrong assumptions. You are not supposed to use mutable objects as keys, but that doesn't mean you can't (diabolic laughter)! The default implementation of __hash__ returns a unique value, probably derived from the object's address, that remains constant in time. And the default __eq__ follows the same notion of object identity.
So you can put mutable objects in a dict, and they work as expected (if you expect equality based on instance, not on value).
See also: I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?
I was having problems because I was pickling/unpickling the objects, which of course changed the hashes. One could generate a unique ID in the constructor, and use that for equality and deriving a hash to overcome this.
(For the curious, as to why such a "lookup based on instance identity" dict might be neccessary: I've been experimenting with a kind of "object database". You have pure python objects, put them in lists/containers, and can define indexes on attributes for faster lookup, complex queries and so on. For foreign keys (1:n relationships) I can just use containers, but for the backlink I have to come up with something clever if I don't want to modify the objects on the n side.)

How to create a class from function

I am still struggling with understanding classes, I am not certain but I have an idea that this function I have created is probably a good candidate for a class. The function takes a list of dictionaries, identifies the keys and writes out a csv file.
First Q, is this function a good candidate for a class (I write out a lot of csv files
Second Q If the answer to 1 is yes, how do I do it
Third Q how do I use the instances of the class (did I say that right)
import csv
def writeCSV(dictList,outfile):
maxLine=dictList[0]
for item in dictList:
if len(item)>len(maxLine):
maxLine=item
dictList.insert(0,dict( (key,key) for key in maxLine.keys()))
csv_file=open(outfile,'ab')
writer = csv.DictWriter(csv_file,fieldnames=[key for key in maxLine.keys()],restval='notScanned',dialect='excel')
for dataLine in dictList:
writer.writerow(dataLine)
csv_file.close()
return
The main idea behind objects is that an object is data plus methods.
Whenever you are thinking about making something an object, you must ask yourself what will be the object's data, and what operations (methods) will you want to perform on that data.
Functions, more readily translate to methods than classes.
So, for instance, if your dictList is data upon which you often call writeCSV,
then perhaps make a dictList object with method writeCSV:
class DictList(object):
def __init__(self,data):
self.data=data
def writeCSV(self,outfile):
maxLine=self.data[0]
for item in self.data:
if len(item)>len(maxLine):
maxLine=item
self.data.insert(0,dict( (key,key) for key in maxLine.keys()))
csv_file=open(outfile,'ab')
writer = csv.DictWriter(
csv_file,fieldnames=[key for key in maxLine.keys()],
restval='notScanned',dialect='excel')
for dataLine in self.data:
writer.writerow(dataLine)
csv_file.close()
Then you could instantiate a DictList object:
dl=DictList([{},{},...])
dl.writeCSV(outfile)
Doing this might make sense if you have more methods that could operate on the same DictList.data. Otherwise, you'd probably be better off sticking with the original function.
For this you need to understand little bit concepts of classes first and then follow the next step.
I too faced a same problem and followed this LINK , I m sure u will also start working on classes from your structured programming.
If you want to write a lot of CSV files with the same dictList (is that what you're saying...?), turning the function into a class would let you perform initialization just once, and then write repeatedly from the same initialized instance. E.g., with other minor opts:
class CsvWriter(object):
def __init__(self, dictList):
self.maxline = max(dictList, key=len)
self.dictList = [dict((k,k) for k in self.maxline)]
self.dictList.extend(dictList)
def doWrite(self, outfile):
csv_file=open(outfile,'ab')
writer = csv.DictWriter(csv_file,
fieldnames=self.maxLine.keys(),
restval='notScanned',
dialect='excel')
for dataLine in self.dictList:
writer.writerow(dataLine)
csv_file.close()
This seems a dubious use case, but if it does match your desire, then you'd instantiate and use this class as follows...:
cw = CsvWriter(dataList)
for ou in many_outfiles:
cw.doWrite(ou)
When thinking about making objects, remember this:
Classes have attributes - things that describe different instances of the class differently
Classes have methods - things that the objects do (often involving using their attributes)
Objects and classes are wonderful, but the first thing to keep in mind is that they are not always necessary, or even desirable.
That said, in answer to your first question, this doesn't seem like a particularly good candidate for a class. The only thing different between the different CVS files you're writing are the data and the file you write to, and the only thing you do with them (ie, the only method you would have) is the function you've already written).
Even though the first answer is no, it's still instructive to see how a class is built.
class CSVWriter:
# this function is called when you create an instance of the class
# it sets up the initial attributes of the instance
def __init__(self, dictList, outFile):
self.dictList = dictList
self.outFile = outFile
def writeCSV(self):
# basically exactly what you have above, except you can use the instance's
# own variables (ie, self.dictList and self.outFile) instead of the local
# variables
For your final question - the first step to using an instance of a class (an individual object, if you will) is to create that instance:
myCSV = CSVWriter(dictList, outFile)
When the object is created, init is called with the arguments you gave it - that allows your object to have its own data. Now you can access any of the attributes or methods that your myCSV object has with the '.' operator:
myCSV.writeCSV()
print "Wrote a file to", myCSV.outFile
One way to think about objects versus functions is that objects are generally nouns (eg, I created a CSVWriter), while functions are verbs (eg, you wrote a the function that writes CSV files). If you're just doing something over and over again, without re-using any of the same data, a function by itself is fine. But, if you have lots of related data, and part of it gets changed in the course of the action, classes may be a good idea.
I don't think your writeCSV is in need of a class, typicaly class would be used when you have to update some state(data) and then act on it, may be with various options.
e.g. if you need to pass around your object, so that other function/method can add values to it or your final action/output function has many options or you think same data can be processed, acted upon in many ways.
Typically practical case would be if you have multiple functions which act on same data or a singe function whose optional parameter list is going to long, you may think of converting it into a class.
If in your case you had various options and need to insert data in increments, you should make it a class.
Usually class name would be noun, so function(verb) writeCSV -> class(noun) CSVWriter
class CSVWriter(object):
def __init__(self, init-params...):
self.data = {}
def addData(self, data):
self.data.update(data)
def dumpCSV(self, filePath):
...
def dumpJSON(self, filePath):
....
I think question 1 is pretty crucial as it goes to the heart of what a class is.
Yes, you can put this function in a class. A class is a set of functions (called methods) and data together in one logical unit. As other posters noted, probably overkill to have a class with one method.

Best way to store and use a large text-file in python

I'm creating a networked server for a boggle-clone I wrote in python, which accepts users, solves the boards, and scores the player input. The dictionary file I'm using is 1.8MB (the ENABLE2K dictionary), and I need it to be available to several game solver classes. Right now, I have it so that each class iterates through the file line-by-line and generates a hash table(associative array), but the more solver classes I instantiate, the more memory it takes up.
What I would like to do is import the dictionary file once and pass it to each solver instance as they need it. But what is the best way to do this? Should I import the dictionary in the global space, then access it in the solver class as globals()['dictionary']? Or should I import the dictionary then pass it as an argument to the class constructor? Is one of these better than the other? Is there a third option?
If you create a dictionary.py module, containing code which reads the file and builds a dictionary, this code will only be executed the first time it is imported. Further imports will return a reference to the existing module instance. As such, your classes can:
import dictionary
dictionary.words[whatever]
where dictionary.py has:
words = {}
# read file and add to 'words'
Even though it is essentially a singleton at this point, the usual arguments against globals apply. For a pythonic singleton-substitute, look up the "borg" object.
That's really the only difference. Once the dictionary object is created, you are only binding new references as you pass it along unless if you explicitly perform a deep copy. It makes sense that it is centrally constructed once and only once so long as each solver instance does not require a private copy for modification.
Adam, remember that in Python when you say:
a = read_dict_from_file()
b = a
... you are not actually copying a, and thus using more memory, you are merely making b another reference to the same object.
So basically any of the solutions you propose will be far better in terms of memory usage. Basically, read in the dictionary once and then hang on to a reference to that. Whether you do it with a global variable, or pass it to each instance, or something else, you'll be referencing the same object and not duplicating it.
Which one is most Pythonic? That's a whole 'nother can of worms, but here's what I would do personally:
def main(args):
run_initialization_stuff()
dictionary = read_dictionary_from_file()
solvers = [ Solver(class=x, dictionary=dictionary) for x in len(number_of_solvers) ]
HTH.
Depending on what your dict contains, you may be interested in the 'shelve' or 'anydbm' modules. They give you dict-like interfaces (just strings as keys and items for 'anydbm', and strings as keys and any python object as item for 'shelve') but the data is actually in a DBM file (gdbm, ndbm, dbhash, bsddb, depending on what's available on the platform.) You probably still want to share the actual database between classes as you are asking for, but it would avoid the parsing-the-textfile step as well as the keeping-it-all-in-memory bit.

Categories