Disclaimer: this is perhaps a quite subjective question with no 'right' answer but I'd appreciate any feedback on best-practices and program design. So here goes:
I am writing a library where text files are read into Text objects. Now these might be initialized with a list of file-names or directly with a list of Sentence objects. I am wondering what the best / most Pythonic way to do this might be because, if I understand correctly, Python doesn't directly support method overloading.
One example I found in Scikit-Learn's feature extraction module simply passes the type of the input as an argument while initializing the object. I assume that once this parameter is set it's just a matter of handling the different cases internally:
if input == 'filename':
# glob and read files
elif input == 'content':
# do something else
While this is easy to implement, it doesn't look like a very elegant solution. So I am wondering if there is a better way to handle multiple types of inputs to initialize a class that I am overlooking.
One way is to just create classmethods with different names for the different ways of instantiating the object:
class Text(object):
def __init__(self, data):
# handle data in whatever "basic" form you need
#classmethod
def fromFiles(cls, files):
# process list of filenames into the form that `__init__` needs
return cls(processed_data)
#classmethod
def fromSentences(cls, sentences):
# process list of Sentence objects into the form that `__init__` needs
return cls(processed_data)
This way you just create one "real" or "canonical" initialization method that accepts whatever "lowest common denominator" format you want. The specialized fromXXX methods can preprocess different types of input to convert them into the form they need to be in to pass to that canonical instantiation. The idea is that you call Text.fromFiles(...) to make a Text from filenames, or Text.fromSentences(...) to make a Text from sentence objects.
It can also be acceptable to do some simple type-checking if you just want to accept one of a few enumerable kinds of input. For instance, it's not uncommon for a class to accept either a filename (as a string) or a file object. In that case you'd do:
def __init__(self, file):
if isinstance(file, basestring):
# If a string filename was passed in, open the file before proceeding
file = open(file)
# Now you can handle file as a file object
This becomes unwieldy if you have many different types of input to handle, but if it's something relatively contained like this (e.g., an object or the string "name" that can be used to get that object), it can be simpler than the first method I showed.
You can use duck typing. First you consider as if the arguments are of the type X, if they raise an exception, then you assume they are of type Y, etc:
class Text(object):
def __init__(self, *init_vals):
try:
fileobjs = [open(fname) for fname in init_vals]
except TypeError:
# Then we consider them as file objects.
fileobjs = init_vals
try:
senteces = [parse_sentences(fobj) for fobj in fileobjs]
except TypeError:
# Then init_vals are Sentence objects.
senteces = fileobjs
Note that the absence of type checking means that the method actually accepts any type that implement one of the interfaces you actually use (e.g. file-like object, Sentence-like object etc.).
This method becomes quite heavy if you want to support a lot of different types, but I'd consider that bad code design. Accepting more than 2,3,4 types as initializers will probably confuse any programmer that uses your class, since he will always have to think "wait, did X also accept Y, or was it Z that accepted Y...".
It's probably better design the constructor to only accept 2,3 different interfaces and provide the user with some function/class that allows him to convert some often used types to these interfaces.
Related
I have a function like the following:
def do_something(thing):
pass
def foo(everything, which_things):
"""Do stuff to things.
Args:
everything: Dict of things indexed by thing identifiers.
which_things: Iterable of thing identifiers. Something is only
done to these things.
"""
for thing_identifier in which_things:
do_something(everything[thing_identifier])
But I want to extend it so that a caller can do_something with everything passed in everything without having to provide a list of identifiers. (As a motivation, if everything was an opaque container whose keys weren't accessible to library users but only library internals. In this case, foo can access the keys but the caller can't. Another motivation is error prevention: having a constant with obvious semantics avoids a caller mistakenly passing the wrong set of identifiers in.) So one thought is to have a constant USE_EVERYTHING that can be passed in, like so:
def foo(everything, which_things):
"""Do stuff to things.
Args:
everything: Dict of things indexed by thing identifiers.
which_things: Iterable of thing identifiers. Something is only
done to these things. Alternatively pass USE_EVERYTHING to
do something to everything.
"""
if which_things == USE_EVERYTHING:
which_things = everything.keys()
for thing_identifier in which_things:
do_something(everything[thing_identifier])
What are some advantages and limitations of this approach? How can I define a USE_EVERYTHING constant so that it is unique and specific to this function?
My first thought is to give it its own type, like so:
class UseEverythingType:
pass
USE_EVERYTHING = UseEverythingType()
This would be in a package only exporting USE_EVERYTHING, discouraging creating any other UseEverythingType objects. But I worry that I'm not considering all aspects of Python's object model -- could two instances of USE_EVERYTHING somehow compare unequal?
I have a function which needs to behave differently depending on the type of the parameter taken in. My first impulse was to include some calls to isinstance, but I keep seeing answers on stackoverflow saying that this is bad form and unpythonic but without much reason why its bad form and unpythonic. For the latter, I suppose it has something to do with duck-typing, but whats the big deal to check if your arguments are of a specific type? Isn't it better to play it safe?
Consult this great post
My opinion on the matter is this:
If you are restricting your code, don't do it.
If you are using it to direct your code, then limit it to very specific cases.
Good Example: (this is okay)
def write_to_file(var, content, close=True):
if isinstance(var, str):
var = open(var, 'w')
var.write(content)
if close:
var.close()
Bad Example: (this is bad)
def write_to_file(var, content, close=True):
if not isinstance(var, file):
raise Exception('expected a file')
var.write(content)
if close:
var.close()
Using isinstance limits the objects which you can pass to your function. For example:
def add(a,b):
if isinstance(a,int) and isinstance(b,int):
return a + b
else:
raise ValueError
Now you might try to call it:
add(1.0,2)
expecting to get 3 but instead you get an error because 1.0 isn't an integer. Clearly, using isinstance here prevented our function from being as useful as it could be. Ultimately, if our objects taste like a duck when we roast them, we don't care what type they were to begin with just as long as they work.
However, there are situations where the opposite is true:
def read(f):
if isinstance(f,basestring):
with open(f) as fin
return fin.read()
else:
return f.read()
The point is, you need to decide the API that you want your function to have. Cases where your function should behave differently based on the type exist, but are rare (checking for strings to open files is one of the more common uses that I know of).
Because doing so explicitly prevents duck-typing.
Here's an example. The csv module allows me to write data to a file in CSV format. For that reason, the function accepts a file as a parameter. But what if I didn't want to write to an actual file, but to something like a StringIO object? That's a perfectly good use of it, since StringIO implements the necessary read and write methods. But if csv was explicitly checking for an actual object of type file, that would be forbidden.
Generally, Python takes the view that we should allow things as much as possible - it's the same reasoning behind the lack of real private variables in classes.
sometimes usage of isinstance just reimplements the polymorphic dispatch. Look at str(...), it calls object.__str__(..) which is implemented with each type individually. By implementing __str__ you can reuse code that depends on str by extending that one object instead of having to manipulate the built in method str(...).
Basically this is the culminating point of OOP. You want polymorphic behaviour, you do not want do spell out types.
There are valid reasons to use it though.
There are many Python modules for parsing and coordinating command line options (argparse, getopt, blargs, etc). And Python is blessed with good built-in features/idioms for handling varied function arguments (e.g., default values, *varargs, **keyword_args). But when I read various projects' code for top-level functions, I see notably less discipline and standardization of function arguments than command line arguments.
For simple functions, this isn't an issue; the built-in argument features work great and are more than sufficient. But there are a lot of functionally rich modules whose top-level functions provide lots of different arguments and options (some complementary or exclusive), different modes of operation, defaults, over-rides, etc.--that is, they have argument complexity approaching that of command line arguments. And they seem to largely handle their arguments in ad hoc ways.
Given the number of command line processing modules out there, and how refined they've become over time, I'd expect at least a few modules for simplifying the wrangling of complicated function arguments. But I've searched PyPi, stackoverflow, and Google without success. So...are there function (not command line!) argument handling modules you would recommend?
---update with example---
It's hard to give a truly simple concrete example because the use case doesn't appear until you're dealing with a sophisticated module. But here's a shot at explaining the problem in code: A formatter module with defaults that can be overridden in formatter instantiation, or when the function/method is called. For having only a few options, there's already an awful lot of option-handling verbiage, and the option names are repeated ad nauseam.
defaults = { 'indent': 4,
'prefix': None,
'suffix': None,
'name': 'aFormatter',
'reverse': False,
'show_name': False
}
class Formatter(object):
def __init__(self, **kwargs):
self.name = kwargs.get('name', defaults['name'])
self.indent = kwargs.get('indent', defaults['indent'])
self.prefix = kwargs.get('prefix', defaults['prefix'])
self.suffix = kwargs.get('suffix', defaults['suffix'])
self.reverse = kwargs.get('reverse', defaults['reverse'])
self.show_name = kwargs.get('show_name', defaults['show_name'])
def show_lower(self, *args, **kwargs):
indent = kwargs.get('indent', self.indent) or 0
prefix = kwargs.get('prefix', self.prefix)
suffix = kwargs.get('suffix', self.suffix)
reverse = kwargs.get('reverse', self.reverse)
show_name = kwargs.get('show_name', self.show_name)
strings = []
if show_name:
strings.append(self.name + ": ")
if indent:
strings.append(" " * indent)
if prefix:
strings.append(prefix)
for a in args:
strings.append(a.upper() if reverse else a.lower())
if suffix:
strings.append(suffix)
print ''.join(strings)
if __name__ == '__main__':
fmt = Formatter()
fmt.show_lower("THIS IS GOOD")
fmt.show_lower("THIS", "IS", "GOOD")
fmt.show_lower('this IS good', reverse=True)
fmt.show_lower("something!", show_name=True)
upper = Formatter(reverse=True)
upper.show_lower("this is good!")
upper.show_lower("and so is this!", reverse=False)
When I first read your question, I thought to myself that you're asking for a band-aid module, and that it doesn't exist because nobody wants to write a module that enables bad design to persist.
But I realized that the situation is more complex than that. The point of creating a module such as the one you describe is to create reusable, general-case code. Now, it may well be that there are some interfaces that are justifiably complex. But those interfaces are precisely the interfaces that probably can't be handled easily by general-case code. They are complex because they address a problem domain with a lot of special cases.
In other words, if an interface really can't be refactored, then it probably requires a lot of custom, special-case code that isn't predictable enough to be worth generalizing in a module. Conversely, if an interface can easily be patched up with a module of the kind you describe, then it probably can also be refactored -- in which case it should be.
I don't think command line parsing and function argument processing have much in common. The main issue with the command line is that the only available data structure is a flat list of strings, and you don't have an instrument like a function header available to define what each string means. In the header of a Python function, you can give names to each of the parameters, you can accept containers as parameters, you can define default argument values etc. What a command line parsing library does is actually providing for the command line some of the features Python offers for function calls: give names to parameters, assign default values, convert to the desired types etc. In Python, all these features are built-in, so you don't need a library to get to that level of convenience.
Regarding your example, there are numerous ways how this design can be improved by using the features the language offers. You can use default argument values instead of your defaults dictionary, you can encapsulate all the flags in a FormatterConfig class and only pass one argument instead of all those arguments again and again. But let's just assume you want exactly the interface you gave in the example code. One way to achieve this would be the following code:
class Config(dict):
def __init__(self, config):
dict.__init__(self, config)
self.__dict__ = self
def get_config(kwargs, defaults):
config = defaults.copy()
config.update(kwargs)
return Config(config)
class Formatter(object):
def __init__(self, **kwargs):
self.config = get_config(kwargs, defaults)
def show_lower(self, *args, **kwargs):
config = get_config(kwargs, self.config)
strings = []
if config.show_name:
strings.append(config.name + ": ")
strings.append(" " * config.indent)
if config.prefix:
strings.append(config.prefix)
for a in args:
strings.append(a.upper() if config.reverse else a.lower())
if config.suffix:
strings.append(config.suffix)
print "".join(strings)
Python offers a lot of tools to do this kind of argument handling. So even if we decide not to use some of them (like default arguments), we still can avoid to repeat ourselves too much.
If your API is so complex you think it would be easier to use some module to process the options that were passed you, there's a good chance the actual solution is to simplify your API. The fact some modules have very complex ways to call stuff is a shame, not a feature.
Its in developer's hand, but if you're making a library which may be useful for some other projects or will be published across other users, then I think first you need to identify your problem and analyse it,
Document your functions well, Its good to minimize the number of arguments,
provide default values for functional arguments where users may have trouble to specify what exactly needed to pass.
and for some complex requirement you can provide special classmethods that can be override for advanced programming or by advanced users who actually wants to achieve what they are playing with the library, inheritance is always there.
and you can read the PEP8 also which may helpful, but ultimate goal is to specify the minimum number of arguments, restrict users to enter required arguments, its good to provide default values for optional arguments - in the way that your library / code is easily understandable by ordinary developers too...
You could write more generic code for the defaulting.
If you think about defaulting the other way around, going through the defaults and overwriting the keywords if the don't exist.
defaults = { 'indent': 4,
'prefix': None,
'suffix': None,
'name': 'aFormatter',
'reverse': False,
'show_name': False
}
class Formatter(object):
def __init__(self, **kwargs):
for d,dv in defaults.iteritems():
kwargs[d] = kwargs.get(d, dv)
Side Note:
I'd recommend using keywords args in the __init__ method definition with defaults. This allows the function definition really become the contract to other developers and users of your class (Formatter)
def __init__(self, indent=4, reverse=False .....etc..... ):
Is it common in Python to keep testing for type values when working in a OOP fashion?
class Foo():
def __init__(self,barObject):
self.bar = setBarObject(barObject)
def setBarObject(barObject);
if (isInstance(barObject,Bar):
self.bar = barObject
else:
# throw exception, log, etc.
class Bar():
pass
Or I can use a more loose approach, like:
class Foo():
def __init__(self,barObject):
self.bar = barObject
class Bar():
pass
Nope, in fact it's overwhelmingly common not to test for type values, as in your second approach. The idea is that a client of your code (i.e. some other programmer who uses your class) should be able to pass any kind of object that has all the appropriate methods or properties. If it doesn't happen to be an instance of some particular class, that's fine; your code never needs to know the difference. This is called duck typing, because of the adage "If it quacks like a duck and flies like a duck, it might as well be a duck" (well, that's not the actual adage but I got the gist of it I think)
One place you'll see this a lot is in the standard library, with any functions that handle file input or output. Instead of requiring an actual file object, they'll take anything that implements the read() or readline() method (depending on the function), or write() for writing. In fact you'll often see this in the documentation, e.g. with tokenize.generate_tokens, which I just happened to be looking at earlier today:
The generate_tokens() generator requires one argument, readline, which must be a callable object which provides the same interface as the readline() method of built-in file objects (see section File Objects). Each call to the function should return one line of input as a string.
This allows you to use a StringIO object (like an in-memory file), or something wackier like a dialog box, in place of a real file.
In your own code, just access whatever properties of an object you need, and if it's the wrong kind of object, one of the properties you need won't be there and it'll throw an exception.
I think that it's good practice to check input for type. It's reasonable to assume that if you asked a user to give one data type they might give you another, so you should code to defend against this.
However, it seems like a waste of time (both writing and running the program) to check the type of input that the program generates independent of input. As in a strongly-typed language, checking type isn't important to defend against programmer error.
So basically, check input but nothing else so that code can run smoothly and users don't have to wonder why they got an exception rather than a result.
If your alternative to the type check is an else containing exception handling, then you should really consider duck typing one tier up, supporting as many objects with the methods you require from the input, and working inside a try.
You can then except (and except as specifically as possible) that.
The final result wouldn't be unlike what you have there, but a lot more versatile and Pythonic.
Everything else that needed to be said about the actual question, whether it's common/good practice or not, I think has been answered excellently by David's.
I agree with some of the above answers, in that I generally never check for type from one function to another.
However, as someone else mentioned, anything accepted from a user should be checked, and for things like this I use regular expressions. The nice thing about using regular expressions to validate user input is that not only can you verify that the data is in the correct format, but you can parse the input into a more convenient form, like a string into a dictionary.
I am still struggling with understanding classes, I am not certain but I have an idea that this function I have created is probably a good candidate for a class. The function takes a list of dictionaries, identifies the keys and writes out a csv file.
First Q, is this function a good candidate for a class (I write out a lot of csv files
Second Q If the answer to 1 is yes, how do I do it
Third Q how do I use the instances of the class (did I say that right)
import csv
def writeCSV(dictList,outfile):
maxLine=dictList[0]
for item in dictList:
if len(item)>len(maxLine):
maxLine=item
dictList.insert(0,dict( (key,key) for key in maxLine.keys()))
csv_file=open(outfile,'ab')
writer = csv.DictWriter(csv_file,fieldnames=[key for key in maxLine.keys()],restval='notScanned',dialect='excel')
for dataLine in dictList:
writer.writerow(dataLine)
csv_file.close()
return
The main idea behind objects is that an object is data plus methods.
Whenever you are thinking about making something an object, you must ask yourself what will be the object's data, and what operations (methods) will you want to perform on that data.
Functions, more readily translate to methods than classes.
So, for instance, if your dictList is data upon which you often call writeCSV,
then perhaps make a dictList object with method writeCSV:
class DictList(object):
def __init__(self,data):
self.data=data
def writeCSV(self,outfile):
maxLine=self.data[0]
for item in self.data:
if len(item)>len(maxLine):
maxLine=item
self.data.insert(0,dict( (key,key) for key in maxLine.keys()))
csv_file=open(outfile,'ab')
writer = csv.DictWriter(
csv_file,fieldnames=[key for key in maxLine.keys()],
restval='notScanned',dialect='excel')
for dataLine in self.data:
writer.writerow(dataLine)
csv_file.close()
Then you could instantiate a DictList object:
dl=DictList([{},{},...])
dl.writeCSV(outfile)
Doing this might make sense if you have more methods that could operate on the same DictList.data. Otherwise, you'd probably be better off sticking with the original function.
For this you need to understand little bit concepts of classes first and then follow the next step.
I too faced a same problem and followed this LINK , I m sure u will also start working on classes from your structured programming.
If you want to write a lot of CSV files with the same dictList (is that what you're saying...?), turning the function into a class would let you perform initialization just once, and then write repeatedly from the same initialized instance. E.g., with other minor opts:
class CsvWriter(object):
def __init__(self, dictList):
self.maxline = max(dictList, key=len)
self.dictList = [dict((k,k) for k in self.maxline)]
self.dictList.extend(dictList)
def doWrite(self, outfile):
csv_file=open(outfile,'ab')
writer = csv.DictWriter(csv_file,
fieldnames=self.maxLine.keys(),
restval='notScanned',
dialect='excel')
for dataLine in self.dictList:
writer.writerow(dataLine)
csv_file.close()
This seems a dubious use case, but if it does match your desire, then you'd instantiate and use this class as follows...:
cw = CsvWriter(dataList)
for ou in many_outfiles:
cw.doWrite(ou)
When thinking about making objects, remember this:
Classes have attributes - things that describe different instances of the class differently
Classes have methods - things that the objects do (often involving using their attributes)
Objects and classes are wonderful, but the first thing to keep in mind is that they are not always necessary, or even desirable.
That said, in answer to your first question, this doesn't seem like a particularly good candidate for a class. The only thing different between the different CVS files you're writing are the data and the file you write to, and the only thing you do with them (ie, the only method you would have) is the function you've already written).
Even though the first answer is no, it's still instructive to see how a class is built.
class CSVWriter:
# this function is called when you create an instance of the class
# it sets up the initial attributes of the instance
def __init__(self, dictList, outFile):
self.dictList = dictList
self.outFile = outFile
def writeCSV(self):
# basically exactly what you have above, except you can use the instance's
# own variables (ie, self.dictList and self.outFile) instead of the local
# variables
For your final question - the first step to using an instance of a class (an individual object, if you will) is to create that instance:
myCSV = CSVWriter(dictList, outFile)
When the object is created, init is called with the arguments you gave it - that allows your object to have its own data. Now you can access any of the attributes or methods that your myCSV object has with the '.' operator:
myCSV.writeCSV()
print "Wrote a file to", myCSV.outFile
One way to think about objects versus functions is that objects are generally nouns (eg, I created a CSVWriter), while functions are verbs (eg, you wrote a the function that writes CSV files). If you're just doing something over and over again, without re-using any of the same data, a function by itself is fine. But, if you have lots of related data, and part of it gets changed in the course of the action, classes may be a good idea.
I don't think your writeCSV is in need of a class, typicaly class would be used when you have to update some state(data) and then act on it, may be with various options.
e.g. if you need to pass around your object, so that other function/method can add values to it or your final action/output function has many options or you think same data can be processed, acted upon in many ways.
Typically practical case would be if you have multiple functions which act on same data or a singe function whose optional parameter list is going to long, you may think of converting it into a class.
If in your case you had various options and need to insert data in increments, you should make it a class.
Usually class name would be noun, so function(verb) writeCSV -> class(noun) CSVWriter
class CSVWriter(object):
def __init__(self, init-params...):
self.data = {}
def addData(self, data):
self.data.update(data)
def dumpCSV(self, filePath):
...
def dumpJSON(self, filePath):
....
I think question 1 is pretty crucial as it goes to the heart of what a class is.
Yes, you can put this function in a class. A class is a set of functions (called methods) and data together in one logical unit. As other posters noted, probably overkill to have a class with one method.