Writing big python classes the right way [closed]

Writing big python classes the right way [closed] - python

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
When writing a python class that have different functions for getting the data, and parsing the data; what is the most correct way?
You can write it so you are populating self.data... one by one, and then running parse functions to populate self.parsed_data.... Or is it correct to write functions that accept self.data and returns self.parsed_data..?
Examples below.
MyClass1 populates self.variables, and MyClass2 takes them as parameters.
I think MyClass2 is "most" correct.
So, what is correct? And why? I have been trying to decide upon which of these two coding styles for a while. But I want to know which of these are considered best practice.
class MyClass1(object):
def __init__(self):
self.raw_data = None
def _parse_data(self):
# This is a fairly complex function xml/json parser
raw_data = self.raw_data
data = raw_data # Much for is done to do something with raw_data
cache.set('cache_key', data, 600) # Cache for 10 minutes
return data
def _populate_data(self):
# This function grabs data from an external source
self.raw_data = 'some raw data, xml, json or alike..'
def get_parsed_data(self):
cached_data = cache.get('cache_key')
if cached_data:
return cached_data
else:
self._populate_data()
return self._parse_data()
mc1 = MyClass1()
print mc1.get_parsed_data()
class MyClass2(object):
def _parse_data(self, raw_data):
# This is a fairly complex function xml/json parser
data = raw_data # After some complicated work of parsing raw_data
cache.set('cache_key', data, 600) # Cache for 10 minutes
return data
def _get_data(self):
# This function grabs data from an external source
return 'some raw data, xml, json or alike..'
def get_parsed_data(self):
cached_data = cache.get('cache_key')
if cached_data:
return cached_data
else:
return self._populate_data(self._get_data())
mc2 = MyClass2()
print mc1.get_parsed_data()

It's down to personal preference, finally. But IMO, it's better to just have a module-level function called parse_data which takes in the raw data, does a bunch of work and returns the parsed data. I assume your cache keys are somehow derived from the raw data, which means the parse_data function can also implement your caching logic.
The reason I prefer a function vs having a full-blown class is the simplicity. If you want to have a class which provides data fields pulled from your raw data, so that users of your objects can do something like obj.some_attr instead of having to look inside some lower-level data construct (e.g. JSON, XML, Python dict, etc.), I would make a simple "value object" class which only contains data fields, and no parsing logic, and have the aforementioned parse_data function return an instance of this class (essentially acting as a factory function for your data class). This leads to less state, simpler objects and no laziness, making your code easier to reason about.
This would also make it easier to unit test consumers of this class, because in those tests, you can simply instantiate the data object with fields, instead of having to provide a big blob of test raw data.

For me the most correct class is the class the user understands and uses with as few errors as possible.
When I look at class 2 I ask myself how would I use it...
mc2 = MyClass2()
print mc1.get_parsed_data()
I would like only
print get_parsed_data()
Sometimes it is better to not write classes at all.

the second way is preferable because (if i understand correctly) it's identical in efficiency and results, but avoids having an instance member for the raw data. in general you want to reduce the amount of data stored inside your objects because each extra attribute means more worrying about consistency over time.
in other words, it's "more functional".

Think about the question this way: if, instead of having two methods, you combined this logic into one long method, would you keep track of the raw data after it is parsed? If the answer to that is yes, then it would make sense to store it as an attribute. But if you don't care about it anymore after that point, prefer the second form. Breaking out parts of your logic into "helper" subroutines should preferably avoid making changes to your class that other methods might need to care about.

Related

Better to save method result as class variable or return this variable and use as input in another method?

I don't have any formal training in programming, but I routinely come across this question when I am making classes and running individual methods of that class in sequence. What is better: save results as class variables or return them and use them as inputs to subsequent method calls. For example, here is a class where the the variables are returned and used as inputs:
class ProcessData:
def __init__(self):
pass
def get_data(self,path):
data = pd.read_csv(f"{path}/data.csv"}
return data
def clean_data(self, data)
data.set_index("timestamp", inplace=True)
data.drop_duplicates(inplace=True)
return data
def main():
processor = ProcessData()
temp = processor.get_data("path/to/data")
processed_data = processor.clean_data(temp)
And here is an example where the results are saved/used to update the class variable:
class ProcessData:
def __init__(self):
self.data = None
def get_data(self,path):
data = pd.read_csv(f"{path}/data.csv"}
self.data = data
def clean_data(self)
self.data.set_index("timestamp", inplace=True)
self.data.drop_duplicates(inplace=True)
def main():
processor = ProcessData()
processor.get_data("path/to/data")
processor.clean_data()
I have a suspicion that the latter method is better, but I could also see instances where the former might have its advantages. I am sure the answer to my question is "it depends", but I am curious in general, what are the best practices?

Sketch the class based on usage, then create it
Instead of inventing classes to make your high level coding easier, tap your heels together and write the high-level code as if the classes already existed. Then create the classes with the methods and behavior that exactly fits what you need.
PEP AS AN EXAMPLE
If you look at several peps, you'll notice that the rationale or motivation is given before the details. The rationale and motivation shows how the new Python feature is going to solve a problem and how it is going to be used sometimes with code examples.
Example from PEP 289 – Generator Expressions:
Generator expressions are especially useful with functions like sum(),
min(), and max() that reduce an iterable input to a single value:
max(len(line) for line in file if line.strip())
Generator expressions also address some examples of functionals coded
with lambda:
reduce(lambda s, a: s + a.myattr, data, 0)
reduce(lambda s, a: s + a[3], data, 0)
These simplify to:
sum(a.myattr for a in data)
sum(a[3] for a in data)
My methodology given above is the same as describing the motivation and rationale for a class in terms of use. Because you are writing the code that is actually going to use it first.

When composing classes in python, which is preferred mixins or object attributes?

When composing functions from mutliple classes, I have seen both of the following conventions employed: Data1 uses mixins to add methods directly to the child class while Data2 instantiates attribute objects that do the work.
I think they are equivalent and that the choice is just one of style. Is that right? Or is one preferred over the other?
class Data1(Reader, Processor, Writer):
def __init__(self):
Reader.__init__(self)
Processor.__init__(self)
Writer.__init__(self)
def run(self):
self.read()
self.process()
self.write()
or
class Data2:
def __init__(self):
self.reader = Reader()
self.processor = Processor()
self.writer = Writer()
def run(self):
self.reader.read()
self.processor.process()
self.writer.write()
To flush out a specific example, I have a processing pipeline where there are various data products that each need to be read (Reader.read()), processed (Processor.process()), and then the product of the processing step needs to be written to a db (Writer.write()).
To be even more concrete, consider that I have multiple fitness data types:
heart rate data from a csv file that needs to averaged on
1-second intervals and dumped to a heart-rate table in a db
running speed data from a json file that needs to be converted to
mi/hr, then formatted as part of a html report.
weather data from
a web api that needs to be aggregated over day-long periods and then
posted to another web api.
For each of these "data products", there is a logical read, process, write pipeline and I'd like to capture that in an abstract class that can then be used as a consistent template for handling future "data products".
In these examples, Reader.read() would be an abstract class that might read a csv, json, or web API. Processor.process() performs various aggregations. Writer.write() would send the processed data to various places.
Given that, I am unsure of the best structure.

I would like to avoid a religious was because you could find technical reasons to use one or the other, but a rule of thumb is to ask the question is A a B or does A have a B? In former case you should use inheritance, in the latter composition.
For example, a coloured square is a figure and has a colour. So it should inherit from figure and contain a colour. There can be some hints: is one of the subobject has an independant lifecycle (it may exist before being used in the combining object) then it is with no doubt a has a relation. On the other side, if it cannot exist alone (abstract class) then is is with no doubt a **is a* relation.
But that means that without knowing what you call a Reader a Writer and a Processor and what data is, I cannot say which model I would use. But if Reader and Writer were both subclasses of the same ancestor class with independant parent members, I would use composition. If they were specially tailored classes sharing members of a common ancestor, that would be more of a is a operation and I would use inheritance.
The rule is that when possible, you should respect the real object semantics. After all in deep code execution it does not really matter whether you have used inheritance or composition.
BTW, what is discussed above is the general inheritance vs composition question. Strictly speaking, a mixin is a special case because a mixin should maintain no state but only add methods and because of that is often abstract. In Python mixins are implemented through inheritance, but other language may have other implementations. But in Python they are a typical example of something that is not necessarily a is a relation but does use inheritance.

Using superclasses is shorter (if you initialize the bases correctly):
class Data1(Reader, Processor, Writer):
def __init__(self):
super().__init__()
def run(self):
self.read()
self.process()
self.write()
but many people find the composition version to be easier to work with, since you don't need to go hunting through the inheritance tree to find where a method is implemented/overriden.
Wikipedia has a longer article on the topic which is well worth the read: https://en.wikipedia.org/wiki/Composition_over_inheritance
Addendum: your .run() method might be better implemented as
self.write(self.process(self.read()))
and then it would be easier to just make it a function:
def run(reader, processor, writer):
return writer.write(processor.process(reader.read()))
or with e.g. logging:
def run(reader, processor, writer):
for data in reader.read():
log.debug("Read data: %r", data)
for output_chunk in processor.process(data):
log.debug("processed %r and got %r", data, output_chunk)
writer.write(output_chunk)
log.debug("wrote %r", output_chunk)
and call it:
run(Reader(), Processor(), Writer())
assuming the reader is yielding data, this could be much more efficient, and it is very much easier to write unit tests for.
Finally: You do not need Reader as an abstract base class for csv, json, or web API reader classes. People coming from Java/C++ tend to conflate classes and types, and subclasses with subtypes. The Python type of the reader parameter in
def run(reader, processor, writer):
is ∀τ ≤ {read: NONE→DATA}, ie. all subtypes t of an object type with a .read(..) that takes NONE (the type of None) and returns a value of (here unspecified) type DATA. E.g. the standard file object has such a type and could be passed in directly, instead of writing a wrapper-class FileReader with tons of boilerplate. This, incidentally, is why I consider adding under-powered type languages to Python to be a very bad thing, but at this point I realize that I'm digressing ;-)

The most 'Pythonic' way to handle overloading

Disclaimer: this is perhaps a quite subjective question with no 'right' answer but I'd appreciate any feedback on best-practices and program design. So here goes:
I am writing a library where text files are read into Text objects. Now these might be initialized with a list of file-names or directly with a list of Sentence objects. I am wondering what the best / most Pythonic way to do this might be because, if I understand correctly, Python doesn't directly support method overloading.
One example I found in Scikit-Learn's feature extraction module simply passes the type of the input as an argument while initializing the object. I assume that once this parameter is set it's just a matter of handling the different cases internally:
if input == 'filename':
# glob and read files
elif input == 'content':
# do something else
While this is easy to implement, it doesn't look like a very elegant solution. So I am wondering if there is a better way to handle multiple types of inputs to initialize a class that I am overlooking.

One way is to just create classmethods with different names for the different ways of instantiating the object:
class Text(object):
def __init__(self, data):
# handle data in whatever "basic" form you need
#classmethod
def fromFiles(cls, files):
# process list of filenames into the form that `__init__` needs
return cls(processed_data)
#classmethod
def fromSentences(cls, sentences):
# process list of Sentence objects into the form that `__init__` needs
return cls(processed_data)
This way you just create one "real" or "canonical" initialization method that accepts whatever "lowest common denominator" format you want. The specialized fromXXX methods can preprocess different types of input to convert them into the form they need to be in to pass to that canonical instantiation. The idea is that you call Text.fromFiles(...) to make a Text from filenames, or Text.fromSentences(...) to make a Text from sentence objects.
It can also be acceptable to do some simple type-checking if you just want to accept one of a few enumerable kinds of input. For instance, it's not uncommon for a class to accept either a filename (as a string) or a file object. In that case you'd do:
def __init__(self, file):
if isinstance(file, basestring):
# If a string filename was passed in, open the file before proceeding
file = open(file)
# Now you can handle file as a file object
This becomes unwieldy if you have many different types of input to handle, but if it's something relatively contained like this (e.g., an object or the string "name" that can be used to get that object), it can be simpler than the first method I showed.

You can use duck typing. First you consider as if the arguments are of the type X, if they raise an exception, then you assume they are of type Y, etc:
class Text(object):
def __init__(self, *init_vals):
try:
fileobjs = [open(fname) for fname in init_vals]
except TypeError:
# Then we consider them as file objects.
fileobjs = init_vals
try:
senteces = [parse_sentences(fobj) for fobj in fileobjs]
except TypeError:
# Then init_vals are Sentence objects.
senteces = fileobjs
Note that the absence of type checking means that the method actually accepts any type that implement one of the interfaces you actually use (e.g. file-like object, Sentence-like object etc.).
This method becomes quite heavy if you want to support a lot of different types, but I'd consider that bad code design. Accepting more than 2,3,4 types as initializers will probably confuse any programmer that uses your class, since he will always have to think "wait, did X also accept Y, or was it Z that accepted Y...".
It's probably better design the constructor to only accept 2,3 different interfaces and provide the user with some function/class that allows him to convert some often used types to these interfaces.

How to properly document hasattr() use

I see it's not considered pythonic to use isinstance(), and people suggest e.g. to use hasattr().
I wonder what the best way is to document the proper use of a function that uses hasattr().
Example:
I get stock data from different websites (e.g. Yahoo Finance, Google Finance), and there are classes GoogleFinanceData and YahooFinanceData which both have a method get_stock(date).
There is also a function which compares the value of two stocks:
def compare_stocks(stock1,stock2,date):
if hasattr(stock1,'get_stock') and hasattr(stock2,'get_stock'):
if stock1.get_stock(date) < stock2.get_stock(date):
print "stock1 < stock2"
else:
print "stock1 > stock2"
The function is used like this:
compare_stocks(GoogleFinanceData('Microsoft'),YahooFinanceData('Apple'),'2012-03-14')
It is NOT used like this:
compare_stocks('Tree',123,'bla')
The question is: How do I let people know which classes they can use for stock1 and stock2? Am I supposed to write a docstring like "stock1 and stock2 ought to have a method get_stock" and people have to look through the source themselves? Or do I put all right classes into one module and reference that file in the docstring?

If all you ever do is call the function with *FinanceData instances, I'd not even bother with testing for the get_stock method; it's an error to pass in anything else and the function should just break if someone passes in strings.
In other words, just document your function as expecting a get_stock() method, and not test at all. Duck typing is for code that needs to accept distinctly different types of input, not for code that only works for one specific type.

I don't see whats unpythonic about the use of isinstance(), I would create a base class and refer to the base class' documentation.
def compare_stocks(stock1, stock2, date):
""" Compares stock data of two FinanceData objects at a certain time. """
if isinstance(stock1, FinanceData) and isinstance(stock2, FinanceData):
return 'comparison'
class FinanceData(object):
def get_stock(self, date):
""" Returns stock data in format XX, expects parameter date in format YY """
raise NotImplementedError
class GoogleFinanceData(FinanceData):
def get_stock(self, date):
""" Implements FinanceData.get_stock() """
return 'important data'
As you see I don't use duck-typing here, but since you've asked this question in regards to documentation, I think this is the cleaner approach for readability. Whenever another developer sees the compare_stocks function or a get_stock method he knows exactly where he has to look for further information regarding functionality, data structure or implementation details.

Do what you suggest, put in the docstring that passed arguments should have a get_stock function, that is what your function requires, listing classes is bad since the code may well be used with derived or other classes when it suits someone.

How to create a class from function

I am still struggling with understanding classes, I am not certain but I have an idea that this function I have created is probably a good candidate for a class. The function takes a list of dictionaries, identifies the keys and writes out a csv file.
First Q, is this function a good candidate for a class (I write out a lot of csv files
Second Q If the answer to 1 is yes, how do I do it
Third Q how do I use the instances of the class (did I say that right)
import csv
def writeCSV(dictList,outfile):
maxLine=dictList[0]
for item in dictList:
if len(item)>len(maxLine):
maxLine=item
dictList.insert(0,dict( (key,key) for key in maxLine.keys()))
csv_file=open(outfile,'ab')
writer = csv.DictWriter(csv_file,fieldnames=[key for key in maxLine.keys()],restval='notScanned',dialect='excel')
for dataLine in dictList:
writer.writerow(dataLine)
csv_file.close()
return

The main idea behind objects is that an object is data plus methods.
Whenever you are thinking about making something an object, you must ask yourself what will be the object's data, and what operations (methods) will you want to perform on that data.
Functions, more readily translate to methods than classes.
So, for instance, if your dictList is data upon which you often call writeCSV,
then perhaps make a dictList object with method writeCSV:
class DictList(object):
def __init__(self,data):
self.data=data
def writeCSV(self,outfile):
maxLine=self.data[0]
for item in self.data:
if len(item)>len(maxLine):
maxLine=item
self.data.insert(0,dict( (key,key) for key in maxLine.keys()))
csv_file=open(outfile,'ab')
writer = csv.DictWriter(
csv_file,fieldnames=[key for key in maxLine.keys()],
restval='notScanned',dialect='excel')
for dataLine in self.data:
writer.writerow(dataLine)
csv_file.close()
Then you could instantiate a DictList object:
dl=DictList([{},{},...])
dl.writeCSV(outfile)
Doing this might make sense if you have more methods that could operate on the same DictList.data. Otherwise, you'd probably be better off sticking with the original function.

For this you need to understand little bit concepts of classes first and then follow the next step.
I too faced a same problem and followed this LINK , I m sure u will also start working on classes from your structured programming.

If you want to write a lot of CSV files with the same dictList (is that what you're saying...?), turning the function into a class would let you perform initialization just once, and then write repeatedly from the same initialized instance. E.g., with other minor opts:
class CsvWriter(object):
def __init__(self, dictList):
self.maxline = max(dictList, key=len)
self.dictList = [dict((k,k) for k in self.maxline)]
self.dictList.extend(dictList)
def doWrite(self, outfile):
csv_file=open(outfile,'ab')
writer = csv.DictWriter(csv_file,
fieldnames=self.maxLine.keys(),
restval='notScanned',
dialect='excel')
for dataLine in self.dictList:
writer.writerow(dataLine)
csv_file.close()
This seems a dubious use case, but if it does match your desire, then you'd instantiate and use this class as follows...:
cw = CsvWriter(dataList)
for ou in many_outfiles:
cw.doWrite(ou)

When thinking about making objects, remember this:
Classes have attributes - things that describe different instances of the class differently
Classes have methods - things that the objects do (often involving using their attributes)
Objects and classes are wonderful, but the first thing to keep in mind is that they are not always necessary, or even desirable.
That said, in answer to your first question, this doesn't seem like a particularly good candidate for a class. The only thing different between the different CVS files you're writing are the data and the file you write to, and the only thing you do with them (ie, the only method you would have) is the function you've already written).
Even though the first answer is no, it's still instructive to see how a class is built.
class CSVWriter:
# this function is called when you create an instance of the class
# it sets up the initial attributes of the instance
def __init__(self, dictList, outFile):
self.dictList = dictList
self.outFile = outFile
def writeCSV(self):
# basically exactly what you have above, except you can use the instance's
# own variables (ie, self.dictList and self.outFile) instead of the local
# variables
For your final question - the first step to using an instance of a class (an individual object, if you will) is to create that instance:
myCSV = CSVWriter(dictList, outFile)
When the object is created, init is called with the arguments you gave it - that allows your object to have its own data. Now you can access any of the attributes or methods that your myCSV object has with the '.' operator:
myCSV.writeCSV()
print "Wrote a file to", myCSV.outFile
One way to think about objects versus functions is that objects are generally nouns (eg, I created a CSVWriter), while functions are verbs (eg, you wrote a the function that writes CSV files). If you're just doing something over and over again, without re-using any of the same data, a function by itself is fine. But, if you have lots of related data, and part of it gets changed in the course of the action, classes may be a good idea.

I don't think your writeCSV is in need of a class, typicaly class would be used when you have to update some state(data) and then act on it, may be with various options.
e.g. if you need to pass around your object, so that other function/method can add values to it or your final action/output function has many options or you think same data can be processed, acted upon in many ways.
Typically practical case would be if you have multiple functions which act on same data or a singe function whose optional parameter list is going to long, you may think of converting it into a class.
If in your case you had various options and need to insert data in increments, you should make it a class.
Usually class name would be noun, so function(verb) writeCSV -> class(noun) CSVWriter
class CSVWriter(object):
def __init__(self, init-params...):
self.data = {}
def addData(self, data):
self.data.update(data)
def dumpCSV(self, filePath):
...
def dumpJSON(self, filePath):
....

I think question 1 is pretty crucial as it goes to the heart of what a class is.
Yes, you can put this function in a class. A class is a set of functions (called methods) and data together in one logical unit. As other posters noted, probably overkill to have a class with one method.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.