Best way to initialize variable in a module? - python

Let's say I need to write incoming data into a dataset on the cloud.
When, where and if I will need the dataset in my code, depends on the data coming in.
I only want to get a reference to the dataset once.
What is the best way to achieve this?
Initialize as global variable at start and access through global variable
if __name__="__main__":
dataset = #get dataset from internet
This seems like the simplest way, but initializes the variable even if it is never needed.
Get reference first time the dataset is needed, save in global variable, and access with get_dataset() method
dataset = None
def get_dataset():
global dataset
if dataset is none
dataset = #get dataset from internet
return dataset
Get reference first time the dataset is needed, save as function attribute, and access with get_dataset() method
def get_dataset():
if not hasattr(get_dataset, 'dataset'):
get_dataset.dataset = #get dataset from internet
return get_dataset.dataset
Any other way

The typical way to do what you want is to wrap your service calling for the data into a class:
class MyService():
dataset = None
def get_data(self):
if self.dataset = None:
self.dataset = get_my_data()
return self.dataset
Then you instantiate it once in your main and use it wherever you need it.
if __name__="__main__":
data_service = MyService()
data = data_service.get_data()
# or pass the service to whoever needs it
my_function_that_uses_data(data_service)
The dataset variable is internal but accessible through a discoverable function. You could also use a property on the instance of the class.
Also, using objects and classes makes it much more clear in a large project, as the functionality should be self-explanatory from the classname and methods.
Note that you can easily make this a generic service too, passing it the way to fetch data in the initialization (like a url?), so it can be re-used with different endpoints.
One caveat to avoid is to instantiate the same class multiple times, in your submodules, as opposed to the main. If you did, the data would be fetched and stored for each instance. On the other hand, you can pass the instance of the class to a sub-module and only fetch the data when it's needed (i.e., it may never be fetched if your submodule never needs it), while with all your options, the dataset needs to be fetched first to be passed somewhere else.
Note about your proposed options:
Initializing in the if __name__ == '__main__' section:
It is not initialized globally if you were to call the module as a module (it would only be initialized when calling the module from shell).
You need to fetch the data to pass it somewhere else, even if you don't need it in main.
Set a global within a function.
The use of global is generally discouraged, as it is in any programming language. Modifying variables out of scope is a recipe for encountering odd behaviors. It also tends to make the code harder to test if you rely on this global which is only set in a specific workflow.
Attribute on a function
This one is a bit of an eye-sore: it would certainly work, and the functionality is very similar to the Class pattern I propose, but you have to admit attributes on functions is not very pythonic. The advantage of the Class is that you can initialize it in many ways, can subclass it etc, and yet not fetch the data until you need it. Using a straight function is 'simpler' but much more limited.

You can also use the lru_cache decorator from the functools module for achieving the goal of running an expensive operation only once.
As long as the parameters are the same, calling the function again and again returns the same object.
https://docs.python.org/3/library/functools.html#functools.lru_cache
#lru_cache
def fun(input1, input2):
... # expensive operation
return result

Similar to MrE's answer, it is best to encapsulate the data with a wrapper.
However, I would recommend you to use a python closure python closure instead of a class.
A class should be used to encapsulate data and relevant functions that are closely related to the data. A class should be something that you will instantiate objects of and objects will retain individuality. You can read more about this here
You can use closures in the following way
def get_dataset_wrapper():
dataset = None
def get_dataset():
nonlocal dataset
if dataset is none
dataset = #get dataset from internet
return dataset
return get_dataset
You can use this in the following way
dataset = get_dataset_wrapper()()
If the ()() syntax bothers you, you can do this:
def wrapper():
return get_dataset_wrapper()()

Related

use of attributes in python

This is kind of a high level question. I'm not sure what you'd do with code like this:
class Object(object):
pass
obj = Object
obj.a = lambda: None
obj.d = lambda: dict
setattr(obj.d, 'dictionary', {4,3,5})
setattr(obj.a, 'somefield', 'somevalue')
If I'm going to call obj.a.somefield, why would I use print? It feels redundant.
I simply can't see what programming strictly with setting attributes would be good for?
I could write an entire program with all of my variables in object classes.
First about your print question. Print is used more for debugging or for attributes that are an output from an object that gives you information when you create it.
For example, there might be an object that you create by passing it data and it finds all of the basic statistics information of that data. You could have it return a dictionary via a method and access the values from there or you could simply access it via an attribute, making the data more readable.
For your second part of your question about why you would want to use attributes in general, they're more for internally passing information from function to function in an object or for configuring an object. Python has different scopes that determine which information each function can access. All methods of an object can access that object's attributes, which allows you to avoid using external or global variables. That makes your object nice and self contained. Global variables are generally avoided at all costs, because they can get messy, so they are considered bad practice.
Taking that a step further, using setattr is a more sophisticated way of setting these attributes to make your code more readable. You could use a function to modify aspects of an object or you could "hide" the complexity inside your setattr so the user can use a higher level interface rather than getting bogged down in the specifics.

Counting used attributes at runtime

I'm working on a python project that requires me to compile certain attributes of some objects into a dataset. The code I'm currently using is something like the following:
class VectorBuilder(object):
SIZE = 5
def __init__(self, player, frame_data):
self.player = player
self.fd = frame_data
def build(self):
self._vector = []
self._add(self.player)
self._add(self.fd.getSomeData())
self._add(self.fd.getSomeOtherData())
char = self.fd.getCharacter()
self._add(char.getCharacterData())
self._add(char.getMoreCharacterData())
assert len(self._vector) == self.SIZE
return self._vector
def _add(self, element):
self._vector.append(element)
However, this code is slightly unclean because adding/removing attributes to/from the dataset also requires correctly adjusting the SIZE variable. The reason I even have the SIZE variable is that the size of the dataset needs to be known at runtime before the dataset itself is created.
I've thought of instead keeping a list of all the functions used to construct the dataset as strings (as in attributes = ['getPlayer', 'fd.getSomeData', ...]) and then defining the build function as something like:
def build(self):
self._vector = []
for att in attributes:
self._vector.append(getattr(self, att)())
return self._vector
This would let me access the size as simply len(attributes) and I only ever need to edit attributes, but I don't know how to make this approach work with the chained function calls, such as self.fd.getCharacter().getCharacterData().
Is there a cleaner way to accomplish what I'm trying to do?
EDIT:
Some additional information and clarification is necessary.
I was using __ due to some bad advice I read online (essentially saying I should use _ for module-private members and __ for class-private members). I've edited them to _ attributes now.
The getters are a part of the framework I'm using.
The vector is stored as a private class member so I don't have to pass it around the construction methods, which are in actuality more numerous than the simple _add, doing some other stuff like normalisation and bool->int conversion on the elements before adding them to the vector.
SIZE as it currently stands, is a true constant. It is only ever given a value in the first line of VectorBuilder and never changed at runtime. I realise that I did not clarify this properly in the main post, but new attributes never get added at runtime. The adjustment I was talking about would take place at programming time. For example, if I wanted to add a new attribute, I would need to add it in the build function, e.g.:
self._add(self.fd.getCharacter().getAction().getActionData().getSpeed())
, as well as change the SIZE definition to SIZE = 6.
The attributes are compiled into what is currently a simple python list (but will probably be replaced with a numpy array), then passed into a neural network as an input vector. However, the neural network itself needs to be built first, and this happens before any data is made available (i.e. before any input vectors are created). In order to be built successfully, the neural network needs to know the size of the input vectors it will be receiving, though. This is why SIZE is necessary and also the reason for the assert statement - to ascertain that the vectors I'm passing to the network are in fact the size I claimed I would be passing to it.
I'm aware the code is unpythonic, that is why I'm here - the code works, it's just ugly.
Instead of providing the strings of the attributes as a list you would like to create the input arguments from, why don't you initialize the build function with a list containing all the values returned by your getter functions?
Your SIZE variable would then still be the length of the dynamic argument list provided in build(self,*args) for example.

Avoiding global variables but also too many function arguments (Python)

Let's say I have a python module that has a lot of functions that rely on each other, processing each others results. There's lots of cohesion.
That means I'll be passing back and forth a lot of arguments. Either that, or I'd be using global variables.
What are best practices to deal with such a situation if? Things that come to mind would be replacing those parameters with dictionaries. But I don't necessarily like how that changes the function signature to something less expressive. Or I can wrap everything into a class. But that feels like I'm cheating and using "pseudo"-global variables?
I'm asking specifically for how to deal with this in Python but I understand that many of those things would apply to other languages as well.
I don't have a specific code example right, it's just something that came to mind when I was thinking about this issue.
Examples could be: You have a function that calculates something. In the process, a lot of auxiliary stuff is calculated. Your processing routines need access to this auxiliary stuff, and you don't want to just re-compute it.
This is a very generic question so it is hard to be specific. What you seem to be describing is a bunch of inter-related functions that share data. That pattern is usually implemented as an Object.
Instead of a bunch of functions, create a class with a lot of methods. For the common data, use attributes. Set the attributes, then call the methods. The methods can refer to the attributes without them being explicitly passed as parameters.
As RobertB said, an object seems the clearest way. Could be as simple as:
class myInfo:
def __init__(self, x=0.0, y=0.0):
self.x = x
self.y = y
self.dist = self.messWithDist()
def messWithDist(self):
self.dist = math.sqrt(self.x*self.x + self.y*self.y)
blob = myInfo(3,4)
blob.messWithDist()
print(blob.dist)
blob.x = 5
blob.y = 7
blob.messWithDist()
print(blob.dist)
If some of the functions shouldn't really be part of such an object, you can just define them as (non-member, independent) functions, and pass the blob as one parameter. For example, by un-indenting the def of messWithDist, then calling as messWithDist(blob) instead of blob.messWithDist().
-s

Best practice when defining instance variables

I'm fairly new to Python and have a question regarding the following class:
class Configuration:
def __init__(self):
parser = SafeConfigParser()
try:
if parser.read(CONFIG_FILE) is None:
raise IOError('Cannot open configuration file')
except IOError, error:
sys.exit(error)
else:
self.__parser = parser
self.fileName = CONFIG_FILE
def get_section(self):
p = self.__parser
result = []
for s in p.sections():
result.append('{0}'.format(s))
return result
def get_info(self, config_section):
p = self.__parser
self.section = config_section
self.url = p.get(config_section, 'url')
self.imgexpr = p.get(config_section, 'imgexpr')
self.imgattr1 = p.get(config_section, 'imgattr1')
self.imgattr2 = p.get(config_section, 'imgattr2')
self.destination = p.get(config_section, 'destination')
self.createzip = p.get(config_section, 'createzip')
self.pagesnumber = p.get(config_section, 'pagesnumber')
Is it OK to add more instance variables in another function, get_info in this example, or is it best practice to define all instance variables in the constructor? Couldn't it lead to spaghetti code if I define new instance variables all over the place?
EDIT: I'm using this code with a simple image scraper. Via get_section I return all sections in the config file, and then iterate through them to visit each site that I'm scraping images from. For each iteration I make a call to get_section to get the configuration settings for each section in the config file.
If anyone can come up with another approach it'll be fine! Thanks!
I would definitely declare all instance variables in __init__. To not do so leads to increased complexity and potential unexpected side effects.
To provide an alternate point of view from David Hall in terms of access, this is from the Google Python style guide.
Access Control:
If an accessor function would be trivial you should use public
variables instead of accessor functions to avoid the extra cost of
function calls in Python. When more functionality is added you can use
property to keep the syntax consistent
On the other hand, if access is more complex, or the cost of accessing
the variable is significant, you should use function calls (following
the Naming guidelines) such as get_foo() and set_foo(). If the past
behavior allowed access through a property, do not bind the new
accessor functions to the property. Any code still attempting to
access the variable by the old method should break visibly so they are
made aware of the change in complexity.
From PEP8
For simple public data attributes, it is best to expose just the
attribute name, without complicated accessor/mutator methods. Keep in
mind that Python provides an easy path to future enhancement, should
you find that a simple data attribute needs to grow functional
behavior. In that case, use properties to hide functional
implementation behind simple data attribute access syntax.
Note 1: Properties only work on new-style classes.
Note 2: Try to keep the functional behavior side-effect free, although
side-effects such as caching are generally fine.
Note 3: Avoid using properties for computationally expensive
operations; the attribute notation makes the caller believe that
access is (relatively) cheap.
Python isn't java/C#, and it has very strong ideas about how code should look and be written. If you are coding in python, it makes sense to make it look and feel like python. Other people will be able to understand your code more easily and you'll be able to understand other python code better as well.
I would favour setting all the instance variables in the constructor over having functions like get_info() that are required to put the class in a valid state.
With public instance variables that are only instantiated by calls to methods such as your get_info() you create a class that is a bit of a minefield to use.
If you are worried about have certain configuration values which are not always needed and are expensive to calculate (which I guess is why you have get_info(), allowing for deferred execution), then I'd either consider refactoring that subset of config into a second class or introducting properties or functions that return values.
With properties or get style functions you encourage consumers of the class to go through a defined interface and improve the encapsulation 1.
Once you have that encapsulation of the instance variables you give yourself the option to do something more than simply throw a NameError exception - you can perhaps call get_info() yourself, or throw a custom exception.
1.You can't provide 100% encapsulation with Python since private instance variables denoted by a leading double underscore are only private by convention

How to create a class from function

I am still struggling with understanding classes, I am not certain but I have an idea that this function I have created is probably a good candidate for a class. The function takes a list of dictionaries, identifies the keys and writes out a csv file.
First Q, is this function a good candidate for a class (I write out a lot of csv files
Second Q If the answer to 1 is yes, how do I do it
Third Q how do I use the instances of the class (did I say that right)
import csv
def writeCSV(dictList,outfile):
maxLine=dictList[0]
for item in dictList:
if len(item)>len(maxLine):
maxLine=item
dictList.insert(0,dict( (key,key) for key in maxLine.keys()))
csv_file=open(outfile,'ab')
writer = csv.DictWriter(csv_file,fieldnames=[key for key in maxLine.keys()],restval='notScanned',dialect='excel')
for dataLine in dictList:
writer.writerow(dataLine)
csv_file.close()
return
The main idea behind objects is that an object is data plus methods.
Whenever you are thinking about making something an object, you must ask yourself what will be the object's data, and what operations (methods) will you want to perform on that data.
Functions, more readily translate to methods than classes.
So, for instance, if your dictList is data upon which you often call writeCSV,
then perhaps make a dictList object with method writeCSV:
class DictList(object):
def __init__(self,data):
self.data=data
def writeCSV(self,outfile):
maxLine=self.data[0]
for item in self.data:
if len(item)>len(maxLine):
maxLine=item
self.data.insert(0,dict( (key,key) for key in maxLine.keys()))
csv_file=open(outfile,'ab')
writer = csv.DictWriter(
csv_file,fieldnames=[key for key in maxLine.keys()],
restval='notScanned',dialect='excel')
for dataLine in self.data:
writer.writerow(dataLine)
csv_file.close()
Then you could instantiate a DictList object:
dl=DictList([{},{},...])
dl.writeCSV(outfile)
Doing this might make sense if you have more methods that could operate on the same DictList.data. Otherwise, you'd probably be better off sticking with the original function.
For this you need to understand little bit concepts of classes first and then follow the next step.
I too faced a same problem and followed this LINK , I m sure u will also start working on classes from your structured programming.
If you want to write a lot of CSV files with the same dictList (is that what you're saying...?), turning the function into a class would let you perform initialization just once, and then write repeatedly from the same initialized instance. E.g., with other minor opts:
class CsvWriter(object):
def __init__(self, dictList):
self.maxline = max(dictList, key=len)
self.dictList = [dict((k,k) for k in self.maxline)]
self.dictList.extend(dictList)
def doWrite(self, outfile):
csv_file=open(outfile,'ab')
writer = csv.DictWriter(csv_file,
fieldnames=self.maxLine.keys(),
restval='notScanned',
dialect='excel')
for dataLine in self.dictList:
writer.writerow(dataLine)
csv_file.close()
This seems a dubious use case, but if it does match your desire, then you'd instantiate and use this class as follows...:
cw = CsvWriter(dataList)
for ou in many_outfiles:
cw.doWrite(ou)
When thinking about making objects, remember this:
Classes have attributes - things that describe different instances of the class differently
Classes have methods - things that the objects do (often involving using their attributes)
Objects and classes are wonderful, but the first thing to keep in mind is that they are not always necessary, or even desirable.
That said, in answer to your first question, this doesn't seem like a particularly good candidate for a class. The only thing different between the different CVS files you're writing are the data and the file you write to, and the only thing you do with them (ie, the only method you would have) is the function you've already written).
Even though the first answer is no, it's still instructive to see how a class is built.
class CSVWriter:
# this function is called when you create an instance of the class
# it sets up the initial attributes of the instance
def __init__(self, dictList, outFile):
self.dictList = dictList
self.outFile = outFile
def writeCSV(self):
# basically exactly what you have above, except you can use the instance's
# own variables (ie, self.dictList and self.outFile) instead of the local
# variables
For your final question - the first step to using an instance of a class (an individual object, if you will) is to create that instance:
myCSV = CSVWriter(dictList, outFile)
When the object is created, init is called with the arguments you gave it - that allows your object to have its own data. Now you can access any of the attributes or methods that your myCSV object has with the '.' operator:
myCSV.writeCSV()
print "Wrote a file to", myCSV.outFile
One way to think about objects versus functions is that objects are generally nouns (eg, I created a CSVWriter), while functions are verbs (eg, you wrote a the function that writes CSV files). If you're just doing something over and over again, without re-using any of the same data, a function by itself is fine. But, if you have lots of related data, and part of it gets changed in the course of the action, classes may be a good idea.
I don't think your writeCSV is in need of a class, typicaly class would be used when you have to update some state(data) and then act on it, may be with various options.
e.g. if you need to pass around your object, so that other function/method can add values to it or your final action/output function has many options or you think same data can be processed, acted upon in many ways.
Typically practical case would be if you have multiple functions which act on same data or a singe function whose optional parameter list is going to long, you may think of converting it into a class.
If in your case you had various options and need to insert data in increments, you should make it a class.
Usually class name would be noun, so function(verb) writeCSV -> class(noun) CSVWriter
class CSVWriter(object):
def __init__(self, init-params...):
self.data = {}
def addData(self, data):
self.data.update(data)
def dumpCSV(self, filePath):
...
def dumpJSON(self, filePath):
....
I think question 1 is pretty crucial as it goes to the heart of what a class is.
Yes, you can put this function in a class. A class is a set of functions (called methods) and data together in one logical unit. As other posters noted, probably overkill to have a class with one method.

Categories