python / pandas - MultiIndexing - eliminate the use of global variables - python

I am using pandas to import a dataframe from excel in order to sort, make changes and run some simple addition and division on the data.
My code is working but it has global variables throughout. I think this is poor practice and I want to somehow eliminate these global variables but I am confused on how I can go about doing this.
I'm not sure how I can further modify my dataframe with indexing and slicing without declaring global variables.
mydf = pd.read_excel('data.xlsx')
new_indexes = df.set_index(['apple', 'cherry', 'banana'])
new_indexes['apples and cherries'] = new_indexes['apple'] + new_indexes['cherries']
sliced = multi.loc(axis = 0)[pd.IndexSlice[:, 'fruits']]
total_fruits = sliced.loc[:, 'grapes', 'watermelon', 'orange'].sum(axis=1)
That's a snippet of my code. As you can see I am referring to the global variables in order to further modify my dataframe. I need to eliminate the global variables. I am trying to create functions to help clean up my code.
My main question is how can I refer to my data and changes without assigning global variables to my code?
If I wanted to go about defining a class and reassigning the variables to properties would I be able to do something like this?
class MyDf:
def __init__(self):
pass
def get_df(self):
return pd.read_excel('data.xlsx')
def set_index(self):
self._multi_index = df.set_index(['apple', 'cherry', 'banana'])
def add_totals(self)
self.set_indexes['apples and cherries'] = set_indexes['apple']+ new_indexes['cherries']
Thank you

There are several things you could do, dependent on the overall structure of your code and your goal. Without knowing more about your case and, for example, seeing how the snippet you provided is embedded into the rest of your code, those are only possible solutions.
You could define a function, make it take a dataframe as an argument, perform operations on it and then return the modified dataframe. The function could also simply take a filename as argument, so that the respective df is created within the function to begin with. If you do not need to refer to intermediary variables such as new_indexes or sliced later in the code, using a function to perform the operations might be a good way to go.
You could also define a Class, make the variables into properties of objects of that class and write methods to perform the respective operations you want to do. This would have the advantage that you could still access your variables, if necessary.

Related

Does calculations take place when I assign a statement to a variable or when that variable is called

Let's say I have a pandas dataframe of 1,000,000 rows which looks like this:
A B
10 20
Text Word
...
Text_1m Word_1m
I have declared 100 variables like so:
a = df[df['A']>50 & df['B'] == 'Word_1231']
I later use these variables in the Slack app as /commands which simply return the value of the variables.
My question is: does those 100 variables and their value gets calculated at the moment of declaration or for example when I write print(a)?
I come up with this question because of the high number of variables and large dataframe, if they get calculated at the point of declaration, what other methods could I use to only calculate the value of a variable when I actually use it, for example print()? Put everything in functions?
UPDATE
Does the variable a only gets calculated when I call the function or when I declate it?
def calculate():
a = df[df['A']>50 & df['B'] == 'Word_1231']
return a
Thank you for your suggestions.
Hard to say, because pandas can either use copies or views. The choice is presented as an implementation detail that can depend on various parameters. The only important thing is that the user should be prepared to both.
The problem is that if it is a view, at assignement point, only the references for the view are stored, while if is a a copy, the full extraction is done at the assignment point.
But the whole point is that using 100 distinct variables to do almost the same thing is an anti-pattern. The correct way is to create loops that (re-) assign the variable with the proper sub dataframe and then processes it. If you really need all the sub dataframes at the same time, for any reason, then you should store them in a container like a list or dict.

How do I make a dataframe global and use it in a function?

i want to print a column of a dataframe in a function. it says name 'data' is not defined. How do I make it global?
my function is:
def min_function():
print("Choose action to be performed on the data using the specified metric. Options are list, correlation")
action = input()
if action == "list":
print("Ranked list of countries' happiness scores based the min metric")
print(data['country'], data.sort_values(['min_value'], ascending=[0,0]))
data is my dataframe.
Don't :-)
import it and use it as a parameter for your function. This way you will be able to use other dataframes as well. If there really is, ever will, and ever should be one dataframe of data use the singleton-pattern (https://en.wikipedia.org/wiki/Singleton_pattern) and still import it.
While it is true, that you should think hard before introducing global variables, you can do it in two ways:
Either just define the variable in the file you want to access it and then just use it, or you use the global keyword in a function before defining the variable.
Note, that with both approaches, you can only access the variable within your file.
If that not enough, you need to define the variable as built in variable. But seriously, don't do it
Edit:
Note, that obviously it hold similarly to local objects, you need (to make sure) to initialise it before you use it!

Best way to initialize variable in a module?

Let's say I need to write incoming data into a dataset on the cloud.
When, where and if I will need the dataset in my code, depends on the data coming in.
I only want to get a reference to the dataset once.
What is the best way to achieve this?
Initialize as global variable at start and access through global variable
if __name__="__main__":
dataset = #get dataset from internet
This seems like the simplest way, but initializes the variable even if it is never needed.
Get reference first time the dataset is needed, save in global variable, and access with get_dataset() method
dataset = None
def get_dataset():
global dataset
if dataset is none
dataset = #get dataset from internet
return dataset
Get reference first time the dataset is needed, save as function attribute, and access with get_dataset() method
def get_dataset():
if not hasattr(get_dataset, 'dataset'):
get_dataset.dataset = #get dataset from internet
return get_dataset.dataset
Any other way
The typical way to do what you want is to wrap your service calling for the data into a class:
class MyService():
dataset = None
def get_data(self):
if self.dataset = None:
self.dataset = get_my_data()
return self.dataset
Then you instantiate it once in your main and use it wherever you need it.
if __name__="__main__":
data_service = MyService()
data = data_service.get_data()
# or pass the service to whoever needs it
my_function_that_uses_data(data_service)
The dataset variable is internal but accessible through a discoverable function. You could also use a property on the instance of the class.
Also, using objects and classes makes it much more clear in a large project, as the functionality should be self-explanatory from the classname and methods.
Note that you can easily make this a generic service too, passing it the way to fetch data in the initialization (like a url?), so it can be re-used with different endpoints.
One caveat to avoid is to instantiate the same class multiple times, in your submodules, as opposed to the main. If you did, the data would be fetched and stored for each instance. On the other hand, you can pass the instance of the class to a sub-module and only fetch the data when it's needed (i.e., it may never be fetched if your submodule never needs it), while with all your options, the dataset needs to be fetched first to be passed somewhere else.
Note about your proposed options:
Initializing in the if __name__ == '__main__' section:
It is not initialized globally if you were to call the module as a module (it would only be initialized when calling the module from shell).
You need to fetch the data to pass it somewhere else, even if you don't need it in main.
Set a global within a function.
The use of global is generally discouraged, as it is in any programming language. Modifying variables out of scope is a recipe for encountering odd behaviors. It also tends to make the code harder to test if you rely on this global which is only set in a specific workflow.
Attribute on a function
This one is a bit of an eye-sore: it would certainly work, and the functionality is very similar to the Class pattern I propose, but you have to admit attributes on functions is not very pythonic. The advantage of the Class is that you can initialize it in many ways, can subclass it etc, and yet not fetch the data until you need it. Using a straight function is 'simpler' but much more limited.
You can also use the lru_cache decorator from the functools module for achieving the goal of running an expensive operation only once.
As long as the parameters are the same, calling the function again and again returns the same object.
https://docs.python.org/3/library/functools.html#functools.lru_cache
#lru_cache
def fun(input1, input2):
... # expensive operation
return result
Similar to MrE's answer, it is best to encapsulate the data with a wrapper.
However, I would recommend you to use a python closure python closure instead of a class.
A class should be used to encapsulate data and relevant functions that are closely related to the data. A class should be something that you will instantiate objects of and objects will retain individuality. You can read more about this here
You can use closures in the following way
def get_dataset_wrapper():
dataset = None
def get_dataset():
nonlocal dataset
if dataset is none
dataset = #get dataset from internet
return dataset
return get_dataset
You can use this in the following way
dataset = get_dataset_wrapper()()
If the ()() syntax bothers you, you can do this:
def wrapper():
return get_dataset_wrapper()()

reference functions to array slices in python

Is the following was possible in python?
(I am pretty new to python, not sure what the appropriate search term would be)
I have a class that stores and manipulates a large numpy array.
Now I would like to access parts of this array via an alias 'reference function'
Here is a dummy example for illustration
import numpy as np
class Trajectory(object):
def __init__(self,M=np.random.random((4,4))):
self.M=M
def get_second_row(self):
return self.M[1,:]
def set_second_row(self,newData):
self.M[1,:]=newData
t=Trajectory()
print t.M
initialData=t.get_second_row()
t.set_second_row(np.random.random(4))
print t.M
What I don't like about this is that I have to write separate set and get functions. is there a simpler way to use just one function to refer to the parts of the array M that would work for both getting and setting values?
so speaking in dummy code, something that would allow me to do this:
values=t.nth_row
t.nth_row=values+1
I would like to use t.nth_row as a reference for both getting and setting the value if that makes sense
is there a simpler way to use just one function to refer to the parts of the array M that would work for both getting and setting values?
Yes, and you've written it. It is your get function:
initialData=t.get_second_row()
t.get_second_row()[:] = np.random.random(4)
t.get_second_row()[0] = 1997

Avoiding global variables but also too many function arguments (Python)

Let's say I have a python module that has a lot of functions that rely on each other, processing each others results. There's lots of cohesion.
That means I'll be passing back and forth a lot of arguments. Either that, or I'd be using global variables.
What are best practices to deal with such a situation if? Things that come to mind would be replacing those parameters with dictionaries. But I don't necessarily like how that changes the function signature to something less expressive. Or I can wrap everything into a class. But that feels like I'm cheating and using "pseudo"-global variables?
I'm asking specifically for how to deal with this in Python but I understand that many of those things would apply to other languages as well.
I don't have a specific code example right, it's just something that came to mind when I was thinking about this issue.
Examples could be: You have a function that calculates something. In the process, a lot of auxiliary stuff is calculated. Your processing routines need access to this auxiliary stuff, and you don't want to just re-compute it.
This is a very generic question so it is hard to be specific. What you seem to be describing is a bunch of inter-related functions that share data. That pattern is usually implemented as an Object.
Instead of a bunch of functions, create a class with a lot of methods. For the common data, use attributes. Set the attributes, then call the methods. The methods can refer to the attributes without them being explicitly passed as parameters.
As RobertB said, an object seems the clearest way. Could be as simple as:
class myInfo:
def __init__(self, x=0.0, y=0.0):
self.x = x
self.y = y
self.dist = self.messWithDist()
def messWithDist(self):
self.dist = math.sqrt(self.x*self.x + self.y*self.y)
blob = myInfo(3,4)
blob.messWithDist()
print(blob.dist)
blob.x = 5
blob.y = 7
blob.messWithDist()
print(blob.dist)
If some of the functions shouldn't really be part of such an object, you can just define them as (non-member, independent) functions, and pass the blob as one parameter. For example, by un-indenting the def of messWithDist, then calling as messWithDist(blob) instead of blob.messWithDist().
-s

Categories