Storage of dataframes and variables defined inside a method in Python

Storage of dataframes and variables defined inside a method in Python - python

If a class method creates a data frame within it when an object of that class calls the method, will the data for the data frame persist after the execution of the method?
Taking the code below as an example:
class some_class():
def some_method(self):
some_data = pd.DataFrame({"a":[1,2,3,4],
"b":[5,6,7,8]
})
return some_data
a = some_class()
b = a.some_method()
After the execution of the call to a.some_method() will the dataframe be stored in the object?
I want to be able to create multiple objects and use them to return data based on the methods defined in those objects but I'm concerned that if the object stores the data as well then in effect I'll be storing the same data twice (in data frame b and in the object an in the example above).

If you want to store a value inside a class, then a method must assign to self. For example:
class some_class():
def some_method(self):
self.some_data = pd.DataFrame({"a":[1,2,3,4],
"b":[5,6,7,8]
})
return self.some_data
a = some_class()
b = a.some_method()
This will store a "label" to the data within your instance of some_class (which you should capitalize as SomeClass btw if you want to follow the popular convention) with the label some_data. The variable b is also an alias to this data - both a.some_data and b refer to the exact same data. There is no copy.
This is useful and saves memory but you need to be aware that you're working with labels (references) to the same data. If you want a.some_data and b to be separate instances of data, you'll need to explicitly copy the data.
Python variables behave differently to many other popular languages. The name of the variable, e.g. b, is really just a label attached to some value. Therefore if you assign c = b, you haven't copied the data, you've simply assigned a new label to the original value. For immutable types like primitive numeric types, this isn't much different to copying the value, but for more complex types (lists, dicts, data frames, etc) you need to be aware that you're dealing with labels.

The class you do won't, since you have self while there isn't __init__, so do:
class some_class():
def some_method():
some_data = pd.DataFrame({"a":[1,2,3,4],
"b":[5,6,7,8]
})
return some_data
print(some_class.some_method())
Output:
a b
0 1 5
1 2 6
2 3 7
3 4 8

Related

Why do function attributes (setattr ones) only become available after assigning it as a property to a class and instantiating it?

I apologize if I'm butchering the terminology. I'm trying to understand the code in this example on how to chain a custom function onto a PySpark dataframe. I'd really want to understand exactly what it's doing, and if it is not awful practice before I implement anything.
From the way I'm understanding the code, it:
defines a function g with sub-functions inside of it, that returns a copy of itself
assigns the sub-functions to g as attributes
assigns g as a property of the DataFrame class
I don't think at any step in the process do any of them become a method (when I do getattr, it always says "function")
When I run a (as best as I can do) simplified version of the code (below), it seems like only when I assign the function as a property to a class, and then instantiate at least one copy of the class, do the attributes on the function become available (even outside of the class). I want to understand what and why that is happening.
An answer [here(https://stackoverflow.com/a/17007966/19871699) indicates that this is a behavior, but doesn't really explain what/why it is. I've read this too but I'm having trouble seeing the connection to the code above.
I read here about the setattr part of the code. He doesn't mention exactly the use case above. this post has some use cases where people do it, but I'm not understanding how it directly applies to the above, unless I've missed something.
The confusing part is when the inner attributes become available.
class SampleClass():
def __init__(self):
pass
def my_custom_attribute(self):
def inner_function_one():
pass
setattr(my_custom_attribute,"inner_function",inner_function_one)
return my_custom_attribute
[x for x in dir(my_custom_attribute) if x[0] != "_"]
returns []
then when I do:
SampleClass.custom_attribute = property(my_custom_attribute)
[x for x in dir(my_custom_attribute) if x[0] != "_"]
it returns []
but when I do:
class_instance = SampleClass()
class_instance.custom_attribute
[x for x in dir(my_custom_attribute) if x[0] != "_"]
it returns ['inner_function']
In the code above though, if I do SampleClass.custom_attribute = my_custom_attribute instead of =property(...) the [x for x... code still returns [].
edit: I'm not intending to access the function itself outside of the class. I just don't understand the behavior, and don't like implementing something I don't understand.

So, setattr is not relevant here. This would all work exactly the same without it, say, by just doing my_custom_attribute.inner_function = inner_function_one etc. What is relevant is that the approach in the link you showed (which your example doesn't exactly make clear what the purpose is) relies on using a property, which is a descriptor. But the function won't get called unless you access the attribute corresponding to the property on an instance. This comes down to how property works. For any property, given a class Foo:
Foo.attribute_name = property(some_function)
Then some_function won't get called until you do Foo().attribute_name. That is the whole point of property.
But this whole solution is very confusingly engineered. It relies on the above behavior, and it sets attributes on the function object.
Note, if all you want to do is add some method to your DataFrame class, you don't need any of this. Consider the following example (using pandas for simplicity):
>>> import pandas as pd
>>> def foobar(self):
... print("in foobar with instance", self)
...
>>> pd.DataFrame.baz = foobar
>>> df = pd.DataFrame(dict(x=[1,2,3], y=['a','b','c']))
>>> df
x y
0 1 a
1 2 b
2 3 c
>>> df.baz()
in foobar with instance x y
0 1 a
1 2 b
2 3 c
That's it. You don't need all that rigamarole. Of course, if you wanted to add a nested accessor, df.custom.whatever, you would need something a bit more complicated. You could use the approach in the OP, but I would prefer something more explicit:
import pandas as pd
class AccessorDelegator:
def __init__(self, accessor_type):
self.accessor_type = accessor_type
def __get__(self, instance, cls=None):
return self.accessor_type(instance)
class CustomMethods:
def __init__(self, instance):
self.instance = instance
def foo(self):
# do something with self.instance as if this were your `self` on the dataframe being augmented
print(self.instance.value_counts())
pd.DataFrame.custom = AccessorDelegator(CustomMethods)
df = pd.DataFrame(dict(a=[1,2,3], b=['a','b','c']))
df.foo()
The above will print:
a b
1 a 1
2 b 1
3 c 1

Because when you call a function the attributes within that function aren't returned only the returned value is passed back.
In other words the additional attributes are only available on the returned function and not with 'g' itself.
Try moving setattr() outside of the function.

Usage of setattr method in python

I have a question on the usage of the setattr method in python.
I have a python class with around 20 attributes, which can be initialized in the below manner:
class SomeClass():
def __init__(self, pd_df_row): # pd_df_row is one row from a dataframe
#initialize some attributes (attribute_A to attribute_Z) in a similar manner
if 'column_A' in pd_df_row.columns:
self.attribute_A = pd_df_row['column_A']
else:
self.attribute_A = np.nan
....
if 'column_Z' in pd_df_row.columns:
self.attribute_Z = pd_df_row['column_Z']
else:
self.attribute_Z = np.nan
# initialize some other attributes based on some other columns in pd_df_row
self.other_attribute = pre_process(pd_df_row['column_123'])
# some other methods
def compute_something(self):
return self.attribute_A + self.attribute_B
Is it advisable to write the class in the below way instead, making use of the setattr method and for loop in python:
class SomeClass():
# create a static list to store the mapping between attribute names and column names that can be initialized using a similar logic.
# However, the mapping would not cover all columns in the input pd_df_row or cover all attributes of the class, because not all columns are read and stored in the same way
# (this mapping will be hardcoded. Its initialization cannot be further simplified using a loop, because the attribute name and the corresponding column name do not actually follow any particular patterns)
ATTR_LIST = [('attribute_A', 'column_A'), ('attribute_B', 'column_B'), ...,('attribute_Z', 'column_Z')]
def __init__(self, pd_df_row): #where pd_df_row is a dataframe
#initialize some attributes (attribute_A to attribute_Z) in a loop
for attr_name, col_name in SomeClass.ATTR_LIST:
if col_name in pd_df_row.columns:
setattr(self, attr_name, pd_df_row[col_name])
else:
setattr(self, attr_name, np.nan)
# initialize some other attributes based on some other columns in pd_df_row
self.other_attribute = pre_process(pd_df_row['column_123'])
# some other methods
def compute_something(self):
return self.attribute_A + self.attribute_B
the second way of writing this class seem to be able to shorten the code. However, it also seem to make the structure of the class a bit confusing, by creating the static list of attribute and column name mapping (which will be used to initiate only some but not all of the attributes). Also, I noticed that code auto-completion will not work for the second piece of code as the code editor wont be able to know what attribute is created until run time. Therefore my question is, is it advisable to use setattr() in this way? In what cases should I write my code in this way and in what cases I should avoid doing so?
In addition, does creating the static mapping in the class violate object oriented programming principles? should I create and store this mapping in some other place instead?
Thank you.

You could, but I would consider having a dict of attributes rather than separate similarly named attributes.
class SomeClass():
def __init__(self, pd_df_row): # pd_df_row is one row from a dataframe
self.attributes = {}
for x in ['A', ..., 'Z']:
column = f'column_{x}'
if column in pd_df_row:
self.attributes[x] = pd_df_row[column]
else:
self.attributes[x] = np.nan
# initialize some other attributes
self.other_attribute = some_other_values
# some other methods
def compute_something(self):
return self.attribute['A'] + self.attribute['B']

Understanding Mutability and Multiple Variable Assignment to Class Objects in Python

I'm looking for some clarification regarding mutability and class objects. From what I understand, variables in Python are about assigning a variable name to an object.
If that object is immutable then when we set two variables to the same object, it'll be two separate copies (e.g. a = b = 3 so a changing to 4 will not affect b because 3 is a number, an example of an immutable object).
However, if an object is mutable, then changing the value in one variable assignment will naturally change the value in the other (e.g. a = b = [] -> a.append(1) so now both a and b will refer to "[1]")
Working with classes, it seems even more fluid than I believed. I wrote a quick example below to show the differences. The first class is a typical Node class with a next pointer and a value. Setting two variables, "slow" and "fast", to the same instance of the Node object ("head"), and then changing the values of both "slow" and "fast" won't affect the other. That is, "slow", "fast", and "head" all refer to different objects (verified by checking their id() as well).
The second example class doesn't have a next pointer and only has a self.val attribute. This time changing one of the two variables, "p1" and "p2", both of which are set to the same instance, "start", will affect the other. This is despite that self.val in the "start" instance is an immutable number.
'''
The below will have two variable names (slow, fast) assigned to a head Node.
Changing one of them will NOT change the other reference as well.
'''
class Node:
def __init__(self, x, next=None):
self.x = x
self.next = next
def __str__(self):
return str(self.x)
n3 = Node(3)
n2 = Node(2, n3)
n1 = Node(1, n2)
head = n1
slow = fast = head
print(f"Printing before moving...{head}, {slow}, {fast}") # 1, 1, 1
while fast and fast.next:
fast = fast.next.next
slow = slow.next
print(f"Printing after moving...{head}, {slow}, {fast}") # 1, 2, 3
print(f"Checking the ids of each variable {id(head)}, {id(slow)}, {id(fast)}") # all different
'''
The below will have two variable names (p1, p2) assigned to a start Dummy.
Changing one of them will change the other reference as well.
'''
class Dummy:
def __init__(self, val):
self.val = val
def __str__(self):
return str(self.val)
start = Dummy(100)
p1 = p2 = start
print(f"Printing before changing {p1}, {p2}") # 100, 100
p1.val = 42
print(f"Printing after changing {p1}, {p2}") # 42, 42
This is a bit murky for me to understand what is actually going on under the hood and I'm seeking clarification so I can feel confident in setting multiple variable assignments to the same object expecting a true copy (without resorting to "import copy; copy.deepcopy(x);")
Thank you for your help

This isn't a matter of immutability vs mutability. This is a matter of mutating an object vs reassigning a reference.
If that object is immutable then when we set two variables to the same object, it'll be two separate copies
This isn't true. A copy won't be made. If you have:
a = 1
b = a
You have two references to the same object, not a copy of the object. This is fine though because integers are immutable. You can't mutate 1, so the fact that a and b are pointing to the same object won't hurt anything.
Python will never make implicit copies for you. If you want a copy, you need to copy it yourself explicitly (using copy.copy, or some other method like slicing on lists). If you write this:
a = b = some_obj
a and b will point to the same object, regardless of the type of some_obj and whether or not it's mutable.
So what's the difference between your examples?
In your first Node example, you never actually alter any Node objects. They may as well be immutable.
slow = fast = head
That initial assignment makes both slow an fast point to the same object: head. Right after that though, you do:
fast = fast.next.next
This reassigns the fast reference, but never actually mutates the object fast is looking at. All you've done is change what object the fast reference is looking at.
In your second example however, you directly mutate the object:
p1.val = 42
While this looks like reassignment, it isn't. This is actually:
p1.__setattr__("val", 42)
And __setattr__ alters the internal state of the object.
So, reassignment changes what object is being looked at. It will always take the form:
a = b # Maybe chained as well.
Contrast with these that look like reassignment, but are actually calls to mutating methods of the object:
l = [0]
l[0] = 5 # Actually l.__setitem__(0, 5)
d = Dummy()
d.val = 42 # Actually d.__setattr__("val", 42)

You overcomplicate things. The fundamental, simple rule is: each time you use = to assign an object to a variable, you make the variable name refer to that object, that's all. The object being mutable or not makes no difference.
With a = b = 3, you make the names a and b refer to the object 3. If you then make a = 4, you make the name a refer to the object 4, and the name b still refers to 3.
With a = b = [], you've created two names a and b that refer to the same list object. When doing a.append(1), you append 1 to this list. You haven't assigned anything to a or b in the process (you didn't write any a = ... or b = ...). So, whether you access the list through the name a or b, it's still the same list that you manipulate. It can just be called by two different names.
The same happens in your example with classes: when you write fast = fast.next.next, you make the name fast refer to a new object.
When you do p1.val = 42, you don't make p1 refer to a new different instance, but you change the val attribute of this instance. p1 and p2are still two names for this unique instance, so using either name lets you refer to the same instance.

Mutable and Immutable Objects
When a program is run, data objects in the program are stored in the computer’s
memory for processing. While some of these objects can be modified at that memory
location, other data objects can’t be modified once they are stored in the memory. The
property of whether or not data objects can be modified in the same memory location
where they are stored is called mutability. We can check the mutability of an object by checking its memory location before and
after it is modified. If the memory location remains the same when the data object is
modified, it means it is mutable. To check the memory location of where a data object is stored, we use the function, id(). Consider the following example
a=[5, 10, 15]
id(a)
#1906292064
a[1]=20
id(a)
#1906292064
#Assigning values to the list a. The ID of the memory location where a is stored.
#Replacing the second item in the list,10 with a new item, 20.
#print(a) Using the print() function to verify the new value of a.# Using the function #id() to get the memory location of a.
#The ID of the memory location where a is stored.
the memory location has not changed as the ID remains (1906292064)
remains the same before and after the variable is modified. This indicates that the list
is mutable, i.e., it can be modified at the same memory location where it is stored

Variable supporting class like attribute assignments

I was wondering if there was some kind of python variable (which isn't a custom made class) which would support the following code :
a = some_creation_procedure
a.variable_a = 1
a.variable_b = 2
a.variable_c = 3
print ("{}, {}, {}".format(a.variable_a, a.variable_b, a.variable_c))
output -
[1, 2, 3]
I could probably create a custom class and support this with "get_attribute" function, but I was wondering if there was a built-in support for this.
Motivation:
I want to debug a certain function within a class (Which requires a lot of operations / variables for initliazing), so I want to create a sub-class instance which has variables corresponding to that specific function (and send it as self for that specific function).
for example :
class some_class():
def __init__(var1, var2, var3, var4 , ....):
do_a()
do_b()
and so on...
def minimal_func(self):
print (self.var1)
my_variable.var1 = "a"
some_class.minimal_func(my_variable)

The simplest way is a class without anything inside it:
class Namespace: pass
a = Namespace()
a.variable_a = 1
a.variable_b = 2
a.variable_c = 3
print ("{}, {}, {}".format(a.variable_a, a.variable_b, a.variable_c))
prints:
1, 2, 3
as #James said in the comments:
Instances of class objects can have attributes assigned to them on the fly. There are ways to restrict attribute assignment as well, but by default, you can just assign anything
As #Aran-Frey pointed out, you can also use the types.SimpleNamespace class instead of of the above empty class:
import types
a = types.SimpleNamespace()
This also allows you to add attributes in the constructor:
import types
a = types.SimpleNamespace(
variable_a=1,
variable_b=2,
variable_c=3)
Also it has a nice __repr__ function:
print(a)
prints:
namespace(variable_a=1, variable_b=2, variable_c=3)

Understanding Python inheritance and initialisation [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
“Least Astonishment” in Python: The Mutable Default Argument
In Python 2.7, consider I have the following code:
class Base(object):
# Variant 1
def __init__(self, records=[]):
self._records = records
# Variant 2
# def __init__(self, records=[]):
# self._records = []
# if records:
# self._records = records
def append(self, value):
self._records.append(value)
class ChildA(Base):
pass
class ChildB(Base):
pass
a = ChildA()
b = ChildB()
a.append(100)
b.append(200)
print a._records
print b._records
If I use variant 1 to initialize my base class, self._records behaves like a class variable. Executing the code using variant 1 to initialize my base class, I get the ouput:
[100, 200]
[100, 200]
Using variant 2 to initialize my base class, self._records behaves like a instance variable (as expected). Executing the code using variant 2 to initialize my base class, I get the output:
[100]
[200]
What is the difference between these both variants? Why does variant 1 work different to variant 2? Thanks a lot for your help!

Your default argument is [], which is a common pitfall with Python. See more in the tutorial:
Important warning: The default value is evaluated only once. This
makes a difference when the default is a mutable object such as a
list, dictionary, or instances of most classes.

It has nothing to do with inheritance, class or instance variables. Consider the next code:
>>> def f(a=[]):
... a.append(1)
... print a
...
>>> f.func_defaults
([],)
>>> f()
[1]
>>> f()
[1, 1]
>>> f.func_defaults
([1, 1],)
Default values for function parameters are evaluated only ones and stored within function object. Each time f is called it operates with the same list. Right as in your case.

As the others had put - this has to do with using an empty list as default value, not with class inheritance behavior at all.
The correct way to do it is:
class Base(object):
# Variant 1
def __init__(self, records=None):
if records is None:
records = []
self._records = records
Therefore ensuring a new list instance is created each time the Base class is instantiated. The way you put it in your code, a list object is instantiated when your class body is parsed by Python - and that one instance of list is used as the default parameter each time the __init__ method is run. As the objects hold a reference and change that same list, the change is visible in all other objects which share the Base class.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.