Checking whether data frame is copy or view in Pandas - python

Is there an easy way to check whether two data frames are different copies or views of the same underlying data that doesn't involve manipulations? I'm trying to get a grip on when each is generated, and given how idiosyncratic the rules seem to be, I'd like an easy way to test.
For example, I thought "id(df.values)" would be stable across views, but they don't seem to be:
# Make two data frames that are views of same data.
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2,:]
# Demonstrate they are views:
df.iloc[0,0] = 99
df2.iloc[0,0]
Out[70]: 99
# Now try and compare the id on values attribute
# Different despite being views!
id(df.values)
Out[71]: 4753564496
id(df2.values)
Out[72]: 4753603728
# And we can of course compare df and df2
df is df2
Out[73]: False
Other answers I've looked up that try to give rules, but don't seem consistent, and also don't answer this question of how to test:
What rules does Pandas use to generate a view vs a copy?
Pandas: Subindexing dataframes: Copies vs views
Understanding pandas dataframe indexing
Re-assignment in Pandas: Copy or view?
And of course:
- http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy
UPDATE: Comments below seem to answer the question -- looking at the df.values.base attribute rather than df.values attribute does it, as does a reference to the df._is_copy attribute (though the latter is probably very bad form since it's an internal).

Answers from HYRY and Marius in comments!
One can check either by:
testing equivalence of the values.base attribute rather than the values attribute, as in:
df.values.base is df2.values.base instead of df.values is df2.values.
or using the (admittedly internal) _is_view attribute (df2._is_view is True).
Thanks everyone!

I've elaborated on this example with pandas 1.0.1. There's not only a boolean _is_view attribute, but also _is_copy which can be None or a reference to the original DataFrame:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2, :]
df3 = df.loc[df['a'] == 1, :]
# df is neither copy nor view
df._is_view, df._is_copy
Out[1]: (False, None)
# df2 is a view AND a copy
df2._is_view, df2._is_copy
Out[2]: (True, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
# df3 is not a view, but a copy
df3._is_view, df3._is_copy
Out[3]: (False, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
So checking these two attributes should tell you not only if you're dealing with a view or not, but also if you have a copy or an "original" DataFrame.
See also this thread for a discussion explaining why you can't always predict whether your code will return a view or not.

You might trace the memory your pandas/python environment is consuming, and, on the assumption that a copy will utilise more memory than a view, be able to decide one way or another.
I believe there are libraries out there that will present the memory usage within the python environment itself - e.g. Heapy/Guppy.
There ought to be a metric you can apply that takes a baseline picture of memory usage prior to creating the object under inspection, then another picture afterwards. Comparison of the two memory maps (assuming nothing else has been created and we can isolate the change is due to the new object) should provide an idea of whether a view or copy has been produced.
We'd need to get an idea of the different memory profiles of each type of implementation, but some experimentation should yield results.

Related

Deep dive into when you should use Inplace argument in Pandas functions and why not to use it?

Probably it's very common to use inplace argument in Pandas functions when manipulating dataframes.
Inplace is an argument used in different functions. Some functions in which inplace is used as an attributes like set_index(), dropna(), fillna(), reset_index(), drop(), replace() and many more. The default value of this attribute is False and it returns the copy of the object.
I want to know in detail when it's good practice to use inplace in pandas functions and when you shouldn't do that also, the reason for that.
Can you demonstrate in examples to be a reference since this issue is very common in using pandas functions.
As Example:
df.drop(columns=[your_columns], inplace=True)
In which cases using inplace with drop is recommended. Also if some variables like list depending on the dataframe. changing it inplace will affect the result of other variables that depending on it. Another issue which is using inplace prevent method chaining on pandas dataframe.
I will give examples, but the better way to look at it is from style.
First, for any production code or code which can be run in any arbitrarily parallelized way, you don't want to change anything in place.
There is one major philosophical difference between functional programming and object-oriented programming: functional programming has no side effects. What does that mean? It means that if I have a function or method, let's take df.drop() for a tangible example, then using drop in a purely functional fashion will only return a result, it will do nothing else.
Let's make a dataframe:
>>> df = pd.DataFrame({"name": ["Alice", "Bob", "Candi"],
"job": ["CFO", "Accountant", "Developer"],
"department": ["Executive", "Accounting", "Product"]})
>>> df
name job department
0 Alice CFO Executive
1 Bob Accountant Accounting
2 Candi Developer Product
No Side Effects (inplace=False)
Now, if I call drop in a functional way, all that happens is a new dataframe is returned with the missing column(s):
>>> df.drop(columns = "job", inplace=False)
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
Notice that I am returning the result, which is the dataframe. To be clear, I can do this:
>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
Notice that new_df has been assigned to the returned result of the drop method.
With Side Effects (inplace=True)
>>> df.drop(columns="job", inplace=True)
>>>
Notice that nothing is returned! The return of this method, in fact, is None.
But something did happen:
>>> df
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
If I ask for the dataframe, we can see that df was in fact changed so that the job column is missing. But this entirely happened as a side effect, not as a return.
To prove that nothing is being returned, let's try this again (with a different column, for reasons mentioned below) and assign the result of the inplace method to a new variable:
>>> not_much = df.drop(columns="name", inplace=True)
>>> type(not_much)
<class 'NoneType'>
As you can see, the variable not_much is of NoneType, which means that None was returned.
Philosophy - Or, "when to use or not use"
Software engineering has changed over the years, and parallel activity is a much more common thing now. If you run big data jobs on Spark, or even if you run pandas on your single laptop, you can configure tasks to run multi-threaded, on multiple processes, asynchronously, as map-reduce jobs, etc.
Because of this parallel actions, you often won't know what will happen first and what will happen second. You want as many actions as possible either not to change state, or to change state in an atomic fashion.
Now let's revisit the df.drop, repeating it multiple times. Imagine you have a big-data job that is network-limited -- often the case -- and you just ask 10 machines to do the same task, and you accept the answer from whichever machine returns the answer first. This is a common way to deal with network inconsistencies.
Inplace:
>>> df
name job department
0 Alice CFO Executive
1 Bob Accountant Accounting
2 Candi Developer Product
>>> df.drop(columns="job", inplace=True)
>>> df.drop(columns="job", inplace=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: "['job'] not found in axis"
>>>
I just ran the same job twice, and got different answers, one of which caused an error. That is not good for parallel jobs running.
Not Inplace:
>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
No matter how often that code is run above, new_df will always equal the same thing.
When to use one versus the other
I would never use inplace=True unless it was in a one-off Jupyter notebook, a homework assignment, or something else so far from a production environment.
Final Note
I started off saying something about "functional vs object-oriented". That is how it is framed, but I don't like that comparison because it starts flame wars, and object oriented does not have to have side effects.
It's just that functional cannot have side effects, and object-oriented often does.
I prefer to say "side effects vs no side effects". That choice is easy: always prevent side effects whenever possible, recognizing it is not always possible (despite what Haskel suggests).
The inplace argument is used in many Pandas functions to specify whether the function should modify the original dataframe or return a new dataframe with the modifications applied. By default, the inplace argument is set to False, which means that the function will return a new dataframe.
In most cases, it is good practice to use the inplace argument when you want to modify a dataframe in place, rather than creating a new dataframe. This can save memory and improve performance, especially when working with large datasets. For example, you can use the inplace argument with the drop() function to remove rows or columns from a dataframe in place, like this:
df.drop(columns=["column1", "column2"], inplace=True)
In some cases, however, it is better not to use the inplace argument. For example, if you want to keep a copy of the original dataframe before applying any modifications, you should not use the inplace argument. This way, you can use the original dataframe for reference or comparison purposes. In this case, you can simply omit the inplace argument, like this:
df2 = df.drop(columns=["column1", "column2"])
Another case where you should avoid using the inplace argument is when you are not sure whether the function will modify the dataframe in the way you expect. For example, the fillna() function can replace missing values with a specified value, but it may not always produce the desired result. In this case, it is better to first create a copy of the dataframe using the copy() method, and then apply the fillna() function to the copy, like this:
df2 = df.copy()
df2.fillna(value=0, inplace=True)
Overall, the inplace argument is a useful tool for modifying dataframes in place, but it should be used with care to avoid unintended consequences. It is always a good idea to create a copy of the dataframe before applying any modifications, and to carefully test the results to ensure that they are correct.

Issue using drop() in pandas

I'm new to python and I'm using pandas to perform some basic manipulation before running a model.
Here is my problem: I have got dataset A and I want to create another dataset (B) equal to A, except for one variable X.
In order to reach this outcome I'm doing this:
A = B
columns_drop= ['X']
B.drop(columns_drop,axis=1)
But the result came out to be that both A and B misses the variable X.
Use:
A = B.drop(columns_drop, axis=1)
This will make a copy of the data.
Your approach failed as both A and B point to the same object. The proper way to make an independent copy is A = B.copy(), although the non-mutables objects like lists that are in the dataframe won't be deep copied!

Pythonic way to return pandas dataframe with helper function and copy

I was wondering what the most pythonic way to return a pandas dataframe would be in a class. I am curious if I need to include .copy() when returning, and generally would like to understand the pitfalls of not including it. I am using a helper function because the dataframe is called multiple times, and I don't want return from the manipulate_dataframe method.
My questions are:
Do I need to place .copy() after df when assigning it to a new object?
Do I need to place .copy() after self.final_df when returning it with the get_df helper function?
class df:
def __init__(self):
pass
def manipulate_dataframe(self, df):
""" DATAFRAME MANIPULATION """
self.final_df = df
def get_df(self):
return self.final_df
Question: do you want the thing referred to as self.final_df to directly alter the original data frame, or do you want it to live its own life? Recall Python treats many variables as "views", in the sense that
y = x
creates two links to same information and if you alter y, it also alters x. Example:
>>> x = [3, 4, 6]
>>> y = x
>>> y[0] = 77
>>> x
[77, 4, 6]
The pandas df is just another example of same Python fact.
Making a copy can be time consuming. It will also disconnect your self.final_df from the input data frame. People often do that because Pandas issues warnings about user efforts to assign values into view of data frames. The correct place to start reading on this is Pandas Document https://pandas.pydata.org/docs/user_guide/indexing.html?highlight=copy%20versus%20view. However there seem to be 100 blog posts about it, I prefer the RealPython site https://realpython.com/pandas-settingwithcopywarning but you will find a others, https://www.dataquest.io/blog/settingwithcopywarning). A common effort that careless users take to address that problem is to copy the data frame, which effectively "disconnects" the new copy from the old data frame.
I don't know why you might want to create a class that simply holds a pandas df. Probably you could just create a function that does all of those data manipulations. If those commands are supposed to alter original df, don't make a copy.

dataframes selecting data by [] and by .(attibute)

I found out that I have problem understanding when I should be accessing data from dataframe(df) using df[data] or df.data .
I mostly use the [] method to create new columns, but I can also access data using both df[] and df.data, but what's the difference and how can I better grasp those two ways of selecting data? When one should be used over the other one?
If I understand the Docs correctly, they are pretty much equivalent, except in these cases:
You can use [the .] access only if the index element is a valid python identifier, e.g. s.1 is not allowed.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
However, while
indexing operators [] and attribute operator . provide quick and easy
access to pandas data structures across a wide range of use cases [...]
in production you should really use the optimized panda data access methods such as .loc, .iloc, and .ix, because
[...] since the type of the data to be accessed isn’t known
in advance, directly using standard operators has some optimization
limits. For production code, we recommended that you take advantage of
the optimized pandas data access methods.
Using [] will use the value of the index.
a = "hello"
df[a] # It will give you content at hello
Using .
df.a # content at a
The difference is that with the first one you can use a variable.

Copy Warning Pandas

I'm aware of the view vs. copy issue and the reason for the warning that Pandas displays. This has made me more careful to use,loc, .ix, etc rather than chained indexing.
However I'm generating warnings when I'm convinced I shouldn't be. I have a dataframe "df" containing my data and I have a function:
def my_func(df):
df['new_channel'] = df.channel.diff()
return df
If I run this function I get no warnings. However if I then define a new dataframe from my original one like this:
df2 = df.ix[df.channel==val,:]
Calling the function:
my_func(df2)
Then generates copy warnings. But my understanding is that I'm not using any chained indexing in my function and I haven't used chained indexing to create the second dataframe.
Is this a false positive - i.e. I can just turn the warning off and carry on. Or have I missed something more fundamental that could bite me in the future?
Ben

Categories