Deep copy of Pandas dataframes and dictionaries - python

I'm creating a small Pandas dataframe:
df = pd.DataFrame(data={'colA': [["a", "b", "c"]]})
I take a deepcopy of that df. I'm not using the Pandas method but general Python, right?
import copy
df_copy = copy.deepcopy(df)
A df_copy.head() gives the following:
Then I put these values into a dictionary:
mydict = df_copy.to_dict()
That dictionary looks like this:
Finally, I remove one item of the list:
mydict['colA'][0].remove("b")
I'm surprized that the values in df_copy are updated. I'm very confused that the values in the original dataframe are updated too! Both dataframes look like this now:
I understand Pandas doesn't really do deepcopy, but this wasn't a Pandas method. My questions are:
1) how can I build a dictionary from a dataframe that doesn't update the dataframe?
2) how can I take a copy of a dataframe which would be completely independent?
thanks for your help!
Cheers,
Nicolas

Disclaimer
Notice that putting mutable objects inside a DataFrame can be an antipattern so make sure that you really need it and you understand what you are doing.
Why doesn't your copy independent
When applied on an object, copy.deepcopy is looked up for a _deepcopy_ method of that object, that is called in turn. It's added to avoid copying too much for objects. In the case of a DataFrame instance in version 0.20.0 and above - _deepcopy_ doesn`t work recursively.
Similarly, if you will use DataFrame.copy(deep=True) deep copy will copy the data, but will not do so recursively. .
How to solve the problem
To take a truly deep copy of a DataFrame containing a list(or other python objects), so that it will be independent - you can use one of the methods below.
df_copy = pd.DataFrame(columns = df.columns, data = copy.deepcopy(df.values))
For a dictionary, you may use same trick:
mydict = pd.DataFrame(columns = df.columns, data = copy.deepcopy(df_copy.values)).to_dict()
mydict['colA'][0].remove("b")
There's also a standard hacky way of deep-copying python objects:
import pickle
df_copy = pickle.loads(pickle.dumps(df))
Feel free to ask for any clarifications, if needed.

Related

Issue using drop() in pandas

I'm new to python and I'm using pandas to perform some basic manipulation before running a model.
Here is my problem: I have got dataset A and I want to create another dataset (B) equal to A, except for one variable X.
In order to reach this outcome I'm doing this:
A = B
columns_drop= ['X']
B.drop(columns_drop,axis=1)
But the result came out to be that both A and B misses the variable X.
Use:
A = B.drop(columns_drop, axis=1)
This will make a copy of the data.
Your approach failed as both A and B point to the same object. The proper way to make an independent copy is A = B.copy(), although the non-mutables objects like lists that are in the dataframe won't be deep copied!

Pythonic way to return pandas dataframe with helper function and copy

I was wondering what the most pythonic way to return a pandas dataframe would be in a class. I am curious if I need to include .copy() when returning, and generally would like to understand the pitfalls of not including it. I am using a helper function because the dataframe is called multiple times, and I don't want return from the manipulate_dataframe method.
My questions are:
Do I need to place .copy() after df when assigning it to a new object?
Do I need to place .copy() after self.final_df when returning it with the get_df helper function?
class df:
def __init__(self):
pass
def manipulate_dataframe(self, df):
""" DATAFRAME MANIPULATION """
self.final_df = df
def get_df(self):
return self.final_df
Question: do you want the thing referred to as self.final_df to directly alter the original data frame, or do you want it to live its own life? Recall Python treats many variables as "views", in the sense that
y = x
creates two links to same information and if you alter y, it also alters x. Example:
>>> x = [3, 4, 6]
>>> y = x
>>> y[0] = 77
>>> x
[77, 4, 6]
The pandas df is just another example of same Python fact.
Making a copy can be time consuming. It will also disconnect your self.final_df from the input data frame. People often do that because Pandas issues warnings about user efforts to assign values into view of data frames. The correct place to start reading on this is Pandas Document https://pandas.pydata.org/docs/user_guide/indexing.html?highlight=copy%20versus%20view. However there seem to be 100 blog posts about it, I prefer the RealPython site https://realpython.com/pandas-settingwithcopywarning but you will find a others, https://www.dataquest.io/blog/settingwithcopywarning). A common effort that careless users take to address that problem is to copy the data frame, which effectively "disconnects" the new copy from the old data frame.
I don't know why you might want to create a class that simply holds a pandas df. Probably you could just create a function that does all of those data manipulations. If those commands are supposed to alter original df, don't make a copy.

How am I supposed to use pandas functions?

Let data be a giant pandas dataframe. It has many functions. The functions do not modify in place but return a new dataframe. How then am I supposed to perform multiple operations, to maximize performance?
For example, say I want to do
data = data.method1().method2().method()
where method1 could be set_index, and so on.
Is this the way you are supposed to do it? My wory is that pandas creates a copy every time I call a method, so that there are 3 copies being made of my data in the above, when in reality, all I want is to modify the original one.
So is it faster to say
data = data.method1(inplace=True)
data = data.method2(inplace=True)
data = data.method3(inplace=True)
This is just way too verbose for me?
Yes you can do that, i.e applying methods one after the other. Since you overwrite data, you are not creating 3 copies, but only keeping one copy
data = data.method1().method2().method()
However regarding your second example, the general rule is to either write over the existing dataframe or do it "inplace", but not both at the same time
data = data.method1()
or
data.method1(inplace=True)

Handling appending to abstraction of dataframe

If I have a "reference" to a dataframe, there appears to be no way to append to it in pandas because neither append nor concat support the inplace=True parameter.
An (overly) simple example:
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df = chosen_df.append(chosen_row)
Now because Python does something akin to copy reference by value, chosen_df will initially be a reference to whichever candidate dataframe passed some_test.
But the update semantics of pandas mean that the referenced dataframe is not updated by the result of the append function; a new label is created instead. I believe, if there was the possibility to use inplace=True this would work, but it looks like that isn't likely to happen, given discussion here https://github.com/pandas-dev/pandas/issues/14796
It's worth noting that with a simpler example using lists rather than dataframes does work, because the contents of lists are directly mutated by append().
So my question is --- How could an updatable abstraction over N dataframes be achieved in Python?
The idiom is commonplace, useful and trivial in languages that allow references, so I'm guessing I'm missing a Pythonic trick, or thinking about the whole problem with the wrong hat on!
Obviously the pure illustrative example can be resolved by duplicating the append in the body of an if...else and concretely referencing each underlying dataframe in turn. But this isn't scalable to more complex examples and it's a generic solution akin to references I'm looking for.
Any ideas?
There is a simple way to do this specifically for pandas dataframes - so I'll answer my own question.
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df.loc[max_idx+1] = chosen_row
The calculation of max_idx very much depends on the structure of chosen_df. In the simplest case when it is a dataframe with a sequential index starting at 0, then you can simply use the length of the index to calculate it.
If chosen_df is non-sequential you'll need call max() on the index column rather than rely on the length of the index.
If chosen_df is a slice or groupby object then you'll need to calculate the index off the max parent dataframe to ensure it's truly the max across all rows.

Checking whether data frame is copy or view in Pandas

Is there an easy way to check whether two data frames are different copies or views of the same underlying data that doesn't involve manipulations? I'm trying to get a grip on when each is generated, and given how idiosyncratic the rules seem to be, I'd like an easy way to test.
For example, I thought "id(df.values)" would be stable across views, but they don't seem to be:
# Make two data frames that are views of same data.
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2,:]
# Demonstrate they are views:
df.iloc[0,0] = 99
df2.iloc[0,0]
Out[70]: 99
# Now try and compare the id on values attribute
# Different despite being views!
id(df.values)
Out[71]: 4753564496
id(df2.values)
Out[72]: 4753603728
# And we can of course compare df and df2
df is df2
Out[73]: False
Other answers I've looked up that try to give rules, but don't seem consistent, and also don't answer this question of how to test:
What rules does Pandas use to generate a view vs a copy?
Pandas: Subindexing dataframes: Copies vs views
Understanding pandas dataframe indexing
Re-assignment in Pandas: Copy or view?
And of course:
- http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy
UPDATE: Comments below seem to answer the question -- looking at the df.values.base attribute rather than df.values attribute does it, as does a reference to the df._is_copy attribute (though the latter is probably very bad form since it's an internal).
Answers from HYRY and Marius in comments!
One can check either by:
testing equivalence of the values.base attribute rather than the values attribute, as in:
df.values.base is df2.values.base instead of df.values is df2.values.
or using the (admittedly internal) _is_view attribute (df2._is_view is True).
Thanks everyone!
I've elaborated on this example with pandas 1.0.1. There's not only a boolean _is_view attribute, but also _is_copy which can be None or a reference to the original DataFrame:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2, :]
df3 = df.loc[df['a'] == 1, :]
# df is neither copy nor view
df._is_view, df._is_copy
Out[1]: (False, None)
# df2 is a view AND a copy
df2._is_view, df2._is_copy
Out[2]: (True, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
# df3 is not a view, but a copy
df3._is_view, df3._is_copy
Out[3]: (False, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
So checking these two attributes should tell you not only if you're dealing with a view or not, but also if you have a copy or an "original" DataFrame.
See also this thread for a discussion explaining why you can't always predict whether your code will return a view or not.
You might trace the memory your pandas/python environment is consuming, and, on the assumption that a copy will utilise more memory than a view, be able to decide one way or another.
I believe there are libraries out there that will present the memory usage within the python environment itself - e.g. Heapy/Guppy.
There ought to be a metric you can apply that takes a baseline picture of memory usage prior to creating the object under inspection, then another picture afterwards. Comparison of the two memory maps (assuming nothing else has been created and we can isolate the change is due to the new object) should provide an idea of whether a view or copy has been produced.
We'd need to get an idea of the different memory profiles of each type of implementation, but some experimentation should yield results.

Categories