Someone was reviewing my code and told me to use the .copy() function when copying a pandas dataframe into another variable. My code was like this:
data1 = pd.DataFrame()
data1[['Country/Region','Province/State','Lat','Long']]=confirmed[['Country/Region','Province/State','Lat','Long']]
I copied the dataframe into another dataframe like this:
data2 = data1 and was working alright. Alright means, data1 is not being changed when I work on the data2. So, I am guessing it is ok. I am using Jupyter Notebook.
Should I stick to using it in the future or use .copy()?
toking, normally if you say data2=data1 it passes to data2 the reference (a pointer) to data1. Thus when you make changes to data2 you really making them on data1. In your case, however, you already allocated memory to your data2, thus data1[['Country/Region','Province/State','Lat','Long']]=confirmed[['Country/Region','Province/State','Lat','Long']] code copies the content to the allocated memory. But it is generally a good practice to use .copy(deep=True) if you want to avoid confusion. Cheers
It's because the = is used to assign one variable to another. If you use just = you may have some unexpected values, like the simple example below:
a = [1,2,3]
b = a
b.append([5,6,7])
print(a)
Output
[1, 2, 3, [5, 6, 7]]
If you are appending values to b, why it's doing the same with a? It's because you assigned them with an equal sign =. Anything you do with a will be also done in b and vice-versa.
That's why you should use .copy() instead of =. The logic is valid for DataFrame and any other object in python.
Related
I'm new to python and I'm using pandas to perform some basic manipulation before running a model.
Here is my problem: I have got dataset A and I want to create another dataset (B) equal to A, except for one variable X.
In order to reach this outcome I'm doing this:
A = B
columns_drop= ['X']
B.drop(columns_drop,axis=1)
But the result came out to be that both A and B misses the variable X.
Use:
A = B.drop(columns_drop, axis=1)
This will make a copy of the data.
Your approach failed as both A and B point to the same object. The proper way to make an independent copy is A = B.copy(), although the non-mutables objects like lists that are in the dataframe won't be deep copied!
I have dataframes that follow name syntax of 'df#' and I would like to be able to loop through these dataframes in a function. In the code below, if function "testing" is removed, the loop works as expected. When I add the function, it gets stuck on the "test" variable with keyerror = "iris1".
import statistics
iris1 = sns.load_dataset('iris')
iris2 = sns.load_dataset('iris')
def testing():
rows = []
for i in range(2):
test=vars()['iris'+str(i+1)]
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing()
The reason this will be valuable is because I am subsetting my dataframe df multiple times to create quick visualizations. So in Jupyter, I have one cell where I create visualizations off of df1,df2,df3. In the next cell, I overwrite df1,df2,df3 based on different subsetting rules. This is advantageous because I can quickly do this by calling a function each time, so the code stays quite uniform.
Store the datasets in a dictionary and pass that to the function.
import statistics
import seaborn as sns
datasets = {'iris1': sns.load_dataset('iris'), 'iris2': sns.load_dataset('iris')}
def testing(data):
rows = []
for i in range(1,3):
test=data[f'iris{i}']
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing(datasets)
No...
You should NEVER make a sentence like I have dataframes that follow name syntax of 'df#'
Then you have a list of dataframes, or a dict of dataframe, depending how you want to index them...
Here I would say a list
Then you can forget about vars(), trust me you don't need it... :)
EDIT :
And use list comprehensions, your code could hold in three lines :
import statistics
list_iris = [sns.load_dataset('iris'), sns.load_dataset('iris')]
rows = [
(statistics.mean(test['sepal_length']), statistics.mean(test['sepal_width']))
for test in list_iris
]
Storing as a list or dictionary allowed me to create the function. There is still a problem of the nubmer of dataframes in the list varies. It would be nice to be able to just input n argument specifying how many objects are in the list (I guess I could just add a bunch of if statements to define the list based off such an argument). **EDIT: Changing my code so that I don't use df# syntax, instead just putting it directly into a list
The problem I was experiencing is still perplexing. I can't for the life of me figure out why the "test" variable performs as expected outside of a function, but inside of a function it fails. I'm going to go the route of creating a list of dataframes, but am still curious to understand why it fails inside of the function.
I agree with #Icarwiz that it might not be the best way to go about it but you can make it work with.
test=eval('iris'+str(i+1))
I was wondering what the most pythonic way to return a pandas dataframe would be in a class. I am curious if I need to include .copy() when returning, and generally would like to understand the pitfalls of not including it. I am using a helper function because the dataframe is called multiple times, and I don't want return from the manipulate_dataframe method.
My questions are:
Do I need to place .copy() after df when assigning it to a new object?
Do I need to place .copy() after self.final_df when returning it with the get_df helper function?
class df:
def __init__(self):
pass
def manipulate_dataframe(self, df):
""" DATAFRAME MANIPULATION """
self.final_df = df
def get_df(self):
return self.final_df
Question: do you want the thing referred to as self.final_df to directly alter the original data frame, or do you want it to live its own life? Recall Python treats many variables as "views", in the sense that
y = x
creates two links to same information and if you alter y, it also alters x. Example:
>>> x = [3, 4, 6]
>>> y = x
>>> y[0] = 77
>>> x
[77, 4, 6]
The pandas df is just another example of same Python fact.
Making a copy can be time consuming. It will also disconnect your self.final_df from the input data frame. People often do that because Pandas issues warnings about user efforts to assign values into view of data frames. The correct place to start reading on this is Pandas Document https://pandas.pydata.org/docs/user_guide/indexing.html?highlight=copy%20versus%20view. However there seem to be 100 blog posts about it, I prefer the RealPython site https://realpython.com/pandas-settingwithcopywarning but you will find a others, https://www.dataquest.io/blog/settingwithcopywarning). A common effort that careless users take to address that problem is to copy the data frame, which effectively "disconnects" the new copy from the old data frame.
I don't know why you might want to create a class that simply holds a pandas df. Probably you could just create a function that does all of those data manipulations. If those commands are supposed to alter original df, don't make a copy.
Let data be a giant pandas dataframe. It has many functions. The functions do not modify in place but return a new dataframe. How then am I supposed to perform multiple operations, to maximize performance?
For example, say I want to do
data = data.method1().method2().method()
where method1 could be set_index, and so on.
Is this the way you are supposed to do it? My wory is that pandas creates a copy every time I call a method, so that there are 3 copies being made of my data in the above, when in reality, all I want is to modify the original one.
So is it faster to say
data = data.method1(inplace=True)
data = data.method2(inplace=True)
data = data.method3(inplace=True)
This is just way too verbose for me?
Yes you can do that, i.e applying methods one after the other. Since you overwrite data, you are not creating 3 copies, but only keeping one copy
data = data.method1().method2().method()
However regarding your second example, the general rule is to either write over the existing dataframe or do it "inplace", but not both at the same time
data = data.method1()
or
data.method1(inplace=True)
I'm creating a small Pandas dataframe:
df = pd.DataFrame(data={'colA': [["a", "b", "c"]]})
I take a deepcopy of that df. I'm not using the Pandas method but general Python, right?
import copy
df_copy = copy.deepcopy(df)
A df_copy.head() gives the following:
Then I put these values into a dictionary:
mydict = df_copy.to_dict()
That dictionary looks like this:
Finally, I remove one item of the list:
mydict['colA'][0].remove("b")
I'm surprized that the values in df_copy are updated. I'm very confused that the values in the original dataframe are updated too! Both dataframes look like this now:
I understand Pandas doesn't really do deepcopy, but this wasn't a Pandas method. My questions are:
1) how can I build a dictionary from a dataframe that doesn't update the dataframe?
2) how can I take a copy of a dataframe which would be completely independent?
thanks for your help!
Cheers,
Nicolas
Disclaimer
Notice that putting mutable objects inside a DataFrame can be an antipattern so make sure that you really need it and you understand what you are doing.
Why doesn't your copy independent
When applied on an object, copy.deepcopy is looked up for a _deepcopy_ method of that object, that is called in turn. It's added to avoid copying too much for objects. In the case of a DataFrame instance in version 0.20.0 and above - _deepcopy_ doesn`t work recursively.
Similarly, if you will use DataFrame.copy(deep=True) deep copy will copy the data, but will not do so recursively. .
How to solve the problem
To take a truly deep copy of a DataFrame containing a list(or other python objects), so that it will be independent - you can use one of the methods below.
df_copy = pd.DataFrame(columns = df.columns, data = copy.deepcopy(df.values))
For a dictionary, you may use same trick:
mydict = pd.DataFrame(columns = df.columns, data = copy.deepcopy(df_copy.values)).to_dict()
mydict['colA'][0].remove("b")
There's also a standard hacky way of deep-copying python objects:
import pickle
df_copy = pickle.loads(pickle.dumps(df))
Feel free to ask for any clarifications, if needed.