Issue using drop() in pandas - python

I'm new to python and I'm using pandas to perform some basic manipulation before running a model.
Here is my problem: I have got dataset A and I want to create another dataset (B) equal to A, except for one variable X.
In order to reach this outcome I'm doing this:
A = B
columns_drop= ['X']
B.drop(columns_drop,axis=1)
But the result came out to be that both A and B misses the variable X.

Use:
A = B.drop(columns_drop, axis=1)
This will make a copy of the data.
Your approach failed as both A and B point to the same object. The proper way to make an independent copy is A = B.copy(), although the non-mutables objects like lists that are in the dataframe won't be deep copied!

Related

Pythonic way to return pandas dataframe with helper function and copy

I was wondering what the most pythonic way to return a pandas dataframe would be in a class. I am curious if I need to include .copy() when returning, and generally would like to understand the pitfalls of not including it. I am using a helper function because the dataframe is called multiple times, and I don't want return from the manipulate_dataframe method.
My questions are:
Do I need to place .copy() after df when assigning it to a new object?
Do I need to place .copy() after self.final_df when returning it with the get_df helper function?
class df:
def __init__(self):
pass
def manipulate_dataframe(self, df):
""" DATAFRAME MANIPULATION """
self.final_df = df
def get_df(self):
return self.final_df
Question: do you want the thing referred to as self.final_df to directly alter the original data frame, or do you want it to live its own life? Recall Python treats many variables as "views", in the sense that
y = x
creates two links to same information and if you alter y, it also alters x. Example:
>>> x = [3, 4, 6]
>>> y = x
>>> y[0] = 77
>>> x
[77, 4, 6]
The pandas df is just another example of same Python fact.
Making a copy can be time consuming. It will also disconnect your self.final_df from the input data frame. People often do that because Pandas issues warnings about user efforts to assign values into view of data frames. The correct place to start reading on this is Pandas Document https://pandas.pydata.org/docs/user_guide/indexing.html?highlight=copy%20versus%20view. However there seem to be 100 blog posts about it, I prefer the RealPython site https://realpython.com/pandas-settingwithcopywarning but you will find a others, https://www.dataquest.io/blog/settingwithcopywarning). A common effort that careless users take to address that problem is to copy the data frame, which effectively "disconnects" the new copy from the old data frame.
I don't know why you might want to create a class that simply holds a pandas df. Probably you could just create a function that does all of those data manipulations. If those commands are supposed to alter original df, don't make a copy.

How am I supposed to use pandas functions?

Let data be a giant pandas dataframe. It has many functions. The functions do not modify in place but return a new dataframe. How then am I supposed to perform multiple operations, to maximize performance?
For example, say I want to do
data = data.method1().method2().method()
where method1 could be set_index, and so on.
Is this the way you are supposed to do it? My wory is that pandas creates a copy every time I call a method, so that there are 3 copies being made of my data in the above, when in reality, all I want is to modify the original one.
So is it faster to say
data = data.method1(inplace=True)
data = data.method2(inplace=True)
data = data.method3(inplace=True)
This is just way too verbose for me?
Yes you can do that, i.e applying methods one after the other. Since you overwrite data, you are not creating 3 copies, but only keeping one copy
data = data.method1().method2().method()
However regarding your second example, the general rule is to either write over the existing dataframe or do it "inplace", but not both at the same time
data = data.method1()
or
data.method1(inplace=True)

How to properly copy a pandas dataframe into another variable?

Someone was reviewing my code and told me to use the .copy() function when copying a pandas dataframe into another variable. My code was like this:
data1 = pd.DataFrame()
data1[['Country/Region','Province/State','Lat','Long']]=confirmed[['Country/Region','Province/State','Lat','Long']]
I copied the dataframe into another dataframe like this:
data2 = data1 and was working alright. Alright means, data1 is not being changed when I work on the data2. So, I am guessing it is ok. I am using Jupyter Notebook.
Should I stick to using it in the future or use .copy()?
toking, normally if you say data2=data1 it passes to data2 the reference (a pointer) to data1. Thus when you make changes to data2 you really making them on data1. In your case, however, you already allocated memory to your data2, thus data1[['Country/Region','Province/State','Lat','Long']]=confirmed[['Country/Region','Province/State','Lat','Long']] code copies the content to the allocated memory. But it is generally a good practice to use .copy(deep=True) if you want to avoid confusion. Cheers
It's because the = is used to assign one variable to another. If you use just = you may have some unexpected values, like the simple example below:
a = [1,2,3]
b = a
b.append([5,6,7])
print(a)
Output
[1, 2, 3, [5, 6, 7]]
If you are appending values to b, why it's doing the same with a? It's because you assigned them with an equal sign =. Anything you do with a will be also done in b and vice-versa.
That's why you should use .copy() instead of =. The logic is valid for DataFrame and any other object in python.

Dataframe Generator Based On Conditions Pandas

I have manually created a bunch of dataframes to later concatenate back together based on a list of bigrams I have(my reason for doing this is out of the scope of this question). The problem is, I want to set this code to run daily or weekly and the manually created dataframes I have created will no longer work if the data has changed once refreshed. For instance, looking at the code below, what if "data_science," is no longer a bigram being pulled from my code next week and I have another bigram like "hello_world," that is not listed below in my code. I need to set up one function that will do all of these for me. I have about 50 dataframes I am making from my real data so even without the automation purposes, it would be a huge time saver to get a function going for this. One KEY point to make is that I am grabbing all of these bigrams from a list and naming a dataframe for each one of them. My function below with the list_input is what I am using that for.
data_science = df[df['column_name'].str.contains("data") &
df['column_name'].str.contains("science")]
data_science['bigram'] = "(data_science)"
p_value = df[df['column_name'].str.contains("p") &
df['column_name'].str.contains("value")]
p_value['bigram'] = "(p_value)"
ab_testing = df[df['column_name'].str.contains("ab") &
df['column_name'].str.contains("testing")]
ab_testing['bigram'] = "(ab_texting)"```
I am trying something like this code below but have not figured out how to make it work yet.
```def df_creator(a,b, my_list):
for a,b in my_list:
a_b = df[df['Message_stop'].str.contains(a) &
df['Message_stop'].str.contains(b)]
a_b['bigram'] = "(a_b)"```

Checking whether data frame is copy or view in Pandas

Is there an easy way to check whether two data frames are different copies or views of the same underlying data that doesn't involve manipulations? I'm trying to get a grip on when each is generated, and given how idiosyncratic the rules seem to be, I'd like an easy way to test.
For example, I thought "id(df.values)" would be stable across views, but they don't seem to be:
# Make two data frames that are views of same data.
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2,:]
# Demonstrate they are views:
df.iloc[0,0] = 99
df2.iloc[0,0]
Out[70]: 99
# Now try and compare the id on values attribute
# Different despite being views!
id(df.values)
Out[71]: 4753564496
id(df2.values)
Out[72]: 4753603728
# And we can of course compare df and df2
df is df2
Out[73]: False
Other answers I've looked up that try to give rules, but don't seem consistent, and also don't answer this question of how to test:
What rules does Pandas use to generate a view vs a copy?
Pandas: Subindexing dataframes: Copies vs views
Understanding pandas dataframe indexing
Re-assignment in Pandas: Copy or view?
And of course:
- http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy
UPDATE: Comments below seem to answer the question -- looking at the df.values.base attribute rather than df.values attribute does it, as does a reference to the df._is_copy attribute (though the latter is probably very bad form since it's an internal).
Answers from HYRY and Marius in comments!
One can check either by:
testing equivalence of the values.base attribute rather than the values attribute, as in:
df.values.base is df2.values.base instead of df.values is df2.values.
or using the (admittedly internal) _is_view attribute (df2._is_view is True).
Thanks everyone!
I've elaborated on this example with pandas 1.0.1. There's not only a boolean _is_view attribute, but also _is_copy which can be None or a reference to the original DataFrame:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2, :]
df3 = df.loc[df['a'] == 1, :]
# df is neither copy nor view
df._is_view, df._is_copy
Out[1]: (False, None)
# df2 is a view AND a copy
df2._is_view, df2._is_copy
Out[2]: (True, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
# df3 is not a view, but a copy
df3._is_view, df3._is_copy
Out[3]: (False, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
So checking these two attributes should tell you not only if you're dealing with a view or not, but also if you have a copy or an "original" DataFrame.
See also this thread for a discussion explaining why you can't always predict whether your code will return a view or not.
You might trace the memory your pandas/python environment is consuming, and, on the assumption that a copy will utilise more memory than a view, be able to decide one way or another.
I believe there are libraries out there that will present the memory usage within the python environment itself - e.g. Heapy/Guppy.
There ought to be a metric you can apply that takes a baseline picture of memory usage prior to creating the object under inspection, then another picture afterwards. Comparison of the two memory maps (assuming nothing else has been created and we can isolate the change is due to the new object) should provide an idea of whether a view or copy has been produced.
We'd need to get an idea of the different memory profiles of each type of implementation, but some experimentation should yield results.

Categories