Copy Warning Pandas - python

I'm aware of the view vs. copy issue and the reason for the warning that Pandas displays. This has made me more careful to use,loc, .ix, etc rather than chained indexing.
However I'm generating warnings when I'm convinced I shouldn't be. I have a dataframe "df" containing my data and I have a function:
def my_func(df):
df['new_channel'] = df.channel.diff()
return df
If I run this function I get no warnings. However if I then define a new dataframe from my original one like this:
df2 = df.ix[df.channel==val,:]
Calling the function:
my_func(df2)
Then generates copy warnings. But my understanding is that I'm not using any chained indexing in my function and I haven't used chained indexing to create the second dataframe.
Is this a false positive - i.e. I can just turn the warning off and carry on. Or have I missed something more fundamental that could bite me in the future?
Ben

Related

How to avoid Pandas throwing a SettingCopyWarning when transforming a subset of a series?

mask = ~df.bar.isna()
df.bar.loc[mask] = df.bar.loc[mask].map(f)
This sets off a SettingCopyWarning, though I am using loc.
I am aware of df.mask, but this will not work either as the column contains missing values that throw errors when the mapping function is applied to it .
You get SettingCopyWarning because Pandas cannot be sure if you want to manipulate dataframe via a reference, or if you want to manipulate a copy of the dataframe. So try to add .copy() at the end of the second line and see if the warning goes away. Sometimes the cause of the warning is actually in the code somewhere a few lines before where you get the it.

How do I use the Pandas .loc() function when converting a column to DateTime?

I'm working on a small project to parse and create graphs based on DNC Primary data. It's all functional so far but I'm working on getting rid of the following SettingWithCopyWarning:
"SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
candidate['end_date'] = pd.to_datetime(candidate_state['end_date'])"
I've tried changing the referenced line to:
candidate.loc['end_date'] = pd.to_datetime(candidate_state['end_date'])
But that throws another error about some kind of comparison. Can anyone help me figure this out?
Thank you!
It is hard to confirm without seeing the full code, but this warning is usually a result of the dataframe you are working on (candidate in this case) being a "copy" of a filtered selection of a larger dataframe. In other words, when you created candidate you did something like:
candidate = df_larger_dataset[df_larger_dataset['some_column'] == 'some_value']
The reason you get the warning is because when you do this, you don't actually create a new object, just a reference, which means when you start making changes to candidate, you are also modifying df_larger_dataset. This may or may not matter in your context, but to avoid the warning, when you create 'candidate' make it an explicit copy of df_larger_dataset:
candidate = df_larger_dataset[df_larger_dataset['some_column'] == some_value].copy()

Pandas -- String of boolean conditions -- SettingWithCopyWarning

I am wondering if anyone can assist me with this warning I get in my code. The code DOES score items correctly, but this warning is bugging me and I can't seem to find a good fix, given that I need to string a few boolean conditions together.
Background: Imagine that I have a magical fruit identifier and I have a csv file that lists what fruit was identified and in which area (1, 2, etc.). I read in the csv file with columns of "FruitID" and "Area." An identification of "APPLE" or "apple" in Zone 1 is scored as correct/true (other identified fruits are incorrect/false). I apply similar logic for other areas, but I won't get into that.
Any ideas for how to correct this? Should I use .loc, although I'm not sure that this will work with multiple booleans. Thanks!
My code snippet that initiates the CopyWarning:
Area1_ID_df['Area 1, Score']=(Area1_ID_df['FruitID']=='APPLE')|(Area1_ID_df['FruitID']=='apple')
Stacktrace:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Pandas finds it ambiguous what you are trying to do. Certain operations return a view of the dataset, whereas other operations make a copy of the dataset. The confusion is whether you want to modify a copy of the dataset or whether you want to modify the original dataset or are trying to create something new.
https://www.dataquest.io/blog/settingwithcopywarning/ is a great link to learn more about the problem you are having.
If the line that's causing this error is truly: s = t | u, where t and u are Boolean series indexed consistently, you should not worry about SettingWithCopyWarning.
This is a warning rather than an error. The latter indicates there is a problem, the former indicates there may be a problem. In this case, Pandas guesses you may be working with a copy rather than a view.
If the result is as you expect, you can safely ignore the warning.

Checking whether data frame is copy or view in Pandas

Is there an easy way to check whether two data frames are different copies or views of the same underlying data that doesn't involve manipulations? I'm trying to get a grip on when each is generated, and given how idiosyncratic the rules seem to be, I'd like an easy way to test.
For example, I thought "id(df.values)" would be stable across views, but they don't seem to be:
# Make two data frames that are views of same data.
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2,:]
# Demonstrate they are views:
df.iloc[0,0] = 99
df2.iloc[0,0]
Out[70]: 99
# Now try and compare the id on values attribute
# Different despite being views!
id(df.values)
Out[71]: 4753564496
id(df2.values)
Out[72]: 4753603728
# And we can of course compare df and df2
df is df2
Out[73]: False
Other answers I've looked up that try to give rules, but don't seem consistent, and also don't answer this question of how to test:
What rules does Pandas use to generate a view vs a copy?
Pandas: Subindexing dataframes: Copies vs views
Understanding pandas dataframe indexing
Re-assignment in Pandas: Copy or view?
And of course:
- http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy
UPDATE: Comments below seem to answer the question -- looking at the df.values.base attribute rather than df.values attribute does it, as does a reference to the df._is_copy attribute (though the latter is probably very bad form since it's an internal).
Answers from HYRY and Marius in comments!
One can check either by:
testing equivalence of the values.base attribute rather than the values attribute, as in:
df.values.base is df2.values.base instead of df.values is df2.values.
or using the (admittedly internal) _is_view attribute (df2._is_view is True).
Thanks everyone!
I've elaborated on this example with pandas 1.0.1. There's not only a boolean _is_view attribute, but also _is_copy which can be None or a reference to the original DataFrame:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'],
columns = ['a','b','c','d'])
df2 = df.iloc[0:2, :]
df3 = df.loc[df['a'] == 1, :]
# df is neither copy nor view
df._is_view, df._is_copy
Out[1]: (False, None)
# df2 is a view AND a copy
df2._is_view, df2._is_copy
Out[2]: (True, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
# df3 is not a view, but a copy
df3._is_view, df3._is_copy
Out[3]: (False, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)
So checking these two attributes should tell you not only if you're dealing with a view or not, but also if you have a copy or an "original" DataFrame.
See also this thread for a discussion explaining why you can't always predict whether your code will return a view or not.
You might trace the memory your pandas/python environment is consuming, and, on the assumption that a copy will utilise more memory than a view, be able to decide one way or another.
I believe there are libraries out there that will present the memory usage within the python environment itself - e.g. Heapy/Guppy.
There ought to be a metric you can apply that takes a baseline picture of memory usage prior to creating the object under inspection, then another picture afterwards. Comparison of the two memory maps (assuming nothing else has been created and we can isolate the change is due to the new object) should provide an idea of whether a view or copy has been produced.
We'd need to get an idea of the different memory profiles of each type of implementation, but some experimentation should yield results.

Cannot change nan values in a dataframe [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
I have been reading this link on "Returning a view versus a copy". I do not really get how the chained assignment concept in Pandas works and how the usage of .ix(), .iloc(), or .loc() affects it.
I get the SettingWithCopyWarning warnings for the following lines of code, where data is a Panda dataframe and amount is a column (Series) name in that dataframe:
data['amount'] = data['amount'].astype(float)
data["amount"].fillna(data.groupby("num")["amount"].transform("mean"), inplace=True)
data["amount"].fillna(mean_avg, inplace=True)
Looking at this code, is it obvious that I am doing something suboptimal? If so, can you let me know the replacement code lines?
I am aware of the below warning and like to think that the warnings in my case are false positives:
The chained assignment warnings / exceptions are aiming to inform the
user of a possibly invalid assignment. There may be false positives;
situations where a chained assignment is inadvertantly reported.
EDIT : the code leading to the first copy warning error.
data['amount'] = data.apply(lambda row: function1(row,date,qty), axis=1)
data['amount'] = data['amount'].astype(float)
def function1(row,date,qty):
try:
if(row['currency'] == 'A'):
result = row[qty]
else:
rate = lookup[lookup['Date']==row[date]][row['currency'] ]
result = float(rate) * float(row[qty])
return result
except ValueError: # generic exception clause
print "The current row causes an exception:"
The point of the SettingWithCopy is to warn the user that you may be doing something that will not update the original data frame as one might expect.
Here, data is a dataframe, possibly of a single dtype (or not). You are then taking a reference to this data['amount'] which is a Series, and updating it. This probably works in your case because you are returning the same dtype of data as existed.
However it could create a copy which updates a copy of data['amount'] which you would not see; Then you would be wondering why it is not updating.
Pandas returns a copy of an object in almost all method calls. The inplace operations are a convience operation which work, but in general are not clear that data is being modified and could potentially work on copies.
Much more clear to do this:
data['amount'] = data["amount"].fillna(data.groupby("num")["amount"].transform("mean"))
data["amount"] = data['amount'].fillna(mean_avg)
One further plus to working on copies. You can chain operations, this is not possible with inplace ones.
e.g.
data['amount'] = data['amount'].fillna(mean_avg)*2
And just an FYI. inplace operations are neither faster nor more memory efficient. my2c they should be banned. But too late on that API.
You can of course turn this off:
pd.set_option('chained_assignment',None)
Pandas runs with the entire test suite with this set to raise (so we know if chaining is happening) on, FYI.

Categories