pandas dataframe copy of slice warning - python

I'm fairly new to pandas, and was getting the infamous SettingWithCopyWarning in a large piece of code. I boiled it down to the following:
import pandas as pd
df = pd.DataFrame([[0,3],[3,3],[3,1],[1,1]], columns=list('AB'))
df
df = df.loc[(df.A>1) & (df.B>1)]
df['B'] = 10
When I run this I get the warning:
main:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
The strange thing is that if I leave off the "df" line it runs without a warning. Is this intended behavior?
In general, if I want to filter a DataFrame by the values across various columns, do I need to do a copy() to avoid the SettingWithCopyWarning?
thanks very much

Assuming your DataFrame as below from your question, this will avoid SettingWithCopyWarning
There is github Discussion and solution suggested by one of the Pandas developer Jeff :)
df
A B
1 3 3
Best to use this way.
df['B'] = df['B'].replace(3, 10)
df
A B
1 3 10

Related

Subtracting Pandas columns caveat

There are many similar questions to this one, but I couldn't find the one that answers my questions specifically.
Firstly, when I run something like this
df['new_col'] = df['col2'] - df['col1']
I get a warning saying "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead".
If I then try to run something like this
df.loc[:, 'new_col'] = df['col2'] - df['col1']
I get a "SettingWithCopyWarning" warning with the same message "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead".
Using the apply and lambda functions as suggested by some answers in other posts as raises a "SettingWithCopyWarning" warning and also seems to be a slow operation.
df.loc[:, 'new_col'] = df.apply(lambda x: x['col2'] - x['col1'], axis=1)
I read the documentation pages, but I'm afraid I don't completely understand them, otherwise it would be clear to me what the correct format to make such calculation would be.
Right, so my question is: how do I subtract to columns of a Pandas dataframe to create a new for the same dataframe in the correct way so that Pandas is happy? Thank you!
Try adding df = df.copy():
df = df.copy()
df['new_col'] = df['col2'] - df['col1']

Set value to an entire column of a pandas dataframe

I'm trying to set the entire column of a dataframe to a specific value.
In [1]: df
Out [1]:
issueid industry
0 001 xxx
1 002 xxx
2 003 xxx
3 004 xxx
4 005 xxx
From what I've seen, loc is the best practice when replacing values in a dataframe (or isn't it?):
In [2]: df.loc[:,'industry'] = 'yyy'
However, I still received this much talked-about warning message:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
If I do
In [3]: df['industry'] = 'yyy'
I got the same warning message.
Any ideas? Working with Python 3.5.2 and pandas 0.18.1.
EDIT Jan 2023:
Given the volume of visits on this question, it's worth stating that my original question was really more about dataframe copy-versus-slice than "setting value to an entire column".
On copy-versus-slice: My current understanding is that, in general, if you want to modify a subset of a dataframe after slicing, you should create the subset by .copy(). If you only want a view of the slice, no copy() needed.
On setting value to an entire column: simply do df[col_name] = col_value
You can use the assign function:
df = df.assign(industry='yyy')
Python can do unexpected things when new objects are defined from existing ones. You stated in a comment above that your dataframe is defined along the lines of df = df_all.loc[df_all['issueid']==specific_id,:]. In this case, df is really just a stand-in for the rows stored in the df_all object: a new object is NOT created in memory.
To avoid these issues altogether, I often have to remind myself to use the copy module, which explicitly forces objects to be copied in memory so that methods called on the new objects are not applied to the source object. I had the same problem as you, and avoided it using the deepcopy function.
In your case, this should get rid of the warning message:
from copy import deepcopy
df = deepcopy(df_all.loc[df_all['issueid']==specific_id,:])
df['industry'] = 'yyy'
EDIT: Also see David M.'s excellent comment below!
df = df_all.loc[df_all['issueid']==specific_id,:].copy()
df['industry'] = 'yyy'
df.loc[:,'industry'] = 'yyy'
This does the magic. You are to add '.loc' with ':' for all rows. Hope it helps
You can do :
df['industry'] = 'yyy'
Assuming your Data frame is like 'Data' you have to consider if your data is a string or an integer. Both are treated differently. So in this case you need be specific about that.
import pandas as pd
data = [('001','xxx'), ('002','xxx'), ('003','xxx'), ('004','xxx'), ('005','xxx')]
df = pd.DataFrame(data,columns=['issueid', 'industry'])
print("Old DataFrame")
print(df)
df.loc[:,'industry'] = str('yyy')
print("New DataFrame")
print(df)
Now if want to put numbers instead of letters you must create and array
list_of_ones = [1,1,1,1,1]
df.loc[:,'industry'] = list_of_ones
print(df)
Or if you are using Numpy
import numpy as np
n = len(df)
df.loc[:,'industry'] = np.ones(n)
print(df)
This provides you with the possibility of adding conditions on the rows and then change all the cells of a specific column corresponding to those rows:
df.loc[(df['issueid'] == '001'), 'industry'] = str('yyy')
Seems to me that:
df1 = df[df['col1']==some_value] will not create a new DataFrame, basically, changes in df1 will be reflected in the parent df. This leads to the warning.
Whereas, df1 = df[df['col1]]==some_value].copy() will create a new DataFrame, and changes in df1 will not be reflected in df. The copy method is recommended if you don't want to make changes to your original df.
I had a similar issue before even with this approach df.loc[:,'industry'] = 'yyy', but once I refreshed the notebook, it ran well.
You may want to try refreshing the cells after you have df.loc[:,'industry'] = 'yyy'.
Only use them instead:
df.iloc[:]['industry'] = 'yyy'
remember: this only works with exist columns in dataframe
this for people who didn't work .loc
For anyone else coming for this answer and doesn't want to use copy -
df['industry'] = df['industry'].apply(lambda x: '')
if you just create new but empty data frame, you cannot directly sign a value to a whole column. This will show as NaN because the system wouldn't know how many rows the data frame will have!You need to either define the size or have some existing columns.
df = pd.DataFrame()
df["A"] = 1
df["B"] = 2
df["C"] = 3

df.loc causes a SettingWithCopyWarning warning message

The following line of my code causes a warning :
import pandas as pd
s = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
s.loc[-1] = [5,np.nan,np.nan,6]
grouped = s.groupby(['A'])
for key_m, group_m in grouped:
group_m.loc[-1] = [10,np.nan,np.nan,10]
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
According to the documentation this is the recommended way of doing, so what is happening ?
Thanks for your help.
The documentation is slightly confusing.
Your dataframe is a copy of another dataframe. You can verify this by running bool(df.is_copy) You are getting the warning because you are trying to assign to this copy.
The warning/documentation is telling you how you should have constructed df in the first place. Not how you should assign to it now that it is a copy.
df = some_other_df[cols]
will make df a copy of some_other_df. The warning suggests doing this instead
df = some_other_df.loc[:, [cols]]
Now that it is done, if you choose to ignore this warning, you could
df = df.copy()
or
df.is_copy = None

Modifying a pandas dataframe that may be a view

I have a pandas DataFrame df that is returned from a function and I generally don't know whether it is an independent object or a view on another DataFrame. I want to add new columns to it but don't want to copy it unnecessarily.
df['new_column'] = 0
may give a nasty warning about modifying a copy
df = df.copy()
may be expensive if df is large.
What's the best way here?
you should use an indexer to create your s1 such has:
import pandas as pd
s = pd.DataFrame({'a':[1,2], 'b':[2,3]})
indexer = s[s.a > 1].index
s1 = s.loc[indexer, :]
s1['c'] = 0
should remove the warning.

pandas DataFrame combine_first method converts boolean in floats

I'm running into a strange issue where combine_first method is causing values stored as bool to be upcasted into float64s.
Example:
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame({"a": [True]})
In [3]: df2 = pd.DataFrame({"b": ['test']})
In [4]: df2.combine_first(df1)
Out[4]:
a b
0 1.0 test
This problem has already been reported in a previous post 3 years ago: pandas DataFrame combine_first and update methods have strange behavior. This issue was told to be solved but I still have this behaviour under pandas 0.18.1
thank you for your help
Somewhere along the chain of events to get to a combined dataframe, potential missing values had to be addressed. I'm aware that nothing is missing in your example. None and np.nan are not int, or bool. So in order to have a common dtype that contains a bool and a None or np.nan it is necessary to cast the column as either object or float. As 'float`, a large number of operations become far more efficient and is a decent choice. It obviously isn't the best choice all of the time, but a choice has to be made none the less and pandas tried to infer the best one.
A work around:
Setup
df1 = pd.DataFrame({"a": [True]})
df2 = pd.DataFrame({"b": ['test']})
df3 = df2.combine_first(df1)
df3
Solution
dtypes = df1.dtypes.combine_first(df2.dtypes)
for k, v in dtypes.iteritems():
df3[k] = df3[k].astype(v)
df3
I ran into the same issue. This specific case does not seem to be fixed in Pandas yet. I've filed a bug report:
https://github.com/pandas-dev/pandas/issues/20699

Categories