Pandas SettingWithCopyWarning when using iloc - python

I'm trying to change values in my DataFrame after merging it with another DataFrame and coming across some issues (doesn't appear to be an issue prior to merging).
I am indexing and changing values in my DataFrame with:
df.iloc[0]['column'] = 1
Subsequently I've joined (left outer join) along both indexes using merge (I realize left.join(right) would work too). After this when I perform the same value assignment using iloc, I receive the following warning:
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A review of the linked document doesn't clarify the understanding hence, am I using an incorrect method of slicing with iloc? (keeping in mind I require positional based slicing for the purpose of my code)
I notice that df.ix[0,'column'] = 1 works, and similarly based on this page I can reference the column location with df.columns.get_loc('column') but on the surface this seems unnecessarily convoluted.
What's the difference between these methods under the hood, and what about merging causes the previous method (df.iloc[0]['column']) to break?

You are using chained indexing above, this is to be avoided "df.iloc[0]['column'] = 1" and generates the SettingWithCopy Warning you are getting. The Pandas docs are a bit complicated but see SettingWithCopy Warning with chained indexing for the under the hood explanation on why this does not work.
Instead you should use df.loc[0, 'column'] = 1
.loc is for "Access a group of rows and columns by label(s) or a boolean array."
.iloc is for "Purely integer-location based indexing for selection by position."

It sucks, but the best solution I've come so far about updating a dataframe's column based on the .ilocs is find the iloc of a column, then use .iloc for everything:
column_i_loc = np.where(df.columns == 'column')[0][0]
df.iloc[0, column_i_loc] = 1
Note you could also disable the warning, but really do not!...
Also, if you face this warning and were not trying to update some original DataFrame, then you forgot to make a copy and end up with a nasty bug...

Related

How to manipulate Pandas Series without changing the given Original?

Context
I have a method that takes a Pandas Series of categorial Data and returns it as an indexed version. However, I think my implementation is also modifying the given Series, not just returning a modified new Series. I also get the following Errors:
A value is trying to be set on a copy of a slice from a DataFrame.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
series[series == value] = index
SettingWithCopyWarning: modifications to a property of a datetimelike object are not supported and are discarded. Change values on the original.
cacher_needs_updating = self._check_is_chained_assignment_possible()
Code
def categorials(series: pandas.Series) -> pandas.Series:
unique = series.unique()
for index, value in enumerate(unique):
series[series == value] = index
return series.astype(pandas.Int64Dtype())
Question
How can I achieve my goal: This method should return the modified series without manipulating the original given series?
You need to .copy() the incoming argument. Normally, that warning wouldn't have appeared; we're at liberty to write to Series/DataFrames after all. However, in the code you didn't share, it seems the argument you're passing here was obtained as a subset of another Series/Frame (or maybe even itself). FYI, if you're planning to do modifications on a subset, better chain .copy() at the end of initialization.
Anyway, back to the question, series = series.copy() as the first line in the function should resolve the issue. However, your method is actually doing factorization, so
pd.Series(pd.factorize(series)[0], index=series.index)
is equivalent to what your function does, where since pd.factorize returns a 2-tuple of (codes, uniques), we take the 0th one. Also it gives a NumPy array back, so we Series-ify it with the incoming index. Noting that, it does not attempt to modify the original Series, so no .copy is needed for it.

there are different categories for a column in the dataframe and want to merge/rename these different columns into 3 categories [duplicate]

In the pandas library many times there is an option to change the object inplace such as with the following statement...
df.dropna(axis='index', how='all', inplace=True)
I am curious what is being returned as well as how the object is handled when inplace=True is passed vs. when inplace=False.
Are all operations modifying self when inplace=True? And when inplace=False is a new object created immediately such as new_df = self and then new_df is returned?
If you are trying to close a question where someone should use inplace=True and hasn't, consider replace() method not working on Pandas DataFrame instead.
When inplace=True is passed, the data is renamed in place (it returns nothing), so you'd use:
df.an_operation(inplace=True)
When inplace=False is passed (this is the default value, so isn't necessary), performs the operation and returns a copy of the object, so you'd use:
df = df.an_operation(inplace=False)
In pandas, is inplace = True considered harmful, or not?
TLDR; Yes, yes it is.
inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
inplace does not work with method chaining
inplace can lead to SettingWithCopyWarning if used on a DataFrame column, and may prevent the operation from going though, leading to hard-to-debug errors in code
The pain points above are common pitfalls for beginners, so removing this option will simplify the API.
I don't advise setting this parameter as it serves little purpose. See this GitHub issue which proposes the inplace argument be deprecated api-wide.
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In reality, there are absolutely no performance benefits to using inplace=True. Both the in-place and out-of-place versions create a copy of the data anyway, with the in-place version automatically assigning the copy back.
inplace=True is a common pitfall for beginners. For example, it can trigger the SettingWithCopyWarning:
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
Calling a function on a DataFrame column with inplace=True may or may not work. This is especially true when chained indexing is involved.
As if the problems described above aren't enough, inplace=True also hinders method chaining. Contrast the working of
result = df.some_function1().reset_index().some_function2()
As opposed to
temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()
The former lends itself to better code organization and readability.
Another supporting claim is that the API for set_axis was recently changed such that inplace default value was switched from True to False. See GH27600. Great job devs!
The way I use it is
# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False)
Or
# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)
CONCLUSION:
if inplace is False
Assign to a new variable;
else
No need to assign
The inplace parameter:
df.dropna(axis='index', how='all', inplace=True)
in Pandas and in general means:
1. Pandas creates a copy of the original data
2. ... does some computation on it
3. ... assigns the results to the original data.
4. ... deletes the copy.
As you can read in the rest of my answer's further below, we still can have good reason to use this parameter i.e. the inplace operations, but we should avoid it if we can, as it generate more issues, as:
1. Your code will be harder to debug (Actually SettingwithCopyWarning stands for warning you to this possible problem)
2. Conflict with method chaining
So there is even case when we should use it yet?
Definitely yes. If we use pandas or any tool for handeling huge dataset, we can easily face the situation, where some big data can consume our entire memory.
To avoid this unwanted effect we can use some technics like method chaining:
(
wine.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.query("alcohol > 14 and color_filter == 1")
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
which make our code more compact (though harder to interpret and debug too) and consumes less memory as the chained methods works with the other method's returned values, thus resulting in only one copy of the input data. We can see clearly, that we will have 2 x original data memory consumption after this operations.
Or we can use inplace parameter (though harder to interpret and debug too) our memory consumption will be 2 x original data, but our memory consumption after this operation remains 1 x original data, which if somebody whenever worked with huge datasets exactly knows can be a big benefit.
Final conclusion:
Avoid using inplace parameter unless you don't work with huge data and be aware of its possible issues in case of still using of it.
Save it to the same variable
data["column01"].where(data["column01"]< 5, inplace=True)
Save it to a separate variable
data["column02"] = data["column01"].where(data["column1"]< 5)
But, you can always overwrite the variable
data["column01"] = data["column01"].where(data["column1"]< 5)
FYI: In default inplace = False
When trying to make changes to a Pandas dataframe using a function, we use 'inplace=True' if we want to commit the changes to the dataframe.
Therefore, the first line in the following code changes the name of the first column in 'df' to 'Grades'. We need to call the database if we want to see the resulting database.
df.rename(columns={0: 'Grades'}, inplace=True)
df
We use 'inplace=False' (this is also the default value) when we don't want to commit the changes but just print the resulting database. So, in effect a copy of the original database with the committed changes is printed without altering the original database.
Just to be more clear, the following codes do the same thing:
#Code 1
df.rename(columns={0: 'Grades'}, inplace=True)
#Code 2
df=df.rename(columns={0: 'Grades'}, inplace=False}
Yes, in Pandas we have many functions has the parameter inplace but by default it is assigned to False.
So, when you do df.dropna(axis='index', how='all', inplace=False) it thinks that you do not want to change the orignial DataFrame, therefore it instead creates a new copy for you with the required changes.
But, when you change the inplace parameter to True
Then it is equivalent to explicitly say that I do not want a new copy
of the DataFrame instead do the changes on the given DataFrame
This forces the Python interpreter to not to create a new DataFrame
But you can also avoid using the inplace parameter by reassigning the result to the orignal DataFrame
df = df.dropna(axis='index', how='all')
inplace=True is used depending if you want to make changes to the original df or not.
df.drop_duplicates()
will only make a view of dropped values but not make any changes to df
df.drop_duplicates(inplace = True)
will drop values and make changes to df.
Hope this helps.:)
inplace=True makes the function impure. It changes the original dataframe and returns None. In that case, You breaks the DSL chain.
Because most of dataframe functions return a new dataframe, you can use the DSL conveniently. Like
df.sort_values().rename().to_csv()
Function call with inplace=True returns None and DSL chain is broken. For example
df.sort_values(inplace=True).rename().to_csv()
will throw NoneType object has no attribute 'rename'
Something similar with python’s build-in sort and sorted. lst.sort() returns None and sorted(lst) returns a new list.
Generally, do not use inplace=True unless you have specific reason of doing so. When you have to write reassignment code like df = df.sort_values(), try attaching the function call in the DSL chain, e.g.
df = pd.read_csv().sort_values()...
As Far my experience in pandas I would like to answer.
The 'inplace=True' argument stands for the data frame has to make changes permanent
eg.
df.dropna(axis='index', how='all', inplace=True)
changes the same dataframe (as this pandas find NaN entries in index and drops them).
If we try
df.dropna(axis='index', how='all')
pandas shows the dataframe with changes we make but will not modify the original dataframe 'df'.
If you don't use inplace=True or you use inplace=False you basically get back a copy.
So for instance:
testdf.sort_values(inplace=True, by='volume', ascending=False)
will alter the structure with the data sorted in descending order.
then:
testdf2 = testdf.sort_values( by='volume', ascending=True)
will make testdf2 a copy. the values will all be the same but the sort will be reversed and you will have an independent object.
then given another column, say LongMA and you do:
testdf2.LongMA = testdf2.LongMA -1
the LongMA column in testdf will have the original values and testdf2 will have the decrimented values.
It is important to keep track of the difference as the chain of calculations grows and the copies of dataframes have their own lifecycle.

python pandas.loc not finding row name: KeyError

This is driving me crazy because it should be so simple and yet it's not working. It's a duplicate question and yet the answers from previous questions don't work.
My csv looks similar to this:
name,val1,val2,val3
ted,1,2,
bob,1,,
joe,,,4
I want to print the contents of row 'joe'. I use the line below and pycharm gives me a KeyError.
print(df.loc['joe'])
The problem with your logic is that you have not let pandas know which column it should search joe for.
print(df.loc[df['name'] == 'joe'])
or
print(df[df['name'] == 'joe'])
Using .loc directly is achievable only on index.
If you just used pd.read_csv without mentioning the index, by default pandas will use number as index. You can set name to be the index if it is unique. Then .loc will work:
df.set_index("name")
print(df.loc['joe'])
Another option, and it's how usually working with .loc, is to name specifically what column you refer to:
print(df.loc[df["name"]=="joe"])
Note that the condition df["name"]=="joe" returns a series with true/false for each row. df.loc[...] on that series will return only rows where the value is true, and therefore it will return only rows where name is "joe". Keep that in mind when in future you will try to do more complex conditioning on your dataframe using .loc.

Changing value in data frame column in a loop python

I am new to Python pandas library and using data frames.
I am using Jupyter.
I kind of lost with this syntax.
I want to loop through rows and set new value to column new_value.
I thought I would do it like this, but it raises an error.
df_merged['new_value'] = 0
for i, row in df_merged.iterrows():
df_merged['new_value'][i] = i
I also tried to do a calculation like:
df_merged['new_value'][i] = df_merged['move_%'] * df_merged['value']
But it doesnt work.
I am getting this error:
/usr/lib/python3.4/site-packages/ipykernel_launcher.py:4:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-
docs/stable/indexing.html#indexing-view-versus-copy
after removing the cwd from sys.path.
What I am doing wrong here?
Thanks.
You can use just this:
df_merged['new_value'] = df.index
You can also use apply method.
df_merged['new_value'] = df_merged.apply(lambda row : row.name, axis=1)
I am getting this error : A value is trying to be set on a copy of a
slice from a DataFrame
It's not a error, it's just a warning message.
From this answer:
The SettingWithCopyWarning was created to flag potentially confusing "chained" assignments, such as the following, which don't always work as expected, particularly when the first selection returns a copy.
You can avoid this warning message using pd.DataFrame.loc method.
for i, row in df_merged.iterrows():
df_merged.loc[i,'price_new'] = i
For a loop update in pandas dataframe:
for i, row in df_merged.iterrows():
df_merged.set_value(i,'new_value',i)
Should be able to update values in pandas dataframe.
FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead.
for i, row in df_merged.iterrows():
df_merged.at[i,'new_value'] = i
Should be preferred.
This works also fine:
df_merged['price_new'] = 0
for i, row in df_merged.iterrows():
df_merged.loc[i,'price_new'] = i
This is not an error. It simply saying that the data frame _merged is initialised as a view of a parent daraframe and thus isn’t a data frame by itself, therefore cannot take values. That’s probably why when you check the value of the merged data frame after this step it remains the same as the original. You have two options: make your _merged data frame itself a copy by using the .copy() method when you initialise it from its parent data frame. Or in the loop or the computation set the values to the parent data frame using the same calculations or indexes done on merged data frame. I’d recommend the first method because I don’t think memory is a constraint for you and you want the values changed in the new data frame. Plus it is as straightforward as can be.
If you want to perform a multiplication on two columns, you don't have to do it row-wise, the following should work:
df_merged['new_value'] = df_merged['move_%'] * df_merged['value']

What is the difference between these two Python pandas dataframe commands?

Let's say I have an empty pandas dataframe.
import pandas as pd
m = pd.DataFrame(index=range(1,100), columns=range(1,100))
m = m.fillna(0)
What is the difference between the following two commands?
m[2][1]
m[2].ix[1] # This code actually allows you to edit the dataframe
Feel free to provide further reading if it would be helpful for future reference.
The short answer is that you probably shouldn't do either of these (see #EdChum's link for the reason):
m[2][1]
m[2].ix[1]
You should generally use a single ix, iloc, or loc command on the entire dataframe anytime you want access by both row and column -- not a sequential column access, followed by row access as you did here. For example,
m.iloc[1,2]
Note that the 1 and 2 are reversed compared to your example because ix/iloc/loc all use standard syntax of row then column. Your syntax is reversed because you are chaining, and are first selecting a column from a dataframe (which is a series) and then selecting a row from that series.
In simple cases like yours, it often won't matter too much but the usefulness of ix/iloc/loc is that they are designed to let you "edit" the dataframe in complex ways (aka set or assign values).
There is a really good explanation of ix/iloc/loc here:
pandas iloc vs ix vs loc explanation?
and also in standard pandas documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html

Categories