SettingwithCopyWarning: How to Fix This - python

I have a column with strings and I'm trying to find number of tokens in it and then creating a new column in the same dataframe with those values.
data['tokens'] = data['query'].str.split().apply(len)
I get SettingWithCopyWarning. I'm not sure how to fix this. I understand I need to use .loc[row_indexer,col_indexer] = value but don't get how that would apply to this.

a SettingWithCopyWarning happens when you have made a copy of a slice of a DataFrame, but pandas thinks you might be trying to modify the underlying object.
To fix it, you need to understand the difference between a copy and a view. A copy makes an entirely new object. When you index into a DataFrame, like:
data['query'].str.split().apply(len)
or
data['tokens']
you're creating a new DataFrame that is a modified copy of the original one. If you modify this new copy, it won't change the original data object. You can check that with the _is_view attribute, which will return a boolean value.
data['tokens']._is_view
On the other hand, when you use the .at, .loc, or .iloc methods, you are taking a view of the original DataFrame. That means you're subsetting it according to some criteria and manipulating the original object itself.
Pandas raises the SettingWithCopyWarning when you are modifying a copy when you probably mean to be modifying the original. To avoid this, you can explicitly use .copy() on the data that you are copying, or you can use .loc to specify the columns you want to modify in data (or both).
Since it depends a lot on what transformations you've done to your DataFrame already and how it is set up, it's hard to say exactly where and how you can fix it without seeing more of your code. There's unfortunately no one-size-fits-all answer. If you can post more of your code, I'm happy to help you debug it.
One thing you might try is creating an intermediate lengths object explicitly, in case that is the problem. So your code would look like:
lengths = data['query'].str.split().apply(len).copy()
data['tokens'] = lengths

Related

Python Pandas: Dataframe.loc[] vs dataframe[] behavior difference?

I'm completely new to Pandas, but I was given some code that uses it, and I came across a line that looked like the following:
df.loc[bool_array]['column_name']=0
Looking at that code, I would expect that if I ran df.loc[bool_array]['column_name'] to retrieve the values for the column after running said code, the resulting values would all be zero. In fact, however, this was not the case, the values remained unchanged. To actually change the values, I had to remove the .loc as so:
df[bool_array]['column_name']=0
whereupon displaying the values did, in fact, show 0 as expected, regardless of if I used .loc or not.
Everything I can find about .loc seems to indicate that it should behave essentially the same as simply doing [] when given a boolean array like this. What difference am I missing that explains this behavioral discrepancy?
EDIT: As pointed out by #QuangHoang in the comments, the following recommended code DOES work:
df.loc[bool_array, 'column_name']=0
which simplifies the question to "why doesn't the original code work"?

What .object means for a GroupBy Object

I keep seeing
for index, row in group.object.iterrows():
in Tensorflow tutorials. I get what it's doing, and that group is a GroupBy object, but I wonder what the ".object" is there for. I googled "group.object.iterrows", all I got was Tensorflow object detection code. I tried other variants, but nothing had a GroupBy.object example or description of what it is.
EDIT: here's a tutorial:
https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10/blob/master/generate_tfrecord.py
See line 70.
Here's another, there are a bunch, actually:
https://www.skcript.com/svr/realtime-object-and-face-detection-in-android-using-tensorflow-object-detection-api/
Some more context:
They involve making a tensorflow.train.Example, loading features into it. These were originally taken from some xml from some producing labeling tools, then converted to a csv, then converted to a pandas data frame.
In fact, the code mostly looks like cut-and-paste from some original script with small edits.
Like a DataFrame, a Pandas GroupBy object supports accessing columns by attribute access notation, as long as the column name doesn't conflict with "regular" attributes. object is merely one of the column names in the grouped data, and group.object accesses that column.
object is a column in the group DataFrame.

Using .loc in pandas slows down calculation

I have the following dataframe where I want to assign the bottom 1% value to a new column. When I do this calculation with using the ".loc" notification, it takes around 10 seconds for using .loc assignment, where the alternative solution is only 2 seconds.
df_temp = pd.DataFrame(np.random.randn(100000000,1),columns=list('A'))
%time df_temp["q"] = df_temp["A"].quantile(0.01)
%time df_temp.loc[:, "q1_loc"] = df_temp["A"].quantile(0.01)
Why is the .loc solution slower? I understand using the .loc solution is safer, but if I want to assign data to all indices in the column, what can go wrong with the direct assignment?
.loc is searching along the entirety of indices and columns (in this case, only 1 column) in your df along the whole axes, which is time consuming and perhaps redundant, in addition to figuring out the quantiles of df_temp['A'] (which is negligible as far as calculation time). Your direct assignment method, on the other hand, is just parsing df_temp['A'].quantile(0.01), and assigning df_temp['q']. It doesn't need to exhaustively search the indices/columns of your df.
See this answer for a similar description of the .loc method.
As far as safety is concerned, you are not using chained indexing, so you're probably safe (you're not trying to set anything on a copy of your data, it's being set directly on the data itself). It's good to be aware of the potential issues with not using .loc (see this post for a nice overview of SettingWithCopy warnings), but I think that you're OK as far as that goes.
If you want to be more explicit about your column creation, you could do something along the lines of df = df.assign(q=df_temp["A"].quantile(0.01)). It won't really change performance (I don't think), nor the result, but it allows you to see that you're explicitly assigning a new column to your existing dataframe (and thus not setting anything on a copy of said dataframe).

Is it a good practice to preallocate an empty dataframe with types?

I'm trying to load around 3GB of data into a Pandas dataframe, and I figured that I would save some memory by first declaring an empty dataframe, while enforcing that its float coulms would be 32bit instead of the default 64bit. However, the Pandas dataframe constructor does not allow specifying the types fo multiple columns on an empty dataframe.
I found a bunch of workarounds in the replies to this question, but they made me realize that Pandas is not designed in this way.
This made me wonder whether it was a good strategy at all to declare the empty dataframe first, instead of reading the file and then downcasting the float columns (which seems inefficient memory-wise and processing-wise).
What would be the best strategy to design my program?

Pandas DataFrame - Test for change/modification

Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.

Categories