SettingwithCopy when creating new column and when dropping NaN rows [duplicate] - python

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 4 years ago.
I've been searching around reading the pandas docs here and trying different lines of code from questions posted around here and here and I can't seem to get away from the setting with copy warning. I'd prefer to learn to code it the "right" way as opposed to just ignoring the warnings.
The following lines of code are inside a for loop and I don't want to generate this warning a lot of times because it could slow things down.
I'm trying to make a new column with name: 'E'+vs where vs is a string in a list in the for loop
But for each one of them, I still get the following warning, even with the last 3 lines:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here are the troublesome lines I've tried so far:
#based on research, the first two seem to be the "wrong" way
df_out['E'+vs] = df_out[kvs].rolling(v).mean().copy()
df_out['E'+vs] = df_out[kvs].rolling(v).mean()
df_out.loc[:,'E'+vs] = df_out[kvs].rolling(v).mean().copy()
df_out.loc[:,'E'+vs] = df_out[kvs].rolling(v).mean()
df_out.loc[:,'E'+vs] = df_out.loc[:,kvs].rolling(v).mean()
The other one that gives a SettingWithCopyWarning is this:
df_out.dropna(inplace=True,axis=0)
This one also gave a warning (but I figured this one would)
df_out = df_out.dropna(inplace=True,axis=0)
How do I do both of these operations correctly?
EDIT: Here is the code that produced the original df_out
df_out= pd.concat([vol.Date[1:-1], ret.Return_Time[:-2], vol.Freq_Time[:-2],
vol.Freq_Time[:-1].shift(-1), vol.Freq_Time[:].shift(-2)],
axis=1).dropna().set_index('Date')

This is a confusing topic. It's not the code you've posted that is the problem. It's the code you haven't posted. It's the code that generated the df_out
Consider this example and note the last line that generates the warning.
df_other = pd.DataFrame(dict(A=[1], B=[2]))
df_out = df_other[:]
df_out['E'] = 5
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Now we'll try an equivalent thing that won't produce the warning
df_other = pd.DataFrame(dict(A=[1], B=[2]))
df_out = df_other.loc[:]
df_out['E'] = 5
Then
print `df_out`
A B E
0 1 2 5
It boils down to pandas deciding to attach an is_copy attribute to a dataframe when it's constructed based on lots of criteria.
Notice the
df_other[:].is_copy
<weakref at 0x103323458; to 'DataFrame' at 0x116a684e0>
When
df_other.loc[:].is_copy
Returns None
So what types of construction trigger the copy? I still don't know everything, and not even the things I know all make sense to me.
Like why does this not trigger it?
df_other[['A', 'B', 'E']].is_copy

First off, I am not sure this is either efficient or the best approach. However, I had the same issue when I was adding a new column to the exist dataframe and I decided to use reset_index method.
Here I first drop Nan rows from EMPLOYEES column and assign this manipulated data frame to new data frame df1 then I add COMPANY_SIZE column to the df1 as follows:
df1 = all_merged_years.dropna(subset=['EMPLOYEES']).reset_index()
column = df1['EMPLOYEES']
Size =[]
df1['COMPANY_SIZE'] = ' '
for number in column:
if number <=999:
Size.append('Small')
elif 999<number<=9999:
Size.append('Medium')
elif 9999<number:
Size.append('Large')
else:
Size.append('UNKNOWN')
df1['COMPANY_SIZE'] = Size
This way I did NOT get a warning as such. Hope that helps.

Related

Setting category and type for multiple columns possible?

I have a dataset which contains 6 columns TIME1 to TIME6, amongst others. For each of these I need to apply the code below (which is shown for 2 columns). LISTED is a prepared list of the possible elements to be seen in these columns.
Is there a way to do this without writing the same 2 lines 6 times?
df['PART1'] = df['TIME1'].astype('category')
df['PART1'].cat.set_categories(LISTED, inplace=True)
df['PART2'] = df['TIME2'].astype('category')
df['PART2'].cat.set_categories(LISTED, inplace=True)
For astype(first line of code), I tried the following:
for col in ['TIME1', 'TIME2', 'TIME3', 'TIME4', 'TIME5', 'TIME6']:
df_col = df[col].astype('category')
I think this works (not sure how to check without the whole code working). But how could I do something similar for the second line of code with the set_categories etc?
In short, I'm looking for something short/more elegant that just copying and modifying the same 2 lines 6 times.
I am new to python, any help is greatly appreciated.
Using python 2.7 and pandas 0.24.2
Yes it is possible! We can change the dtype of multiple columns to categorical is one go by creating CategoricalDtype
i = pd.RangeIndex(1, 7).astype(str)
df['PART' + i] = df['TIME' + i].astype(pd.CategoricalDtype(LISTED))

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

Pandas, series str accessor and SettingWithCopyWarning

I have got data frame named "test_msg" with columns:
SMS - message text,
Label - if its spam or not spam (ham)
Whenever I do something like this:
test_msg['SMS'] = test_msg['SMS'].str.replace('\W', ' ') #get rid of non-word characters
I got SettingWithCopyWarning. Somehow I set values to copy, but Im not sure where is this problem. My original dataframe after this operation is modified.
Could someone help me to crack this problem?
Your problem is not in this instruction but somewhere earlier in your code.
I made such experiment:
I created a source DataFrame (df), containing 3 rows:
SMS Label
0 Acdf xxx rr 10_20 1
1 BBbb x x x aa 20_30 1
2 Ccccc##?& ax^ax*ax. aa$ 20_30 1
Then I "copied" some rows from it to test_msg:
test_msg = df[:2]
containing first 2 rows from df. But note that I actually did not make
any copy (no new DataFrame has been created).
test_msg is only a view of df, i.e. test_msg draws data from
the buffer used by df.
Now when you atempt to modify this data (referring to test_msg),
SettingWithCopyWarning warning occurs.
To cope with this problem, create test_msg e.g. using loc:
test_msg = df.loc[:2]
Then test_msg is a DataFrame with its own data buffer (this time
you just made a copy), so now you can run your the instruction in question
with no warning (try yourself).
Another option is to use copy() method:
test_msg = df[:2].copy()
with the same result.

How to subset a dataframe and resolve the SettingWithCopy warning in Python?

I've read an Excel sheet of survey responses into a dataframe in a Python 3 Jupyter notebook, and want to remove rows where the individuals are in one particular program. So I've subset from dataframe 'df' to a new dataframe 'dfgeneral' using .loc .
notnurse = df['Program Code'] != 'NSG'
dfgeneral = df.loc[notnurse,:]
I then want to map labels (I.e. Satisfied, Not Satisfied) to the codes that were used to represent them, and find the number of respondents who gave each response. Several questions use the same scale, so I looped through them:
q5list = ['Q5_1','Q5_2','Q5_3','Q5_4','Q5_5','Q5_6']
scale5_dict = {1:'Very satisfied',2:'Satisfied',3:'Neutral',
4:'Somewhat dissatisfied',5:'Not satisfied at all',
np.NaN:'No Response'}
for i in q5list:
dfgeneral[i] = df[i].map(scale5_dict)
print(dfgeneral[i].value_counts(dropna=False))
In the output, I get the SettingWithCopy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I used .loc to create dfgeneral; is this a false positive, or what change should I make? Thank you for your help.
dfgeneral = df.loc[notnurse,:]
This line (second line) takes a slice of the DataFrame and assigns it to a variable. When you want to manipulate that variable, you see the warning (A value is trying to be set on a copy of a slice from a DataFrame).
Change that line to:
dfgeneral = df.loc[notnurse, :].copy()

Categories