How to subset a dataframe and resolve the SettingWithCopy warning in Python? - python

I've read an Excel sheet of survey responses into a dataframe in a Python 3 Jupyter notebook, and want to remove rows where the individuals are in one particular program. So I've subset from dataframe 'df' to a new dataframe 'dfgeneral' using .loc .
notnurse = df['Program Code'] != 'NSG'
dfgeneral = df.loc[notnurse,:]
I then want to map labels (I.e. Satisfied, Not Satisfied) to the codes that were used to represent them, and find the number of respondents who gave each response. Several questions use the same scale, so I looped through them:
q5list = ['Q5_1','Q5_2','Q5_3','Q5_4','Q5_5','Q5_6']
scale5_dict = {1:'Very satisfied',2:'Satisfied',3:'Neutral',
4:'Somewhat dissatisfied',5:'Not satisfied at all',
np.NaN:'No Response'}
for i in q5list:
dfgeneral[i] = df[i].map(scale5_dict)
print(dfgeneral[i].value_counts(dropna=False))
In the output, I get the SettingWithCopy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I used .loc to create dfgeneral; is this a false positive, or what change should I make? Thank you for your help.

dfgeneral = df.loc[notnurse,:]
This line (second line) takes a slice of the DataFrame and assigns it to a variable. When you want to manipulate that variable, you see the warning (A value is trying to be set on a copy of a slice from a DataFrame).
Change that line to:
dfgeneral = df.loc[notnurse, :].copy()

Related

How to filter a dataframe and assign a value?

I am trying to assign an object(it could be a list,tuple, string) to a specific cell in a dataframe, but it does not work. I am filtering first and then trying to assign the value.
enter image description here
I am using df.loc[df['name']=='aagicampus'].reset_index(drop=True).at[0,'words']='test'
The expected result is something like
enter image description here
It works if I create a copy of the dataframe, but I must keep the original dataframe to iterate later over a list and perform this procedure many times.
Thanks for your help.
You can do it by first getting the indices of the row(s) that you want to change, and then setting cells at one of those locations to the desired value.
This code gets the locations of rows that satisfy your condition of df['name'] == 'aagicampus
locations = df.index[df['name'] == 'aagicampus']
then you just .loc on locations[0] to change the first row that satisfies the condition. Here it is all together:
df = pd.DataFrame({'name':['something','aagicampus','something'], 'words':['unchanged', 'unchanged', 'unchanged'] })
locations = df.index[df['name'] == 'aagicampus']
df.words.loc[locations[0]] = 'CHANGED'
df.head()
this will return a table:
name words
0 something unchanged
1 aagicampus CHANGED
2 something unchanged

Python loop to search multiple sets of keywords in all columns of dataframe

I've used the code below to search across all columns of my dataframe to see if each row has the word "pool" and the words "slide" or "waterslide".
AR11AR11_regex = r"""
(?=.*(?:slide|waterslide)).*pool
"""
f = lambda x: x.str.findall(AR_regex, flags= re.VERBOSE|re.IGNORECASE)
d['AR'][AR11] = d['AR'].astype(str).apply(f).any(1).astype(int)
This has worked fine but when I want to write a for loop to do this for more than one regex pattern (e.g., AR11, AR12, AR21) using the code below, the new columns are all zeros (i.e., the search is not finding any hits)
for i in AR_list:
print(i)
pat = i+"_regex"
print(pat)
f = lambda x: x.str.findall(i+"_regex", flags= re.VERBOSE|re.IGNORECASE)
d['AR'][str(i)] = d['AR'].astype(str).apply(f).any(1).astype(int)
Any advice on why this loop didn't work would be much appreciated!
A small sample data frame would help understand your question. In any case, your code sample appears to have a multitude of problems.
i+"_regex" is just the string "AR11_regex". It won't evaluate to the value of the variable with the identifier AR11_regex. Put your regex patterns in a dict.
d['AR'] is the values in the AR column. It seems like you expect it to be a row.
d['AR'][str(i)] is adding a new row. It seems like you want to add a new column.
Lastly, this approach to setting a cell generally (always for me) yields the following warning:
/var/folders/zj/pnrcbb6n01z2qv1gmsk70b_m0000gn/T/ipykernel_13985/876572204.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The suggest approach would be to use "at" as in d.at[str(i), 'AR'] or some such.
Add a sample data frame and refine your question for more suggestions.

Append std,mean columns to a DataFrame with a for-loop

I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)

Select specific rows on pandas based on condition

I have a dataframe containing a column called bmi (Body Mass Index) containing int values
I have to separate the values in bmi column into Under weight, Normal, Over weight and Obese based on the values. Below is the loop for the same
However I am getting an error. I am a beginner. Just started coding 2 weeks back.
Generally speaking, using a for loop in pandas is usually a bad idea. Pandas allows you to manipulate data easily. For example, if you want to filter by some condition:
print(df[df["bmi"] > 30])
will print all rows where there bmi>30. It works as follows: df[condition]. Condition in this case is "bmi" is larger then 30, so our condition is df["bmi"] > 30. Notice the line df[df["bmi"] > 30] returns all rows that satisfy the condition. I printed them, but you can manipulate them whatever you like.
Even though it's a bad technique (or used only for specific need), you can of course iterate through dataframe. This is not done via for l in df, as df is a dataframe object. To iterate through it you can use iterrows:
for index, row in df.iterrows():
if (row["bmi"] > 30)
print("Obese")
Also for next time please provide your code inline. Don't paste an image of it
If your goal is to separate into different labels, I suggest the following:
df.loc[df[df["bmi"] > 30, "NewColumn"] = "Obese"
df.loc[df[df["bmi"] < 18.5, "NewColumn"] = "Underweight"
.loc operator allows me to manipulate only part of the data. It's format is [rows, columns]. So the above code takes on rows where bmi>30, and it takes only "NewColumn" (change it whatever you like) which is a new column. It puts the value on the right to this column. That way, after that operation, you have a new column in your dataframe which has "Obese/Underweight" as you like.
As side note - there are better ways to map values (e.g pandas' map and others) but if you are a beginner, it's important to understand simple methods to manipulate data before diving into more complex one. That's why I am avoiding into explaining more complex method
First of all, As mentioned in the comment you should post text/code instead of screenshots.
You could do binning in pandas:
bmi_labels = ['Normal', 'Overweight', 'Obese']
cut_bins = [18.5, 24.9, 29.9, df["bmi"].max()]
df['bmi_label'] = pd.cut(df['bmi'], bins=cut_bins, labels=bmi_labels)
Here, i have made a seperate column (bmi_label) to store label but you could can do it in same column (bmi) too.

SettingwithCopy when creating new column and when dropping NaN rows [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 4 years ago.
I've been searching around reading the pandas docs here and trying different lines of code from questions posted around here and here and I can't seem to get away from the setting with copy warning. I'd prefer to learn to code it the "right" way as opposed to just ignoring the warnings.
The following lines of code are inside a for loop and I don't want to generate this warning a lot of times because it could slow things down.
I'm trying to make a new column with name: 'E'+vs where vs is a string in a list in the for loop
But for each one of them, I still get the following warning, even with the last 3 lines:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here are the troublesome lines I've tried so far:
#based on research, the first two seem to be the "wrong" way
df_out['E'+vs] = df_out[kvs].rolling(v).mean().copy()
df_out['E'+vs] = df_out[kvs].rolling(v).mean()
df_out.loc[:,'E'+vs] = df_out[kvs].rolling(v).mean().copy()
df_out.loc[:,'E'+vs] = df_out[kvs].rolling(v).mean()
df_out.loc[:,'E'+vs] = df_out.loc[:,kvs].rolling(v).mean()
The other one that gives a SettingWithCopyWarning is this:
df_out.dropna(inplace=True,axis=0)
This one also gave a warning (but I figured this one would)
df_out = df_out.dropna(inplace=True,axis=0)
How do I do both of these operations correctly?
EDIT: Here is the code that produced the original df_out
df_out= pd.concat([vol.Date[1:-1], ret.Return_Time[:-2], vol.Freq_Time[:-2],
vol.Freq_Time[:-1].shift(-1), vol.Freq_Time[:].shift(-2)],
axis=1).dropna().set_index('Date')
This is a confusing topic. It's not the code you've posted that is the problem. It's the code you haven't posted. It's the code that generated the df_out
Consider this example and note the last line that generates the warning.
df_other = pd.DataFrame(dict(A=[1], B=[2]))
df_out = df_other[:]
df_out['E'] = 5
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Now we'll try an equivalent thing that won't produce the warning
df_other = pd.DataFrame(dict(A=[1], B=[2]))
df_out = df_other.loc[:]
df_out['E'] = 5
Then
print `df_out`
A B E
0 1 2 5
It boils down to pandas deciding to attach an is_copy attribute to a dataframe when it's constructed based on lots of criteria.
Notice the
df_other[:].is_copy
<weakref at 0x103323458; to 'DataFrame' at 0x116a684e0>
When
df_other.loc[:].is_copy
Returns None
So what types of construction trigger the copy? I still don't know everything, and not even the things I know all make sense to me.
Like why does this not trigger it?
df_other[['A', 'B', 'E']].is_copy
First off, I am not sure this is either efficient or the best approach. However, I had the same issue when I was adding a new column to the exist dataframe and I decided to use reset_index method.
Here I first drop Nan rows from EMPLOYEES column and assign this manipulated data frame to new data frame df1 then I add COMPANY_SIZE column to the df1 as follows:
df1 = all_merged_years.dropna(subset=['EMPLOYEES']).reset_index()
column = df1['EMPLOYEES']
Size =[]
df1['COMPANY_SIZE'] = ' '
for number in column:
if number <=999:
Size.append('Small')
elif 999<number<=9999:
Size.append('Medium')
elif 9999<number:
Size.append('Large')
else:
Size.append('UNKNOWN')
df1['COMPANY_SIZE'] = Size
This way I did NOT get a warning as such. Hope that helps.

Categories