Pandas, series str accessor and SettingWithCopyWarning - python

I have got data frame named "test_msg" with columns:
SMS - message text,
Label - if its spam or not spam (ham)
Whenever I do something like this:
test_msg['SMS'] = test_msg['SMS'].str.replace('\W', ' ') #get rid of non-word characters
I got SettingWithCopyWarning. Somehow I set values to copy, but Im not sure where is this problem. My original dataframe after this operation is modified.
Could someone help me to crack this problem?

Your problem is not in this instruction but somewhere earlier in your code.
I made such experiment:
I created a source DataFrame (df), containing 3 rows:
SMS Label
0 Acdf xxx rr 10_20 1
1 BBbb x x x aa 20_30 1
2 Ccccc##?& ax^ax*ax. aa$ 20_30 1
Then I "copied" some rows from it to test_msg:
test_msg = df[:2]
containing first 2 rows from df. But note that I actually did not make
any copy (no new DataFrame has been created).
test_msg is only a view of df, i.e. test_msg draws data from
the buffer used by df.
Now when you atempt to modify this data (referring to test_msg),
SettingWithCopyWarning warning occurs.
To cope with this problem, create test_msg e.g. using loc:
test_msg = df.loc[:2]
Then test_msg is a DataFrame with its own data buffer (this time
you just made a copy), so now you can run your the instruction in question
with no warning (try yourself).
Another option is to use copy() method:
test_msg = df[:2].copy()
with the same result.

Related

RealTime data appending - Pandas

I am trying to do something very basic in pandas and failing miserably.
From a high level I am taking ask_size data from my broker who passes the value to me on every tick update.
I can print out the last value easily enough.
All I am trying to do is append the next ask_size amount to the previous ask_size, to the end of a df in a new row, so I can do some historical analysis.
def getTickSize():
askSize_list = [] # empty list
askSize_list.append(float(ask_size)) # getting askSize and putting it in a list
datagrab = {'ask_size': askSize_list} # creating the single column and putting askSize in
df = pd.DataFrame(datagrab) # using a pd df
print(df.tail(10))
I am then calling the function in a different part of my script
However the output always only shows the last askSize:
askSize
0 30.0
And never actually appends the real-time data
Clearly I am doing something wrong, but I am at a loss to what.
I have also tried using the ignore_index=True in a second df, refencing the first, but no joy:
askSize
0 30.0
1 30.0
I have also tried using 'for loops' but as there doesn't seem to be anything to iterate over (data is real-time) I came to a dead end
(note I will also eventually add a timestamp to each new ask_size as it is appended to the list. So only 2 columns, in the end)
Any help is much appreciated
it seems you are creating a new dataframe, not appending new data.
You could, for example, create a new dataframe that will be appended to the existing data frame with the row(s) in the same format.
Lets say you have already df created. You want to add 1 new entry that will be read as a parameter (if you need more, specify more parameters), here is a basic example:
'askSize'
1.0
2.0
def append_row(newdata, dataframe):
row = {'ask_size': [newdata]}
temp_df = pd.DataFrame(row)
# merge original dataframe with temp_df
merged_df = pd.concat([dataframe, temp_df])
return merged_df
df = append_row("5.1", df) # this will overwrite your original df
'askSize'
1.0
2.0
5.1
You would need to call the function to add a new row (for instance calling it from inside a loop or any other part of the code).
You can also use df.append() and other methods, here are some links that could be useful for your use case:
Merge, join, concatenate and compare (Pandas.pydata.org)
Example of using pd.append() (Pandas.pydata.org)

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

How to read SPSS aka (.sav) in Python

It's my first time using Jupyter Notebook to analyze survey data (.sav file), and I would like to read it in a way it will show the metadata so I can connect the answers with the questions. I'm totally a newbie in this field, so any help is appreciated!
import pandas as pd
import pyreadstat
df, meta = pyreadstat.read_sav('./SimData/survey_1.sav')
type(df)
type(meta)
df.head()
Please lmk if there is an additional step needed for me to be able to see the metadata!
The meta object contains the metadata you are looking for. Probably the most useful attributes to look at are:
meta.column_names_to_labels : it's a dictionary with column names as you have in your pandas dataframe to labels meaning longer explanations on the meaning of each column
print(meta.column_names_to_labels)
meta.variable_value_labels : a dict where keys are column names and values are a dict where the keys are values you find in your dataframe and values are value labels.
print(meta.variable_value_labels)
For instance if you have a column "gender' with values 1 and 2, you could get:
{"gender": {1:"male", 2:"female"}}
which means value 1 is male and 2 female.
You can get those labels from the beginning if you pass the argument apply_value_formats :
df, meta = pyreadstat.read_sav('survey.sav', apply_value_formats=True)
You can also apply those value formats to your dataframe anytime with pyreadstat.set_value_labels which returns a copy of your dataframe with labels:
df_copy = pyreadstat.set_value_labels(df, meta)
meta.missing_ranges : you get labels for missing values. Let's say in the survey in certain variable they encoded 1 meaning yes, 2 no and then mussing values, 5 meaning didn't answer, 6 person not at home. When you read the dataframe by default you will get values 1 and 2 and NaN (missing) instead of 5 and 6. You can pass the argument user_missing to get 5 and 6, and meta.missing_ranges will tell you that 5 and 6 are missing values. Variable_value_labels will give you the "didn't answer" and "person not at home" labels.
df, meta = pyreadstat.read_sav("survey.sav", user_missing=True)
print(meta.missing_ranges)
print(meta.variable_value_labels)
These are the potential pieces of information useful for your case, not necessarily all of these pieces will be present in your dataset.
More information here: https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html

SettingwithCopy when creating new column and when dropping NaN rows [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 4 years ago.
I've been searching around reading the pandas docs here and trying different lines of code from questions posted around here and here and I can't seem to get away from the setting with copy warning. I'd prefer to learn to code it the "right" way as opposed to just ignoring the warnings.
The following lines of code are inside a for loop and I don't want to generate this warning a lot of times because it could slow things down.
I'm trying to make a new column with name: 'E'+vs where vs is a string in a list in the for loop
But for each one of them, I still get the following warning, even with the last 3 lines:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here are the troublesome lines I've tried so far:
#based on research, the first two seem to be the "wrong" way
df_out['E'+vs] = df_out[kvs].rolling(v).mean().copy()
df_out['E'+vs] = df_out[kvs].rolling(v).mean()
df_out.loc[:,'E'+vs] = df_out[kvs].rolling(v).mean().copy()
df_out.loc[:,'E'+vs] = df_out[kvs].rolling(v).mean()
df_out.loc[:,'E'+vs] = df_out.loc[:,kvs].rolling(v).mean()
The other one that gives a SettingWithCopyWarning is this:
df_out.dropna(inplace=True,axis=0)
This one also gave a warning (but I figured this one would)
df_out = df_out.dropna(inplace=True,axis=0)
How do I do both of these operations correctly?
EDIT: Here is the code that produced the original df_out
df_out= pd.concat([vol.Date[1:-1], ret.Return_Time[:-2], vol.Freq_Time[:-2],
vol.Freq_Time[:-1].shift(-1), vol.Freq_Time[:].shift(-2)],
axis=1).dropna().set_index('Date')
This is a confusing topic. It's not the code you've posted that is the problem. It's the code you haven't posted. It's the code that generated the df_out
Consider this example and note the last line that generates the warning.
df_other = pd.DataFrame(dict(A=[1], B=[2]))
df_out = df_other[:]
df_out['E'] = 5
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Now we'll try an equivalent thing that won't produce the warning
df_other = pd.DataFrame(dict(A=[1], B=[2]))
df_out = df_other.loc[:]
df_out['E'] = 5
Then
print `df_out`
A B E
0 1 2 5
It boils down to pandas deciding to attach an is_copy attribute to a dataframe when it's constructed based on lots of criteria.
Notice the
df_other[:].is_copy
<weakref at 0x103323458; to 'DataFrame' at 0x116a684e0>
When
df_other.loc[:].is_copy
Returns None
So what types of construction trigger the copy? I still don't know everything, and not even the things I know all make sense to me.
Like why does this not trigger it?
df_other[['A', 'B', 'E']].is_copy
First off, I am not sure this is either efficient or the best approach. However, I had the same issue when I was adding a new column to the exist dataframe and I decided to use reset_index method.
Here I first drop Nan rows from EMPLOYEES column and assign this manipulated data frame to new data frame df1 then I add COMPANY_SIZE column to the df1 as follows:
df1 = all_merged_years.dropna(subset=['EMPLOYEES']).reset_index()
column = df1['EMPLOYEES']
Size =[]
df1['COMPANY_SIZE'] = ' '
for number in column:
if number <=999:
Size.append('Small')
elif 999<number<=9999:
Size.append('Medium')
elif 9999<number:
Size.append('Large')
else:
Size.append('UNKNOWN')
df1['COMPANY_SIZE'] = Size
This way I did NOT get a warning as such. Hope that helps.

How to subset a dataframe and resolve the SettingWithCopy warning in Python?

I've read an Excel sheet of survey responses into a dataframe in a Python 3 Jupyter notebook, and want to remove rows where the individuals are in one particular program. So I've subset from dataframe 'df' to a new dataframe 'dfgeneral' using .loc .
notnurse = df['Program Code'] != 'NSG'
dfgeneral = df.loc[notnurse,:]
I then want to map labels (I.e. Satisfied, Not Satisfied) to the codes that were used to represent them, and find the number of respondents who gave each response. Several questions use the same scale, so I looped through them:
q5list = ['Q5_1','Q5_2','Q5_3','Q5_4','Q5_5','Q5_6']
scale5_dict = {1:'Very satisfied',2:'Satisfied',3:'Neutral',
4:'Somewhat dissatisfied',5:'Not satisfied at all',
np.NaN:'No Response'}
for i in q5list:
dfgeneral[i] = df[i].map(scale5_dict)
print(dfgeneral[i].value_counts(dropna=False))
In the output, I get the SettingWithCopy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I used .loc to create dfgeneral; is this a false positive, or what change should I make? Thank you for your help.
dfgeneral = df.loc[notnurse,:]
This line (second line) takes a slice of the DataFrame and assigns it to a variable. When you want to manipulate that variable, you see the warning (A value is trying to be set on a copy of a slice from a DataFrame).
Change that line to:
dfgeneral = df.loc[notnurse, :].copy()

Categories