Pandas: SettingWithCopyWarning trigger location - python

With SettingWithCopyWarning, sometimes it refers you to the exact line of code in your module that triggered the warning (e.g. here) and other times it doesn't (e.g. here).
Short of going through each line of the code (doesn't sound too appealing if you're reviewing hundreds of lines of code), is there a way to pinpoint the line of code that triggered the warning assuming the warning does not return that information?
I wonder if this is a bug in the warning to return a warning without pinpointing the specific code that triggered it.
Warning (from warnings module):
File "C:\Python34\lib\site-packages\pandas\core\indexing.py", line 415
self.obj[item] = s
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Python 3.4, Pandas 0.15.0

You can have pandas raise SettingWithCopyError instead of SettingWithCopyWarning by setting the mode.chained_assignment option to raise instead of warn.
import pandas as pd
pd.set_option('mode.chained_assignment', 'raise')
https://pandas.pydata.org/pandas-docs/stable/options.html#available-options

Related

List all Pandas ParserError when using pd.read_csv()

I have a csv with multiple lines that produce the following error:
df1 = pd.read_csv('df1.csv', skiprows=[176, 2009, 2483, 3432, 7486, 7608, 7990, 11992, 12421])
ParserError: Error tokenizing data. C error: Expected 52 fields in line 12541, saw 501
As you can probably notice, I have multiple lines that produce a ParserError.
To work around this, I am just updating 'skiprows' to include the error and continue parsing the csv. I have over 30K lines and would prefer to just do this all at once rather than hitting run in Jupyter Notebook, getting a new error, and updating. Otherwise, I wish it would just skip the errors and parse the rest, I've tried googling a solution that way - but all the SO responses were too complicated for me to follow and reproduce for my data structures.
P.S. why is that when using skiprows with just 1 line, like 177, I can just enter skiprows = 177, but when using skiprows with a list, I have to do skiprows = 'errored line - 1'? Why does the counting change?
pandas ≥ 1.3
You should use the on_bad_lines parameter of read_csv (pandas ≥ 1.3.0)
df1 = pd.read_csv('df1.csv', on_bad_lines='warn')
This will skip the invalid lines and give you a warning. If you use on_bad_lines='skip' you skip the lines without warning. The default value of on_bad_lines='error' raises an error for the first issue and aborts.
pandas < 1.3
The parameters are error_bad_lines=False and warn_bad_lines=True.

Pandas: ValueError: unknown type str32 for string comparison

The following code throws ValueError: unknown type str32 for string comparison:
import pandas as pd
# Loading in some bigger data from Kaggle https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
# data and code file included in zip to make it easy
df = pd.read_csv("AB_NYC_2019.csv")
print(df == "x") # throws ValueError
It seems that the last line of code is legitimate. What is done wrong?
This error is related with the bug that affected pandas version 1.1.0 and some versions prior to 1.0.5. It has been fixed in version 1.1.3.
Therefore, to make it go away it is recommended to upgrade pandas to version 1.1.3.
The bug does not manifest in smaller datasets (or the ones not loaded from CSV).

Ignore UserWarning from openpyxl using pandas

I have tons of .xlsm files that I have to load. Each Excel file has 6 sheets. Because of that, I'm opening each Excel file like this, using pandas:
for excel_file in files_list:
with pd.ExcelFile(excel_file, engine = "openpyxl") as f:
df1 = pd.read_excel(f, "Sheet1")
df2 = pd.read_excel(f, "Sheet2")
df3 = pd.read_excel(f, "Sheet3")
...
After each iteration I am passing the df to other function and do some stuff with it. I am using pd.ExcelFile to load the file into memory just once and then separate it on DataFrames.
However, when doing this, I am getting the following warning:
/opt/anaconda3/lib/python3.8/site-packages/openpyxl/worksheet/_reader.py:300: UserWarning: Data Validation extension is not supported and will be removed
warn(msg)
No matter the warning, the information is loaded correctly from the Excel file and no data is missing. It takes about 0.8s to load each Excel file and all of its sheets into df. If I use the default engine on pandas to load each Excel file, the warning goes away, but the time it takes for each file goes up to 5 or even 6 seconds.
I saw this post, but there wasn't an answer on how to remove the warning, which is what I need, as everything's working correctly.
How can I disable said UserWarning?
You can do this using warnings core module:
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='openpyxl')
You can also specify the particular module you'd like to silence warnings for by adding an argument module="openpyxl".
See What causes "UserWarning: Discarded range with reserved name" - openpyxl -- different warning, same solution - so you put the warnings back to default after you open the book, since there may be other warnings that you do want to see.
import warnings
warnings.simplefilter("ignore")
wb = load_workbook(path)
warnings.simplefilter("default")
If you want to ignore this warning specifically, and do it in a given context only, you can combine catch_warnings and filterwarnings with the message argument. E.g:
import warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="Data Validation extension is not supported and will be removed")
data = pd.read_excel(f, sheet_name=None)
Note: sheet_name=None will read all the Excel sheets in one go too.

Strange warning using dask.dataframe to read csv

I am using dask dataframe module to read a csv.
In [3]: from dask import dataframe as dd
In [4]: dd.read_csv("/file.csv", sep=",", dtype=str, encoding="utf-8", error_bad_lines=False, collection=True, blocksize=64e6)
I used to this with no problem, but today a strange warning showed up:
FutureWarning: The default value of auto_mkdir=True has been deprecated and will be changed to auto_mkdir=False by default in a future release.
FutureWarning,
This didn't worried me until I realised it breaks my unit tests, because, when using this from console, it's simple a warning, but the tests set for my app have broken because of this.
Does anyone know the cause of this warning or how to get rid of it?
Auto-answering for documentation:
This issue appears in fsspec==0.6.3 and dask==2.12.0 and will be removed in the future.
To prevent pytest failing because of the warning, add or edit a pytest.ini file in your project and set
filterwarnings =
error
ignore::UserWarning
If you want dask to silent the warning at all, explicit set this in the function call storage_options=dict("auto_mkdir"=True)
I got the same thing. Finding no answers as to what might have replaced the feature, I decided to see if the feature is even needed any more. Sure enough, as of Pandas 1.3.0 the warnings that previously motivated the feature no longer appear. So
pd.read_csv(import_path, error_bad_lines=False, warn_bad_lines=False, names=cols)
simply became
pd.read_csv(import_path, names=cols)
and works fine with no errors or warnings.

How to ignore SettingWithCopyWarning using warnings.simplefilter()?

The question:
Can I ignore or prevent the SettingWithCopyWarning to be printed to the console using warnings.simplefilter()?
The details:
I'm running a few data cleaning routines using pandas, and those are executed in the simplest of ways using a batch file. One of the lines in my Python script triggers the SettingWithCopyWarning and is printed to the console. But it's also being echoed in the command prompt:
Aside from sorting out the source of the error, is there any way I can prevent the error message from being printed to the prompt like I can with FutureWarnings like warnings.simplefilter(action = "ignore", category = FutureWarning)?
Though I would strongly advise to fix the issue, it is possible to suppress the warning by importing it from pandas.core.common. I found where it's located on GitHub.
Example:
import warnings
import pandas as pd
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)
df = pd.DataFrame(dict(A=[1, 2, 3], B=[2, 3, 4]))
df[df['A'] > 2]['B'] = 5 # No warnings for the chained assignment!
You can use:
pd.set_option('mode.chained_assignment', None)
# This code will not complain!
pd.reset_option("mode.chained_assignment")
Or if you prefer to use it inside a context:
with pd.option_context('mode.chained_assignment', None):
# This code will not complain!

Categories