Strange warning using dask.dataframe to read csv

Strange warning using dask.dataframe to read csv - python

I am using dask dataframe module to read a csv.
In [3]: from dask import dataframe as dd
In [4]: dd.read_csv("/file.csv", sep=",", dtype=str, encoding="utf-8", error_bad_lines=False, collection=True, blocksize=64e6)
I used to this with no problem, but today a strange warning showed up:
FutureWarning: The default value of auto_mkdir=True has been deprecated and will be changed to auto_mkdir=False by default in a future release.
FutureWarning,
This didn't worried me until I realised it breaks my unit tests, because, when using this from console, it's simple a warning, but the tests set for my app have broken because of this.
Does anyone know the cause of this warning or how to get rid of it?

Auto-answering for documentation:
This issue appears in fsspec==0.6.3 and dask==2.12.0 and will be removed in the future.
To prevent pytest failing because of the warning, add or edit a pytest.ini file in your project and set
filterwarnings =
error
ignore::UserWarning
If you want dask to silent the warning at all, explicit set this in the function call storage_options=dict("auto_mkdir"=True)

I got the same thing. Finding no answers as to what might have replaced the feature, I decided to see if the feature is even needed any more. Sure enough, as of Pandas 1.3.0 the warnings that previously motivated the feature no longer appear. So
pd.read_csv(import_path, error_bad_lines=False, warn_bad_lines=False, names=cols)
simply became
pd.read_csv(import_path, names=cols)
and works fine with no errors or warnings.

Related

pandas to_excel function writes a file with 0 bytes

I use the following function to write a pandas DataFrame to Excel
def write_dataset(
train: pd.DataFrame,
forecast: pd.DataFrame,
config
out_path: str,
) -> None:
forecast = forecast.rename(
columns={
"col": "col_predicted",
}
)
df = pd.concat([train, forecast])
df.drop(["id"], axis=1, inplace=True)
if config.join_meta:
df.drop(
["some_col", "some_other_col"],
axis=1,
inplace=True,
)
df.sort_values(config.id_columns, inplace=True)
df.rename(columns={"date": "month"}, inplace=True)
df["a_col"] = df["a_col"].round(2)
df.to_excel(out_path, index=False)
just before the df.to_excel() the DataFrame looks completely normal, just containing some NaNs. But the file it writes is a 0 Byte file, which I can't even open with Excel. I use this function for 6 different dfs and somehow it works for some and doesn't for others. Also on my colleagues computer it always works fine.
I'm using python version 3.10.4, pandas 1.4.2 and opnepyxl 3.0.9
Any ideas what is happening and how to fix that behavior?

I encountered this issue on my Mac, and was similarly stumped for a while. Then I realized that the file appears as 0 bytes once the code has begun to create the file but hasn't yet finished.
So in my case, I found that all I had to do was wait a long time, and eventually (> 5-10m) the file jumped from 0 bytes to its full size. My file was about 14mb, so it shouldn't have required that much time. My guess is that this is an issue related to how the OS is handling scheduling and permissions among various processes and memory locations, hence why some dfs work fine and others don't.
(So it might be worth double checking that you don't have other processes that might be trying to claim write access of the write destination. I've seen programs like automatic backup services claim access to folders and cause conflicts along these lines.)

How to manage options in PySpark more efficiently

Let us consider following pySpark code
my_df = (spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true")
.load(my_data_path))
This is a relatively small code, but sometimes we have codes with many options, where passing string options causes typos frequently. Also we don't get any suggestions from our code editors.
As a workaround I am thinking to create a named tuple (or a custom class) to have all the options I need. For example,
from collections import namedtuple
allOptions = namedtuple("allOptions", "csvFormat header inferSchema")
sparkOptions = allOptions("csv", "header", "inferSchema")
my_df = (spark.read.format(sparkOptions.csvFormat)
.option(sparkOptions.header,"true")
.option(sparkOptions.inferSchema, "true")
.load(my_data_path))
I am wondering if there is downsides of this approach or if there is any better and standard approach used by the other pySpark developers.

If you use .csv function to read the file, options are named arguments, thus it throws the TypeError. Also, on VS Code with Python plugin, the options would autocomplete.
df = spark.read.csv(my_data_path,
header=True,
inferSchema=True)
If I run with a typo, it throws the error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/tv/32xjg80x6pb_9t4909z8_hh00000gn/T/ipykernel_3636/4060466279.py in <module>
----> 1 df = spark.read.csv('test.csv', inferSchemaa=True, header=True)
TypeError: csv() got an unexpected keyword argument 'inferSchemaa'
On VS Code, options are suggested in autocomplete.

I think the best approach is to make a wrapper(s) with some default values and kwargs like this
def csv(path, inferSchema=True, header=True, options={}):
return hdfs(path, 'csv', {'inferSchema': inferSchema, 'header': header, **options})
def parquet(path, options={}):
return hdfs(path, 'parquet', {**options})
def hdfs(path, format, options={}):
return (spark
.read
.format(format)
.options(**options)
.load(f'hdfs://.../{path}')
)

For that and many other reasons, in production level projects, we used to write a project to wrap spark.
So developers not allowed to deal with spark directly.
In such project we can :
Abstract options using enumerations and inheritance to avoid typos and incompatibles options.
Set default options for each data format and developers can overwrite them if needed, to reduce the amount of code written by the developer
Set and defines any repetitive code like frequently used data sources, default output data format, etc.

How to disable warning in jupyter notebook?

I want to remove the red warnings.
None of the answers mentioned here have worked

For your particular example (pd.read_table where the warnings are all regarding an incorrect number of columns in the data file) you can add warn_bad_lines=False to the call:
pd.read_table(filename, header=0, error_bad_lines=False, warn_bad_lines=False)

Ignore UserWarning from openpyxl using pandas

I have tons of .xlsm files that I have to load. Each Excel file has 6 sheets. Because of that, I'm opening each Excel file like this, using pandas:
for excel_file in files_list:
with pd.ExcelFile(excel_file, engine = "openpyxl") as f:
df1 = pd.read_excel(f, "Sheet1")
df2 = pd.read_excel(f, "Sheet2")
df3 = pd.read_excel(f, "Sheet3")
...
After each iteration I am passing the df to other function and do some stuff with it. I am using pd.ExcelFile to load the file into memory just once and then separate it on DataFrames.
However, when doing this, I am getting the following warning:
/opt/anaconda3/lib/python3.8/site-packages/openpyxl/worksheet/_reader.py:300: UserWarning: Data Validation extension is not supported and will be removed
warn(msg)
No matter the warning, the information is loaded correctly from the Excel file and no data is missing. It takes about 0.8s to load each Excel file and all of its sheets into df. If I use the default engine on pandas to load each Excel file, the warning goes away, but the time it takes for each file goes up to 5 or even 6 seconds.
I saw this post, but there wasn't an answer on how to remove the warning, which is what I need, as everything's working correctly.
How can I disable said UserWarning?

You can do this using warnings core module:
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='openpyxl')
You can also specify the particular module you'd like to silence warnings for by adding an argument module="openpyxl".

See What causes "UserWarning: Discarded range with reserved name" - openpyxl -- different warning, same solution - so you put the warnings back to default after you open the book, since there may be other warnings that you do want to see.
import warnings
warnings.simplefilter("ignore")
wb = load_workbook(path)
warnings.simplefilter("default")

If you want to ignore this warning specifically, and do it in a given context only, you can combine catch_warnings and filterwarnings with the message argument. E.g:
import warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="Data Validation extension is not supported and will be removed")
data = pd.read_excel(f, sheet_name=None)
Note: sheet_name=None will read all the Excel sheets in one go too.

Pandas: SettingWithCopyWarning trigger location

With SettingWithCopyWarning, sometimes it refers you to the exact line of code in your module that triggered the warning (e.g. here) and other times it doesn't (e.g. here).
Short of going through each line of the code (doesn't sound too appealing if you're reviewing hundreds of lines of code), is there a way to pinpoint the line of code that triggered the warning assuming the warning does not return that information?
I wonder if this is a bug in the warning to return a warning without pinpointing the specific code that triggered it.
Warning (from warnings module):
File "C:\Python34\lib\site-packages\pandas\core\indexing.py", line 415
self.obj[item] = s
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Python 3.4, Pandas 0.15.0

You can have pandas raise SettingWithCopyError instead of SettingWithCopyWarning by setting the mode.chained_assignment option to raise instead of warn.
import pandas as pd
pd.set_option('mode.chained_assignment', 'raise')
https://pandas.pydata.org/pandas-docs/stable/options.html#available-options

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strange warning using dask.dataframe to read csv - python

Related

pandas to_excel function writes a file with 0 bytes

How to manage options in PySpark more efficiently

How to disable warning in jupyter notebook?

Ignore UserWarning from openpyxl using pandas

Pandas: SettingWithCopyWarning trigger location

Categories

Resources