How to manage options in PySpark more efficiently - python

Let us consider following pySpark code
my_df = (spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true")
.load(my_data_path))
This is a relatively small code, but sometimes we have codes with many options, where passing string options causes typos frequently. Also we don't get any suggestions from our code editors.
As a workaround I am thinking to create a named tuple (or a custom class) to have all the options I need. For example,
from collections import namedtuple
allOptions = namedtuple("allOptions", "csvFormat header inferSchema")
sparkOptions = allOptions("csv", "header", "inferSchema")
my_df = (spark.read.format(sparkOptions.csvFormat)
.option(sparkOptions.header,"true")
.option(sparkOptions.inferSchema, "true")
.load(my_data_path))
I am wondering if there is downsides of this approach or if there is any better and standard approach used by the other pySpark developers.

If you use .csv function to read the file, options are named arguments, thus it throws the TypeError. Also, on VS Code with Python plugin, the options would autocomplete.
df = spark.read.csv(my_data_path,
header=True,
inferSchema=True)
If I run with a typo, it throws the error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/tv/32xjg80x6pb_9t4909z8_hh00000gn/T/ipykernel_3636/4060466279.py in <module>
----> 1 df = spark.read.csv('test.csv', inferSchemaa=True, header=True)
TypeError: csv() got an unexpected keyword argument 'inferSchemaa'
On VS Code, options are suggested in autocomplete.

I think the best approach is to make a wrapper(s) with some default values and kwargs like this
def csv(path, inferSchema=True, header=True, options={}):
return hdfs(path, 'csv', {'inferSchema': inferSchema, 'header': header, **options})
def parquet(path, options={}):
return hdfs(path, 'parquet', {**options})
def hdfs(path, format, options={}):
return (spark
.read
.format(format)
.options(**options)
.load(f'hdfs://.../{path}')
)

For that and many other reasons, in production level projects, we used to write a project to wrap spark.
So developers not allowed to deal with spark directly.
In such project we can :
Abstract options using enumerations and inheritance to avoid typos and incompatibles options.
Set default options for each data format and developers can overwrite them if needed, to reduce the amount of code written by the developer
Set and defines any repetitive code like frequently used data sources, default output data format, etc.

Related

pandas to_excel function writes a file with 0 bytes

I use the following function to write a pandas DataFrame to Excel
def write_dataset(
train: pd.DataFrame,
forecast: pd.DataFrame,
config
out_path: str,
) -> None:
forecast = forecast.rename(
columns={
"col": "col_predicted",
}
)
df = pd.concat([train, forecast])
df.drop(["id"], axis=1, inplace=True)
if config.join_meta:
df.drop(
["some_col", "some_other_col"],
axis=1,
inplace=True,
)
df.sort_values(config.id_columns, inplace=True)
df.rename(columns={"date": "month"}, inplace=True)
df["a_col"] = df["a_col"].round(2)
df.to_excel(out_path, index=False)
just before the df.to_excel() the DataFrame looks completely normal, just containing some NaNs. But the file it writes is a 0 Byte file, which I can't even open with Excel. I use this function for 6 different dfs and somehow it works for some and doesn't for others. Also on my colleagues computer it always works fine.
I'm using python version 3.10.4, pandas 1.4.2 and opnepyxl 3.0.9
Any ideas what is happening and how to fix that behavior?
I encountered this issue on my Mac, and was similarly stumped for a while. Then I realized that the file appears as 0 bytes once the code has begun to create the file but hasn't yet finished.
So in my case, I found that all I had to do was wait a long time, and eventually (> 5-10m) the file jumped from 0 bytes to its full size. My file was about 14mb, so it shouldn't have required that much time. My guess is that this is an issue related to how the OS is handling scheduling and permissions among various processes and memory locations, hence why some dfs work fine and others don't.
(So it might be worth double checking that you don't have other processes that might be trying to claim write access of the write destination. I've seen programs like automatic backup services claim access to folders and cause conflicts along these lines.)

How can you load a CSV into PyFlink as a Streaming Table Source?

I am trying to setup a simple playground environment to use the Flink Python Table API. The Jobs I am ultimately trying to write will feed off of a Kafka or Kenesis queue, but that makes playing around with ideas (and tests) very difficult.
I can happily load from a CSV and process it in Batch mode. But I cannot get it to work in Streaming Mode. How would I do something similar but in a StreamingExecutionEnvironment (primarily so I can play around with windows).
I understand that I need to get the system to use EventTime (because ProcTime would all come in at once), but I cannot find anyway to set this up. In principle I should be able to set one of the columns of the CSV to be the event time, but it is not clear form the docs how to do this (or if it is possible).
To get the Batch execution tests running I used the below code, which reads from an input.csv and outputs to an output.csv.
from pyflink.dataset import ExecutionEnvironment
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import (
TableConfig,
DataTypes,
BatchTableEnvironment,
StreamTableEnvironment,
)
from pyflink.table.descriptors import Schema, Csv, OldCsv, FileSystem
from pathlib import Path
exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)
root = Path(__file__).parent.resolve()
out_path = root / "output.csv"
try:
out_path.unlink()
except:
pass
from pyflink.table.window import Tumble
(
t_env.connect(FileSystem().path(str(root / "input.csv")))
.with_format(Csv())
.with_schema(
Schema().field("time", DataTypes.TIMESTAMP(3)).field("word", DataTypes.STRING())
)
.create_temporary_table("mySource")
)
(
t_env.connect(FileSystem().path(str(out_path)))
.with_format(Csv())
.with_schema(
Schema().field("word", DataTypes.STRING()).field("count", DataTypes.BIGINT())
)
.create_temporary_table("mySink")
)
(
t_env.from_path("mySource")
.group_by("word")
.select("word, count(1) as count")
.filter("count > 1")
.insert_into("mySink")
)
t_env.execute("tutorial_job")
and input.csv is
2000-01-01 00:00:00.000000000,james
2000-01-01 00:00:00.000000000,james
2002-01-01 00:00:00.000000000,steve
So my question is how could I set it up so that it reads from the same CSV, but uses the first column as the event time and allow me to write code like:
(
t_env.from_path("mySource")
.window(Tumble.over("10.minutes").on("time").alias("w"))
.group_by("w, word")
.select("w, word, count(1) as count")
.filter("count > 1")
.insert_into("mySink")
)
Any help would be appreciated, I cant work this out from the docs. I am using python 3.7 and flink 1.11.1 .
If you use the descriptor API, you can specify a field is the event-time field through the schema:
.with_schema( # declare the schema of the table
Schema()
.field("rowtime", DataTypes.TIMESTAMP())
.rowtime(
Rowtime()
.timestamps_from_field("time")
.watermarks_periodic_bounded(60000))
.field("a", DataTypes.STRING())
.field("b", DataTypes.STRING())
.field("c", DataTypes.STRING())
)
But I still recommend you to use DDL, on the one hand it is easier to use, on the other hand there are some bugs in the existing Descriptor API, the community is discussing refactoring the Descriptor API
Have you tried using watermark strategies? As mentioned here, you need to use watermark strategies to use event time. For pyflink case, personally i think it is easier to declare it in the ddl format like this.

Strange warning using dask.dataframe to read csv

I am using dask dataframe module to read a csv.
In [3]: from dask import dataframe as dd
In [4]: dd.read_csv("/file.csv", sep=",", dtype=str, encoding="utf-8", error_bad_lines=False, collection=True, blocksize=64e6)
I used to this with no problem, but today a strange warning showed up:
FutureWarning: The default value of auto_mkdir=True has been deprecated and will be changed to auto_mkdir=False by default in a future release.
FutureWarning,
This didn't worried me until I realised it breaks my unit tests, because, when using this from console, it's simple a warning, but the tests set for my app have broken because of this.
Does anyone know the cause of this warning or how to get rid of it?
Auto-answering for documentation:
This issue appears in fsspec==0.6.3 and dask==2.12.0 and will be removed in the future.
To prevent pytest failing because of the warning, add or edit a pytest.ini file in your project and set
filterwarnings =
error
ignore::UserWarning
If you want dask to silent the warning at all, explicit set this in the function call storage_options=dict("auto_mkdir"=True)
I got the same thing. Finding no answers as to what might have replaced the feature, I decided to see if the feature is even needed any more. Sure enough, as of Pandas 1.3.0 the warnings that previously motivated the feature no longer appear. So
pd.read_csv(import_path, error_bad_lines=False, warn_bad_lines=False, names=cols)
simply became
pd.read_csv(import_path, names=cols)
and works fine with no errors or warnings.

Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store locally).
The following piece of code belongs to a short, helper program that will make a dask dataframe of a large csv file hosted on the VM. I want to later pass its output (reference to the dask dataframe) to a second function that will perform some overview analysis on it.
import dask.dataframe as dd
import paramiko as pm
import pandas as pd
import sys
def remote_file_to_dask_dataframe(remote_path):
if isinstance(remote_path, (str)):
try:
client = pm.SSHClient()
client.load_system_host_keys()
client.connect('#myserver', username='my_username', password='my_password')
sftp_client = client.open_sftp()
remote_file = sftp_client.open(remote_path)
df = dd.read_csv(remote_file)
remote_file.close()
sftp_client.close()
return df
except:
print("An error occurred.")
sftp_client.close()
remote_file.close()
else:
raise ValueError("Path to remote file as string required")
The code is neither nice nor complete, and I will replace username and password with ssh keys in time, but this is not the issue. In a jupyter notebook, I've previously opened the sftp connection with a path to a file on the server, and read it into a dataframe with a regular Pandas read_csv call. However, here the equivalent line, using Dask, is the source of the problem:df = dd.read_csv(remote_file).
I've looked at the documentation online (here), but I can't tell whether what I'm trying above is possible. It seems that for networked options, Dask wants a url. The parameter passing options for, e.g. S3, appear to depend on that infrastructure's backend. I unfortunately cannot make any sense of the dash-ssh documentation (here).
I've poked around with print statements and the only line that fails to execute is the one stated. The error risen is: raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood:
Can anybody point me in the right direction for achieving what I'm trying to do? I'd expected Dask's read_csv to function as Pandas' had, as it's based on the same.
I'd appreciate any help, thanks.
p.s. I'm aware of Pandas' read_csv chunksize option, but I would like to achieve this through Dask, if possible.
In the master version of Dask, file-system operations are now using fsspec which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations.
In short, using a url like "sftp://user:pw#host:port/path" should now work for you, if you install fsspec and Dask from master.
It seems that you would have to implement their "file system" interface.
I'm not sure what is minimal set of methods that you need to implement to allow read_csv. But you definitely have to implement the open.
class SftpFileSystem(object):
def open(self, path, mode='rb', **kwargs):
return sftp_client.open(path, mode)
dask.bytes.core._filesystems['sftp'] = SftpFileSystem
df = dd.read_csv('sftp://remote/path/file.csv')

What happens exactly in the i/o of json files?

I struggled with the following for a couple of hours yesterday. I figured out a workaround, but I'd like to understand a little more of what's going on in the background and, ideally, I'd like to remove the intermediate file from my code just for the sake of elegance. I'm using python, by the way and files_df starts off as a pandas dataframe.
Can you help me understand why the following code gives me an error.
files_json = files_df.to_json(orient='records')
for file_json in files_json:
print(file_json) #do stuff
But this code works?
files_json = files_df.to_json(orient='records')
with open('export_json.json', 'w') as f:
f.write(files_json)
with open('export_json.json') as data:
files_json = json.load(data)
for file_json in files_json:
print(file_json) #do stuff
Obviously, the export/import is converting the data somehow into a usable format. I would like to understand that a little better and know if there is some option within the pandas files_df.to_json command to perform the same conversion.
json.load is the opposite of json.dump, but you export from pandas data frames into file and than import again with standard library into some sort of python structure.
Try files_df.to_dict

Categories