I have a cluster with spark 2.1 and a process that at the end writes on file a PipelineModel, which contains a StringIndexerModel. I can locally (with spark 2.3) load the pipeline and inspect the StringIndexerModel. What it appears very strange is that the method and fields differ between the two versions, even if they read the same files. In particular, with spark 2.1 the field inputCol appears to not be there even if it's obviously needed to make the StringIndexer work.
This is what I get.
Spark 2.1:
pip1 = PipelineModel.load("somepath")
si = pip1.stages[0]
si
#StringIndexer_494eb1f86ababc8540e2
si.inputCol
#Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
#AttributeError: 'StringIndexerModel' object has no attribute 'inputCol'
Spark 2.3
pip1 = PipelineModel.load("somepath")
si = pip1.stages[0]
si
#StringIndexer_494eb1f86ababc8540e2
si.inputCol
#Param(parent='StringIndexer_494eb1f86ababc8540e2', name='inputCol', doc='input column name')
I understand that methods and fields might change from one version to another, but the inputCol must be somewhere in the object, since it is essential to make fit or transform work. Is there a way to extract the inputCol in spark 2.1 with PySpark?
The heavy lifting in Spark ML is done by the internal Java objects (_java_obj), that's why objects can work, even if internal are never fully exposed in Python API. This of course limits what can be done without drilling into Java API, and since Spark 2.3 Params are exposed in PySpark models (SPARK-10931).
In previous versions you can access internal model, and fetch data from there. However if you want to get a value of the Param you should use get* method, not the Param as such.
si._java_obj.getInputCol()
Related:
pyspark: getting the best model's parameters after a gridsearch is blank {}
How to get best params after tuning by pyspark.ml.tuning.TrainValidationSplit?
Getting labels from StringIndexer stages within pipeline in Spark (pyspark)
Related
Let us consider following pySpark code
my_df = (spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true")
.load(my_data_path))
This is a relatively small code, but sometimes we have codes with many options, where passing string options causes typos frequently. Also we don't get any suggestions from our code editors.
As a workaround I am thinking to create a named tuple (or a custom class) to have all the options I need. For example,
from collections import namedtuple
allOptions = namedtuple("allOptions", "csvFormat header inferSchema")
sparkOptions = allOptions("csv", "header", "inferSchema")
my_df = (spark.read.format(sparkOptions.csvFormat)
.option(sparkOptions.header,"true")
.option(sparkOptions.inferSchema, "true")
.load(my_data_path))
I am wondering if there is downsides of this approach or if there is any better and standard approach used by the other pySpark developers.
If you use .csv function to read the file, options are named arguments, thus it throws the TypeError. Also, on VS Code with Python plugin, the options would autocomplete.
df = spark.read.csv(my_data_path,
header=True,
inferSchema=True)
If I run with a typo, it throws the error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/tv/32xjg80x6pb_9t4909z8_hh00000gn/T/ipykernel_3636/4060466279.py in <module>
----> 1 df = spark.read.csv('test.csv', inferSchemaa=True, header=True)
TypeError: csv() got an unexpected keyword argument 'inferSchemaa'
On VS Code, options are suggested in autocomplete.
I think the best approach is to make a wrapper(s) with some default values and kwargs like this
def csv(path, inferSchema=True, header=True, options={}):
return hdfs(path, 'csv', {'inferSchema': inferSchema, 'header': header, **options})
def parquet(path, options={}):
return hdfs(path, 'parquet', {**options})
def hdfs(path, format, options={}):
return (spark
.read
.format(format)
.options(**options)
.load(f'hdfs://.../{path}')
)
For that and many other reasons, in production level projects, we used to write a project to wrap spark.
So developers not allowed to deal with spark directly.
In such project we can :
Abstract options using enumerations and inheritance to avoid typos and incompatibles options.
Set default options for each data format and developers can overwrite them if needed, to reduce the amount of code written by the developer
Set and defines any repetitive code like frequently used data sources, default output data format, etc.
I run PySpark 3.1 on a Windows computer with local mode on Jupyter Notebook. I call "applyInPandas" on Spark DataFrame.
Below function applies a few data transformations to input Pandas DataFrame, and trains an SGBT model. Then it serializes the trained model into binary and saves to S3 bucket as object. Finally it returns the DataFrame. I call this function from a Spark DataFrame grouped by two columns in the last line. I receive no error and the returned DataFrame is as the same length as the input. Data for each group is returned.
The problem is the saved model objects. There are objects saved in S3 only for 2 groups when there were supposed to be models for each group. There is no missing/wrong data point that would cause model training to fail. (I'd receive an error or warning anyway.) What I have tried so far:
Replace S3 and save to local file system: The same result.
Replace "pickle" with "joblib" and "BytesIO": The same result.
Repartition before calling the function: Now I had more objects saved for different groups, but not all. [I did this by calling "val_large_df.coalesce(1).groupby('la..." in the last line.]
So I suspect this is about parallelism and distribution, but I could not figure it out. Thank you already.
def train_sgbt(pdf):
##Some data transformations here##
#Train the model
sgbt_mdl=GradientBoostingRegressor(--Params.--).fit(--Params.--)
sgbt_mdl_b=pickle.dumps(sgbt_mdl) #Serialize
#Initiate s3_client
s3_client = boto3.client(--Params.--)
#Put file in S3
s3_client.put_object(Body=sgbt_mdl_b, Bucket='my-bucket-name',
Key="models/BT_"+str(pdf.latGroup_m[0])+"_"+str(pdf.lonGroup_m[0])+".mdl")
return pdf
dummy_df=val_large_df.groupby("latGroup_m","lonGroup_m").applyInPandas(train_sgbt,
schema="fcast_error double")
dummy_df.show()
Spark evaluates the dummy_df lazy and therefore train_sgbt will only be called for the groups that are required to complete the Spark action.
The Spark action here is show(). This action prints only the first 20 rows, so train_sgbt is only called for the groups that have at least one element in the first 20 rows. Spark may evaluate more groups, but there is no guarantee for it.
One way to solve to problem would be to call another action, for example csv.
If i read a file in pyspark:
Data = spark.read(file.csv)
Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct? If yes, why do i need:
Data.cache()
If i read a file in pyspark:
Data = spark.read(file.csv)
Then for the life of the spark session, the ‘data’ is available in memory,correct?
No. Nothing happens here due to Spark lazy evaluation, which happens upon the first call to show() in your case.
So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct?
No. The dataframe will be re-evaluated for each call to show. Caching the dataframe will prevent that re-evaluation, forcing the data to be read from cache instead.
I am relatively new to Dask and have a large file 12GB that I wish to process. This file was imported from a SQL BCP file that I want to wrangle with Dask prior to uploading to sql. As part of this, I need to remove some proceeding whitespace e.g. ' SQL Tutorial’ changed to 'SQL Tutorial'. I would do this using pandas as follows:
df_train['colum1'] = pd.core.strings.str_strip(df_train['column1'])
dask doesn't seem to have this feature as I get the error
AttributeError: module 'dask.dataframe.core' has no attribute
'strings'
Is there a memory-efficient way to do this using dask?
After a long searching I find it in dask API:
str
Namespace for string methods
So you can use:
df_train['colum1'] = df_train['column1'].str.strip()
I just realized that I can do following in Scala
val df = spark.read.csv("test.csv")
val df1=df.map(x=>x(0).asInstanceOf[String].toLowerCase)
However in Python if I try to call map function on DataFrame it will throw me error.
df = spark.read.csv("Downloads/test1.csv")
df.map(lambda x: x[1].lower())
Error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/2.4.3/libexec/python/pyspark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'
In Python I need to explicitly convert Dataframe to RDD.
my question is why, I need to do this in case of python ?
Is this the different in Spark API implementation or Scala implicityly converts DataFrame to RDD back and again to DataFrame
Python Dataframe API doesn't have map function due to how the Python API works.
Python, everytime that you convert to RDD or uses a UDF with the Python API you are creating a python call during your execution.
What that means? That means, during the Spark execution instead of all the data be processed inside of the JVM with Scala code generated (Dataframe API) the JVM need to call the Python code to apply the logic you created. That by default creates a HUGE overhead during the execution.
So the solution for Python is building an API that will block the usage of python code and will only use Scala generated code using the DataFrame pipeline.
This will help to understand how UDFs with python works, that basically is really close how RDD maps will work with Python: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9