Is pyproj Transformer stateless or stateful? - python

I'm currently working on project where we will be using pySpark and pyproj for gps to cartesian transormations.
In this project I will be getting parquet files as an input, will need to modify content of one column (column that will contain gps coordinates), and save the output to parquet.
We want to use PySpark UDF function, and while trying to pass it to this UDF function I received:
Cannot pass transformer to UDF: TypeError: Invalid argument, not a string or column
I was thinking about create an class and this UDF function will be an class method, and pyproj.Transformer will be an class attribute. PySpark will create one instance class that will be used across all other workers, so to use pyproj.Transformer with PySpark I need to know if pyproj.Transformer is statless or stateful. I guess that it is not since it is doing some geometric transformations. I was trying to find some info about it in pyproj documentation but unfortunately I didn't succed
Does somebody know if pyproj.Transformer is statless or stateful?

Related

Infer the correct schema in case of partitioned parquet in pyarrow

I have a big partitioned parquet that I need to access either loading it all or filtering it. I am not able to load the entire file because of this error ArrowNotImplementedError: Unsupported cast from string to null using function cast_null. From this issue it turns out to be a problem linked to the first file used to infer the overall schema. If I generate a unifiedSchema and use it to load the entire dataframe I have no problem at all, but when I apply a filter passing the schema the following error pops up:
ArrowInvalid: No match for FieldRef.Name(MARKET_TAG) in PRODUCT_TAG: string
df = pd.read_parquet("data/reduced.parquet",filters=[("MARKET_TAG","=",3)],schema=unifiedSchema)
There is a way to solve this?
I was thinking to solve the problem directly storing the correct schema with pq.write_to_dataset but I use to work on several datasets with a single "MARKET_TAG" and to store the parquets at the end using partition_cols in the same folder. In this case, the table schema could be "wrong" compared with the other, not solving my issue.

Databricks Feature Store - Can I use native Python (instead of PySpark) to create features?

I would like to create a feature table with some popular time series features using out of the box feature transformations provided by popular python packages such as ta-lib or pandas-ta - these packages rely on numpy/pandas and not Spark dataframes.
Can this be done with Databricks Feature Store?
In the documentation I could only find feature creation examples using Spark dataframes.
When it comes to creation - yes, you can do it using Pandas. You just need to convert Pandas DataFrame into Spark DataFrame before creating the feature store or writing new data into it. The simplest way to do it is to use spark.createDataFrame function, passing Pandas DataFrame to it as an argument.

How Python data structure implemented in Spark when using PySpark?

I am currently self-learning Spark programming and trying to recode an existing Python application in PySpark. However, I am still confused about how we use regular Python objects in PySpark.
I understand the distributed data structure in Spark such as the RDD, DataFrame, Datasets, vector, etc. Spark has its own transformation operations and action operations such as .map(), .reduceByKey() to manipulate those objects. However, what if I create traditional Python data objects such as array, list, tuple, or dictionary in PySpark? They will be only stored in the memory of my driver program node, right? If I transform them into RDD, can i still do operations with typical Python function?
If I have a huge dataset, can I use regular Python libraries like pandas or numpy to process it in PySpark? Will Spark only use the driver node to run the data if I directly execute Python function on a Python object in PySpark? Or I have to create it in RDD and use Spark's operations?
You can create traditional Python data objects such as array, list, tuple, or dictionary in PySpark.
You can perform most of the operations using python functions in Pyspark.
You can import Python libraries in Pyspark and use them to process data in Pyspark
You can create a RDD and apply spark operations on them

In PySpark ML, how can I interpret the SparseVector returned by a pyspark.ml.classification.RandomForestClassificationModel.featureImportances?

I have created and am debugging a PySpark ML RandomForestClassificationModel which was of course created by calling pyspark.ml.classification.RandomForestClassifier.fit(). I want to interpret the feature vectors returned by the RandomForestClassificationModel.featureImportances property. They are a SparseVector.
As you can see in the notebook below, I had to transform my features in several stages to get them into the final Features_vec that fed the algorithm. What I want is a list of features keyed by the feature type and column. How can I use the SparseVector of features to get to a list of feature importances along with feature names, or some other format that is interpretable?
The code is in a Jupyter Notebook here. Skip to the end.
This shouldn't be specific to PySpark, so if you know a Scala solution, please chime in.

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.
Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

Categories