I am currently self-learning Spark programming and trying to recode an existing Python application in PySpark. However, I am still confused about how we use regular Python objects in PySpark.
I understand the distributed data structure in Spark such as the RDD, DataFrame, Datasets, vector, etc. Spark has its own transformation operations and action operations such as .map(), .reduceByKey() to manipulate those objects. However, what if I create traditional Python data objects such as array, list, tuple, or dictionary in PySpark? They will be only stored in the memory of my driver program node, right? If I transform them into RDD, can i still do operations with typical Python function?
If I have a huge dataset, can I use regular Python libraries like pandas or numpy to process it in PySpark? Will Spark only use the driver node to run the data if I directly execute Python function on a Python object in PySpark? Or I have to create it in RDD and use Spark's operations?
You can create traditional Python data objects such as array, list, tuple, or dictionary in PySpark.
You can perform most of the operations using python functions in Pyspark.
You can import Python libraries in Pyspark and use them to process data in Pyspark
You can create a RDD and apply spark operations on them
Related
I would like to create a feature table with some popular time series features using out of the box feature transformations provided by popular python packages such as ta-lib or pandas-ta - these packages rely on numpy/pandas and not Spark dataframes.
Can this be done with Databricks Feature Store?
In the documentation I could only find feature creation examples using Spark dataframes.
When it comes to creation - yes, you can do it using Pandas. You just need to convert Pandas DataFrame into Spark DataFrame before creating the feature store or writing new data into it. The simplest way to do it is to use spark.createDataFrame function, passing Pandas DataFrame to it as an argument.
I just read that the dataframe has 2-dimensional array-like storage where rdd doesn't have any such constraints over storage. Due to this the queries can be run more optimized with dataframes. Does this means that creating a dataframe would consume more memory than creating an rdd over the same input dataset?
Also, if I have a defined rdd as rdd1, when I use the toDf method to convert rdd1 into a dataframe, am I consuming more memory over the node?
Similarly, if I have a dataframe and I am converting it to rdd using df.rdd method, am I freeing some space over the node?
RDD:
Resilient Distributed Datasets. RDD is a fault-tolerant collection of elements that can be operated on in-parallel, also we can say RDD is the fundamental data structure of Spark.
Through RDD, we can process structured as well as unstructured data. But, in RDD user need to specify the schema of ingested data, RDD cannot infer its own.
It is a distributed collection of data elements. That is spread across many machines over the cluster, they are a set of Scala or Java objects representing data.
RDD Supports object-oriented programming style with compile-time type safety
RDDs are immutable in nature. That means we can not change anything about RDDs
If RDD is in tabular format, we can move from RDD to dataframe by to() method. We can also do the reverse by the .rdd method.
There was no provision for optimization engine in RDD. On the basis of its attributes, developers optimise each RDD
Spark does not compute their result right away, it evaluates RDDs lazily
Since RDD APIs, use schema projection explicitly. Therefore, a user needs to define the schema manually
While performing simple grouping and aggregation operations RDD API is slower compare to DataFrame.
DataFrame:
Data frame data is organized into named columns. Basically, it is as same as a table in a relational database
If we try to access any column which is not present in the table, then an attribute error may occur at runtime. Dataframe will not support compile-time type safety in such case.
One cannot regenerate a domain object, after transforming into dataframe. By using the example, if we generate one test data frame from tested then, we can not recover the original RDD again of the test class.
By using Catalyst Optimizer, optimization takes place in dataframes. In 4 phases, dataframes use catalyst tree transformation framework
Use of off-heap memory for serialization reduces the overhead also generates, bytecode. So that, many operations can be performed on that serialized data
Similarly, computation happens only when action appears as Spark evaluates dataframe lazily
In dataframe, there is no need to specify a schema. Generally, it discovers schema automatically
In performing exploratory analysis, creating aggregated statistics on data, dataframes are faster.
We use dataframe when we need a high level of abstraction and for unstructured data, such as media streams or streams of text.
I have a compute intensive python function called repeatedly within a for loop (each iteration is independent i.e. embarrassingly parallel). I am looking for spark.lapply (from SparkR) kind of functionality to utilize the Spark cluster.
Native Spark
If you use Spark data frames and libraries, then Spark will natively parallelize and distribute your task.
Thread Pools
One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library.However, by default all of your code will run on the driver node.
Pandas UDFs
One of the newer features in Spark that enables parallel processing is Pandas UDFs. With this feature, you can partition a Spark data frame into smaller data sets that are distributed and converted to Pandas objects, where your function is applied, and then the results are combined back into one large Spark data frame.
Example from https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
from pyspark.sql.functions import udf
# Use udf to define a row-at-a-time udf
#udf('double')
# Input/output are both a single double value
def plus_one(v):
return v + 1
df.withColumn('v2', plus_one(df.v))
Currently the integration between Spark structures and Dask seems cubersome when dealing with complicated nested structures. Specifically dumping a Spark Dataframe with nested structure to be read by Dask seems to not be very reliable yet although the parquet loading is part of a large ongoing effort (fastparquet, pyarrow);
so my follow up question - Let's assume that I can live with doing a few transformations in Spark and transform the DataFrame into an RDD that contains custom class objects; Is there a way to reliably dump the data of an Spark RDD with custom class objects and read it in a Dask collection? Obviously you can collect the rdd into a python list, pickle it and then read it as a normal data structure but that removes the opportunity to load larger than memory datasets. Could something like the spark pickling be used by dask to load a distributed pickle?
I solved this by doing the following
Having a Spark RDD with a list of custom objects as Row values I created a version of the rdd where I serialised the objects to strings using cPickle.dumps. Then converted this RDD to a simple DF with string columns and wrote it to parquet. Dask is able to read parquet files with simple structure. Then deserialised with cPickle.loads to get the original objects
I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.
Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.