rdd vs dataframe in pyspark

rdd vs dataframe in pyspark - python

I just read that the dataframe has 2-dimensional array-like storage where rdd doesn't have any such constraints over storage. Due to this the queries can be run more optimized with dataframes. Does this means that creating a dataframe would consume more memory than creating an rdd over the same input dataset?
Also, if I have a defined rdd as rdd1, when I use the toDf method to convert rdd1 into a dataframe, am I consuming more memory over the node?
Similarly, if I have a dataframe and I am converting it to rdd using df.rdd method, am I freeing some space over the node?

RDD:
Resilient Distributed Datasets. RDD is a fault-tolerant collection of elements that can be operated on in-parallel, also we can say RDD is the fundamental data structure of Spark.
Through RDD, we can process structured as well as unstructured data. But, in RDD user need to specify the schema of ingested data, RDD cannot infer its own.
It is a distributed collection of data elements. That is spread across many machines over the cluster, they are a set of Scala or Java objects representing data.
RDD Supports object-oriented programming style with compile-time type safety
RDDs are immutable in nature. That means we can not change anything about RDDs
If RDD is in tabular format, we can move from RDD to dataframe by to() method. We can also do the reverse by the .rdd method.
There was no provision for optimization engine in RDD. On the basis of its attributes, developers optimise each RDD
Spark does not compute their result right away, it evaluates RDDs lazily
Since RDD APIs, use schema projection explicitly. Therefore, a user needs to define the schema manually
While performing simple grouping and aggregation operations RDD API is slower compare to DataFrame.
DataFrame:
Data frame data is organized into named columns. Basically, it is as same as a table in a relational database
If we try to access any column which is not present in the table, then an attribute error may occur at runtime. Dataframe will not support compile-time type safety in such case.
One cannot regenerate a domain object, after transforming into dataframe. By using the example, if we generate one test data frame from tested then, we can not recover the original RDD again of the test class.
By using Catalyst Optimizer, optimization takes place in dataframes. In 4 phases, dataframes use catalyst tree transformation framework
Use of off-heap memory for serialization reduces the overhead also generates, bytecode. So that, many operations can be performed on that serialized data
Similarly, computation happens only when action appears as Spark evaluates dataframe lazily
In dataframe, there is no need to specify a schema. Generally, it discovers schema automatically
In performing exploratory analysis, creating aggregated statistics on data, dataframes are faster.
We use dataframe when we need a high level of abstraction and for unstructured data, such as media streams or streams of text.

Related

Is it possible to manually create Dask data frames? (i.e., not by a fixed partition count)

I would like to define a way in which a data dataframe is created (e.g., a particular criteria for splitting) or be able to manually create one.
The situation:
I have a Python function that traverses a subset of a large data frame. The traversal can be limited to all rows that match a certain key. So I need to ensure that this key is not split over several partitions.
Currently, I am splitting the input data frame (Pandas) manually and use multiprocessing to process each partition separately.
I would love to use Dask, which I also user for other computations, due to its ease of use. But I can't find a way to manually define how the input dataframe is split in order to later use map_partitions.
Or am I on a completely wrong path here and should other methods of Dask?

You might find using dask delayed useful and then use that to create a custom dask dataframe? https://docs.dask.org/en/latest/dataframe-create.html#dask-delayed

Is there any good way to read the content of a Spark RDD into a Dask structure

Currently the integration between Spark structures and Dask seems cubersome when dealing with complicated nested structures. Specifically dumping a Spark Dataframe with nested structure to be read by Dask seems to not be very reliable yet although the parquet loading is part of a large ongoing effort (fastparquet, pyarrow);
so my follow up question - Let's assume that I can live with doing a few transformations in Spark and transform the DataFrame into an RDD that contains custom class objects; Is there a way to reliably dump the data of an Spark RDD with custom class objects and read it in a Dask collection? Obviously you can collect the rdd into a python list, pickle it and then read it as a normal data structure but that removes the opportunity to load larger than memory datasets. Could something like the spark pickling be used by dask to load a distributed pickle?

I solved this by doing the following
Having a Spark RDD with a list of custom objects as Row values I created a version of the rdd where I serialised the objects to strings using cPickle.dumps. Then converted this RDD to a simple DF with string columns and wrote it to parquet. Dask is able to read parquet files with simple structure. Then deserialised with cPickle.loads to get the original objects

PySpark casting IntegerTypes to ByteType for optimization

I'm reading in a large amount of data via parquet files into dataframes. I noticed a vast amount of the columns either have 1,0,-1 as values and thus could be converted from Ints to Byte types to save memory.
I wrote a function to do just that and return a new dataframe with the values casted as bytes, however when looking at the memory of the dataframe in the UI, I see it saved as just a transformation from the original dataframe and not as a new dataframe itself, thus taking the same amount of memory.
I'm rather new to Spark and may not fully understand the internals, so how would I go about initially setting those columns to be of ByteType?

TL;DR It might be useful, but in practice impact might be much smaller than you think.
As you noticed:
the memory of the dataframe in the UI, I see it saved as just a transformation from the original dataframe and not as a new dataframe itself, thus taking the same amount of memory.
For storage, Spark uses in-memory columnar storage, which applies a number of optimizations, including compression. If data has low cardinality, then column can be easily compressed using run length encoding or dictionary encoding, and casting won't make any difference.

In order to see whether there is any impact, you can try two things:
Write the data back to the file system. Once with the original type and anther time with your optimisation. Compare size on disk.
Try calling collect on the dataframe and look at the driver memory in your OS's system monitor, make sure to induce a garbage collection to get a cleaner indication. Again- do this once w/o the optimisation and another time with the optimisation.
user8371915 is right in the general case but take into account that the optimisations may or may not kick in based on various parameters like row group size and dictionary encoding threshold.
This means that even if you do see impact, there is a good chance you could get the same compression by tuning spark.

How Python data structure implemented in Spark when using PySpark?

I am currently self-learning Spark programming and trying to recode an existing Python application in PySpark. However, I am still confused about how we use regular Python objects in PySpark.
I understand the distributed data structure in Spark such as the RDD, DataFrame, Datasets, vector, etc. Spark has its own transformation operations and action operations such as .map(), .reduceByKey() to manipulate those objects. However, what if I create traditional Python data objects such as array, list, tuple, or dictionary in PySpark? They will be only stored in the memory of my driver program node, right? If I transform them into RDD, can i still do operations with typical Python function?
If I have a huge dataset, can I use regular Python libraries like pandas or numpy to process it in PySpark? Will Spark only use the driver node to run the data if I directly execute Python function on a Python object in PySpark? Or I have to create it in RDD and use Spark's operations?

You can create traditional Python data objects such as array, list, tuple, or dictionary in PySpark.
You can perform most of the operations using python functions in Pyspark.
You can import Python libraries in Pyspark and use them to process data in Pyspark
You can create a RDD and apply spark operations on them

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.

Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.