I am using Spark's Python API and running Spark 0.8.
I am storing a large RDD of floating point vectors and I need to perform calculations of one vector against the entire set.
Is there any difference between slices and partitions in an RDD?
When I create the RDD, I pass it 100 as a parameter which causes it to store the RDD as 100 slices and create 100 tasks when performing the calculations. I want to know if partitioning the data would improve performance beyond the slicing by enabling the system to process the data more efficiently (i.e. is there a difference between performing operations over a partition versus over just operating over every element in the sliced RDD).
For example, is there any significant difference between these two pieces of code?
rdd = sc.textFile(demo.txt, 100)
vs
rdd = sc.textFile(demo.txt)
rdd.partitionBy(100)
I believe slices and partitions are the same thing in Apache Spark.
However, there is a subtle but potentially significant difference between the two pieces of code you posted.
This code will attempt to load demo.txt directly into 100 partitions using 100 concurrent tasks:
rdd = sc.textFile('demo.txt', 100)
For uncompressed text, it will work as expected. But if instead of demo.txt you had a demo.gz, you will end up with an RDD with only 1 partition. Reads against gzipped files cannot be parallelized.
On the other hand, the following code will first open demo.txt into an RDD with the default number of partitions, then it will explicitly repartition the data into 100 partitions that are roughly equal in size.
rdd = sc.textFile('demo.txt')
rdd = rdd.repartition(100)
So in this case, even with a demo.gz you will end up with an RDD with 100 partitions.
As a side note, I replaced your partitionBy() with repartition() since that's what I believe you were looking for. partitionBy() requires the RDD to be an RDD of tuples. Since repartition() is not available in Spark 0.8.0, you should instead be able to use coalesce(100, shuffle=True).
Spark can run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably 2-3x times that).
As of Spark 1.1.0, you can check how many partitions an RDD has as follows:
rdd.getNumPartitions() # Python API
rdd.partitions.size // Scala API
Before 1.1.0, the way to do this with the Python API was rdd._jrdd.splits().size().
You can do partition as follows:
import org.apache.spark.Partitioner
val p = new Partitioner() {
def numPartitions = 2
def getPartition(key: Any) = key.asInstanceOf[Int]
}
recordRDD.partitionBy(p)
Related
I have a performance issue and after analyzing Spark web UI i found what it seems to be data skewness:
Initially i thought partitions were not evenly distributed, so i performed an analysis of rowcount per partitions, but it seems normal(with no outliers):
how to manually run pyspark's partitioning function for debugging
But the problem persists and i see there is one executor processing most of the data:
So the hypothesis now is partitions are not evenly distributed across executors, question is: how spark distributes partitions to executors? and how can i change it to solve my skewness problem?
The code is very simple:
hive_query = """SELECT ... FROM <multiple joined hive tables>"""
df = sqlContext.sql(hive_query).cache()
print(df.count())
Update after posting this question i performed further analysis and found that there 3 tables that cause this, if they are removed the data is evenly distributed in the executors and performance improves, so i added the spark sql hint /*+ BROADCASTJOIN(<table_name>) */ and it worked, performance is much better now, but the question remains:
why do this tables(including a small 6 rows table) cause this uneven distribution across executors when added to the query ?
repartition() will not give you to evenly distributed the dataset as Spark internally uses HashPartitioner. To put your data evenly in all partitions then in my point of view Custom Partitioner is the way.
In this case you need to extend the org.apache.spark.Partitioner class and use your own logic instead of HashPartition. To achieve this we need to convert the RDD to PairRDD.
Found below blog post which will be help you in your case:
https://blog.clairvoyantsoft.com/custom-partitioning-spark-datasets-25cbd4e2d818
Thanks
When you are reading data from HDFS the number of partitions depends on the number of blocks you are reading. From the images attached it looks like your data is not distributed evenly across the cluster. Try repartitioning your data and increase tweak the number of cores and executors.
If you are repartitioning your data, the hash partitioner is returning a value which is more common than other can lead to data skew.
If this is after performing join, then your data is skewed.
I have a data source, around 100GB, and I'm trying to write it partitioned using a date column.
In order to avoid small chunks inside the partitions, I've added a repartition(5) to have 5 files max inside each partition :
df.repartition(5).write.orc("path")
My problem here, is that only 5 executores out of the 30 I'm allocating are actually running. In the end I have what I want (5 files inside each partition), but since only 5 executors are running, the execution time is extremely high.
Dy you have any suggestion on how I can make it faster ?
I fixed it using simply :
df.repartition($"dateColumn").write.partitionBy("dateColumn").orc(path)
And allocating the same number of executors as the number of partitions I ll have in the output.
Thanks all
You can use repartition along with partitionBy to resolve the issue.
There are two ways to solve this.
Suppose you need to partition by dateColumn
df.repartition(5, 'dateColumn').write.partitionBy('dateColumn').parquet(path)
In this case the number of executors used will be equal to 5 * distinct(dateColumn) and all your date will contain 5 files each.
Another approach is to repartition your data 3 times no of executors then using maxRecordsPerFile to save data this will create equal sizze files but you will lose control over the number of files created
df.repartition(60).write.option('maxRecordsPerFile',200000).partitionBy('dateColumn').parquet(path)
Spark can run 1 concurrent task for every partition of an RDD or data frame (up to the number of cores in the cluster). If your cluster has 30 cores, you should have at least 30 partitions. On the other hand, a single partition typically shouldn’t contain more than 128MB and a single shuffle block cannot be larger than 2GB (see SPARK-6235).
Since you want to reduce your execution time, it is better to increase your number of partitions and at the end of your job reduce your number of partitions for your specific job.
for better distribution of your data (equally) among partition, it is better to use the hash partitioner.
I'm trying to tune the performance of spark, by the use of partitioning on a spark dataframe. Here is the code:
file_path1 = spark.read.parquet(*paths[:15])
df = file_path1.select(columns) \
.where((func.col("organization") == organization))
df = df.repartition(10)
#execute an action just to make spark execute the repartition step
df.first()
During the execution of first() I check the job stages in Spark UI and here what I find:
Why there is no repartition step in the stage?
Why there is also stage 8? I only requested one action of first(). Is it because of the shuffle caused by the repartition?
Is there a way to change the repartition of the parquet files without having to occur to such operations? As initially when I read the df you can see that it's partitioned over 43k partitions which is really a lot (compared to its size when I save it to a csv file: 4 MB with 13k rows) and creating problems in further steps, that's why I wanted to repartition it.
Should I use cache() after repartition? df = df.repartition(10).cache()? As when I executed df.first() the second time, I also get a scheduled stage with 43k partitions, despite df.rdd.getNumPartitions() which returned 10.
EDIT: the number of partitions is just to try. my questions are directed to help me understand how to do the right repartition.
Note: initially the Dataframe is read from a selection of parquet files in Hadoop.
I already read this as reference How does Spark partition(ing) work on files in HDFS?
Whenever there is shuffling, there is a new stage. and the
repartition causes shuffling that"s why you have two stages.
the caching is used when you'll use the dataframe multiple times to
avoid reading it twice.
Use coalesce instead of repartiton. I think it causes less shuffling since it only reduces the number of partitions.
I am trying to use groupby and apply a custom function on a huge dataset, which is giving me memory errors and the workers are getting killed because of the shuffling. How can I avoid shuffle and do this efficiently.
I am reading around fifty 700MB (each) parquet files and the data in those files is isolated, i.e. no group exists in more than one file. If I try running my code on one file, it works fine but fails when I try to run on the complete dataset.
Dask documentation talks about problems with groupby when you apply a custom function on groups, but they do not offer a solution for such data:
http://docs.dask.org/en/latest/dataframe-groupby.html#difficult-cases
How can I process my dataset in a reasonable timeframe (it takes around 6 minutes for groupby-apply on a single file) and hopefully avoid shuffle. I do not need my results to be sorted, or groupby trying to sort my complete dataset from different files.
I have tried using persist but the data does not fit into RAM (32GB). Even though dask does not support multi column index, but I tried adding a index on one column to support groupby to no avail. Below is what the structure of code looks like:
from dask.dataframe import read_parquet
df = read_parquet('s3://s3_directory_path')
results = df.groupby(['A', 'B']).apply(custom_function).compute()
# custom function sorts the data within a group (the groups are small, less than 50 entries) on a field and computes some values based on heuristics (it computes 4 values, but I am showing 1 in example below and other 3 calculations are similar)
def custom_function(group):
results = {}
sorted_group = group.sort_values(['C']).reset_index(drop=True)
sorted_group['delta'] = sorted_group['D'].diff()
sorted_group.delta = sorted_group.delta.shift(-1)
results['res1'] = (sorted_group[sorted_group.delta < -100]['D'].sum() - sorted_group.iloc[0]['D'])
# similarly 3 more results are generated
results_df = pd.DataFrame(results, index=[0])
return results_df
One possibility is that I process one file at a time and do it multiple times, but in that case dask seems useless (no parallel processing) and it will take hours to achieve the desired results. Is there any way to do this efficiently using dask, or any other library? How do people deal with such data?
If you want to avoid shuffling and can promise that groups are well isolated, then you could just call a pandas groupby apply across every partition with map_partitions
df.map_partitions(lambda part: part.groupby(...).apply(...))
I see the paramter npartitions in many functions, but I don't understand what it is good for / used for.
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
head(...)
Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.
repartition(...)
Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.
Is the number of partitions probably 5 in this case:
(Image source: http://dask.pydata.org/en/latest/dataframe-overview.html )
The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways.
If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.dataframe has only one partition then only one core can operate at a time.
If you have too many partitions then the scheduler may incur a lot of overhead deciding where to compute each task.
Generally you want a few times more partitions than you have cores. Every task takes up a few hundred microseconds in the scheduler.
You can determine the number of partitions either at data ingestion time using the parameters like blocksize= in read_csv(...) or afterwards by using the .repartition(...) method.
I tried to check what is the optimal number for my case.
I work on laptops with 8 cores.
I have 100Gb csv files with 250M rows and 25 columns.
I run the function "describe" on 1,5,30,1000 partitions
df = df.repartition(npartitions=1)
a1=df['age'].describe().compute()
df = df.repartition(npartitions=5)
a2=df['age'].describe().compute()
df = df.repartition(npartitions=30)
a3=df['age'].describe().compute()
df = df.repartition(npartitions=100)
a4=df['age'].describe().compute()
about speed :
5,30 > around 3 minutes
1, 1000 > around 9 minutes
but ...I found that "order" functions like median or percentile give wrong number when I used more than one partition .
1 partition give right number (I checked it with small data using pandas and dask)