Spark Repartition Executors - python

I have a data source, around 100GB, and I'm trying to write it partitioned using a date column.
In order to avoid small chunks inside the partitions, I've added a repartition(5) to have 5 files max inside each partition :
df.repartition(5).write.orc("path")
My problem here, is that only 5 executores out of the 30 I'm allocating are actually running. In the end I have what I want (5 files inside each partition), but since only 5 executors are running, the execution time is extremely high.
Dy you have any suggestion on how I can make it faster ?

I fixed it using simply :
df.repartition($"dateColumn").write.partitionBy("dateColumn").orc(path)
And allocating the same number of executors as the number of partitions I ll have in the output.
Thanks all

You can use repartition along with partitionBy to resolve the issue.
There are two ways to solve this.
Suppose you need to partition by dateColumn
df.repartition(5, 'dateColumn').write.partitionBy('dateColumn').parquet(path)
In this case the number of executors used will be equal to 5 * distinct(dateColumn) and all your date will contain 5 files each.
Another approach is to repartition your data 3 times no of executors then using maxRecordsPerFile to save data this will create equal sizze files but you will lose control over the number of files created
df.repartition(60).write.option('maxRecordsPerFile',200000).partitionBy('dateColumn').parquet(path)

Spark can run 1 concurrent task for every partition of an RDD or data frame (up to the number of cores in the cluster). If your cluster has 30 cores, you should have at least 30 partitions. On the other hand, a single partition typically shouldn’t contain more than 128MB and a single shuffle block cannot be larger than 2GB (see SPARK-6235).
Since you want to reduce your execution time, it is better to increase your number of partitions and at the end of your job reduce your number of partitions for your specific job.
for better distribution of your data (equally) among partition, it is better to use the hash partitioner.

Related

More optimized way to do itertools.combinations

I'm trying to find unique combinations of ~70,000 IDs.
I'm currently doing an itertools.combinations([list name], 2) to get unique 2 ID combinations but it's been running for more than 800 minutes.
Is there a faster way to do this?
I tried converting the IDs into a matrix where the IDs are both the index and the columns and populating the matrix using itertools.product.
I tried doing it the manual way with loops too.
But after more than a full day of letting them run, none of my methods have actually finished running.
For additional information, I'm storing these into a data frame, to later run a function that compares each of the unique set of IDs.
(70_000 ** 69_000) / 2== 2.4 billion - it is not such a large number as to be not computable in a few hours (update I run a dry-run on itertools.product(range(70000), 2) and it took less than 70 seconds, on a 2017 era i7 #3GHz, naively using a single core) But if you are trying to keep this data in memory at once, them it won't fit - and if your system is configured to swap memory to disk before erroring with a MemoryError, this may slow-down the program by 2 or more orders of magnitude, and thus, that is when your problem come from.
itertools.combination does the right thing in this respect, and no need to try to change it for something else: it will yield one combination at a time. What you are doing with the result, however, do change things: if you are streaming the combination to a file and not keeping it in memory, it should be fine, and then, it is just computational time you can't speed up anyway.
If, on the other hand, you are collecting the combinations to a list or other data structure: there is your problem - don't do it.
Now. going a step further than your question, since these combinations are check-able and predictable, maybe trying to generate these is not the right approach at all - you don't give details on how these are to be used, but if used in a reactive form, or on a lazy form, you might have an instantaneous workflow instead.
Your Ram will run full. You can counter this with gc.collect() or emtpying the results but the found results have to be saved inbetween.
You could try something similar to the code below. I would create individual file names or save the results into a database since the result file will be some gb big. Additionaly range of the second loop can probably be divided by 2.
import gc
new_set=set()
for i in range(70000):
new_set.add(i)
print(new_set)
combined_set=set()
for i in range(len(new_set)):
print(i)
if i % 300 ==0:
with open("results","a") as f:
f.write(str(combined_set))
combined_set=set()
gc.collect()
for b in range(len(new_set)):
combined_set.add((i,b))

Out of memory when trying to persist a dataframe

I am facing an out of memory error when trying to persist a dataframe and I don't really understand why. I have a dataframe of roughly 20Gb with 2.5 millions rows and around 20 columns. After filtering this dataframe, I have 4 columns and 0.5 million rows.
Now my problem is that when I persist the filtered dataframe I get an out of memory error (exceeds 25.4Gb of 20 Gb physical memory used). I have tried persisting at different storage levels
df = spark.read.parquet(path) # 20 Gb
df_filter = df.select('a', 'b', 'c', 'd').where(df.a == something) # a few Gb
df_filter.persist(StorageLevel.MEMORY_AND_DISK)
df_filter.count()
My cluster has 8 nodes with 30Gb of memory each.
Do you have any idea where that OOM could come from ?
Just some suggestions to help identify root cause ...
You probably have either (or a combo) of ...
skewed source data partition split sizes which is tough to deal with and cause garbage collection, OOM, etc. (these methods have helped me, but there may be better approaches per use case)
# to check num partitions
df_filter.rdd.getNumPartitions()
# to repartition (**does cause shuffle**) to increase parallelism and help with data skew
df_filter.repartition(...) # monitor/debug performance in spark ui after setting
too little/too many executors/ram/cores set in config
# check via
spark.sparkContext.getConf().getAll()
# these are the ones you want to watch out for
'''
--num-executors
--executor-cores
--executor-memory
'''
wide transformation shuffles size too little/too many => try general debug checks to view transformations that will be triggered when persisting + find their # of output partitions to disk
# debug directed acyclic graph [dag]
df_filter.explain() # also "babysit" in spark UI to examine performance of each node/partitions to get specs when you are persisting
# check output partitions if shuffle occurs
spark.conf.get("spark.sql.shuffle.partitions")

What is the role of npartitions in a Dask dataframe?

I see the paramter npartitions in many functions, but I don't understand what it is good for / used for.
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
head(...)
Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.
repartition(...)
Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.
Is the number of partitions probably 5 in this case:
(Image source: http://dask.pydata.org/en/latest/dataframe-overview.html )
The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways.
If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.dataframe has only one partition then only one core can operate at a time.
If you have too many partitions then the scheduler may incur a lot of overhead deciding where to compute each task.
Generally you want a few times more partitions than you have cores. Every task takes up a few hundred microseconds in the scheduler.
You can determine the number of partitions either at data ingestion time using the parameters like blocksize= in read_csv(...) or afterwards by using the .repartition(...) method.
I tried to check what is the optimal number for my case.
I work on laptops with 8 cores.
I have 100Gb csv files with 250M rows and 25 columns.
I run the function "describe" on 1,5,30,1000 partitions
df = df.repartition(npartitions=1)
a1=df['age'].describe().compute()
df = df.repartition(npartitions=5)
a2=df['age'].describe().compute()
df = df.repartition(npartitions=30)
a3=df['age'].describe().compute()
df = df.repartition(npartitions=100)
a4=df['age'].describe().compute()
about speed :
5,30 > around 3 minutes
1, 1000 > around 9 minutes
but ...I found that "order" functions like median or percentile give wrong number when I used more than one partition .
1 partition give right number (I checked it with small data using pandas and dask)

In what order does data get process from RDDs in Spark?

Context
Spark provides RDDs for which map functions can be used to lazily set up the operations for processing in parallel. RDD's can be created with a specified partitioning parameter that determines how many partitions to create per RDD, preferably this parameter equals the number of systems (Ex. You have 12 files to process, create an RDD with 3 partitions which splits the data into buckets of 4 each for 4 systems and all the files get processed concurrently in each system). It is my understand that these partitions control the portion of data that goes to each system for processing.
Issue
I need to fine tune and control how many functions run at same time per system. If 2 or more functions run on same GPU at the same time, the system will crash.
Question
If an RDD is not evenly nicely split (like in the example above), how many threads run concurrently on the system?
Example
In:
sample_files = ['one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'eleven.png','ten.png']
CLUSTER_SIZE = 3
example_rdd = sc.parallelize(sample_files, CLUSTER_SIZE)
example_partitions = example_rdd.glom().collect()
# Print elements per partition
for i, l in enumerate(example_partitions): print "parition #{} length: {}".format(i, len(l))
# Print partition distribution
print example_partitions
# How many map functions run concurrently when the action is called on this Transformation?
example_rdd.map(lambda s: (s, len(s))
action_results = example_rdd.reduceByKey(add)
Out:
parition #0 length: 8
parition #1 length: 8
parition #2 length: 10
[ ['one.jpg', 'free.jpg', 'two.png', 'zero.png', 'four.jpg', 'six.png', 'seven.png', 'eight.jpg'],
['nine.png', 'eleven.png', 'ten.png', 'ten.png', 'one.jpg', 'free.jpg', 'two.png', 'zero.png'],
['four.jpg', 'six.png', 'seven.png', 'eight.jpg', 'nine.png', 'eleven.png', 'ten.png', 'ten.png', 'eleven.png', 'ten.png'] ]
In Conclusion
What I need to know, is if the RDD is split the way it is, what controls how many threads are processed simultaneously? Is it the number of cores, or is there a global parameter that can be set so it only processes 4 at a time on each partition (system)?
In what order does data get process from RDDs in Spark?
Unless it is some border case, like only one partition, order is arbitrary or nondeterministic. This will depend on the cluster, on the data and on different runtime events.
A number of partitions sets only a limit of overall parallelism for a given stage or in other words it is a minimal unit of parallelism in Spark. No matter how much resources you allocate you a single stage should process more data than at the time. Once again there can be border cases when worker is not accessible and task is rescheduled on another machine.
Another possible limit you can think of is the number of the executor threads. Even if you increase the number of partitions a single executor thread will process only one at the time.
Neither of the above tell you where or when given partition will be processed. While you can use some dirty, inefficient and non-portable tricks at the configuration level (like single worker with a single executor thread per machine) to make sure that only a one partition is processed on a given machine at the time it is not particularly useful in general.
As a rule of thumb I would say that Spark code should never be concerned wit a time an place it is executed. There are some low level aspects of the API which provides means to set partition specific preferences but as far as I know these don't provide hard guarantees.
That being said one can think of at least few ways you can approach this problem:
long running executor threads with configuration level guarantees - it could be acceptable if Spark is responsible only for loading and saving data
singleton objects which control queuing jobs on the GPU
delegating GPU processing to specialized service which ensures proper access
On a side not you may be interested in Large Scale Distributed Deep Learning on Hadoop Clusters which roughly describes an architecture which can be applicable here.

What are the differences between slices and partitions of RDDs?

I am using Spark's Python API and running Spark 0.8.
I am storing a large RDD of floating point vectors and I need to perform calculations of one vector against the entire set.
Is there any difference between slices and partitions in an RDD?
When I create the RDD, I pass it 100 as a parameter which causes it to store the RDD as 100 slices and create 100 tasks when performing the calculations. I want to know if partitioning the data would improve performance beyond the slicing by enabling the system to process the data more efficiently (i.e. is there a difference between performing operations over a partition versus over just operating over every element in the sliced RDD).
For example, is there any significant difference between these two pieces of code?
rdd = sc.textFile(demo.txt, 100)
vs
rdd = sc.textFile(demo.txt)
rdd.partitionBy(100)
I believe slices and partitions are the same thing in Apache Spark.
However, there is a subtle but potentially significant difference between the two pieces of code you posted.
This code will attempt to load demo.txt directly into 100 partitions using 100 concurrent tasks:
rdd = sc.textFile('demo.txt', 100)
For uncompressed text, it will work as expected. But if instead of demo.txt you had a demo.gz, you will end up with an RDD with only 1 partition. Reads against gzipped files cannot be parallelized.
On the other hand, the following code will first open demo.txt into an RDD with the default number of partitions, then it will explicitly repartition the data into 100 partitions that are roughly equal in size.
rdd = sc.textFile('demo.txt')
rdd = rdd.repartition(100)
So in this case, even with a demo.gz you will end up with an RDD with 100 partitions.
As a side note, I replaced your partitionBy() with repartition() since that's what I believe you were looking for. partitionBy() requires the RDD to be an RDD of tuples. Since repartition() is not available in Spark 0.8.0, you should instead be able to use coalesce(100, shuffle=True).
Spark can run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably 2-3x times that).
As of Spark 1.1.0, you can check how many partitions an RDD has as follows:
rdd.getNumPartitions() # Python API
rdd.partitions.size // Scala API
Before 1.1.0, the way to do this with the Python API was rdd._jrdd.splits().size().
You can do partition as follows:
import org.apache.spark.Partitioner
val p = new Partitioner() {
def numPartitions = 2
def getPartition(key: Any) = key.asInstanceOf[Int]
}
recordRDD.partitionBy(p)

Categories