I am wondering why my jobs are running very slowly, and it appears to be because I am not using all of the memory available in PySpark.
When I go to the spark UI, and I click on "Executors" I see the following memory used:
And when I look at my executors I see the following table:
I am wondering why the "Used" memory is so small compared to the "Total memory. What can I do to use as much of the memory as possible?
Other information:
I have a small broadcasted table, but it is only 1MB in size. It should be replicated once per each executor, so I do not imagine it would affect this that much.
I am using spark managed by yarn
I am using spark 1.6.1
Config settings are:
spark.executor.memory=45g
spark.executor.cores=2
spark.executor.instances=4
spark.sql.broadcastTimeout = 9000
spark.memory.fraction = 0.6
The dataset I am processing has 8397 rows, and 80 partitions. I am not doing any shuffle operations aside from the repartitioning initially to 80 partitions.
It is the part when I am adding columns that this becomes slow. All of the parts before that seem to be reasonably fast, when when I try to add a column using a custom udf (using withColumn) it seems to be slowing in that part.
There is a similar question here:
How can I tell if my spark job is progressing? But my question is more pointed - why does the "Memory Used" show a number so low?
Thanks.
Related
I have a performance issue and after analyzing Spark web UI i found what it seems to be data skewness:
Initially i thought partitions were not evenly distributed, so i performed an analysis of rowcount per partitions, but it seems normal(with no outliers):
how to manually run pyspark's partitioning function for debugging
But the problem persists and i see there is one executor processing most of the data:
So the hypothesis now is partitions are not evenly distributed across executors, question is: how spark distributes partitions to executors? and how can i change it to solve my skewness problem?
The code is very simple:
hive_query = """SELECT ... FROM <multiple joined hive tables>"""
df = sqlContext.sql(hive_query).cache()
print(df.count())
Update after posting this question i performed further analysis and found that there 3 tables that cause this, if they are removed the data is evenly distributed in the executors and performance improves, so i added the spark sql hint /*+ BROADCASTJOIN(<table_name>) */ and it worked, performance is much better now, but the question remains:
why do this tables(including a small 6 rows table) cause this uneven distribution across executors when added to the query ?
repartition() will not give you to evenly distributed the dataset as Spark internally uses HashPartitioner. To put your data evenly in all partitions then in my point of view Custom Partitioner is the way.
In this case you need to extend the org.apache.spark.Partitioner class and use your own logic instead of HashPartition. To achieve this we need to convert the RDD to PairRDD.
Found below blog post which will be help you in your case:
https://blog.clairvoyantsoft.com/custom-partitioning-spark-datasets-25cbd4e2d818
Thanks
When you are reading data from HDFS the number of partitions depends on the number of blocks you are reading. From the images attached it looks like your data is not distributed evenly across the cluster. Try repartitioning your data and increase tweak the number of cores and executors.
If you are repartitioning your data, the hash partitioner is returning a value which is more common than other can lead to data skew.
If this is after performing join, then your data is skewed.
Is there a memory efficient way to apply large (>4GB) models to Spark Dataframes without running into memory issues?
We recently ported a custom pipeline framework over to Spark (using python and pyspark) and ran into problems when applying large models like Word2Vec and Autoencoders to tokenized text inputs. First I very naively converted the transformation calls to udfs (both pandas and spark "native" ones), which was fine, as long as the models/utilities used were small enough to either be broadcasted, or instantiated repeatedly:
#pandas_udf("array<string>")
def tokenize_sentence(sentences: pandas.Series):
return sentences.map(lambda sentence: tokenize.word_tokenize(sentence))
Trying the same approach with large models (e.g. for embedding those tokens into vector space via word2vec) resulted in terrible performance, and I get why:
#pandas_udf("array<array<double>>")
def rows_to_lists_of_vectors(rows):
model = api.load('word2vec-google-news-300')
def words_to_vectors(words) -> List[List[float]]:
vectors = []
for word in words:
if word in model:
vec = model[word]
vectors.append(vec.tolist())
return vectors
return rows.map(words_to_vectors)
The code from above would instantiate the ~4Gb word2vec model repeatedly, loading it from disk into RAM, which is very slow. I could remedy this by using mapPartition, which would at least only load it once per partition. But more importantly this would crash with memory related issues (at least on my dev machine), if I didn't heavily restrict the number of tasks, which in turn made the small udfs very slow. For example restricting the number of tasks to 2 would solve the memory crashes, but make the tokenizing painfully slow.
I understand there is an entire pipeline framework in Spark, that would fit our needs, but before committing to that, I'd like to understand how the problems I ran into were solved there. Maybe there are some key practices we could use instead of having to rewrite our framework.
My actual question therefore is twofold:
Would using the Spark pipeline framework solve our issues regarding performance and memory, assuming we wrote custom Estimators and Transformers for the steps, that are not covered by Spark out of the box (like e.g. Tokenizers and Word2Vec).
How does Spark solve those issues, if at all? Can I improve the current approach, or is this impossible using python (where to my understanding processes don't share memory space).
If any of the above makes you believe I missed a core principle with Spark, please point it out, after all I'm just getting started with Spark.
This much vary on various factors (models, cluster resourcing, pipeline) but trying to answers to your main question :
1). Spark pipeline might solve your problem if they fits your needs in terms of the Tokenizers, Words2Vec, etc. However those are not so powerful as the one already available of the shelf and loaded with api.load. You might also want to take a look to Deeplearning4J which brings those to Java/Apache Spark and see how it can do the same things: tokenize, word2vec,etc
2). Following the current approach I would see loading the model in a foreachParition or mapPartition and ensure the model can fit into memory per partition. You can shrink down the partition size to a more affordable number based on the cluster resources to avoid memory problems (it's the same when for example instead of creating a db connection for each row you have one per partition).
Typically Spark udfs are good when you apply a kind o business logic that is spark friendly and not mixin 3rd external parties.
I'm reading in data using this:
ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)
Of course, this runs instantaneously due to lazy computation. This table has several hundred million rows.
Next, I want to filter this Dask dataframe:
ddf2 = ddf1.query('some_col == "converted"')
Finally, I want to convert this to a Pandas dataframe. The result should only be about 8000 rows:
ddf3 = ddf2.compute()
However, this is taking very long (~1 hour). Can I get any advice on how to substantially speed this up? I've tried using .compute(scheduler='threads'), changing up the number of partitions, but none have worked so far. What am I doing wrong?
Firstly, you may be able to use sqlalchemy expression syntax to encode your filter clause in the query, and do the filtering server-side. If data transfer is your bottleneck, than that is your best solution, especially is the filter column is indexed.
Depending on your DB backend, sqlalchemy probably does not release the GIL, so your partitions cannot run in parallel in threads. All you are getting is contention between the threads and extra overhead. You should use the distributed scheduler with processes.
Of course, please look at your CPU and memory usage; with the distributed scheduler, you also have access to the diagnostic dashboard. You should also be concerned with how big each partition will be in memory.
I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works.
I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset is quite large and divided into seven principal partitions consisting each of ~78M lines. The dataset consists of 70 columns.
I defined a Pandas UDF in to do some operations on the dataset, that can only be done using Python, on a Pandas dataframe.
The pandas UDF is defined this way :
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def operation(pdf):
#Some operations
return pdf
spark.table("my_dataset").groupBy(partition_cols).apply(operation)
There is absolutely no way to get the Pandas UDF to work as it crashes before even doing the operations. I suspect there is an OOM error somewhere. The code above runs for a few minutes before crashing with an error code stating that the connection has reset.
However, if I call the .toPandas() function after filtering on one partition and then display it, it runs fine, with no error. The error seems to happen only when using a PandasUDF.
I fail to understand how it works. Does Spark try to convert one whole partition at once (78M lines) ? If so, what memory does it use ? The driver memory ? The executor's ? If it's on the driver's, is all Python code executed on it ?
The cluster is configured with the following :
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=64g
spark.executor.cores 2
spark.executor.memory 30g (to allow memory for the python instance)
spark.driver.memory 43g
Am I missing something or is there just no way to run 78M lines through a PandasUDF ?
Does Spark try to convert one whole partition at once (78M lines) ?
That's exactly what happens. Spark 3.0 adds support for chunked UDFs, which operate on iterators of Pandas DataFrames or Series, but if operations on the dataset, that can only be done using Python, on a Pandas dataframe, these might not be the right choice for you.
If so, what memory does it use ? The driver memory? The executor's?
Each partition is processed locally, on the respective executor, and data is passed to and from Python worker, using Arrow streaming.
Am I missing something or is there just no way to run 78M lines through a PandasUDF?
As long as you have enough memory to handle Arrow input, output (especially if data is copied), auxiliary data structures, as well as as JVM overhead, it should handle large datasets just fine.
But on such tiny cluster, you'll be better with partitioning the output and reading data directly with Pandas, without using Spark at all. This way you'll be able to use all the available resources (i.e. > 100GB / interpreter) for data processing instead of wasting these on secondary tasks (having 16GB - overhead / interpreter).
To answer the general question about using a Pandas UDF on a large pyspark dataframe:
If you're getting out-of-memory errors such as
java.lang.OutOfMemoryError : GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space and increasing memory limits hasn't worked, ensure that pyarrow is enabled. It is disabled by default.
In pyspark, you can enable it using:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
More info here.
I'm doing some Monte Carlo for a model and figured that Dask could be quite useful for this purpose. For the first 35 hours or so, things were running quite "smoothly" (apart from the fan noise giving a sense that the computer was taking off). Each model run would take about 2 seconds and there were 8 partitions running it in parallel. Activity monitor was showing 8 python3.6 instances.
However, the computer has become "silent" and CPU usage (as displayed in Spyder) hardly exceeds 20%. Model runs are happening sequentially (not in parallel) and taking about 4 seconds each. This happened today at some point while I was working on other things. I understand that depending on the sequence of actions, Dask won't use all cores at the same time. However, in this case there is really just one task to be performed (see further below), so one could expect all partitions to run and finish more or less simultaneously. Edit: the whole set up has run successfully for 10.000 simulations in the past, the difference now being that there are nearly 500.000 simulations to run.
Edit 2: now it has shifted to doing 2 partitions in parallel (instead of the previous 1 and original 8). It appears that something is making it change how many partitions are simultaneously processed.
Edit 3: Following recommendations, I have used a dask.distributed.Client to track what is happening, and ran it for the first 400 rows. An illustration of what it looks like after completing is included below. I am struggling to understand the x-axis labels, hovering over the rectangles shows about 143 s.
Some questions therefore are:
Is there any relationship between running other software (Chrome, MS Word) and having the computer "take back" some CPU from python?
Or instead, could it be related to the fact that at some point I ran a second Spyder instance?
Or even, could the computer have somehow run out of memory? But then wouldn't the command have stopped running?
... any other possible explanation?
Is it possible to "tell" Dask to keep up the hard work and go back to using all CPU power while it is still running the original command?
Is it possible to interrupt an execution and keep whichever calculations have already been performed? I have noticed that stopping the current command doesn't seem to do much.
Is it possible to inquire on the overall progress of the computation while it is running? I would like to know how many model runs are left to have an idea of how long it would take to complete in this slow pace. I have tried using the ProgressBar in the past but it hangs on 0% until a few seconds before the end of the computations.
To be clear, uploading the model and the necessary data would be very complex. I haven't created a reproducible example either out of fear of making the issue worse (for now the model is still running at least...) and because - as you can probably tell by now - I have very little idea of what could be causing it and I am not expecting anyone to be able to reproduce it. I'm aware this is not best practice and apologise in advance. However, I would really appreciate some thoughts on what could be going on and possible ways to go about it, if anyone has been thorough something similar before and/or has experience with Dask.
Running:
- macOS 10.13.6 (Memory: 16 GB | Processor: 2.5 GHz Intel Core i7 | 4 cores)
- Spyder 3.3.1
- dask 0.19.2
- pandas 0.23.4
Please let me know if anything needs to be made clearer
If you believe it can be relevant, the main idea of the script is:
# Create a pandas DataFrame where each column is a parameter and each row is a possible parameter combination (cartesian product). At the end of each row some columns to store the respective values of some objective functions are pre-allocated too.
# Generate a dask dataframe that is the DataFrame above split into 8 partitions
# Define a function that takes a partition and, for each row:
# Runs the model with the coefficient values defined in the row
# Retrieves the values of objective functions
# Assigns these values to the respective columns of the current row in the partition (columns have been pre-allocated)
# and then returns the partition with columns for objective functions populated with the calculated values
# map_partitions() to this function in the dask dataframe
Any thoughts?
This shows how simple the script is:
The dashboard:
Update: The approach I took was to:
Set a large number of partitions (npartitions=nCores*200). This made it much easier to visualise the progress. I'm not sure if setting so many partitions is good practice but it worked without much of a slowdown.
Instead of trying to get a single huge pandas DataFrame in the end by .compute(), I got the dask dataframe to be written to Parquet (in this way each partition was written to a separate file). Later, reading all files into a dask dataframe and computeing it to a pandas DataFrame wasn't difficult, and if something went wrong in the middle at least I wouldn't lose the partitions that had been successfully processed and written.
This is what it looked like at a given point:
Dask has many diagnostic tools to help you understand what is going on inside your computation. See http://docs.dask.org/en/latest/understanding-performance.html
In particular I recommend using the distributed scheduler locally and watching the Dask dashboard to get a sense of what is going on in your computation. See http://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
This is a webpage that you can visit that will tell you exactly what is going on in all of your processors.