Spark Job getting stuck - python

I am trying to union 5 data frames in my cod(simple union, no joins). The output data frame contains around 95k records. The cluster is getting stuck and neither running nor failing. It is 4.4x large cluster with 40 nodes.
Spark Configs -
--num-executors 120
--executor-cores 5
--executor-memory 38g
--driver-memory 35g
--shuffle.partitions=1400
When I perform spark submit manually, it throws error :
AsyncEventQueue: Dropping event from queue appStatus. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
I also get a lot of IndexOutOfBounds exceptions. The important thing to note here is that there is no fixed dataframe from which this error/exception starts occuring. Can someone help with the possible reasons?
The exceptions have been pasted below
javax.servlet.ServletException: java.lang.IndexOutOfBoundsException: 4
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:539)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)

Related

Where can I find a detailed description of Snowflake error 255005?

The answer I'm looking for is a reference to documentation. I'm still debugging, but can't find a reference for the error code.
The full error comment is:
snowflake.connector.errors.OperationalError: 255005: Failed to read next arrow batch: b'Array length did not match record batch length'
Some background, if it helps:
The error is in response to the call to fetchall as shown here (python):
cs.execute (f'SELECT {specific_column} FROM {table};')
all_starts = cs.fetchall()
The code context is: when running from cron (i.e., timed job), successful connect, then for a list of tables, the third time through, there's an error. (I.e., two tables are "successful".) When the same script is run at other times (via command line, not cron), there's no error (all tables are "successful").

how to efficiently join and groupBy pyspark dataframe?

I working with python pyspark dataframe. I have approx 70 gb of json files which are basically the emails. The aim is to perform TF-IDF on body of emails. Primarily, I have converted record-oriented json to HDFS. for the tf-idf implementation, first I have done some data cleaning using spark NLP, followed by basic computing tf, idf and tf-idf which includes several groupBy() and join() operations. the implementation works just fine with a small sample dataset but when I run if with entire dataset I have been getting the following error:
Py4JJavaError: An error occurred while calling o506.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 107 in stage 7.0 failed 4 times, most recent failure: Lost task 107.3 in stage 7.0 : ExecutorLostFailure (executor 72 exited caused by one of the running tasks) Reason: Container from a bad node: container_1616585444828_0075_01_000082 on host: Exit status: 143. Diagnostics: [2021-04-08 01:02:20.700]Container killed on request. Exit code is 143
[2021-04-08 01:02:20.701]Container exited with a non-zero exit code 143.
[2021-04-08 01:02:20.702]Killed by external signal
sample code:
df_1 = df.select(df.id,explode(df.final).alias("words"))
df_1= df_1.filter(length(df_1.words) > '3')
df_2 = df_1.groupBy("id","words").agg(count("id").alias("count_id"))
I get error on third step, i.e, df_2 = df_1.groupBy("id","words").agg(count("id").alias("count_id"))
Things I have tried:
set("spark.sql.broadcastTimeout", "36000")
df_1.coalesce(20) before groupBy()
checked for null values -> no null values in any df
nothing worked so far. Since I am very new to pyspark I would really appreciate some help on how to make the implementation more efficient and fast.

Celery Redis instance filling up despite queue looking empty

We have a Django app that needs to fetch lots of data using Celery. There are 20 or so celery workers running every few minutes. We're running on Google Kubernetes Engine with a Redis queue using Cloud memorystore.
The Redis instance we're using for celery is filling up, even when the queue is empty according to Flower. This results in the Redis DB eventually being full and Celery throwing errors.
In Flower I see tasks coming in and out, and I have increased workers to the point where the queue is always empty now.
If I run redis-cli --bigkeys I see:
# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type. You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).
[00.00%] Biggest set found so far '_kombu.binding.my-queue-name-queue' with 1 members
[00.00%] Biggest list found so far 'default' with 611 items
[00.00%] Biggest list found so far 'my-other-queue-name-queue' with 44705 items
[00.00%] Biggest set found so far '_kombu.binding.celery.pidbox' with 19 members
[00.00%] Biggest list found so far 'my-queue-name-queue' with 727179 items
[00.00%] Biggest set found so far '_kombu.binding.celeryev' with 22 members
-------- summary -------
Sampled 12 keys in the keyspace!
Total key length in bytes is 271 (avg len 22.58)
Biggest list found 'my-queue-name-queue' has 727179 items
Biggest set found '_kombu.binding.celeryev' has 22 members
4 lists with 816144 items (33.33% of keys, avg size 204036.00)
0 hashs with 0 fields (00.00% of keys, avg size 0.00)
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
0 streams with 0 entries (00.00% of keys, avg size 0.00)
8 sets with 47 members (66.67% of keys, avg size 5.88)
0 zsets with 0 members (00.00% of keys, avg size 0.00)
If I inspect the queue using LRANGE I see lots of objects like this:
"{\"body\": \"W1syNDQ0NF0sIHsicmVmZXJlbmNlX3RpbWUiOiBudWxsLCAibGF0ZXN0X3RpbWUiOiBudWxsLCAicm9sbGluZyI6IGZhbHNlLCAidGltZWZyYW1lIjogIjFkIiwgIl9udW1fcmV0cmllcyI6IDF9LCB7ImNhbGxiYWNrcyI6IG51bGwsICJlcnJiYWNrcyI6IG51bGwsICJjaGFpbiI6IG51bGwsICJjaG9yZCI6IG51bGx9XQ==\", \"content-encoding\": \"utf-8\", \"content-type\": \"application/json\", \"headers\": {\"lang\": \"py\", \"task\": \"MyDataCollectorClass\", \"id\": \"646910fc-f9db-48c3-b5a9-13febbc00bde\", \"shadow\": null, \"eta\": \"2019-08-20T02:31:05.113875+00:00\", \"expires\": null, \"group\": null, \"retries\": 0, \"timelimit\": [null, null], \"root_id\": \"beeff557-66be-451d-9c0c-dc622ca94493\", \"parent_id\": \"374d8e3e-92b5-423e-be58-e043999a1722\", \"argsrepr\": \"(24444,)\", \"kwargsrepr\": \"{'reference_time': None, 'latest_time': None, 'rolling': False, 'timeframe': '1d', '_num_retries': 1}\", \"origin\": \"gen1#celery-my-queue-name-worker-6595bd8fd8-8vgzq\"}, \"properties\": {\"correlation_id\": \"646910fc-f9db-48c3-b5a9-13febbc00bde\", \"reply_to\": \"e55a31ed-cbba-3d79-9ffc-c19a29e77aac\", \"delivery_mode\": 2, \"delivery_info\": {\"exchange\": \"\", \"routing_key\": \"my-queue-name-queue\"}, \"priority\": 0, \"body_encoding\": \"base64\", \"delivery_tag\": \"a83074a5-8787-49e3-bb7d-a0e69ba7f599\"}}"
We're using django-celery-results to store results, so these shouldn't be going in there, and we're using a separate Redis instance for Django's cache.
If I clear Redis with a FLUSHALL it slowly fills up again.
I'm kind of stumped at where to go next. I don't know Redis well - maybe I can do something to inspect the data to see what's filling this? Maybe it's Flower not reporting properly? Maybe Celery keeps completed tasks for a bit despite us using the Django DB for results?
Thanks loads for any help.
It sounds like Redis is not set up to delete completed items or report & delete failed items--i.e. it may be putting the tasks on the list, but it's not taking them off.
Check out pypi packages: rq, django-rq, django-rq-scheduler
You can read here a little bit about how this should work: https://python-rq.org/docs/
This seems to be a known (or intentional) issue with Celery, with various solutions/workarounds proposed:
https://github.com/celery/celery/issues/436

Spark MLlib - trainImplicit warning

I keep seeing these warnings when using trainImplicit:
WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB).
The maximum recommended task size is 100 KB.
And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same.
All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also shown in Spark 1.3.1):
org.apache.spark.rdd.RDD.flatMap(RDD.scala:296)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1065)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:530)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)
and from aggregate:
org.apache.spark.rdd.RDD.aggregate(RDD.scala:968)
org.apache.spark.ml.recommendation.ALS$.computeYtY(ALS.scala:1112)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1064)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:538)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)
Similar problem was described in Apache Spark mail lists - http://apache-spark-user-list.1001560.n3.nabble.com/Large-Task-Size-td9539.html
I think you can try to play with number of partitions (using repartition() method), depending of how many hosts, RAM, CPUs do you have.
Try also to investigate all steps via Web UI, where you can see number of stages, memory usage on each stage, and data locality.
Or just never mind about this warnings unless everything works correctly and fast.
This notification is hard-coded in Spark (scheduler/TaskSetManager.scala)
if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
!emittedTaskSizeWarning) {
emittedTaskSizeWarning = true
logWarning(s"Stage ${task.stageId} contains a task of very large size " +
s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " +
s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
}
.
private[spark] object TaskSetManager {
// The user will be warned if any stages contain a task that has a serialized size greater than
// this.
val TASK_SIZE_TO_WARN_KB = 100
}

sample map reduce script in python for hive produces exception

I am learning hive. I have setup a table named records. With schema as follows:
year : string
temperature : int
quality : int
Here are sample rows
1999 28 3
2000 28 3
2001 30 2
Now I wrote a sample map reduce script in python exactly as specified in the book Hadoop The Definitive Guide:
import re
import sys
for line in sys.stdin:
(year,tmp,q) = line.strip().split()
if (tmp != '9999' and re.match("[01459]",q)):
print "%s\t%s" % (year,tmp)
I run this using following command:
ADD FILE /usr/local/hadoop/programs/sample_mapreduce.py;
SELECT TRANSFORM(year, temperature, quality)
USING 'sample_mapreduce.py'
AS year,temperature;
Execution fails. On the terminal I get this:
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-08-23 18:30:28,506 Stage-1 map = 0%, reduce = 0%
2012-08-23 18:30:59,647 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201208231754_0005 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201208231754_0005_m_000002 (and more) from job job_201208231754_0005
Exception in thread "Thread-103" java.lang.RuntimeException: Error while reading from task log url
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://master:50060/tasklog?taskid=attempt_201208231754_0005_m_000000_2&start=-8193
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
... 3 more
I go to failed job list and this is the stack trace
java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hit error while closing ..
at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:452)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
... 8 more
The same trace repeated 3 times more.
Please, can someone help me with this? What is wrong here? I am going exactly by the book. What seems to be the problem. There are two errors it seems. On terminal it says that it can't read from task log url. In the failed job list, the exception says something different. Please help
I went to stedrr log from the hadoop admin interface and saw that there was syntax error from python. Then I found that when I created hive table the field delimiter was tab. And in the split() i hadn't mentioned. So I changed it to split('\t') and it worked alright !
just use 'describe formatted ' and near the bottom of the output you'll find 'Storage Desc Params:' which describe any delimiters used.

Categories