how to efficiently join and groupBy pyspark dataframe?

how to efficiently join and groupBy pyspark dataframe? - python

I working with python pyspark dataframe. I have approx 70 gb of json files which are basically the emails. The aim is to perform TF-IDF on body of emails. Primarily, I have converted record-oriented json to HDFS. for the tf-idf implementation, first I have done some data cleaning using spark NLP, followed by basic computing tf, idf and tf-idf which includes several groupBy() and join() operations. the implementation works just fine with a small sample dataset but when I run if with entire dataset I have been getting the following error:
Py4JJavaError: An error occurred while calling o506.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 107 in stage 7.0 failed 4 times, most recent failure: Lost task 107.3 in stage 7.0 : ExecutorLostFailure (executor 72 exited caused by one of the running tasks) Reason: Container from a bad node: container_1616585444828_0075_01_000082 on host: Exit status: 143. Diagnostics: [2021-04-08 01:02:20.700]Container killed on request. Exit code is 143
[2021-04-08 01:02:20.701]Container exited with a non-zero exit code 143.
[2021-04-08 01:02:20.702]Killed by external signal
sample code:
df_1 = df.select(df.id,explode(df.final).alias("words"))
df_1= df_1.filter(length(df_1.words) > '3')
df_2 = df_1.groupBy("id","words").agg(count("id").alias("count_id"))
I get error on third step, i.e, df_2 = df_1.groupBy("id","words").agg(count("id").alias("count_id"))
Things I have tried:
set("spark.sql.broadcastTimeout", "36000")
df_1.coalesce(20) before groupBy()
checked for null values -> no null values in any df
nothing worked so far. Since I am very new to pyspark I would really appreciate some help on how to make the implementation more efficient and fast.

Related

Where can I find a detailed description of Snowflake error 255005?

The answer I'm looking for is a reference to documentation. I'm still debugging, but can't find a reference for the error code.
The full error comment is:
snowflake.connector.errors.OperationalError: 255005: Failed to read next arrow batch: b'Array length did not match record batch length'
Some background, if it helps:
The error is in response to the call to fetchall as shown here (python):
cs.execute (f'SELECT {specific_column} FROM {table};')
all_starts = cs.fetchall()
The code context is: when running from cron (i.e., timed job), successful connect, then for a list of tables, the third time through, there's an error. (I.e., two tables are "successful".) When the same script is run at other times (via command line, not cron), there's no error (all tables are "successful").

Pyspark Erroneous SparkUpgradeException During Linear Regression

To run linear regression in pyspark, I have to convert the feature columns of my data to dense vectors and then use them to fit the regression model as shown below:
assembler = VectorAssembler(inputCols=feature_cols, outputCol="Features")
vtrain = assembler.transform(train).select('Features', y)
lin_reg = LinearRegression(solver='normal',
featuresCol = 'Features',
labelCol = y)
model = lin_reg.fit(vtrain)
This has been working for a while but just recently started giving me the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1059.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1059.0 (TID 1877) (10.139.64.10 executor 0): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'MMM dd, yyyy hh:mm:ss aa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
This is confusing me because all of the columns in "train" are either integer or double. vtrain is just that same data in vectorized form. There is no datetime parsing anywhere. I tried setting spark.sql.legacy.timeParserPolicy to LEGACY, but the same error occurred.
Does anyone know why this might be?

Spark Job getting stuck

I am trying to union 5 data frames in my cod(simple union, no joins). The output data frame contains around 95k records. The cluster is getting stuck and neither running nor failing. It is 4.4x large cluster with 40 nodes.
Spark Configs -
--num-executors 120
--executor-cores 5
--executor-memory 38g
--driver-memory 35g
--shuffle.partitions=1400
When I perform spark submit manually, it throws error :
AsyncEventQueue: Dropping event from queue appStatus. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
I also get a lot of IndexOutOfBounds exceptions. The important thing to note here is that there is no fixed dataframe from which this error/exception starts occuring. Can someone help with the possible reasons?
The exceptions have been pasted below
javax.servlet.ServletException: java.lang.IndexOutOfBoundsException: 4
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:539)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)

Spark MLlib - trainImplicit warning

I keep seeing these warnings when using trainImplicit:
WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB).
The maximum recommended task size is 100 KB.
And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same.
All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also shown in Spark 1.3.1):
org.apache.spark.rdd.RDD.flatMap(RDD.scala:296)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1065)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:530)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)
and from aggregate:
org.apache.spark.rdd.RDD.aggregate(RDD.scala:968)
org.apache.spark.ml.recommendation.ALS$.computeYtY(ALS.scala:1112)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1064)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:538)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)

Similar problem was described in Apache Spark mail lists - http://apache-spark-user-list.1001560.n3.nabble.com/Large-Task-Size-td9539.html
I think you can try to play with number of partitions (using repartition() method), depending of how many hosts, RAM, CPUs do you have.
Try also to investigate all steps via Web UI, where you can see number of stages, memory usage on each stage, and data locality.
Or just never mind about this warnings unless everything works correctly and fast.
This notification is hard-coded in Spark (scheduler/TaskSetManager.scala)
if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
!emittedTaskSizeWarning) {
emittedTaskSizeWarning = true
logWarning(s"Stage ${task.stageId} contains a task of very large size " +
s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " +
s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
}
.
private[spark] object TaskSetManager {
// The user will be warned if any stages contain a task that has a serialized size greater than
// this.
val TASK_SIZE_TO_WARN_KB = 100
}

sample map reduce script in python for hive produces exception

I am learning hive. I have setup a table named records. With schema as follows:
year : string
temperature : int
quality : int
Here are sample rows
1999 28 3
2000 28 3
2001 30 2
Now I wrote a sample map reduce script in python exactly as specified in the book Hadoop The Definitive Guide:
import re
import sys
for line in sys.stdin:
(year,tmp,q) = line.strip().split()
if (tmp != '9999' and re.match("[01459]",q)):
print "%s\t%s" % (year,tmp)
I run this using following command:
ADD FILE /usr/local/hadoop/programs/sample_mapreduce.py;
SELECT TRANSFORM(year, temperature, quality)
USING 'sample_mapreduce.py'
AS year,temperature;
Execution fails. On the terminal I get this:
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-08-23 18:30:28,506 Stage-1 map = 0%, reduce = 0%
2012-08-23 18:30:59,647 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201208231754_0005 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201208231754_0005_m_000002 (and more) from job job_201208231754_0005
Exception in thread "Thread-103" java.lang.RuntimeException: Error while reading from task log url
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://master:50060/tasklog?taskid=attempt_201208231754_0005_m_000000_2&start=-8193
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
... 3 more
I go to failed job list and this is the stack trace
java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hit error while closing ..
at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:452)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
... 8 more
The same trace repeated 3 times more.
Please, can someone help me with this? What is wrong here? I am going exactly by the book. What seems to be the problem. There are two errors it seems. On terminal it says that it can't read from task log url. In the failed job list, the exception says something different. Please help

I went to stedrr log from the hadoop admin interface and saw that there was syntax error from python. Then I found that when I created hive table the field delimiter was tab. And in the split() i hadn't mentioned. So I changed it to split('\t') and it worked alright !

just use 'describe formatted ' and near the bottom of the output you'll find 'Storage Desc Params:' which describe any delimiters used.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to efficiently join and groupBy pyspark dataframe? - python

Related

Where can I find a detailed description of Snowflake error 255005?

Pyspark Erroneous SparkUpgradeException During Linear Regression

Spark Job getting stuck

Spark MLlib - trainImplicit warning

sample map reduce script in python for hive produces exception

Categories

Resources