I ran the following command:
$ spark-submit --master yarn --deploy-mode cluster pi.py
So, below log is continuous print:
...
2021-12-23 06:07:50,158 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
2021-12-23 06:07:51,162 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
...
and I check the result through my 8088(Logs for container web UI), but there is nothing in stdout.
I was disappointed and tried to force the park operation to end, but suddenly the new log is print like below:
...
2021-12-23 06:09:06,694 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
2021-12-23 06:09:06,695 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: master
ApplicationMaster RPC port: 40451
queue: default
start time: 1640239668020
final status: UNDEFINED
tracking URL: http://master2:8088/proxy/application_1640239254568_0002/
user: root
2021-12-23 06:09:07,707 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
...
And after some time, an error log occurred as shown below:
...
2021-12-23 06:10:25,003 INFO retry.RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "master/172.17.0.2"; destination host is: "master2":8032; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm2. Trying to failover immediately.
2021-12-23 06:10:25,003 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
2021-12-23 06:10:25,004 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From master/172.17.0.2 to master:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm1 after 1 failover attempts. Trying to failover after sleeping for 18340ms.
2021-12-23 06:10:43,347 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
...
I understand that sparks have completed the resource manager allocation after work, so it is normal for the above error log to appear.
Q1. Is the above job normal?
Q2. After this work, where can I check the results? Can I check them on "containerlogs web UI"?
IMPORTANT!! ADD. I re-ran the command. and check the status: SUCCEEDED. Why does the park-submit operation sometimes succeed and sometimes stop in the middle?
Related
After installing MySQL and working for almost 1 month, I am facing below warning every time that I try to execute a query within a python script:
ANY HELP TO HANDLE THIS?
(MainThread) # _do_auth(): user: root
(MainThread) # _do_auth(): self._auth_plugin:
(MainThread) new_auth_plugin: caching_sha2_password
(MainThread) # request: b'\xa5\x0c/\x14v\t\x86O\xa8\x84\xc7\x93\x8c8\x1c\xa9\x8b#\xaf\xa1' size: 20
(MainThread) # server response packet: bytearray(b'\x07\x00\x00\x05\x00\x00\x00\x02\x00\x00\x00')
I have setup dask on my MapR cluster's edge node following the directions here: https://gateway.dask.org/install-hadoop.html
Per those directions, I'm testing the install by running the following in a JupyterHub spawned ipython notebook:
from dask_gateway import Gateway
gateway = Gateway("http://sa1x-hadoopedg-np1.hchc.local:9010")
cluster = gateway.new_cluster()
However, when it tries to start the new cluster via YARN, I get the following error in the YARN application's log:
Diagnostics: User a059571(user id 1425180742) does not have access to maprfs:///user/a059571/.skein/application_1605411890003_0222/809B8EAF0CC3524F90366F449C11C97E/tmpv8cbv2ag
Even though dask is supposed to be running as the requesting user (in this case a059571), it appears to be creating directories as the user running the dask-gateway-server (in this case the user mapr):
hdfs dfs -ls -d maprfs:///user/a059571/.skein/application_1605411890003_0222
drwx------ - mapr mapr 7 2021-01-19 17:37 maprfs:///user/a059571/.skein/application_1605411890003_0222
I feel like I'm missing something obvious.
Here are my configs, for full disclosure:
/etc/dask-gateway/dask_gateway_config.py
c.DaskGateway.backend_class = (
"dask_gateway_server.backends.yarn.YarnBackend"
)
c.DaskGateway.address= '12.190.113.133:9010'
c.Proxy.address = '12.190.113.133:9011'
c.Proxy.tcp_address = '12.190.113.133:9012'
c.YarnClusterConfig.scheduler_cmd = "/opt/anaconda3/bin/dask-scheduler"
c.YarnClusterConfig.worker_cmd = "/opt/anaconda3/bin/dask-worker"
c.YarnClusterConfig.queue = 'root.default'
c.DaskGateway.log_level= 'DEBUG'
Snippet from inside my core_site.xml
<property>
<name>hadoop.proxyuser.mapr.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapr.groups</name>
<value>*</value>
</property>
And, some interesting lines from the dask-gateway-server logs:
[DaskGateway] - HTTP routes listening at http://12.190.113.133:9011
[DaskGateway] - Scheduler routes listening at gateway://12.190.113.133:9012
[Proxy] Unexpected failure fetching routing table, retrying in 0.5s: Get http://12.190.113.133:9010/api/v1/routes: dial tcp 12.190.113.133:9010: connect: connection refused
[DaskGateway] Removed 0 expired clusters from the database
[Proxy] Unexpected failure fetching routing table, retrying in 1.0s: Get http://12.190.113.133:9010/api/v1/routes: dial tcp 12.190.113.133:9010: connect: connection refused
[Proxy] Unexpected failure fetching routing table, retrying in 2.0s: Get http://12.190.113.133:9010/api/v1/routes: dial tcp 12.190.113.133:9010: connect: connection refused
[Proxy] Unexpected failure fetching routing table, retrying in 4.0s: Get http://12.190.113.133:9010/api/v1/routes: dial tcp 12.190.113.133:9010: connect: connection refused
INFO skein.Driver: Driver started, listening on 44262
[DaskGateway] Backend started, clusters will contact api server at http://12.190.113.133:9011/api
[DaskGateway] Dask-Gateway server started
[DaskGateway] - Private API server listening at http://12.190.113.133:9010
Note: sa1x-hadoopedg-np1.hchc.local == 12.190.113.133, an RHEL 7.x server. MapR cluster is 6.x.
I run python script in azkaban.
enviroment:
CentOS 8.1
azkaban 3.90.0
Python 3.6.8
ChromeDriver84.0.4147.30
In test.flow file
nodes:
- name: job_test
type: command
config:
command: python3 /home/azkaban/python_codes/pyib/activity/pickgoods.py
when after run execute this flow about twenty miniutes, the system becomes very slowly and the execution is failed.
28-07-2020 18:30:40 CST job_test INFO - Process with id 1403 completed unsuccessfully in 1727 seconds.
28-07-2020 18:30:40 CST job_test ERROR - Job run failed!
java.lang.RuntimeException: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:312)
at azkaban.execapp.JobRunner.runJob(JobRunner.java:830)
at azkaban.execapp.JobRunner.doRun(JobRunner.java:607)
at azkaban.execapp.JobRunner.run(JobRunner.java:568)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
at azkaban.jobExecutor.utils.process.AzkabanProcess.run(AzkabanProcess.java:125)
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:304)
... 8 more
28-07-2020 18:30:40 CST job_test ERROR - azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1 cause: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
28-07-2020 18:30:40 CST job_test INFO - Finishing job job_test at 1595932240480 with status FAILED
and azkaban-webserver.log under azkaban-web-server
2020/07/28 21:00:34.127 +0800 INFO [ExecutorManager] [AzkabanWebServer-QueueProcessor-Thread] [Azkaban] Successfully refreshed executor: iZbp1hb3esnbp3levrcg05Z:36037 (id: 16), active=true with executor info : ExecutorInfo{remainingMemoryPercent=45.705342424456234, remainingMemoryInMB=835, remainingFlowCapacity=30, numberOfAssignedFlows=0, lastDispatchedTime=1595936723440, cpuUsage=0.01}
2020/07/28 21:00:34.128 +0800 ERROR [ExecutorManager] [AzkabanWebServer-QueueProcessor-Thread] [Azkaban] Failed to update ExecutorInfo for executor : iZbp1hb3esnbp3levrcg05Z:44085 (id: 17), active=true
java.util.concurrent.ExecutionException: org.apache.http.conn.HttpHostConnectException: Connect to iZbp1hb3esnbp3levrcg05Z:44085 [iZbp1hb3esnbp3levrcg05Z/172.16.184.105] failed: Connection refused (Connection refused)
anyone can help to resolve it?
Your job process crashes. You can find its error log in the web UI for further debugging; see https://azkaban.readthedocs.io/en/latest/useAzkaban.html#job-logs
I'm running an analysis on AWS EMR, and I am getting an unexpected SIGTERM error.
Some background:
I'm running a script that reads in many csv files I have stored on S3, and then performs an analysis. My script is schematically:
analysis_script.py
import pandas as pd
from pyspark.sql import SQLContext, DataFrame
from pyspark.sql.types import *
from pyspark import SparkContext
import boto3
#Spark context
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.csv("s3n://csv_files/*", header = True)
def analysis(df):
#do bunch of stuff. Create output dataframe
return df_output
df_output = analysis(df)
I launch the cluster using:
aws emr create-cluster
--release-label emr-5.5.0
--name "Analysis"
--applications Name=Hadoop Name=Hive Name=Spark Name=Ganglia
--ec2-attributes KeyName=EMRB,InstanceProfile=EMR_EC2_DefaultRole
--service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=r3.xlarge
--region us-west-2
--log-uri s3://emr-logs/
--bootstrap-actions Name="Install Python Packages",Path="s3://emr-bootstraps/install_python_packages_custom.bash",Args=["numpy pandas boto3 tqdm"]
--auto-terminate
I can see from logs that the reading in of the csv files goes fine. But then it finishes with errors. The following lines are in the stderr file:
18/07/16 12:02:26 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
18/07/16 12:02:26 ERROR ApplicationMaster: User application exited with status 143
18/07/16 12:02:26 INFO ApplicationMaster: Final app status: FAILED, exitCode: 143, (reason: User application exited with status 143)
18/07/16 12:02:26 INFO SparkContext: Invoking stop() from shutdown hook
18/07/16 12:02:26 INFO SparkUI: Stopped Spark web UI at http://172.31.36.42:36169
18/07/16 12:02:26 INFO TaskSetManager: Starting task 908.0 in stage 1494.0 (TID 88112, ip-172-31-35-59.us-west-2.compute.internal, executor 27, partition 908, RACK_LOCAL, 7278 bytes)
18/07/16 12:02:26 INFO TaskSetManager: Finished task 874.0 in stage 1494.0 (TID 88078) in 16482 ms on ip-172-31-35-59.us-west-2.compute.internal (executor 27) (879/4805)
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-36-42.us-west-2.compute.internal:34133 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(20, ip-172-31-36-42.us-west-2.compute.internal, 34133, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-47-55.us-west-2.compute.internal:45758 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(16, ip-172-31-47-55.us-west-2.compute.internal, 45758, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO DAGScheduler: Job 1494 failed: toPandas at analysis_script.py:267, took 479.895614 s
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1494 (toPandas at analysis_script.py:267) failed in 478.993 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerSQLExecutionEnd(0,1531742546839)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#28e5b10c)
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1495 (toPandas at analysis_script.py:267) failed in 479.270 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#6b68c419)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(1494,1531742546841,JobFailed(org.apache.spark.SparkException: Job 1494 cancelled because SparkContext was shut down))
18/07/16 12:02:26 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/07/16 12:02:26 INFO YarnClusterSchedulerBackend: Shutting down all executors
18/07/16 12:02:26 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/16 12:02:26 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices(serviceOption=None, services=List(),started=false)
18/07/16 12:02:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
I can't find much useful information about exit code 143. Does anybody know why this error is occurring? Thanks.
Spark passes through exit codes when they're over 128, which is often the case with JVM errors. In the case of exit code 143, it signifies that the JVM received a SIGTERM - essentially a unix kill signal (see this post for more exit codes and an explanation). Other details about Spark exit codes can be found in this question.
Since you didn't terminate this yourself, I'd start by suspecting something else externally did. Given that precisely 8 minutes elapse between job start and a SIGTERM being issued, it seems much more likely that EMR itself may be enforcing a maximum job run time/cluster age. Try checking through your EMR settings to see if there is any such timeout set - there was one in my case (on AWS Glue, but the same concept).
I've got a remote server running Nginx -> gunicorn -> django. When I hit a view that causes an exception, I would expect a 500 server error page to be returned. Instead, it hangs for ~10 seconds and I get a 502 bad gateway.
When I look in the gunicorn logs, they indicate a worker timed out and was killed. No exceptions are logged, and no admin emails are sent. The gunicorn logs:
[2016-02-16 16:47:30 -0600] [5809] [CRITICAL] WORKER TIMEOUT (pid:5817)
[2016-02-16 22:47:30 +0000] [5817] [INFO] Worker exiting (pid: 5817)
[2016-02-16 16:47:30 -0600] [5833] [INFO] Booting worker with pid: 5833
On my local machine, everything works as expected. They are both running identical settings.py (DEBUG is False). I reduced it to a test case of
def foo(request):
raise Exception('bar')
Browsing to it locally, it immediately returns the 500 server error page, as well as firing off admin emails. On the remote server, the browser spins for a while then nginx returns the bad gateway response. No emails are sent, no exceptions are logged.
Regular pages return immediately with the responses I expect. It appears to exhibit the bad behavior only if an exception is thrown.
What might cause such behavior?
I figured it out. The firewall wasn't allowing outbound SMTP connections. Django hung trying to send the email.
At first I would have increased nginx:
proxy_connect_timeout 300s;
proxy_read_timeout 300s;
and gunicorn settings:
--timeout 180
Maybe it's would help to handle some exceptions in log files;