I ran the following command:
$ spark-submit --master yarn --deploy-mode cluster pi.py
So, below log is continuous print:
...
2021-12-23 06:07:50,158 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
2021-12-23 06:07:51,162 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
...
and I check the result through my 8088(Logs for container web UI), but there is nothing in stdout.
I was disappointed and tried to force the park operation to end, but suddenly the new log is print like below:
...
2021-12-23 06:09:06,694 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
2021-12-23 06:09:06,695 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: master
ApplicationMaster RPC port: 40451
queue: default
start time: 1640239668020
final status: UNDEFINED
tracking URL: http://master2:8088/proxy/application_1640239254568_0002/
user: root
2021-12-23 06:09:07,707 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
...
And after some time, an error log occurred as shown below:
...
2021-12-23 06:10:25,003 INFO retry.RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "master/172.17.0.2"; destination host is: "master2":8032; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm2. Trying to failover immediately.
2021-12-23 06:10:25,003 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
2021-12-23 06:10:25,004 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From master/172.17.0.2 to master:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm1 after 1 failover attempts. Trying to failover after sleeping for 18340ms.
2021-12-23 06:10:43,347 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
...
I understand that sparks have completed the resource manager allocation after work, so it is normal for the above error log to appear.
Q1. Is the above job normal?
Q2. After this work, where can I check the results? Can I check them on "containerlogs web UI"?
IMPORTANT!! ADD. I re-ran the command. and check the status: SUCCEEDED. Why does the park-submit operation sometimes succeed and sometimes stop in the middle?
I am new to spark and I tried to submit my first job.
The code is in python and i am running on windows 10.
On one console I deployed the master as:
spark-class org.apache.spark.deploy.master.Master
On another I deployed one slave:
spark-class org.apache.spark.deploy.worker.Worker spark://192.168.1.4:7077
All good so far.
When I want to deploy the spark job in python I run on another console the following:
spark-submit --master spark://192.168.1.4:7077 "C:\path\to\file\friends-by-age.py"
The job is executed correctly because it prints all the results well, but then an error is thrown:
19/07/28 21:51:52 INFO SparkContext: Invoking stop() from shutdown hook
19/07/28 21:51:52 INFO SparkUI: Stopped Spark web UI at http://192.168.1.4:4040
19/07/28 21:51:52 INFO StandaloneSchedulerBackend: Shutting down all executors
19/07/28 21:51:52 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
19/07/28 21:51:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/07/28 21:51:52 WARN TransportChannelHandler: Exception in connection from /192.168.1.4:59096
java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Unknown Source)
19/07/28 21:51:52 INFO MemoryStore: MemoryStore cleared
19/07/28 21:51:52 INFO BlockManager: BlockManager stopped
19/07/28 21:51:52 INFO BlockManagerMaster: BlockManagerMaster stopped
19/07/28 21:51:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/07/28 21:51:52 INFO SparkContext: Successfully stopped SparkContext
This is the friends-by-age.py code:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("FriendsByAge")
sc = SparkContext(conf = conf)
def parseLine(line):
fields = line.split(',')
age = int(fields[2])
numFriends = int(fields[3])
return (age, numFriends)
lines = sc.textFile("file:///C:/path/to/file/fakefriends.csv")
rdd = lines.map(parseLine)
totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
averagesByAge = totalsByAge.mapValues(lambda x: x[0] / x[1])
averagesByAge = averagesByAge.sortByKey()
results = averagesByAge.collect()
for result in results:
print(result)
Also in the spark ui
I did not deploy any other worker.
Any ideas?
I'm using MongoDB version 3.4.7, Spark version 1.6.3 and MongoDB-Spark connector version 1.1.0.
I have a pyspark script that extract data from MongoDB collection to create a dataframe.
I noticed that my spark-submit fails after the connection get closed. (see the log out below).
19/02/18 23:47:25 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, node1.dev.qwerty.asdf.io, partition 0,ASDF_LOCAL, 2476 bytes)
19/02/18 23:47:25 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on node1.dev.qwerty.asdf.io:45779 (size: 2.7 KB, free: 2.7 GB)
19/02/18 23:47:25 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on node1.dev.qwerty.asdf.io:45779 (size: 497.0 B, free: 2.7 GB)
19/02/18 23:47:29 INFO MongoClientCache: Closing MongoClient: [mongoconfig-001.zxcv.prod.rba.company.net:27017]
19/02/18 23:47:29 INFO connection: Closed connection [connectionId{localValue:2}] to mongoconfig-001.zxcv.prod.rba.company.net:27017 because the pool has been closed.
19/02/18 23:47:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, node1.dev.qwerty.asdf.io): com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting to connect. Client view of cluster state is {type=UNKNOWN, servers=[{address=mongoconfig-001.zxcv.prod.rba.company.net:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
at com.mongodb.connection.BaseCluster.getDescription(BaseCluster.java:163)
at com.mongodb.Mongo.getClusterDescription(Mongo.java:411)
at com.mongodb.Mongo.getServerAddressList(Mongo.java:404)
at com.mongodb.spark.connection.MongoClientCache$$anonfun$logClient$1.apply(MongoClientCache.scala:161)
at com.mongodb.spark.connection.MongoClientCache$$anonfun$logClient$1.apply(MongoClientCache.scala:161)
at com.mongodb.spark.LoggingTrait$class.logInfo(LoggingTrait.scala:48)
at com.mongodb.spark.Logging.logInfo(Logging.scala:24)
at com.mongodb.spark.connection.MongoClientCache.logClient(MongoClientCache.scala:161)
at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:56)
at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:239)
at com.mongodb.spark.rdd.MongoRDD.compute(MongoRDD.scala:141)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm not sure what is happening here. Can someone help me out here?
I'm currently using the code below.
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri","mongodb://<USER>:<PASSWORD>#<HOST>:<PORT>/db.<COLLECTION>?ssl=true&authSource=<DATABASENAME>").load()
I'm calling the above script using the below spark-submit
spark-submit --master yarn --verbose --jars mongo-java-driver-3.4.2.jar,mongo-spark-connector_2.10-1.1.0.jar --py-files pymongo_spark.py test.py
I'm running an analysis on AWS EMR, and I am getting an unexpected SIGTERM error.
Some background:
I'm running a script that reads in many csv files I have stored on S3, and then performs an analysis. My script is schematically:
analysis_script.py
import pandas as pd
from pyspark.sql import SQLContext, DataFrame
from pyspark.sql.types import *
from pyspark import SparkContext
import boto3
#Spark context
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.csv("s3n://csv_files/*", header = True)
def analysis(df):
#do bunch of stuff. Create output dataframe
return df_output
df_output = analysis(df)
I launch the cluster using:
aws emr create-cluster
--release-label emr-5.5.0
--name "Analysis"
--applications Name=Hadoop Name=Hive Name=Spark Name=Ganglia
--ec2-attributes KeyName=EMRB,InstanceProfile=EMR_EC2_DefaultRole
--service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=r3.xlarge
--region us-west-2
--log-uri s3://emr-logs/
--bootstrap-actions Name="Install Python Packages",Path="s3://emr-bootstraps/install_python_packages_custom.bash",Args=["numpy pandas boto3 tqdm"]
--auto-terminate
I can see from logs that the reading in of the csv files goes fine. But then it finishes with errors. The following lines are in the stderr file:
18/07/16 12:02:26 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
18/07/16 12:02:26 ERROR ApplicationMaster: User application exited with status 143
18/07/16 12:02:26 INFO ApplicationMaster: Final app status: FAILED, exitCode: 143, (reason: User application exited with status 143)
18/07/16 12:02:26 INFO SparkContext: Invoking stop() from shutdown hook
18/07/16 12:02:26 INFO SparkUI: Stopped Spark web UI at http://172.31.36.42:36169
18/07/16 12:02:26 INFO TaskSetManager: Starting task 908.0 in stage 1494.0 (TID 88112, ip-172-31-35-59.us-west-2.compute.internal, executor 27, partition 908, RACK_LOCAL, 7278 bytes)
18/07/16 12:02:26 INFO TaskSetManager: Finished task 874.0 in stage 1494.0 (TID 88078) in 16482 ms on ip-172-31-35-59.us-west-2.compute.internal (executor 27) (879/4805)
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-36-42.us-west-2.compute.internal:34133 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(20, ip-172-31-36-42.us-west-2.compute.internal, 34133, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-47-55.us-west-2.compute.internal:45758 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(16, ip-172-31-47-55.us-west-2.compute.internal, 45758, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO DAGScheduler: Job 1494 failed: toPandas at analysis_script.py:267, took 479.895614 s
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1494 (toPandas at analysis_script.py:267) failed in 478.993 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerSQLExecutionEnd(0,1531742546839)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#28e5b10c)
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1495 (toPandas at analysis_script.py:267) failed in 479.270 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#6b68c419)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(1494,1531742546841,JobFailed(org.apache.spark.SparkException: Job 1494 cancelled because SparkContext was shut down))
18/07/16 12:02:26 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/07/16 12:02:26 INFO YarnClusterSchedulerBackend: Shutting down all executors
18/07/16 12:02:26 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/16 12:02:26 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices(serviceOption=None, services=List(),started=false)
18/07/16 12:02:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
I can't find much useful information about exit code 143. Does anybody know why this error is occurring? Thanks.
Spark passes through exit codes when they're over 128, which is often the case with JVM errors. In the case of exit code 143, it signifies that the JVM received a SIGTERM - essentially a unix kill signal (see this post for more exit codes and an explanation). Other details about Spark exit codes can be found in this question.
Since you didn't terminate this yourself, I'd start by suspecting something else externally did. Given that precisely 8 minutes elapse between job start and a SIGTERM being issued, it seems much more likely that EMR itself may be enforcing a maximum job run time/cluster age. Try checking through your EMR settings to see if there is any such timeout set - there was one in my case (on AWS Glue, but the same concept).
I have a pyspark script like below. In this I am passing table names from a file to the this script. The script is executing succesfully and I have no problem with the script.
Now I want to collect the logs of this script for each table indiviually. Is it possible?
Pyspark script:
#!/usr/bin/env python
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
#Condition to specify exact number of arguments in the spark-submit command line
if len(sys.argv) != 8:
print "Invalid number of args......"
print "Usage: spark-submit import.py Arguments"
exit()
args_file = sys.argv[1]
hivedb = sys.argv[2]
domain = sys.argv[3]
port=sys.argv[4]
mysqldb=sys.argv[5]
username=sys.argv[6]
password=sys.argv[7]
def mysql_spark(table, hivedb, domain, port, mysqldb, username, password):
print "*********************************************************table = {} ***************************".format(table)
df = sqlContext.read.format("jdbc").option("url", "{}:{}/{}".format(domain,port,mysqldb)).option("driver", "com.mysql.jdbc.Driver").option("dbtable","{}".format(table)).option("user", "{}".format(username)).option("password", "{}".format(password)).load()
df.registerTempTable("mytempTable")
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable".format(hivedb,table))
input = sc.textFile('/user/XXXXXXXX/mysql_spark/%s' %args_file).collect()
for table in input:
mysql_spark(table, hivedb, domain, port, mysqldb, username, password)
sc.stop()
Shell script to invoke pyspark script
#!/bin/bash
source /home/$USER/mysql_spark/source.sh
[ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; }
args_file=$1
TIMESTAMP=`date "+%Y-%m-%d"`
touch /home/$USER/logs/${TIMESTAMP}.success_log
touch /home/$USER/logs/${TIMESTAMP}.fail_log
success_logs=/home/$USER/logs/${TIMESTAMP}.success_log
failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log
#Function to get the status of the job creation
function log_status
{
status=$1
message=$2
if [ "$status" -ne 0 ]; then
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}"
#echo "Please find the attached log file for more details"
exit 1
else
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}"
fi
}
spark-submit --name "${args_file}" --master "yarn-client" /home/$USER/mysql_spark/mysql_spark.py ${args_file} ${hivedb} ${domain} ${port} ${mysqldb} ${username} ${password}
g_STATUS=$?
log_status $g_STATUS "Spark job ${args_file} Execution"
Questions
I want to get the logs of each table separately as separate files rather than all the tables in a single file.
If possible the `status` messages of each table separately rather than getting single status message of file
Log file
*********************************************************table = table_1 ***************************
17/07/26 12:47:36 INFO parquet.ParquetRelation: Listing hdfs://localhost/user/hive/warehouse/testing.db/table_1 on driver
17/07/26 12:47:36 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Got job 4 (sql at NativeMethodAccessorImpl.java:-2) with 2 output partitions
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[12] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:36 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 71.6 KB, free 602.7 KB)
17/07/26 12:47:36 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 25.0 KB, free 627.7 KB)
17/07/26 12:47:36 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 25.0 KB, free: 530.2 MB)
17/07/26 12:47:36 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 4 (MapPartitionsRDD[12] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:36 INFO cluster.YarnScheduler: Adding task set 4.0 with 2 tasks
17/07/26 12:47:36 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 7, localhost, partition 0,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:36 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 8, localhost, partition 1,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:36 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:63339 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:36 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:59298 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:37 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 4.0 (TID 8) in 121 ms on localhost (1/2)
17/07/26 12:47:37 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 7) in 133 ms on localhost (2/2)
17/07/26 12:47:37 INFO scheduler.DAGScheduler: ResultStage 4 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.133 s
17/07/26 12:47:37 INFO cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Job 4 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.160750 s
17/07/26 12:47:37 INFO parquet.ParquetRelation: Using default output committer for Parquet: parquet.hadoop.ParquetOutputCommitter
17/07/26 12:47:37 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/07/26 12:47:37 INFO datasources.DefaultWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
17/07/26 12:47:37 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/07/26 12:47:37 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Got job 5 (sql at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Final stage: ResultStage 5 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[15] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:37 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 84.4 KB, free 712.0 KB)
17/07/26 12:47:37 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 30.9 KB, free 742.9 KB)
17/07/26 12:47:37 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 30.9 KB, free: 530.1 MB)
17/07/26 12:47:37 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[15] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:37 INFO cluster.YarnScheduler: Adding task set 5.0 with 1 tasks
17/07/26 12:47:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 9, localhost, partition 0,PROCESS_LOCAL, 1922 bytes)
17/07/26 12:47:37 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:63339 (size: 30.9 KB, free: 3.1 GB)
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 9) in 2270 ms on localhost (1/1)
17/07/26 12:47:39 INFO cluster.YarnScheduler: Removed TaskSet 5.0, whose tasks have all completed, from pool
17/07/26 12:47:39 INFO scheduler.DAGScheduler: ResultStage 5 (sql at NativeMethodAccessorImpl.java:-2) finished in 2.270 s
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Job 5 finished: sql at NativeMethodAccessorImpl.java:-2, took 2.302009 s
17/07/26 12:47:39 INFO datasources.DefaultWriterContainer: Job job_201707261247_0000 committed.
17/07/26 12:47:39 INFO parquet.ParquetRelation: Listing hdfs://localhost/user/hive/warehouse/testing.db/table_1 on driver
17/07/26 12:47:39 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Got job 6 (sql at NativeMethodAccessorImpl.java:-2) with 2 output partitions
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Final stage: ResultStage 6 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[17] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:39 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 71.6 KB, free 814.5 KB)
17/07/26 12:47:39 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 25.0 KB, free 839.5 KB)
17/07/26 12:47:39 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 25.0 KB, free: 530.1 MB)
17/07/26 12:47:39 INFO spark.SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 6 (MapPartitionsRDD[17] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:39 INFO cluster.YarnScheduler: Adding task set 6.0 with 2 tasks
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 10, localhost, partition 0,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 6.0 (TID 11, localhost, partition 1,PROCESS_LOCAL, 2101 bytes)
17/07/26 12:47:39 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:63339 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:39 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:59298 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 6.0 (TID 10) in 142 ms on localhost (1/2)
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 6.0 (TID 11) in 180 ms on localhost (2/2)
17/07/26 12:47:39 INFO cluster.YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool
17/07/26 12:47:39 INFO scheduler.DAGScheduler: ResultStage 6 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.195 s
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Job 6 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.219934 s
*********************************************************table = table_2 ***************************
17/07/26 12:47:40 INFO parquet.ParquetRelation: Listing hdfs://localhost/user/hive/warehouse/testing.db/table_2 on driver
17/07/26 12:47:40 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Got job 7 (sql at NativeMethodAccessorImpl.java:-2) with 2 output partitions
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Final stage: ResultStage 7 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting ResultStage 7 (MapPartitionsRDD[21] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 71.6 KB, free 911.1 KB)
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 25.0 KB, free 936.1 KB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 25.0 KB, free: 530.1 MB)
17/07/26 12:47:40 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 7 (MapPartitionsRDD[21] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO cluster.YarnScheduler: Adding task set 7.0 with 2 tasks
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 7.0 (TID 12, localhost, partition 0,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 7.0 (TID 13, localhost, partition 1,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:63339 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:59298 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 7.0 (TID 13) in 69 ms on localhost (1/2)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 7.0 (TID 12) in 137 ms on localhost (2/2)
17/07/26 12:47:40 INFO scheduler.DAGScheduler: ResultStage 7 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.138 s
17/07/26 12:47:40 INFO cluster.YarnScheduler: Removed TaskSet 7.0, whose tasks have all completed, from pool
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Job 7 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.157692 s
17/07/26 12:47:40 INFO parquet.ParquetRelation: Using default output committer for Parquet: parquet.hadoop.ParquetOutputCommitter
17/07/26 12:47:40 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/07/26 12:47:40 INFO datasources.DefaultWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
17/07/26 12:47:40 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/07/26 12:47:40 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Got job 8 (sql at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Final stage: ResultStage 8 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting ResultStage 8 (MapPartitionsRDD[24] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 84.4 KB, free 1020.4 KB)
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 30.9 KB, free 1051.3 KB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 30.9 KB, free: 530.0 MB)
17/07/26 12:47:40 INFO spark.SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (MapPartitionsRDD[24] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO cluster.YarnScheduler: Adding task set 8.0 with 1 tasks
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 8.0 (TID 14, localhost, partition 0,PROCESS_LOCAL, 1922 bytes)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:63339 (size: 30.9 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 8.0 (TID 14) in 194 ms on localhost (1/1)
17/07/26 12:47:40 INFO cluster.YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool
17/07/26 12:47:40 INFO scheduler.DAGScheduler: ResultStage 8 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.195 s
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Job 8 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.221049 s
17/07/26 12:47:40 INFO datasources.DefaultWriterContainer: Job job_201707261247_0000 committed.
17/07/26 12:47:40 INFO parquet.ParquetRelation: Listing hdfs://localhost/user/hive/warehouse/testing.db/table_2 on driver
17/07/26 12:47:40 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Got job 9 (sql at NativeMethodAccessorImpl.java:-2) with 2 output partitions
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Final stage: ResultStage 9 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting ResultStage 9 (MapPartitionsRDD[26] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 71.6 KB, free 1122.9 KB)
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 25.0 KB, free 1147.9 KB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 25.0 KB, free: 530.0 MB)
17/07/26 12:47:40 INFO spark.SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 9 (MapPartitionsRDD[26] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO cluster.YarnScheduler: Adding task set 9.0 with 2 tasks
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 9.0 (TID 15, localhost, partition 0,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 9.0 (TID 16, localhost, partition 1,PROCESS_LOCAL, 2101 bytes)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:63339 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:59298 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 9.0 (TID 15) in 124 ms on localhost (1/2)
17/07/26 12:47:41 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 9.0 (TID 16) in 151 ms on localhost (2/2)
17/07/26 12:47:41 INFO scheduler.DAGScheduler: ResultStage 9 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.158 s
17/07/26 12:47:41 INFO cluster.YarnScheduler: Removed TaskSet 9.0, whose tasks have all completed, from pool
17/07/26 12:47:41 INFO scheduler.DAGScheduler: Job 9 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.181504 s
As far as I can see, you should read the file that contains the list of tables from bash script, and then send one spark-submit for each of them.
IFS=$'\n' read -d '' -r -a lines < args_file
for it in "${lines[#]}"
do
TIMESTAMP=`date "+%Y-%m-%d"`
touch /home/$USER/logs/${it}_${TIMESTAMP}.success_log
touch /home/$USER/logs/${it}_${TIMESTAMP}.fail_log
success_logs=/home/$USER/logs/${it}_${TIMESTAMP}.success_log
failed_logs=/home/$USER/logs/${it}_${TIMESTAMP}.fail_log
#Function to get the status of the job creation
function log_status
{
status=$1
message=$2
if [ "$status" -ne 0 ]; then
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}"
#echo "Please find the attached log file for more details"
exit 1
else
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}"
fi
}
spark-submit --name "${it}" --master "yarn-client" /home/$USER/mysql_spark/mysql_spark.py ${it} ${hivedb} ${domain} ${port} ${mysqldb} ${username} ${password}
done
First argument of python script now is a table name:
if len(sys.argv) != 8:
print "Invalid number of args......"
print "Usage: spark-submit import.py Arguments"
exit()
table = sys.argv[1]
hivedb = sys.argv[2]
domain = sys.argv[3]
port=sys.argv[4]
mysqldb=sys.argv[5]
username=sys.argv[6]
password=sys.argv[7]
def mysql_spark(table, hivedb, domain, port, mysqldb, username, password):
print "*********************************************************table = {} ***************************".format(table)
df = sqlContext.read.format("jdbc").option("url", "{}:{}/{}".format(domain,port,mysqldb)).option("driver", "com.mysql.jdbc.Driver").option("dbtable","{}".format(table)).option("user", "{}".format(username)).option("password", "{}".format(password)).load()
df.registerTempTable("mytempTable")
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable".format(hivedb,table))
mysql_spark(table, hivedb, domain, port, mysqldb, username, password)
sc.stop()
However, I do not know how this will affect the performance of the task.
Just run your script as
./script.sh >> log.txt
Should save your logs in log.txt file. Anything and everything that your script prints out!