OutOfMemoryError when using PySpark to read files in local mode - python

I have about a dozen gpg-encrypted files containing data I'd like to analyze using PySpark. My strategy is to apply a decryption function as a flat map to each file and then proceed processing at the record level:
def read_fun_generator(filename):
with gpg_open(filename[0].split(':')[-1], 'r') as f:
for line in f:
yield line.strip()
gpg_files = sc.wholeTextFiles(/path/to/files/*.gpg)
rdd_from_gpg = gpg_files.flatMap(read_fun_generator).map(lambda x: x.split('|'))
rdd_from_gpg.count() # <-- For example...
This approach works quite well when using a single thread in local mode, i.e. setting the master to local[1]. However, using any more than a single thread causes an OutOfMemoryError to be thrown. I've tried increasing spark.executor.memory and spark.driver.memory to 30g, but this seems not to help. I can confirm in the UI that those settings have stuck. (My machine has over 200GB available.) However, I've noticed in the logs that the block manager seems to be starting with only 265.4 MB of memory. I wonder if this is related?
Here is the full configuration I'm starting with:
conf = (SparkConf()
.setMaster("local[*]")
.setAppName("pyspark_local")
.set("spark.executor.memory", "30g")
.set("spark.driver.memory", "30g")
.set("spark.python.worker.memory", "5g")
)
sc = SparkContext(conf=conf)
This is the stack trace from my logs:
15/06/10 11:03:30 INFO SparkContext: Running Spark version 1.3.1
15/06/10 11:03:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/10 11:03:31 INFO SecurityManager: Changing view acls to: santon
15/06/10 11:03:31 INFO SecurityManager: Changing modify acls to: santon
15/06/10 11:03:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(santon); users with modify permissions: Set(santon)
15/06/10 11:03:31 INFO Slf4jLogger: Slf4jLogger started
15/06/10 11:03:31 INFO Remoting: Starting remoting
15/06/10 11:03:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#localhost:44347]
15/06/10 11:03:32 INFO Utils: Successfully started service 'sparkDriver' on port 44347.
15/06/10 11:03:32 INFO SparkEnv: Registering MapOutputTracker
15/06/10 11:03:32 INFO SparkEnv: Registering BlockManagerMaster
15/06/10 11:03:32 INFO DiskBlockManager: Created local directory at /tmp/spark-24dc8f0a-a89a-44f8-bb95-cd5514e5bf0c/blockmgr-85b6f082-ff5a-4a0e-b48a-1ec62715dda0
15/06/10 11:03:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/06/10 11:03:32 INFO HttpFileServer: HTTP File server directory is /tmp/spark-7b2172ed-d658-4e11-bbc1-600697f3255e/httpd-5423f8bc-ec43-48c5-9367-87214dad54f4
15/06/10 11:03:32 INFO HttpServer: Starting HTTP Server
15/06/10 11:03:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/10 11:03:32 INFO AbstractConnector: Started SocketConnector#0.0.0.0:50366
15/06/10 11:03:32 INFO Utils: Successfully started service 'HTTP file server' on port 50366.
15/06/10 11:03:32 INFO SparkEnv: Registering OutputCommitCoordinator
15/06/10 11:03:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/10 11:03:32 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/06/10 11:03:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/06/10 11:03:32 INFO SparkUI: Started SparkUI at localhost:4040
15/06/10 11:03:32 INFO Executor: Starting executor ID <driver> on host localhost
15/06/10 11:03:32 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#localhost:44347/user/HeartbeatReceiver
15/06/10 11:03:33 INFO NettyBlockTransferService: Server created on 46730
15/06/10 11:03:33 INFO BlockManagerMaster: Trying to register BlockManager
15/06/10 11:03:33 INFO BlockManagerMasterActor: Registering block manager localhost:46730 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 46730)
15/06/10 11:03:33 INFO BlockManagerMaster: Registered BlockManager
15/06/10 11:05:19 INFO MemoryStore: ensureFreeSpace(215726) called with curMem=0, maxMem=278302556
15/06/10 11:05:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 210.7 KB, free 265.2 MB)
15/06/10 11:05:19 INFO MemoryStore: ensureFreeSpace(31533) called with curMem=215726, maxMem=278302556
15/06/10 11:05:19 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 30.8 KB, free 265.2 MB)
15/06/10 11:05:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46730 (size: 30.8 KB, free: 265.4 MB)
15/06/10 11:05:19 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/06/10 11:05:19 INFO SparkContext: Created broadcast 0 from wholeTextFiles at NativeMethodAccessorImpl.java:-2
15/06/10 11:05:22 INFO FileInputFormat: Total input paths to process : 16
15/06/10 11:05:22 INFO FileInputFormat: Total input paths to process : 16
15/06/10 11:05:22 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 71665121
15/06/10 11:05:22 INFO SparkContext: Starting job: count at <timed exec>:2
15/06/10 11:05:22 INFO DAGScheduler: Got job 0 (count at <timed exec>:2) with 2 output partitions (allowLocal=false)
15/06/10 11:05:22 INFO DAGScheduler: Final stage: Stage 0(count at <timed exec>:2)
15/06/10 11:05:22 INFO DAGScheduler: Parents of final stage: List()
15/06/10 11:05:22 INFO DAGScheduler: Missing parents: List()
15/06/10 11:05:22 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at count at <timed exec>:2), which has no missing parents
15/06/10 11:05:23 INFO MemoryStore: ensureFreeSpace(6264) called with curMem=247259, maxMem=278302556
15/06/10 11:05:23 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.1 KB, free 265.2 MB)
15/06/10 11:05:23 INFO MemoryStore: ensureFreeSpace(4589) called with curMem=253523, maxMem=278302556
15/06/10 11:05:23 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.5 KB, free 265.2 MB)
15/06/10 11:05:23 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:46730 (size: 4.5 KB, free: 265.4 MB)
15/06/10 11:05:23 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/06/10 11:05:23 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
15/06/10 11:05:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1] at count at <timed exec>:2)
15/06/10 11:05:23 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/06/10 11:05:23 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1903 bytes)
15/06/10 11:05:23 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 3085 bytes)
15/06/10 11:05:23 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/06/10 11:05:23 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/06/10 11:05:26 INFO WholeTextFileRDD: Input split: Paths:[gpg_files]
15/06/10 11:05:40 ERROR Utils: Uncaught exception in thread stdout writer for /anaconda/python/bin/python
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
at java.nio.CharBuffer.toString(CharBuffer.java:1201)
at org.apache.hadoop.io.Text.decode(Text.java:405)
at org.apache.hadoop.io.Text.decode(Text.java:382)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:86)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:421)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
Exception in thread "stdout writer for /anaconda/python/bin/python" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
at java.nio.CharBuffer.toString(CharBuffer.java:1201)
at org.apache.hadoop.io.Text.decode(Text.java:405)
at org.apache.hadoop.io.Text.decode(Text.java:382)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:86)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:421)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
15/06/10 11:05:47 INFO PythonRDD: Times: total = 24140, boot = 2860, init = 664, finish = 20616
15/06/10 11:05:47 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1873 bytes result sent to driver
15/06/10 11:05:47 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 24251 ms on localhost (1/2)
Has anyone run into this problem? Is there a setting I'm not aware of that I should modify? It seems like this should be possible...

The thing with sc.wholeTextFiles(/path/to/files/*.gpg) - to returns PairRDD, key - the file name and the value - is the file contents.
Looks like you are not using file contents part, but still told Spark to read files from disk and ship them to workers.
If your goal is to process the list of file names only, and the contents of them to be read with gpg_open you can do this:
def read_fun_generator(filename):
with gpg_open(filename.split(':')[-1], 'r') as f:
for line in f:
yield line.strip()
gpg_filelist = glob.glob("/path/to/files/*.gpg")
# generate RDD with file name per record
gpg_files = sc.parallelize(gpg_filelist)
rdd_from_gpg = gpg_files.flatMap(read_fun_generator).map(lambda x: x.split('|'))
rdd_from_gpg.count() # <-- For example...
This would reduce the amount of memory used by Spark's JVM.

Related

Is this an error that occurred after the spark operation?

I ran the following command:
$ spark-submit --master yarn --deploy-mode cluster pi.py
So, below log is continuous print:
...
2021-12-23 06:07:50,158 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
2021-12-23 06:07:51,162 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
...
and I check the result through my 8088(Logs for container web UI), but there is nothing in stdout.
I was disappointed and tried to force the park operation to end, but suddenly the new log is print like below:
...
2021-12-23 06:09:06,694 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
2021-12-23 06:09:06,695 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: master
ApplicationMaster RPC port: 40451
queue: default
start time: 1640239668020
final status: UNDEFINED
tracking URL: http://master2:8088/proxy/application_1640239254568_0002/
user: root
2021-12-23 06:09:07,707 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
...
And after some time, an error log occurred as shown below:
...
2021-12-23 06:10:25,003 INFO retry.RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "master/172.17.0.2"; destination host is: "master2":8032; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm2. Trying to failover immediately.
2021-12-23 06:10:25,003 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
2021-12-23 06:10:25,004 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From master/172.17.0.2 to master:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm1 after 1 failover attempts. Trying to failover after sleeping for 18340ms.
2021-12-23 06:10:43,347 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
...
I understand that sparks have completed the resource manager allocation after work, so it is normal for the above error log to appear.
Q1. Is the above job normal?
Q2. After this work, where can I check the results? Can I check them on "containerlogs web UI"?
IMPORTANT!! ADD. I re-ran the command. and check the status: SUCCEEDED. Why does the park-submit operation sometimes succeed and sometimes stop in the middle?

spark submit with python on windows- An existing connection was forcibly closed by the remote host

I am new to spark and I tried to submit my first job.
The code is in python and i am running on windows 10.
On one console I deployed the master as:
spark-class org.apache.spark.deploy.master.Master
On another I deployed one slave:
spark-class org.apache.spark.deploy.worker.Worker spark://192.168.1.4:7077
All good so far.
When I want to deploy the spark job in python I run on another console the following:
spark-submit --master spark://192.168.1.4:7077 "C:\path\to\file\friends-by-age.py"
The job is executed correctly because it prints all the results well, but then an error is thrown:
19/07/28 21:51:52 INFO SparkContext: Invoking stop() from shutdown hook
19/07/28 21:51:52 INFO SparkUI: Stopped Spark web UI at http://192.168.1.4:4040
19/07/28 21:51:52 INFO StandaloneSchedulerBackend: Shutting down all executors
19/07/28 21:51:52 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
19/07/28 21:51:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/07/28 21:51:52 WARN TransportChannelHandler: Exception in connection from /192.168.1.4:59096
java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Unknown Source)
19/07/28 21:51:52 INFO MemoryStore: MemoryStore cleared
19/07/28 21:51:52 INFO BlockManager: BlockManager stopped
19/07/28 21:51:52 INFO BlockManagerMaster: BlockManagerMaster stopped
19/07/28 21:51:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/07/28 21:51:52 INFO SparkContext: Successfully stopped SparkContext
This is the friends-by-age.py code:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("FriendsByAge")
sc = SparkContext(conf = conf)
def parseLine(line):
fields = line.split(',')
age = int(fields[2])
numFriends = int(fields[3])
return (age, numFriends)
lines = sc.textFile("file:///C:/path/to/file/fakefriends.csv")
rdd = lines.map(parseLine)
totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
averagesByAge = totalsByAge.mapValues(lambda x: x[0] / x[1])
averagesByAge = averagesByAge.sortByKey()
results = averagesByAge.collect()
for result in results:
print(result)
Also in the spark ui
I did not deploy any other worker.
Any ideas?

Spark to MongoDB connection closes without a meaningful error

I'm using MongoDB version 3.4.7, Spark version 1.6.3 and MongoDB-Spark connector version 1.1.0.
I have a pyspark script that extract data from MongoDB collection to create a dataframe.
I noticed that my spark-submit fails after the connection get closed. (see the log out below).
19/02/18 23:47:25 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, node1.dev.qwerty.asdf.io, partition 0,ASDF_LOCAL, 2476 bytes)
19/02/18 23:47:25 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on node1.dev.qwerty.asdf.io:45779 (size: 2.7 KB, free: 2.7 GB)
19/02/18 23:47:25 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on node1.dev.qwerty.asdf.io:45779 (size: 497.0 B, free: 2.7 GB)
19/02/18 23:47:29 INFO MongoClientCache: Closing MongoClient: [mongoconfig-001.zxcv.prod.rba.company.net:27017]
19/02/18 23:47:29 INFO connection: Closed connection [connectionId{localValue:2}] to mongoconfig-001.zxcv.prod.rba.company.net:27017 because the pool has been closed.
19/02/18 23:47:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, node1.dev.qwerty.asdf.io): com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting to connect. Client view of cluster state is {type=UNKNOWN, servers=[{address=mongoconfig-001.zxcv.prod.rba.company.net:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
at com.mongodb.connection.BaseCluster.getDescription(BaseCluster.java:163)
at com.mongodb.Mongo.getClusterDescription(Mongo.java:411)
at com.mongodb.Mongo.getServerAddressList(Mongo.java:404)
at com.mongodb.spark.connection.MongoClientCache$$anonfun$logClient$1.apply(MongoClientCache.scala:161)
at com.mongodb.spark.connection.MongoClientCache$$anonfun$logClient$1.apply(MongoClientCache.scala:161)
at com.mongodb.spark.LoggingTrait$class.logInfo(LoggingTrait.scala:48)
at com.mongodb.spark.Logging.logInfo(Logging.scala:24)
at com.mongodb.spark.connection.MongoClientCache.logClient(MongoClientCache.scala:161)
at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:56)
at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:239)
at com.mongodb.spark.rdd.MongoRDD.compute(MongoRDD.scala:141)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm not sure what is happening here. Can someone help me out here?
I'm currently using the code below.
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri","mongodb://<USER>:<PASSWORD>#<HOST>:<PORT>/db.<COLLECTION>?ssl=true&authSource=<DATABASENAME>").load()
I'm calling the above script using the below spark-submit
spark-submit --master yarn --verbose --jars mongo-java-driver-3.4.2.jar,mongo-spark-connector_2.10-1.1.0.jar --py-files pymongo_spark.py test.py

AWS-EMR error exit code 143

I'm running an analysis on AWS EMR, and I am getting an unexpected SIGTERM error.
Some background:
I'm running a script that reads in many csv files I have stored on S3, and then performs an analysis. My script is schematically:
analysis_script.py
import pandas as pd
from pyspark.sql import SQLContext, DataFrame
from pyspark.sql.types import *
from pyspark import SparkContext
import boto3
#Spark context
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.csv("s3n://csv_files/*", header = True)
def analysis(df):
#do bunch of stuff. Create output dataframe
return df_output
df_output = analysis(df)
I launch the cluster using:
aws emr create-cluster
--release-label emr-5.5.0
--name "Analysis"
--applications Name=Hadoop Name=Hive Name=Spark Name=Ganglia
--ec2-attributes KeyName=EMRB,InstanceProfile=EMR_EC2_DefaultRole
--service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=r3.xlarge
--region us-west-2
--log-uri s3://emr-logs/
--bootstrap-actions Name="Install Python Packages",Path="s3://emr-bootstraps/install_python_packages_custom.bash",Args=["numpy pandas boto3 tqdm"]
--auto-terminate
I can see from logs that the reading in of the csv files goes fine. But then it finishes with errors. The following lines are in the stderr file:
18/07/16 12:02:26 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
18/07/16 12:02:26 ERROR ApplicationMaster: User application exited with status 143
18/07/16 12:02:26 INFO ApplicationMaster: Final app status: FAILED, exitCode: 143, (reason: User application exited with status 143)
18/07/16 12:02:26 INFO SparkContext: Invoking stop() from shutdown hook
18/07/16 12:02:26 INFO SparkUI: Stopped Spark web UI at http://172.31.36.42:36169
18/07/16 12:02:26 INFO TaskSetManager: Starting task 908.0 in stage 1494.0 (TID 88112, ip-172-31-35-59.us-west-2.compute.internal, executor 27, partition 908, RACK_LOCAL, 7278 bytes)
18/07/16 12:02:26 INFO TaskSetManager: Finished task 874.0 in stage 1494.0 (TID 88078) in 16482 ms on ip-172-31-35-59.us-west-2.compute.internal (executor 27) (879/4805)
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-36-42.us-west-2.compute.internal:34133 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(20, ip-172-31-36-42.us-west-2.compute.internal, 34133, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-47-55.us-west-2.compute.internal:45758 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(16, ip-172-31-47-55.us-west-2.compute.internal, 45758, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO DAGScheduler: Job 1494 failed: toPandas at analysis_script.py:267, took 479.895614 s
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1494 (toPandas at analysis_script.py:267) failed in 478.993 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerSQLExecutionEnd(0,1531742546839)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#28e5b10c)
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1495 (toPandas at analysis_script.py:267) failed in 479.270 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#6b68c419)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(1494,1531742546841,JobFailed(org.apache.spark.SparkException: Job 1494 cancelled because SparkContext was shut down))
18/07/16 12:02:26 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/07/16 12:02:26 INFO YarnClusterSchedulerBackend: Shutting down all executors
18/07/16 12:02:26 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/16 12:02:26 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices(serviceOption=None, services=List(),started=false)
18/07/16 12:02:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
I can't find much useful information about exit code 143. Does anybody know why this error is occurring? Thanks.
Spark passes through exit codes when they're over 128, which is often the case with JVM errors. In the case of exit code 143, it signifies that the JVM received a SIGTERM - essentially a unix kill signal (see this post for more exit codes and an explanation). Other details about Spark exit codes can be found in this question.
Since you didn't terminate this yourself, I'd start by suspecting something else externally did. Given that precisely 8 minutes elapse between job start and a SIGTERM being issued, it seems much more likely that EMR itself may be enforcing a maximum job run time/cluster age. Try checking through your EMR settings to see if there is any such timeout set - there was one in my case (on AWS Glue, but the same concept).

Get logs separately while invoking python script using a shell script

I have a pyspark script like below. In this I am passing table names from a file to the this script. The script is executing succesfully and I have no problem with the script.
Now I want to collect the logs of this script for each table indiviually. Is it possible?
Pyspark script:
#!/usr/bin/env python
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
#Condition to specify exact number of arguments in the spark-submit command line
if len(sys.argv) != 8:
print "Invalid number of args......"
print "Usage: spark-submit import.py Arguments"
exit()
args_file = sys.argv[1]
hivedb = sys.argv[2]
domain = sys.argv[3]
port=sys.argv[4]
mysqldb=sys.argv[5]
username=sys.argv[6]
password=sys.argv[7]
def mysql_spark(table, hivedb, domain, port, mysqldb, username, password):
print "*********************************************************table = {} ***************************".format(table)
df = sqlContext.read.format("jdbc").option("url", "{}:{}/{}".format(domain,port,mysqldb)).option("driver", "com.mysql.jdbc.Driver").option("dbtable","{}".format(table)).option("user", "{}".format(username)).option("password", "{}".format(password)).load()
df.registerTempTable("mytempTable")
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable".format(hivedb,table))
input = sc.textFile('/user/XXXXXXXX/mysql_spark/%s' %args_file).collect()
for table in input:
mysql_spark(table, hivedb, domain, port, mysqldb, username, password)
sc.stop()
Shell script to invoke pyspark script
#!/bin/bash
source /home/$USER/mysql_spark/source.sh
[ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; }
args_file=$1
TIMESTAMP=`date "+%Y-%m-%d"`
touch /home/$USER/logs/${TIMESTAMP}.success_log
touch /home/$USER/logs/${TIMESTAMP}.fail_log
success_logs=/home/$USER/logs/${TIMESTAMP}.success_log
failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log
#Function to get the status of the job creation
function log_status
{
status=$1
message=$2
if [ "$status" -ne 0 ]; then
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}"
#echo "Please find the attached log file for more details"
exit 1
else
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}"
fi
}
spark-submit --name "${args_file}" --master "yarn-client" /home/$USER/mysql_spark/mysql_spark.py ${args_file} ${hivedb} ${domain} ${port} ${mysqldb} ${username} ${password}
g_STATUS=$?
log_status $g_STATUS "Spark job ${args_file} Execution"
Questions
I want to get the logs of each table separately as separate files rather than all the tables in a single file.
If possible the `status` messages of each table separately rather than getting single status message of file
Log file
*********************************************************table = table_1 ***************************
17/07/26 12:47:36 INFO parquet.ParquetRelation: Listing hdfs://localhost/user/hive/warehouse/testing.db/table_1 on driver
17/07/26 12:47:36 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Got job 4 (sql at NativeMethodAccessorImpl.java:-2) with 2 output partitions
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[12] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:36 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 71.6 KB, free 602.7 KB)
17/07/26 12:47:36 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 25.0 KB, free 627.7 KB)
17/07/26 12:47:36 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 25.0 KB, free: 530.2 MB)
17/07/26 12:47:36 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:36 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 4 (MapPartitionsRDD[12] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:36 INFO cluster.YarnScheduler: Adding task set 4.0 with 2 tasks
17/07/26 12:47:36 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 7, localhost, partition 0,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:36 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 8, localhost, partition 1,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:36 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:63339 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:36 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:59298 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:37 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 4.0 (TID 8) in 121 ms on localhost (1/2)
17/07/26 12:47:37 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 7) in 133 ms on localhost (2/2)
17/07/26 12:47:37 INFO scheduler.DAGScheduler: ResultStage 4 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.133 s
17/07/26 12:47:37 INFO cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Job 4 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.160750 s
17/07/26 12:47:37 INFO parquet.ParquetRelation: Using default output committer for Parquet: parquet.hadoop.ParquetOutputCommitter
17/07/26 12:47:37 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/07/26 12:47:37 INFO datasources.DefaultWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
17/07/26 12:47:37 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/07/26 12:47:37 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Got job 5 (sql at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Final stage: ResultStage 5 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[15] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:37 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 84.4 KB, free 712.0 KB)
17/07/26 12:47:37 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 30.9 KB, free 742.9 KB)
17/07/26 12:47:37 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 30.9 KB, free: 530.1 MB)
17/07/26 12:47:37 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:37 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[15] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:37 INFO cluster.YarnScheduler: Adding task set 5.0 with 1 tasks
17/07/26 12:47:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 9, localhost, partition 0,PROCESS_LOCAL, 1922 bytes)
17/07/26 12:47:37 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:63339 (size: 30.9 KB, free: 3.1 GB)
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 9) in 2270 ms on localhost (1/1)
17/07/26 12:47:39 INFO cluster.YarnScheduler: Removed TaskSet 5.0, whose tasks have all completed, from pool
17/07/26 12:47:39 INFO scheduler.DAGScheduler: ResultStage 5 (sql at NativeMethodAccessorImpl.java:-2) finished in 2.270 s
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Job 5 finished: sql at NativeMethodAccessorImpl.java:-2, took 2.302009 s
17/07/26 12:47:39 INFO datasources.DefaultWriterContainer: Job job_201707261247_0000 committed.
17/07/26 12:47:39 INFO parquet.ParquetRelation: Listing hdfs://localhost/user/hive/warehouse/testing.db/table_1 on driver
17/07/26 12:47:39 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Got job 6 (sql at NativeMethodAccessorImpl.java:-2) with 2 output partitions
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Final stage: ResultStage 6 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[17] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:39 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 71.6 KB, free 814.5 KB)
17/07/26 12:47:39 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 25.0 KB, free 839.5 KB)
17/07/26 12:47:39 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 25.0 KB, free: 530.1 MB)
17/07/26 12:47:39 INFO spark.SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 6 (MapPartitionsRDD[17] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:39 INFO cluster.YarnScheduler: Adding task set 6.0 with 2 tasks
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 10, localhost, partition 0,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 6.0 (TID 11, localhost, partition 1,PROCESS_LOCAL, 2101 bytes)
17/07/26 12:47:39 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:63339 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:39 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:59298 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 6.0 (TID 10) in 142 ms on localhost (1/2)
17/07/26 12:47:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 6.0 (TID 11) in 180 ms on localhost (2/2)
17/07/26 12:47:39 INFO cluster.YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool
17/07/26 12:47:39 INFO scheduler.DAGScheduler: ResultStage 6 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.195 s
17/07/26 12:47:39 INFO scheduler.DAGScheduler: Job 6 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.219934 s
*********************************************************table = table_2 ***************************
17/07/26 12:47:40 INFO parquet.ParquetRelation: Listing hdfs://localhost/user/hive/warehouse/testing.db/table_2 on driver
17/07/26 12:47:40 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Got job 7 (sql at NativeMethodAccessorImpl.java:-2) with 2 output partitions
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Final stage: ResultStage 7 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting ResultStage 7 (MapPartitionsRDD[21] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 71.6 KB, free 911.1 KB)
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 25.0 KB, free 936.1 KB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 25.0 KB, free: 530.1 MB)
17/07/26 12:47:40 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 7 (MapPartitionsRDD[21] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO cluster.YarnScheduler: Adding task set 7.0 with 2 tasks
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 7.0 (TID 12, localhost, partition 0,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 7.0 (TID 13, localhost, partition 1,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:63339 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:59298 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 7.0 (TID 13) in 69 ms on localhost (1/2)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 7.0 (TID 12) in 137 ms on localhost (2/2)
17/07/26 12:47:40 INFO scheduler.DAGScheduler: ResultStage 7 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.138 s
17/07/26 12:47:40 INFO cluster.YarnScheduler: Removed TaskSet 7.0, whose tasks have all completed, from pool
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Job 7 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.157692 s
17/07/26 12:47:40 INFO parquet.ParquetRelation: Using default output committer for Parquet: parquet.hadoop.ParquetOutputCommitter
17/07/26 12:47:40 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/07/26 12:47:40 INFO datasources.DefaultWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
17/07/26 12:47:40 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/07/26 12:47:40 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Got job 8 (sql at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Final stage: ResultStage 8 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting ResultStage 8 (MapPartitionsRDD[24] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 84.4 KB, free 1020.4 KB)
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 30.9 KB, free 1051.3 KB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 30.9 KB, free: 530.0 MB)
17/07/26 12:47:40 INFO spark.SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (MapPartitionsRDD[24] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO cluster.YarnScheduler: Adding task set 8.0 with 1 tasks
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 8.0 (TID 14, localhost, partition 0,PROCESS_LOCAL, 1922 bytes)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:63339 (size: 30.9 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 8.0 (TID 14) in 194 ms on localhost (1/1)
17/07/26 12:47:40 INFO cluster.YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool
17/07/26 12:47:40 INFO scheduler.DAGScheduler: ResultStage 8 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.195 s
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Job 8 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.221049 s
17/07/26 12:47:40 INFO datasources.DefaultWriterContainer: Job job_201707261247_0000 committed.
17/07/26 12:47:40 INFO parquet.ParquetRelation: Listing hdfs://localhost/user/hive/warehouse/testing.db/table_2 on driver
17/07/26 12:47:40 INFO spark.SparkContext: Starting job: sql at NativeMethodAccessorImpl.java:-2
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Got job 9 (sql at NativeMethodAccessorImpl.java:-2) with 2 output partitions
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Final stage: ResultStage 9 (sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Missing parents: List()
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting ResultStage 9 (MapPartitionsRDD[26] at sql at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 71.6 KB, free 1122.9 KB)
17/07/26 12:47:40 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 25.0 KB, free 1147.9 KB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on xxxxxxxxxxxxx:9612 (size: 25.0 KB, free: 530.0 MB)
17/07/26 12:47:40 INFO spark.SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:1006
17/07/26 12:47:40 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 9 (MapPartitionsRDD[26] at sql at NativeMethodAccessorImpl.java:-2)
17/07/26 12:47:40 INFO cluster.YarnScheduler: Adding task set 9.0 with 2 tasks
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 9.0 (TID 15, localhost, partition 0,PROCESS_LOCAL, 1975 bytes)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 9.0 (TID 16, localhost, partition 1,PROCESS_LOCAL, 2101 bytes)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:63339 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:59298 (size: 25.0 KB, free: 3.1 GB)
17/07/26 12:47:40 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 9.0 (TID 15) in 124 ms on localhost (1/2)
17/07/26 12:47:41 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 9.0 (TID 16) in 151 ms on localhost (2/2)
17/07/26 12:47:41 INFO scheduler.DAGScheduler: ResultStage 9 (sql at NativeMethodAccessorImpl.java:-2) finished in 0.158 s
17/07/26 12:47:41 INFO cluster.YarnScheduler: Removed TaskSet 9.0, whose tasks have all completed, from pool
17/07/26 12:47:41 INFO scheduler.DAGScheduler: Job 9 finished: sql at NativeMethodAccessorImpl.java:-2, took 0.181504 s
As far as I can see, you should read the file that contains the list of tables from bash script, and then send one spark-submit for each of them.
IFS=$'\n' read -d '' -r -a lines < args_file
for it in "${lines[#]}"
do
TIMESTAMP=`date "+%Y-%m-%d"`
touch /home/$USER/logs/${it}_${TIMESTAMP}.success_log
touch /home/$USER/logs/${it}_${TIMESTAMP}.fail_log
success_logs=/home/$USER/logs/${it}_${TIMESTAMP}.success_log
failed_logs=/home/$USER/logs/${it}_${TIMESTAMP}.fail_log
#Function to get the status of the job creation
function log_status
{
status=$1
message=$2
if [ "$status" -ne 0 ]; then
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}"
#echo "Please find the attached log file for more details"
exit 1
else
echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}"
fi
}
spark-submit --name "${it}" --master "yarn-client" /home/$USER/mysql_spark/mysql_spark.py ${it} ${hivedb} ${domain} ${port} ${mysqldb} ${username} ${password}
done
First argument of python script now is a table name:
if len(sys.argv) != 8:
print "Invalid number of args......"
print "Usage: spark-submit import.py Arguments"
exit()
table = sys.argv[1]
hivedb = sys.argv[2]
domain = sys.argv[3]
port=sys.argv[4]
mysqldb=sys.argv[5]
username=sys.argv[6]
password=sys.argv[7]
def mysql_spark(table, hivedb, domain, port, mysqldb, username, password):
print "*********************************************************table = {} ***************************".format(table)
df = sqlContext.read.format("jdbc").option("url", "{}:{}/{}".format(domain,port,mysqldb)).option("driver", "com.mysql.jdbc.Driver").option("dbtable","{}".format(table)).option("user", "{}".format(username)).option("password", "{}".format(password)).load()
df.registerTempTable("mytempTable")
sqlContext.sql("create table {}.{} stored as parquet as select * from mytempTable".format(hivedb,table))
mysql_spark(table, hivedb, domain, port, mysqldb, username, password)
sc.stop()
However, I do not know how this will affect the performance of the task.
Just run your script as
./script.sh >> log.txt
Should save your logs in log.txt file. Anything and everything that your script prints out!

Categories