Spark to MongoDB connection closes without a meaningful error - python

I'm using MongoDB version 3.4.7, Spark version 1.6.3 and MongoDB-Spark connector version 1.1.0.
I have a pyspark script that extract data from MongoDB collection to create a dataframe.
I noticed that my spark-submit fails after the connection get closed. (see the log out below).
19/02/18 23:47:25 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, node1.dev.qwerty.asdf.io, partition 0,ASDF_LOCAL, 2476 bytes)
19/02/18 23:47:25 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on node1.dev.qwerty.asdf.io:45779 (size: 2.7 KB, free: 2.7 GB)
19/02/18 23:47:25 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on node1.dev.qwerty.asdf.io:45779 (size: 497.0 B, free: 2.7 GB)
19/02/18 23:47:29 INFO MongoClientCache: Closing MongoClient: [mongoconfig-001.zxcv.prod.rba.company.net:27017]
19/02/18 23:47:29 INFO connection: Closed connection [connectionId{localValue:2}] to mongoconfig-001.zxcv.prod.rba.company.net:27017 because the pool has been closed.
19/02/18 23:47:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, node1.dev.qwerty.asdf.io): com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting to connect. Client view of cluster state is {type=UNKNOWN, servers=[{address=mongoconfig-001.zxcv.prod.rba.company.net:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]
at com.mongodb.connection.BaseCluster.getDescription(BaseCluster.java:163)
at com.mongodb.Mongo.getClusterDescription(Mongo.java:411)
at com.mongodb.Mongo.getServerAddressList(Mongo.java:404)
at com.mongodb.spark.connection.MongoClientCache$$anonfun$logClient$1.apply(MongoClientCache.scala:161)
at com.mongodb.spark.connection.MongoClientCache$$anonfun$logClient$1.apply(MongoClientCache.scala:161)
at com.mongodb.spark.LoggingTrait$class.logInfo(LoggingTrait.scala:48)
at com.mongodb.spark.Logging.logInfo(Logging.scala:24)
at com.mongodb.spark.connection.MongoClientCache.logClient(MongoClientCache.scala:161)
at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:56)
at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:239)
at com.mongodb.spark.rdd.MongoRDD.compute(MongoRDD.scala:141)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm not sure what is happening here. Can someone help me out here?
I'm currently using the code below.
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri","mongodb://<USER>:<PASSWORD>#<HOST>:<PORT>/db.<COLLECTION>?ssl=true&authSource=<DATABASENAME>").load()
I'm calling the above script using the below spark-submit
spark-submit --master yarn --verbose --jars mongo-java-driver-3.4.2.jar,mongo-spark-connector_2.10-1.1.0.jar --py-files pymongo_spark.py test.py

Related

Is this an error that occurred after the spark operation?

I ran the following command:
$ spark-submit --master yarn --deploy-mode cluster pi.py
So, below log is continuous print:
...
2021-12-23 06:07:50,158 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
2021-12-23 06:07:51,162 INFO yarn.Client: Application report for application_1640239254568_0002 (state: ACCEPTED)
...
and I check the result through my 8088(Logs for container web UI), but there is nothing in stdout.
I was disappointed and tried to force the park operation to end, but suddenly the new log is print like below:
...
2021-12-23 06:09:06,694 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
2021-12-23 06:09:06,695 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: master
ApplicationMaster RPC port: 40451
queue: default
start time: 1640239668020
final status: UNDEFINED
tracking URL: http://master2:8088/proxy/application_1640239254568_0002/
user: root
2021-12-23 06:09:07,707 INFO yarn.Client: Application report for application_1640239254568_0002 (state: RUNNING)
...
And after some time, an error log occurred as shown below:
...
2021-12-23 06:10:25,003 INFO retry.RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "master/172.17.0.2"; destination host is: "master2":8032; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm2. Trying to failover immediately.
2021-12-23 06:10:25,003 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
2021-12-23 06:10:25,004 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From master/172.17.0.2 to master:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm1 after 1 failover attempts. Trying to failover after sleeping for 18340ms.
2021-12-23 06:10:43,347 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
...
I understand that sparks have completed the resource manager allocation after work, so it is normal for the above error log to appear.
Q1. Is the above job normal?
Q2. After this work, where can I check the results? Can I check them on "containerlogs web UI"?
IMPORTANT!! ADD. I re-ran the command. and check the status: SUCCEEDED. Why does the park-submit operation sometimes succeed and sometimes stop in the middle?

azkaban run selenium automatic python script flow failed when after execute about twenty miniutes,and the system becomes very slowly

I run python script in azkaban.
enviroment:
CentOS 8.1
azkaban 3.90.0
Python 3.6.8
ChromeDriver84.0.4147.30
In test.flow file
nodes:
- name: job_test
type: command
config:
command: python3 /home/azkaban/python_codes/pyib/activity/pickgoods.py
when after run execute this flow about twenty miniutes, the system becomes very slowly and the execution is failed.
28-07-2020 18:30:40 CST job_test INFO - Process with id 1403 completed unsuccessfully in 1727 seconds.
28-07-2020 18:30:40 CST job_test ERROR - Job run failed!
java.lang.RuntimeException: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:312)
at azkaban.execapp.JobRunner.runJob(JobRunner.java:830)
at azkaban.execapp.JobRunner.doRun(JobRunner.java:607)
at azkaban.execapp.JobRunner.run(JobRunner.java:568)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
at azkaban.jobExecutor.utils.process.AzkabanProcess.run(AzkabanProcess.java:125)
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:304)
... 8 more
28-07-2020 18:30:40 CST job_test ERROR - azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1 cause: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 1
28-07-2020 18:30:40 CST job_test INFO - Finishing job job_test at 1595932240480 with status FAILED
and azkaban-webserver.log under azkaban-web-server
2020/07/28 21:00:34.127 +0800 INFO [ExecutorManager] [AzkabanWebServer-QueueProcessor-Thread] [Azkaban] Successfully refreshed executor: iZbp1hb3esnbp3levrcg05Z:36037 (id: 16), active=true with executor info : ExecutorInfo{remainingMemoryPercent=45.705342424456234, remainingMemoryInMB=835, remainingFlowCapacity=30, numberOfAssignedFlows=0, lastDispatchedTime=1595936723440, cpuUsage=0.01}
2020/07/28 21:00:34.128 +0800 ERROR [ExecutorManager] [AzkabanWebServer-QueueProcessor-Thread] [Azkaban] Failed to update ExecutorInfo for executor : iZbp1hb3esnbp3levrcg05Z:44085 (id: 17), active=true
java.util.concurrent.ExecutionException: org.apache.http.conn.HttpHostConnectException: Connect to iZbp1hb3esnbp3levrcg05Z:44085 [iZbp1hb3esnbp3levrcg05Z/172.16.184.105] failed: Connection refused (Connection refused)
anyone can help to resolve it?
Your job process crashes. You can find its error log in the web UI for further debugging; see https://azkaban.readthedocs.io/en/latest/useAzkaban.html#job-logs

spark submit with python on windows- An existing connection was forcibly closed by the remote host

I am new to spark and I tried to submit my first job.
The code is in python and i am running on windows 10.
On one console I deployed the master as:
spark-class org.apache.spark.deploy.master.Master
On another I deployed one slave:
spark-class org.apache.spark.deploy.worker.Worker spark://192.168.1.4:7077
All good so far.
When I want to deploy the spark job in python I run on another console the following:
spark-submit --master spark://192.168.1.4:7077 "C:\path\to\file\friends-by-age.py"
The job is executed correctly because it prints all the results well, but then an error is thrown:
19/07/28 21:51:52 INFO SparkContext: Invoking stop() from shutdown hook
19/07/28 21:51:52 INFO SparkUI: Stopped Spark web UI at http://192.168.1.4:4040
19/07/28 21:51:52 INFO StandaloneSchedulerBackend: Shutting down all executors
19/07/28 21:51:52 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
19/07/28 21:51:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/07/28 21:51:52 WARN TransportChannelHandler: Exception in connection from /192.168.1.4:59096
java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Unknown Source)
19/07/28 21:51:52 INFO MemoryStore: MemoryStore cleared
19/07/28 21:51:52 INFO BlockManager: BlockManager stopped
19/07/28 21:51:52 INFO BlockManagerMaster: BlockManagerMaster stopped
19/07/28 21:51:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/07/28 21:51:52 INFO SparkContext: Successfully stopped SparkContext
This is the friends-by-age.py code:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("FriendsByAge")
sc = SparkContext(conf = conf)
def parseLine(line):
fields = line.split(',')
age = int(fields[2])
numFriends = int(fields[3])
return (age, numFriends)
lines = sc.textFile("file:///C:/path/to/file/fakefriends.csv")
rdd = lines.map(parseLine)
totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
averagesByAge = totalsByAge.mapValues(lambda x: x[0] / x[1])
averagesByAge = averagesByAge.sortByKey()
results = averagesByAge.collect()
for result in results:
print(result)
Also in the spark ui
I did not deploy any other worker.
Any ideas?

AWS-EMR error exit code 143

I'm running an analysis on AWS EMR, and I am getting an unexpected SIGTERM error.
Some background:
I'm running a script that reads in many csv files I have stored on S3, and then performs an analysis. My script is schematically:
analysis_script.py
import pandas as pd
from pyspark.sql import SQLContext, DataFrame
from pyspark.sql.types import *
from pyspark import SparkContext
import boto3
#Spark context
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.csv("s3n://csv_files/*", header = True)
def analysis(df):
#do bunch of stuff. Create output dataframe
return df_output
df_output = analysis(df)
I launch the cluster using:
aws emr create-cluster
--release-label emr-5.5.0
--name "Analysis"
--applications Name=Hadoop Name=Hive Name=Spark Name=Ganglia
--ec2-attributes KeyName=EMRB,InstanceProfile=EMR_EC2_DefaultRole
--service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=r3.xlarge
--region us-west-2
--log-uri s3://emr-logs/
--bootstrap-actions Name="Install Python Packages",Path="s3://emr-bootstraps/install_python_packages_custom.bash",Args=["numpy pandas boto3 tqdm"]
--auto-terminate
I can see from logs that the reading in of the csv files goes fine. But then it finishes with errors. The following lines are in the stderr file:
18/07/16 12:02:26 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
18/07/16 12:02:26 ERROR ApplicationMaster: User application exited with status 143
18/07/16 12:02:26 INFO ApplicationMaster: Final app status: FAILED, exitCode: 143, (reason: User application exited with status 143)
18/07/16 12:02:26 INFO SparkContext: Invoking stop() from shutdown hook
18/07/16 12:02:26 INFO SparkUI: Stopped Spark web UI at http://172.31.36.42:36169
18/07/16 12:02:26 INFO TaskSetManager: Starting task 908.0 in stage 1494.0 (TID 88112, ip-172-31-35-59.us-west-2.compute.internal, executor 27, partition 908, RACK_LOCAL, 7278 bytes)
18/07/16 12:02:26 INFO TaskSetManager: Finished task 874.0 in stage 1494.0 (TID 88078) in 16482 ms on ip-172-31-35-59.us-west-2.compute.internal (executor 27) (879/4805)
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-36-42.us-west-2.compute.internal:34133 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(20, ip-172-31-36-42.us-west-2.compute.internal, 34133, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-47-55.us-west-2.compute.internal:45758 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(16, ip-172-31-47-55.us-west-2.compute.internal, 45758, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO DAGScheduler: Job 1494 failed: toPandas at analysis_script.py:267, took 479.895614 s
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1494 (toPandas at analysis_script.py:267) failed in 478.993 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerSQLExecutionEnd(0,1531742546839)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#28e5b10c)
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1495 (toPandas at analysis_script.py:267) failed in 479.270 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#6b68c419)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(1494,1531742546841,JobFailed(org.apache.spark.SparkException: Job 1494 cancelled because SparkContext was shut down))
18/07/16 12:02:26 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/07/16 12:02:26 INFO YarnClusterSchedulerBackend: Shutting down all executors
18/07/16 12:02:26 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/16 12:02:26 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices(serviceOption=None, services=List(),started=false)
18/07/16 12:02:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
I can't find much useful information about exit code 143. Does anybody know why this error is occurring? Thanks.
Spark passes through exit codes when they're over 128, which is often the case with JVM errors. In the case of exit code 143, it signifies that the JVM received a SIGTERM - essentially a unix kill signal (see this post for more exit codes and an explanation). Other details about Spark exit codes can be found in this question.
Since you didn't terminate this yourself, I'd start by suspecting something else externally did. Given that precisely 8 minutes elapse between job start and a SIGTERM being issued, it seems much more likely that EMR itself may be enforcing a maximum job run time/cluster age. Try checking through your EMR settings to see if there is any such timeout set - there was one in my case (on AWS Glue, but the same concept).

OutOfMemoryError when using PySpark to read files in local mode

I have about a dozen gpg-encrypted files containing data I'd like to analyze using PySpark. My strategy is to apply a decryption function as a flat map to each file and then proceed processing at the record level:
def read_fun_generator(filename):
with gpg_open(filename[0].split(':')[-1], 'r') as f:
for line in f:
yield line.strip()
gpg_files = sc.wholeTextFiles(/path/to/files/*.gpg)
rdd_from_gpg = gpg_files.flatMap(read_fun_generator).map(lambda x: x.split('|'))
rdd_from_gpg.count() # <-- For example...
This approach works quite well when using a single thread in local mode, i.e. setting the master to local[1]. However, using any more than a single thread causes an OutOfMemoryError to be thrown. I've tried increasing spark.executor.memory and spark.driver.memory to 30g, but this seems not to help. I can confirm in the UI that those settings have stuck. (My machine has over 200GB available.) However, I've noticed in the logs that the block manager seems to be starting with only 265.4 MB of memory. I wonder if this is related?
Here is the full configuration I'm starting with:
conf = (SparkConf()
.setMaster("local[*]")
.setAppName("pyspark_local")
.set("spark.executor.memory", "30g")
.set("spark.driver.memory", "30g")
.set("spark.python.worker.memory", "5g")
)
sc = SparkContext(conf=conf)
This is the stack trace from my logs:
15/06/10 11:03:30 INFO SparkContext: Running Spark version 1.3.1
15/06/10 11:03:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/10 11:03:31 INFO SecurityManager: Changing view acls to: santon
15/06/10 11:03:31 INFO SecurityManager: Changing modify acls to: santon
15/06/10 11:03:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(santon); users with modify permissions: Set(santon)
15/06/10 11:03:31 INFO Slf4jLogger: Slf4jLogger started
15/06/10 11:03:31 INFO Remoting: Starting remoting
15/06/10 11:03:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#localhost:44347]
15/06/10 11:03:32 INFO Utils: Successfully started service 'sparkDriver' on port 44347.
15/06/10 11:03:32 INFO SparkEnv: Registering MapOutputTracker
15/06/10 11:03:32 INFO SparkEnv: Registering BlockManagerMaster
15/06/10 11:03:32 INFO DiskBlockManager: Created local directory at /tmp/spark-24dc8f0a-a89a-44f8-bb95-cd5514e5bf0c/blockmgr-85b6f082-ff5a-4a0e-b48a-1ec62715dda0
15/06/10 11:03:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/06/10 11:03:32 INFO HttpFileServer: HTTP File server directory is /tmp/spark-7b2172ed-d658-4e11-bbc1-600697f3255e/httpd-5423f8bc-ec43-48c5-9367-87214dad54f4
15/06/10 11:03:32 INFO HttpServer: Starting HTTP Server
15/06/10 11:03:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/10 11:03:32 INFO AbstractConnector: Started SocketConnector#0.0.0.0:50366
15/06/10 11:03:32 INFO Utils: Successfully started service 'HTTP file server' on port 50366.
15/06/10 11:03:32 INFO SparkEnv: Registering OutputCommitCoordinator
15/06/10 11:03:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/10 11:03:32 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/06/10 11:03:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/06/10 11:03:32 INFO SparkUI: Started SparkUI at localhost:4040
15/06/10 11:03:32 INFO Executor: Starting executor ID <driver> on host localhost
15/06/10 11:03:32 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#localhost:44347/user/HeartbeatReceiver
15/06/10 11:03:33 INFO NettyBlockTransferService: Server created on 46730
15/06/10 11:03:33 INFO BlockManagerMaster: Trying to register BlockManager
15/06/10 11:03:33 INFO BlockManagerMasterActor: Registering block manager localhost:46730 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 46730)
15/06/10 11:03:33 INFO BlockManagerMaster: Registered BlockManager
15/06/10 11:05:19 INFO MemoryStore: ensureFreeSpace(215726) called with curMem=0, maxMem=278302556
15/06/10 11:05:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 210.7 KB, free 265.2 MB)
15/06/10 11:05:19 INFO MemoryStore: ensureFreeSpace(31533) called with curMem=215726, maxMem=278302556
15/06/10 11:05:19 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 30.8 KB, free 265.2 MB)
15/06/10 11:05:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46730 (size: 30.8 KB, free: 265.4 MB)
15/06/10 11:05:19 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/06/10 11:05:19 INFO SparkContext: Created broadcast 0 from wholeTextFiles at NativeMethodAccessorImpl.java:-2
15/06/10 11:05:22 INFO FileInputFormat: Total input paths to process : 16
15/06/10 11:05:22 INFO FileInputFormat: Total input paths to process : 16
15/06/10 11:05:22 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 71665121
15/06/10 11:05:22 INFO SparkContext: Starting job: count at <timed exec>:2
15/06/10 11:05:22 INFO DAGScheduler: Got job 0 (count at <timed exec>:2) with 2 output partitions (allowLocal=false)
15/06/10 11:05:22 INFO DAGScheduler: Final stage: Stage 0(count at <timed exec>:2)
15/06/10 11:05:22 INFO DAGScheduler: Parents of final stage: List()
15/06/10 11:05:22 INFO DAGScheduler: Missing parents: List()
15/06/10 11:05:22 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at count at <timed exec>:2), which has no missing parents
15/06/10 11:05:23 INFO MemoryStore: ensureFreeSpace(6264) called with curMem=247259, maxMem=278302556
15/06/10 11:05:23 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.1 KB, free 265.2 MB)
15/06/10 11:05:23 INFO MemoryStore: ensureFreeSpace(4589) called with curMem=253523, maxMem=278302556
15/06/10 11:05:23 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.5 KB, free 265.2 MB)
15/06/10 11:05:23 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:46730 (size: 4.5 KB, free: 265.4 MB)
15/06/10 11:05:23 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/06/10 11:05:23 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
15/06/10 11:05:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1] at count at <timed exec>:2)
15/06/10 11:05:23 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/06/10 11:05:23 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1903 bytes)
15/06/10 11:05:23 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 3085 bytes)
15/06/10 11:05:23 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/06/10 11:05:23 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/06/10 11:05:26 INFO WholeTextFileRDD: Input split: Paths:[gpg_files]
15/06/10 11:05:40 ERROR Utils: Uncaught exception in thread stdout writer for /anaconda/python/bin/python
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
at java.nio.CharBuffer.toString(CharBuffer.java:1201)
at org.apache.hadoop.io.Text.decode(Text.java:405)
at org.apache.hadoop.io.Text.decode(Text.java:382)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:86)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:421)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
Exception in thread "stdout writer for /anaconda/python/bin/python" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
at java.nio.CharBuffer.toString(CharBuffer.java:1201)
at org.apache.hadoop.io.Text.decode(Text.java:405)
at org.apache.hadoop.io.Text.decode(Text.java:382)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:86)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:421)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
15/06/10 11:05:47 INFO PythonRDD: Times: total = 24140, boot = 2860, init = 664, finish = 20616
15/06/10 11:05:47 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1873 bytes result sent to driver
15/06/10 11:05:47 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 24251 ms on localhost (1/2)
Has anyone run into this problem? Is there a setting I'm not aware of that I should modify? It seems like this should be possible...
The thing with sc.wholeTextFiles(/path/to/files/*.gpg) - to returns PairRDD, key - the file name and the value - is the file contents.
Looks like you are not using file contents part, but still told Spark to read files from disk and ship them to workers.
If your goal is to process the list of file names only, and the contents of them to be read with gpg_open you can do this:
def read_fun_generator(filename):
with gpg_open(filename.split(':')[-1], 'r') as f:
for line in f:
yield line.strip()
gpg_filelist = glob.glob("/path/to/files/*.gpg")
# generate RDD with file name per record
gpg_files = sc.parallelize(gpg_filelist)
rdd_from_gpg = gpg_files.flatMap(read_fun_generator).map(lambda x: x.split('|'))
rdd_from_gpg.count() # <-- For example...
This would reduce the amount of memory used by Spark's JVM.

Categories