We have a standalone Spark cluster running on several machines, and currently we run Spark jobs in client mode meaning that the driver program is getting executed on the host machine not on the cluster.
We want to be able to run the driver program on the cluster and submit our jobs to the Spark cluster. Using the spark-submit bash script provided by Spark itself, we can give it a python module and Spark cluster runs it on the cluster. But what I want now is to get the result of that parallel computation back in python because I want to write the results to a PostgreSQL database for further use.
How can get the results of a Spark job deployed using cluster mode back in python? How can I submit a python module using pyspark in python not using the spark-submit bash script.
Related
I have a remote Ubuntu server on linode.com with 4 cores and 8G RAM
I have a Spark-2 cluster consisting of 1 master and 1 slave on my remote Ubuntu server.
I have started PySpark shell locally on my MacBook, connected to my master node on remote server by:
$ PYSPARK_PYTHON=python3 /vagrant/spark-2.0.0-bin-hadoop2.7/bin/pyspark --master spark://[server-ip]:7077
I tried executing simple Spark example from website:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("/path/to/spark-2.0.0-bin-hadoop2.7/examples/src/main/resources/people.json")
I have got error
Initial job has not accepted any resources; check your cluster UI to
ensure that workers are registered and have sufficient resources
I have enough memory on my server and also on my local machine, but I am getting this weird error again and again. I have 6G for my Spark cluster, my script is using only 4 cores with 1G memory per node.
[
I have Googled for this error and tried to setup different memory configs, also disabled firewall on both machines, but it does not helped me. I have no idea how to fix it.
Is someone faced the same problem? Any ideas?
You are submitting application in the client mode. It means that driver process is started on your local machine.
When executing Spark applications all machines have to be able to communicate with each other. Most likely your driver process is not reachable from the executors (for example it is using private IP or is hidden behind firewall). If that is the case you can confirm that by checking executor logs (go to application, select on of the workers with the status EXITED and check stderr. You "should" see that executor is failing due to org.apache.spark.rpc.RpcTimeoutException).
There are two possible solutions:
Submit application from the machine which can be reached from you cluster.
Submit application in the cluster mode. This will use cluster resources to start driver process so you have to account for that.
i have 4 spark application (to find wordcount from text file) which written on 4 different language (R,python,java,scala)
./wordcount.R
./wordcount.py
./wordcount.java
./wordcount.scala
spark works in standalone mode...
1.4worker nodes
2.1 core for each worker node
3.1gb memory for each node
4.core_max set to 1
./conf/spark-env.sh
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=4
i submitted spark application using pgm.sh file on terminal
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.py &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./project_2.jar &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./project_2.jar
when each process executing individually it takes 2sec.
when all process executed using .sh file on terminal it takes 5 sec to 6sec
how do i run different spark applications parallelly?
how to assign each spark application to individual core?
Edit: In Standalone mode, per the documentation, you simply need to set spark.cores.max to something less than the size of your standalone cluster, and it is also advisable to set spark.deploy.defaultCores for the applications that don't explicitly set this setting.
Original Answer (assuming this was running in something like local or YARN): When you submit multiple spark applications, they should run in parallel automatically, provided that the cluster or server you are running on is configured to allow for that. A YARN cluster, for example, would run the applications in parallel by default. Note that the more applications you are running in parallel, the greater the risk of resource contention.
On "how to assign each spark application to individual core": you don't, Spark handles the scheduling of workers to cores. You can configure how many resources each of your worker executors uses, but the allocation of them is up to the scheduler (be that Spark, Yarn, or some other scheduler).
I've to execute long running (~10 hours) hive queries from my local server using a python script. my target hive server is in an aws cluster.
I've tried to execute it using
pyhs2, execute('<command>')
and
paramiko, exec_command('hive -e "<command>"')
in both cases my query will be running in hive server and will complete successfully. but issue is even after successfully completing the query my parent python script continue to wait for return value and will remain in Interruptible sleep (Sl) state for infinite time!
is there anyway I can make my script work fine using pyhs2 or paramiko? os is there any other better option available in python?
As i mentioned before that even I face a similar issue in my Performance based environment.
My use-case was i was using PYHS2 module to run queries using HIVE TEZ execution engine. TEZ generates lot of logs(basically in seconds scale). the logs gets captured in STDOUT variable and is provided to the output once the query successfully completes.
The way to overcome is to stream the output as an when it is generated as shown below:
for line in iter(lambda: stdout.readline(2048), ""):
print line
But for this you will have to use native connection to cluster using PARAMIKO or FABRIC and then issue hive command via CLI or beeline.
Since a few weeks i tried to submit python scripts via remote access or connecting to the pyspark shell of the YARN cluster.
I am new to the HADOOP world. What I want is submitting spark scripts in my local shell on the external HADOOP cluster.
My situation: External hadoop YARN cluster. Have access to the important ports. I have Windows 7 64 Bit / Python 2.7.9 64 Bit / Spark 1.4.1. The HADOOP cluster is running without any problems.
My problem: Submitting python scripts via remote access on the HADOOP cluster doesnt work.
If i try
spark-submit --master yarn-cluster --num-executors 2 --driver-memory 512m --executor-memory 512m --executor-cores 4 ... example.py
It says
Error: Cluster deploy mode is not applicable to Spark shells.
Exception: Java gateway process exited before sending the driver its port number
So as far as I understand the problem I think the question is
How do I set the yarn-config correctly to connect with my local client (not part of the cluster) to the external YARN cluster.
SPARK VERSION 1.6.0 (which is the current version writing this).
Python code cannot be executed in YARN-cluster mode. Python can only executed in a cluster mode on a native spark cluster.
You can switch to use a spark cluster or you re-implement your code in Java or Scala.
I am using the python yelp/mrjob framework for my mapreduce jobs. There are only about 4G of data and I don't want to go through the trouble of setting up Hadoop or EMR. I have a 64 core machine and it takes about 2 hours to process the data with mrjob. I notice mrjob assigns 54 mappers for my job, but it seems to run only one at a time. Is there a way to make mrjob run all tasks in parallel with all my cpu cores?
I manually changed number of tasks but didn't help much.
--jobconf mapred.map.tasks=10 --jobconf mapred.reduce.tasks=10
EDIT:
I have -r local when I execute the job, however, looking at the code, it seems it defaults to run one process at a time. Please tell me I am wrong.
The local job runner for mrjob just spawns one subprocess for each MR stage, one for the mapper, one for the combiner (optionally), and one for the reducer, and passes data between them via a pipe. It is not designed to have any parallelism at all, so it will never take advantage of your 64 cores.
My suggestion would be to run hadoop on your local machine and submit the job with the -r hadoop option. A hadoop cluster running on your local machine in pseduo-distributed mode should be able to take advantage of your multiple cores.
See this question which addresses that topic: Full utilization of all cores in Hadoop pseudo-distributed mode
The runner for a job can be specified via the command line with the -r option.
When you run a mrjob script from the command line, the default run mode is inline which runs your job on your local machine in a single process. The other obvious options for running jobs are emr and hadoop.
You can make the job run in parallel on you local machine by setting the runner as local
$ python myjob.py -r local
Those --jobconf options are only recognised by Hadoop (i.e. on EMR or a Hadoop cluster).