i have 4 spark application (to find wordcount from text file) which written on 4 different language (R,python,java,scala)
./wordcount.R
./wordcount.py
./wordcount.java
./wordcount.scala
spark works in standalone mode...
1.4worker nodes
2.1 core for each worker node
3.1gb memory for each node
4.core_max set to 1
./conf/spark-env.sh
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=4
i submitted spark application using pgm.sh file on terminal
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.py &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./project_2.jar &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./project_2.jar
when each process executing individually it takes 2sec.
when all process executed using .sh file on terminal it takes 5 sec to 6sec
how do i run different spark applications parallelly?
how to assign each spark application to individual core?
Edit: In Standalone mode, per the documentation, you simply need to set spark.cores.max to something less than the size of your standalone cluster, and it is also advisable to set spark.deploy.defaultCores for the applications that don't explicitly set this setting.
Original Answer (assuming this was running in something like local or YARN): When you submit multiple spark applications, they should run in parallel automatically, provided that the cluster or server you are running on is configured to allow for that. A YARN cluster, for example, would run the applications in parallel by default. Note that the more applications you are running in parallel, the greater the risk of resource contention.
On "how to assign each spark application to individual core": you don't, Spark handles the scheduling of workers to cores. You can configure how many resources each of your worker executors uses, but the allocation of them is up to the scheduler (be that Spark, Yarn, or some other scheduler).
Related
We have a standalone Spark cluster running on several machines, and currently we run Spark jobs in client mode meaning that the driver program is getting executed on the host machine not on the cluster.
We want to be able to run the driver program on the cluster and submit our jobs to the Spark cluster. Using the spark-submit bash script provided by Spark itself, we can give it a python module and Spark cluster runs it on the cluster. But what I want now is to get the result of that parallel computation back in python because I want to write the results to a PostgreSQL database for further use.
How can get the results of a Spark job deployed using cluster mode back in python? How can I submit a python module using pyspark in python not using the spark-submit bash script.
I am using pyspark under ubuntu with python 2.7
I installed it using
pip install pyspark --user
And trying to follow the instruction to setup spark cluster
I can't find the script start-master.sh
I assume that it has to do with the fact that i installed pyspark and not regular spark
I found here that i can connect a worker node to the master via pyspark, but how do i start the master node with pyspark?
https://pypi.python.org/pypi/pyspark
The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
Well i did a bit of a mix-up in the op.
You need to get spark on the machine that should run as master.
You can download it here
After extracting it, you have spark/sbin folder, there you have start-master.sh script. you need to start it with -h argument.
please note that you need to create a spark-env file like explained here and define the spark local and master variables, this is important on the master machine.
After that, on the worker nodes, use the start-slave.sh script to start worker nodes.
And you are good to go, you can use a spark context inside python to use it!
If you are already using pyspark through conda / pip installation, there's no need to install Spark and setup environment variables again for cluster setup.
For conda / pip pyspark installation is missing only 'conf', 'sbin' , 'kubernetes', 'yarn' folders, You can simply download Spark and move those folders into the folder where pyspark is located (usually site-packages folder inside python).
After you installed pyspark via pip install pyspark, you can start the Spark standalone cluster master process using this command:
spark-class org.apache.spark.deploy.master.Master -h 127.0.0.1
And then you can add some workers (executors), which would process the jobs:
spark-class org.apache.spark.deploy.worker.Worker \
spark://127.0.0.1:7077 \
-c 4 -m 8G
Flags -c and -m specify the number of CPU cores and amount of memory provided by the worker.
The 127.0.0.1 local address is used there for security reasons (it isn't good if anyone just copy/pasting this lines would expose an "arbitary code execution service" in their network), but for the distributed standalone Spark cluster setup the different address should be used (ex, a private IP address in an isolated network available only for this cluster nodes and their intended users, and an official Spark security guide should be read).
The spark-class script is contained in the "pyspark" python package, and it is a wrapper to load the environment variables from spark-env.sh and add the corresponding spark jars locations to -cp flag of java command.
If you may need to configure the environment - consult the official Spark docs, but it also works and may be suitable for the regular usage with default parameters. Also, see the flags for the master/worker commands using their --help.
This is an example how to connect to this standalone cluster using pyspark script with ipython shell:
PYSPARK_DRIVER_PYTHON=ipython \
pyspark --master spark://127.0.0.1:7077 \
--num-executors 2
--executor-cores 2
--executor-memory 4G
The code for instantiating spark session manually, ex. in Jupyter:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.master("spark://127.0.0.1:7077")
# the number of executors this job needs
.config("spark.executor.instances", 2)
# the number of CPU cores memory this needs from the executor,
# it would be reserved on the worker
.config("spark.executor.cores", "2")
.config("spark.executor.memory", "4G")
.getOrCreate()
)
Since a few weeks i tried to submit python scripts via remote access or connecting to the pyspark shell of the YARN cluster.
I am new to the HADOOP world. What I want is submitting spark scripts in my local shell on the external HADOOP cluster.
My situation: External hadoop YARN cluster. Have access to the important ports. I have Windows 7 64 Bit / Python 2.7.9 64 Bit / Spark 1.4.1. The HADOOP cluster is running without any problems.
My problem: Submitting python scripts via remote access on the HADOOP cluster doesnt work.
If i try
spark-submit --master yarn-cluster --num-executors 2 --driver-memory 512m --executor-memory 512m --executor-cores 4 ... example.py
It says
Error: Cluster deploy mode is not applicable to Spark shells.
Exception: Java gateway process exited before sending the driver its port number
So as far as I understand the problem I think the question is
How do I set the yarn-config correctly to connect with my local client (not part of the cluster) to the external YARN cluster.
SPARK VERSION 1.6.0 (which is the current version writing this).
Python code cannot be executed in YARN-cluster mode. Python can only executed in a cluster mode on a native spark cluster.
You can switch to use a spark cluster or you re-implement your code in Java or Scala.
I am using the python yelp/mrjob framework for my mapreduce jobs. There are only about 4G of data and I don't want to go through the trouble of setting up Hadoop or EMR. I have a 64 core machine and it takes about 2 hours to process the data with mrjob. I notice mrjob assigns 54 mappers for my job, but it seems to run only one at a time. Is there a way to make mrjob run all tasks in parallel with all my cpu cores?
I manually changed number of tasks but didn't help much.
--jobconf mapred.map.tasks=10 --jobconf mapred.reduce.tasks=10
EDIT:
I have -r local when I execute the job, however, looking at the code, it seems it defaults to run one process at a time. Please tell me I am wrong.
The local job runner for mrjob just spawns one subprocess for each MR stage, one for the mapper, one for the combiner (optionally), and one for the reducer, and passes data between them via a pipe. It is not designed to have any parallelism at all, so it will never take advantage of your 64 cores.
My suggestion would be to run hadoop on your local machine and submit the job with the -r hadoop option. A hadoop cluster running on your local machine in pseduo-distributed mode should be able to take advantage of your multiple cores.
See this question which addresses that topic: Full utilization of all cores in Hadoop pseudo-distributed mode
The runner for a job can be specified via the command line with the -r option.
When you run a mrjob script from the command line, the default run mode is inline which runs your job on your local machine in a single process. The other obvious options for running jobs are emr and hadoop.
You can make the job run in parallel on you local machine by setting the runner as local
$ python myjob.py -r local
Those --jobconf options are only recognised by Hadoop (i.e. on EMR or a Hadoop cluster).
Even when I give a parameter to the groupByKey function, for example groupByKey(4), when I check with the top command, spark is still using one core. I run my script like that.
spark-submit --master local[4] program.py
So, why spark only uses one core when I tell it to use 4?
You're running this on Linux, if the tags to your question are to be trusted. Under linux, top does not, by default, show every thread (it shows every process). local[4] tells spark to work locally on 4 threads (not processes).
Run top -H to pick up the threads.