Problem submitting Pyspark jobs from Windows Driver to Ubuntu Spark Cluster

Problem submitting Pyspark jobs from Windows Driver to Ubuntu Spark Cluster - python

I am having trouble submitting a Pyspark job from my Windows driver machine (Win 10) to a simple Spark cluster running on Ubuntu.
There are several posts already that attempt to answer this question, most notably this one from ThatDataGuy here but none of them have helped.
Every time I try to submit the simple wordcount.py example to my remote master from my Windows box, I get the following error:
Cannot run program 'C:\apps\Python\3.6.6\python.exe': error=2, No such file or directory
This is a Java IOException generated by the Py4J jar.
My Spark cluster is a simple Master, 1 Worker setup in VirtualBox setup via Vagrant. All machines, (my Spark driver laptop, and 2 VMs (Master / Worker) have identical Spark 2.4.2, Python 3.6.6, and Scala 12.8. Note that Scala programs using spark-submit against the remote cluster work fine, as well as anything run in local mode. Also, the code examples work fine when run on either the Master or Worker nodes directly. It's only when I try to use my Windows laptop as a Spark driver in Pyspark, against the Ubuntu Spark cluster, that this issue arises. It always returns the error above.
It seems that Py4j is trying to use or instantiate Python from my Windows Driver's python path, which of course my Linux cluster can't see. I have already set the Pyspark Python path to a different value in the cluster nodes. I have set the both PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON in the nodes environment variables (via .bashrc), in spark-defaults.conf, AND in spark-env.sh files. All values point to /usr/local/bin/python3 as that's where Python 3.6.6 is installed on the Master and Worker nodes.
I've also (just as a hunch) aliased "python" to point to /usr/local/bin/python3 in the nodes and then changed my Windows python shortcut to pull up the same Python version. No luck but I was grasping at straws. ;/ Error simply changed to:
Cannot run program 'python': error=2, No such file or directory
I did see an article where the Py4J 0.10.7 library does not support Python 3.7, so this caused me to drop down to Python 3.6. Error stayed the same after that though.
The only thing I haven't done is to try to setup an additional shared / synced folder in Vagrant back to my Windows Python installation and then use /vagrant/shared/python/whatever in my remote PYSPARK settings. No idea if that would work though given I'm dealing with Windows and Linux Python flavors (all 3.6.6). Ugh. :/
Any ideas? I've got a Windows 10 machine and I like to do my Python development there. I've also got 64GB of RAM so I'd like to use it. Please don't make me switch to Scala! ;)
-- Pyspark works fine in local
spark-submit C:\apps\Spark\spark-2.4.2\examples\src\main\python\wordcount.py C:\Users\sitex\Desktop\p_and_p_ch1.txt
-- Pyspark fails when calling master with IOException
spark-submit --master spark://XXX.XX.XXX.XXX:7077 C:\apps\Spark\spark-2.4.2\examples\src\main\python\wordcount.py C:\Users\sitex\Desktop\p_and_p_ch1.txt
UPDATE: Ok, so it looks like my workaround is to pretend that my Driver (Windows laptop) knows the Python path in Linux that the Worker needs to use. Fortunately for me, I do it as this entire setup is running on my laptop. Here is the code that gets me past the error:
spark-submit --conf spark.pyspark.driver.python=python --conf spark.pyspark.python=/usr/local/bin/python3 --master spark://172.28.128.150:7077 C:\apps\Spark\spark-2.4.2\examples\src\main\python\wordcount.py /vagrant/shared/p_and_p_ch1.txt
Now, I should add that this DOESN'T run wordcount.py as I quickly realized that my Cluster cannot figure out Window's paths and my attempt to use a Vagrant synced / shared folder is resulting in a File Not Found on the p_and_p_ch1.txt file. But it does get me past my dreaded error. I can figure out how to stash my files on a network share / S3 / et some other day.
This puts a lot of onus on the Spark Driver knowing exactly what Python path the Cluster needs to use. Fortunately I know these settings as the setup is entirely on my laptop, but isn't the entire point of this that I am supposed to submit Spark jobs to a cluster without the Driver (me) knowing settings like the Worker nodes' Python path? I'm wondering if this is just a Windows + Linux quirk?

Related

How to use pycharm to run an application in remote spark cluster

I have installed PyCharm on my local system and have configured it to run spark applications in local mode in windows.
My spark cluster is in a remote Ubuntu box.
How can I run a spark application in the remote spark cluster, which is on Ubuntu, from my locally installed PyCharm which is on Windows?
My goal is to run the application in a remote cluster so workarounds are also welcome.

PyCharm is already setup for this. Ideally you want to setup a deployment and a remote interpreter for your setup, ideally via ssh.
This allows you to upload your codebase to the cluster (so that the pyspark driver has access to it), but run it from your laptop. Remote interpreter then takes care of resolving dependencies on the cluster.
Have a look here https://www.jetbrains.com/help/pycharm/configuring-remote-interpreters-via-ssh.html and here https://www.jetbrains.com/help/pycharm/creating-a-remote-server-configuration.html.
NB: Before you start configuring the remote interpreter, it's better to install venv or conda on your cluster and create a virtual environment, so that you don't have any dependencies or outdated packages. You then point the remote interpreter config to the python binary of the environment, such as /app/anaconda3/envs/my_env/bin/python.

Opening script saved on a cluster with local spyder installation

I have anaconda, hence spyder, installed on a local machine. What I am trying to do is to use my local spyder installation to open a .py script saved on a remote cluster (in my office) via ssh. The issues that I am encountering are the following:
I cannot run spyder from the cluster - there is no graphical device whatsoever. For example, we have actually anaconda installed on the cluster, but when I ran spyder from the command line, I get the following error message: Could not connect to any X display
I cannot mount the (remote) drivers, where the .py scripts are located, onto my local machine when I am working from home (which is the case when I am at work, connected to the internet via cable). If this was the case, I could simply launch spyder on my local machine, then open the scripts. I can only access the files on some drivers mounted onto the cluster via ssh.
As, however, I can access the .py scripts saved on the cluster via ssh (I can open then with programs installed locally e.g. vim, jpico etc), I was wondering whether it is possible to use the command line to open a script saved on a remote cluster using my local spyder installation, something like $ spyder /path/to/myScript/savedOnTheRemoteCluster.py

(Spyder maintainer here) As of May 2019 our editor is not capable of working with files on remote locations. So your best option right now is to mount your remote server with sshfs to make it appear as a local directory and then open any file present there in Spyder.

How to start a standalone cluster using pyspark?

I am using pyspark under ubuntu with python 2.7
I installed it using
pip install pyspark --user
And trying to follow the instruction to setup spark cluster
I can't find the script start-master.sh
I assume that it has to do with the fact that i installed pyspark and not regular spark
I found here that i can connect a worker node to the master via pyspark, but how do i start the master node with pyspark?

https://pypi.python.org/pypi/pyspark
The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.

Well i did a bit of a mix-up in the op.
You need to get spark on the machine that should run as master.
You can download it here
After extracting it, you have spark/sbin folder, there you have start-master.sh script. you need to start it with -h argument.
please note that you need to create a spark-env file like explained here and define the spark local and master variables, this is important on the master machine.
After that, on the worker nodes, use the start-slave.sh script to start worker nodes.
And you are good to go, you can use a spark context inside python to use it!

If you are already using pyspark through conda / pip installation, there's no need to install Spark and setup environment variables again for cluster setup.
For conda / pip pyspark installation is missing only 'conf', 'sbin' , 'kubernetes', 'yarn' folders, You can simply download Spark and move those folders into the folder where pyspark is located (usually site-packages folder inside python).

After you installed pyspark via pip install pyspark, you can start the Spark standalone cluster master process using this command:
spark-class org.apache.spark.deploy.master.Master -h 127.0.0.1
And then you can add some workers (executors), which would process the jobs:
spark-class org.apache.spark.deploy.worker.Worker \
spark://127.0.0.1:7077 \
-c 4 -m 8G
Flags -c and -m specify the number of CPU cores and amount of memory provided by the worker.
The 127.0.0.1 local address is used there for security reasons (it isn't good if anyone just copy/pasting this lines would expose an "arbitary code execution service" in their network), but for the distributed standalone Spark cluster setup the different address should be used (ex, a private IP address in an isolated network available only for this cluster nodes and their intended users, and an official Spark security guide should be read).
The spark-class script is contained in the "pyspark" python package, and it is a wrapper to load the environment variables from spark-env.sh and add the corresponding spark jars locations to -cp flag of java command.
If you may need to configure the environment - consult the official Spark docs, but it also works and may be suitable for the regular usage with default parameters. Also, see the flags for the master/worker commands using their --help.
This is an example how to connect to this standalone cluster using pyspark script with ipython shell:
PYSPARK_DRIVER_PYTHON=ipython \
pyspark --master spark://127.0.0.1:7077 \
--num-executors 2
--executor-cores 2
--executor-memory 4G
The code for instantiating spark session manually, ex. in Jupyter:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.master("spark://127.0.0.1:7077")
# the number of executors this job needs
.config("spark.executor.instances", 2)
# the number of CPU cores memory this needs from the executor,
# it would be reserved on the worker
.config("spark.executor.cores", "2")
.config("spark.executor.memory", "4G")
.getOrCreate()
)

Controlling VMs using Python scripts

I want to manage virtual machines (any flavor) using Python scripts. Example, create VM, start, stop and be able to access my guest OS's resources.
My host machine runs Windows. I have VirtualBox installed. Guest OS: Kali Linux.
I just came across a software called libvirt. Do any of you think this would help me ?
Any insights on how to do this? Thanks for your help.

For aws use boto.
For GCE use Google API Python Client Library
For OpenStack use the python-openstackclient and import its methods directly.
For VMWare, google it.
For Opsware, abandon all hope as their API is undocumented and has like 12 years of accumulated abandoned methods to dig through and an equally insane datamodel back ending it.
For direct libvirt control there are python bindings for libvirt. They work very well and closely mimic the c libraries.
I could go on.

follow the directions here to install docker https://docs.docker.com/windows/ (it includes Oracle VirtualBox (if you dont already have it)
#grab the immage
docker pull kalilinux/kali-linux-docker
#run a specific command
docker run kalilinux/kali-linux-docker <some_command>
#open interactive terminal to "docker image"
docker run -t -i kalilinux/kali-linux-docker /bin/bash
if you want to mount a local volume you can use the `-v dst src` switch in your run command
#mount local ./training/webapp directory into kali image # /webapp
docker run kalilinux/kali-linux-docker -v /webapp training/webapp <some_command>
note that these are run from the regular windows prompt to use python you would need to wrap them in subprocess calls ...

apache spark, "failed to create any local dir"

I am trying to setup Apache-Spark on a small standalone cluster (1 Master Node and 8 Slave Nodes). I have installed the "pre-built" version of spark 1.1.0 built on top of Hadoop 2.4. I have set up the passwordless ssh between nodes and exported a few necessary environment variables. One of these variables (which is probably most relevant) is:
export SPARK_LOCAL_DIRS=/scratch/spark/
I have a small piece of python code which I know works with Spark. I can run it locally--on my desktop, not the cluster--with:
$SPARK_HOME/bin/spark-submit ~/My_code.py
I copied the code to the cluster. Then, I start all the processes from the head node:
$SPARK_HOME/sbin/start-all
And each of the slaves is listed as running as process xxxxx.
If I then attempt to run my code with the same command above:
$SPARK_HOME/bin/spark-submit ~/MY_code.py
I get the following error:
14/10/27 14:19:02 ERROR util.Utils: Failed to create local root dir in /scratch/spark/. Ignoring this directory.
14/10/27 14:19:02 ERROR storage.DiskBlockManager: Failed to create any local dir.
I have the permissions set on the /scratch and /scratch/spark at 777. Any help is greatly appreciated.

The problem was that I didn't realize the master node also needed a scratch directory. In each of my 8 worker nodes I created the local /scratch/spark directory, but neglected to do so on the master node. Adding the directory fixed the problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.