apache spark, "failed to create any local dir" - python

I am trying to setup Apache-Spark on a small standalone cluster (1 Master Node and 8 Slave Nodes). I have installed the "pre-built" version of spark 1.1.0 built on top of Hadoop 2.4. I have set up the passwordless ssh between nodes and exported a few necessary environment variables. One of these variables (which is probably most relevant) is:
export SPARK_LOCAL_DIRS=/scratch/spark/
I have a small piece of python code which I know works with Spark. I can run it locally--on my desktop, not the cluster--with:
$SPARK_HOME/bin/spark-submit ~/My_code.py
I copied the code to the cluster. Then, I start all the processes from the head node:
$SPARK_HOME/sbin/start-all
And each of the slaves is listed as running as process xxxxx.
If I then attempt to run my code with the same command above:
$SPARK_HOME/bin/spark-submit ~/MY_code.py
I get the following error:
14/10/27 14:19:02 ERROR util.Utils: Failed to create local root dir in /scratch/spark/. Ignoring this directory.
14/10/27 14:19:02 ERROR storage.DiskBlockManager: Failed to create any local dir.
I have the permissions set on the /scratch and /scratch/spark at 777. Any help is greatly appreciated.

The problem was that I didn't realize the master node also needed a scratch directory. In each of my 8 worker nodes I created the local /scratch/spark directory, but neglected to do so on the master node. Adding the directory fixed the problem.

Related

PyCharm Error when setting remote interpreter -> Run Error: Jupyter server process failed to start due to path mismatch issue

I am trying to setup remote development on PyCharm. For this, I want to make changes locally and execute the code on remote Amazon EC2 instance with a remote interpreter. I had done the following configuration of the project, but I am getting run error when I try to execute a ipython file created locally.
Cannot run program "stfp://<remote server hostname>/<remote server host>:<remote interpreter path>" (in directory <local folder directory>): error=2, No such file or directory.
It seems it should open <remote folder directory> instead of <local folder directory> when running the program. I read through multiple setup instructions but could not get this fixed. I am attaching configuration below.
Can you please help me with what could be wrong?
Updating PyCharm to 2022.3 can fix the issue. This seems to be related to an unstable older version.

Installing packages in a Kubernetes Pod

I am experimenting with running jenkins on kubernetes cluster. I have achieved running jenkins on the cluster using helm chart. However, I'm unable to run any test cases since my code base requires python, mongodb
In my JenkinsFile, I have tried the following
1.
withPythonEnv('python3.9') {
pysh 'pip3 install pytest'
}
stage('Test') {
sh 'python --version'
}
But it says java.io.IOException: error=2, No such file or directory.
It is not feasible to always run the python install command and have it hardcoded into the JenkinsFile. After some research I found out that I have to declare kube to install python while the pod is being provisioned but there seems to be no PreStart hook/lifecycle for the pod, there is only PostStart and PreStop.
I'm not sure how to install python and mongodb use it as a template for kube pods.
This is the default YAML file that I used for the helm chart - jenkins-values.yaml
Also I'm not sure if I need to use helm.
You should create a new container image with the packages installed. In this case, the Dockerfile could look something like this:
FROM jenkins/jenkins
RUN apt install -y appname
Then build the container, push it to a container registry, and replace the "Image: jenkins/jenkins" in your helm chart with the name of the container image you built plus the container registry you uploaded it to. With this, your applications are installed on your container every time it runs.
The second way, which works but isn't perfect, is to run environment commands, with something like what is described here:
https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/
the issue with this method is that some deployments already use the startup commands, and by redefining the entrypoint, you can stop the starting command of the container from ever running, thus causing the container to fail.
(This should work if added to the helm chart in the deployment section, as they should share roughly the same format)
Otherwise, there's a really improper way of installing programs in a running pod - use kubectl exec -it deployment.apps/jenkins -- bash then run your installation commands in the pod itself.
That being said, it's a poor idea to do this because if the pod restarts, it will revert back to the original image without the required applications installed. If you build a new container image, your apps will remain installed each time the pod restarts. This should basically never be used, unless it is a temporary pod as a testing environment.

Problem submitting Pyspark jobs from Windows Driver to Ubuntu Spark Cluster

I am having trouble submitting a Pyspark job from my Windows driver machine (Win 10) to a simple Spark cluster running on Ubuntu.
There are several posts already that attempt to answer this question, most notably this one from ThatDataGuy here but none of them have helped.
Every time I try to submit the simple wordcount.py example to my remote master from my Windows box, I get the following error:
Cannot run program 'C:\apps\Python\3.6.6\python.exe': error=2, No such file or directory
This is a Java IOException generated by the Py4J jar.
My Spark cluster is a simple Master, 1 Worker setup in VirtualBox setup via Vagrant. All machines, (my Spark driver laptop, and 2 VMs (Master / Worker) have identical Spark 2.4.2, Python 3.6.6, and Scala 12.8. Note that Scala programs using spark-submit against the remote cluster work fine, as well as anything run in local mode. Also, the code examples work fine when run on either the Master or Worker nodes directly. It's only when I try to use my Windows laptop as a Spark driver in Pyspark, against the Ubuntu Spark cluster, that this issue arises. It always returns the error above.
It seems that Py4j is trying to use or instantiate Python from my Windows Driver's python path, which of course my Linux cluster can't see. I have already set the Pyspark Python path to a different value in the cluster nodes. I have set the both PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON in the nodes environment variables (via .bashrc), in spark-defaults.conf, AND in spark-env.sh files. All values point to /usr/local/bin/python3 as that's where Python 3.6.6 is installed on the Master and Worker nodes.
I've also (just as a hunch) aliased "python" to point to /usr/local/bin/python3 in the nodes and then changed my Windows python shortcut to pull up the same Python version. No luck but I was grasping at straws. ;/ Error simply changed to:
Cannot run program 'python': error=2, No such file or directory
I did see an article where the Py4J 0.10.7 library does not support Python 3.7, so this caused me to drop down to Python 3.6. Error stayed the same after that though.
The only thing I haven't done is to try to setup an additional shared / synced folder in Vagrant back to my Windows Python installation and then use /vagrant/shared/python/whatever in my remote PYSPARK settings. No idea if that would work though given I'm dealing with Windows and Linux Python flavors (all 3.6.6). Ugh. :/
Any ideas? I've got a Windows 10 machine and I like to do my Python development there. I've also got 64GB of RAM so I'd like to use it. Please don't make me switch to Scala! ;)
-- Pyspark works fine in local
spark-submit C:\apps\Spark\spark-2.4.2\examples\src\main\python\wordcount.py C:\Users\sitex\Desktop\p_and_p_ch1.txt
-- Pyspark fails when calling master with IOException
spark-submit --master spark://XXX.XX.XXX.XXX:7077 C:\apps\Spark\spark-2.4.2\examples\src\main\python\wordcount.py C:\Users\sitex\Desktop\p_and_p_ch1.txt
UPDATE: Ok, so it looks like my workaround is to pretend that my Driver (Windows laptop) knows the Python path in Linux that the Worker needs to use. Fortunately for me, I do it as this entire setup is running on my laptop. Here is the code that gets me past the error:
spark-submit --conf spark.pyspark.driver.python=python --conf spark.pyspark.python=/usr/local/bin/python3 --master spark://172.28.128.150:7077 C:\apps\Spark\spark-2.4.2\examples\src\main\python\wordcount.py /vagrant/shared/p_and_p_ch1.txt
Now, I should add that this DOESN'T run wordcount.py as I quickly realized that my Cluster cannot figure out Window's paths and my attempt to use a Vagrant synced / shared folder is resulting in a File Not Found on the p_and_p_ch1.txt file. But it does get me past my dreaded error. I can figure out how to stash my files on a network share / S3 / et some other day.
This puts a lot of onus on the Spark Driver knowing exactly what Python path the Cluster needs to use. Fortunately I know these settings as the setup is entirely on my laptop, but isn't the entire point of this that I am supposed to submit Spark jobs to a cluster without the Driver (me) knowing settings like the Worker nodes' Python path? I'm wondering if this is just a Windows + Linux quirk?

pycharm always "uploading pycharm helpers" to same remote python interpreter when starts

When I start PyCharm for remote python interpreter, it always performs "Uploading PyCharm helpers", even when the remote machine IP is the same and already containing previously uploaded helpers. Is the behaviour correct?
This is a well known problem that can be a major obstacle in productivity especially if you use disposable instances in your workflow. It leads to a forced coffee break of 20 minutes every time you want to connect to a remote system. Unacceptable.
Seems like PyCharm creates a build.txt file in the remote helper folder that just has the current PyCharm build number as its contents, for instance:
PY-171.4694.38
So it's possible to upload the helpers manually by using rsync on /Applications/PyCharm.app/Contents/helpers/ and finally manually creating a build.txt file with your current build number. After that, PyCharm should not attempt to re-upload them.
Example:
$ rsync -avz /Applications/PyCharm.app/Contents/helpers/ cluster:/home/xapple/.pycharm_helpers/
$ echo "PY-171.4694.38" > /home/xapple/.pycharm_helpers/build.txt
$ python /home/xapple/.pycharm_helpers/pydev/setup_cython.py build_ext --inplace
In my case, several projects are projected to the remote server by Pycharm. All of them get stuck when one of the projects goes wrong on the remote server. Solution: leave only one that you need to work on and restart the PyCharm by "Invalidate caches".
Note that -- at least as late as version 2018.3.x -- PyCharm also appears to require re-uploading of the helpers when the local network connection changes as well, for some reason.
What I've observed in my case is that if, while PyCharm remains running, I relocate my laptop and connect to a different LAN, the next remote debugging session I initiate will trigger the lengthy helper upload. It turns out that the contents of the helpers directory actually uploaded in this case are exactly identical to the contents already present in that directory on the remote system (I compared them), so this upload is entirely superfluous, but PyCharm isn't able to detect this.
As there's no way I know of in PyCharm to bypass or cancel automatic helpers upload, the only recourse is to completely exit from PyCharm (close all open project windows) after each change of network connection and restart the IDE. In my experience, this will cause the helper upload to succeed in the "checking remote helpers" phase, before actually uploading all the helpers again. Of course, this is major annoyance if you have multiple projects open, but it's faster than waiting the (tens of) minutes for the agonizingly slow helpers upload to complete.
All of what other responders describe for the course of action to take when changing PyCharm versions is true. It is sufficient to use rsync, ftp, scp, or whatever to transfer the contents of the new local helpers directory (on Linux, a subdirectory of where the app is installed) to the remote system (on Linux, ~/.pycharm_helpers, where ~ is the home directory of the user name used for the remote debugging session), and update the remote build.txt in the helpers directory with the new PyCharm version.
This problem came back again 6 years later with PyCharm 2022.3.2.
The directory /Applications/PyCharm.app/Contents/helpers/ doesn't exist anymore, so the previous trick doesn't work.
What solved it this time is simply to:
Quit PyCharm.
Delete the ~/.pycharm_helpers directory on the remote server.
Relaunch PyCharm and let it do it's thing.
According to the docs,
PyCharm checks remote helpers version on every remote run, so if you update your PyCharm version, the new helpers will be uploaded automatically and you don't need to recreate remote interpreter.
fast (less than 3 second between me an digitalocean) solution inspired by excellent xApple's answer
on remote server:
export SOURCE=<your ip>
export PORT=9000
export HELPERS=$HOME/.pycharm_helpers
# PyCharm Help -> About
export BUILD=PY-172.4343.24 # 2017/10/11
cd $HELPERS
rm -fr *
# my OS - ubuntu, change firewall rules to yours if you're not so lucky
sudo ufw allow from $SOURCE proto tcp to any port $PORT
netcat -l -v -p $PORT | tar xz # here you waiting for connection
# after finish
sudo ufw delete allow from $SOURCE proto tcp to any port $PORT
echo -n $BUILD > build.txt
python $HELPERS/pydev/setup_cython.py build_ext --inplace
on your workstation:
export TARGET=<remote server ip>
export PORT=9000
export HELPERS=<path to helpers> # for me it's $HOME/opt/pycharm-2016.3/helpers
cd $HELPERS
tar cfz - . | netcat -v $TARGET $PORT
Turning off the firewall addressed the problem in my case (macOS - Mojave). Note that this is not a general solution as it was not tested in any other environments/OS.

How can PySpark be called in debug mode?

I have IntelliJ IDEA set up with Apache Spark 1.4.
I want to be able to add debug points to my Spark Python scripts so that I can debug them easily.
I am currently running this bit of Python to initialise the spark process
proc = subprocess.Popen([SPARK_SUBMIT_PATH, scriptFile, inputFile], shell=SHELL_OUTPUT, stdout=subprocess.PIPE)
if VERBOSE:
print proc.stdout.read()
print proc.stderr.read()
When spark-submit eventually calls myFirstSparkScript.py, the debug mode is not engaged and it executes as normal. Unfortunately, editing the Apache Spark source code and running a customised copy is not an acceptable solution.
Does anyone know if it is possible to have spark-submit call the Apache Spark script in debug mode? If so, how?
As far as I understand your intentions what you want is not directly possible given Spark architecture. Even without subprocess call the only part of your program that is accessible directly on a driver is a SparkContext. From the rest you're effectively isolated by different layers of communication, including at least one (in the local mode) JVM instance. To illustrate that, lets use a diagram from PySpark Internals documentation.
What is in the left box is the part that is accessible locally and could be used to attach a debugger. Since it is most limited to JVM calls there is really nothing there that should of interest for you, unless you're actually modifying PySpark itself.
What is on the right happens remotely and depending on a cluster manager you use is pretty much a black-box from an user perspective. Moreover there are many situations when Python code on the right does nothing more than calling JVM API.
This is was the bad part. The good part is that most of the time there should be no need for remote debugging. Excluding accessing objects like TaskContext, which can be easily mocked, every part of your code should be easily runnable / testable locally without using Spark instance whatsoever.
Functions you pass to actions / transformations take standard and predictable Python objects and are expected to return standard Python objects as well. What is also important these should be side effects free
So at the end of the day you have to parts of your program - a thin layer that can be accessed interactively and tested based purely on inputs / outputs and "computational core" which doesn't require Spark for testing / debugging.
Other options
That being said, you're not completely out of options here.
Local mode
(passively attach debugger to a running interpreter)
Both plain GDB and PySpark debugger can be attached to a running process. This can be done only, once PySpark daemon and /or worker processes have been started. In local mode you can force it by executing a dummy action, for example:
sc.parallelize([], n).count()
where n is a number of "cores" available in the local mode (local[n]). Example procedure step-by-step on Unix-like systems:
Start PySpark shell:
$SPARK_HOME/bin/pyspark
Use pgrep to check there is no daemon process running:
➜ spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
➜ spark-2.1.0-bin-hadoop2.7$
The same thing can be determined in PyCharm by:
alt+shift+a and choosing Attach to Local Process:
or Run -> Attach to Local Process.
At this point you should see only PySpark shell (and possibly some unrelated processes).
Execute dummy action:
sc.parallelize([], 1).count()
Now you should see both daemon and worker (here only one):
➜ spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
13990
14046
➜ spark-2.1.0-bin-hadoop2.7$
and
The process with lower pid is a daemon, the one with higher pid is (possibly) ephemeral worker.
At this point you can attach debugger to a process of interest:
In PyCharm by choosing the process to connect.
With plain GDB by calling:
gdb python <pid of running process>
The biggest disadvantage of this approach is that you have find the right interpreter at the right moment.
Distributed mode
(Using active component which connects to debugger server)
With PyCharm
PyCharm provides Python Debug Server which can be used with PySpark jobs.
First of all you should add a configuration for remote debugger:
alt+shift+a and choose Edit Configurations or Run -> Edit Configurations.
Click on Add new configuration (green plus) and choose Python Remote Debug.
Configure host and port according to your own configuration (make sure that port and be reached from a remote machine)
Start debug server:
shift+F9
You should see debugger console:
Make sure that pyddev is accessible on the worker nodes, either by installing it or distributing the egg file.
pydevd uses an active component which has to be included in your code:
import pydevd
pydevd.settrace(<host name>, port=<port number>)
The tricky part is to find the right place to include it and unless you debug batch operations (like functions passed to mapPartitions) it may require patching PySpark source itself, for example pyspark.daemon.worker or RDD methods like RDD.mapPartitions. Let's say we are interested in debugging worker behavior. Possible patch can look like this:
diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
index 7f06d4288c..6cff353795 100644
--- a/python/pyspark/daemon.py
+++ b/python/pyspark/daemon.py
## -44,6 +44,9 ## def worker(sock):
"""
Called by a worker process after the fork().
"""
+ import pydevd
+ pydevd.settrace('foobar', port=9999, stdoutToServer=True, stderrToServer=True)
+
signal.signal(SIGHUP, SIG_DFL)
signal.signal(SIGCHLD, SIG_DFL)
signal.signal(SIGTERM, SIG_DFL)
If you decide to patch Spark source be sure to use patched source not packaged version which is located in $SPARK_HOME/python/lib.
Execute PySpark code. Go back to the debugger console and have fun:
Other tools
There is a number of tools, including python-manhole or pyrasite which can be used, with some effort, to work with PySpark.
Note:
Of course, you can use "remote" (active) methods with local mode and, up to some extent "local" methods with distributed mode (you can connect to the worker node and follow the same steps as in the local mode).
Check out this tool called pyspark_xray, below is a high level summary extracted from its doc.
pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.
The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.
Problem
For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.
If you develop PySpark applications, you know that PySpark application code is made up of two categories:
code that runs on master node
code that runs on worker/slave nodes
While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.
Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.
Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.
Solution
pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.
This library achieves these capabilties by using the following techniques:
wrapper functions of Spark code on slave nodes, check out the section to learn more details
practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac's memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
usage of a flag to auto-detect local mode, CONST_BOOL_LOCAL_MODE from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
True: if current OS is Mac or Windows
False: otherwise
in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.

Categories