How can PySpark be called in debug mode?

How can PySpark be called in debug mode? - python

I have IntelliJ IDEA set up with Apache Spark 1.4.
I want to be able to add debug points to my Spark Python scripts so that I can debug them easily.
I am currently running this bit of Python to initialise the spark process
proc = subprocess.Popen([SPARK_SUBMIT_PATH, scriptFile, inputFile], shell=SHELL_OUTPUT, stdout=subprocess.PIPE)
if VERBOSE:
print proc.stdout.read()
print proc.stderr.read()
When spark-submit eventually calls myFirstSparkScript.py, the debug mode is not engaged and it executes as normal. Unfortunately, editing the Apache Spark source code and running a customised copy is not an acceptable solution.
Does anyone know if it is possible to have spark-submit call the Apache Spark script in debug mode? If so, how?

As far as I understand your intentions what you want is not directly possible given Spark architecture. Even without subprocess call the only part of your program that is accessible directly on a driver is a SparkContext. From the rest you're effectively isolated by different layers of communication, including at least one (in the local mode) JVM instance. To illustrate that, lets use a diagram from PySpark Internals documentation.
What is in the left box is the part that is accessible locally and could be used to attach a debugger. Since it is most limited to JVM calls there is really nothing there that should of interest for you, unless you're actually modifying PySpark itself.
What is on the right happens remotely and depending on a cluster manager you use is pretty much a black-box from an user perspective. Moreover there are many situations when Python code on the right does nothing more than calling JVM API.
This is was the bad part. The good part is that most of the time there should be no need for remote debugging. Excluding accessing objects like TaskContext, which can be easily mocked, every part of your code should be easily runnable / testable locally without using Spark instance whatsoever.
Functions you pass to actions / transformations take standard and predictable Python objects and are expected to return standard Python objects as well. What is also important these should be side effects free
So at the end of the day you have to parts of your program - a thin layer that can be accessed interactively and tested based purely on inputs / outputs and "computational core" which doesn't require Spark for testing / debugging.
Other options
That being said, you're not completely out of options here.
Local mode
(passively attach debugger to a running interpreter)
Both plain GDB and PySpark debugger can be attached to a running process. This can be done only, once PySpark daemon and /or worker processes have been started. In local mode you can force it by executing a dummy action, for example:
sc.parallelize([], n).count()
where n is a number of "cores" available in the local mode (local[n]). Example procedure step-by-step on Unix-like systems:
Start PySpark shell:
$SPARK_HOME/bin/pyspark
Use pgrep to check there is no daemon process running:
➜ spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
➜ spark-2.1.0-bin-hadoop2.7$
The same thing can be determined in PyCharm by:
alt+shift+a and choosing Attach to Local Process:
or Run -> Attach to Local Process.
At this point you should see only PySpark shell (and possibly some unrelated processes).
Execute dummy action:
sc.parallelize([], 1).count()
Now you should see both daemon and worker (here only one):
➜ spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
13990
14046
➜ spark-2.1.0-bin-hadoop2.7$
and
The process with lower pid is a daemon, the one with higher pid is (possibly) ephemeral worker.
At this point you can attach debugger to a process of interest:
In PyCharm by choosing the process to connect.
With plain GDB by calling:
gdb python <pid of running process>
The biggest disadvantage of this approach is that you have find the right interpreter at the right moment.
Distributed mode
(Using active component which connects to debugger server)
With PyCharm
PyCharm provides Python Debug Server which can be used with PySpark jobs.
First of all you should add a configuration for remote debugger:
alt+shift+a and choose Edit Configurations or Run -> Edit Configurations.
Click on Add new configuration (green plus) and choose Python Remote Debug.
Configure host and port according to your own configuration (make sure that port and be reached from a remote machine)
Start debug server:
shift+F9
You should see debugger console:
Make sure that pyddev is accessible on the worker nodes, either by installing it or distributing the egg file.
pydevd uses an active component which has to be included in your code:
import pydevd
pydevd.settrace(<host name>, port=<port number>)
The tricky part is to find the right place to include it and unless you debug batch operations (like functions passed to mapPartitions) it may require patching PySpark source itself, for example pyspark.daemon.worker or RDD methods like RDD.mapPartitions. Let's say we are interested in debugging worker behavior. Possible patch can look like this:
diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
index 7f06d4288c..6cff353795 100644
--- a/python/pyspark/daemon.py
+++ b/python/pyspark/daemon.py
## -44,6 +44,9 ## def worker(sock):
"""
Called by a worker process after the fork().
"""
+ import pydevd
+ pydevd.settrace('foobar', port=9999, stdoutToServer=True, stderrToServer=True)
+
signal.signal(SIGHUP, SIG_DFL)
signal.signal(SIGCHLD, SIG_DFL)
signal.signal(SIGTERM, SIG_DFL)
If you decide to patch Spark source be sure to use patched source not packaged version which is located in $SPARK_HOME/python/lib.
Execute PySpark code. Go back to the debugger console and have fun:
Other tools
There is a number of tools, including python-manhole or pyrasite which can be used, with some effort, to work with PySpark.
Note:
Of course, you can use "remote" (active) methods with local mode and, up to some extent "local" methods with distributed mode (you can connect to the worker node and follow the same steps as in the local mode).

Check out this tool called pyspark_xray, below is a high level summary extracted from its doc.
pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.
The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.
Problem
For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.
If you develop PySpark applications, you know that PySpark application code is made up of two categories:
code that runs on master node
code that runs on worker/slave nodes
While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.
Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.
Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.
Solution
pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.
This library achieves these capabilties by using the following techniques:
wrapper functions of Spark code on slave nodes, check out the section to learn more details
practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac's memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
usage of a flag to auto-detect local mode, CONST_BOOL_LOCAL_MODE from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
True: if current OS is Mac or Windows
False: otherwise
in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.

Related

In Python, is there any way to save/dump/serialize a PDB debugging session?

I've been working with Python programs which take several hours to complete, but crash occasionally. To debug, so far I have been adding conditional breakpoints, which drop me into a PDB session whenever a problem occurs. This is great because pinpointing the exact cause of the problem is hard, and the interactive session lets me explore the whole program (including all the stack frames and so on).
The only issue is, if I ever accidentally close or crash my debugging session, I need to start the whole program again! Reaching my breakpoint takes several hours! I would really, really like a way of serializing a PDB session and re-opening it multiple times. Does anything like this exist? I have looked into dill to serialize an interpreter session, unfortunately several of my types fail to serialize (it also isn't robust to code changes down the line). Thanks!

You haven't specified your operating system of choice, but in linux world there is a criu utility - https://criu.org/Main_Page , which can be used to save an application state. Now there are a lot of pitfalls, especially with tty-based applications (see https://criu.org/Advanced_usage#Shell_jobs_C.2FR) but here is an exaple.
I got a simple python application with pdb debug point, let's call it app.py:
print("hello")
import pdb; pdb.set_trace()
print("world")
After running this application with python app.py you get expected
hello
> /home/user/app.py(3)<module>()
-> print("world")
Get your pid with pgrep -f app.py, in my case it was 17060
Create a folder to dump your process
mkdir /tmp/criu
Dump your process with
sudo criu dump -D /tmp/criu -t 17060 --shell-job
notice that you current process will be killed (AFAIK due to --shell-job key, see link above).
you'll see
(Pdb) [1] 17060 killed python app.py
in your tty
Restore your process with
sudo criu restore -D /tmp/criu --shell-job
your tty will be restored in a same window where you used this command.
Since debugger is attached, you may type c and enter to see if it actually worked. Here's the result on my machine:
(Pdb) c
world
Hope that helps, there are a lot of pitfalls that might make this approach unfeasible for you.
The other way is to run your code in a VM and snapshot disk and memory each time. It might not be the best solution resource-wise, but many hypervisors have a nice UI, or even shell utilities to control state of virtual machines. Snapshoting tech is mature in any hypervisor now days, you shouldn't run into any problems. Setup remote debugging and connect with your favorite IDE after bringing your snapshot back.
Edit: There is also an easy way to do this if you are running your applications in containers and you OS supports podman and criu 3.11+
https://criu.org/Podman
You can use something like
podman run -d --name your_container_name your_image
To snapshot use
podman container checkpoint your_container_id
To restore use
podman container restore your_container_id
All these commands require root priviliges. Unfortunately i wasn't able to test it because my distro provides criu 3.8, and for podman 3.11 is required.
Same functionality is available as experimental flag in Docker, see https://criu.org/Docker

How to start asyncio server on remote server with Python?

I have a virtual server available which runs Linux, having 8 cores. 32 GB RAM, and 1 TB in addition. It should be a development environment. (same for test and prod) This is what I could get from IT. Server can only be accessed via so-called jump servers by putty or direct tcp/ip ports (ssh is a must).
The application I am working on starts several processes via multiprocessing. In every process an asyncio event loop is started, and an asyncio socket server in some cases. Basically it is a low level data streaming and processing application (unfortunately no kafka or similar technology available yet). The live application runs forever, no or limited interaction with the user (reads/processes/writes data).
I assume, IPython is an option for this, but - and maybe I am wrong - I think it starts new kernels per client request, but I need to start new process from the main code w/o user interaction. If so, this can be an option for monitoring the application, gathering data from it, sending new user commands to the main module, but not sure how to run processes and asyncio servers remotely.
I would like to understand how these can be done on the given environment. I do not know where to start, what alternatives there are. And I do not understand ipython properly, their page is not obviuos to me yet.
Please help me out! Thank you in advance!

After lots of research and learning I came across to a possible solution in our "sandbox" environment. First, I had to split the problem into several sub-problems:
"remote" development
paralllization
scheduling and executing parallel codes
data sharing between these "engines"
controlling these "engines"
Let's see in details:
Remote development means you want write your code on your laptop, but the code must be executed on a remote server. Easy answer is Jupyter Notebook (or equivalent solution) although it has several trade-offs, also other solutions are available, but this was faster to deploy and use and had the least dependency, maintenance, etc.
parallelization: had several challenges with iPython kernel when working with multiprocessing, so every code that must run parallel will be written in separated Jupyter Notebook. In a single code I can still use eventloop to get async behaviour
executing parallel codes: there are several options I will use :
iPyParallel - "workaround" for multiprocessing
papermill - execute JNs with parameters from command line (optional)
using %%writefile magic command in Jupyter Notebook - create importables
os task scheduler like cron.
async with eventloops
No option yet: docker, multiprocessing, multithreading, cloud (aws, azure, google...)
data sharing: selected ZeroMQ, took time to learn but was simpler and easier than writing everything on pure sockets. There are alternatives but come with extra dependency, and some very useful benefit (will check them later): RabbitMQ, Redis message broker, etc. The reasons for preferring ZMQ: fast, simple, elegant, and just a library. (Knonw risk: our IT will prefer RabbitMQ, but that problem comes later :-) )
controlling the engines: now this answer is obvious: separate python code (can be tested as JN code but easy to turn into pure .py and schedule it). This one can communicate with the other modules via ZMQ sockets: healthcheck, sending new parameters, commands, etc....

Is it possible to use remote vagrant based python interpreter when coding Visual Studio + PTVS

In our company we using vagrant VM's to have the save env. for all. Is it possible to configure VisualStudio + PTVS (python tools for VS) to use vagrant based python interpreter via ssh for example?

There's no special support for remote interpreters in PTVS, like what PyCharm has. It's probably possible to hack something based on the existing constraints, but it would be some work...
To register an interpreter that can actually run, it would have to have a local (well, CreateProcess'able - so e.g. SMB shares are okay) binary that accepts the same command line options as python.exe. It might be possible to use ssh directly by adding the corresponding command line options to project settings. Otherwise, a proxy binary that just turns around and invokes the remote process would definitely work.
Running under debugger is much trickier. For that to work, the invoked Python binary would also have to be able to load the PTVS debugging bits (which is a bunch of .py files in PTVS install directory), and to connect to VS over TCP to establish a debugger connection. I don't see how this could be done without writing significant amounts of code to correctly proxy everything.
Attaching to a remotely running process using ptvsd, on the other hand, would be trivial.
For code editing experience, you'd need a local copy (or a share etc) of the standard library for that interpreter, so that it can be analyzed by the type inference engine.

Tests fail ran by gitlab-ci, but not ran in bash

I'm using gitlab-ci to automatically build a C++ project and run unit-tests written in python (it runs the daemon, and then communicates via the network/socket based interface).
The problem I'm finding is that when the tests are run by the GitLab-CI runner, they fail for various reasons (with one test, it stalls indefinitely on a particular network operation, on the other it doesn't receive a packet that should have been sent).
BUT: When I open up SSH and run the tests manually, they all work successfully (the tests also succeed on all of our developers' machines [linux/windows/OSX]).
At this point I've been trying to replicate enough of the build/test conditions that gitlab-ci is using but I don't really know any exact details, and none of my experiments have reproduced the problem.
I'd really appreciate help with either of the following:
Guidance on running the tests manually outside of gitlab-ci, but replicating its environment so I can get the same errors/failures and debug the daemon and/or tests, OR
Insight into why the test would fail when ran by GitLab-CI-Runner
Sidetrack 1:
For some reason, not all the (mostly debugging) output that would normally be sent to the shell shows up in the gitlab-ci output.
Sidetrack 2:
I also played around setting it up with jenkins, but one of the tests fails to even connect to the daemon, while the rest do it fine.

-i usually replicate the problem by using a docker container only for the runner and running the tests inside it, dont know if you have it setup like this =(.
-Normally the test doesnt actually fail if you log in the container you will see he actually does everything but doesnt report back to the Gilab CI, dont freak out it does it job it simply does not say it.
PS: you can see if its actually running by checking the processes on the machine.
example:
im running a gitlab ci with java and docker:
gitlab ci starts doing its thing then hangs at a download,meanwhile i log in the container and check that he is actually working and manages to upload my compiled docker image.

Determine killer of process

I got a python program running as a windows service which does in my opinion catch all exceptions. In my development environment I cannot reproduce any situation, where no exception is logged, when the program crashes. Except 2 cases: the program is killed via Task Manager or I power off the computer.
However, on the target environment (Windows 2000 with all necessary libraries and python installed), the windows service quits suddenly ca. 4 Minutes after reboots without logging any Exception or reason for the fail. The environment was definitely not powered off.
Does anybody have a suggestion how to determine what killed the python program?
EDIT: I cannot use a debugger in the target environment (as is it is productional level). Therefore I need a way to log the reason for the failure. So, I am looking for tools or methods to log additional information at runtime (or failure time) which can be used for post-mortem analysis.

You need to give more information like "Is your program multi-threaded ?" Does the code depend on the version of Python interpreter you are using, or any imported modules used not present in the target environment ?
If you have GDB for Windows, you can do "gdb -p pid" where "pid" is the pid of the python program you are running. If there is a crash, you can get the back trace.

You may want to check also the following tools from sysinternals.com (now acquired by MSFT):
http://technet.microsoft.com/en-us/sysinternals/bb795533
such as ProcDump, Process Monitor or even Process Explorer (yet less adapted than the previous ones).
You may also be able to install a lightweight debugger such as OllyDbg, or use Moonsols's tools to monitor the guest VM's process if you happen to have this in a virtualized environment.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.