Restart Python interpreter from code on Databricks?

Restart Python interpreter from code on Databricks? - python

When installing libraries directly in Databricks notebook cells via %pip install, the Python interpreter gets restarted. I have the understanding that in order for the newly installed packages to become visible and accessible to the rest of the notebook cells, the interpreter must be restarted.
My question: How would it be possible to perform this interpreter restart programatically?
I am installing packages using a function call, based on requirements stored on a separate file, but I noticed that newly installed packages are not present, despite the installation seemingly taking place, either on notebook or cluster scope. I figured out that the reason might be the lack of interpreter restart from my installation code.
Is such a behavior possible?

Firstly, you can preinstall packages at cluster level or job level before you even start the notebook. I would recommend that before you try anything else. Trust me I've run thousands of libraries including custom built ones and have never needed to-do this.
On to your actual question. Yes but it will throw an exception.
This code will cause you to kill the python process on your DBX workspace
%sh
pid=$(ps aux | grep PythonShell | grep -v grep | awk '{print $2}')
kill -9 $pid
However, it poses a problem for you. If you run this as a bash notebook cell this will not be wrappable in try catch logic which makes automated workflows impossible.
I would have said you can call the shell commands from python but that wouldn't work as the exception would be thrown in that cell. You could perhaps use scala and sys_process library to achieve it but I'm no scala expert sadly.
Hope that helps!

you can use this:
dbutils.library.restartPython()
for more info look at https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-utils#dbutils-library-restartpython

Related

How can I access functions inside the anaconda3/bin directory when running a bash script with subprocess.call?

I have the following problem: I wrote a bash script for data analysis that works perfectly fine when I run it from the terminal. To further automate the process I wanted to use a python script that runs the bash script (using subprocess.call), changes the working directory, and reruns the script (and so on). This also worked fine when I did it on my MacBook. However, I need to do the analysis on a Linux machine and here the problem occurred. Again, running the script from the terminal worked fine but once I tried doing this with my python script it fails to find the relevant functions for the analysis. The functions are stored inside the anaconda3/bin folder.
(Python does not even find other functions like "pip")
Of course, I could add the path to all the functions in the bash script but this seems very inefficient to me. So my question is: is there any better way of telling python where to look for the functions? And can you maybe explain to me why running the script from the terminal works but not when I use subprocess.call?
Here is the python script:
import subprocess
import os
path_list = ["Path1",
"Path2"
]
for path in path_list:
os.chdir(path)
subprocess.call("Users/.../bash_script", shell=True)

I'm just posting my series of comments as an answer since I think this at least constitutes a reasonable answer for anyone running into a similar issue (your question could definitely be common enough to index from search engine results).
Issue:
...running the script from the terminal worked fine but once I tried doing this with my python script it fails to find the relevant functions for the analysis
In general, you can troubleshoot this kind of problem with:
import subprocess
subprocess.call('echo $PATH', shell=True)
If the directory that contains the relevant binaries/scripts/etc. is not in the output, then you are facing a PATH issue in the shell created by subprocess.call.
The exact problem as confirmed by the OP in comments is that anaconda3/bin is not part of your PATH. Your script works in a regular terminal session because of the Anaconda initialization function that gets added to your .bashrc when installing.
Part of an answer that is very helpful here: Python - Activate conda env through shell script
The problem with your script, though, lies in the fact that the .bashrc is not sourced by the subshell that runs shell scripts (see this answer for more info). This means that even though your non-login interactive shell sees the conda commands, your non-interactive script subshells won't - no matter how many times you call conda init.
Solution 1: Manually use the Anaconda sourcing function in your script
As the OP mentioned in the comments, their workaround was to use the initialization function added to their .bashrc in the script they are trying to run. Although this perhaps feels like not a great solution, this is a "good enough" workaround. Unfortunately I don't use Anaconda on Linux so I don't have an exact snippet of what this looks like. See the next section for a possibly "cleaner" solution.
Solution 2: Use bash -i to run your script
As mentioned in the same answer linked above, you might be able to use:
bash -i Users/.../bash_script
This will tell bash to run in interactive mode, which then properly sources your .bashrc file when creating the shell. As a result, Anaconda and related functions should work properly.
Solution 3: Manually add anaconda3/bin to PATH
You can check out this answer to decide if this is something you want to do. Keep in mind they are speaking about a Windows OS but most of the same applies to Linux.
When you add the directory to your PATH, you are specifically telling your system to always look in that directory for commands when executing by name, e.g. ping or which. This can have unexpected behavior if you have conflicts (e.g. a command is found with the same name in /usr/bin and .../anaconda3/bin), and as such Anaconda does not add its bin folder to your PATH by default.
This is not necessarily "dangerous" per se, it's just not an ideal solution for most people. However, you are the boss of your own system. If you decide this works for your particular workflow, you can just add the export to your script:
export PATH="path/to/anaconda3/bin:$PATH"
This will set the PATH for use in the current shell and sub-processes.
Solution 4: Manually source the conda script (possibly outdated)
As mentioned in this answer, you can also opt to manually source the conda.sh script (keep in mind your conda.sh might be in another directory):
source /opt/anaconda/etc/profile.d/conda.sh
This will essentially run that shell script and add the included functionality to the current shell (e.g the one spawned by subprocess.call).
Keep in mind this answer is quite a bit older (~2013) and may not apply anymore, depending how much conda has changed over the years.
Notes
As I mentioned in the comments, you may want to post some related questions on https://unix.stackexchange.com/. You have an interesting configuration challenge that may be better suited for answers specifically pertaining to Linux, since your issue is sourcing directly from Linux shell behavior.

How to install SageMath kernel in Anaconda?

I'm trying to use Sage in Anaconda 3 but it looks that the libraries are not imported.
I firstly created a new environment 'ipykernel_py2' and then installed Python 2 as explained in here. With this I can have both Python 3 and Python 3 up and running in Anaconda 3.
Then I went to the kernel's folder created (C:\Users\YOUR_USERNAME\AppData\Local\Continuum\anaconda3\envs\ipykernel_py2\share\jupyter\kernels) and pasted Sage's kernel (taken from C:\Program Files\SageMath 8.2\runtime\opt\sagemath-8.2\local\share\jupyter\kernels). This allows to create new SageMath files in Jupyter but the kernel is dead.
To activate the kernel I used Anaconda Prompt and typed:
activate ipykernel_py2
python -m ipykernel install --user --name sagemath --display-name "SageMath 8.2"
So the kernel is now activated and I can create and run Sage files. However the libraries are still not working. It seems that the file is running like a normal Python 2 file.
Does anyone know how to fix this? Do I need to create a seperate environment?

Sage for Windows runs under a UNIX emulation environment called Cygwin. Looking at the sagemath/kernel.json it contains:
{"display_name": "SageMath 8.2", "argv": ["/opt/sagemath-8.2/local/bin/sage", "--python", "-m", "sage.repl.ipython_kernel", "-f", "{connection_file}"]}
You can see here that it has a UNIX-style path to the sage executable. This path only makes sense to other programs running under Sage's Cygwin environment, and is meaningless to native Windows programs. Simply converting it to the equivalent Windows path won't work either, because bin/sage is actually a shell script. At the very least you need to provide a Windows path to the bash that comes with Cygwin and pass it the UNIX path to the sage executable (the same as the one above). Without a login shell, most environment variables needed won't be set either, so you probably need bash -l.
So, something like:
{"display_name": "SageMath 8.2", "argv": ["C:\\Program Files\\SageMath 8.2\\runtime\\bin\\bash.exe", "-l", "/opt/sagemath-8.2/local/bin/sage", "--python", "-m", "sage.repl.ipython_kernel", "-f", "{connection_file}"]}
might work. The one thing I'm not sure about is whether the {connection_file} argument will be handled properly either. I haven't tested it.
Update: Indeed, the above partially works, but there are a few problems: The {connection_file} argument as passed as the absolute Windows path to the file. While Cygwin can normally translate transparently from Windows paths to a corresponding UNIX path, there is a known issue that Python's os.path module on Cygwin does not handle Windows-style paths well, and this leads to issues.
The other major problem I encountered was that IPKernelApp, the class that drives generic Jupyter kernels, has a thread which polls to see if the kernel's parent process (in this case the notebook server) has exited, so it can appropriately shut down if the parent shuts down. This is how kernels know to automatically shut down when you kill the notebook server.
How this is done is very different depending on the platform--Windows versus UNIX-like. Because Sage's kernel runs in Cygwin, it chooses the UNIX-like poller. However, this is wrong if the notebook server happens to be a native Windows process, as is the case when running the Sage kernel in a Windows-native Jupyter. Remarkably, the parent poller for Windows can work just as well on Cygwin since it accesses the Windows API through ctypes. Therefore, this can be worked around by providing a wrapper to IPKernelApp that forces uses of ParentPollerWindows.
A possible solution then looks something like this: From within the SageMath Shell do:
$ cd "$SAGE_LOCAL"
$ mkdir -p ./share/jupyter/kernels/sagemath
$ cd ./share/jupyter/kernels/sagemath
$ cat <<_EOF_ > kernel-wrapper.sh
#!/bin/sh
here="$(dirname "$0")"
connection_file="$(cygpath -u -a "$1")"
exec /opt/sagemath-8.2/local/bin/sage --python "${here}/kernel-wrapper.py" -f "${connection_file}"
_EOF_
$ cat <<_EOF_ > kernel-wrapper.py
from ipykernel.kernelapp import IPKernelApp as OrigIPKernelApp
from ipykernel.parentpoller import ParentPollerWindows
from sage.repl.ipython_kernel.kernel import SageKernel
class IPKernelApp(OrigIPKernelApp):
"""
Although this kernel runs under Cygwin, its parent is a native Windows
process, so we force use of the ParentPollerWindows.
"""
def init_poller(self):
if self.interrupt or self.parent_handle:
self.poller = ParentPollerWindows(self.interrupt,
self.parent_handle)
IPKernelApp.launch_instance(kernel_class=SageKernel)
_EOF_
Now edit the kernel.json (in its existing location under share\jupyter\kernels\sagemath) to read:
{"display_name": "SageMath 8.2", "argv": ["C:\\Program Files\\SageMath 8.2\\runtime\\bin\\bash.exe", "-l", "/opt/sagemath-8.2/local/share/jupyter/kernels/sagemath/kernel-wrapper.sh", "{connection_file}"]}
This runs kernel-wrapper.sh which in turn runs kernel-wrapper.py. (There are a few simplifications I could make to get rid of the need for kernel-wrapper.sh completely, but that would be easier in SageMath 8.3 which includes PyCygwin.)
Make sure to change every "8.2" to the appropriate "X.Y" version for your Sage installation.
Update: Made some updates thanks to feedback from a user, but I haven't tested these changes yet, so please make sure instead of blindly copy/pasting that every file/directory path in my instructions exists and looks correct.
As you can see, this was not trivial, and was never by design meant to be possible. But it can be done. Once the kernel itself is up and running it's just a matter of talking to it over TCP/IP sockets so there's not too much magic involved after that. I believe there are some small improvements that could be made on both the Jupyter side and on the Sage side that would facilitate this sort of thing in the future...

Git push-to-deploy post-receive python script not cannot set env var

I am stuck since 2 days trying to set up a small automatic deployment script.
The thing is: I have been using Git for some months now, but I always used it locally just by myself, just with the purpose of easily saving version of my code. All good until here.
Now I have to find a way to "publish" the code as soon as new functionalities are implemented and I think the code is stable enough.
Searching around I've discovered these 'hooks', which are scripts that are executed by Git in certain situations. Basically the idea is to have my master branch sync'd with my published code, so that everytime I merge a branch to the master and 'push', the files are automatically copied into '/my/published/folder'.
That said, I've found this tutorial that explains to do exactly what I want using a 'hooks' post-receive script, which is written in Ruby. Since at my studio I don't have and don't want to use Ruby at this time, I've found a Python version of the same script.
I tested and tested, but I couldn't make it work. I keep getting the same error:
remote: GIT_WORK_TREE is not recognized as as internal or external command,
Consider this is based on the tutorial I've shared above. Same prj name, same structure, etc.
I even installed Ruby on my personal laptop and tried the original script, but it still doesn't work...
I'm using Windows, and the Git env variable is set and accessible. But nevertheless it seems like it's not recognizing the GIT_WORK_TREE command. If I run it from the Git Bash it works just fine, but if I use the Windows Shell I get the same error message.
I suppose that when in my py script use the call() function, it runs the cmd using the Windows Shell. That's my guess, but I don't really know how to solve it. Google didn't help, as if no one ever had this problem before.
Maybe I'm just not seeing something obvious here, but I spent the whole day on this and I cannot get out of this bog!
Does anyone know how to solve it, or at least have an idea for a workaround?
Hope someone can help...
Thanks a lot!

The Ruby script you are talking about generates "bash" command:
GIT_WORK_TREE=/deploy/path git checkout -f ...
It means: define environment variable "GIT_WORK_TREE" with value "/deploy/path" and execute "git checkout -f ...".
As I understand it doesn't work for Windows command line.
Try to use something like:
set GIT_WORK_TREE=c:\temp\deploy && git checkout -f ...

I've had this problem as well - the best solution I've found is to pass the working tree across as one of the parameters
git --work-tree="/deploy/path" checkout -f ...

Run a python script from bamboo

I'm trying to run a python script from bamboo. I created a script task and wrote inline "python myFile.py". Should I be listing the full path for python?
I changed the working directory to the location of myFile.py so that is not a problem. Is there anything else I need to do within the configuration plan to properly run this script? It isn't running but I know it should be running because the script works fine from terminal on my local machine. Thanks

I run a lot of python tasks from bamboo, so it is possible. Using the Script task is generally painless...
You should be able to use your script task to run the commands directly and have stdout written to the logs. Since this is true, you can run:
'which python' -- Output the path of which python that is being ran.
'pip list' -- Output a list of which modules are installed with pip.
You should verify that the output from the above commands matches the output when ran from the server. I'm guessing they won't match up and once that is addressed, everything will work fine.
If not, comment back and we can look at a few other things.
For the future, there are a handful of different ways you can package things with python which could assist with this problem (e.g. automatically installing missing modules, etc).

You can also use the Script Task directly with an inline Python script to run your myFile.py:
/usr/bin/python <<EOF
print "Hello, World!"
EOF
Check this page for a more complex example:
https://www.langhornweb.com/display/BAT/Run+Python+script+as+a+Bamboo+task?desktop=true&macroName=seo-metadata

How to get pycassaShell working in windows?

EDIT: I got it working, I went into the pycassa directory and typed python pycassaShell but the 2nd part of my question (at the bottom there) is still valid: how do I run a script in pycassaShell?
I recently installed Cassandra and pycassa and followed the instruction from here.
They work fine, except I cant get pycassaShell to load. When I type pycassaShell at the command prompt, I get
'pycassaShell' is not recognized as an internal or external command,
operable program or batch file.
Do I need to set up a path for it?
Also, does anyone know if you can run ddl scripts using pycassaShell? It is for this reason that I want to try it out. At the moment, I'm doing all my ddl in the cassandra CLI, I'd like to be able to put it in a script to automate it.

You probably don't want to be running scripts with pycassaShell. It's designed more as an interactive environment to quickly try things out. For serious scripts, I recommend just writing a normal python script that imports pycassa and sets up the connection pool and column families itself; it should only be an extra 5 or so lines.
However, there is an (undocumented, I just noticed) optional -f or --file flag that you can use. It will essentially run execfile() on that script after startup completes, so you can use the SYSTEM_MANAGER and CF variables that are already set up in your script. This is intended primarily to be used as a prep script for your environment, similar to how you might use a .bashrc file (I don't know of a Windows equivalent).
Regarding DDL statements, I suggest you look at the SystemManager class.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.