How can I develop with Python libraries in editable mode on databricks? - python

On Databricks, it is possible to install Python packages directly from a git repo, or from the dbfs:
%pip install git+https://github/myrepo
%pip install /dbfs/my-library-0.0.0-py3-none-any.whl
Is there a way to enable a live package development mode, similar to the usage of pip install -e, such that the databricks notebook references the library files as is, and it's possible to update the library files on the go?
E.g. something like
%pip install /dbfs/my-library/ -e
combined with a way to keep my-library up-to-date?
Thanks!

I would recommend to adopt the Databricks Repos functionality that allows to import Python code into a notebook as a normal package, including the automatic reload of the code when Python package code changes.
You need to add the following two lines to your notebook that uses the Python package that you're developing:
%load_ext autoreload
%autoreload 2
Your library is recognized as the Databricks Repos main folders are automatically added to sys.path. If your library is in a Repo subfolder, you can add it via:
import os, sys
sys.path.append(os.path.abspath('/Workspace/Repos/<username>/path/to/your/library'))
This works for the notebook node, however not for worker nodes.
P.S. You can see examples in this Databricks cookbook and in this repository.

You can do %pip install -e in notebook scope. But you will need to do that every time reattach. The code changes does not seem to reload with auto reload since editable mode does not append to syspath; rather a symblink on site-packages.
However editable mode in cluster scope does not seem to work for me

I did some more test and here are my findings for pip install editable:
(1) If I am currently working on /Workspace/xxx/Repo1, and %pip install -e /Workspace/xxx/Repo2 at Notebook scope, it only get recognized in driver node but not worker nodes when you run RDD. When I did "%pip install -e /Workspace/xxx/Repo2" as notebook scope, the class function in Repo2 I called from Repo1 is fine if such function is used only in driver node. But it failed in worker node as worker node does not append the sys.path with /Workspace/xxx/Repo2. Apparently worker node path is out of sync from driver node after %pip install editable mode.
(2) Manually append sys.path of /Workspace/xxx/Repo2 when working on a notebook at /Workspace/xxx/Repo1: this also works only in driver node but not worker node. To make it work in worker node, you need to append the same sys.path in each worker node job function submission, which is not ideal.
(3) install editable /Workspace/xxx/Repo2 at init-script: this works in both driver node and worker node as this environment path is initialized at cluster init stage. This is the best option in my opinion as it ensure consistency across all notebooks. The only downside is /Workspace is not mounted at cluster init stage so /Workspace is not accessible. I can only make it work for when pip install -e /dbfs/xxx/Repo2

Related

How to run GitHub Actions CI workflows locally from within a Python venv using act tool? (FATA[0000]: .env is a directory)

I forked a repository to my GH page, cloned and changed directory into it, started a venv inside with python3.9 -m venv .env.
I want to use act to run github actions' CI workflows every time locally before pushing commits to a PR. Installed this version of act using pikaur. However, when trying to list available act commands like act -l (or any other command) from project root dir, it gives out the error FATA[0000] Error loading from ~/Development/cvat/.env: read ~/Development/cvat/.env: is a directory. The workflows are all there. It's said that act should work out of the box if the workflow .yamls exist. I understand the env file is required too and should be explicitly created? Renaming the .env directory will clearly break everything within, so it's not an option. What can I do at this point?
Using Arch Linux, Python 3.9 and act 0.2.31-1.

How to edit and debug a library in python?

I have created a my own library(package) and installed as development using pip install -e
Now, I would like to edit this library(.py) files and see the update in jupyter notebook. Every time, I edit a library(.py) files I am closing and reopening ipython notebook to see the update. Is there any easy way to edit and debug .py package files ?
Put this as first cell of your notebooks:
%load_ext autoreload
%autoreload 2
More info in the doc.
When you load jupyter, you are initializing a python kernel. This will lock your python to the environment it was at when you loaded the kernel.
In this case, your kernel contains your local egg installed package at the point where it was when you loaded jupyter. Unfortunately, you will need to reload jupyter every time you update your local package.
#BlackBear has a great solution of using autoreload in your first cell:
%load_ext autoreload
%autoreload 2
A follow up solution assumes you do not need to make changes to your notebooks, but just want the updated outputs given changes to your package. One way I have gotten around this is to use automated notebook generation processes using jupyter nbconvert and shell scripting. You essentially create some jupyter templates stored in a templates folder that you will auto execute every time you update your package.
An example script:
rm ./templates/*.nbconvert.ipynb
rm ./*.nbconvert.ipynb
for file in "templates"/*.ipynb
do
echo $file
jupyter nbconvert --to notebook --execute $file
done
mv ./templates/*.nbconvert.ipynb .
Assuming you want to actively debug your package, I would recommend writing test scripts that load a fresh kernel every time. EG:
#mytest.py
from mypackage import myfunction
expected_outputs={'some':'expected','outputs':'here'}
if myfunction(inputs)==expected_outputs:
print('Success')
else:
print('Fail')
python3 mytest.py

How to add another path to the Python Path on AWS SageMaker

I've been trying to add one of my folders where I hold my python modules and, so far, I haven't been able to do it through AWS's terminal. The folder with the .py files is inside the main SageMaker folder, so I'm trying (I've also tried it with SageMaker/zds, which is the folder that holds the modules):
export PYTHONPATH="${PYTHONPATH}:SageMaker/"
After printing the directories of the PYTHONPATH through the terminal with python -c "import sys; print('\n'.join(sys.path))", I get that indeed my new path is included in the PYTHONPATH. However, when I try to import any module from any notebook (with from zds.module import * or from module import *), I get the error that the module doesn't exist. If I print the paths from the PYTHONPATH directly inside the notebook I no longer see the previously added path in the list.
Am I missing something basic here or is it not possible to add paths to the PYTHONPATH inside AWS SageMaker? For now, I'm having to use import sys, os
sys.path.insert(0, os.path.abspath('..')) inside basically every notebook as a fix to the problem.
Adding this to the lifecycle script worked for me
sudo -i <<'EOF'
touch /etc/profile.d/jupyter-env.sh
echo export PYTHONPATH="$PYTHONPATH:/home/ec2-user/SageMaker/repo-name/src" >> /etc/profile.d/jupyter-env.sh
EOF
Thanks for using Amazon SageMaker!
Copying from the https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html
Amazon SageMaker notebook instances use conda environments to implement different kernels for Jupyter notebooks. If you want to install packages that are available to one or more notebook kernels, enclose the commands to install the packages with conda environment commands that activate the conda environment that contains the kernel where you want to install the package
For example, if you want to install a package only in for the python3 environment, use the following code:
# This will affect only the Jupyter kernel called "conda_python3".
source activate python3
# Replace myPackage with the name of the package you want to install.
pip install myPackage
# You can also perform "conda install" here as well.
source deactivate
If you do installation in above suggested way you should be able to import your package from the Notebook corresponding Kernel which you are using. Let us know if it doesn't help.

Cloning a package from github and use in spyder

I would like to create a copy of a package on github that I can edit and use in spyder. I currently use the anaconda package manager for my python packages.
Here are the steps that I have taken so far:
fork repo
clone repo onto my local directory
The package is called 'Nilearn'. I currently use anaconda and have installed nilearn via 'conda install nilearn'.
I would like to be able to use my own copy of nilearn inside spyder alongside nilearn. I have tried renamine the repo to nilearn_copy, but this doesn't appear to work.
If this is not possible or not the ideal solution, then please help suggest an alternative, I'm new to github and python.
Thanks a lot,
Joe
You need to open an IPython console, then run this command
In [1]: %cd /path/to/nilearn/parent
By this I mean that you to go with the %cd magic to the parent directory where nilearn is placed. After that you can run
In [2]: import nilearn
and that should import your local copy of nilearn.
Note: If you are planning to do changes to nilearn and want your changes to be picked up in the same console, you need to run these commands before the previous ones:
In [3]: %load_ext autoreload
In [4]: %autoreload 2

Azure functions: Installing Python modules and extensions on consumption plan

I am trying to run a python script with Azure functions.
I had success updating the python version and installing modules on Azure functions under the App Services plan but I need to use it under the Consumption plan as my script will only execute once everyday and for only a few minutes, so I want to pay only for the time of execution. See: https://azure.microsoft.com/en-au/services/functions/
Now I'm still new to this but from my understanding the consumption plan spins up the vm and terminates it after your script has been executed unlike the App Service plan which is always on.
I am not sure why this would mean that I can't have install anything on it. I thought that would just mean I have to install it every time I spin it up.
I have tried installing modules through the python script itself and the kudu command line with no success.
While under the app service plan it was simple, following this tutorial: https://prmadi.com/running-python-code-on-azure-functions-app/
On Functions Comsumption plan, Kudu extensions are not available. However, you can update pip to be able to install all your dependencies correctly:
Create your Python script on Functions (let's say NameOfMyFunction/run.py)
Open a Kudu console
Go to the folder of your script (should be d:/home/site/wwwroot/NameOfMyFunction)
Create a virtualenv in this folder (python -m virtualenv myvenv)
Load this venv (cd myenv/Scripts and call activate.bat)
Your shell should be now prefixed by (myvenv)
Update pip (python -m pip install -U pip)
Install what you need (python -m pip install flask)
Now in the Azure Portal, in your script, update the sys.path to add this venv:
import sys, os.path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname( __file__ ), 'myvenv/Lib/site-packages')))
You should be able to start what you want now.
(Reference: https://github.com/Azure/azure-sdk-for-python/issues/1044)
Edit: reading previous comment, it seems you need numpy. I just tested right now and I was able to install 1.12.1 with no issues.
You may upload the modules for the Python version of your choice in Consumption Plan. Kindly refer to the instructions at this link: https://github.com/Azure/azure-webjobs-sdk-script/wiki/Using-a-custom-version-of-Python
This is what worked for me:
Dislaimer: I use C# Function that includes Python script execution, using command line with System.Diagnostics.Process class.
Add relevant Python extension for the Azure Function from Azure Portal:
Platform Features -> Development Tools -> Extensions
It installed python to D:\home\python364x86 (as seen from Kudu console)
Add an application setting called WEBSITE_USE_PLACEHOLDER and set its value to 0. This is necessary to work around an Azure Functions issue that causes the Python extension to stop working after the function app is unloaded.
See: Using Python 3 in Azure Functions question.
Install the packages from Kudu CMD line console using pip install ...
(in my case it was pip install pandas)

Categories