Databricks Import/Copy Data from python lib inside repo

Databricks Import/Copy Data from python lib inside repo - python

i am facing a little challenge while trying to implement a solution using the new repo functionality of databricks. I am working in a interdisziplinairy project which needs to be able to use python und pyspark code. The python team already builded some libraries which now also want to be used by the pyspark team (e.g. preprocessing ect.). We thought that using the new repo function would be a good compromise to collaborate easily. Therefore, we have added the ## Databricks notebook source to all library files so that they can easily changed in databricks (since python development isn't finished yet, the code will also be changed by the pyspark team). Unfortunately, we run into trouble with "importing" the library moduls in a databricks workspace directly from the repo.
Let me explain you our problem in an easy example:
Let this be module_a.py
## Databricks notebook source
def function_a(test):
pass
And this module_b.py
## Databricks notebook source
import module_a as a
def function_b(test):
a.function_a(test)
...
The issue is, that the only way to import these module directly in databricks is to use
%run module_a
%run module_b
which will fail since modulbe_b is trying to import module a which is not in the python path.
My idea was to copy the module_a.py and module_b.py file to the dbfs or localFilestore and then add the path to the python path with using sys.path.append(). Unfortunately, I didn't found any possility to access the file from the repo via some magic commands in databricks to be able to copy them to the file store.
(I do not want to clone the repo, since then I need to push my changed everytime before reexecuting the code).
Is there a way to access the repo directoy somehow via a notebook itself, so that I can copy them to the dbfs/filestorage?
Is there another way to import the function correctly ? (Installing the repo as a library on the cluster is not an option, since library will be changed during the process by the developers).
Thanks!

This functionality isn't available on Databricks yet. When you work with notebooks in the Databricks UI, you work with objects located in so-called Control Plane that is part of Databricks cloud, while code to be accessible as Python package should be in the data plane that is part of customer's cloud (see this answer for more details).
Usually people split the code into the notebooks that are used as a glue between configuration/business logic, and libraries that contain data transformations, etc. But libraries needs to be installed onto clusters, and usually developed separately from notebooks (there are some tools that helps with that, like, cicd-templates).
There is also the libify package that tries to emulate Python packages on top of Databricks notebooks, but it's not supported by Databricks, and I don't have personal experience with it.
P.S. I'll pass this feedback to development team.

Related

Determine if Code is Running on Databricks or IDE (PyCharm)

I am in the process of building a Python package that can be used by Data Scientists to manage their MLOps lifecycle. Now, this package can be used either locally (usually on PyCharm) or on Databricks.
I want a certain functionality of the package to be dependent on where it is running, i.e. I want it to do something different if it is called by a Databricks notebook and something else entirely if it is running locally.
Is there any way I can determine where it is being called from?
I am a little doubtful as to whether we can use something like the following that checks if your code is running on a notebook or otherwise since this will be a package that is going to be stored in your Databricks environment,
How can I check if code is executed in the IPython notebook?

The workaround I've found to work is to check for databricks specific environment variables.
import os
def is_running_in_databricks() -> bool:
return "DATABRICKS_RUNTIME_VERSION" in os.environ

GCP and Datalab with Python 3.6, Need to use Jupyter on GCP

I'm trying to use fastai's libraries, but some of the data accessing tools built in to those libraries are dependent on HTML objects. For example, the DataBunch.show_batch attribute produces an HTML object that is easiest to use in Jupyter. I need to run my testing on GCP (or another cloud), and here are the issues:
fastai has some libraries that are dependent on Python3.6 or greater (new string format)
GCP doesn't have a good way to interface with Jupyter NBs. I had it set up with some trouble, but then my computer needed reformat, and now I am questioning if I should set up again. The previous method was largely based on this.
GCP apparently has something meant to provide an interface between it and Jupyter NBs, called Datalab. However, Datalab does not support Py3.6 or greater, per this link.
I see a few options:
Develop my own data visualization techniques by subclassing fastai's libraries and skip Jupyter altogether
Create a Jupyter-to-GCP interface in a different way, basically redoing the steps in link in the second bullet point above.
Use one of the containers (docker) that I keep hearing about on Datalab that allow me to use my own version of Python
Does anyone have other options for how I can make this connection? If not, can anyone provide other links for how to accomplish 1, 2, or 3?

You can follow this guide from fast.ai to create a VM with all the required libraries pre-installed. Then, following the same guide you can access JupyterLab or Jupyter Notebooks in this VM. It's simple, fast and comes with Python 3.7.3.

You could create a notebook using GCP's AI Platform Notebooks.
It should give you a one-click way to create a VM with all the libraries you need preinstalled. It'll even give you a URL that you can use to directly access your notebook.

How do I create modules in Google Colab so that I can import and use them in other files?

I could not upload .py files in Google Colab at the first place and also I am not sure how to import modules created within Google Colab or system environment?
I thought of creating and importing modules in system but could not do it due to issues in PTVS. Couldn't figure out what the issue is.
I need to create number of modules which can retrieve data, preprocess the same, run query, retrieve results and rank them. I am not very comfortable with Python's module syntax. Is there any way I can learn them in an intuitive way as well?

Defining and using a Python class within a ROS catkin workspace

I am running ROS Indigo. I have what should be a simple problem: I have a utility class in my package that I want to be callable from our scripts. It only needs to be called within our own package; I don't need it to be available to other ROS packages.
I defined a class named HandControl in a file HandControl.py. All my attempts to import it, or use it without importing, fail. Where in the catkin workspace do I put it -- the root of the package, or in scripts? Do I need __init.py__ anywhere (I have tried several places)?

It is a good practice to follow the standards of Python and ROS here. Scripts are typically placed in /script directory and they should not be imported into other python scripts. Reusable python code is an indication of a python module. Python modules should be placed in /src/package_name and there you should create __init__.py as well. This module will be available everywhere in your catkin workspace. There is a good chance this structure will help you in the future to structure things, even though you may not seem to need it at the moment. Project typically grow and following guidelines helps to maintain good code. For more specific details checkout this python doc.

Erica,
please see this school project, which was written in Python and run on ROS Indigo. If you'd look in the /scripts folder, you can see an example of a custom class that is being called from other scripts. If you'd look into the launch files in /launch you can see an example of configuring the ROS nodes - maybe that is your problem.

Python: Is it possible to create a package out of multiple external libraries?

This issue has been driving me insane for the past few days.
So basically, I'm trying to port over a Pure Python project to a proper PyCharm project. This is to basically improve code quality and project structure.
I wish it was as simple as basically creating a virtualenv to house everything, but it isn't. This project will eventually be developed simultaneously by multiple developers with Git as source control, and the default libraries will be modified. I presume this means that the libraries should ideally be tracked by Git in the end. Virtualenv shouldn't help here as far as I know because it's not portable between systems (or at least that's still being tested).
This project will also be, in the future, deployed to a Centos server.
So the only plan I can think of to successfully pull off this would be to simply bring in all of the libraries (which was done using pip install -t Libraries <ExampleLibrary>) into a single folder, with a __init__.py inside, and use them from other python files as a package within the Pycharm project.
Is this possible / recommended? I tried various methods to reference these libraries, but they all don't work during runtime. Somehow when the files in the library import something else from their own package, an ImportError is raised saying that there's no such module.
Will accept any other suggestions too.
Using Pycharm Community Edition.
EDIT: After having a good night's rest I think the crux of the issue is really just project organization. Before I ported it over to Pycharm the project worked as expected, but this had all of the python files in the root directory, and the libraries in a subfolder of the root, with every project file having the same boilerplate code:
import os, sys
absFilePath = os.path.dirname(os.path.abspath(__file__));
sys.path.insert(1, absFilePath + "/lib")
I was hoping that by using Pycharm to help me flesh out the packages, I could avoid having repeated boilerplate code.

Note: Not full solution.
The addition of the template code below forces the file containing the code to be in the same directory as the libs folder.
For Pycharm, all I had to do was mark the libs folder as a source folder. Even with the addition of the template code to the file, the modified libraries still work as expected.
For the Python Shell, this template code is still needed:
import os
import sys
absFilePath = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(1, absFilePath + "/lib")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Databricks Import/Copy Data from python lib inside repo - python

Related

Determine if Code is Running on Databricks or IDE (PyCharm)

GCP and Datalab with Python 3.6, Need to use Jupyter on GCP

How do I create modules in Google Colab so that I can import and use them in other files?

Defining and using a Python class within a ROS catkin workspace

Python: Is it possible to create a package out of multiple external libraries?

Categories

Resources