Determine if Code is Running on Databricks or IDE (PyCharm)

Determine if Code is Running on Databricks or IDE (PyCharm) - python

I am in the process of building a Python package that can be used by Data Scientists to manage their MLOps lifecycle. Now, this package can be used either locally (usually on PyCharm) or on Databricks.
I want a certain functionality of the package to be dependent on where it is running, i.e. I want it to do something different if it is called by a Databricks notebook and something else entirely if it is running locally.
Is there any way I can determine where it is being called from?
I am a little doubtful as to whether we can use something like the following that checks if your code is running on a notebook or otherwise since this will be a package that is going to be stored in your Databricks environment,
How can I check if code is executed in the IPython notebook?

The workaround I've found to work is to check for databricks specific environment variables.
import os
def is_running_in_databricks() -> bool:
return "DATABRICKS_RUNTIME_VERSION" in os.environ

Related

Run two Python Scripts in different environments

Is there a possibility to run a Jupyter Notebook in one environment and than to call a .py file (out of the JN) from another environment without pulling it over like it normally occurs?
Example:
from PythonScript1 import FunctionFromScript
Edit:
Because I see my problem is unclear described here some further details and the background of my question:
I want to run a matlab file from a jupyter notebook but this only works on condition which does not allow me to use tensorflow in the same JN (Using Matlab.engine and installing tensorflow at the same time).
My idea was to have the tensorflow model in one .py file which works in an anaconda env. (+ other directory) which is designed for it, while I have an JN in an other anaconda environment to call the matlab code.

You can also use SOS kernels in Jupyter Lab. SOS allows you to run multiple kernels in the same notebook and pass variables between the kernels. I was able to run Python and R kernels in a single notebook using SOS. You can use two Python kernels in your case - one with TF and one without.
P.S. I am not affiliated to SOS and am not promoting it. It worked for me and I thought I'd suggest this option.

No, it is not possible because you can't have two interpreters on the same notebook. Actually, you can have two virtual environments and execute the notebook with one or other, but you can't do it with both.
If you are talking about running a module crafted with other version of python interpreter, it depends on the versions compatibility

I found a solution to my problem. If I build my (.py) script as a Flask, then I can run it in a different environment (+ dir.) than my Jupyter Notebook. The only difference is that I can't call the function directly, I have to access the server and import my data with "get" and "post". Thanks for the help everyone!

Databricks Import/Copy Data from python lib inside repo

i am facing a little challenge while trying to implement a solution using the new repo functionality of databricks. I am working in a interdisziplinairy project which needs to be able to use python und pyspark code. The python team already builded some libraries which now also want to be used by the pyspark team (e.g. preprocessing ect.). We thought that using the new repo function would be a good compromise to collaborate easily. Therefore, we have added the ## Databricks notebook source to all library files so that they can easily changed in databricks (since python development isn't finished yet, the code will also be changed by the pyspark team). Unfortunately, we run into trouble with "importing" the library moduls in a databricks workspace directly from the repo.
Let me explain you our problem in an easy example:
Let this be module_a.py
## Databricks notebook source
def function_a(test):
pass
And this module_b.py
## Databricks notebook source
import module_a as a
def function_b(test):
a.function_a(test)
...
The issue is, that the only way to import these module directly in databricks is to use
%run module_a
%run module_b
which will fail since modulbe_b is trying to import module a which is not in the python path.
My idea was to copy the module_a.py and module_b.py file to the dbfs or localFilestore and then add the path to the python path with using sys.path.append(). Unfortunately, I didn't found any possility to access the file from the repo via some magic commands in databricks to be able to copy them to the file store.
(I do not want to clone the repo, since then I need to push my changed everytime before reexecuting the code).
Is there a way to access the repo directoy somehow via a notebook itself, so that I can copy them to the dbfs/filestorage?
Is there another way to import the function correctly ? (Installing the repo as a library on the cluster is not an option, since library will be changed during the process by the developers).
Thanks!

This functionality isn't available on Databricks yet. When you work with notebooks in the Databricks UI, you work with objects located in so-called Control Plane that is part of Databricks cloud, while code to be accessible as Python package should be in the data plane that is part of customer's cloud (see this answer for more details).
Usually people split the code into the notebooks that are used as a glue between configuration/business logic, and libraries that contain data transformations, etc. But libraries needs to be installed onto clusters, and usually developed separately from notebooks (there are some tools that helps with that, like, cicd-templates).
There is also the libify package that tries to emulate Python packages on top of Databricks notebooks, but it's not supported by Databricks, and I don't have personal experience with it.
P.S. I'll pass this feedback to development team.

Case sensitivity with names of modules and files in python 2.7.15

I have encountered a rather funny situation: I work in a big scientific collaboration whose major software package is based on C++ and python (2.7.15 still). This collaboration also has multiple servers (SL6) to run the framework on. Since I joined the collaboration recently, I received instructions on how to set up the software and run it. All works perfectly on the server. Now, there are reasons not to connect to the server to do simple tasks or code development, instead it is preferrable to do these kind of things on your local laptop. Thus, I set up a virtual machine (docker) according to a recipe I received, installed a couple of things (fuse, cvmfs, docker images, etc.) and in this way managed to connect my MacBook (OSX 10.14.2) to the server where some of the libraries need to be sourced in order for the software to be compiled and run. And after 2h it does compile! So far so good..
Now comes the fun part: you run the software by executing a specific python script which is fed as argument another python script. Not funny yet. But somewhere in this big list of python scripts sourcing one another, there is a very simple task:
import logging
variable = logging.DEBUG
This is written inside a script that is called Logging.py. So the script and library only are different by the first letter: l or L. On the server, this runs perfectly smooth. On my local VM set up, I get the error
AttributeError: 'module' object has no attribute 'DEBUG'
I checked the python versions (which python) and the location of the logging library (print logging.__file__), and in both set ups I get the same result for both commands. So the same python version is run, and the same logging library is sourced but in one case there is a mix up with the name of the file that sources the library.
So I am wondering, if there is some "convention file" (like a .vimrc for vi) sourced somewhere where this issue could be resolved by setting some tolerance parameter to some other value...?
Thanks a lot for the help!
conni

as others have said, OSX treats names as case-insensitive by default, so the Python bundled logging module will be confused with your Logging.py file. I'd suggest the better fix would be to get the Logging.py file renamed, as this would improve compatibility of the code base. otherwise, you could create a "Case-sensitive" APFS file system using "Disk Utility"
if you go with creating a file system, I'd suggest not changing the root/system partition to case-sensitive as this will break various programs in subtle ways. you could either repartition your disk and create a case-sensitive filesystem, or create an "Image" (this might be slower, not sure how much) and work in there. Just make sure you pick the "APFS (Case-sensitive)" format when creating the filesystem!

Install Python modules on virtual robot on Choregraphe

I’ve been programming a NAO robot using Choregraphe 2.1.4 and I’ve been using Python boxes. I need a way to install Tweepy onto my virtual robot. I’ve tried installing it on my computer and then copying all the libraries over, but I seem to not be able to get the SSL libraries or whatever onto it.
Is there a way to SSH into my virtual robot or something? Thank you.

I don't know of a clean way of doing that (there may be one); what I would usually do would be something like:
1) Create a service package, for example with robot-jumpstarter
python jumpstart.py python-service tweety-service TweetyService
2) Include tweety and whatever other libraries are needed directly in this package
3) when using a virtual robot, start Choregraphe, get that robot's port (in "Preferences > Virtual robot), and run your service (in a console or in your Python IDE) with
python scripts/tweetyservice.py --qi-url localhost:34674 (or whatever port you got from Choregraphe)
4) then inside your behavior, call your service with self.session().service("TweetyService") like you would with any NAOqi service
5) When running on an actual robot, install your tweety-service package like you would any normal package and it would work fine.
This technique also allows you to put more of your logic in the standalone Python code, and less in Choregraphe boxes (which can be convenient if you want to split your code up in several modules).

Programmatically determine if running in DSX

How can I programmatically determine if the python code in my notebook is running under DSX?
I'd like to be able to do different things under a local Jupyter notebook vs. DSX.

While the method presented in another answer (look for specific environment variables) works today, it may stop working in the future. This is not an official API that DSX exposes. It will obviously also not work if somebody decides to set these environment variables on their non-DSX system.
My take on this is that "No, there is no way to reliably determine whether the notebook is running on DSX".
In general, (in my opinion) notebooks are not really designed as artifacts that you can arbitrarily deploy anywhere; there will always need to be someone wearing the "application developer" hat and transform them - how to do that, you could put into a markdown cell inside the notebook.

You can print your environment or look for some specific environment variable. I am sure you will find some differences.
For example:
import os
if os.environ.get('SERVICE_CALLER'):
print ('In DSX')
else:
print ('Not in DSX')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.