Access databricks secret in custom python package imported into databricks notebook - python

We have a custom python package hosted on a private bitbucket repo, which can be installed via %pip install git+https//... on any databricks notebook.
One of the functions performs a number of operations and then pushes data to another location, for which credentials are required. When the function is run, e.g., locally, this is handled via config files, but if the function runs on databricks we'd like to store these credentials in a databricks secret (scope).
However, trying to do something like
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
pw = dbutils.secrets.get(scope = <SCOPE>, key = <KEY>)
within the module (foo) doesn't work and causes the following error:
>>>import myPackage
>>>myPackage.foo.bar()
Py4JJavaError: An error occurred while calling o272.get.
: java.lang.SecurityException: Accessing a secret from Databricks Connect requires
a privileged secrets token. To obtain such a token, you can run the following
in a Databricks workspace notebook:
given that we possibly would like to run this function regularly using a scheduled job creating a time-limited privileged token doesn't seem to be the way to go.
Is there a way to make this work or is there an alternative/better approach that we should be following instead?

For others that might run into the same problem in the future. Using this code snippet within my function ended up working for me:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]

Related

Azure Storage Blob Python modulewill NOT be recognized when running on azure batch node

I’m trying to run a python script on Azure Batch nodes. One of the required things I need in this file is
import azure.storage.blob
Of which I need to use the BlobServiceClient class from that module. When I try to access the BSC class, it tells me that no attribute of the name BSC (for short) exists in the azure.storage.blob module. Here are the things I have done
I’ve ran the script on my local machine. The script works perfectly
Python3 —version returns 3.8.10 on my AZ nodes
I have downloaded the azure-storage-blob module on my Azure computer nodes (which are Linux nodes)
What might I need to do?
You have not prepared your compute node with the proper software or the task execution environment does not have the proper context. Please read the documentation about Azure Batch jobs and tasks.

Cannot access pyspark in EMR cluster jupyter notebook

I've created a spark cluster on EMR.
But I'm unable to access pyspark when I open it with a notebook.
Configuration:
Example:
from pyspark import SparkContext
I also cannot access sc which I was under the impression would be available.
sc.list_packages()
NameError: name 'sc' is not defined
I feel like I'm missing something very basic here but I'm completely new to EMR and have spent a bunch of time on this already.
Are there any ideas I should try to debug this?
When I opened my notebook with "JupyterLab" instead of "Jupyter" all libraries were available.

SageMaker notebook connected to EMR import custom Python module

I looked through similar questions but none of them solved my problem.
I have a SageMaker notebook instance, opened a SparkMagic Pyspark notebook connected to a AWS EMR cluster. I have a SageMaker repo connected to this notebook as well called dsci-Python
Directory looks like:
/home/ec2-user/SageMaker/dsci-Python
/home/ec2-user/SageMaker/dsci-Python/pyspark_mle/datalake_data_object/SomeClass
/home/ec2-user/SageMaker/dsci-Python/Pyspark_playground.ipynb
There are __init__.py under both pyspark_mle and datalake_data_object directory and I have no problem importing them in other environments
when I'm running this code in Pyspark_playground.ipynb:
from pyspark_mle.datalake_data_object.SomeClass.SomeClass import Something
I got No module named 'pyspark_mle'
I think this is an environment path thing.
The repo is on your Notebook Instance, whereas the PySpark kernel is executing code on the EMR cluster.
To access these local modules on the EMR cluster, you can clone the repository on the EMR cluster.
Also, SparkMagic has a useful magic send_to_spark which can be used to send data from the Notebook locally to the Spark kernel. https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Send%20local%20data%20to%20Spark.ipynb

Cannot save Spark dataframe to Google Cloud Storage from PySpark

I have a Spark dataframe that I'm trying to save to a Google Storage bucket with the line
df.write.format("com.databricks.spark.csv").save('gs://some-test-bucket-delete-me')
But Pyspark raises the following exception
Py4JJavaError: An error occurred while calling o55.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
What I've tried:
The typical solutions to this exception is to make sure that the environmental variable HADOOP_CLASSPATH to point at the gcs-connector-latest-hadoop2.jar file, which it does. I've tried using the Hadoop version 1 as well as version 2 jars incase that was the problem. I've tried explicitly pointing at it within Jupyter notebook with
sc._jsc.hadoopConfiguration().set("spark.driver.extraClassPath", "/home/****/hadoop-2.8.2/share/hadoop/common/lib/gcs-connector-latest-hadoop1.jar")
to no avail.
If I try hadoop fs -ls gs://gs://some-test-bucket-delete-me from bash the command returns perfectly, which is supposed to indicate that that the Google Cloud Storage connect works but for some reason I can't seem to get this functionality to work in PySpark.
Things that may be important:
Spark Version 2.2.0
Python 3.6.1 :: Anaconda custom (64-bit)
I'm running PySpark locally
You should run gcloud init first
Then try df.write.csv('gs://some-test-bucket-delete-me/file_name')

How to do I send a message to a queue in Azure from an Azure function using python?

I am using an azure function that has a time trigger and I am trying to communication with a data base to return a list of dictionaries, I am trying to send each dictionary as a string (queue message) to a queue, I was going to do this with an output binder but could not figure out how to, so I am using the azure module. Problem is every message I send goes into a poison queue for some reason, and I cannot figure out why, here is a code snippet of what I have in my Azure function.
import os
import platform
import WorkWithDatabase
#import base64
from azure.storage.queue import QueueService
acc='...ACCOUNT NAME'
key='...KEY'
#Connect to QueueService
queue_service = QueueService(account_name=acc, account_key=key)
#Pull missing data from the database,
#Call a function in another script to do this
missingList=WorkWithDatabase.ListRequests()
for item in missingList:
queue_service.put_message('taskqueue', str(item))
Also Is there a way I can use the database as a resource in an azure function with python??
For using these python packages, such as pyodbc or pymssql to connect Azure SQL Database, you need to install a custom version of Python on Azure Function, then to install pip for the custom verion of Python to install these package as you want.
So the steps as below.
Follow the document Using a custom version of Python to install a custom Python runtime in the path site\tools of Kudu via access the url https://<your function name>.scm.azurewebsites.net/DebugConsole.
After custom python installation, to download a get-pip.py file to the path site\tool of custom Python, then to install pip tool via command python get-pip.py.
Then, you can install these packages via Scripts/pip.exe install <package-names>.
Then, you can import these packages to write your code in your Azure Function on Azure portal.

Categories