Azure Data Factory run Databricks Python Wheel - python

I have a python package (created in PyCharm) that I want to run on Azure Databricks. The python code runs with Databricks from the command line of my laptop in both Windows and Linux environments, so I feel like there are no code issues.
I've also successfully created a python wheel from the package, and am able to run the wheel from the command line locally.
Finally I've uploaded the wheel as a library to my Spark cluster, and created the Databricks Python object in Data Factory pointing to the wheel in dbfs.
When I try to run the Data Factory Pipeline, it fails with the error that it can't find the module that is the very first import statement of the main.py script. This module (GlobalVariables) is one of the other scripts in my package. It is also in the same folder as main.py; although I have other scripts in sub-folders as well. I've tried installing the package into the cluster head and still get the same error:
ModuleNotFoundError: No module named 'GlobalVariables'Tue Apr 13 21:02:40 2021 py4j imported
Has anyone managed to run a wheel distribution as a Databricks Python object successfully, and did you have to do any trickery to have the package find the rest of the contained files/modules?
Your help greatly appreciated!
Configuration screen grabs:

We run pipelines using egg packages but it should be similar to wheel. Here is a summary of the steps:
Build the package with with python setup.py bdist_egg
Place the egg/whl file and the main.py script into Databricks FileStore (dbfs)
In Azure DataFactory's Databricks Activity go to the Settings tab
In Python file, set the dbfs path to the python entrypoint file (main.py script).
In Append libraries section, select type egg/wheel set the dbfs path to the egg/whl file
Select pypi and set all the dependencies of your package. It is recommended to specify the versions used in development.
Ensure GlobalVariables module code is inside the egg. As you are working with wheels try using them in step 5. (never tested myself)

Related

Getting error No module named 'azure.storage.blob.blockblobservice' while executing on server

I have a script that runs perfectly fine on my local machine using Anaconda
from azure.storage.blob.blockblobservice import BlockBlobService
I installed it via: pip install azure-storage-blob.
I migrated the script to a server and first did pip install azure-storage-blob. This ran without any issues. Now when I execute from azure.storage.blob.blockblobservice import BlockBlobService, I get the error No module named 'azure.storage.blob.blockblobservice'.
I went into the site-packages folder on the server and could not find the file "blockblobservice.py" under azure/storge/blob folder. Below are the list of files and folders I see under this folder on the server:
__init__.py
_blob_service_client.py
_blob_client.py
_deserialize.py
_container_client.py
_lease.py
_download.py
_shared_access_signature.py
_serialize.py
_models.py
_version.py
_upload_helpers.py
_generated
aio
_shared
__pycache__
pip freeze | grep azure returns below information:
azure-common==1.1.25
azure-core==1.6.0
azure-nspkg==3.0.2
azure-storage-blob==12.3.2
azure-storage-nspkg==3.1.0
Thanks in advance, for your help in resolving this!
azure.storage.blob.blockblobservice is part of older Azure Storage SDK (azure-storage) and not the newer one (azure-storage-blob).
I believe the reason why code is working on your machine is because you have the older SDK still present on your machine. You can confirm this by going into site-packages/azure/storage/blob folder on your local machine. You should see blockblobservice.py file there.

Install Python modules in Azure Functions

I am learning how to use Azure functions and using my web scraping script in it.
It uses BeautifulSoup (bs4) and pymysql modules.
It works fine when I tried it locally in the virtual environment as per this MS guide:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-create-first-azure-function-azure-cli?pivots=programming-language-python&tabs=cmd%2Cbrowser#run-the-function-locally
But when I create the function App and publish the script to it, Azure Functions logs give me this error:
Failure Exception: ModuleNotFoundError: No module named 'pymysql'.
It must happen when attempting to import it.
I really don't know how to proceed, where should I specify what modules it needs to install?
You need to check if you have generated the requirements.txt which includes all of the information of the modules. When you deploy the function to azure, it will install the modules by the requirements.txt automatically.
You can generate the information of modules in requirements.txt file by the command below in local:
pip freeze > requirements.txt
And then deploy the function to azure by running the publish command:
func azure functionapp publish hurypyfunapp --build remote
For more information about deploy python function from local to auzre, please refer to this tutorial.
By the way, if you use consumption plan for your python function, the "Kudu" is not available for us. If you want to use "Kudu", you need to create app service plan for it but not consumption plan.
Hope it helps~
You need to upload the installed modules when deploying to azure. You can upload them using Kudu:
https://github.com/projectkudu/kudu/wiki/Kudu-console
as an alternative, you can also use Kudu and run pip install using the console:
Install python packages from the python code itself with the following snippet: (Tried and verified on Azure functions)
def install(package):
# This function will install a package if it is not present
from importlib import import_module
try:
import_module(package)
except:
from sys import executable as se
from subprocess import check_call
check_call([se,'-m','pip','-q','install',package])
for package in ['beautifulsoup4','pymysql']:
install(package)
Desired libraries mentioned the list gets installed when the azure function is triggered for the first time. for the subsequent triggers, you can comment/ remove the installation code.

Python Import error on installing ruamel.yaml in custom directory

I am using python 2.7.13 and
I am facing problems importing ruamel.yaml when I install it in a custom directory.
**ImportError: No module named ruamel.yaml**
The command used is as follows:
pip install --target=Z:\XYZ\globalpacks ruamel.yaml
I have added this custom directory to PYTHONPATH env variable
and also have a .pth file in this location with the following lines
Z:\XYZ\globalpacks\anotherApp
Z:\XYZ\globalpacks\ruamel
There is another app installed similarly with the above settings
and it works.
What am I missing here?
PS: It works when I install in site-packages folder
also it worked in the custom folder when I created an init.py file
in the ruamel folder.
EDIT:
Since our content creation software uses python 2.7 we are restricted to
using the same.We have chosen to install the same version of python on all
machines and set import paths to to point to modules/apps locacted on shared
network drive.
Like mentioned it works in pythons site-packages but not on the network drive
which is on the PYTHONPATH env-variable.
The ruamel.yaml-**.nspkg.pth and ruamel.ordereddict-*-nspkg.pth are
dutifully installed.Sorry for not giving complete details earlier.Your inputs
are much appreciated.

Add python modules to Azure

When I run a python script that I set up on WebJobs in Azure - I get the following error:
import MySQLdb
ImportError: No module named MySQLdb
job failed due to exit code 1
I found some articles that seem to suggest to install python modules to a directory created on the webapp. How/Where would I install those modules?
Have you tried this?
http://nicholasjackson.github.io/azure/python/python-packages-and-azure-webjobs/
(from the site):
Step 1.
If you are using OSX and the default Python 2.7 install your packages
installed with pip will be in /usr/local/lib/python2.7/site-packages,
create a folder called site-packages in the root of your python job
and copy any packages you need for your job into it.
Step 2
Next you need to modify your run.py or any other file which requires
access to the package files. At the top of the file add….
import sys
sys.path.append("site-packages")

Shipping Python modules in pyspark to other nodes

How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Spark cluster?
I know that I can ship Python files in Spark using a standalone Python script (example code below):
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py'])
But in situations where there is no '.py', how do I ship the module?
If you can package your module into a .egg or .zip file, you should be able to list it in pyFiles when constructing your SparkContext (or you can add it later through sc.addPyFile).
For Python libraries that use setuptools, you can run python setup.py bdist_egg to build an egg distribution.
Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).
There are two main options here:
If it's a single file or a .zip/.egg, pass it to SparkContext.addPyFile.
Insert pip install into a bootstrap code for the cluster's machines.
Some cloud platforms (DataBricks in this case) have UI to make this easier.
People also suggest using python shell to test if the module is present on the cluster.

Categories