Shipping Python modules in pyspark to other nodes - python

How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Spark cluster?
I know that I can ship Python files in Spark using a standalone Python script (example code below):
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py'])
But in situations where there is no '.py', how do I ship the module?

If you can package your module into a .egg or .zip file, you should be able to list it in pyFiles when constructing your SparkContext (or you can add it later through sc.addPyFile).
For Python libraries that use setuptools, you can run python setup.py bdist_egg to build an egg distribution.
Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).

There are two main options here:
If it's a single file or a .zip/.egg, pass it to SparkContext.addPyFile.
Insert pip install into a bootstrap code for the cluster's machines.
Some cloud platforms (DataBricks in this case) have UI to make this easier.
People also suggest using python shell to test if the module is present on the cluster.

Related

Azure Data Factory run Databricks Python Wheel

I have a python package (created in PyCharm) that I want to run on Azure Databricks. The python code runs with Databricks from the command line of my laptop in both Windows and Linux environments, so I feel like there are no code issues.
I've also successfully created a python wheel from the package, and am able to run the wheel from the command line locally.
Finally I've uploaded the wheel as a library to my Spark cluster, and created the Databricks Python object in Data Factory pointing to the wheel in dbfs.
When I try to run the Data Factory Pipeline, it fails with the error that it can't find the module that is the very first import statement of the main.py script. This module (GlobalVariables) is one of the other scripts in my package. It is also in the same folder as main.py; although I have other scripts in sub-folders as well. I've tried installing the package into the cluster head and still get the same error:
ModuleNotFoundError: No module named 'GlobalVariables'Tue Apr 13 21:02:40 2021 py4j imported
Has anyone managed to run a wheel distribution as a Databricks Python object successfully, and did you have to do any trickery to have the package find the rest of the contained files/modules?
Your help greatly appreciated!
Configuration screen grabs:
We run pipelines using egg packages but it should be similar to wheel. Here is a summary of the steps:
Build the package with with python setup.py bdist_egg
Place the egg/whl file and the main.py script into Databricks FileStore (dbfs)
In Azure DataFactory's Databricks Activity go to the Settings tab
In Python file, set the dbfs path to the python entrypoint file (main.py script).
In Append libraries section, select type egg/wheel set the dbfs path to the egg/whl file
Select pypi and set all the dependencies of your package. It is recommended to specify the versions used in development.
Ensure GlobalVariables module code is inside the egg. As you are working with wheels try using them in step 5. (never tested myself)

psutil library installation issue on databricks

I am using psutil library on my databricks cluster which was running fine for last couple of weeks. When I started the cluster today, this specific library failed to install. I noticed there was a different version of psutil got updated in the site.
Currently my python script fails with 'No module psutil'
Tried installing previous version of psutil using pip install but still my code fails with the same error.
Is there any alternative to psutil or is there a way to install it in databricks
As I known, there are two ways to install a Python package in Azure Databricks cluster, as below.
As the two figures below, move to the Libraries tab of your cluster and click the Install New button to type the package name of you want to install, then wait to install successfully
Open a notebook, type the shell command as below to install a Python package via pip. Note: At here, for installing in the current environment of databricks cluster, not in the system environment of Linux, you must use /databricks/python/bin/pip, not only pip.
%sh
/databricks/python/bin/pip install psutil
Finally, I run the code below, it works for the two ways above.
import psutil
for proc in psutil.process_iter(attrs=['pid', 'name']):
print(proc.info)
psutil.pid_exists(<a pid number in the printed list above>)
In additional to #Peter response, you can also use "Library utilities" to install Python libraries.
Library utilities allow you to install Python libraries and create an environment scoped to a notebook session. The libraries are available both on the driver and on the executors, so you can reference them in UDFs. This enables:
Library dependencies of a notebook to be organized within the
notebook itself.
Notebook users with different library dependencies
to share a cluster without interference.
Example: To install "psutil" library using library utilities:
dbutils.library.installPyPI("psutil")
**Reference: **Databricks - library utilities
Hope this helps.

How could I import an external jar library to ZEPPELIN in Hortonworks?

I have an HDP 2.5 cluster and I am working with ZEPPELIN's %pyspark interpreter to generate code.
I want to use a library that helps working with Time Series Analysis in Spark both in python, java and scala, which is specified here: https://github.com/sryza/spark-timeseries
The problem is that I don't know how to import and use this library to my ZEPPELIN %pyspark interpreter.
First of all, I downloaded the .jar file named "sparkts-0.2.0-jar-with-dependencies.jar". Next, I save it in my /opt/ directory in my cluster node where ZEPPELIN is working.
Then, I tried by using %dep, but it's deprecated in my current version of HDP, so I added a dependency in the ZEPPELIN "interpreters" menu, this way:
I restarted the interpreter and tried in a ZEPPELIN notebook:
%pyspark
import sparkts
But I got an error:
ImportError: No module named sparkts
So my question is: How could I import and use this .jar file to make Time Series analysis in my HDP cluster with ZEPPELIN?
Thank you so much!
Since its a Python library, you need to PIP install this on each node of your cluster if you're running zeppelin on top of a cluster using a resource manager like YARN where jobs could run on any node of the cluster and you're using an interpreter like Livy to distribute your job. If the library isn't available through PIP you can install it by running setup.py (if it has one) or as a last resort supply jar file directly to Pyspark shell like so spark-shell --jars (not a solution for Zeppelin though)

Run python file on shared drive without installing python or python frameworks

I have placed a Pythonfile at a shared drive that I want any user to be able to start. I don't want the user to install python or any of the libraries needed (e.g pandas).
I want it to be easy for the user to start the program. How should I do that?
I have tried to create a bat-file (Z: is the shared drive location. All users will have it as their Z-drive):
#echo off
Z:\python27\python.exe Z:\main.py %*
pause
I tried to install python at the shared drive (as specified) and placed all needed imports at the shared drive.
From my computer this is runnable. But from a users computer I get the error message:
ImportError: C extension: DLL load failed: The specified module could not be fou
nd. not built. If you want to import pandas from the source directory, you may n
eed to run 'python setup.py build_ext --inplace' to build the C extensions first
I have Python 2.7, the Teradata module and Pandas installed. What can I do to make this runnable?
PyPy does that out of the box.
CPython has some sort of redistributable bundle which should suit your use-case.
Finally, there are these projects:
http://www.voidspace.org.uk/python/movpy/
http://portablepython.com/

Unable to install pandas on AWS Lambda

I'm trying to install and run pandas on an Amazon Lambda instance. I've used the recommended zip method of packaging my code file model_a.py and related python libraries (pip install pandas -t /path/to/dir/) and uploaded the zip to Lambda. When I try to run a test, this is the error message I get:
Unable to import module 'model_a': C extension:
/var/task/pandas/hashtable.so: undefined symbol: PyFPE_jbuf not built.
If you want to import pandas from the source directory, you may need
to run 'python setup.py build_ext --inplace' to build the C extensions
first.
Looks like an error in a variable defined in hashtable.so that comes with the pandas installer. Googling for this did not turn up any relevant articles. There were some references to a failure in numpy installation but nothing concrete. Would appreciate any help in troubleshooting this! Thanks.
I would advise you to use Lambda layers to use additional libraries. The size of a lambda function package is limited, but layers can be used up to 250MB (more here).
AWS has open sourced a good package, including Pandas, for dealing with data in Lambdas. AWS has also packaged it making it convenient for Lambda layers. You can find instructions here.
I have successfully run pandas code on lambda before. If your development environment is not binary-compatible with the lambda environment, you will not be able to simply run pip install pandas -t /some/dir and package it up into a lambda .zip file. Even if you are developing on linux, you may still run into compatability issues.
So, how do you get around this? The solution is actually pretty simple: run your pip install on a lambda container and use the pandas module that it downloads/builds instead. When I did this, I had a build script that would spin up an instance of the lambci/lambda container on my local system (a clone of the AWS Lambda container in docker), bind my local build folder to /build and run pip install pandas -t /build/. Once that's done, kill the container and you have the lambda-compatible pandas module in your local build folder, ready to zip up and send to AWS along with the rest of your code.
You can do this for an arbitrary set of python modules by making use of a requirements.txt file, and you can even do it for arbitrary versions of python by first creating a virtual environment on the lambci container. I haven't needed to do this for a couple of years, so maybe there are better tools by now, but this approach should at least be functional.
If you want to install it directly through the AWS Console, I made a step-by-step youtube tutorial, check out the video here: How to install Pandas on AWS Lambda

Categories