How to install gdal on databricks cluster? - python

I am trying to install the package GDAL on an Azure Databricks cluster. In no way I can get it to work.
Approaches that I've tried but didn't work:
Via the library tab of the corresponding cluster --> Install New --> PyPi (under Library Source) --> Entered gdal under Package
Tried all approaches mentioned on https://forums.databricks.com/questions/13738/gdal-installation.html. None of them worked.
Details:
Runtime: 6.1 (includes Apache Spark 2.4.4, Scala 2.11) (When using runtime 3.5 I got GDAL to work, however an update to a higher runtime was necessary for other reasons.)
We're using python 3.7.

Finally we got it working by using an ML runtime in combination with the answer given in forums.databricks.com/answers/21118/view.html. Apparently the ML-runtimes contain conda, which is needed for the answer given in the previous link.

I have already replied similar type of question.
Please check the below link would help you to install the required library:
How can I download GeoMesa on Azure Databricks?
For your convenience I am pasting the Answer again... just you need to choose your required library from the search area.
You can install GDAL Library directly into your Databricks cluster.
1) Select the Libraries option then a new window will open.
2) Select the maven option and click on 'search packages' option
3) Search the required library and select the library/jar version and choose the 'select' option.
Thats it.
After the installation of the library/jar, restart your cluster. Now import the required classes in your Databricks notebook.
I hope it helps. Happy Coding..

pip install https://manthey.github.io/large_image_wheels/GDAL-3.1.0-cp38-cp38-manylinux2010_x86_64.whl
Looks like you are able to use this whl file and install the package but when running tasks like GDAL.Translate it will not actually run. This is the farthest I've gotten.
The above URL was found when I was searching for the binaries that GDAL needs. As a note you will have to run this every time you start your cluster.

Related

Amazon EMR pip install in bootstrap actions runs OK but has no effect

In Amazon EMR, I am using the following script as a custom bootstrap action to install python packages. The script runs OK (checked the logs, packages installed successfully) but when I open a notebook in Jupyter Lab, I cannot import any of them. If I open a terminal in JupyterLab and run pip list or pip3 list, none of my packages is there. Even if I go to / and run find . -name mleap for instance, it does not exist.
Something I have noticed is that on the master node, I am getting all the time an error saying bootstrap action 2 has failed (there is no second action, only one). According to this, it is a rare error which I get in all my clusters. However, my cluster eventually gets created and I can use it.
My script is called aws-emr-bootstrap-actions.sh
#!/bin/bash
sudo python3 -m pip install numpy scikit-learn pandas mleap sagemaker boto3
I suspect it might have something to do with a docker image being deployed that invalidates my previous installs or something, but I think (for my Google searches) it is common to use bootstrap actions to install python packages and should work ...
The PYSPARK, Python interpreter that Spark is using, is different than the one to which the OP was installing the modules (as confirmed in comments).

No Module named pyb

I am working on a signal generator AD9833 and i am using a RaspberryPi with it.I found a python library for working with it.
The link to the library is - https://github.com/KipCrossing/Micropython-AD9833
When i try to use this library as explained in the code below, i am not able to import the module 'pyb'. I am not able to link or install from pyb import Pin.
I tried various approaches.I tried this link https://pybuilder.github.io/documentation/tutorial.html#.Xd6HOOhKiUk, which describes about virtual environment. I am able to get the things as enter image description here, but ultimately after many hours of try, i am at same place from where i started. Please guide. I am new to programming.
I even installed the Miropython as suggested, but still the term from pyb import Pin is not recognized.
Testing Micropython installation
try to install PyBoard package
pip install PyBoard

psutil library installation issue on databricks

I am using psutil library on my databricks cluster which was running fine for last couple of weeks. When I started the cluster today, this specific library failed to install. I noticed there was a different version of psutil got updated in the site.
Currently my python script fails with 'No module psutil'
Tried installing previous version of psutil using pip install but still my code fails with the same error.
Is there any alternative to psutil or is there a way to install it in databricks
As I known, there are two ways to install a Python package in Azure Databricks cluster, as below.
As the two figures below, move to the Libraries tab of your cluster and click the Install New button to type the package name of you want to install, then wait to install successfully
Open a notebook, type the shell command as below to install a Python package via pip. Note: At here, for installing in the current environment of databricks cluster, not in the system environment of Linux, you must use /databricks/python/bin/pip, not only pip.
%sh
/databricks/python/bin/pip install psutil
Finally, I run the code below, it works for the two ways above.
import psutil
for proc in psutil.process_iter(attrs=['pid', 'name']):
print(proc.info)
psutil.pid_exists(<a pid number in the printed list above>)
In additional to #Peter response, you can also use "Library utilities" to install Python libraries.
Library utilities allow you to install Python libraries and create an environment scoped to a notebook session. The libraries are available both on the driver and on the executors, so you can reference them in UDFs. This enables:
Library dependencies of a notebook to be organized within the
notebook itself.
Notebook users with different library dependencies
to share a cluster without interference.
Example: To install "psutil" library using library utilities:
dbutils.library.installPyPI("psutil")
**Reference: **Databricks - library utilities
Hope this helps.

Install python CV2 on spark cluster(data bricks)

i want to install pythons library CV2 on a spark cluster using databricks community edition and i'm going to:
workspace-> create -> library , as the normal procedure and then selecting python in the Language combobox, but in the "PyPi Package" textbox , i tried "cv2" and "opencv" and had no luck. Does anybody has tried this? Do you know if cv2 can be installed on the cluster through this method? and if so, which name should be used in the texbox?
Try to download numpy first followed by opencv-python it will work.
Steps:
Navigate to Install Library-->Select PyPI----->In Package--->numpy
(after installation completes, proceed to step 2)
Navigate to Install Library-->Select PyPI----->In Package--->opencv-python
The PyPi package you want is https://pypi.python.org/pypi/opencv-python -- so just put opencv-python in the textbox and install.

Python 2.7 Modules with Hive Streaming

I am doing Hive Streaming on a DSE 3.0 cluster (Hive 0.9) using a Python mapper. My python script imports the statsmodels module, which requires Python 2.7. Since the default is not 2.7 (it's 2.4), I download and install it, as well as the statsmodels module.
However, when running the simple Hive query
hive> select transform (line) using 'python python-mapper.py' from docs;
where "docs" is a Hive table with line STRING's. However, I get the error:
File "python-mapper.py", line 6, in ?
import statsmodels
ImportError: No module named statsmodels
So I changed my Hive query to:
hive> select transform (line) using 'python2.7 python-mapper.py' from docs;
to invoke version 2.7. But then I get the error
Caused by: java.io.IOException: Cannot run program "python2.7":
java.io.IOException: error=2, No such file or directory
I have also tried python27 and /usr/local/bin/python2.7 and am still getting the same error. Has anyone encountered this before? I have already referenced the second answer to the post On linux SUSE or RedHat, how do I load Python 2.7. Any advice would be greatly appreciated!
Thanks,
AM
I know this is abit old however I came across the same problem recently and thought I would answer for anybody else who came across this problem.
python2.7 command won't work if you have more than one version of python installed.
There are two ways of solving this. One, use a python virtual environment, which would allow you to start your script and add this as a resource to distribute across all your nodes. Two, you can find out where you python2.7 libs are installed by typing:
which python2.7
and then reference the location in your hive query like so (example):
select transform (line) using '/usr/local/bin/python2.7 python-mapper.py' from docs;
Caution each node may have a different location where python2.7 is installed so check before hand. Better yet use a virtual environment.

Categories