PySpark Load Packages for Pandas UDF's - python

I have tried following the Databricks blog post here but unfortunately keep getting errors. I'm trying to install pandas, pyarrow, numpy, and the h3 library and then be able to access those libraries on my PySpark cluster but following these instructions isn't working.
conda init --all (then close and reopen terminal)
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas h3 numpy python=3.7.10 conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
import os
from pyspark.sql import SparkSession
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
"spark.yarn.archive", # 'spark.yarn.dist.archives' in YARN.
"~/gzk/pyspark_conda_env.tar.gz#environment").getOrCreate()
I'm able to get this far but when I actually try to run a pandas udf I get the error: ModuleNotFoundError: No module named 'numpy'
How can I solve this problem and use pandas udf's?

I ended up solving this issue by writing a bootstrap script for my AWS EMR cluster that would install all the packages I needed on all the nodes. I was never able to get the directions above to work properly.
Documentation on bootstrap scripts can be found here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html

Related

No module found error for a downloaded package (sksparse.cholmod) and how to download a package from the internet to conda

I need to use the sksparse.chomod package however my pycharm does not let me install it as it can't seem to find it.
I found the sksparse package on github and downloaded it but I do not know how to add a package downloaded from the internet into a conda environment. So, my first question would be can you download a package from github and add it to your conda environment, and how do you do this?
As I did not know how to do the above I instead saved the package within my project and thought I could simply import sksparse.cholmod. However, the line in my code that says import sksparse.cholmod as sks has no errors with it, so I assumed that meant this was ok, but when I try to run my file I get this error:
import sksparse.cholmod as sks
ModuleNotFoundError: No module named 'sksparse.cholmod'
If I have downloaded the package into my project why can't it be found, yet there are no errors when importing?
The cholmod file is a pyx file which I've been told should not be a problem.
Please could anyone help, I am reasonably new to python and I am looking for a straight forward solution that won't be time consuming.
It was an issue with windows, I was able to fix this using the instructions on this link
https://github.com/EmJay276/scikit-sparse
We must follow these steps precisely:
(This was tested with a Anaconda 3 installation and Python 3.7)
Install these requirements in order:
'''
conda install -c conda-forge numpy - tested with v1.19.1
conda install -c anaconda scipy - tested with v1.5.0
conda install -c conda-forge cython - tested with v0.29.21
conda install -c conda-forge suitesparse - tested with v5.4.0
'''
Download Microsoft Build Tools for C++ from https://visualstudio.microsoft.com/de/visual-cpp-build-tools/ (tested with 2019, should work with 2015 or newer)
Install Visual Studio Build Tools
Choose Workloads
Check "C++ Buildtools"
Keep standard settings
Run ''' pip install git+https://github.com/EmJay276/scikit-sparse '''
Test ''' from sksparse.cholmod import cholesky '''
Use all the versions stated for numpy etc, however with scipy I installed the latest version and it worked fine.

How to import pyscipopt to Google Colab (in the Jupyter Notebook file)?

I usually use SCIP (call PyScipOpt) with Jupyter Notebook (installed via Anaconda) on my Mac, and when I write "from pyscipopt import Model" there is no error (when running it on my machine) but for the large-scale problems I decided to import my notebook to Google Colab and run the code there. Cannot get rid of the error: "no module named pyscipopt".
I tried "!pip install pyscipopt" directly in google colab, and "!apt install pyscipopt". While executing "!pip install pyscipopt", I got another error: "failed building wheel for pyscipopt". When I googled it, I found that the SCIP should be installed prior etc. but I said everything is working fine on Jupyter Notebook directly on my Mac which means SCIP is installed and lib and include subpackages are there (I checked). I also tried "export SCIPOPTDIR=" and "!export SCIPOPTDIR=". Nothing works.
Any advise would be much appreciated.
Thanks!
Lidiia
From pip or PyPI you cannot install SCIP but only PySCIPOpt. You need to use the conda package that includes SCIP to use PySCIPOpt in a hosted Google Colab notebook.
First, you need to install conda itself:
!pip install -q condacolab
import condacolab
condacolab.install()
Then, you can install PySCIPOpt:
!conda install pyscipopt
Finally, you can import PySCIPOpt as usual:
from pyscipopt import Model
m = Model()
m.redirectOutput()
m.printVersion()

cannot install pyimagej on python

I need to write a code using jupyter notebook but need imagej packages.
I installed pyimagej using several commands
conda install -c conda-forge pyimagej
conda install -c conda-forge/label/cf201901 pyimagej
conda install -c conda-forge/label/cf202003 pyimagej
But I always get this error: ModuleNotFoundError: No module named 'ij'
when I try
from ij import IJ
I know that I can use imagej editor but I need to use jupyter notebook. Could someone help?
I tried this and it worked
conda create -n imagej pyimagej
We have recently improved our documentation on how to use PyImageJ. It looks like you were successful in installing the software, but you did not yet initialize PyImageJ. Checkout the How to initialize PyImageJ. If you have trouble initializing PyImageJ, you might need to view the installation documentation.
But assuming you have installed PyImageJ and activated your environment, a typical script looks something like this:
import imagej
# initialize PyImageJ
ij = imagej.init('sc.fiji:fiji', mode='interactive')
print(f"ImageJ Version: {ij.getVersion()}")
# The original ImageJ "IJ" is a property of the initialized "ij" object.
print(dir(ij.IJ))
Hope that helps!

How I Update pandas using spyder?

I try to convert a datastucture i get from pandas to an array using .to_numpy but my pandas version is too old. Trying to update it writing: conda install -c anaconda pandas inside anaconda prompt resolves in this issue: ERROR conda.core.link:_execute_actions(337): An error occurred while installing package 'anaconda::tqdm-4.50.2-py_0'. CondaError: Cannot link a source that does not exist. C:\Users\domyb\Anaconda3\Scripts\conda.exe Running conda clean --packages may resolve your problem. Attempting to roll back.
CondaError: Cannot link a source that does not exist. C:\Users\domyb\Anaconda3\Scripts\conda.exe Running conda clean --packages may resolve your problem. Any idea?

What is the latest config for installing pyspark?

I am trying to install pyspark. Following this thread here, particularly advice from OneCricketeer and zero323.
I have done the following:
1 - Install pyspark in anaconda3 with conda install -c conda-forge pyspark
2 - Set up this in my .bashrc file:
function snotebook ()
{
#Spark path (based on your computer)
SPARK_PATH=~/spark-3.0.1-bin-hadoop3.2
export ANACONDA_ROOT=~/anaconda3
export PYSPARK_DRIVER_PYTHON=$ANACONDA_ROOT/bin/ipython
export PYSPARK_PYTHON=$ANACONDA_ROOT/bin/python
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
# For python 3 users, you have to add the line below or you will get an error
#export PYSPARK_PYTHON=python3
$SPARK_PATH/bin/pyspark --master local[2]
}
I have Python 3.8.2, anaconda3. I downloaded spark 3.0.1 with hadoop 3.2.
The .bashrc setup partially follows this article from Medium here
When I tried import pyspark as ps, I get No module named 'pyspark'.
What am I missing? Thanks.
I work with PySpark a lot and found these three simple steps which always work irrespective of the OS. For this example, I am going to depict for MacOS
Install PySpark using PIP
pip install pyspark
Install JAVA8 using below link
https://www.oracle.com/in/java/technologies/javase/javase-jdk8-downloads.html
Setup Environment variables in ~/.bash_profile
[JAVA_HOME, SPARK_HOME, PYTHONPATH]
JAVA_HOME - (Check the path where JAVA is installed. It is usually below in McOS)
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home/
SPAK_HOME - Check the path where pyspark is installed. One hack is to run below command which will give pyspark and its py4j path.
pip install pyspark
Requirement already satisfied: pyspark in /opt/anaconda3/lib/python3.7/site-packages (2.4.0)
Requirement already satisfied: py4j==0.10.7 in /opt/anaconda3/lib/python3.7/site-packages (from pyspark) (0.10.7)
Use above two paths to set following environment variables:
export SPARK_HOME=/opt/anaconda3/lib/python3.7/site-packages/pyspark
export PYTHONPATH=/opt/anaconda3/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Run below command to reset ~/.bash_profile
source ~/.bash_profile

Categories