What is the latest config for installing pyspark?

What is the latest config for installing pyspark? - python

I am trying to install pyspark. Following this thread here, particularly advice from OneCricketeer and zero323.
I have done the following:
1 - Install pyspark in anaconda3 with conda install -c conda-forge pyspark
2 - Set up this in my .bashrc file:
function snotebook ()
{
#Spark path (based on your computer)
SPARK_PATH=~/spark-3.0.1-bin-hadoop3.2
export ANACONDA_ROOT=~/anaconda3
export PYSPARK_DRIVER_PYTHON=$ANACONDA_ROOT/bin/ipython
export PYSPARK_PYTHON=$ANACONDA_ROOT/bin/python
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
# For python 3 users, you have to add the line below or you will get an error
#export PYSPARK_PYTHON=python3
$SPARK_PATH/bin/pyspark --master local[2]
}
I have Python 3.8.2, anaconda3. I downloaded spark 3.0.1 with hadoop 3.2.
The .bashrc setup partially follows this article from Medium here
When I tried import pyspark as ps, I get No module named 'pyspark'.
What am I missing? Thanks.

I work with PySpark a lot and found these three simple steps which always work irrespective of the OS. For this example, I am going to depict for MacOS
Install PySpark using PIP
pip install pyspark
Install JAVA8 using below link
https://www.oracle.com/in/java/technologies/javase/javase-jdk8-downloads.html
Setup Environment variables in ~/.bash_profile
[JAVA_HOME, SPARK_HOME, PYTHONPATH]
JAVA_HOME - (Check the path where JAVA is installed. It is usually below in McOS)
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home/
SPAK_HOME - Check the path where pyspark is installed. One hack is to run below command which will give pyspark and its py4j path.
pip install pyspark
Requirement already satisfied: pyspark in /opt/anaconda3/lib/python3.7/site-packages (2.4.0)
Requirement already satisfied: py4j==0.10.7 in /opt/anaconda3/lib/python3.7/site-packages (from pyspark) (0.10.7)
Use above two paths to set following environment variables:
export SPARK_HOME=/opt/anaconda3/lib/python3.7/site-packages/pyspark
export PYTHONPATH=/opt/anaconda3/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Run below command to reset ~/.bash_profile
source ~/.bash_profile

Related

Jupyter notebook won't start - despite various fixes

Every time I try to run jupyter notebook (using the command python -m jupyter notebook it returns:
Jupyter command "jupyter-notebook" not found.
I've tried uninstalling and reinstalling jupyter, I've tried this rec:
uninstall pyzmq and install it again.
Then I ran pip3 install --upgrade nbconvert.
I also did pip install --upgrade pywin32==224
from a different question. again no dice, still the same return. I definitely have notebook installed, I've done jupyter --version and it returned that I have notebook installed.
I did python -m jupyter --debug --paths which returned:
JUPYTER_PREFER_ENV_PATH is not set, making the user-level path preferred over the environment-level path for data and config
JUPYTER_NO_CONFIG is not set, so we use the full path list for config
JUPYTER_CONFIG_PATH is not set, so we do not prepend anything to the config paths
JUPYTER_CONFIG_DIR is not set, so we use the default user-level config directory
Python's site.ENABLE_USER_SITE is True, so we add the user site directory 'C:\Users\Nathaniel Paczek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-pac
kages'
JUPYTER_PATH is not set, so we do not prepend anything to the data paths
JUPYTER_DATA_DIR is not set, so we use the default user-level data directory
JUPYTER_RUNTIME_DIR is not set, so we use the default runtime directory
config:
C:\Users\Nathaniel Paczek\.jupyter
C:\Users\Nathaniel Paczek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\etc\jupyter
C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\etc\jupyter
C:\ProgramData\jupyter
data:
C:\Users\Nathaniel Paczek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\Roaming\jupyter
C:\Users\Nathaniel Paczek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\share\jupyter
C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\share\jupyter
C:\ProgramData\jupyter
runtime:
C:\Users\Nathaniel Paczek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\Roaming\jupyter\runtime
i realise this is a lengthy question but I'm tearing my hair out trying to get it sorted so would really appreciate any help that can be offered.

You need to install the notebook package - see the Jupyter documentation.

PySpark Load Packages for Pandas UDF's

I have tried following the Databricks blog post here but unfortunately keep getting errors. I'm trying to install pandas, pyarrow, numpy, and the h3 library and then be able to access those libraries on my PySpark cluster but following these instructions isn't working.
conda init --all (then close and reopen terminal)
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas h3 numpy python=3.7.10 conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
import os
from pyspark.sql import SparkSession
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
"spark.yarn.archive", # 'spark.yarn.dist.archives' in YARN.
"~/gzk/pyspark_conda_env.tar.gz#environment").getOrCreate()
I'm able to get this far but when I actually try to run a pandas udf I get the error: ModuleNotFoundError: No module named 'numpy'
How can I solve this problem and use pandas udf's?

I ended up solving this issue by writing a bootstrap script for my AWS EMR cluster that would install all the packages I needed on all the nodes. I was never able to get the directions above to work properly.
Documentation on bootstrap scripts can be found here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html

How I Update pandas using spyder?

I try to convert a datastucture i get from pandas to an array using .to_numpy but my pandas version is too old. Trying to update it writing: conda install -c anaconda pandas inside anaconda prompt resolves in this issue: ERROR conda.core.link:_execute_actions(337): An error occurred while installing package 'anaconda::tqdm-4.50.2-py_0'. CondaError: Cannot link a source that does not exist. C:\Users\domyb\Anaconda3\Scripts\conda.exe Running conda clean --packages may resolve your problem. Attempting to roll back.
CondaError: Cannot link a source that does not exist. C:\Users\domyb\Anaconda3\Scripts\conda.exe Running conda clean --packages may resolve your problem. Any idea?

Visual Studio Code is rejecting the Tensorflow installation in a Virtual Environment

I created a virtual environment called env using
python -m venv env
.\env\Scripts\activate.bat
pip install tensorflow
I verified tensorflow is in the env\Lib\site-packages folder
Next I loaded VS Code and created a workspace, added a python file, it prompted me to install pylint,
I typed in python: select interpreter and I browsed to C:\Users\admin\env\Scripts folder
This is the command line at the beginning of the script
(env) PS C:\Users\admin\env\project> cd 'c:\Users\admin\env\project'; & 'C:\Users\admin\env\Scripts\python.exe' 'c:\Users\admin\.vscode\extensions\ms-python.python-2020.8.106424\pythonFiles\lib\python\debugpy\launcher' '54436' '--' 'c:\Users\admin\env\project\face_gan.py'
This is the error I get when debugging the python file:
ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via `pip install tensorflow`
PS C:\Users\admin\env\project> & C:/Users/admin/env/Scripts/Activate.ps1
When I type in pip install tensorflow in VS Code terminal, it shows its already installed
(env) PS C:\Users\admin\env\project> pip install tensorflow
Requirement already satisfied: tensorflow in c:\users\admin\env\lib\site-packages (2.3.0)
I don't understand this, is it not running in virtual environment?
Why is it executing C:/Users/admin/env/Scripts/Activate.ps1 at the end of the debugging session, not at the beginning
Lastly, is running python from the virtual environment folder C:\Users\admin\env\Scripts the same as using the activate.bat file or the source command? Does it automatically defer to using the C:\Users\admin\env\Lib folder, or is it still trying to use the default python installation to look for Tensorflow?
What step did I miss to make it use the virtual environment correct in VS Code?

First question: executing C:/Users/admin/env/Scripts/Activate.ps1 after debugging command make no difference. It just because it's the first command of the terminal. You can run it again to make a try.
Second question: Yes, that's the same. In your case, it will add 'C:\Users\admin\env' and 'C:\Users\admin\env\lib\site-packages' path to the PYTHONPATH variable.
You can through these codes to get the PYTHONPATH(the default search path for module files) variable value:
import sys
print(sys.path)
If you import 'tensorflow' directly. you will find you can import it correctly. It's a version problem. You should downgrade the version of the packages, and you can refer to this comment to get some useful information.

How to change default install location for pip

I'm trying to install Pandas using pip, but I'm having a bit of trouble. I just ran sudo pip install pandas which successfully downloaded pandas. However, it did not get downloaded to the location that I wanted. Here's what I see when I use pip show pandas:
---
Name: pandas
Version: 0.14.0
Location: /Library/Python/2.7/site-packages/pandas-0.14.0-py2.7-macosx-10.9-intel.egg
Requires: python-dateutil, pytz, numpy
So it is installed. But I was confused when I created a new Python Project and searched under System Libs/lib/python for pandas, because it didn't show up. Some of the other packages that I've downloaded in the past did show up, however, so I tried to take a look at where those were. Running pip show numpy (which I can import with no problem) yielded:
---
Name: numpy
Version: 1.6.2
Location: /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
Requires:
Which is in a completely different directory. For the sake of confirming my error, I ran pip install pyquery to see where it would be downloaded to, and got:
Name: pyquery
Version: 1.2.8
Location: /Library/Python/2.7/site-packages
Requires: lxml, cssselect
So the same place as pandas...
How do I change the default download location for pip so that these packages are downloaded to the same location that numpy is in?
Note: There were a few similar questions that I saw when searching for a solution, but I didn't see anything that mentioned permanently changing the default location.

According to pip documentation at
http://pip.readthedocs.org/en/stable/user_guide/#configuration
You will need to specify the default install location within a pip.ini file, which, also according to the website above is usually located as follows
On Unix and Mac OS X the configuration file is: $HOME/.pip/pip.conf
On Windows, the configuration file is: %HOME%\pip\pip.ini
The %HOME% is located in C:\Users\Bob on windows assuming your name is Bob
On linux the $HOME directory can be located by using cd ~
You may have to create the pip.ini file when you find your pip directory. Within your pip.ini or pip.config you will then need to put (assuming your on windows) something like
[global]
target=C:\Users\Bob\Desktop
Except that you would replace C:\Users\Bob\Desktop with whatever path you desire. If you are on Linux you would replace it with something like /usr/local/your/path
After saving the command would then be
pip install pandas
However, the program you install might assume it will be installed in a certain directory and might not work as a result of being installed elsewhere.

You can set the following environment variable:
PIP_TARGET=/path/to/pip/dir
https://pip.pypa.io/en/stable/user_guide/#environment-variables

Open Terminal and type:
pip config set global.target /Users/Bob/Library/Python/3.8/lib/python/site-packages
except instead of
/Users/Bob/Library/Python/3.8/lib/python/site-packages
you would use whatever directory you want.

Follow these steps
pip config set global.target D:\site-packages to change install path
or py -m pip config --user --editor notepad edit
[global]
target = D:\site-packages
set environment variable to use download import xxx
PIP_TARGET=site-packages
PYTHONPATH=site-packages
3.pip config unset global.target, to upgrade pip py -m pip install --upgrade pip

#Austin's answer is outdated, here for more up-to-date solution:
According to pip documentation at
https://pip.pypa.io/en/stable/topics/configuration/
You will need to specify the default install location within a configuration file, which, also according to the website above is usually located as follows
Mac OS
$HOME/Library/Application Support/pip/pip.conf if directory $HOME/Library/Application Support/pip exists else $HOME/.config/pip/pip.conf.
The legacy “per-user” configuration file is also loaded, if it exists: $HOME/.pip/pip.conf.
The $HOME folder can be located by navigating to ~/ (cmd+shift+G in Finder; cmd+shift+. to show hidden files).
Windows
%APPDATA%\pip\pip.ini
The legacy “per-user” configuration file is also loaded, if it exists: %HOME%\pip\pip.ini
The %HOME% is located in C:\Users\Bob on windows assuming your username is Bob
Unix
$HOME/.config/pip/pip.conf, which respects the XDG_CONFIG_HOME environment variable.
The legacy “per-user” configuration file is also loaded, if it exists: $HOME/.pip/pip.conf.
On linux the $HOME directory can be located by using cd ~
You may have to create the configuration file when you find your pip directory. Put something like
[global]
target = /Library/Frameworks/Python.framework/Versions/Current/lib/python3.10/site-packages/
if you are on a Mac. Except that you would replace /Library/Frameworks/Python.framework/Versions/Current/lib/python3.10/site-packages/ with whatever path you desire. If you are on Linux you would replace it with something like /usr/local/your/path
After saving the command would then be
pip install pandas
However, the program you install might assume it will be installed in a certain directory and might not work as a result of being installed elsewhere.
Please note that
pip3 install pandas
might be the solution if your packages gets installed in the Python2 folder vs Python3.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.