environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON - python

I have installed pyspark recently. It was installed correctly. When I am using following simple program in python, I am getting an error.
>>from pyspark import SparkContext
>>sc = SparkContext()
>>data = range(1,1000)
>>rdd = sc.parallelize(data)
>>rdd.collect()
while running the last line I am getting error whose key line seems to be
[Stage 0:> (0 + 0) / 4]18/01/15 14:36:32 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I have the following variables in .bashrc
export SPARK_HOME=/opt/spark
export PYTHONPATH=$SPARK_HOME/python3
I am using Python 3.

By the way, if you use PyCharm, you could add PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to run/debug configurations per image below

You should set the following environment variables in $SPARK_HOME/conf/spark-env.sh:
export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=/usr/bin/python
If spark-env.sh doesn't exist, you can rename spark-env.sh.template

This may happen also if you're working within an environment. In this case, it may be harder to retrieve the correct path to the python executable (and anyway I think it's not a good idea to hardcode the path if you want to share it with others).
If you run the following lines at the beginning of your script/notebook (at least before you create the SparkSession/SparkContext) the problem is solved:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
Package os allows you to set global variables; package sys gives the string with the absolute path of the executable binary for the Python interpreter.

I got the same issue, and I set both variable in .bash_profile
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
But My problem is still there.
Then I found out the problem is that my default python version is python 2.7 by typing python --version
So I solved the problem by following below page:
How to set Python's default version to 3.x on OS X?

Just run the code below in the very beginning of your code. I am using Python3.7. You might need to run locate python3.7 to get your Python path.
import os
os.environ["PYSPARK_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"

I'm using Jupyter Notebook to study PySpark, and that's what worked for me.
Find where python3 is installed doing in a terminal:
which python3
Here is pointing to /usr/bin/python3.
Now in the the beginning of the notebook (or .py script), do:
import os
# Set spark environments
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/bin/python3'
Restart your notebook session and it should works!

Apache-Spark 2.4.3 on Archlinux
I've just installed Apache-Spark-2.3.4 from Apache-Spark website, I'm using Archlinux distribution, it's simple and lightweight distribution. So, I've installed and put the apache-spark directory on /opt/apache-spark/, now it's time to export our environment variables, remember, I'm using Archlinux, so take in mind to using your $JAVA_HOME for example.
Importing environment variables
echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/user/.bashrc
echo 'export SPARK_HOME=/opt/apache-spark' >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/user/.bashrc
source ../.bashrc
Testing
emanuel#hinton ~ $ echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/emanuel/.bashrc
emanuel#hinton ~ $ echo 'export SPARK_HOME=/opt/apache-spark' >> /home/emanuel/.bashrc
emanuel#hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/emanuel/.bashrc
emanuel#hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/emanuel/.bashrc
emanuel#hinton ~ $ source .bashrc
emanuel#hinton ~ $ python
Python 3.7.3 (default, Jun 24 2019, 04:54:02)
[GCC 9.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>>
Everything it's working fine since you correctly imported the environment variables for SparkContext.
Using Apache-Spark on Archlinux via DockerImage
For my use purposes I've created a Docker image with python, jupyter-notebook and apache-spark-2.3.4
running the image
docker run -ti -p 8888:8888 emanuelfontelles/spark-jupyter
just go to your browser and type
http://localhost:8888/tree
and will prompted a authentication page, come back to terminal and copy the token number and voila, will have Archlinux container running a Apache-Spark distribution.

If you are using Pycharm , Got to Run - > Edit Configurations and click on Environment variables to add as below(basically the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON should point to the same version of Python) . This solution worked for me .Thanks to the above posts.

To make it easier to see for people, that instead of having to set a specific path /usr/bin/python3 that you can do this:
I put this line in my ~/.zshrc
export PYSPARK_PYTHON=python3.8
export PYSPARK_DRIVER_PYTHON=python3.8
When I type in python3.8 in my terminal I get Python3.8 going. I think it's because I installed pipenv.
Another good website to reference to get your SPARK_HOME is https://towardsdatascience.com/how-to-use-pyspark-on-your-computer-9c7180075617
(for permission denied issues use sudo mv)

1. Download and Install Java (Jre)
2. It has two options, you can choose one of the following solution:-
## -------- Temporary Solution -------- ##
Just put the path in your jupyter notebook in the following code and RUN IT EVERYTIME:-
import os
os.environ["PYSPARK_PYTHON"] = r"C:\Users\LAPTOP0534\miniconda3\envs\pyspark_v3.3.0"
os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:\Users\LAPTOP0534\miniconda3\envs\pyspark_v3.3.0"
os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jre1.8.0_333"
----OR----
## -------- Permanent Solution -------- ##
Set above 3 variables in your Environment Variable.
I have gone through many answers but nothing works for me.
But both of these resolution worked for me. This has resolved my error.
Thanks

import os
os.environ["JAVA_HOME"] = "C:\Program Files\Java\jdk-19"
os.environ["SPARK_HOME"] = "C:\Program Files\Spark\spark-3.3.1-bin-hadoop2"
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
This worked for me in a jupyter notebook as the os library makes our life easy in setting up the environment variables. Make sure to run this cell befor running the sparksession.

I tried two methods for the question. the method in the picture can works.
add environment variables
PYSPARK_PYTHON=/usr/local/bin/python3.7;PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3.7;PYTHONUNBUFFERED=1

Related

Mac gcloud install ImportError: No module named __future__

When installing gcloud for mac I get this error when I run the install.sh command according to docs here:
Traceback (most recent call last):
File "/path_to_unzipped_file/google-cloud-sdk/bin/bootstrapping/install.py", line 8, in <module>
from __future__ import absolute_import
I poked through and echoed out some stuff in the install shell script. It is setting the environment variables correctly (pointing to my default python installation, pointing to the correct location of the gcloud SDK).
If I just enter the python interpreter (using the same default python that the install script points to when running install.py) I can import the module just fine:
>>> from __future__ import absolute_import
>>>
Only other information worth noting is my default python setup is a virtual environment that I create from python 2.7.15 installed through brew. The virtual environment python bin is first in my PATH so python and python2 and python2.7 all invoke the correct binary. I've had no other issues installing packages on this setup so far.
If I echo the final line of the install.sh script that calls the install.py script it shows /path_to_virtualenv/bin/python -S /path_to_unzipped_file/google-cloud-sdk/bin/bootstrapping/install.py which is the correct python. Or am I missing something?
The script uses the -S command-line switch, which disables loading the site module on start-up.
However, it is a custom dedicated site module installed in a virtualenv that makes a virtualenv work. As such, the -S switch and virtualenvs are incompatible, with -S set fundamental imports such as from __future__ break down entirely.
You can either remove the -S switch from the install.bat command or use a wrapper script to strip it from the command line as you call your real virtualenv Python.
I had the error below when trying to run gcloud commands.
File "/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/lib/gcloud.py", line 20, in <module>
from __future__ import absolute_import
ImportError: No module named __future__
If you have your virtualenv sourced automatically you can specify the environment variable CLOUDSDK_PYTHON i.e. set -x CLOUDSDK_PYTHON /usr/bin/python to not use the virtualenv python.
In google-cloud-sdk/install.sh go to last line, remove variable $CLOUDSDK_PYTHON_ARGS as below.
"$CLOUDSDK_PYTHON" $CLOUDSDK_PYTHON_ARGS "${CLOUDSDK_ROOT_DIR}/bin/bootstrapping/install.py" "$#"
"$CLOUDSDK_PYTHON" "${CLOUDSDK_ROOT_DIR}/bin/bootstrapping/install.py" "$#"

Why is $PATH in remote deployment path different from $PATH in remote system?

I'm currently working on Pycharm with remote python Interpreter(miniconda3/bin/python).
So when I type echo $PATH in remote server, it prints
/home/woosung/bin:/home/woosung/.local/bin:/home/woosung/miniconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
I created project in Pycharm and set remote python Interpreter as miniconda3 python, it works well when I just run some *.py files.
But when I typed some os.system() lines, weird things happened.
For instance, in test.py from Pycharm project
import os
os.system('echo $PATH')
os.system('python --version')
Output is
ssh://woosung#xxx.xxx.xxx.xxx:xx/home/woosung/miniconda3/bin/python -u /tmp/pycharm_project_203/test.py
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
Python 2.7.12
Process finished with exit code 0
I tried same command in remote server,
woosung#test-pc:~$ echo $PATH
/home/woosung/bin:/home/woosung/.local/bin:/home/woosung/miniconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
woosung#test-pc:~$ python --version
Python 3.6.6 :: Anaconda, Inc.
PATH and the version of python are totally different! How can I fix this?
I've already tried add os.system('export PATH="$PATH:$HOME/miniconda3/bin"') to test.py. But it still gives same $PATH.(/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games)
EDIT
Thanks to the comment of #Dietrich Epp, I successfully add interpreter path to the shell $PATH.
(os.environ["PATH"] += ":/home/woosung/miniconda3/bin")
But I stuck the more basic problem. When I add the path and execute command the some *.py file including import library which is only in miniconda3, the shell gives ImportError.
For instance, in test.py
import matplotlib
os.environ["PATH"] += ":/home/woosung/miniconda3/bin"
os.system("python import_test.py")
and import_test.py
import matplotlib
And when I run test.py,
Traceback (most recent call last):
File "import_test.py", line 1, in <module>
import matplotlib
ImportError: No module named matplotlib
Looks like the shell doesn't understand how to utilize modified $PATH.
I find the solution.
It is not direct but quite simple.
I changed os.system("python import_test.py") to os.system(sys.executable + ' import_test.py').
This makes the shell uses the Pycharm remote interpreter(miniconda3), not original.

python don't extract zipfile when ran through crontab

#!/usr/bin/python
import requests, zipfile, StringIO, sys
extractDir = "myfolder"
zip_file_url = "download url"
response = requests.get(zip_file_url)
zipDocument = zipfile.ZipFile(StringIO.StringIO(response.content))
zipinfos = zipDocument.infolist()
for zipinfo in zipinfos:
extrat = zipDocument.extract(zipinfo,path=extractDir)
System configuration
Ubuntu OS 16.04
Python 2.7.12
$ python extract.py
when I run the code on Terminal with above command, it works properly and create the folder and extract the file into it.
Similarly, when I create a cron job using sodu rights the code executes but don't create any folder or extracts the files.
crontab command:-
40 10 * * * /usr/bin/sudo /usr/bin/python /home/ubuntu/demo/directory.py > /home/ubuntu/demo/logmyshit.log 2>&1
also tried
40 10 * * * /usr/bin/python /home/ubuntu/demo/directory.py > /home/ubuntu/demo/logmyshit.log 2>&1
Notes :
I check the syslog, it says the cron is running successfully
The above code gives no errors
also made the python program executable by chmod +x filename.py
Please help where am I going wrong.
Oups, there is nothing really wrong in running a Python script in crontab, but many bad things can happen because the environment is not the one you are used to.
When you type in an interactive shell python directory.py, the PATH and all required PYTHON environment variable have been set as part of login and interactive shell initialization, and the current directory is your home directory by default or anywhere you currently are.
When the same command is run from crontab, the current directory is not specified (but may not be what you expect), PATH is only /bin:/usr/bin and python environment variables are not set. That means that you will have to tweak environment variables in crontab file until you get a correct Python environment, and set the current directory.
I had a very similar problem and it turned out cron didn’t like importing matplotlib, I ended up having to specify Agg backend. I figured it out by putting log statements after each line to see how far the program got before it crapped out. Of course, my log was empty which tipped me off that it crashed on imports.
TLDR: log each line inside the script

Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0

I am trying to use IPython notebook with Apache Spark 1.4.0. I have followed the 2 tutorial below to set my configuration
Installing Ipython notebook with pyspark 1.4 on AWS
and
Configuring IPython notebook support for Pyspark
After fisnish the configuration, following is several code in the related files:
1.ipython_notebook_config.py
c=get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser =False
c.NotebookApp.port = 8193
2.00-pyspark-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
I also add following two lines to my .bash_profile:
export SPARK_HOME='home/hadoop/sparl'
source ~/.bash_profile
However, when I run
ipython notebook --profile=pyspark
it shows the message: unrecognized alias '--profile=pyspark' it will probably have no effect
It seems that the notebook doesn't configure with pyspark successfully
Does anyone know how to solve it? Thank you very much
following are some software version
ipython/Jupyter: 4.0.0
spark 1.4.0
AWS EMR: 4.0.0
python: 2.7.9
By the way I have read the following, but it doesn't work
IPython notebook won't read the configuration file
Jupyter notebooks don't have the concept of profiles (as IPython did). The recommended way of launching with a different configuration is e.g.:
JUPTYER_CONFIG_DIR=~/alternative_jupyter_config_dir jupyter notebook
See also issue jupyter/notebook#309, where you'll find a comment describing how to set up Jupyter notebook with PySpark without profiles or kernels.
This worked for me...
Update ~/.bashrc with:
export SPARK_HOME="<your location of spark>"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
(Lookup pyspark docs for those arguments)
Then create a new ipython profile eg. pyspark:
ipython profile create pyspark
Then create and add the following lines in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))
filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.6" in open(spark_release_file).read():
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
(update versions of py4j and spark to suit your case)
Then mkdir -p ~/.ipython/kernels/pyspark and then create and add following lines in the file ~/.ipython/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 1.6.1)",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"--profile=pyspark",
"-f",
"{connection_file}"
]
}
Now you should see this kernel, pySpark (Spark 1.6.1), under jupyter's new notebook option. You can test by executing sc and should see your spark context.
I have tried so many ways to solve this 4.0 version problem, and finally I decided to install version 3.2.3. of IPython:
conda install 'ipython<4'
It's anazoning! And wish to help all you!
ref: https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/ace9F4dWZTA
As people commented, in Jupyter you don't need profiles. All you need to do is export the variables for jupyter to find your spark install (I use zsh but it's the same for bash)
emacs ~/.zshrc
export PATH="/Users/hcorona/anaconda/bin:$PATH"
export SPARK_HOME="$HOME/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_SUBMIT_ARGS="--master local[*,8] pyspark-shell"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
It is important to add pyspark-shell in the PYSPARK_SUBMIT_ARGS
I found this guide useful but not fully accurate.
My config is local, but should work if you use the PYSPARK_SUBMIT_ARGS to the ones you need.
I am having the same problem to specify the --profile **kwarg. It seems it is a general problem with the new version, not related with Spark. If you downgrade to ipython 3.2.1 you will be able to specify the profile again.

How do I make a python script executable?

How can I run a python script with my own command line name like myscript without having to do python myscript.py in the terminal?
Add a shebang line to the top of the script:
#!/usr/bin/env python
Mark the script as executable:
chmod +x myscript.py
Add the dir containing it to your PATH variable. (If you want it to stick, you'll have to do this in .bashrc or .bash_profile in your home dir.)
export PATH=/path/to/script:$PATH
The best way, which is cross-platform, is to create setup.py, define an entry point in it and install with pip.
Say you have the following contents of myscript.py:
def run():
print('Hello world')
Then you add setup.py with the following:
from setuptools import setup
setup(
name='myscript',
version='0.0.1',
entry_points={
'console_scripts': [
'myscript=myscript:run'
]
}
)
Entry point format is terminal_command_name=python_script_name:main_method_name
Finally install with the following command.
pip install -e /path/to/script/folder
-e stands for editable, meaning you'll be able to work on the script and invoke the latest version without need to reinstall
After that you can run myscript from any directory.
I usually do in the script:
#!/usr/bin/python
... code ...
And in terminal:
$: chmod 755 yourfile.py
$: ./yourfile.py
Another related solution which some people may be interested in. One can also directly embed the contents of myscript.py into your .bashrc file on Linux (should also work for MacOS I think)
For example, I have the following function defined in my .bashrc for dumping Python pickles to the terminal, note that the ${1} is the first argument following the function name:
depickle() {
python << EOPYTHON
import pickle
f = open('${1}', 'rb')
while True:
try:
print(pickle.load(f))
except EOFError:
break
EOPYTHON
}
With this in place (and after reloading .bashrc), I can now run depickle a.pickle from any terminal or directory on my computer.
The simplest way that comes to my mind is to use "pyinstaller".
create an environment that contains all the lib you have used in your code.
activate the environment and in the command window write pip install pyinstaller
Use the command window to open the main directory that codes maincode.py is located.
remember to keep the environment active and write pyinstaller maincode.py
Check the folder named "build" and you will find the executable file.
I hope that this solution helps you.
GL
I've struggled for a few days with the problem of not finding the command py -3 or any other related to pylauncher command if script was running by service created using Nssm tool.
But same commands worked when run directly from cmd.
What was the solution? Just to re-run Python installer and at the very end click the option to disable path length limit.
I'll just leave it here, so that anyone can use this answer and find it helpful.

Categories