pySpark has a worker - driver version conflict when ran in Rodeo

pySpark has a worker - driver version conflict when ran in Rodeo - python

The following simple script works fine in pyspark when it is ran from the terminal:
import pyspark
sc = pyspark.SparkContext()
foo = sc.parallelize([1,2])
foo.foreach(print)
But when ran in Rodeo, it produces an error, most important line of which says:
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions
And the full error output can be found at this link: http://pastebin.com/raw/unGuGLhq
My$SPARK_HOME/conf/spark-env.sh file contains the following lines:
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
The problem persists despite that and putting the same lines in ~/.bashrc doesn't solve the problem, either.
Rodeo version: 1.3.0
Spark version: 1.6.1
Platform: Linux

This issue is related to one described here: link
Rodeo as a desktop app has a hard time working with shell environment variables. The trick is to put variables we'd normally declare in spark-env.sh in Rodeo's .rodeoprofile instead using os module to add them. Specifically in this case adding the following lines to .rodeoprofile helped:
os.environ["PYSPARK_PYTHON"]="python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="python3"
(though the second one is redundant and I added it just for consistence as the driver used 3.5 anyway)

Related

R reticulate specifying python executable to use

First, I'm working on a Windows machine. I would like to specify a specific version of python to use in RStudio. I would like RStudio to use python 3 in the ArcGIS Pro folder in order to have arcpy available, along with the licensed extensions. I have reticulate installed and have tried the following methods to force RStudio to use the ArcGIS Pro version of python.
First I tried this:
library(reticulate)
use_python("C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe", required = TRUE)
The resulting error:
Error in path.expand(path) : invalid 'path' argument
Following some other tips, I tried setting the environment before loading the reticulate library.
Sys.setenv(RETICULATE_PYTHON = "c:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe")
library(reticulate)
Then I retrieve information about the the version of Python currently being used by reticulate.
py_config
Error in path.expand(path) : invalid 'path' argument
I also tried creating and editing the .Renviron by using the usethis package
usethis::edit_r_environ()
Then entering the following
RETICULATE_PYTHON="C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
And saving it, restarting R..
library (reticulate)
py_config()
Error in path.expand(path) : invalid 'path' argument
And, to confirm, here is the location...
Any ideas on why I continue to receive invalid 'path' argument

I was having a similar issue. After trying a whole assortment of things, I finally installed an archived version of reticulate (reticulate_1.22) instead of using the most up-to-date version (reticulate_1.23) and now the issue is gone. It appears that this bug has been brought to the developers' attention (https://github.com/rstudio/reticulate/issues/1189).

Try using
use_python("C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3")

Have you tried replacing Program Files with PROGRA~1 and have you maybe also checked for example a command like dir("path/to/your/env") although tbh your screenshot looks ok;
btw just in case - after editing your .Renviron file you need to restart your RStudio/R session for changes to take effect;
RStudio version 2022.02 has Python interpreter selection now available in Global Options

I ran into the same error with R version R-4.1.1, but when I switched back to the previous version R-4.0.5 everything worked as expected. It's a quick workaround but doesn't solve the underlying issue in the current version.

How can I resolve Python module import problems stemming from the failed import of NumPy C-extensions for running Spark/Python code on a MacBook Pro?

When I try to run the (simplified/illustrative) Spark/Python script shown below in the Mac Terminal (Bash), errors occur if imports are used for numpy, pandas, or pyspark.ml. The sample Python code shown here runs well when using the 'Section 1' imports listed below (when they include from pyspark.sql import SparkSession), but fails when any of the 'Section 2' imports are used. The full error message is shown below; part of it reads: '..._multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64'). Apparently, there was a problem importing NumPy 'c-extensions' to some of the computing nodes. Is there a way to resolve the error so a variety of pyspark.ml and other imports will function normally? [Spoiler alert: It turns out there is! See the solution below!]
The problem could stem from one or more potential causes, I believe: (1) improper setting of the environment variables (e.g., PATH), (2) an incorrect SparkSession setting in the code, (3) an omitted but necessary Python module import, (4) improper integration of related downloads (in this case, Spark 3.2.1 (spark-3.2.1-bin-hadoop2.7), Scala (2.12.15), Java (1.8.0_321), sbt (1.6.2), Python 3.10.1, and NumPy 1.22.2) in the local development environment (a 2021 MacBook Pro (Apple M1 Max) running macOS Monterey version 12.2.1), or (5) perhaps a hardware/software incompatibility.
Please note that the existing combination of code (in more complex forms), plus software and hardware runs fine to import and process data and display Spark dataframes, etc., using Terminal--as long as the imports are restricted to basic versions of pyspark.sql. Other imports seem to cause problems, and probably shouldn't.
The sample code (a simple but working program only intended to illustrate the problem):
# Example code to illustrate an issue when using locally-installed versions
# of Spark 3.2.1 (spark-3.2.1-bin-hadoop2.7), Scala (2.12.15),
# Java (1.8.0_321), sbt (1.6.2), Python 3.10.1, and NumPy 1.22.2 on a
# MacBook Pro (Apple M1 Max) running macOS Monterey version 12.2.1
# The Python code is run using 'spark-submit test.py' in Terminal
# Section 1.
# Imports that cause no errors (only the first is required):
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
# Section 2.
# Example imports that individually cause similar errors when used:
# import numpy as np
# import pandas as pd
# from pyspark.ml.feature import StringIndexer
# from pyspark.ml.feature import VectorAssembler
# from pyspark.ml.classification import RandomForestClassifier
# from pyspark.ml import *
spark = (SparkSession
.builder
.appName("test.py")
.enableHiveSupport()
.getOrCreate())
# The associated dataset is located here (but is not required to replicate the issue):
# https://github.com/databricks/LearningSparkV2/blob/master/databricks-datasets/learning-spark-v2/flights/departuredelays.csv
# Create database and managed tables
spark.sql("DROP DATABASE IF EXISTS learn_spark_db CASCADE")
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")
spark.sql("CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING)")
# Display (print) the database
print(spark.catalog.listDatabases())
print('Completed with no errors!')
Here is the error-free output that results when only Section 1 imports are used (some details have been replaced by '...'):
MacBook-Pro ~/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src$ spark-submit test.py
[Database(name='default', description='Default Hive database', locationUri='file:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/spark-warehouse'), Database(name='learn_spark_db', description='', locationUri='file:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/spark-warehouse/learn_spark_db.db')]
Completed with no errors!
Here is the error that typically results when using from pyspark.ml import * or other (Section 2) imports individually:
MacBook-Pro ~/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src$ spark-submit test.py
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/__init__.py", line 23, in <module>
from . import multiarray
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/multiarray.py", line 10, in <module>
from . import overrides
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/overrides.py", line 6, in <module>
from numpy.core._multiarray_umath import (
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so, 0x0002): tried: '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/usr/lib/_multiarray_umath.cpython-310-darwin.so' (no such file)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/LearningSparkGitHub/chapter4/py/src/test.py", line 28, in <module>
from pyspark.ml import *
File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/base.py", line 25, in <module>
File "/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 21, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/__init__.py", line 144, in <module>
from . import core
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/__init__.py", line 49, in <module>
raise ImportError(msg)
ImportError:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.10 from "/Library/Frameworks/Python.framework/Versions/3.10/bin/python3"
* The NumPy version is: "1.22.2"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: dlopen(/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so, 0x0002): tried: '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/usr/lib/_multiarray_umath.cpython-310-darwin.so' (no such file)
To respond to the comment mentioned in the error message: Yes, the Python and NumPy versions noted above appear to be correct. (But it turns out the reference to Python 3.10 was misleading, as it was probably a reference to Python 3.10.1 rather than Python 3.10.2, as mentioned in Edit 1, below.)
For your reference, here are the settings currently used in the ~/.bash_profile:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home/
export SPARK_HOME=/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7
export SBT_HOME=/Users/.../Spark2/sbt
export SCALA_HOME=/Users/.../Spark2/scala-2.12.15
export PATH=$JAVA_HOME/bin:$SBT_HOME/bin:$SBT_HOME/lib:$SCALA_HOME/bin:$SCALA_HOME/lib:$PATH
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
# export PYSPARK_DRIVER_PYTHON="jupyter"
# export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
PATH="/Library/Frameworks/Python.framework/Versions/3.10/bin:${PATH}"
export PATH
# Misc: cursor customization, MySQL
export PS1="\h \w$ "
export PATH=${PATH}:/usr/local/mysql/bin/
# Not used, but available:
# export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-16.0.1.jdk/Contents/Home
# export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home
# export PATH=$PATH:$SPARK_HOME/bin
# For use of SDKMAN!
export SDKMAN_DIR="$HOME/.sdkman"
[[ -s "$HOME/.sdkman/bin/sdkman-init.sh" ]] && source "$HOME/.sdkman/bin/sdkman-init.sh"
The following website was helpful for loading and integrating Spark, Scala, Java, sbt, and Python (versions noted above): https://kevinvecmanis.io/python/pyspark/install/2019/05/31/Installing-Apache-Spark.html. Please note that the jupyter and notebook driver settings have been commented-out in the Bash profile because they are probably unnecessary (and because at one point, they seemed to interfere with the use of spark-submit commands in Terminal).
A review of the referenced numpy.org website did not help much:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
In response to some of the comments on the numpy.org website: a Python3 shell runs fine in the Mac Terminal, and pyspark and other imports (numpy, etc.) work there normally. Here is the output that results when printing the PYTHONPATH and PATH variables from Python interactively (with a few details replaced by '...'):
>>> import os
>>> print("PYTHONPATH:", os.environ.get('PYTHONPATH'))
PYTHONPATH: /Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/python/:
>>> print("PATH:", os.environ.get('PATH'))
PATH: /Users/.../.sdkman/candidates/sbt/current/bin:/Library/Frameworks/Python.framework/Versions/3.10/bin:/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home//bin:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/bin:/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7/sbin:/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home//bin:/Users/.../Spark2/sbt/bin:/Users/.../Spark2/sbt/lib:/Users/.../Spark2/scala-2.12.15/bin:/Users/.../Spark2/scala-2.12.15/lib:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/MacGPG2/bin:/Library/Apple/usr/bin:/usr/local/mysql/bin/
(I am not sure which portion of this output points to a problem.)
The previously attempted remedies included these (all unsuccessful):
The use and testing of a variety of environment variables in the ~/.bash_profile
Uninstallation and reinstallation of Python and NumPy using pip3
Re-installation of Spark, Scala, Java, Python, and sbt in a (new) local dev environment
Many Internet searches on the error message, etc.
To date, no action has resolved the problem.
Edit 1
I am adding recently discovered information.
First, it appears the PATH setting mentioned above (export PYSPARK_PYTHON=python3) was pointing toward Python 3.10.1 located in /Library/Frameworks/Python.framework/Versions/3.10/bin/python3 rather than to Python 3.10.2 in my development environment. I subsequently uninstalled Python 3.10.1 and reinstalled Python 3.10.2 (python-3.10.2-macos11.pkg) on my Mac (macOS Monterey 12.2.1), but have not yet changed the PYSPARK_PYTHON path to point toward the dev environment (suggestions would be welcome on how to do that). The code still throws errors as described previously.
Second, it may help to know a little more about the architecture of the computer, since the error message pointed to a potential hardware-software incompatiblity:
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')
The computer is a "MacBookPro18,2" with an Apple M1 Max chip (10 cores: 8 performance, and 2 efficiency; 32-core GPU). Some websites like these (https://en.wikipedia.org/wiki/Apple_silicon#Apple_M1_Pro_and_M1_Max, https://github.com/conda-forge/miniforge/blob/main/README.md) suggest 'Apple silicon' like the M1 Max needs software designed for the 'arm64' architecture. Using Terminal on the Mac, I checked the compatibility of Python 3.10.2 and the troublesome _multiarray_umath.cpython-310-darwin.so file. Python 3.10.2 is a 'universal binary' with 2 architectures (x86_64 and arm64), and the file is exclusively arm64:
MacBook-Pro ~$ python3 --version
Python 3.10.2
MacBook-Pro ~$ whereis python3
/usr/bin/python3
MacBook-Pro ~$ which python3
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3
MacBook-Pro ~$ file /Library/Frameworks/Python.framework/Versions/3.10/bin/python3
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64:Mach-O 64-bit executable arm64]
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3 (for architecture x86_64): Mach-O 64-bit executable x86_64
/Library/Frameworks/Python.framework/Versions/3.10/bin/python3 (for architecture arm64): Mach-O 64-bit executable arm64
MacBook-Pro ~$ file /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-darwin.so: Mach-O 64-bit bundle arm64
So I am still puzzled by the error message, which says 'x86_64' is needed for something (hardware or software?) to run this script. Do you need a special environment to run PySpark scripts on an Apple M1 Max chip? As discussed previously, PySpark seems to work fine on the same computer in Python's interactive mode:
MacBook-Pro ~$ python3
Python 3.10.2 (v3.10.2:a58ebcc701, Jan 13 2022, 14:50:16) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> from pyspark.ml import *
>>> import numpy as np
>>>
Is there a way to resolve the error so a variety of pyspark.ml and other imports will function normally in a Python script? Perhaps the settings in the ~/.bash_profile need to be changed? Would a different version of the _multiarray_umath.cpython-310-darwin.so file solve the problem, and if so, how would I obtain it? (Use a different version of Python?) I am seeking suggestions for code, settings, and/or actions. Perhaps there is an easy fix I have overlooked.

Solved it. The errors experienced while trying to import numpy c-extensions involved the challenge of ensuring each computing node had the environment it needed to execute the target script (test.py). It turns out this can be accomplished by zipping the necessary modules (in this case, only numpy) into a tarball (.tar.gz) for use in a 'spark-submit' command to execute the Python script. The approach I used involved leveraging conda-forge/miniforge to 'pack' the required dependencies into a file. (It felt like a hack, but it worked.)
The following websites were helpful for developing a solution:
Hyukjin Kwon's blog, "How to Manage Python Dependencies in PySpark" https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
"Python Package Management: Using Conda": https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
Alex Ziskind's video "python environment setup on Apple Silicon | M1, M1 Pro/Max with Conda-forge": https://www.youtube.com/watch?v=2Acht_5_HTo
conda-forge/miniforge on GitHub: https://github.com/conda-forge/miniforge (for Apple chips, use the Miniforge3-MacOSX-arm64 download for OS X (arm64, Apple Silicon).
Steps for implementing a solution:
Install conda-forge/miniforge on your computer (in my case, a MacBook Pro with Apple silicon), following Alex's recommendations. You do not yet need to activate any conda environment on your computer. During installation, I recommend these settings:
Do you wish the installer to initialize Miniforge3
by running conda init? [yes|no] >>> choose 'yes'
If you'd prefer that conda's base environment not be activated on startup,
set the auto_activate_base parameter to false:
conda config --set auto_activate_base false # Set to 'false' for now
After you have conda installed, cd into the directory that contains your Python (PySpark) script (i.e., the file you want to run--in the case discussed here, 'test.py').
Enter the commands recommended in the Spark documentation (see URL above) for "Using Conda." Include in the first line (shown below) a space-separated sequence of the modules you need (in this case, only numpy, since the problem involved the failure to import numpy c-extensions). This will create the tarball you need, pyspark_conda_env.tar.gz (with all of the required modules and dependencies for each computing node) in the directory where you are (the one that contains your Python script):
conda create -y -n pyspark_conda_env -c conda-forge numpy conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
(If needed, you could insert 'pyarrow pandas numpy', etc., instead of just 'numpy' in the first line of three shown above, if you require multiple modules such as these three. Pandas appears to have a dependency on pyarrow.)
The command conda activate pyspark_conda_env (above) will activate your new environment, so now is a good time to investigate which version of Python your conda environment has, and where it exists (you only need to do this once). You will need this information to set your PYSPARK_PYTHON environment variable in your ~/.bash_profile:
(pyspark_conda_env) MacBook-Pro ~$ python --version
Python 3.10.2
(pyspark_conda_env) MacBook-Pro ~$ which python
/Users/.../miniforge3/envs/pyspark_conda_env/bin/python
If you need a different version of Python, you can instruct conda to install it (see Alex's video).
Ensure your ~/.bash_profile (or similar profile) includes the following setting (filling-in the exact path you just discovered):
export PYSPARK_PYTHON=/Users/.../miniforge3/envs/pyspark_conda_env/bin/python
Remember to 'source' any changes to your profile (e.g., source ~/.bash_profile) or simply restart your Terminal so the changes take effect.
Use a command similar to this to run your target script (assuming you are in the same directory discussed above). The Python script should now execute successfully, with no errors:
spark-submit --archives pyspark_conda_env.tar.gz test.py
There are several other ways to use the tarball to ensure it is automatically unpacked on the Spark executors (nodes) to run your script. See the Spark documentation discussed above, if needed, to learn about them.
For clarity, here are the final ~/.bash_profile settings that worked for this installation, which included the ability to run Scala scripts in Spark. If you are not using Scala, the SBT_HOME and SCALA_HOME settings may not apply to you. Also, you may or may not need the PYTHONPATH setting. How to tailor it to your specific version of py4j is discussed in 'How to install PySpark locally' (https://sigdelta.com/blog/how-to-install-pyspark-locally/).
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home
export SPARK_HOME=/Users/.../Spark2/spark-3.2.1-bin-hadoop2.7
export SBT_HOME=/Users/.../Spark2/sbt
export SCALA_HOME=/Users/.../Spark2/scala-2.12.15
export PATH=$JAVA_HOME/bin:$SBT_HOME/bin:$SBT_HOME/lib:$SCALA_HOME/bin:$SCALA_HOME/lib:$PATH
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=/Users/.../miniforge3/envs/pyspark_conda_env/bin/python
PYTHONPATH=$SPARK_HOME$\python:$SPARK_HOME$\python\lib\py4j-0.10.9.3-src.zip:$PYTHONPATH
PATH="/Library/Frameworks/Python.framework/Versions/3.10/bin:${PATH}"
export PATH
# export PYSPARK_DRIVER_PYTHON="jupyter" # Not required
# export PYSPARK_DRIVER_PYTHON_OPTS="notebook" # Not required
If you have suggestions on how to improve these settings, please comment below.
Other notes:
Your Python script should still include the other imports you need (in my case, there was no need to include a numpy import in the script itself--only numpy in the tarball). So your script might include these, for example:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
(etc.)
My script did not require this code snippet, which was shown in an example in the Spark documentation:
if __name__ == "__main__":
main(SparkSession.builder.getOrCreate())
The approach of simply creating a 'requirements.txt' file containing a list of modules (to zip and use in a spark-submit command without using conda) as discussed in this thread I can't seem to get --py-files on Spark to work did not work in my case:
pip3 install -t dependencies -r requirements.txt
zip -r dep.zip dependencies # Possibly incorrect...
zip -r dep.zip . # Correct if run from within folder containing requirements.txt
spark-submit --py-files dep.zip test.py
See 'PySpark dependencies' by Daniel Corin for more details on this approach, which clearly works in certain cases:
https://blog.danielcorin.com/posts/2015-11-09-pyspark/
I'm speculating a bit, but I think this approach may not allow packages built as 'wheels', so not all the dependencies you need will be built. The Spark documentation discusses this concept under "Using PySpark Native Features." (Feel free to test it out...you will not need conda-forge/miniforge to do so.)

Is there a version compatibility issue between Spark/Hadoop/Scala/Java/Python?

I'm getting an error while running spark-shell command through cmd but unfortunately without any luck so far. I have Python/Java/Spark/Hadoop(winutils.exe)/Scala installed with versions as below:
Python: 3.7.3
Java: 1.8.0_311
Spark: 3.2.0
Hadoop(winutils.exe):2.5x
scala sbt: sbt-1.5.5.msi
I followed below steps and ran spark-shell (C:\Program Files\spark-3.2.0-bin-hadoop3.2\bin>) through cmd:
Create JAVA_HOME variable: C:\Program Files\Java\jdk1.8.0_311\bin
Add the following part to your path: %JAVA_HOME%\bin
Create SPARK_HOME variable: C:\spark-3.2.0-bin-hadoop3.2\bin
Add the following part to your path: %SPARK_HOME%\bin
The most important part Hadoop path should include bin file before winutils.exe as the following: C:\Hadoop\bin Sure you will locate winutils.exe inside this path.
Create HADOOP_HOME Variable: C:\Hadoop
Add the following part to your path: %HADOOP_HOME%\bin
Am I missing out on anything? I've posted my question with error details in another thread (spark-shell command throwing this error: SparkContext: Error initializing SparkContext)

You went the difficult way in installing everything by hand. You may need Scala too, be extremely vigilant with the version you are installing, from your example it seems like it’s Scala 2.12.
But you are right: Spark is extremely demanding in term of version matching. Java 8 is good. Java 11 is ok too, but not any more recent version.
Alternatively, you can:
Try a very simple app like in https://github.com/jgperrin/net.jgp.books.spark.ch01
Use Docker with a pre made image, and if your goal is to do Python, I would recommend an image with Jupiter and Spark preconfigured together.

Python version different in worker and driver

The question I am trying to answer is:
Create RDD
Use the map to create an RDD of the NumPy arrays specified by the columns. The name of the RDD would be Rows
My code: Rows = df.select(col).rdd.map(make_array)
After I type this, I get a strange error, which basically says: Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I know I am working in an environment with Python 3.6. I am not sure if this specific line of code is triggering this error? What do you think
Just to note, this isn't my first line of code on this Jupyter notebook.
If you need more information, please let me know and I will provide it. I can't understand why this is happening.

Your slaves and your driver are not using the same version of Python, which will trigger this error anytime you use Spark.
Make sure you have Python 3.6 installed on your slaves then (in Linux) modify your spark/conf/spark-env.sh file to add PYSPARK_PYTHON=/usr/local/lib/python3.6 (if this is the python directory in your slaves)

In a notebook recently, I had to add those lines at the beginning, to sync the python versions:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

Unable to change the Python to be used for interacting with R using reticulate

I want to use a specific Python version: /Users/aviral.s/.pyenv/versions/3.5.2/bin/python. This version is not available for R.
I tried reading the documentation but following all the three steps(setting the env variable, using the API use_python() didn't help either.
With sudo, I run the following code:
library("reticulate")
py_config()
use_python("/Users/aviral.s/.pyenv/versions/3.5.2/bin/python")
py_config() # Unchanged.
I tried using any of the available ones in the py_config() which worked by setting the environment variable as in here
However, if I set the same env variable to my pyenv version, I get this error:
> library("reticulate")
> py_config()
Error in initialize_python(required_module, use_environment) :
Python shared library not found, Python bindings not loaded.
My env variable is correct:
echo $RETICULATE_PYTHON
/Users/aviral.s/.pyenv/versions/3.5.2/bin/python

I ran into the same problem a few days ago and i had to jump through all kinds of hoops to get where i wanted and i am not sure which one did it for me, but what definitely helped was using py_discover_config() instead of the regular py_config() command.
what might be another problem, is that apparently a python version with installed numpy will always be preferred by reticulate:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pySpark has a worker - driver version conflict when ran in Rodeo - python

Related

R reticulate specifying python executable to use

How can I resolve Python module import problems stemming from the failed import of NumPy C-extensions for running Spark/Python code on a MacBook Pro?

Is there a version compatibility issue between Spark/Hadoop/Scala/Java/Python?

Python version different in worker and driver

Unable to change the Python to be used for interacting with R using reticulate

Categories

Resources