Is there a version compatibility issue between Spark/Hadoop/Scala/Java/Python? - python

I'm getting an error while running spark-shell command through cmd but unfortunately without any luck so far. I have Python/Java/Spark/Hadoop(winutils.exe)/Scala installed with versions as below:
Python: 3.7.3
Java: 1.8.0_311
Spark: 3.2.0
Hadoop(winutils.exe):2.5x
scala sbt: sbt-1.5.5.msi
I followed below steps and ran spark-shell (C:\Program Files\spark-3.2.0-bin-hadoop3.2\bin>) through cmd:
Create JAVA_HOME variable: C:\Program Files\Java\jdk1.8.0_311\bin
Add the following part to your path: %JAVA_HOME%\bin
Create SPARK_HOME variable: C:\spark-3.2.0-bin-hadoop3.2\bin
Add the following part to your path: %SPARK_HOME%\bin
The most important part Hadoop path should include bin file before winutils.exe as the following: C:\Hadoop\bin Sure you will locate winutils.exe inside this path.
Create HADOOP_HOME Variable: C:\Hadoop
Add the following part to your path: %HADOOP_HOME%\bin
Am I missing out on anything? I've posted my question with error details in another thread (spark-shell command throwing this error: SparkContext: Error initializing SparkContext)

You went the difficult way in installing everything by hand. You may need Scala too, be extremely vigilant with the version you are installing, from your example it seems like it’s Scala 2.12.
But you are right: Spark is extremely demanding in term of version matching. Java 8 is good. Java 11 is ok too, but not any more recent version.
Alternatively, you can:
Try a very simple app like in https://github.com/jgperrin/net.jgp.books.spark.ch01
Use Docker with a pre made image, and if your goal is to do Python, I would recommend an image with Jupiter and Spark preconfigured together.

Related

R reticulate specifying python executable to use

First, I'm working on a Windows machine. I would like to specify a specific version of python to use in RStudio. I would like RStudio to use python 3 in the ArcGIS Pro folder in order to have arcpy available, along with the licensed extensions. I have reticulate installed and have tried the following methods to force RStudio to use the ArcGIS Pro version of python.
First I tried this:
library(reticulate)
use_python("C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe", required = TRUE)
The resulting error:
Error in path.expand(path) : invalid 'path' argument
Following some other tips, I tried setting the environment before loading the reticulate library.
Sys.setenv(RETICULATE_PYTHON = "c:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe")
library(reticulate)
Then I retrieve information about the the version of Python currently being used by reticulate.
py_config
Error in path.expand(path) : invalid 'path' argument
I also tried creating and editing the .Renviron by using the usethis package
usethis::edit_r_environ()
Then entering the following
RETICULATE_PYTHON="C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
And saving it, restarting R..
library (reticulate)
py_config()
Error in path.expand(path) : invalid 'path' argument
And, to confirm, here is the location...
Any ideas on why I continue to receive invalid 'path' argument
I was having a similar issue. After trying a whole assortment of things, I finally installed an archived version of reticulate (reticulate_1.22) instead of using the most up-to-date version (reticulate_1.23) and now the issue is gone. It appears that this bug has been brought to the developers' attention (https://github.com/rstudio/reticulate/issues/1189).
Try using
use_python("C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3")
Have you tried replacing Program Files with PROGRA~1 and have you maybe also checked for example a command like dir("path/to/your/env") although tbh your screenshot looks ok;
btw just in case - after editing your .Renviron file you need to restart your RStudio/R session for changes to take effect;
RStudio version 2022.02 has Python interpreter selection now available in Global Options
I ran into the same error with R version R-4.1.1, but when I switched back to the previous version R-4.0.5 everything worked as expected. It's a quick workaround but doesn't solve the underlying issue in the current version.

Unable to change the Python to be used for interacting with R using reticulate

I want to use a specific Python version: /Users/aviral.s/.pyenv/versions/3.5.2/bin/python. This version is not available for R.
I tried reading the documentation but following all the three steps(setting the env variable, using the API use_python() didn't help either.
With sudo, I run the following code:
library("reticulate")
py_config()
use_python("/Users/aviral.s/.pyenv/versions/3.5.2/bin/python")
py_config() # Unchanged.
I tried using any of the available ones in the py_config() which worked by setting the environment variable as in here
However, if I set the same env variable to my pyenv version, I get this error:
> library("reticulate")
> py_config()
Error in initialize_python(required_module, use_environment) :
Python shared library not found, Python bindings not loaded.
My env variable is correct:
echo $RETICULATE_PYTHON
/Users/aviral.s/.pyenv/versions/3.5.2/bin/python
I ran into the same problem a few days ago and i had to jump through all kinds of hoops to get where i wanted and i am not sure which one did it for me, but what definitely helped was using py_discover_config() instead of the regular py_config() command.
what might be another problem, is that apparently a python version with installed numpy will always be preferred by reticulate:

Got errors, while running exe file built with pyinstaller and Google Cloud API integration in python

I am working one file python project.
I integrated google-cloud-API for realtime speech streaming and recognition.
It works with python aaa.py command well.
Now I need windows build file(.exe), so I used pyinstaller program and I got aaa.exe file successfully.
But I got this error while running speech streaming by using Google cloud API.
[Errno 2] No such file or directory:
'D:\AI\ai\dist\AAA\google\cloud\gapic\speech\v1\speech_client_config.json'
So I copied this speech_client_config.json file in needed path, after that I got below error again.
Exception in 'grpc._cython.cygrpc.ssl_roots_override_callback'
ignored E0511 01:13:14.320000000 3108
src/core/lib/security/security_connector/security _connector.cc:1170]
assertion failed: pem_root_certs != nullptr
Then, I can not find solution to get working version with google-cloud API.
I am using python version 2.7.14
I need your friendly help.
Thanks.
I had the same problem. If you are willing to distribute roots.pem with your executable (just search for the file - it should be buried deep within the installation directory of grpcio), I had luck fixing this by setting GRPC_DEFAULT_SSL_ROOTS_FILE_PATH environment variable to the full path of this roots.pem file.
Update 2021
To anyone who is experiencing this issue. I got it working thanks to these amazing people. See the full conversation on this github issue.
Here is the link
Step 1
Credits to #cbenhagen & #rising-stark on this github link.
A PyInstaller hook called hook-grpc.py looking like this would do the trick:
Create a python file named hook-grpc.py with this code.
from PyInstaller.utils.hooks import collect_data_files
datas = collect_data_files('grpc')
Step 2
Put the hook-grpc.py file in your \site-packages\PyInstaller\hooks directory of the python environment you are running on. So basically you can find it at
C:\Users\yourusername\AppData\Local\Programs\Python\Python37\Lib\site-packages\PyInstaller\hooks
Note:
Just change the yourusername and Python37 to your
respective username and python version you are using.
For Anaconda users it might be different. Check this site
to find the anaconda python environment path you are using.
Step 3
Once you've done that you can now convert your .py python program to .exe using pyinstaller and it should work.
This looks to me like a SSL credentials mistake. I think you are not being allowed to GC. Check this code snippet and this documentation.

pySpark has a worker - driver version conflict when ran in Rodeo

The following simple script works fine in pyspark when it is ran from the terminal:
import pyspark
sc = pyspark.SparkContext()
foo = sc.parallelize([1,2])
foo.foreach(print)
But when ran in Rodeo, it produces an error, most important line of which says:
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions
And the full error output can be found at this link: http://pastebin.com/raw/unGuGLhq
My$SPARK_HOME/conf/spark-env.sh file contains the following lines:
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
The problem persists despite that and putting the same lines in ~/.bashrc doesn't solve the problem, either.
Rodeo version: 1.3.0
Spark version: 1.6.1
Platform: Linux
This issue is related to one described here: link
Rodeo as a desktop app has a hard time working with shell environment variables. The trick is to put variables we'd normally declare in spark-env.sh in Rodeo's .rodeoprofile instead using os module to add them. Specifically in this case adding the following lines to .rodeoprofile helped:
os.environ["PYSPARK_PYTHON"]="python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="python3"
(though the second one is redundant and I added it just for consistence as the driver used 3.5 anyway)

Set up Brew installed Python 2.7.X as sdk for Intellij / Pycharm

I am trying to import the brew installed version of python by emulating the Global Libraries structure existing for the (mostly) working mac os built-in 2.7.2. However IJ is unable to infer the types or to create the library properly.
Update this is a large existing project. Creating a new project just to get a different version of python is not an option.
Here are the steps:
Try to create new Global Library: Fail : no python .
OK, so I use Copy to clone the built-in SDK:
Now - let us try to emulate the paths included in the original built-in but with the brew base dir: here is a starting point:
And here is one of the exact entries from the builtin library:
So let us clikc on the + to add it:
So .. IJ is unable to handle it properly. I also tried a half dozen others - all with same shrug result from IJ.
So then what is the correct process?
Update Here is the project SDK dialog (thanks to scribbles).
And trying to add: **but the "OK" button is not enabled! So then IJ is not able to load it..
New Project -> Select SDK.
See this video if you still have any questions.
EDIT: Is this more along the lines of what you're looking for (link)?
This is old, but but I ran into the same problem with the current Python 2 install from homebrew in High Sierra. Instead of choosing a directory like it needed in the previous setup, I just setup the Python SDK pointing to the python executable link in /usr/local/opt/python/libexec/bin (which is the directory I added to my path for Python 2. It seems to be working just fine now.
Hopefully this will help someone.

Categories