Python version different in worker and driver - python

The question I am trying to answer is:
Create RDD
Use the map to create an RDD of the NumPy arrays specified by the columns. The name of the RDD would be Rows
My code: Rows = df.select(col).rdd.map(make_array)
After I type this, I get a strange error, which basically says: Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I know I am working in an environment with Python 3.6. I am not sure if this specific line of code is triggering this error? What do you think
Just to note, this isn't my first line of code on this Jupyter notebook.
If you need more information, please let me know and I will provide it. I can't understand why this is happening.

Your slaves and your driver are not using the same version of Python, which will trigger this error anytime you use Spark.
Make sure you have Python 3.6 installed on your slaves then (in Linux) modify your spark/conf/spark-env.sh file to add PYSPARK_PYTHON=/usr/local/lib/python3.6 (if this is the python directory in your slaves)

In a notebook recently, I had to add those lines at the beginning, to sync the python versions:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

Related

Pandas Installed For All Python Versions But Module Can't Be Found

I am trying to modify an AI for a game on the steam store. The AI communicates through the game with the use of a mod called the communication mod. The AI is made using a python project. The package I am trying to modify is https://github.com/ForgottenArbiter/spirecomm and the mod is https://github.com/ForgottenArbiter/CommunicationMod.
I want to add the pandas package and the job lib package as imports so I can use a model I have made for the AI. When I try to run the game + mod after adding the pandas and joblib packages as imports I get this error in the error log.
Traceback (most recent call last):
File "/Users/ross/downloads/spirecomm-master/main.py", line 6, in <module>
from spirecomm.ai.agent import SimpleAgent
File "/Users/ross/Downloads/spirecomm-master/spirecomm/ai/agent.py", line 10, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
This issue only happens when the game is running and the mod tries to run. if I just run the file in terminal it is able to compile/run and send the ready signal
I have checked that I have these modules installed and it is installed. I am on an M1 Mac and I have several different versions of python installed but I have checked them all and it is installed for each of them. I have also opened the python package using pycharm and added pandas and joblib to the python interpreter as a package.
Another thing I have tried is modifying the setup.py file to say that pandas and joblib are required. I then ran it again but I am not sure if this had any effect because I have already run it before.
There is limited help that can be provided without knowing the framework that you are is using but hopefully this will give you some starting points to help.
If you are getting a "No module named 'pandas'" error, it is because you have imported pandas in your code but your python interpreter cannot find it. There are two major reasons this will happen, either it is not installed (which you say has definitely happened) or it is not in the location the interpreter expects (most likely).
The first thing you can do is make sure the pandas install is in the PYTHONPATH. To do this look at Permanently add a directory to PYTHONPATH?.
Secondly, you say you have several versions of python and have installed the module for all versions but you most likely have several instances of at least one version. Many IDEs, such as PyCharm, create a virtual environment when you create a new project and place in it a new instance of python interpreter, or more accurately a link to one version of python. Within this virtual environment, the IDE then loads the modules it has been told to load and thus makes them available for import to the code using that particular environment.
In your case I suspect you may be using a virtual environment that has not had pandas loaded into it. You need to investigate your IDEs documentation if you do not know how to load it. Alternatively you can instruct the IDE to use a different virtual environment (one that does have pandas loaded) - again search documentation for how to do this.
Finally, if all else fails, you can specifically tell your code where to look for the module using the sys.path.append command in your code.
import sys
sys.path.append('/your/pandas/module/path')
import pandas

The pandas installed in my conda env is not the same version that the one used by python

I am working in a High Performance Computing and I have been using a conda env. created by the team that supports this.
Today I have been getting strange error messages such as
PandasArray object has no attribute _str_len
PandasArray object has no attribute _str_contain
I think something is going wrong.
If I read a simple csv and then I print its content I get the first error I have put above in the line that print the data frame.
I have checked the pandas version using conda list and the version is 1.2. However, if I do
python
>>> import pandas as pd
>>> print(pd._version_)
the output is 1.1.4
It is weird because when I do conda list, pandas 1.1.4 is not in that list.
Python version is 3.7.
I have also take a backup of my application to see if it is me but I have found same error I show you above. I cannot use for example .str.replace()
I have not idea what is going on but I would like to use pandas 1.2 to see if that fix the problem. How can I choose between version when running a script?

R reticulate specifying python executable to use

First, I'm working on a Windows machine. I would like to specify a specific version of python to use in RStudio. I would like RStudio to use python 3 in the ArcGIS Pro folder in order to have arcpy available, along with the licensed extensions. I have reticulate installed and have tried the following methods to force RStudio to use the ArcGIS Pro version of python.
First I tried this:
library(reticulate)
use_python("C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe", required = TRUE)
The resulting error:
Error in path.expand(path) : invalid 'path' argument
Following some other tips, I tried setting the environment before loading the reticulate library.
Sys.setenv(RETICULATE_PYTHON = "c:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe")
library(reticulate)
Then I retrieve information about the the version of Python currently being used by reticulate.
py_config
Error in path.expand(path) : invalid 'path' argument
I also tried creating and editing the .Renviron by using the usethis package
usethis::edit_r_environ()
Then entering the following
RETICULATE_PYTHON="C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
And saving it, restarting R..
library (reticulate)
py_config()
Error in path.expand(path) : invalid 'path' argument
And, to confirm, here is the location...
Any ideas on why I continue to receive invalid 'path' argument
I was having a similar issue. After trying a whole assortment of things, I finally installed an archived version of reticulate (reticulate_1.22) instead of using the most up-to-date version (reticulate_1.23) and now the issue is gone. It appears that this bug has been brought to the developers' attention (https://github.com/rstudio/reticulate/issues/1189).
Try using
use_python("C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3")
Have you tried replacing Program Files with PROGRA~1 and have you maybe also checked for example a command like dir("path/to/your/env") although tbh your screenshot looks ok;
btw just in case - after editing your .Renviron file you need to restart your RStudio/R session for changes to take effect;
RStudio version 2022.02 has Python interpreter selection now available in Global Options
I ran into the same error with R version R-4.1.1, but when I switched back to the previous version R-4.0.5 everything worked as expected. It's a quick workaround but doesn't solve the underlying issue in the current version.

Trying to run python code on jenkins in ubuntu

all.
I recently started working with Jenkins, in an attempt to replace cronjob with Jenkins pipeline. I have really a bit knowledge of programming jargon. I learned what I learned from questions on stackoverflow. So, if you guys need any more info, I would really appreciate if you use plain English.
So, I installed the lastest version of Jenkins and suggested plugins plus all the plugins that I could find useful to python running.
Afterwards, I searched stackoverflow and other websites to make this work, but all I could do was
#!/usr/bin/env python
from __future__ import print_function
print('Hello World')
And it succeeded.
Currently, Jenkins is running on Ubuntu 16.04, and I am using anaconda3's python (~/anaconda3/bin/python).
When I tried to run a bit more complicated python code (by that I mean import pandas), it gives me import error.
What I have tried so far is
execute python script build: import pandas - import error
execute shell build: import pandas (import pandas added to the code that worked above)
python builder build: import pandas - invalid interpreter error
pipeline job: sh python /path_to_python_file/*.py - import error
All gave errors. Since 'hello world' works, I believe that using anaconda3's python is not an issue. Also, it imported print_function just fine, so I want to know what I should do from here. Change workspace setting? workdirectory setting? code changes?
Thanks.
Since 'hello world' works, I believe that using anaconda3's python is not an issue.
Your assumption is wrong.
There are multiple ways of solving the issue but they all come down to using the correct python interpreter with installed pandas. Usually in ubuntu you'll have at least two interpreters. One for python 2 and one for python 3 and you'll use them in shell by calling either python pth/to/myScript.py or python3 pth/to/myScript.py. python and python3 are in this case just a sort of labels which point to the correct executables, using environmental variable PATH.
By installing anaconda3 you are adding one more interpreter with pandas and plenty of other preinstalled packages. If you want to use it, you need to tell somehow your shell or Jenkins about it. If import pandas gives you an error then you're probably using a different interpreter or a different python environment (but this is out of scope here).
Coming back to your script
Following this stack overflow answer, you'll see that all the line #!/usr/bin/env python does, is to make sure that you're using the first python interpreter on your Ubuntu's environment path. Which almost for sure isn't the one you installed with anaconda3. Most likely it will be the default python 2 distributed with ubuntu. If you want to make sure which interpreter exactly is running your script, instead of 'Hello World' put inside:
#!/usr/bin/env python
import sys
print(sys.executable) # this line will give you the exact path to the interpreter
print(sys.version) # this one will give you the version
Ok, so what to do?
Well, run your script using the correct interpreter. Remove #!/usr/bin/env python from your file and if you have a pipeline, add there:
sh "/home/yourname/anaconda3/bin/python /path_to_python_file/myFile.py"
It will most likely solve the issue. It's also quite flexible in the sense that if you ever want to use this python file on a different machine, you won't have your username hardcoded inside.

pySpark has a worker - driver version conflict when ran in Rodeo

The following simple script works fine in pyspark when it is ran from the terminal:
import pyspark
sc = pyspark.SparkContext()
foo = sc.parallelize([1,2])
foo.foreach(print)
But when ran in Rodeo, it produces an error, most important line of which says:
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions
And the full error output can be found at this link: http://pastebin.com/raw/unGuGLhq
My$SPARK_HOME/conf/spark-env.sh file contains the following lines:
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
The problem persists despite that and putting the same lines in ~/.bashrc doesn't solve the problem, either.
Rodeo version: 1.3.0
Spark version: 1.6.1
Platform: Linux
This issue is related to one described here: link
Rodeo as a desktop app has a hard time working with shell environment variables. The trick is to put variables we'd normally declare in spark-env.sh in Rodeo's .rodeoprofile instead using os module to add them. Specifically in this case adding the following lines to .rodeoprofile helped:
os.environ["PYSPARK_PYTHON"]="python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="python3"
(though the second one is redundant and I added it just for consistence as the driver used 3.5 anyway)

Categories