Apache Spark with Python: error

Apache Spark with Python: error - python

New to Spark. Downloaded everything alright but when I run pyspark I get the following errors:
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/05 20:46:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\bin\..\python\pyspark\shell.py", line 43, in <module>
spark = SparkSession.builder\
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\pyspark\sql\session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
Also, when I try (as recommended by http://spark.apache.org/docs/latest/quick-start.html)
textFile = sc.textFile("README.md")
I get:
NameError: name 'sc' is not defined
Any advice? Thank you!

If you are doing it from the pyspark console, it may be because your installation did not work.
If not, it's because most example assume you are testing code in the pyspark console where a default variable 'sc' exist.
You can create a SparkContext by yourself at the beginning of your script using the following code:
from pyspark import SparkContext, SparkConf
conf = SparkConf()
sc = SparkContext(conf=conf)

It looks like you've found the answer to the second part of your question in the above answer, but for future users getting here via the 'org.apache.spark.sql.hive.HiveSessionState' error, this class is found in the spark-hive jar file, which does not come bundled with Spark if it isn't built with Hive.
You can get this jar at:
http://central.maven.org/maven2/org/apache/spark/spark-hive_${SCALA_VERSION}/${SPARK_VERSION}/spark-hive_${SCALA_VERSION}-${SPARK_VERSION}.jar
You'll have to put it into your SPARK_HOME/jars folder, and then Spark should be able to find all of the Hive classes required.

I also encountered this issue on Windows 7 with pre-built Spark 2.2. Here is a possible solution for Windows guys:
make sure you get all the environment path set correctly, including SPARK_PATH, HADOOP_HOME, etc.
get the correct version of winutils.exe for the Spark-Hadoop prebuilt package
then open a cmd prompt as Administration, run this command:
winutils chmod 777 C:\tmp\hive
Note: The drive might be different depending on where you invoke pyspark or spark-shell
This link should take the credit: see the answer by timesking

If you're on a Mac and you've installed Spark (and eventually Hive) through Homebrew the answers from #Eric Pettijohn and #user7772046 will not work. The former due to the fact that Homebrew's Spark contains the aforementioned jar file; the latter because, trivially, it is a pure Windows-based solution.
Inspired by this link and the permission issues hint, I've come up with the following simple solution: launch pyspark using sudo. No more Hive-related errors.

I deleted the metastore_db directory and then things worked. I'm doing some light development on a macbook -- I had run pycharm to sync my directory with the server - I thin it picked up that spark specific directory and messed it up. For my the the error message came when I was trying to start an interactive ipython pyspark shell.

With my problem like this, because I have set the Hadoop at yarn model, so my solution is to start the hdfs and the YARN.
start-dfs.sh
start-yarn.sh

I come across the error:
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
this is because i already run ./bin/spark-shell
So, just kill that spark-shell, and re-run ./bin/pyspark

You need a "winutils" competable in the hadoop bin directory.

Related

Dask df.to_parquet can't find pyarrow. RuntimeError: `pyarrow` not installed

Environment:
macOS Big Sur v 11.6.1
Python 3.7.7
pyarrow==5.0.0 (from pipfreeze)
From the terminal:
>>> import pyarrow
>>> pyarrow
<module 'pyarrow' from '/Users/garyb/Develop/DS/tools-pay-data-pipeline/env/lib/python3.7/site-packages/pyarrow/__init__.py'
So I confirmed that I have pyarrow installed. But when I try to write a Dask dataframe to parquet I get:
def make_parquet_file(filepath):
parquet_path = f'{PARQUET_DIR}/{company}_{table}_{batch}.parquet'
df.to_parquet(parquet_path, engine='pyarrow')
ModuleNotFoundError: No module named pyarrow
The exception detail:
~/Develop/DS/research-dask-parquet/env/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in get_engine(engine)
970 elif engine in ("pyarrow", "arrow", "pyarrow-legacy", "pyarrow-dataset"):
971
--> 972 pa = import_required("pyarrow", "`pyarrow` not installed")
973 pa_version = parse_version(pa.__version__)
974
This function works. It's a much smaller csv file just to confirm that the df.to_parquet function works:
def make_parquet_file():
csv_file = f'{CSV_DATA_DIR}/diabetes.csv'
parquet_file = f'{PARQUET_DIR}/diabetes.parquet'
# Just to prove I can read the csv file
p_df = pd.read_csv(csv_file)
print(p_df.shape)
d_df = dd.read_csv(csv_file)
d_df.to_parquet(parquet_file)
Is it looking in the right place for the package? I'm stuck

It does seem that dask and pure python are using different environments.
In the first example the path is:
~/Develop/DS/tools-pay-data-pipeline/env/lib/python3.7
In the traceback the path is:
~/Develop/DS/research-dask-parquet/env/lib/python3.7
So a quick fix is to install pyarrow in the second environment. Another fix is to install the packages on workers (this might help).
A more robust fix is to use environment files.

OK, problem solved - maybe. During development and experimentation I use Jupyter to test and debug. Later all the functions get moved into scripts which can then be imported into any notebook that needs them. The role of the notebook at that point in this project is demo and document, i.e. a better alternative to the command line for demo.
In this case I'm still in the experiment mode. So the problem was the Jupyter kernel. I had inadvertency recycled the name, and the virtual environment in the two projects also have the same name - "env". See a pattern here? Lazy bit me.
I deleted the kernel that was being used and created a new one with a unique name. PyArrow was then pulled from the correct virtual environment and worked as expected.

How to run paramiko demo_server.py?

https://raw.githubusercontent.com/paramiko/paramiko/master/demos/demo_server.py
I see the above demo_server of paramiko. But I don't see the instructions on how to run it. I run the following ./demo_server.py command. But once I run ssh robey#127.0.0.1 -p 2200, the server fails. Could anybody let me know the complete steps on how to run this example? Thanks.
$ python3 ./demo_server.py
Read key: 60733844cb5186657fdedaa22b5a57d5
Listening for connection ...
Got a connection!
*** Caught exception: <class 'ImportError'>: Unable to import a GSS-API / SSPI module!
Traceback (most recent call last):
File "./demo_server.py", line 140, in <module>
t = paramiko.Transport(client, gss_kex=DoGSSAPIKeyExchange)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/paramiko/transport.py", line 445, in __init__
self.kexgss_ctxt = GSSAuth("gssapi-keyex", gss_deleg_creds)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/paramiko/ssh_gss.py", line 107, in GSSAuth
raise ImportError("Unable to import a GSS-API / SSPI module!")
ImportError: Unable to import a GSS-API / SSPI module!
$ ssh robey#127.0.0.1 -p 2200
kex_exchange_identification: read: Connection reset by peer

I managed to get it to work, make sure to go through these steps (Thank you Sandeep for the pip insight), chances are you may be missing the Kerberos dependencies:
You may need to perform pip install gssapi in the CLI if that has not already been done (I'm using the Windows Command Prompt, Linux/WSL might need pip3 instead depending on your version of python)
From there you will need to import the gssapi library into the code at the top with the other imported libraries, so just call
import gssapi in demo_server.py
After running demo_server.py again, the CLI should eventually say something like it is missing files located in Program Files\ MIT\ Kerberos\ bin, as Kerberos is a dependency for gssapi, you can install it from here:
https://web.mit.edu/KERBEROS/dist/
Make sure you do custom install if you're not sure where it will be downloaded, so that you can set up the file location where the CLI says is missing in step 3 (Should automatically say Program Files\ MIT). I unchecked the boxes for auto-start and tickets, but not sure as to what your preferences may be. After all that, a computer restart is required.

How can I source two paths for the ROS environmental variable at the same time?

I have a problem with using the rqt_image_view package in ROS. Each time when I type rqt_image_view or rosrun rqt_image_view rqt_image_view in terminal, it will return:
Traceback (most recent call last):
File "/opt/ros/kinetic/bin/rqt_image_view", line 16, in
plugin_argument_provider=add_arguments))
File "/opt/ros/kinetic/lib/python2.7/dist-packages/rqt_gui/main.py", line 59, in main
return super(Main, self).main(argv, standalone=standalone, plugin_argument_provider=plugin_argument_provider, plugin_manager_settings_prefix=str(hash(os.environ['ROS_PACKAGE_PATH'])))
File "/opt/ros/kinetic/lib/python2.7/dist-packages/qt_gui/main.py", line 338, in main
from python_qt_binding import QT_BINDING
ImportError: cannot import name QT_BINDING
In the /.bashrc file, I have source :
source /opt/ros/kinetic/setup.bash
source /home/kelu/Dropbox/GET_Lab/leap_ws/devel/setup.bash --extend
source /eda/gazebo/setup.bash --extend
They are the default path of ROS, my own working space, the robot simulator of our university. I must use all of them. I have already finished many projects with this environmental variable setting. However, when I want to use the package rqt_image_view today, it returns the above error info.
When I run echo $ROS_PACKAGE_PATH, I get the return:
/eda/gazebo/ros/kinetic/share:/home/kelu/Dropbox/GET_Lab/leap_ws/src:/opt/ros/kinetic/share
And echo $PATH
/usr/local/cuda/bin:/opt/ros/kinetic/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/home/kelu/bin:/home/kelu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
Then I only source the /opt/ros/kinetic/setup.bash ,the rqt_image_view package runs!!
It seems that, if I want to use rqt_image_view, then I can not source both /opt/ros/kinetic/setup.bash and /home/kelu/Dropbox/GET_Lab/leap_ws/devel/setup.bash at the same time.
Could someone tell me how to fix this problem? I have already search 5 hours in google and haven't find a solution.

Different solutions to try:
It sounds like the first path /eda/gazebo/ros/kinetic/share or /home/kelu/Dropbox/GET_Lab/leap_ws/src has an rqt_image_view package that is being used. Try to remove that dependency.
Have you tried switching the source files being sourced? This depends on how the rqt_image_view package was built, such as by source or through a package manager.
Initially, it sounds like there is a problem with the paths being searched or wrong package being run since the package works with the default ROS environment setup.

ImportError: No module named numpy on spark workers

Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Somehow the executors do not have numpy installed is my feeling. I didnt find any good solution anywhere to let workers know about numpy. I tried setting PYSPARK_PYTHON but that didnt work either.
import numpy
features = numpy.load(open("combined_features.npz"))
features = features['arr_0']
features.shape
features_rdd = sc.parallelize(features, 5000)
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
clusters = KMeans.train(features_rdd, 2, maxIterations=10, runs=10, initializationMode="random")
Stack trace
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
enter code here

To use Spark in Yarn client mode, you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.
Using Spark with Yarn cluster mode is a different story. You can distribute python dependencies with spark-submit.
spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip
However, the situation with numpy is complicated by the same thing that makes it so fast: the fact that does the heavy lifting in C. Because of the way that it is installed, you won't be able to distribute numpy in this fashion.

numpy is not installed on the worker (virtual) machines. If you use anaconda, it's very convenient to upload such python dependencies when deploying the application in cluster mode. (So there is no need to install numpy or other modules on each machine, instead they must in your anaconda).
Firstly, zip your anaconda and put the zip file to the cluster, and then you can submit a job using following script.
spark-submit \
--master yarn \
--deploy-mode cluster \
--archives hdfs://host/path/to/anaconda.zip#python-env
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python
app_main.py
Yarn will copy anaconda.zip from the hdfs path to each worker, and use that pthon-env/anaconda/bin/python to execute tasks.
Refer to Running PySpark with Virtualenv may provide more information.

What solved it for me (On mac) was actually this guide (Which also explains how to run python through Jupyter Notebooks -
https://medium.com/#yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735
In a nutshell:
(Assuming you installed spark with brew install spark)
Find the SPARK_PATH using - brew info apache-spark
Add those lines to your ~/.bash_profile
# Spark and Python
######
export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.1
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, You have to add the line below or you will get an error
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'
######
You should be able to open Jupyter Notebook simply by calling:
pyspark
And just remember you don't need to set the Spark Context but instead simply call:
sc = SparkContext.getOrCreate()

For me environment variable PYSPARK_PYTHON was not set so I set up /etc/environment file and added python environment path to the variable.
PYSPARK_PYTHON=/home/venv/python3
Afterwards, no such error.

I had similar issue but I dont think you need to set PYSPARK_PYTHON instead just install numpy on the worker machine (apt-get or yum). The error will also tell you on which machine the import was missing.

You have to be aware that you need to have numpy installed on each and every worker, and even the master itself (depending on your component placement)
Also ensure to launch pip install numpy command from a root account (sudo does not suffice) after forcing umask to 022 (umask 022) so it cascades the rights to Spark (or Zeppelin) User

A few of things to check
Install the required packages on the worker nodes with sudo permission so that they are available to all users
If you have multiple versions of the python on the worker nodes, make sure to install packages for python used by Spark (usually set by PYSPARK_PYTHON).
Finally, to pass the custom modules (.py files), use --py-files while starting the session using spark-submit or pyspark

I had the same issue. Try installing numpy on pip3 if you're using Python3
pip3 install numpy

Python module import failure in Jenkins

I have a project I'm trying to test and run on Jenkins. On my machine it works fine, but when I try to run it in Jenkins, it fails to find a module in the workspace.
In the main workspace directory, I run the command:
python xtests/app_verify_auto.py
And get the error:
+ python /home/tomcat7/.jenkins/jobs/exit103/workspace/xtests/app_verify_auto.py
Traceback (most recent call last):
File "/home/tomcat7/.jenkins/jobs/exit103/workspace/xtests/app_verify_auto.py", line 19, in <module>
import exit103.data.db as db
ImportError: No module named exit103.data.db
Build step 'Execute shell' marked build as failure
Finished: FAILURE
The directory exit103/data exists in the workspace and is a correct path, but python can't seem to find it.
This error exists both with and without virtualenv.

It's may caused by your PATH setting not right in jenkins environment.In fact , the environments for your default user and jenkins-user are not the same.
You may try to find what are the PATH and PYTHONPATH in your jenkins-user environments .
Try to run "shell commands" in jenkins "echo $path" and so on to see what's them are.
In most of time , you need to set the PATH by yourself.
You may reference this answer.
Jenkins: putting my Python module on the PYTHONPATH

Faced the same issue.
For others who are reading this, Run the build in your master node. It fixed the problem for me.
Running the build in the slave node doesn't give proper access to all the python modules and other commands such as jq to the workspace.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache Spark with Python: error - python

With my problem like this, because I have set the Hadoop at yarn model, so my solution is to start the hdfs and the YARN. start-dfs.sh start-yarn.sh

You need a "winutils" competable in the hadoop bin directory.

Related

Dask df.to_parquet can't find pyarrow. RuntimeError: `pyarrow` not installed

How to run paramiko demo_server.py?

How can I source two paths for the ROS environmental variable at the same time?

ImportError: No module named numpy on spark workers

Python module import failure in Jenkins

Categories

Resources