Spark (pyspark) having difficulty calling statistics methods on worker node

Spark (pyspark) having difficulty calling statistics methods on worker node - python

I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs) from pyspark.mllib.stat in a .mapValues operation on my RDD containing (key, list(int)) pairs.
On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems
keys_to_bucketed = vectors.collectAsMap()
keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()}
but if I do the same directly on the RDD I hit issues
keys_to_chi = vectors.mapValues(lambda vector: Statistics.chiSqTest(vector))
keys_to_chi.collectAsMap()
results in the following exception
Traceback (most recent call last):
File "<ipython-input-80-c2f7ee546f93>", line 3, in chi_sq
File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 238, in chiSqTest
jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/common.py", line 127, in callMLlibFunc
api = getattr(sc._jvm.PythonMLLibAPI(), name)
AttributeError: 'NoneType' object has no attribute '_jvm'
I had an issue early on in my spark install not seeing numpy, with mac-osx having two python installs (one from brew and one from the OS) but I thought I had resolved that. Whats odd here is that this is one of the python libs that ships with the spark install (my previous issue had been with numpy).
Install Details
Max OSX Yosemite
Spark spark-1.4.0-bin-hadoop2.6
python is specified via spark-env.sh as
PYSPARK_PYTHON=/usr/bin/python
PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
alias ipython-spark-notebook="IPYTHON_OPTS=\"notebook\" pyspark"
PYSPARK_SUBMIT_ARGS='--num-executors 2 --executor-memory 4g --executor-cores 2'
declare -x PYSPARK_DRIVER_PYTHON="ipython"

As you've noticed in your comment the sc on the worker nodes is None. The SparkContext is only defined on the driver node.

Related

Python subprocess call won't install R package

I have a Python subprocess to call R:
cmd = ['Rscript', 'Rcode.R', 'file_to_process.txt']
out = subprocess.run(cmd, universal_newlines = True, stdout = subprocess.PIPE)
lines = out.stdout.splitlines() #split stdout
My R code first checks if the 'ape' package is installed before proceeding:
if (!require("ape")) install.packages("ape")
library(ape)
do_R_stuff.......
return_output_to_Python
Previously, the whole process from Python to R worked perfectly - R was called and processed output was returned to Python - until I added the first line (if (!require("ape")) install.packages("ape")). Now Python reports: "there is no package called 'ape'" (i.e. when I uninstall ape in R). I have tried wait instructions in both the R and Python scripts but I can't get it working. When checking, the R code works in isolation.
The full error output from Python is:
Traceback (most recent call last):
File ~\Documents\GitHub\wolPredictor\wolPredictor_MANUAL_parallel.py:347 in <module>
if __name__ == '__main__': main()
File ~\Documents\GitHub\wolPredictor\wolPredictor_MANUAL_parallel.py:127 in main
cophen, purge, pge_incr, z, _ = R_cophen('{}/{}'.format(dat_dir, tree), path2script) #get Dist Mat from phylogeny in R
File ~\Documents\GitHub\wolPredictor\wolPredictor_MANUAL_parallel.py:214 in R_cophen
purge = int(np.max(cophen) * 100) + 1 #max tree pw distance
File <__array_function__ internals>:5 in amax
File ~\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2754 in amax
return _wrapreduction(a, np.maximum, 'max', axis, None, out,
File ~\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:86 in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
Loading required package: ape
Installing package into 'C:/Users/Windows/Documents/R/win-library/4.1'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source") :
trying to use CRAN without setting a mirror
Calls: install.packages -> contrib.url
In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called 'ape'
Execution halted

Point 1: The path to R libraries when using R in a standalone mode may not be the same when using Rscript.
Point 2: The error says there was difficulty in finding the CRAN repository, so perhaps the options that set the repos were not set for the Rscript environment. They can be set in the call to install packages or with a Sys.setenv() call.
Ther OP wrote: "Thanks #IRTFM I had to set a new sub process in a new .R file specifically for the install, but the CRAN mirror was key, I never realised it would be an issue as it's not a requirement on my local machine (not sure why it becomes an issue thru subprocess)."
The places to find more information are the ?Startup help page and the ?Rscript page. Rscript has many fewer defaults. Even the usual set of recommended packages may not get loaded by default. The Rscript help page includes these flags which could be used for debugging and setting a proper path to the libraries desired.:
--verbose
gives details of what Rscript is doing.
--default-packages=list
where list is a comma-separated list of package names or NULL. Sets the environment variable R_DEFAULT_PACKAGES which determines the packages loaded on startup.
Here is a previous similar SO question with an answer that includes some options for construction of a proper working environment: Rscript: There is no package called ...?
There are a few R packages on CRAN that aid in overcoming some of the differences between programming for standalone R and Rsript.
getopt: https://cran.r-project.org/web/packages/getopt/index.html
optparse: https://cran.r-project.org/web/packages/optparse/index.html (styled after a similar Python package.)
argparse: https://cran.r-project.org/web/packages/argparse/index.html

I solved the issue (thanks to #IRTFM) by placing the if-then-install.packages code in a separate Rscript (including the CRAN mirror):
if (!require("ape")) install.packages("ape", repos='http://cran.us.r-project.org')
which I then called using a separate Python subprocess in my Python routine

Webots Attribute Error while getting the robot references

I would like to reproduce the following tutorial, but when I tried to get the robot references theres always this Error:
"AttributeError: 'CartpoleRobot' object has no attribute 'getSelf'"
I rebuild this Tutorial: https://github.com/aidudezzz/deepbots-tutorials/blob/master/robotSupervisorSchemeTutorial/README.md
In other controllers I get similar error messages when I try to get the robot references. I think the error is in the communication between the robot simulation and the controller.
I have tried importing the supervisor and getting the functions via supervisor.get. But here comes another error: "Only one instance of the Robot class should be created"
However, I am new to webots and robotics/informatics in general. Any help would be greatly appreciated!
The whole Error with Traceback:
INFO: robotSupervisorController: Starting controller: python.exe -u robotSupervisorController.py
Traceback (most recent call last):
File "D:\Webots Projekte\controllers\robotSupervisorController\robotSupervisorController.py", line 88, in <module>
env = CartpoleRobot()
File "D:\Webots Projekte\controllers\robotSupervisorController\robotSupervisorController.py", line 16, in __init__
self.robot = self.getSelf() # Grab the robot reference from the supervisor to access various robot methods
AttributeError: 'CartpoleRobot' object has no attribute 'getSelf'
WARNING: 'robotSupervisorController' Controller beendet mit Status: 1
The Code is the same than shown in the Tutorial.
The short section that generates the error:
self.robot = self.getSelf() # Grab the robot reference from the supervisor to access various robot methods
self.positionSensor = self.getDevice("polePosSensor")
self.positionSensor.enable(self.timestep)
If I comment out the first one, the next line returns a similar error
Every answer is highly appreciated! Thanks!

I had the same issue. I use the deepbots-0.1.2 and webotsR2021a.
Try to uninstall the deepbots and then install them using the:
pip install -i https://test.pypi.org/simple/ deepbots

Simple PytorchBenchmark from transformers lib of Huggingface script gives CUDA initialisation error

I am trying to run a simple benchmark script from the transformers lib from Huggingface, but it fails due to a CUDA error, which leads to another error:
1 / 1
Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Traceback (most recent call last):
File "/home/cbarkhof/code-thesis/Experimentation/Benchmarking/benchmarking-models-claartje.py", line 12, in <module>
benchmark.run()
File "/home/cbarkhof/.local/lib/python3.6/site-packages/transformers/benchmark/benchmark_utils.py", line 674, in run
memory, inference_summary = self.inference_memory(model_name, batch_size, sequence_length)
ValueError: too many values to unpack (expected 2)
The script simply follows the example like shown on this page:
from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments
benchmark_args = PyTorchBenchmarkArguments(models=["bert-base-uncased"],
batch_sizes=[8],
sequence_lengths=[8, 32, 128, 512],
save_to_csv=True,
log_filename='log',
env_info_csv_file='env_info')
benchmark = PyTorchBenchmark(benchmark_args)
benchmark.run()
If anyone can point me to why this might be happening. Please let me know :). Cheers!

This error is due to the CUDA runtime not supporting process forking which is the default method of multiprocessing in PyTorch. The suggested workaround is to call torch.multiprocessing.set_start_method('spawn') before importing transformers, but in my experiments this raises an AttributeError: Can't pickle local object error.
If you're happy with the script running as a single process when using GPUs you can pass the multi_process=False arg, or --no_multi_process flag if using the command line script. Alternatively, if running this on CPU only without CUDA, you should be able to make use of multiple cores and passing the cuda=False arg or --no_cuda flag should also work.

Running pyspark code in Intellij

I have followed the steps to set up pyspark in intellij from this question:
Write and run pyspark in IntelliJ IDEA
Here is the simple code attempted to run:
#!/usr/bin/env python
from pyspark import *
def p(msg): print("%s\n" %repr(msg))
import numpy as np
a = np.array([[1,2,3], [4,5,6]])
p(a)
import os
sc = SparkContext("local","ptest",conf=SparkConf().setAppName("x"))
ardd = sc.parallelize(a)
p(ardd.collect())
Here is the result of submitting the code
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
Error: Must specify a primary resource (JAR or Python or R file)
Run with --help for usage help or --verbose for debug output
Traceback (most recent call last):
File "/git/misc/python/ptest.py", line 14, in <module>
sc = SparkContext("local","ptest",SparkConf().setAppName("x"))
File "/shared/spark16/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/shared/spark16/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/shared/spark16/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
However I really do not understand how this could be expected to work: in order to run in Spark the code needs to be bundled up and submitted via spark-submit.
So I doubt that that other question actually truly addressed submitting pyspark code through Intellij to spark.
Is there a way to submit pyspark code to pyspark? It would actually be
spark-submit myPysparkCode.py
The pyspark executable itself is deprecated since Spark 1.0. Anyone have this working?

In my case the variable settings from the other Q&A Write and run pyspark in IntelliJ IDEA covered most but not all of the required settings. I tried them many times.
Only after adding :
PYSPARK_SUBMIT_ARGS = pyspark-shell
to the run configuration did pyspark finally quiet down and succeed.

Getting Spark, Python, and MongoDB to work together

I'm having difficulty getting these components to knit together properly. I have Spark installed and working successfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
I'm working on Ubuntu and the various component versions I have are
Spark spark-1.5.1-bin-hadoop2.6
Hadoop hadoop-2.6.1
Mongo 2.6.10
Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
Python 2.7.10
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
My Python program is basic
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
rdd = sc.mongoRDD(
'mongodb://username:password#localhost:27017/mydb.mycollection')
if __name__ == '__main__':
main()
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
and I am getting the following output as a result
Traceback (most recent call last):
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
main()
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
rdd = sc.mongoRDD('mongodb://username:password#localhost:27017/mydb.mycollection')
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
return self.mongoPairRDD(connection_string, config).values()
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
_ensure_pickles(self)
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
orig_tb)
py4j.protocol.Py4JError
According to here
This exception is raised when an exception occurs in the Java client
code. For example, if you try to pop an element from an empty stack.
The instance of the Java exception thrown is stored in the
java_exception member.
Looking at the source code for pymongo_spark.py and the line throwing the error, it says
"Error while communicating with the JVM. Is the MongoDB Spark jar on
Spark's CLASSPATH? : "
So in response, I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
I have imported pymongo to the same python program to verify that I can at least access MongoDB using that, and I can.
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.

Updates:
2016-07-04
Since the last update MongoDB Spark Connector matured quite a lot. It provides up-to-date binaries and data source based API but it is using SparkConf configuration so it is subjectively less flexible than the Stratio/Spark-MongoDB.
2016-03-30
Since the original answer I found two different ways to connect to MongoDB from Spark:
mongodb/mongo-spark
Stratio/Spark-MongoDB
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup
# although officially 0.11 supports only Spark 1.5
# I haven't encountered any issues on 1.6.1
bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read
.format("com.stratio.datasource.mongodb")
.options(host="mongo:27017", database="foo", collection="bar")
.load())
df.show()
## +---+----+--------------------+
## | x| y| _id|
## +---+----+--------------------+
## |1.0|-1.0|56fbe6f6e4120712c...|
## |0.0| 4.0|56fbe701e4120712c...|
## +---+----+--------------------+
It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.
The original answer:
Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:
git clone https://github.com/zero323/docker-mongo-spark.git
cd docker-mongo-spark
docker build -t zero323/mongo-spark .
or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):
Start images:
docker run -d --name mongo mongo:2.6
docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
Start PySpark shell passing --jars and --driver-class-path:
pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
And finally see how it works:
import pymongo
import pymongo_spark
mongo_url = 'mongodb://mongo:27017/'
client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
{"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()
pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
.map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()
## [(1.0, -1.0), (0.0, 4.0)]
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.
Based on different problems I've encountered creating this image I tend to believe that passing mongo-hadoop-1.5.0-SNAPSHOT.jar and mongo-hadoop-spark-1.5.0-SNAPSHOT.jar to both --jars and --driver-class-path is the only hard requirement.
Notes:
This image is loosely based on jaceklaskowski/docker-spark
so please be sure to send some good karma to #jacek-laskowski if it helps.
If don't require a development version including new API then using --packages is most likely a better option.

Can you try using --package option instead of --jars ... in your spark-submit command:
spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
Some of these jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.

I was having this same problem yesterday. Was able to fix it by placing mongo-java-driver.jar in $HADOOP_HOME/lib and mongo-hadoop-core.jar and mongo-hadoop-spark.jar in $HADOOP_HOME/spark/classpath/emr (Or any other folder that is in the $SPARK_CLASSPATH).
Let me know if that helps.

Good Luck！
#see https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
from pyspark import SparkContext, SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
# Create an RDD backed by the MongoDB collection.
# This RDD *does not* contain key/value pairs, just documents.
# If you want key/value pairs, use the mongoPairRDD method instead.
rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
# Save this RDD back to MongoDB as a different collection.
rdd.saveToMongoDB('mongodb://localhost:27017/db.other.collection')
# You can also read and write BSON:
bson_rdd = sc.BSONFileRDD('/path/to/file.bson')
bson_rdd.saveToBSON('/path/to/bson/output')
if __name__ == '__main__':
main()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark (pyspark) having difficulty calling statistics methods on worker node - python

As you've noticed in your comment the sc on the worker nodes is None. The SparkContext is only defined on the driver node.

Related

Python subprocess call won't install R package

Webots Attribute Error while getting the robot references

Simple PytorchBenchmark from transformers lib of Huggingface script gives CUDA initialisation error

Running pyspark code in Intellij

Getting Spark, Python, and MongoDB to work together

Categories

Resources