I have followed the steps to set up pyspark in intellij from this question:
Write and run pyspark in IntelliJ IDEA
Here is the simple code attempted to run:
#!/usr/bin/env python
from pyspark import *
def p(msg): print("%s\n" %repr(msg))
import numpy as np
a = np.array([[1,2,3], [4,5,6]])
p(a)
import os
sc = SparkContext("local","ptest",conf=SparkConf().setAppName("x"))
ardd = sc.parallelize(a)
p(ardd.collect())
Here is the result of submitting the code
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
Error: Must specify a primary resource (JAR or Python or R file)
Run with --help for usage help or --verbose for debug output
Traceback (most recent call last):
File "/git/misc/python/ptest.py", line 14, in <module>
sc = SparkContext("local","ptest",SparkConf().setAppName("x"))
File "/shared/spark16/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/shared/spark16/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/shared/spark16/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
However I really do not understand how this could be expected to work: in order to run in Spark the code needs to be bundled up and submitted via spark-submit.
So I doubt that that other question actually truly addressed submitting pyspark code through Intellij to spark.
Is there a way to submit pyspark code to pyspark? It would actually be
spark-submit myPysparkCode.py
The pyspark executable itself is deprecated since Spark 1.0. Anyone have this working?
In my case the variable settings from the other Q&A Write and run pyspark in IntelliJ IDEA covered most but not all of the required settings. I tried them many times.
Only after adding :
PYSPARK_SUBMIT_ARGS = pyspark-shell
to the run configuration did pyspark finally quiet down and succeed.
Related
I'm developing Python scripts for the automated grading of assignments using CanvasAPI, an API wrapper in Python for the Canvas learning management platform. In studying the documentation, I can successfully issue curl commands in Python for a few parameters. For example, this conversion below is for grading a single submission:
Curl command per the Canvas API docs:
PUT /api/v1/courses/:course_id/assignments/:assignment_id/submissions/:user_id
with
submission[posted_grade]
Turns into this via the CanvasAPI Python wrapper:
edit(submission={'posted_grade': 'grade'})
Where I'm running into difficulties is the more complex parameter for rubrics. Using the same PUT request as above, the syntax in the documentation is as follows:
rubric_assessment[criterion_id][points]
For which I have:
edit(rubric_assessment[{'id': 'criterion_9980'},{'points', '37'}])
However, I get the following error:
Traceback (most recent call last):
File "C:\Users\danie\AppData\Local\Temp\atom_script_tempfiles\2021528-29488-1eagfyw.k8hw", line 39, in <module>
submission = assignment.get_submission(10370)
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\canvasapi\assignment.py", line 203, in get_submission
response = self._requester.request(
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\canvasapi\requester.py", line 255, in request
raise ResourceDoesNotExist("Not Found")
canvasapi.exceptions.ResourceDoesNotExist: Not Found
I suspect I'm fouling up the syntax somewhere along the line. Any suggestions? All help much appreciated.
I am trying to automate the deployments of an application which is mapped to two clusters in a cell using wsadmin scripting. But no matter how much I try, the application is getting mapped to only one cluster. As a result the application is not at all starting.
I am getting the following error message:
Application helloteam_07062019_1956 is not deployed on the cluster SPPAbcd
Exception: exceptions.AttributeError WASL6048E: The helloteam_07062019_1956 application is not deployed on the SPPAbcd target.
WASX7017E: Exception received while running file "/app/was_scripts/main_scripts/deploy_mutlitest.py"; exception information: com.ibm.bsf.BSFException: exception from Jython:
Traceback (innermost last):
File "<string>", line 175, in ?
File "/app/service/IBM/WebSphere/AppServer/scriptLibraries/application/V70/AdminApplication.py", line 4665, in startApplicationOnCluster
ScriptLibraryException: : 'exceptions.AttributeError WASL6048E: The helloteam_07062019_1956 application is not deployed on the SPPAbcd target. '
It is clear from the error message that app is only mapped to SRVApp cluster, but it is not mapped to SPPAbcd cluster. As a result, it is unable to start the app.
Here is the script:
targetServerOne = "WebSphere:cell=DIGIAPP1Cell02,cluster=SPPAbcd"
targetServerTwo = "WebSphere:cell=DIGIAPP1Cell02,cluster=SRVApp"
AdminApp.install(location, ['-appname',"hellotest",'-defaultbinding.virtual.host',virtualHost,'-usedefaultbindings','-contextroot',ctxRoot,'-MapModulesToServers',[["EchoApp",URI,targetServerOne],["EchoApp",URI,targetServerTwo]]])
AdminConfig.save()
cell=AdminConfig.list('Cell')
cellName=AdminConfig.showAttribute(cell, 'name')
clusters=AdminConfig.list('ServerCluster').split('\n')
print("The clusters in "+cellName+" are...")
print(clusters)
for name in startClusters:
startapp = AdminApplication.startApplicationOnCluster(newWar, name)
print(startapp)
As aforementioned, no matter what I try, the app is only getting mapped to SRVApp cluster (after checking app's Manage module section in DMGR console). It is not getting mapped to SPPAbcd cluster.
How to achieve proper module mapping to multi clusters? The module mapping part is mentioned in adminapp.install command. Is that the correct way to map modules?
To solve this issue, I have leveraged EnvInject plugin of Jenkins to inject properties at build time.
Instead of having two targetServers (targetServerOne and targetServerTwo), i have just used only target server, and invoked it from properties file.
This is my properties file:
moduleMapping=WebSphere:cell=cell1,cluster=cluster1+WebSphere:cell=cell1,cluster=cluster2
My script has been modified like below:
from os import getenv as env
targetServer = env(‘moduleMapping’)
AdminApp.install(filename, [ ‘-MapModulesToServers, [[‘moduleName’, ‘uri’, targetServer]]])
This has mapped my app to two clusters within a cell.
I'm having difficulty getting these components to knit together properly. I have Spark installed and working successfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
I'm working on Ubuntu and the various component versions I have are
Spark spark-1.5.1-bin-hadoop2.6
Hadoop hadoop-2.6.1
Mongo 2.6.10
Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
Python 2.7.10
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
My Python program is basic
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
rdd = sc.mongoRDD(
'mongodb://username:password#localhost:27017/mydb.mycollection')
if __name__ == '__main__':
main()
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
and I am getting the following output as a result
Traceback (most recent call last):
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
main()
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
rdd = sc.mongoRDD('mongodb://username:password#localhost:27017/mydb.mycollection')
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
return self.mongoPairRDD(connection_string, config).values()
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
_ensure_pickles(self)
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
orig_tb)
py4j.protocol.Py4JError
According to here
This exception is raised when an exception occurs in the Java client
code. For example, if you try to pop an element from an empty stack.
The instance of the Java exception thrown is stored in the
java_exception member.
Looking at the source code for pymongo_spark.py and the line throwing the error, it says
"Error while communicating with the JVM. Is the MongoDB Spark jar on
Spark's CLASSPATH? : "
So in response, I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
I have imported pymongo to the same python program to verify that I can at least access MongoDB using that, and I can.
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.
Updates:
2016-07-04
Since the last update MongoDB Spark Connector matured quite a lot. It provides up-to-date binaries and data source based API but it is using SparkConf configuration so it is subjectively less flexible than the Stratio/Spark-MongoDB.
2016-03-30
Since the original answer I found two different ways to connect to MongoDB from Spark:
mongodb/mongo-spark
Stratio/Spark-MongoDB
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup
# although officially 0.11 supports only Spark 1.5
# I haven't encountered any issues on 1.6.1
bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read
.format("com.stratio.datasource.mongodb")
.options(host="mongo:27017", database="foo", collection="bar")
.load())
df.show()
## +---+----+--------------------+
## | x| y| _id|
## +---+----+--------------------+
## |1.0|-1.0|56fbe6f6e4120712c...|
## |0.0| 4.0|56fbe701e4120712c...|
## +---+----+--------------------+
It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.
The original answer:
Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:
git clone https://github.com/zero323/docker-mongo-spark.git
cd docker-mongo-spark
docker build -t zero323/mongo-spark .
or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):
Start images:
docker run -d --name mongo mongo:2.6
docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
Start PySpark shell passing --jars and --driver-class-path:
pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
And finally see how it works:
import pymongo
import pymongo_spark
mongo_url = 'mongodb://mongo:27017/'
client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
{"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()
pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
.map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()
## [(1.0, -1.0), (0.0, 4.0)]
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.
Based on different problems I've encountered creating this image I tend to believe that passing mongo-hadoop-1.5.0-SNAPSHOT.jar and mongo-hadoop-spark-1.5.0-SNAPSHOT.jar to both --jars and --driver-class-path is the only hard requirement.
Notes:
This image is loosely based on jaceklaskowski/docker-spark
so please be sure to send some good karma to #jacek-laskowski if it helps.
If don't require a development version including new API then using --packages is most likely a better option.
Can you try using --package option instead of --jars ... in your spark-submit command:
spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
Some of these jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.
I was having this same problem yesterday. Was able to fix it by placing mongo-java-driver.jar in $HADOOP_HOME/lib and mongo-hadoop-core.jar and mongo-hadoop-spark.jar in $HADOOP_HOME/spark/classpath/emr (Or any other folder that is in the $SPARK_CLASSPATH).
Let me know if that helps.
Good Luck!
#see https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
from pyspark import SparkContext, SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
# Create an RDD backed by the MongoDB collection.
# This RDD *does not* contain key/value pairs, just documents.
# If you want key/value pairs, use the mongoPairRDD method instead.
rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
# Save this RDD back to MongoDB as a different collection.
rdd.saveToMongoDB('mongodb://localhost:27017/db.other.collection')
# You can also read and write BSON:
bson_rdd = sc.BSONFileRDD('/path/to/file.bson')
bson_rdd.saveToBSON('/path/to/bson/output')
if __name__ == '__main__':
main()
I am hitting a library error when running pyspark (from an ipython-notebook), I want to use the Statistics.chiSqTest(obs) from pyspark.mllib.stat in a .mapValues operation on my RDD containing (key, list(int)) pairs.
On the master node, if I collect the RDD as a map, and iterate over the values like so I have no problems
keys_to_bucketed = vectors.collectAsMap()
keys_to_chi = {key:Statistics.chiSqTest(value).pValue for key,value in keys_to_bucketed.iteritems()}
but if I do the same directly on the RDD I hit issues
keys_to_chi = vectors.mapValues(lambda vector: Statistics.chiSqTest(vector))
keys_to_chi.collectAsMap()
results in the following exception
Traceback (most recent call last):
File "<ipython-input-80-c2f7ee546f93>", line 3, in chi_sq
File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/stat/_statistics.py", line 238, in chiSqTest
jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
File "/Users/atbrew/Development/Spark/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/common.py", line 127, in callMLlibFunc
api = getattr(sc._jvm.PythonMLLibAPI(), name)
AttributeError: 'NoneType' object has no attribute '_jvm'
I had an issue early on in my spark install not seeing numpy, with mac-osx having two python installs (one from brew and one from the OS) but I thought I had resolved that. Whats odd here is that this is one of the python libs that ships with the spark install (my previous issue had been with numpy).
Install Details
Max OSX Yosemite
Spark spark-1.4.0-bin-hadoop2.6
python is specified via spark-env.sh as
PYSPARK_PYTHON=/usr/bin/python
PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH:$EA_HOME/omnicat/src/main/python:$SPARK_HOME/python/
alias ipython-spark-notebook="IPYTHON_OPTS=\"notebook\" pyspark"
PYSPARK_SUBMIT_ARGS='--num-executors 2 --executor-memory 4g --executor-cores 2'
declare -x PYSPARK_DRIVER_PYTHON="ipython"
As you've noticed in your comment the sc on the worker nodes is None. The SparkContext is only defined on the driver node.
I read in the howto documentation to install Trigger, but when I test in python environment, I get the error below:
>>> from trigger.netdevices import NetDevices
>>> nd = NetDevices()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/trigger/netdevices/__init__.py", line 913, in __init__
with_acls=with_acls)
File "/usr/local/lib/python2.7/dist-packages/trigger/netdevices/__init__.py", line 767, in __init__
production_only=production_only, with_acls=with_acls)
File "/usr/local/lib/python2.7/dist-packages/trigger/netdevices/__init__.py", line 83, in _populate
# device_data = _munge_source_data(data_source=data_source)
File "/usr/local/lib/python2.7/dist-packages/trigger/netdevices/__init__.py", line 73, in _munge_source_data
# return loader.load_metadata(path, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/trigger/netdevices/loader.py", line 163, in load_metadata
raise RuntimeError('No data loaders succeeded. Tried: %r' % tried)
RuntimeError: No data loaders succeeded. Tried: [<trigger.netdevices.loaders.filesystem.XMLLoader object at 0x7f550a1ed350>, <trigger.netdevices.loaders.filesystem.JSONLoader object at 0x7f550a1ed210>, <trigger.netdevices.loaders.filesystem.SQLiteLoader object at 0x7f550a1ed250>, <trigger.netdevices.loaders.filesystem.CSVLoader object at 0x7f550a1ed290>, <trigger.netdevices.loaders.filesystem.RancidLoader object at 0x7f550a1ed550>]
Does anyone have some idea how to fix it?
The NetDevices constructor is apparently trying to find a "metadata source" that isn't there.
Firstly, you need to define the metadata. Second, your code should handle the exception where none is found.
I'm the lead developer of Trigger. Check out the the doc Working with NetDevices. It is probably what you were missing. We've done some work recently to improve the quality of the setup/install docs, and I hope that this is more clear now!
If you want to get started super quickly, you can feed Trigger a CSV-formatted NetDevices file, like so:
test1-abc.net.example.com,juniper
test2-abc.net.example.com,cisco
Just put that in a file, e.g. /tmp/netdevices.csv and then set the NETDEVICES_SOURCE environment variable:
export NETDEVICES_SOURCE=/tmp/netdevices.csv
And then fire up python and continue on with your examples and you should be good to go!
I found that the default of /etc/trigger/netdevices.xml wasn't listed in the setup instructions. It did indicate to copy from the trigger source folder:
cp conf/netdevices.json /etc/trigger/netdevices.json
But, I didn't see how to specify this instead of the default NETDEVICES_SOURCE on the installation page. But, as soon as I had a file that NETDEVICES_SOURCE pointed to in my /etc/trigger folder, it worked.
I recommend this to get the verifying functionality examples to work right away with minimal fuss:
cp conf/netdevices.xml /etc/trigger/netdevices.xml
Using Ubuntu 14.04 with Python 2.7.3