I'm writing an aggregation in pysaprk
To this project, I'm also adding test, where I create a session, put some data, and then run my aggregation, and check the results
The code looks like as following:
def mapper_convert_row(row):
#... specific of business logic code, eventually return one string value
return my_str
def run_spark_query(spark: SparkSession, from_dt, to_dt):
query = get_hive_query_str(from_dt, to_dt)
df = spark.sql(query).rdd.map(lambda row: Row(mapper_convert_row(row)))
out_schema = StructType([StructField("data", StringType())])
df_conv = spark.createDataFrame(df, out_schema)
df_conv.write.mode('overwrite').format("csv").save(folder)
And here is my test class
class SparkFetchTest(unittest.TestCase):
#staticmethod
def getOrCreateSC():
conf = SparkConf()
conf.setMaster("local")
spark = (SparkSession.builder.config(conf=conf).appName("MyPySparkApp")
.enableHiveSupport().getOrCreate())
return spark
def test_fetch(self):
dt_from = datetime.strptime("2019-01-01-10-00", '%Y-%m-%d-%H-%M')
dt_to = datetime.strptime("2019-01-01-10-05", '%Y-%m-%d-%H-%M')
spark = self.getOrCreateSC()
self.init_and_populate_table_with_test_data(spark, input_tbl, dt_from, dt_to)
run_spark_query(spark, dt_from, dt_to)
# assert on results
I've added PySpark dependencies via the Conda environment
and running this code via PyCharm. Just to make it clear - there is no spark installation on my local machine except PySpark Conda package
When I set the breakpoint inside the code, it works for me in the driver code, but it does not stop inside mapper_convert_row function.
How can I debug this business logic function in a local test environment?
The same approach in scala works perfectly, but this code should be in python.
pyspark is a conduit to the spark runtime that runs on the jvm / is written in scala. The connection is through py4j that provides a tcp-based socket from the python executable to the jvm. Unfortunately that means
No local debugging
I'm no more happy about it than you. I might just write/maintain a parallel code branch in scala to figure some things out that are tiring to do without the debugger.
Update Pycharm is able to debug spark programs. I have been using it nearly daily Pycharm Debugging of Pyspark
Related
I am writing unit testing for a spark application. I am using pytest and I have created a fixture to load the spark session once.
When I run one test at a time, it is passing but when I run all the tests together I am getting unexpected behavior. Then, I realize, spark is not multi-threadable. Any way to fix this? Is running pytest in non-parallel mode is the only solution?
Sample code structure,
#pytest.fixture(scope="session")
def spark() -> SparkSession:
builder = SparkSession.builder.appName("pandas-on-spark")
builder = builder.config("spark.sql.execution.arrow.pyspark.enabled", "true")
return builder.getOrCreate()
def test1(spark):
df = spark.createDataFrame(dummy_rows)
# do some transformaton
# assert
def test2(spark):
df = spark.createDataFrame(dummy_rows)
# do some transformaton
# assert
def testN(spark):
df = spark.createDataFrame(dummy_rows)
# do some transformaton
# assert
pytest -s .
With scope="session", you'd have a single Spark session for all the tests, means all variables, all caches, all transformations etc. If you really need to have each transformation completely separated from each test, you should consider having a new Spark session for each test by changing lower scope into class or function. The whole test would run slower, but your logic will be secured.
Is there a straightforward way to get the yarn ApplicationId of the current job from the DRIVER node running under Amazon's Elastic Map Reduce (EMR)? This is running Spark in the cluster mode.
Right now I'm using code that runs a map() operation on a worker to read the CONTAINER_ID environment variable. This seems inefficient. Here's the code:
def applicationIdFromEnvironment():
return "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])
def applicationId():
"""Return the Yarn (or local) applicationID.
The environment variables are only set if we are running in a Yarn container.
"""
# First check to see if we are running on the worker...
try:
return applicationIdFromEnvironment()
except KeyError:
pass
# Perhaps we are running on the driver? If so, run a Spark job that finds it.
try:
from pyspark import SparkConf, SparkContext
sc = SparkContext.getOrCreate()
if "local" in sc.getConf().get("spark.master"):
return f"local{os.getpid()}"
# Note: make sure that the following map does not require access to any existing module.
appid = sc.parallelize([1]).map(lambda x: "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])).collect()
return appid[0]
except ImportError:
pass
# Application ID cannot be determined.
return f"unknown{os.getpid()}"
You can get the applicationID directly from the SparkContext using the property applicationId:
A unique identifier for the Spark application. Its format depends on
the scheduler implementation.
in case of local spark app something like ‘local-1433865536131’
case of YARN something like ‘application_1433865536131_34483’
appid = sc.applicationId
I'm trying to submit a job written in python to spark.
First I'm going to explain my setup, I installed a single node of spark 2.3.0 on a performing windows server (40 cores and 1TB or RAM), of course the goal will be later to create a cluster of less powerfull nodes, but for now I'm testing everything there :)
My first test consists on taking a set of tabular CSV files (40-100GB each), split them and then save the splitted result somewhere else.
I've been making my prototype on a jupyter notebook using pyspark (which creates automatically a sparkContext.
Now, I want to create a spark_test.py script, containing the body of my prototype in the main, which I'm planning to send to the spark-submit.
The thing is that seems like the processing part of my script is not working at all. Here you have the body of my script:
from pyspark import SparkContext, SparkConf
def main():
# Create spark context
spark_conf = SparkConf().setAppName('GMK_SPLIT_TEST')
print('\nspark configuration: \n%s\n' % spark_conf.toDebugString())
sc = SparkContext(conf=spark_conf)
# Variables definition
partitionings_number = 40
file_1 = r'D:\path\to\my\csv\file.csv'
output_path = r'D:\output\path'
# Processing 1
rdd = sc.parallelize(range(1000))
print(rdd.mean())
# Processing 2
sdf = spark.read.option('header','true').csv(file_1, sep=';', encoding='utf-8')
sdf_2 = sdf.repartition(partitionings_number, 'Zone[3-2]')
sdf_2.write.saveAsTable('CSVBuckets', format='csv', sep=';', mode='overwrite', path=output_path, header='True')
if __name__ == '__main__':
main()
Now here is where I have more doubts. Does spark-submit will try to connect to an already running instance of spark or it will initialize one by itself?
I tried:
spark-submit --master local[20] --driver-memory 30g
The above command seems to work if for the Processing 1, but no for the Processing 2
spark-submit --master spark:\\127.0.0.1:7077 --driver-memory 30g
The above command raises an exception of spark contex initialization. Is it because I don't have any spark instance running?
How so I have to pass my file_1 with my python job in order to have do the processing 2? I tried --files without success.
Thank for your time guys!
I am running a local environment with Spark, PySpark, Ipython and mysql. I am strugling to be able to launch a mysql query via spark. The main issue is including the proper jdbc jar in order to be able to perform the query.
Here is what I have so far :
import pyspark
conf = (pyspark.SparkConf()
.setMaster('local')
.setAppName('Romain_DS')
.set("spark.executor.memory", "1g")
.set("spark.driver.extraLibraryPath","mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar")
.set("spark.driver.extraClassPath","mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar")
)
sc = pyspark.SparkContext(conf=conf)
This is in order to properly create the spark context, and properly show the path to the jar including the jdbc driver.
Then I create an SQLContext :
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)
And finally the query :
MYSQL_USERNAME = "root";
MYSQL_PWD = "rootpass";
MYSQL_CONNECTION_URL = "jdbc:mysql://127.0.0.1:33060/O_Tracking?user=" + MYSQL_USERNAME + "&password=" + MYSQL_PWD;
query = 'Select * from tracker_action'
dataframe_mysql = sqlsc.read.format("jdbc").options(
url = MYSQL_CONNECTION_URL,
dbtable = "tracker_action",
driver = "com.mysql.jdbc.Driver",
user="root",
password="rootpass").load()
If I run this in the ipython notebook I get the error :
An error occurred while calling o198.load. :
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
However if I do everything from the shell (and not ipython), by initializing the spark context this way :
pyspark --driver-library-path './mysql-connector-java-5.1.37-bin.jar' --driver-class-path './mysql-connector-java-5.1.37-bin.jar'
It does work... I looked into the UI in Spark the configurations are the same. So I don't understand why would one work and not the other one ... Is there something to do with the runtime setting before the JVM ?
If I cannot find a proper solution, we could potentially think of running the sc in the shell and then use it from the ipython but I have no idea how to do that.
If someone can help me on that that would be great.
---- Hardware / Software
Mac OSX
Spark 1.5.2
Java 1.8.0
Python 2.7.10 :: Anaconda 2.3.0 (x86_64)
---- Sources to help:
https://gist.github.com/ololobus/4c221a0891775eaa86b0
http://spark.apache.org/docs/latest/configuration.html
Following the comments here is my conf file :
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
spark.driver.extraLibraryPath /Users/romainbui/mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar
spark.driver.extrClassPath /Users/romainbui/mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar
spark.AppName PySpark
spark.setMaster Local
--------- Solution ---------
Thanks to the comments I finally was able to properly have a working solution (and a clean one).
Step 1 : Creating a profile :
ipython profile create pyspark
Step 2 : Edit the profile startup script :
touch ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
Step 3 : Fill the file. Here I did something custom (thanks to the comments) :
import findspark
import os
import sys
findspark.init()
spark_home = findspark.find()
#spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Adding the library to mysql connector
packages = "mysql:mysql-connector-java:5.1.37"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages {0} pyspark-shell".format(
packages
)
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Then you can simply run the notebook with :
ipython notebook --profile=pyspark
I don't understand why would one work and not the other one ... Is there something to do with the runtime setting before the JVM ?
More or less. IPython configuration you've shown executes python/pyspark/shell.py which creates SparkContext (and some other stuff) and creates a JVM instance. When you create another context later it is using the same JVM and parameters like spark.driver.extraClassPath won't be used.
There are multiple ways can you handle this including passing arguments using PYSPARK_SUBMIT_ARGS or setting spark.driver.extraClassPath in $SPARK_HOME/conf/spark-defaults.conf.
Alternatively you can add following lines to 00-pyspark-setup.py before shell.py is executed:
packages = "mysql:mysql-connector-java:5.1.37"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages {0} pyspark-shell".format(
packages
)
Setting --driver-class-path / --driver-library-path there should work as well.
I'm having difficulty getting these components to knit together properly. I have Spark installed and working successfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
I'm working on Ubuntu and the various component versions I have are
Spark spark-1.5.1-bin-hadoop2.6
Hadoop hadoop-2.6.1
Mongo 2.6.10
Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
Python 2.7.10
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
My Python program is basic
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
rdd = sc.mongoRDD(
'mongodb://username:password#localhost:27017/mydb.mycollection')
if __name__ == '__main__':
main()
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
and I am getting the following output as a result
Traceback (most recent call last):
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
main()
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
rdd = sc.mongoRDD('mongodb://username:password#localhost:27017/mydb.mycollection')
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
return self.mongoPairRDD(connection_string, config).values()
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
_ensure_pickles(self)
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
orig_tb)
py4j.protocol.Py4JError
According to here
This exception is raised when an exception occurs in the Java client
code. For example, if you try to pop an element from an empty stack.
The instance of the Java exception thrown is stored in the
java_exception member.
Looking at the source code for pymongo_spark.py and the line throwing the error, it says
"Error while communicating with the JVM. Is the MongoDB Spark jar on
Spark's CLASSPATH? : "
So in response, I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
I have imported pymongo to the same python program to verify that I can at least access MongoDB using that, and I can.
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.
Updates:
2016-07-04
Since the last update MongoDB Spark Connector matured quite a lot. It provides up-to-date binaries and data source based API but it is using SparkConf configuration so it is subjectively less flexible than the Stratio/Spark-MongoDB.
2016-03-30
Since the original answer I found two different ways to connect to MongoDB from Spark:
mongodb/mongo-spark
Stratio/Spark-MongoDB
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup
# although officially 0.11 supports only Spark 1.5
# I haven't encountered any issues on 1.6.1
bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read
.format("com.stratio.datasource.mongodb")
.options(host="mongo:27017", database="foo", collection="bar")
.load())
df.show()
## +---+----+--------------------+
## | x| y| _id|
## +---+----+--------------------+
## |1.0|-1.0|56fbe6f6e4120712c...|
## |0.0| 4.0|56fbe701e4120712c...|
## +---+----+--------------------+
It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.
The original answer:
Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:
git clone https://github.com/zero323/docker-mongo-spark.git
cd docker-mongo-spark
docker build -t zero323/mongo-spark .
or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):
Start images:
docker run -d --name mongo mongo:2.6
docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
Start PySpark shell passing --jars and --driver-class-path:
pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
And finally see how it works:
import pymongo
import pymongo_spark
mongo_url = 'mongodb://mongo:27017/'
client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
{"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()
pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
.map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()
## [(1.0, -1.0), (0.0, 4.0)]
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.
Based on different problems I've encountered creating this image I tend to believe that passing mongo-hadoop-1.5.0-SNAPSHOT.jar and mongo-hadoop-spark-1.5.0-SNAPSHOT.jar to both --jars and --driver-class-path is the only hard requirement.
Notes:
This image is loosely based on jaceklaskowski/docker-spark
so please be sure to send some good karma to #jacek-laskowski if it helps.
If don't require a development version including new API then using --packages is most likely a better option.
Can you try using --package option instead of --jars ... in your spark-submit command:
spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
Some of these jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.
I was having this same problem yesterday. Was able to fix it by placing mongo-java-driver.jar in $HADOOP_HOME/lib and mongo-hadoop-core.jar and mongo-hadoop-spark.jar in $HADOOP_HOME/spark/classpath/emr (Or any other folder that is in the $SPARK_CLASSPATH).
Let me know if that helps.
Good Luck!
#see https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
from pyspark import SparkContext, SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
# Create an RDD backed by the MongoDB collection.
# This RDD *does not* contain key/value pairs, just documents.
# If you want key/value pairs, use the mongoPairRDD method instead.
rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
# Save this RDD back to MongoDB as a different collection.
rdd.saveToMongoDB('mongodb://localhost:27017/db.other.collection')
# You can also read and write BSON:
bson_rdd = sc.BSONFileRDD('/path/to/file.bson')
bson_rdd.saveToBSON('/path/to/bson/output')
if __name__ == '__main__':
main()