Unable to connect to Mongo from pyspark - python

I'm trying to connect to MongoDB using pyspark. Below is the code I'm using
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sparkConf = SparkConf().setAppName("App")
sparkConf.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/mydb.test")
sc = SparkContext(conf = sparkConf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
I'm facing the below error
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource.

Failed to find data source: com.mongodb.spark.sql.DefaultSource.
This error indicates that PySpark failed to locate MongoDB Spark Connector.
If you're invoking pyspark directly, make sure you specify mongo-spark-connector in the packages parameter. For example:
./bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
If you're not invoking pyspark directly (i.e. from an IDE such as Eclipse) you would have to modify Spark configuration spark.jars.packages to specify the dependency.
Either within the spark-defaults.conf file:
spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
Or, you can try changing the configuration within the code:
SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
Or:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()

I had this problem. I could insert and find documents from pyspark shell but I could not using this code snippet within pyspark runtime:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
You Also may need to add packages manually to $SPARK_HOME/jars. Consider this Answer.

Related

Use Pyspark inside the python project

I have Python project and trying to use pyspark within. I build a Python class and call Pyspark class and methods. I declare a SparkConf and create a configuration which ise used by Spark, then create a SparkSession with this conf. My spark environment is a cluster and I can use is as a cluster deploy mode and master as yarn. But, when I try to get instance of spark session, it can be seen on Yarn applications page as submitted, however it can not submit my python methods as a spark job. It works if i submit this python as single script, it can be run as job by Yarn.
Here is the sample code:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf
spark = SparkSession.builder.master('yarn')\
.config(conf=conf)\
.appName('myapp')\
.getOrCreate()
sc = spark.SparkContext()
rdd = sc.parallelize([1,2,3])
count = rdd.count()
print(sc.master)
print(count)
It works great when I submit it with ./bin/spark-submit myapp.py and I see it running on Yarn. It does not work how I expect when I run it python myapp.py and can see on Yarn as an application with no job or executer assigned.
Any help will be appreciated.
PS: I already set .env variables including with Hadoop conf dir, spark conf dir, etc. and conf files core-site and yarn-site xmls. Thus, I did not need to mention them.

Spark - SparkSession - Cannot Import

I tried to import SparkSession using Pyspark from Anaconda spider and Pyspark Command Line but getting below error. I have given the details of the versions that i have installed in my machine.
Error : ImportError: cannot import name 'SparkSession' from 'pyspark' (E:\spark3\python\pyspark\__init__.py).
Assist me to resolve this issue
Spark Version :
Python Version
pyspark
Error From Spyder
Error From Pyspark Command Line
My Environmental Variable setup
Although it has been a long time since you asked your question, I think you aim to use pyspark.sql so by this way you should import:
from pyspark.sql import SparkSession

How to specify driver class path when using pyspark within a jupyter notebook?

I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
Running this from a notebook would raise the following error:
Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...
The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:
Pyspark connection to Postgres database in ipython notebook
Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.
pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar
Thanks to anyone who can help me on this one.
EDIT
The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'
There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.
Using the config method worked for me:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

run python script with pyspark settings

I have Spark configuration in spark-defaults.conf, xml files: core-site.xml, hive-site.xml, and I exported environment variables. When I run pyspark console:
$ pyspark --master yarn
and then:
>>> sqlContext.sql("show tables").show()
everything is correct, but when I use pure python interpreter I cannot see my tables.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
sqlContext.sql("show tables").show()
How can I make python see all config files?
My understanding is that when you run the PySpark Shell, Spark is instantiated with support for Hive i.e HiveContext, which defaults to SQLContext.
But while running a Python Program or the Python Interpreter, as in your case, the SQLContext does not come with Hive support.
To fix this
sqlCtx = HiveContext(sc)
sqlCtx.sql("show tables").show()

Read a folder of parquet files from s3 location using pyspark to pyspark dataframe

I want to read some parquet files present in a folder poc/folderName on s3 bucket myBucketName to a pyspark dataframe. I am using pyspark v2.4.3 for the same.
below is the code which i am using
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", 'id')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", 'sid')
sqlContext = SQLContext(sc)
parquetDF = sqlContext.read.parquet("s3a://myBucketName/poc/folderName")
I have downloaded the hadoop-aws package using command pyspark --packages org.apache.hadoop:hadoop-aws:3.3.0 but when i run above code i am receive below error.
An error occurred while calling o825.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
What is the thing that I am doing wrong here?
I am running the python code using Anaconda and spyder on windows 10
The Maven coordinates for the open source Hadoop S3 driver need to be added as a package dependency:
spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.0
Note the above package version is tied to the installed AWS SDK for Java version.
In the Spark application's code, something like the following may also be needed:
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
Note that when using the open source Hadoop driver, the S3 URI scheme is s3a and not s3 (as it is when using Spark on EMR and Amazon's proprietary EMRFS). e.g. s3a://bucket-name/
Credits to danielchalef

Categories