Spark - SparkSession - Cannot Import - python

I tried to import SparkSession using Pyspark from Anaconda spider and Pyspark Command Line but getting below error. I have given the details of the versions that i have installed in my machine.
Error : ImportError: cannot import name 'SparkSession' from 'pyspark' (E:\spark3\python\pyspark\__init__.py).
Assist me to resolve this issue
Spark Version :
Python Version
pyspark
Error From Spyder
Error From Pyspark Command Line
My Environmental Variable setup

Although it has been a long time since you asked your question, I think you aim to use pyspark.sql so by this way you should import:
from pyspark.sql import SparkSession

Related

How to specify driver class path when using pyspark within a jupyter notebook?

I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
Running this from a notebook would raise the following error:
Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...
The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:
Pyspark connection to Postgres database in ipython notebook
Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.
pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar
Thanks to anyone who can help me on this one.
EDIT
The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'
There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.
Using the config method worked for me:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

Unable to set-up Pyspark

I have installed Pyspark and Findspark using conda environment and added their paths to environment variables.
I execute following code:
import findspark
import pyspark
findspark.find()
I get the output as:
'C:/Users/myname/AppData/Local/Continuum/anaconda3/Scripts'
Then I execute:
findspark.init("C:/Users/myname/AppData/Local/Continuum/anaconda3/Scripts")
The output I get is:
Please use a Docker container https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook and save yourself the trouble.

Unable to connect to Mongo from pyspark

I'm trying to connect to MongoDB using pyspark. Below is the code I'm using
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sparkConf = SparkConf().setAppName("App")
sparkConf.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/mydb.test")
sc = SparkContext(conf = sparkConf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
I'm facing the below error
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource.
Failed to find data source: com.mongodb.spark.sql.DefaultSource.
This error indicates that PySpark failed to locate MongoDB Spark Connector.
If you're invoking pyspark directly, make sure you specify mongo-spark-connector in the packages parameter. For example:
./bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
If you're not invoking pyspark directly (i.e. from an IDE such as Eclipse) you would have to modify Spark configuration spark.jars.packages to specify the dependency.
Either within the spark-defaults.conf file:
spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
Or, you can try changing the configuration within the code:
SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
Or:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
I had this problem. I could insert and find documents from pyspark shell but I could not using this code snippet within pyspark runtime:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
You Also may need to add packages manually to $SPARK_HOME/jars. Consider this Answer.

Jupyter Notebook with Apache Spark (Kernel Error)

My objectif is to use Jupyter Notebook (IPython) with Apache Spark. I'm using Apache Toree to do this. I was setting environment variable for SPARK_HOME and configuring Apache Toree installation with Jupyter. Everything seems fine.
When I run the below command, a juypter browser is opened ipython notebook --profile=pyspark
When I choose Apache Toree - PySpark in the drop-down menu, I can't code in my notebook and I have this view (Python 2 is OK):
The red button gives :
What's wrong ? Help please ?
Not really an answer, but if you're not hooked on toree and just need a local spark for learning and experimenting, you could download a copy of spark, unzip it and use this in the beginning of your notebook:
import os
import sys
os.environ['SPARK_HOME']="<path where you have extracted the spark file>"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.10.4-src.zip') )
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, Row
import pyspark.sql.functions as sql
sc = SparkContext()
sqlContext = SQLContext(sc)
print sc.version

"cannot import name SparkSession"

I cannot import SparkSession from pyspark.sql,but i can import Row
my spark-1.6.0-bin-hadoop2.6 was install in a docker container,the system is centos
How can I solve the problem?This problem has troubled me for a long time
You cannot use it , because its not present there , the Spark Version that you are using is 1.6 and SparkSession was introduced in 2.0.0.
You can see here: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
You can download Spark 2.0.0 from here : http://spark.apache.org/downloads.html

Categories