I tried to import SparkSession using Pyspark from Anaconda spider and Pyspark Command Line but getting below error. I have given the details of the versions that i have installed in my machine.
Error : ImportError: cannot import name 'SparkSession' from 'pyspark' (E:\spark3\python\pyspark\__init__.py).
Assist me to resolve this issue
Spark Version :
Python Version
pyspark
Error From Spyder
Error From Pyspark Command Line
My Environmental Variable setup
Although it has been a long time since you asked your question, I think you aim to use pyspark.sql so by this way you should import:
from pyspark.sql import SparkSession
Related
I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
Running this from a notebook would raise the following error:
Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...
The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:
Pyspark connection to Postgres database in ipython notebook
Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.
pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar
Thanks to anyone who can help me on this one.
EDIT
The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'
There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.
Using the config method worked for me:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
I have installed Pyspark and Findspark using conda environment and added their paths to environment variables.
I execute following code:
import findspark
import pyspark
findspark.find()
I get the output as:
'C:/Users/myname/AppData/Local/Continuum/anaconda3/Scripts'
Then I execute:
findspark.init("C:/Users/myname/AppData/Local/Continuum/anaconda3/Scripts")
The output I get is:
Please use a Docker container https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook and save yourself the trouble.
I'm trying to connect to MongoDB using pyspark. Below is the code I'm using
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sparkConf = SparkConf().setAppName("App")
sparkConf.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/mydb.test")
sc = SparkContext(conf = sparkConf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
I'm facing the below error
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource.
Failed to find data source: com.mongodb.spark.sql.DefaultSource.
This error indicates that PySpark failed to locate MongoDB Spark Connector.
If you're invoking pyspark directly, make sure you specify mongo-spark-connector in the packages parameter. For example:
./bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
If you're not invoking pyspark directly (i.e. from an IDE such as Eclipse) you would have to modify Spark configuration spark.jars.packages to specify the dependency.
Either within the spark-defaults.conf file:
spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
Or, you can try changing the configuration within the code:
SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
Or:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
I had this problem. I could insert and find documents from pyspark shell but I could not using this code snippet within pyspark runtime:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
You Also may need to add packages manually to $SPARK_HOME/jars. Consider this Answer.
My objectif is to use Jupyter Notebook (IPython) with Apache Spark. I'm using Apache Toree to do this. I was setting environment variable for SPARK_HOME and configuring Apache Toree installation with Jupyter. Everything seems fine.
When I run the below command, a juypter browser is opened ipython notebook --profile=pyspark
When I choose Apache Toree - PySpark in the drop-down menu, I can't code in my notebook and I have this view (Python 2 is OK):
The red button gives :
What's wrong ? Help please ?
Not really an answer, but if you're not hooked on toree and just need a local spark for learning and experimenting, you could download a copy of spark, unzip it and use this in the beginning of your notebook:
import os
import sys
os.environ['SPARK_HOME']="<path where you have extracted the spark file>"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.10.4-src.zip') )
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, Row
import pyspark.sql.functions as sql
sc = SparkContext()
sqlContext = SQLContext(sc)
print sc.version
I cannot import SparkSession from pyspark.sql,but i can import Row
my spark-1.6.0-bin-hadoop2.6 was install in a docker container,the system is centos
How can I solve the problem?This problem has troubled me for a long time
You cannot use it , because its not present there , the Spark Version that you are using is 1.6 and SparkSession was introduced in 2.0.0.
You can see here: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
You can download Spark 2.0.0 from here : http://spark.apache.org/downloads.html