I cannot import SparkSession from pyspark.sql,but i can import Row
my spark-1.6.0-bin-hadoop2.6 was install in a docker container,the system is centos
How can I solve the problem?This problem has troubled me for a long time
You cannot use it , because its not present there , the Spark Version that you are using is 1.6 and SparkSession was introduced in 2.0.0.
You can see here: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
You can download Spark 2.0.0 from here : http://spark.apache.org/downloads.html
Related
I'm new to PySpark, and I'm just trying to read a table from my redshift bank.
The code looks like the following:
import findspark
findspark.add_packages("io.github.spark-redshift-community:spark-redshift_2.11:4.0.1")
findspark.init()
spark = SparkSession.builder.appName("Dim_Customer").getOrCreate()
df_read_1 = spark.read \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", "jdbc:redshift://fake_ip:5439/fake_database?user=fake_user&password=fake_password") \
.option("dbtable", "dim_customer") \
.option("tempdir", "https://bucket-name.s3.region-code.amazonaws.com/") \
.load()
I'm getting the error: java.lang.NoClassDefFoundError: scala/Product$class
I'm using Spark version 3.2.2 with Python 3.9.7
Could someone help me, please?
Thank you in advance!
You're using wrong version of the spark-redshift connector - your version is for Spark 2.4 that uses Scala 2.11, while you need version for Spark 3 that uses Scala 2.12 - change version to 5.1.0 that was released recently (all released versions are listed here)
I tried to import SparkSession using Pyspark from Anaconda spider and Pyspark Command Line but getting below error. I have given the details of the versions that i have installed in my machine.
Error : ImportError: cannot import name 'SparkSession' from 'pyspark' (E:\spark3\python\pyspark\__init__.py).
Assist me to resolve this issue
Spark Version :
Python Version
pyspark
Error From Spyder
Error From Pyspark Command Line
My Environmental Variable setup
Although it has been a long time since you asked your question, I think you aim to use pyspark.sql so by this way you should import:
from pyspark.sql import SparkSession
I'm trying to install Isolation Forest package in DataBrick platform. The version of spark in databrick is 3.1.1:
print (pyspark.__version__)
#3.1.1
So I tried to follow this article to implement IsolationForest but I couldn't install the package from this repo with following steps:
Step 1. Package spark-iforest jar and deploy it into spark lib
cd spark-iforest/
mvn clean package -DskipTests
cp target/spark-iforest-.jar $SPARK_HOME/jars/
Step 2. Package pyspark-iforest and install it via pip, skip this step if you don't need the >python pkg
cd spark-iforest/python
python setup.py sdist
pip install dist/pyspark-iforest-.tar.gz
So basically I run following scripts and get: ModuleNotFoundError: No module named 'pyspark_iforest'
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark_iforest.ml.iforest import IForest, IForestModel
import tempfile
conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("IForestExample") \
.getOrCreate()
temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"
What is the best practice to install IsolationForest in DataBrick platform for PySpark?
This specific version of isolation forest is compiled for the Spark 2.4 and Scala 2.11, and is binary incompatible with the Spark 3.1 that you're using. You may try to use Databricks Runtime (DBR) versions that are based on the Spark 2.4 - 6.4 or 5.4.
You may look onto the mmlspark (Microsoft Machine Learning for Apache Spark) library developed by Microsoft - it has an implementation of IsolationForest, although I haven't used it myself.
You need to import the jar first before using it:
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
import tempfile
conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("IForestExample") \
.getOrCreate()
from pyspark_iforest.ml.iforest import IForest, IForestModel
temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"
I usually have the spark session created on a separate .py file or provide the spark.jars by using the spark-submit command, because the way jars are loaded sometimes give me trouble when adding them within the code only.
spark-submit --jars /full/path/to/spark-iforest-2.4.0.jar my_code.py
Also, there is a version mismatch, as #Alex Ott mentioned but the error would be different in that case. Building IForest with pyspark 3.x is not very difficult, but if you don't want to get into it, you could downgrade the pyspark version.
My objectif is to use Jupyter Notebook (IPython) with Apache Spark. I'm using Apache Toree to do this. I was setting environment variable for SPARK_HOME and configuring Apache Toree installation with Jupyter. Everything seems fine.
When I run the below command, a juypter browser is opened ipython notebook --profile=pyspark
When I choose Apache Toree - PySpark in the drop-down menu, I can't code in my notebook and I have this view (Python 2 is OK):
The red button gives :
What's wrong ? Help please ?
Not really an answer, but if you're not hooked on toree and just need a local spark for learning and experimenting, you could download a copy of spark, unzip it and use this in the beginning of your notebook:
import os
import sys
os.environ['SPARK_HOME']="<path where you have extracted the spark file>"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.10.4-src.zip') )
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, Row
import pyspark.sql.functions as sql
sc = SparkContext()
sqlContext = SQLContext(sc)
print sc.version
I downloaded the prebuilt spark for hadoop 2.4 and I'm getting the following error when I try to fire up a SparkContext in python:
ClassNotFoundException: org.apache.spark.launcher.Main
The following code should be correct:
import sys, os
os.environ['SPARK_HOME'] = '/spark-1.5.1-bin-hadoop2.4/'
sys.path.insert(0, '/spark-1.5.1-bin-hadoop2.4/python/')
os.environ['PYTHONPATH'] = '/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/'
import pyspark
from pyspark import SparkContext
sc = SparkContext('local[2]')
Turns out my issue was that the default JDK on my mac is Java 1.6 and Spark 1.5 dropped support for Java 1.6 (reference). I upgraded to Java 1.8 with the installer from oracle, and it fixed the problem.