Getting Spark 1.5 to run on mac local - python

I downloaded the prebuilt spark for hadoop 2.4 and I'm getting the following error when I try to fire up a SparkContext in python:
ClassNotFoundException: org.apache.spark.launcher.Main
The following code should be correct:
import sys, os
os.environ['SPARK_HOME'] = '/spark-1.5.1-bin-hadoop2.4/'
sys.path.insert(0, '/spark-1.5.1-bin-hadoop2.4/python/')
os.environ['PYTHONPATH'] = '/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/'
import pyspark
from pyspark import SparkContext
sc = SparkContext('local[2]')

Turns out my issue was that the default JDK on my mac is Java 1.6 and Spark 1.5 dropped support for Java 1.6 (reference). I upgraded to Java 1.8 with the installer from oracle, and it fixed the problem.

Related

Spark - SparkSession - Cannot Import

I tried to import SparkSession using Pyspark from Anaconda spider and Pyspark Command Line but getting below error. I have given the details of the versions that i have installed in my machine.
Error : ImportError: cannot import name 'SparkSession' from 'pyspark' (E:\spark3\python\pyspark\__init__.py).
Assist me to resolve this issue
Spark Version :
Python Version
pyspark
Error From Spyder
Error From Pyspark Command Line
My Environmental Variable setup
Although it has been a long time since you asked your question, I think you aim to use pyspark.sql so by this way you should import:
from pyspark.sql import SparkSession

Trouble with Python3 string encoding

I have link to CSV file with encoding windows-1251. Field area_name in this file is string. In Jupyter Lab on my laptop I doing:
import pandas as pd
df = pd.read_csv(target_link, encoding="windows-1251", delimiter=";")
df['area_name'] = [el.encode('utf-8').decode('utf-8').replace('/', '') for el in df.area_name.values]
df.to_sql(...)
After this I have correct data in database
And when I use this code in Apache Airflow on server I have incorrect encoding in database
On my laptop:
OS - macOS Monterey 12.1
Python version 3.9.7
Pandas version 1.3.4
On server:
Python version 3.7.10
Pandas version 1.1.4
Airflow version 1.10.15
In both cases I use on database. It has encoding UTF8.
How fix encoding in pandas?

What is the best practice to install IsolationForest in DataBrick platform for PySpark API?

I'm trying to install Isolation Forest package in DataBrick platform. The version of spark in databrick is 3.1.1:
print (pyspark.__version__)
#3.1.1
So I tried to follow this article to implement IsolationForest but I couldn't install the package from this repo with following steps:
Step 1. Package spark-iforest jar and deploy it into spark lib
cd spark-iforest/
mvn clean package -DskipTests
cp target/spark-iforest-.jar $SPARK_HOME/jars/
Step 2. Package pyspark-iforest and install it via pip, skip this step if you don't need the >python pkg
cd spark-iforest/python
python setup.py sdist
pip install dist/pyspark-iforest-.tar.gz
So basically I run following scripts and get: ModuleNotFoundError: No module named 'pyspark_iforest'
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark_iforest.ml.iforest import IForest, IForestModel
import tempfile
conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("IForestExample") \
.getOrCreate()
temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"
What is the best practice to install IsolationForest in DataBrick platform for PySpark?
This specific version of isolation forest is compiled for the Spark 2.4 and Scala 2.11, and is binary incompatible with the Spark 3.1 that you're using. You may try to use Databricks Runtime (DBR) versions that are based on the Spark 2.4 - 6.4 or 5.4.
You may look onto the mmlspark (Microsoft Machine Learning for Apache Spark) library developed by Microsoft - it has an implementation of IsolationForest, although I haven't used it myself.
You need to import the jar first before using it:
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
import tempfile
conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("IForestExample") \
.getOrCreate()
from pyspark_iforest.ml.iforest import IForest, IForestModel
temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"
I usually have the spark session created on a separate .py file or provide the spark.jars by using the spark-submit command, because the way jars are loaded sometimes give me trouble when adding them within the code only.
spark-submit --jars /full/path/to/spark-iforest-2.4.0.jar my_code.py
Also, there is a version mismatch, as #Alex Ott mentioned but the error would be different in that case. Building IForest with pyspark 3.x is not very difficult, but if you don't want to get into it, you could downgrade the pyspark version.

Jupyter Notebook with Apache Spark (Kernel Error)

My objectif is to use Jupyter Notebook (IPython) with Apache Spark. I'm using Apache Toree to do this. I was setting environment variable for SPARK_HOME and configuring Apache Toree installation with Jupyter. Everything seems fine.
When I run the below command, a juypter browser is opened ipython notebook --profile=pyspark
When I choose Apache Toree - PySpark in the drop-down menu, I can't code in my notebook and I have this view (Python 2 is OK):
The red button gives :
What's wrong ? Help please ?
Not really an answer, but if you're not hooked on toree and just need a local spark for learning and experimenting, you could download a copy of spark, unzip it and use this in the beginning of your notebook:
import os
import sys
os.environ['SPARK_HOME']="<path where you have extracted the spark file>"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.10.4-src.zip') )
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, Row
import pyspark.sql.functions as sql
sc = SparkContext()
sqlContext = SQLContext(sc)
print sc.version

"cannot import name SparkSession"

I cannot import SparkSession from pyspark.sql,but i can import Row
my spark-1.6.0-bin-hadoop2.6 was install in a docker container,the system is centos
How can I solve the problem?This problem has troubled me for a long time
You cannot use it , because its not present there , the Spark Version that you are using is 1.6 and SparkSession was introduced in 2.0.0.
You can see here: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
You can download Spark 2.0.0 from here : http://spark.apache.org/downloads.html

Categories