Cannot access pyspark in EMR cluster jupyter notebook - python

I've created a spark cluster on EMR.
But I'm unable to access pyspark when I open it with a notebook.
Configuration:
Example:
from pyspark import SparkContext
I also cannot access sc which I was under the impression would be available.
sc.list_packages()
NameError: name 'sc' is not defined
I feel like I'm missing something very basic here but I'm completely new to EMR and have spent a bunch of time on this already.
Are there any ideas I should try to debug this?

When I opened my notebook with "JupyterLab" instead of "Jupyter" all libraries were available.

Related

start pyspark cluster with jupyter notebook

i'm buildind a pyspark app using jupyter notebook , so far i'm using it in a standalone mode.
Now i have in my disposition 3 Virtual machines with spark on them, and i want to start Pyspark in a cluster.
Here is my code to start it in standalone mode :
knowing i'm using spark 3.1.2 hadoop 3.2
i've searched for ways to do it and i didn't get it, and there are some articles saying that pyspark doesn't work in clusters, so please if you know how i can change this code and launch my session in a cluster please help.
thank you.
You most have a cluster of some sort.
I use kubernutes and https://github.com/bjornjorgensen/jlpyk8s
This way I have a notebook that interactive run pyspark on.

Access databricks secret in custom python package imported into databricks notebook

We have a custom python package hosted on a private bitbucket repo, which can be installed via %pip install git+https//... on any databricks notebook.
One of the functions performs a number of operations and then pushes data to another location, for which credentials are required. When the function is run, e.g., locally, this is handled via config files, but if the function runs on databricks we'd like to store these credentials in a databricks secret (scope).
However, trying to do something like
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
pw = dbutils.secrets.get(scope = <SCOPE>, key = <KEY>)
within the module (foo) doesn't work and causes the following error:
>>>import myPackage
>>>myPackage.foo.bar()
Py4JJavaError: An error occurred while calling o272.get.
: java.lang.SecurityException: Accessing a secret from Databricks Connect requires
a privileged secrets token. To obtain such a token, you can run the following
in a Databricks workspace notebook:
given that we possibly would like to run this function regularly using a scheduled job creating a time-limited privileged token doesn't seem to be the way to go.
Is there a way to make this work or is there an alternative/better approach that we should be following instead?
For others that might run into the same problem in the future. Using this code snippet within my function ended up working for me:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]

Accessing delta lake through Pyspark on EMR notebooks

I have a query with respect to using external libraries like delta-core over AWS EMR notebooks. Currently there isn’t any mechanism of installing the delta-core libraries through pypi packages. The available options include.
Launching out pyspark kernel with --packages option
The other option is to change the packages option in the python script through os configuration, but I don’t see that it is able to download the packages and I still get import error on import delta.tables library.
Third option is to download the JARs manually but it appears that there isn’t any option on EMR notebooks.
Has anyone tried this out before?
You can download the jars while creating EMR using bootstrap scripts.
You can place the jars in s3 and pass it to pyspark with --jars option

SageMaker notebook connected to EMR import custom Python module

I looked through similar questions but none of them solved my problem.
I have a SageMaker notebook instance, opened a SparkMagic Pyspark notebook connected to a AWS EMR cluster. I have a SageMaker repo connected to this notebook as well called dsci-Python
Directory looks like:
/home/ec2-user/SageMaker/dsci-Python
/home/ec2-user/SageMaker/dsci-Python/pyspark_mle/datalake_data_object/SomeClass
/home/ec2-user/SageMaker/dsci-Python/Pyspark_playground.ipynb
There are __init__.py under both pyspark_mle and datalake_data_object directory and I have no problem importing them in other environments
when I'm running this code in Pyspark_playground.ipynb:
from pyspark_mle.datalake_data_object.SomeClass.SomeClass import Something
I got No module named 'pyspark_mle'
I think this is an environment path thing.
The repo is on your Notebook Instance, whereas the PySpark kernel is executing code on the EMR cluster.
To access these local modules on the EMR cluster, you can clone the repository on the EMR cluster.
Also, SparkMagic has a useful magic send_to_spark which can be used to send data from the Notebook locally to the Spark kernel. https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Send%20local%20data%20to%20Spark.ipynb

Cannot save Spark dataframe to Google Cloud Storage from PySpark

I have a Spark dataframe that I'm trying to save to a Google Storage bucket with the line
df.write.format("com.databricks.spark.csv").save('gs://some-test-bucket-delete-me')
But Pyspark raises the following exception
Py4JJavaError: An error occurred while calling o55.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
What I've tried:
The typical solutions to this exception is to make sure that the environmental variable HADOOP_CLASSPATH to point at the gcs-connector-latest-hadoop2.jar file, which it does. I've tried using the Hadoop version 1 as well as version 2 jars incase that was the problem. I've tried explicitly pointing at it within Jupyter notebook with
sc._jsc.hadoopConfiguration().set("spark.driver.extraClassPath", "/home/****/hadoop-2.8.2/share/hadoop/common/lib/gcs-connector-latest-hadoop1.jar")
to no avail.
If I try hadoop fs -ls gs://gs://some-test-bucket-delete-me from bash the command returns perfectly, which is supposed to indicate that that the Google Cloud Storage connect works but for some reason I can't seem to get this functionality to work in PySpark.
Things that may be important:
Spark Version 2.2.0
Python 3.6.1 :: Anaconda custom (64-bit)
I'm running PySpark locally
You should run gcloud init first
Then try df.write.csv('gs://some-test-bucket-delete-me/file_name')

Categories