I looked through similar questions but none of them solved my problem.
I have a SageMaker notebook instance, opened a SparkMagic Pyspark notebook connected to a AWS EMR cluster. I have a SageMaker repo connected to this notebook as well called dsci-Python
Directory looks like:
/home/ec2-user/SageMaker/dsci-Python
/home/ec2-user/SageMaker/dsci-Python/pyspark_mle/datalake_data_object/SomeClass
/home/ec2-user/SageMaker/dsci-Python/Pyspark_playground.ipynb
There are __init__.py under both pyspark_mle and datalake_data_object directory and I have no problem importing them in other environments
when I'm running this code in Pyspark_playground.ipynb:
from pyspark_mle.datalake_data_object.SomeClass.SomeClass import Something
I got No module named 'pyspark_mle'
I think this is an environment path thing.
The repo is on your Notebook Instance, whereas the PySpark kernel is executing code on the EMR cluster.
To access these local modules on the EMR cluster, you can clone the repository on the EMR cluster.
Also, SparkMagic has a useful magic send_to_spark which can be used to send data from the Notebook locally to the Spark kernel. https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Send%20local%20data%20to%20Spark.ipynb
Related
How can I mock existing Azure Databricks PySpark codes of a project (written by others) and run them locally on windows machine/Anaconda to test and practice?
Is it possible to mock the codes or I need to create a new cluster on Databricks for my own testing purposes?
how I can connect to storage account, use the Databricks Utilities, etc? I only have experience with Python & GCP and just joined a Databricks project and need to run the cells one by one to see the result and modify if required.
Thanks
You can test/run PySpark code from your IDE by installing PySpark on your local computer.
Now, to use Databricks Utilities, in fact you would need a Databricks instance and it's not available on local. You can try Databricks community Editionfor free but with some limitation
To acess a cloud storage account, it can be done locally from your computer or from your own Databricks instance. In both cases your will have to set up the end point of this storage account using its secrets.
I have been studying about Kubeflow and trying to grasp how do I write my first hollo world program in it and run locally on my mac. I have kfp and kubectl installed locally on my machine. For testing purpose I want to write a simple pipeline with two functions: get_data() and add_data(). The doc is overwhelming that I am not clear how to program locally without k8s installed, connecting remote GCP machine and debug locally before creating zip and upload or there way to execute code locally and see how is it running on Google cloud?
Currently you need Kubernetes to run KFP pipelines.
The easiest way to deploy KFP is when you use the Google Cloud Marketplace
Alternatively you can locally install Docker Desktop which includes Kubernetes and install standalone version of KFP on it.
After that you can try this tutorial: Data passing in python components
Actually you can install a reduced version of kubeflow with minikf. More info https://www.kubeflow.org/docs/distributions/minikf/minikf-vagrant/
Check whether you are using kubeflow pipelines from the google cloud marketplace, or a custom kubernetes cluster. If you are using the managed one, you can see your pipeline running through the kubeflow pipelines management console.
for details about how to create components based on functions, you can check https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/#getting-started-with-python-function-based-components
I have a query with respect to using external libraries like delta-core over AWS EMR notebooks. Currently there isn’t any mechanism of installing the delta-core libraries through pypi packages. The available options include.
Launching out pyspark kernel with --packages option
The other option is to change the packages option in the python script through os configuration, but I don’t see that it is able to download the packages and I still get import error on import delta.tables library.
Third option is to download the JARs manually but it appears that there isn’t any option on EMR notebooks.
Has anyone tried this out before?
You can download the jars while creating EMR using bootstrap scripts.
You can place the jars in s3 and pass it to pyspark with --jars option
I've created a spark cluster on EMR.
But I'm unable to access pyspark when I open it with a notebook.
Configuration:
Example:
from pyspark import SparkContext
I also cannot access sc which I was under the impression would be available.
sc.list_packages()
NameError: name 'sc' is not defined
I feel like I'm missing something very basic here but I'm completely new to EMR and have spent a bunch of time on this already.
Are there any ideas I should try to debug this?
When I opened my notebook with "JupyterLab" instead of "Jupyter" all libraries were available.
I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.
use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes