Use Jupyter notebook without DataBricks in Azure Data Factory? - python

I gather from the documentation that we can use Jupyter notebooks only with Databricks Spark cluster.
Is there a way around this? Can I call Jupyter notebook as an activity from ADF without Databricks environment? I would like to have a simple way to call some python code from ADF.
Thanks!

You can try Custom Activity in ADF. Custom activity supports cmd command, so you can use command line to invoke python script.
And there's another example of using python in custom activity:
https://github.com/rawatsudhir1/ADFPythonCustomActivity
Hope it helps.

Related

Databricks how to get output of the Notebook Jobs via API?

My Python Notebooks log some data into stdout and when I run this notebooks via UI I can see outputs inside cell.
Now I am running Python Notebooks on Databricks via API (/2.1/jobs/create and then /2.1/jobs/run-now) and would like to get output. I tried both /2.1/jobs/runs/get and /2.1/jobs/runs/get-output however none of the includes stdout of the Notebook.
Is there any way to access stdour of the Notebook via API?
P.S. I am aware of dbutils.notebook.exit() and will use it, if it will not be possible to get stdout.
Yes, it is impossible to get the default return from python code. I saw that on a different cloud provider you can get an output from logs also.
100% solution it is using dbutils.notebook.exit()

start pyspark cluster with jupyter notebook

i'm buildind a pyspark app using jupyter notebook , so far i'm using it in a standalone mode.
Now i have in my disposition 3 Virtual machines with spark on them, and i want to start Pyspark in a cluster.
Here is my code to start it in standalone mode :
knowing i'm using spark 3.1.2 hadoop 3.2
i've searched for ways to do it and i didn't get it, and there are some articles saying that pyspark doesn't work in clusters, so please if you know how i can change this code and launch my session in a cluster please help.
thank you.
You most have a cluster of some sort.
I use kubernutes and https://github.com/bjornjorgensen/jlpyk8s
This way I have a notebook that interactive run pyspark on.

Jupyter notebook execution based on user entered automation parameters

I am trying to build a service that would allow users using notebook to set automation parameters in a cell like the starting time as to when the notebook should start executing. The service would then take this input time and execute the notebook at the desired time and store the executed notebook to S3. I have looked into papermill but I believe there is no way to add automation parameters like start execution time using that. Is there any ways to achieve this? Or is there a way papermill can achieve this?
Papermill handles just parameterizing and executing the notebooks, not scheduling. For that, you need to use another tool. You can build something yourself on top of Apache Airflow which seems to be the most widespread scheduler for such case. It has a native support for Papermill (see here). Or you can use a ready tool like Paperboy.
To read in-depth about scheduling notebooks, take a look at the article by Netflix.
Take a look at the code here and here for a wrapper that will schedule notebook execution
The shell scripts above create a VM, runs the notebook, saves the output and destroy the instance.
In Google Cloud AI Platform Notebooks we provide a scheduling service which is in Beta now.

How do i automate my jupyter notebook using google cloud?

I have a code with Jupyter notebook and i would like to schedule daily running by Google Cloud.
I already created VM instance and running my code there,
but I couldn't find any guide or video how to implement daily running.
So, how can I do that?
Google is offering a product which is called AI Platform Notebooks. It is implementing lots of useful stuff like lots of open-source frameworks, CI etc. There is also a blog post by the Google Cloud that explains the product in depth and can be found here. I think you can use that to achieve what you want.
You can use cron to schedule the notebook in the VM machine. Please take a look at nbconvert or papermill for executing notebooks.
The other ways to schedule Jupyter Notebook can be to use a web-based application for notebook scheduling:
Mercury
Notebooker
Both of them can automatically execute the notebook in the background and share the resulting notebook as a website.

Papermill PySpark support

I'm looking for a way to easily execute parametrized run of Jupyter Notebooks, and I've found Papermill Project (https://github.com/nteract/papermill/)
This tool seems to match my requirements, but I can't find any reference for PySpark kernel support.
Is PySpark kernels supported by papermill executions?
If it is, there is some configuration to be done to connect it to the Spark cluster used by Jupyter?
Thanks in advance for the support, Mattia
Papermill will work with PySpark kernels, so long as they implement Jupyter's kernel spec.
Configuring your kernel will depend on the kernel in question. Usually these read from spark.conf and/or spark.properties files to configure cluster and launch-time settings for Spark.

Categories