Jupyter notebooks as Kedro node - python

How can I use a Jupyter Notebook as a node in Kedro pipeline? This is different from converting functions from Jupyter Notebooks into Kedro nodes. What I want to do is using the full notebook as the node.

Although this is technically possible (via nbconvert, for example), this is strongly discouraged for multiple reasons including the lack of testability and reproducibility of the notebooks among others.
The best practice is usually to keep your pipeline node functions pure (where applicable), meaning that they don't incur any side effects. The way notebooks work generally contradicts with that principle.

AFAIK Kedro doesn't support this but Ploomber does (disclaimer: I'm the author). Tasks can be notebooks, scripts, functions, or any combination of them. You can run locally, Airflow, or Kubernetes (using Argo workflows).
If using a notebook or script as a pipeline task, Ploomber creates a copy whenever you run the pipeline. For example, you can create functions to pre-process your data and add a final task that trains a model in a notebook, this way you can leverage the ipynb format to generate reports for your model training procedure.
This is how a pipeline declaration looks like:
tasks:
- source: notebook.ipynb
product:
nb: output.html
data: output.csv
- source: another.ipynb
product:
nb: another.html
data: another.csv
Resources:
Repository
Exporting to Airflow
Exporting to Kubernetes
Sample pipelines

Related

Package python code dependencies for remote execution on the fly

my situation is as follows, we have:
an "image" with all our dependencies in terms of software + our in-house python package
a "pod" in which such image is loaded on command (kubernetes pod)
a python project which has some uncommitted code of its own which leverages the in-house package
Also please assume you cannot work on the machine (or cluster) directly (say remote SSH interpreter). the cluster is multi-tenant and we want to optimize it as much as possible so no idle times between trials
also for now forget security issues, everything is locked down on our side so no issues there
we want to essentially to "distribute" remotely the workload -i.e. a script.py - (unfeasible in our local machines) without being constrained by git commits, therefore being able to do it "on the fly". this is necessary as all of the changes are experimental in nature (think ETL/pipeline kind of analysis): we want to be able to experiment at scale but with no bounds with regards to git.
I tried dill but I could not manage to make it work (probably due to the structure of the code). ideally, I would like to replicate the concept mleap applied to ML pipelines for Spark but on a much smaller scale, basically packaging but with little to no constraints.
What would be the preferred route for this use case?

Jupyter notebook execution based on user entered automation parameters

I am trying to build a service that would allow users using notebook to set automation parameters in a cell like the starting time as to when the notebook should start executing. The service would then take this input time and execute the notebook at the desired time and store the executed notebook to S3. I have looked into papermill but I believe there is no way to add automation parameters like start execution time using that. Is there any ways to achieve this? Or is there a way papermill can achieve this?
Papermill handles just parameterizing and executing the notebooks, not scheduling. For that, you need to use another tool. You can build something yourself on top of Apache Airflow which seems to be the most widespread scheduler for such case. It has a native support for Papermill (see here). Or you can use a ready tool like Paperboy.
To read in-depth about scheduling notebooks, take a look at the article by Netflix.
Take a look at the code here and here for a wrapper that will schedule notebook execution
The shell scripts above create a VM, runs the notebook, saves the output and destroy the instance.
In Google Cloud AI Platform Notebooks we provide a scheduling service which is in Beta now.

Hiding / locking cells in Azure / Jupyter notebooks

I have made a few jupyter notebooks to handle some workflows for clients and I would like to deploy them in such a way that the clients cannot see or modify the code / functions i have written. As they have limited knowledge of python it is important that they cannot access the functions and modify the and secondly to stop them from being shared or sold on (although highly unlikely). They may run the notebooks in anaconda / jupyter notebook /lab or alternatively via Azure notebooks or some sort of jupyter hub setup.
The code mostly consists of functions that when called give a ipywdiget display where the client can choose several options of displaying their data or running different calculations. So if they only saw the widgets that would be optimal. I know that it is possible to toggle cells or to hide input but this is easily worked around and they could get to the code. Is it possible to call the function using magics from a py file that is stored somewhere that they cannot access or modify? Are there any other methods?
Thanks
Maybe put your code in one or more external modules and then obfuscate
the modules. See here:
How to obfuscate Python code effectively?
You can't prevent the client from modifying the "launch code" which
imports an external module and calls
something in the module, but you can warn/ask them not to. Something
like in this screenshot
from https://github.com/flatironinstitute/mfa_jupyter_for_programming/blob/master/notebooks/Jupyter%20as%20a%20calculator.ipynb

Hosting interactive jupyter notebook on private website

I currently run a personal website using Wordpress (but hosted on siteground) that is a set of engineering study guides. I would like to move towards making these study guides interactive (i.e. refreshing graphics based on sliders, doing basic calculations to indicate if a design works or not, so I need numpy). A friend recommended that I utilize Jupyter notebooks for this purpose, as you can both render LaTeX (which I'm currently using Mathjax with Wordpress to do), as well as have the types of interactive graphics I want using either Bokeh or Plotly.
While I've seen tutorials for sharing notebooks on specific servers, what I'm after is being able for others to run my notebook in their browser (read-only), where the notebook is privately hosted.
I'm still not sure if Jupyter is the correct avenue to accomplish what I want, so I'm open to other suggestions (someone also recommended using Julia, but I've seen fewer examples of this).
I agree with your friend that Jupyter Notebooks is an excellent approach. And while it's by no means the only method to accomplish what you're after, I'm hard-pressed to come up with an immediate alternative that doesn't require significant work to set up.
I can think of three primary methods of using Jupyter Notebooks which suit your needs:
1. Azure Notebooks
Microsoft has a new service called Azure Notebooks, which is (currently) totally free.
Azure Notebooks boasts the complete functionality of Jupyter Notebooks, and in addition to Python, users can also program cells in R and F#. As for typical usage of the service, here's a snippet from their FAQ:
Jupyter (formerly IPython), is a multi-lingual REPL on steroids. This is a free service that provides Jupyter notebooks along with supporting packages for R, Python and F# as a service. This means you can just login and get going since no installation/setup is necessary. Typical usage includes schools/instruction, giving webinars, learning languages, sharing ideas, etc. The service is provided by the Python team # Microsoft, which is part of the Data Group.
2. nbviewer
The top banner of the main Jupyter site contains a link link to an application called nbviewer.
Evidently, you can create your markdown / Jupyter syntax as a discrete page somewhere else, feed the URL to your page into nbviewer, and it'll render it for you right there in the results. If I were going to use this, I would either;
Create a discrete WordPress page for my Jupyter syntax, then feed that into nbviewer; or, more likely
Use GitHub to host my Jupyter Notebook pages (mainly for posterity and version control, over the Gist option), and use the raw text link as the source to feed into nbviewer.
3. Hosting Your Own Solution
If you're technically savy enough, I'd recommend this approach over nbviewer.
When you launch Jupyter Notebooks on your own machine, you access it through your browser using the default URL of http://localhost:8888. That means there must exist some mechanism to expose that port to external users, and allow them to have access to your Notebook, using the exact same interface. Two methods of doing so:
Using Jupyter Notebooks public server
Remotely accessing your normal Jupyter Notebook
Hope that helps! I'm curious to know if any of these options works out for you.
The Iodide Project (and subsequently, Pyodide) are two projects that aim to allow this. They're still in development, but might be worth looking into.
You can try to use Mercury framework. It allows you to transform notebooks into web applications (with interactive widgets). You need to add YAML header to the beginning of the notebook. Based on YAML the widgets will be generated. Your users can change widgets values and click Run button to execute the notebook with new inputs. You can decide whether to show or hide code for your users. You can serve multiple notebooks with Mercury on single server. It is based on Django so can be easily deployed on any server/cloud.
The example notebook:
The generated application for the above notebook:
The screenshot of app/notebooks gallery in the Mercury:

Creating projects based on jupyter notebooks

Having learned to program in Python with Jupyter Notebooks, I've found this to be a very practical way of analyzing data, writing simple programs and even interacting with databases.
However, if one is working on a bigger project (with program and database that run automatically and should also eventually be deployed), is it still possible / reasonable to run the code from Notebooks? If so, do you have any advice on that?
Otherwise, I would of course resort to an IDE. Thanks!
Jupyter Notebooks are awesome tools, and great for certain types of collaborative development, and creating interactive web apps.
The purpose of Jupyter Notebooks is to provide a framework for combining rich text elements and code together (perfect for data science projects, tutorials or interactive dashboards). Jupyter notebooks also allow you to run code in multiple languages, which is a neat feature if your project requires it.
You probably wouldn't want to run a large scale production application from a Jupyter notebook, but you could certainly use them to help you develop it.
Check out this presentation for using Jupyter notebooks with multiple users.
https://www.slideshare.net/mbussonn/jupyter-a-platform-for-data-science-at-scale
If you specified the type of project you were considering it might be easier to suggest alternatives to help complete it.

Categories