Best way to to implement Spark + AWS + Caffe/CUDA?

Best way to to implement Spark + AWS + Caffe/CUDA? - python

I am looking to deploy an application that already has a trained caffemodel file and I need to deploy it to a Spark cluster on AWS for processing due to GPU computation power needed (20K patches per image). From my research it seems that the best way to do it is to use Spark to create an AWS cluster which then runs a Docker image or Amazon AMI to install project dependencies automatically. Once everything is installed, the job can run in the cluster through Spark. What I am wondering is how to do this from start to finish. I have seen several guides, and have taken some online courses on Spark (BerkeleyX, Udemy) and Docker (Udemy); however almost all the information I have seen are examples of how to implement the simplest application that has little to no heavy software dependencies (CUDA drivers, CuDNN, Caffe, DIGITS). I have deployed Spark clusters on AWS and ran simple examples that had no dependencies, but have found little to no information on running an application that would require even a small dependency such as numpy. I would like to leverage the group to see if anyone has experience in such an implementation and can point me in the right direction or offer some help/suggestions?
Here are some things I have looked into:
Docker+NVIDIA: https://github.com/NVIDIA/nvidia-docker
bitfusion AMI: https://aws.amazon.com/marketplace/pp/B01DJ93C7Q/ref=sp_mpg_product_title?ie=UTF8&sr=0-13
My question is in regards to how to implement a small sample application from start to end with the Spark cluster getting created automatically side-by-side while installing the dependencies needed through either Docker or an AMI from above?
Notes:
Platform: Ubuntu 14.04
Language: Python
Dependencies: CUDA 7.5, caffenv, libcudnn4, NVIDIA Graphics Driver (346-352)

Related

Setting up Jupyter lab for python scripts on a cloud provider as a beginner

I have python scripts for automated trading for currency and I want to deploy them by running on Jupter Lab on a cloud instance. I have no experience with cloud computing or linux, so I have been trying weeks to get into this cloud computing mania, but I found it very difficult to participate in it.
My goal is to set up a full-fledged Python infrastructure on a cloud instance from whichever provider so that I can run my trading bot on the cloud.
I want to set up a cloud instance on whichever provider that has the latest python
installation plus the typically needed scientific packages (such as NumPy and pandas and others) in combination with a password-protected and Secure Sockets Layer (SSL)-encrypted Jupyter
Lab server installation.
So far I have gotten no where. I am currently looking at the digital ocean website for setting jupter lab up but there are so many confusing terms.
What is Ubuntu or Debian? Is it like a sub-variant of Linux operating system? Why do I have only 2 options here? I use neither of the operating system, I use the windows operating system on my laptop and it is also where I developed my python script. Do I need a window server or something?
How can I do this? I tried a lot of tutorials but I just got more confused.

Your question raises several more about what you are trying to accomplish. Are you just trying to run your script on cloud services? Or do you want to schedule a server to spin up and execute your code? Are you running a bot that trades for you? These are just some initial questions after reading your post.
Regarding your specific question regarding Ubuntu and Debian, they are indeed Linux distributions which are popular option for servers. You can set up a Windows server on AWS or another cloud provider, but Linux distributions being much more popular are going to have lots of documentation, articles, stackoverflow posts around a Linux based server.
If you just want to run a script on a cloud on demand, you would probably have a lot of success following Wayne's comment around PythonAnywhere or Google Colab.
If you want your own cloud server, I would suggest starting small and slow with a small or free tier EC2 instance by following a tutorial such as this https://dataschool.com/data-modeling-101/running-jupyter-notebook-on-an-ec2-server/ Alternatively, you could splurge for an AWS AMI which will have much more compute power and be configured.

I have similar problem and the most suiteble solution to me is using docker container for jupyter notebooks. The instructions on how to install Docker can be found at https://docs.docker.com/engine/install/ubuntu/ There is ready to use Docker image docker pull jupyter/datascience-notebook for jupyter notebook python stack. The docker compose files und sone addional insruction you will fid at https://github.com/stefanproell/jupyter-notebook-docker-compose/blob/master/README.md.

deploy notebooks to azure devops using rest api

Hi i want to deploy python notebooks to azure databricks using rest api is there a way to achieve that if you could provide me with any documentations or links to refer it would be a great help

One of the first steps in designing a CI/CD pipeline is deciding on a code commit and branching strategy to manage the development and integration of new and updated code without adversely affecting the code currently in production. Part of this decision involves choosing a version control system to contain your code and facilitate the promotion of that code. Azure Databricks supports integrations with GitHub and Bitbucket, which allow you to commit notebooks to a git repository.
If your version control system is not among those supported through direct notebook integration, or if you want more flexibility and control than the self-service git integration, you can use the Databricks CLI to export notebooks and commit them from your local machine.
You can refer this official documentation for code samples and steps.

How to develop with PYSPARK locally and run on Spark Cluster?

I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.

use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes

How do I install Apache spark and get it up and running with Kafka?

I am quite new to Hadoop and Apache Spark. I am a beginner trying my hands on it. Now, I am trying to try my hands on Apache Spark. In order to do that, I am assuming I have to install a software named Apache Spark on my machine.
I tried to create a local machine using VM but I am lost at this point. Is there any resource to help me configure and install Spark and Kafka in the same machine ?

You are in luck, Chris Fregley (from the IBM Spark TC) has a project which has docker images for all of these things working together (you can see it at https://github.com/fluxcapacitor/pipeline/wiki ). For a "real" production deployment, you might want to look at deploying Spark on YARN or something similar - its deployment options are explained at http://spark.apache.org/docs/latest/cluster-overview.html and integrating it with Kafka is covered in the special Kafka integration guide http://spark.apache.org/docs/latest/streaming-kafka-integration.html . Welcome to the wonderful of Spark I hope these help you get started :)

Are all cluster computing libraries compatible with starcluster?

I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?

Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to to implement Spark + AWS + Caffe/CUDA? - python

Related

Setting up Jupyter lab for python scripts on a cloud provider as a beginner

deploy notebooks to azure devops using rest api

How to develop with PYSPARK locally and run on Spark Cluster?

How do I install Apache spark and get it up and running with Kafka?

Are all cluster computing libraries compatible with starcluster?

Categories

Resources