How to develop with PYSPARK locally and run on Spark Cluster? - python

I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.

use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes

Related

Setting up Jupyter lab for python scripts on a cloud provider as a beginner

I have python scripts for automated trading for currency and I want to deploy them by running on Jupter Lab on a cloud instance. I have no experience with cloud computing or linux, so I have been trying weeks to get into this cloud computing mania, but I found it very difficult to participate in it.
My goal is to set up a full-fledged Python infrastructure on a cloud instance from whichever provider so that I can run my trading bot on the cloud.
I want to set up a cloud instance on whichever provider that has the latest python
installation plus the typically needed scientific packages (such as NumPy and pandas and others) in combination with a password-protected and Secure Sockets Layer (SSL)-encrypted Jupyter
Lab server installation.
So far I have gotten no where. I am currently looking at the digital ocean website for setting jupter lab up but there are so many confusing terms.
What is Ubuntu or Debian? Is it like a sub-variant of Linux operating system? Why do I have only 2 options here? I use neither of the operating system, I use the windows operating system on my laptop and it is also where I developed my python script. Do I need a window server or something?
How can I do this? I tried a lot of tutorials but I just got more confused.
Your question raises several more about what you are trying to accomplish. Are you just trying to run your script on cloud services? Or do you want to schedule a server to spin up and execute your code? Are you running a bot that trades for you? These are just some initial questions after reading your post.
Regarding your specific question regarding Ubuntu and Debian, they are indeed Linux distributions which are popular option for servers. You can set up a Windows server on AWS or another cloud provider, but Linux distributions being much more popular are going to have lots of documentation, articles, stackoverflow posts around a Linux based server.
If you just want to run a script on a cloud on demand, you would probably have a lot of success following Wayne's comment around PythonAnywhere or Google Colab.
If you want your own cloud server, I would suggest starting small and slow with a small or free tier EC2 instance by following a tutorial such as this https://dataschool.com/data-modeling-101/running-jupyter-notebook-on-an-ec2-server/ Alternatively, you could splurge for an AWS AMI which will have much more compute power and be configured.
I have similar problem and the most suiteble solution to me is using docker container for jupyter notebooks. The instructions on how to install Docker can be found at https://docs.docker.com/engine/install/ubuntu/ There is ready to use Docker image docker pull jupyter/datascience-notebook for jupyter notebook python stack. The docker compose files und sone addional insruction you will fid at https://github.com/stefanproell/jupyter-notebook-docker-compose/blob/master/README.md.

Python libraries in Azure

I have a requirement where I have to use the Python libraries I created on my machine, in the cloud, such that whenever any new dataset is loaded, this Python library have to start acting on it.
How can I do this? Where will I put the dataset and the python codes in Azure?
Thanks,
Shyam
There are more possibility to do that.
Run your Python code on Azure Web Apps for Containers—a Linux-based, managed application platform
Azure Functions allows running Python code in a serverless environment that scales on-demand.
Use a managed Hadoop and Spark cluster with Azure HDInsights, suitable for enterprise-grade production workloads.
Use a friction-free data science environment that contains popular tools for data exploration, modeling, and development activities.
Azure Kubernetes Service (AKS) offers a fully-managed Kubernetes cluster to run your Python apps and services, as well as any other Docker container. Easily integrate with other Azure services using Open Service Broker for Azure.
Use your favorite Linux distribution, such as Ubuntu, CentOS, and Debian, or Windows Server. Run your code with scalable Azure Virtual Machines and Virtual Machine Scale Sets.
Run your own Python data science experiments using a fully-managed Jupyter notebook with Azure Notebooks.
The easiest and fastest way to run your code is 1. option. Create a web app and a web job in there.

How can I run a simple python script hosted in the cloud on a specific schedule?

Say I have a file "main.py" and I just want it to run at 10 minute intervals, but not on my computer. The only external libraries the file uses are mysql.connector and pip requests.
Things I've tried:
PythonAnywhere - free tier is too limiting (need to connect to external DB)
AWS Lambda - Only supports up to Python 2.7, converted my code but still had issues
Google Cloud Platform + Heroku - can only find tutorials covering deploying applications, I think these could do what I'm looking for but I can't figure out how.
Thanks!
I'd start by taking a look at this question/answer that I asked previously on unix.stackexchange - I went with an AWS redhat installation and it was free to use.
Once you've decided on your VM, you can add SSH onto your server using any SSH client and upload your Python script. A personal preference is this application.
If you need to update the Python version on the server, you can do this by installing the required Python RPMs. A quick google should return the yum [or whichever RPM management system you're using] repository for the required RPMs.
Once you've installed the version of Python that you need, I'd suggest looking into the 'crontab' which can be used to schedule jobs. You can set a cronjob to run every 10minutes which will call your script.
See this site for more information on how to use the crontab
This sounds like a perfect use case for AWS Lambda which supports Python. You can invoke your Lambda on a schedule using Scheduled Events.
I see that you tried Lambda and it didn't work out for you which is too bad as that seems like the easiest route. You could also launch an EC2 instance and use userdata to schedule a cron when the instance starts.
Another option would be an Elastic Beanstalk worker with a cron.yml that defines your schedule. Elastic Beanstalk supports Python 3.4.
Update: AWS does now support Python 3.6. Just select Python 3.6 from the runtime environments when configuring.

How do I install Apache spark and get it up and running with Kafka?

I am quite new to Hadoop and Apache Spark. I am a beginner trying my hands on it. Now, I am trying to try my hands on Apache Spark. In order to do that, I am assuming I have to install a software named Apache Spark on my machine.
I tried to create a local machine using VM but I am lost at this point. Is there any resource to help me configure and install Spark and Kafka in the same machine ?
You are in luck, Chris Fregley (from the IBM Spark TC) has a project which has docker images for all of these things working together (you can see it at https://github.com/fluxcapacitor/pipeline/wiki ). For a "real" production deployment, you might want to look at deploying Spark on YARN or something similar - its deployment options are explained at http://spark.apache.org/docs/latest/cluster-overview.html and integrating it with Kafka is covered in the special Kafka integration guide http://spark.apache.org/docs/latest/streaming-kafka-integration.html . Welcome to the wonderful of Spark I hope these help you get started :)

Running python script on Microsoft Azure

I'll have a linux machine with a virtual machine installed for Microsoft azure soon. I need to run some data mining/graph analysis algorithms on the azure because I work with big data. I don't want to use azure machine learning stuff. just want to run my own python code. What are the steps? If needed, hoe can I install python libraries on azure?
There is no additional steps to do in comparison to Your own. local server. Linux on Azure is a standard Linux machine. If You are looking for step-by-step hopw to on running Linux VM on Azure, just search on azure.com and You will find it. I think You will not have any problems even without documentation. Azure portal is very simple to use, also CLI tool for Linux, Mac and Windows. You just need to run Linux VM and SSH-in to it. Nothing more. If You need some help, just write here.

Categories