I have a Spark batch processing code (basically, the model training) that I execute with spark-submit from AWS EMR cluster. Now I want to be able to launch this job each day at specific time.
What is the standard way to do it?
Should I change the code and add the scheduling inside the code? Or is there any way to schedule spark-submit job?
Or maybe should I make it as a Spark Streaming job executed every 24 hours? (though I am interested in a specific time slot, i.e. between 11:00pm and 12pm)
Cron is more traditional... although it is good, Another way/option is RunDeck.
Use Rundeck as an easier to manage and more secure replacement for Cron or as a replacement for legacy tools like Control-M or HP Operations Orchestration. Rundeck gives your users a simple web interface (GUI or API) to go to for both on-demand and scheduled operations tasks.
What is Rundeck?
Rundeck is open source software that helps you automate routine operational procedures in data center or cloud environments. Rundeck provides a number of features that will alleviate time-consuming grunt work and make it easy for you to scale up your automation efforts and create self service for others. Teams can collaborate to share how processes are automated while others are given trust to view operational activity or execute tasks.
Rundeck allows you to run tasks on any number of nodes from a web-based or command-line interface. Rundeck also includes other features that make it easy to scale up your automation efforts including: access control, workflow building, scheduling, logging, and integration with external sources for node and option data.
If you are using Linux you can setup a Cron job to call the spark-submit script
http://kvz.io/blog/2007/07/29/schedule-tasks-on-linux-using-crontab/
Related
I'm using Apache Beam 2.40.0 in python.
It has 10 different runners for jobs.
How do you choose which one to use? The DirectRunner seems like the easiest one to set up, but the docs claim it does not focus on efficient execution.
DirectRunner runs the pipeline on a single machine. It's hardly used in production. There is also an InteractiveRunner wrapper for Python SDK that mostly uses DirectRunner in an IPython/Notebook environment to execute small pipelines interactively for learning and prototyping.
To process large amount of data in a distributed manner, the runners with the best support (document/support-wise) and most popularity currently are:
DataflowRunner: if you want to use Google Cloud services and want a more serverless experience without worrying about setting up your clusters.
FlinkRunner/SparkRunner: if you prefer setting up your own EMR solutions or using existing services that allows you to provision clusters with optional components (there are also serverless options for these runners out there).
As for other runners, you may refer to the runners section of the roadmap for the newest update.
In the essence, I want to distribute Airflow DAGs with DaskExecutor, however, I have failed to find detailed information on how it actually works. Some of my tasks are supposed to save files and for such purposes, there is a network share accessible from each server running dask-worker. If my guess is correct, some tasks within one DAG could be executed on different servers with different paths to this share. The question is how to specify the path within a task depending on a worker and OS type executing it? If there is a better solution, or vision on this problem it will be great if you share it.
UPD:
Following the comment by #mdurant - the network share is a Redis server on the machine with Debian OS, therefore it represents as a directory on it, other nodes predominantly controlled by Windows Server, and share mounted as network driver.
For now the best appropriate solution I see, is to use the Dask framework directly from within each high-load task. As it guarantees distribution of computations among the cluster.
I am new to Azure pipelines. I am trying to create pipeline for deploying simple python application.
But I get error
No hosted parallelism has been purchased or granted
As I understand microsoft disabled the free grant of parallel jobs for public projects and for certain private projects in new organizations. But what if I don't need parallel jobs? I need jobs just to run one after the other. Can I turn off using of parallel jobs?
I chose template "Python package" and set environment variables "python.version" only one version "3.7". But it doesn't help. I still have the same error
No hosted parallelism has been purchased or granted
Free tier supports 1 Parallel job which is 1 job only.
See the microsoft's defination of parallel job below-
What is a parallel job?
When you define a pipeline, you can define it as a collection of jobs. When a pipeline runs, you can run multiple jobs as part of that pipeline. Each running job consumes a parallel job that runs on an agent. When there aren't enough parallel jobs available for your organization, the jobs are queued up and run one after the other.
As you rightly mentioned it's temporarily disabled by msft for Private Projects. However, you can ask for granting access to the free job. This can take upto take 2-3 days for you to get access.
To request the free grant for public or private projects, submit a request here
I wrote a python script to send Data from a local DB via REST to Kafka.
My goal: I would like this script to run indefinitely, by either restarting in set intervals (i.e. every 5min) or whenever the DB gets new entries. I assume the set Intervals thing would be good enough, easier and safer.
Someone suggested to me to either run it via a cronjob and use a monitoring tool or do it using jenkins (which he considered better).
My Setting: I am not a DevOps engineer, and would like to know about the possibilities and risks setting this Script up. It would be no trouble to recreate the Script in Java if this improves the situation.
My Question: I did try to learn what jenkins is about and i think i understood the CI and CD part. But i don't see how this could help me with my goal. Can someone elaborate on this with some experience on this topic?
If you would suggest a cronjob, what are common methods or tools to monitor such a case? I think the main risks are, failing to send the data due to connection issues on the local machine to REST or the local DB or not beieng started properly at the specified time.
Jobs can be scheduled at regular intervals in Jenkins just like with cron, in fact it uses the same syntax. What's nice about scheduling the job via Jenkins, is that it's very easy to have it send an email if the job exits with a non-zero return code. I've moved all of my cron jobs into Jenkins and it's working well. So by running it via Jenkins you're covering the execution side and the monitoring side at the same time.
I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?
Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.