How do I decide which runner to use for Apache Beam? - python

I'm using Apache Beam 2.40.0 in python.
It has 10 different runners for jobs.
How do you choose which one to use? The DirectRunner seems like the easiest one to set up, but the docs claim it does not focus on efficient execution.

DirectRunner runs the pipeline on a single machine. It's hardly used in production. There is also an InteractiveRunner wrapper for Python SDK that mostly uses DirectRunner in an IPython/Notebook environment to execute small pipelines interactively for learning and prototyping.
To process large amount of data in a distributed manner, the runners with the best support (document/support-wise) and most popularity currently are:
DataflowRunner: if you want to use Google Cloud services and want a more serverless experience without worrying about setting up your clusters.
FlinkRunner/SparkRunner: if you prefer setting up your own EMR solutions or using existing services that allows you to provision clusters with optional components (there are also serverless options for these runners out there).
As for other runners, you may refer to the runners section of the roadmap for the newest update.

Related

Package python code dependencies for remote execution on the fly

my situation is as follows, we have:
an "image" with all our dependencies in terms of software + our in-house python package
a "pod" in which such image is loaded on command (kubernetes pod)
a python project which has some uncommitted code of its own which leverages the in-house package
Also please assume you cannot work on the machine (or cluster) directly (say remote SSH interpreter). the cluster is multi-tenant and we want to optimize it as much as possible so no idle times between trials
also for now forget security issues, everything is locked down on our side so no issues there
we want to essentially to "distribute" remotely the workload -i.e. a script.py - (unfeasible in our local machines) without being constrained by git commits, therefore being able to do it "on the fly". this is necessary as all of the changes are experimental in nature (think ETL/pipeline kind of analysis): we want to be able to experiment at scale but with no bounds with regards to git.
I tried dill but I could not manage to make it work (probably due to the structure of the code). ideally, I would like to replicate the concept mleap applied to ML pipelines for Spark but on a much smaller scale, basically packaging but with little to no constraints.
What would be the preferred route for this use case?

How can I call python scripts on a remote server from z/OS?

As part of migrating batch jobs (and used EXEC PGM) to other language (python here), facing challenge in cross server connection.
We are targeting to migrate few of our mainframes batch jobs COBOL programs to python. In this process, some batch jobs will be fully controlled using schedulers and programs will be rewrite in python scripts. But some mainframes programs will remain intact and not be migrated in python for now. As we are targeting partial migration for now, some mainframe batch jobs need to call python scripts on cloud. I am facing challenge here, how to call python scripts from mainframe batch jobs.
I'm assuming in this answer the COBOL applications run on the z/OS operating system on your mainframe, but if that assumption is not correct, please post a follow-up.
Cschneid has a great answer: just run the Python scripts on your mainframe. Python for z/OS is available for download free of charge from Rocket Software here:
https://www.rocketsoftware.com/zos-open-source
You can optionally purchase Python support on z/OS from Rocket Software if you wish. (All Linux distributions for IBM Z machines also include Python, typically supported by the Linux distributor.) Python running on IBM Z can directly operate on IBM Z-based data stores/databases, including well protected, z/OS-encrypted data sets. And you can quite easily create and manage hybrid cloud architectures that include IBM Z resources across all operating systems. That'd be the best arrangement all around since otherwise you're likely to have operational and management issues. You don't have to look very far to find real world instances of organizations that have suffered major, hugely business impactful batch scheduling problems that have completely wrecked their payment processes, for example. (Relatedly, Python is not an enterprise job scheduler.)
OK, that said, if you're still going to proceed down this (probably unwise) path this way, then here are some other options in no particular order:
Configure z/OS Management Facility (included as a base, included, supported feature in z/OS), and use its authorized REST APIs to submit jobs. Details are available here (z/OS 2.4 asssumed, but this feature is available in all currently supported z/OS releases and even prior):
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.4.0/com.ibm.zos.v2r4.izua700/IZUHPINFO_API_RESTJOBS.htm
Make sure you take reasonable, appropriate steps to secure this job submission path since it's quite powerful.
Equip your z/OS installation with IBM's z/OS Connect Enterprise Edition software product, create the REST APIs you need (both easy and powerful), and invoke them from Python. More information on z/OS Connect EE is available here:
https://www.ibm.com/us-en/marketplace/connect-enterprise-edition
If you have MQ for z/OS, then go grab the MQ client, send an appropriately formatted MQ message from Python to an appropriately configured MQ queue on z/OS, and invoke/trigger your programs that way. (MQ Advanced for z/OS is recommended for Advanced Message Security.) The MQ clients are free for unlimited use when connecting to all currently IBM supported, licensed versions of MQ and MQ Advanced for z/OS. Recent releases of MQ and MQ Advanced for z/OS also support REST APIs (and JSON payloads), so you can format your messages that way now. MQ clients are available for download here:
https://developer.ibm.com/messaging/mq-downloads/
At least some of the choices I'm providing on this list can be combined with MQ, which provides assured messaging -- which is quite helpful if you're trying to make this all work robustly.
Go find out what enterprise job scheduler your mainframe has installed (it probably has one), and use its authorized APIs to schedule and to run programs. For example, IBM Z Workload Scheduler provides authorized REST APIs. Refer to this documentation for an introduction:
https://www.ibm.com/support/knowledgecenter/en/SSRULV_9.5.0/com.ibm.tivoli.itws.doc_9.5/common/src_dgd/awsddrestapi.htm
If you click through to the samples you'll find some Python sample code.
....And there are lots of other possible ways, so if for some reason you don't like any of these choices, please post a follow-up.
Cschneid has another reasonable answer: Dovetailed's Co:Z Toolkit ("z/OS Hybrid Batch"). Here are some more possibilities, in no particular order:
The z/OS Client Web Enablement Toolkit, an included, IBM supported feature in the base z/OS operating system. This toolkit allows you to call a REST API from practically any program on z/OS. A COBOL sample is available here:
https://github.com/IBM/zOS-Client-Web-Enablement-Toolkit
z/OS Connect Enterprise Edition, which is bidirectional.
The enterprise job scheduler often installed and hosted on z/OS typically can trigger and manage "remote" tasks on other systems. IBM Z Workload Scheduler (for example) certainly can, and there's a whole manual discussing the topic here:
https://www.ibm.com/support/knowledgecenter/SSRULV_9.5.0/com.ibm.tivoli.itws.doc_9.5/eqqlwmst.pdf
Remote Procedure Calls (RPC), per IETF RFCs 1831 and 1832. If you're using a COBOL program with RPC you'd call the C interfaces, a minor bit of mixed language programming.
Dovetailed Technologies hybrid batch is another product that allows you to execute code residing on remote servers as steps in a batch job, similar to the solutions in the answers posted by #TimothySipples and #KevinMcKenzie.
Without knowing more, this question is impossible to answer.
However, generically speaking, you can issue USS commands from batch, using bpxbatch. So, you could install something like curl or wget from Rocket Software, and then call python via a REST API, or something similar on the cloud end, built in Django or Flask. If you really wanted to do something horrible, you could write a shell script that would ssh in to the cloud system, and issue a command on the remote system.
However, and I realize you probably don't have much say over this, I'd also point to Timothy Sipples' answer, and say this isn't a good idea, and it's going to be fragile. You'll need multiple such scripts, because you'll need to submit work, and then come back later and get the results, and behave appropriately based on the results. You're going to have to build all sorts of error handling capabilities into these batch jobs/shell scripts.

Cloud computing: when is the effort for using docker justified?

I have a very long running python script, that cannot be parallelized (so it is single threaded = running only with one process).
This job runs several days on my own computer.
It does not benefit from any GPU support.
For analysis and parameter optimization, I assume to run this job several times; perhaps 10 - 20 times with each time different parameters.
As my own existing computer ressources are limited, I would like to use a powerful cloud CPU for this task.
If I realize that the cloud CPUs are really much faster than my own CPU, I probably will migrate the job from e.g. AWS EC2 (Amazon Web Services) to a cheaper flat rate solution like Hetzner.
With this use case: does it make sense to put my setup in a docker container?
Or does this task not justify the effort for engineering and taking the learning curve in docker / docker compose etc.?
Well, for sure you don't need to use docker for that, because of some elements that I will list here:
Docker use is justified for using in encapsulated environment, to obtain security and mostly a controlled access between containers processes.
Another common attraction to Docker is the Continuous Integration/Replication aspects of container development, it is really good to create Docker containers to scale using Kubernetes or easy deployment using Jenkins, for example.
You can read more about it here: https://www.linode.com/docs/applications/containers/when-and-why-to-use-docker/
Now, since your application does not need that, Docker is not the way. And another suggestion, if you need to run it multiple times with only parameter difference between those executions, it is really good for you to parallel it to enjoy a powerful CPU.
Docker will make it much easier for you to run your application in the cloud in the sense that, you will be able to switch machines much easier. In addition, it will make it easy to run it cheaper, because you won't have to spend alot of time spinning up your VMs, and can cheaply and easily spin VMs down, knowing that only docker run and no specific python or yum install steps have to be done to bootstrap the program.

How to schedule the execution of spark-submit to specific time

I have a Spark batch processing code (basically, the model training) that I execute with spark-submit from AWS EMR cluster. Now I want to be able to launch this job each day at specific time.
What is the standard way to do it?
Should I change the code and add the scheduling inside the code? Or is there any way to schedule spark-submit job?
Or maybe should I make it as a Spark Streaming job executed every 24 hours? (though I am interested in a specific time slot, i.e. between 11:00pm and 12pm)
Cron is more traditional... although it is good, Another way/option is RunDeck.
Use Rundeck as an easier to manage and more secure replacement for Cron or as a replacement for legacy tools like Control-M or HP Operations Orchestration. Rundeck gives your users a simple web interface (GUI or API) to go to for both on-demand and scheduled operations tasks.
What is Rundeck?
Rundeck is open source software that helps you automate routine operational procedures in data center or cloud environments. Rundeck provides a number of features that will alleviate time-consuming grunt work and make it easy for you to scale up your automation efforts and create self service for others. Teams can collaborate to share how processes are automated while others are given trust to view operational activity or execute tasks.
Rundeck allows you to run tasks on any number of nodes from a web-based or command-line interface. Rundeck also includes other features that make it easy to scale up your automation efforts including: access control, workflow building, scheduling, logging, and integration with external sources for node and option data.
If you are using Linux you can setup a Cron job to call the spark-submit script
http://kvz.io/blog/2007/07/29/schedule-tasks-on-linux-using-crontab/

Are all cluster computing libraries compatible with starcluster?

I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?
Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.

Categories