Hadoop YARN vs mapreduce

Hadoop YARN vs mapreduce - python

I have installed Hadoop - 2.6.0 in my machine and started all the service.
When I compare with my old version,this version does not start the job tracker and task tracker jobs instead it starts the nodemanager and resourcemanager.
QUestion:-
I believe this version of Hadoop uses YARN for running the jobs. Can't I run a map reduce job anymore?
Should I write a job thats tailored to fit the YARN resource manager and application manager.
Is there a sample Python job that I can submit?

I believe this version of Hadoop uses YARN for running the jobs. Can't I run a map reduce job anymore?
It's still fine to run MapReduce jobs. YARN is a rearchitecture of the cluster computing internals of a Hadoop cluster, but that rearchitecture maintained public API compatibility with classic Hadoop 1.x MapReduce. The Apache Hadoop documentation on Apache Hadoop NextGen MapReduce (YARN) discusses the rearchitecture in more detail. There is a relevant quote at the end of the document:
MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.
Should I write a job thats tailored to fit the YARN resource manager and application manager.
If you're already accustomed to writing MapReduce jobs or higher-level abstractions like Pig scripts and Hive queries, then you don't need to change anything you're doing as the end user. API compatibility as per above means that all of those things continue to work fine. You are welcome to write custom distributed applications that specifically target the YARN framework, but this is more advanced usage that isn't required if you just want to stick to Hadoop 1.x-style data processing jobs. The Apache Hadoop documentation contains a page on Writing YARN Applications if you're interested in exploring this.
Is there a sample Python job that I can submit?
I recommend taking a look at the Apache Hadoop documentation on Hadoop Streaming. Hadoop Streaming allows you write MapReduce jobs based simply on reading stdin and writing to stdout. This is a very general pardigm, so it means you can code in pretty much anything you want, including Python.
In general, it sounds like you would benefit from exploring the Apache Hadoop documentation site. There is a lot of helpful information there.

Related

How do I decide which runner to use for Apache Beam?

I'm using Apache Beam 2.40.0 in python.
It has 10 different runners for jobs.
How do you choose which one to use? The DirectRunner seems like the easiest one to set up, but the docs claim it does not focus on efficient execution.

DirectRunner runs the pipeline on a single machine. It's hardly used in production. There is also an InteractiveRunner wrapper for Python SDK that mostly uses DirectRunner in an IPython/Notebook environment to execute small pipelines interactively for learning and prototyping.
To process large amount of data in a distributed manner, the runners with the best support (document/support-wise) and most popularity currently are:
DataflowRunner: if you want to use Google Cloud services and want a more serverless experience without worrying about setting up your clusters.
FlinkRunner/SparkRunner: if you prefer setting up your own EMR solutions or using existing services that allows you to provision clusters with optional components (there are also serverless options for these runners out there).
As for other runners, you may refer to the runners section of the roadmap for the newest update.

Pydoop vs Mrjob for image processing on Hadoop

I want to process images(most probably in big sizes) on Hadoop platform but I am confused about which one to choose from the aforementioned 2 interfaces, especially for someone who is still a beginner in Hadoop. Considering the need to split the images into blocks to distribute processing among working machines and merge the blocks after processing is completed.
It's known that Pydoop has better access to the Hadoop API while mrjob has powerful utilities for executing the jobs, which one is suitable to be used with this kind of work?

I would actually suggest pyspark because it natively supports binary files.
For image processing, you can try TensorFlowOnSpark

How can I call python scripts on a remote server from z/OS?

As part of migrating batch jobs (and used EXEC PGM) to other language (python here), facing challenge in cross server connection.
We are targeting to migrate few of our mainframes batch jobs COBOL programs to python. In this process, some batch jobs will be fully controlled using schedulers and programs will be rewrite in python scripts. But some mainframes programs will remain intact and not be migrated in python for now. As we are targeting partial migration for now, some mainframe batch jobs need to call python scripts on cloud. I am facing challenge here, how to call python scripts from mainframe batch jobs.

I'm assuming in this answer the COBOL applications run on the z/OS operating system on your mainframe, but if that assumption is not correct, please post a follow-up.
Cschneid has a great answer: just run the Python scripts on your mainframe. Python for z/OS is available for download free of charge from Rocket Software here:
https://www.rocketsoftware.com/zos-open-source
You can optionally purchase Python support on z/OS from Rocket Software if you wish. (All Linux distributions for IBM Z machines also include Python, typically supported by the Linux distributor.) Python running on IBM Z can directly operate on IBM Z-based data stores/databases, including well protected, z/OS-encrypted data sets. And you can quite easily create and manage hybrid cloud architectures that include IBM Z resources across all operating systems. That'd be the best arrangement all around since otherwise you're likely to have operational and management issues. You don't have to look very far to find real world instances of organizations that have suffered major, hugely business impactful batch scheduling problems that have completely wrecked their payment processes, for example. (Relatedly, Python is not an enterprise job scheduler.)
OK, that said, if you're still going to proceed down this (probably unwise) path this way, then here are some other options in no particular order:
Configure z/OS Management Facility (included as a base, included, supported feature in z/OS), and use its authorized REST APIs to submit jobs. Details are available here (z/OS 2.4 asssumed, but this feature is available in all currently supported z/OS releases and even prior):
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.4.0/com.ibm.zos.v2r4.izua700/IZUHPINFO_API_RESTJOBS.htm
Make sure you take reasonable, appropriate steps to secure this job submission path since it's quite powerful.
Equip your z/OS installation with IBM's z/OS Connect Enterprise Edition software product, create the REST APIs you need (both easy and powerful), and invoke them from Python. More information on z/OS Connect EE is available here:
https://www.ibm.com/us-en/marketplace/connect-enterprise-edition
If you have MQ for z/OS, then go grab the MQ client, send an appropriately formatted MQ message from Python to an appropriately configured MQ queue on z/OS, and invoke/trigger your programs that way. (MQ Advanced for z/OS is recommended for Advanced Message Security.) The MQ clients are free for unlimited use when connecting to all currently IBM supported, licensed versions of MQ and MQ Advanced for z/OS. Recent releases of MQ and MQ Advanced for z/OS also support REST APIs (and JSON payloads), so you can format your messages that way now. MQ clients are available for download here:
https://developer.ibm.com/messaging/mq-downloads/
At least some of the choices I'm providing on this list can be combined with MQ, which provides assured messaging -- which is quite helpful if you're trying to make this all work robustly.
Go find out what enterprise job scheduler your mainframe has installed (it probably has one), and use its authorized APIs to schedule and to run programs. For example, IBM Z Workload Scheduler provides authorized REST APIs. Refer to this documentation for an introduction:
https://www.ibm.com/support/knowledgecenter/en/SSRULV_9.5.0/com.ibm.tivoli.itws.doc_9.5/common/src_dgd/awsddrestapi.htm
If you click through to the samples you'll find some Python sample code.
....And there are lots of other possible ways, so if for some reason you don't like any of these choices, please post a follow-up.

Cschneid has another reasonable answer: Dovetailed's Co:Z Toolkit ("z/OS Hybrid Batch"). Here are some more possibilities, in no particular order:
The z/OS Client Web Enablement Toolkit, an included, IBM supported feature in the base z/OS operating system. This toolkit allows you to call a REST API from practically any program on z/OS. A COBOL sample is available here:
https://github.com/IBM/zOS-Client-Web-Enablement-Toolkit
z/OS Connect Enterprise Edition, which is bidirectional.
The enterprise job scheduler often installed and hosted on z/OS typically can trigger and manage "remote" tasks on other systems. IBM Z Workload Scheduler (for example) certainly can, and there's a whole manual discussing the topic here:
https://www.ibm.com/support/knowledgecenter/SSRULV_9.5.0/com.ibm.tivoli.itws.doc_9.5/eqqlwmst.pdf
Remote Procedure Calls (RPC), per IETF RFCs 1831 and 1832. If you're using a COBOL program with RPC you'd call the C interfaces, a minor bit of mixed language programming.

Dovetailed Technologies hybrid batch is another product that allows you to execute code residing on remote servers as steps in a batch job, similar to the solutions in the answers posted by #TimothySipples and #KevinMcKenzie.

Without knowing more, this question is impossible to answer.
However, generically speaking, you can issue USS commands from batch, using bpxbatch. So, you could install something like curl or wget from Rocket Software, and then call python via a REST API, or something similar on the cloud end, built in Django or Flask. If you really wanted to do something horrible, you could write a shell script that would ssh in to the cloud system, and issue a command on the remote system.
However, and I realize you probably don't have much say over this, I'd also point to Timothy Sipples' answer, and say this isn't a good idea, and it's going to be fragile. You'll need multiple such scripts, because you'll need to submit work, and then come back later and get the results, and behave appropriately based on the results. You're going to have to build all sorts of error handling capabilities into these batch jobs/shell scripts.

How to Connect Python to Spark Session and Keep RDDs Alive

How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?
I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.
I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?
Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.
So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:
conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)
I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.
How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?

There is no solution in Spark. You may consider:
To keep persistent RDDs:
Apache Ignite
To keep persistent shared context:
spark-jobserver
livy - https://github.com/cloudera/livy
mist - https://github.com/Hydrospheredata/mist
To share context for with notebooks:
Apache Zeppelin
I think that out of these only Zeppelin officially supports Windows.

For those who may follow: I've recently discovered SnappyData.
SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. It has a job server that you can dump concurrent jobs onto.
It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data.
I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems.

Are all cluster computing libraries compatible with starcluster?

I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?

Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.