Pydoop vs Mrjob for image processing on Hadoop

Pydoop vs Mrjob for image processing on Hadoop - python

I want to process images(most probably in big sizes) on Hadoop platform but I am confused about which one to choose from the aforementioned 2 interfaces, especially for someone who is still a beginner in Hadoop. Considering the need to split the images into blocks to distribute processing among working machines and merge the blocks after processing is completed.
It's known that Pydoop has better access to the Hadoop API while mrjob has powerful utilities for executing the jobs, which one is suitable to be used with this kind of work?

I would actually suggest pyspark because it natively supports binary files.
For image processing, you can try TensorFlowOnSpark

Related

Integrate Machine Learning algorithms written in Python into a Hadoop cluster

After creating a Hadoop cluster that provides data to a Cassandra database, I would like to integrate into the Hadoop architecture some Machine Learning algorithms that I have coded in Python using the SciKit-Learn library in order to schedule when to run these algorithms to the data stored in Cassandra automatically.
Does anyone know how to proceed or any bibliography that could help me?
I have tried to search for information but I have only found that I can use Mahout, but the algorithms I want to apply are the ones I wrote in Python.

For starters, Cassandra isn't part of Hadoop, and Hadoop isn't required for it.
Scikit is fine for small datasets, but to scale an algorithm into Hadoop specifically, your dataset will be distributed, and therefore cannot load into Scikit directly.
You would need to use PySpark w/ Pandas Integration as a starting point, Spark MLlib has several algorithms of its own, and you can optionally deploy that code into Hadoop YARN.

Hadoop YARN vs mapreduce

I have installed Hadoop - 2.6.0 in my machine and started all the service.
When I compare with my old version,this version does not start the job tracker and task tracker jobs instead it starts the nodemanager and resourcemanager.
QUestion:-
I believe this version of Hadoop uses YARN for running the jobs. Can't I run a map reduce job anymore?
Should I write a job thats tailored to fit the YARN resource manager and application manager.
Is there a sample Python job that I can submit?

I believe this version of Hadoop uses YARN for running the jobs. Can't I run a map reduce job anymore?
It's still fine to run MapReduce jobs. YARN is a rearchitecture of the cluster computing internals of a Hadoop cluster, but that rearchitecture maintained public API compatibility with classic Hadoop 1.x MapReduce. The Apache Hadoop documentation on Apache Hadoop NextGen MapReduce (YARN) discusses the rearchitecture in more detail. There is a relevant quote at the end of the document:
MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.
Should I write a job thats tailored to fit the YARN resource manager and application manager.
If you're already accustomed to writing MapReduce jobs or higher-level abstractions like Pig scripts and Hive queries, then you don't need to change anything you're doing as the end user. API compatibility as per above means that all of those things continue to work fine. You are welcome to write custom distributed applications that specifically target the YARN framework, but this is more advanced usage that isn't required if you just want to stick to Hadoop 1.x-style data processing jobs. The Apache Hadoop documentation contains a page on Writing YARN Applications if you're interested in exploring this.
Is there a sample Python job that I can submit?
I recommend taking a look at the Apache Hadoop documentation on Hadoop Streaming. Hadoop Streaming allows you write MapReduce jobs based simply on reading stdin and writing to stdout. This is a very general pardigm, so it means you can code in pretty much anything you want, including Python.
In general, it sounds like you would benefit from exploring the Apache Hadoop documentation site. There is a lot of helpful information there.

Distributing jobs over multiple servers using python

I currently has an executable that when running uses all the cores on my server. I want to add another server, and have the jobs split between the two machines, but still each job using all the cores on the machine it is running. If both machines are busy I need the next job to queue until one of the two machines become free.
I thought this might be controlled by python, however I am a novice and not sure which python package would be the best for this problem.
I liked the "heapq" package for the queuing of the jobs, however it looked like it is designed for a single server use. I then looked into Ipython.parallel, but it seemed more designed for creating a separate smaller job for every core (on either one or more servers).
I saw a huge list of different options here (https://wiki.python.org/moin/ParallelProcessing) but I could do with some guidance as which way to go for a problem like this.
Can anyone suggest a package that may help with this problem, or a different way of approaching it?

Celery does exactly what you want - make it easy to distribute a task queue across multiple (many) machines.
See the Celery tutorial to get started.
Alternatively, IPython has its own multiprocessing library built in, based on ZeroMQ; see the introduction. I have not used this before, but it looks pretty straight-forward.

Is there a simple Python map-reduce framework that uses the regular filesystem?

I have a few problems which may apply well to the Map-Reduce model. I'd like to experiment with implementing them, but at this stage I don't want to go to the trouble of installing a heavyweight system like Hadoop or Disco.
Is there a lightweight Python framework for map-reduce which uses the regular filesystem for input, temporary files, and output?

A Coursera course dedicated to big data suggests using these lightweight python Map-Reduce frameworks:
http://code.google.com/p/octopy/
https://github.com/michaelfairley/mincemeatpy
To get you started very quickly, try this example:
https://github.com/michaelfairley/mincemeatpy/zipball/v0.1.2
(hint: for [server address] in this example use localhost)

http://pythonhosted.org/mrjob/ is great to quickly get started on your local machine, basically all you need is a simple:
pip install mrjob

http://jsmapreduce.com/ -- in-browser mapreduce; in Python or Javascript; nothing to install

Check out Apache Spark. It is written in Java but it has also a Python API. You can try it locally on your machine and then, when you need it, you can easily distribute your computation over a cluster.

MockMR - https://github.com/sjtrny/mockmr
It's meant for educational use. Does not currently operate in parallel but accepts standard Python objects as IO.

So this was asked ages ago, but I worked on a full implementation of mapreduce over the weekend: remap.
https://github.com/gtoonstra/remap
Pretty easy to install with minimal dependencies, if all goes well you should be able to run a test run in 5 minutes.
The entire processing pipeline works, but submitting and monitoring jobs is still being worked on.

Are there any distributed machine learning libraries for using Python with Hadoop? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java.
As far as I can tell there are no well developed Python libraries for distributed machine learning. Java on the other hand has Apache Mahout and the more recent Oryx from Cloudera.
Essentially it seems I have to choose between two options. Slog through parallelising my own algorithms to use with Hadoop streaming or one of the Python wrapper for Hadoop until decent libraries exist or jump ship to Java so that I can use Mahout/Oryx. There is a world of difference between writing your own MapReduce word count code and writing your own MapReduce SVM! Even with with help of great tutorials like this.
I don't know which is the wiser choice, so my question is:
A) Is there some Python library I have missed which would be useful? If not, do you know if there are any in development which will be useful in the near future?
B) If the answer to the above is no then would my time be better spent jumping ship to Java?

I do not know of any library that could be used natively in Python for machine learning on Hadoop, but an easy solution would be to use the jpype module, which basically allows you to interact with Java from within your Python code.
You can for example start a JVM like this:
from jpype import *
jvm = None
def start_jpype():
global jvm
if (jvm is None):
cpopt="-Djava.class.path={cp}".format(cp=classpath)
startJVM(jvmlib,"-ea",cpopt)
jvm="started"
There is a very good tutorial on the topic here, which explains you how to use KMeans clustering from your Python code using Mahout.

Answer to the questions:
To my knowledge, no, python has an extensive collection of machine learning and map-reduce modules but not ML+MR
I would say yes, since you are a heavy programmer you should be able to catch with Java fairly fast if you are not involved with those nasty(sorry no offense) J2EE frameworks

I would recommend using Java, when you are using EMR.
First, and simple, its the way it was designed to work. If your going to play in Windows you write in C#, if your making a web service in apache you use PHP. When your running MapReduce Hadoop in EMR, you use Java.
Second, all the tools are there for you in Java, like AWS SDK. I regularly develop MapReduce jobs in EMR quickly with the help of Netbeans, Cygwin(when on Windows), and s3cmd(in cygwin). I use netbeans to build my MR jar, and cygwin + s3cmd to copy it to my s3 directory to be run be emr. I then also write a program using AWS SDK to launch my EMR cluster with my config and to run my jar.
Third, there are many Hadoop debugging tools(usually need mac or linux os for them to work though) for Java
Please see here for creating a new Netbeans project with maven for hadoop.

This blog post provides a fairly comprehensive review of the python frameworks for working with hadoop:
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
including:
Hadoop Streaming
mrjob
dumbo
hadoopy
pydoop
and this example provides a working example of parallelized ML with python and hadoop:
http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/

A) No
B) No
What you actually want to do is jump ship to Scala and if you want to do any hardcore ML then you also want to forget about using Hadoop and jump ship to Spark. Hadoop is a MapReduce framework, but ML algorithms do not necessarily map to this dataflow structure, as they are often iterative. This means many ML algorithms will result in a large number of MapReduce stages - each stage has the huge overhead of reading and writting to disk.
Spark is a in memory distributed framework that allows data to stay in memory increasing speed by orders of magnitude.
Now Scala is a best of all worlds language, especially for Big Data and ML. It's not dynamically typed, but has type inference and implicit conversions, and it's significantly more concise than Java and Python. This means you can write code very fast in Scala, but moreover, that code is readable and maintainable.
Finally Scala is functional, and naturally lends itself to mathematics and parallelization. This is why all the serious cutting edge work for Big Data and ML is being done in Scala; e.g. Scalding, Scoobi, Scrunch and Spark. Crufty Python & R code will be a thing of the past.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.