Integrate Machine Learning algorithms written in Python into a Hadoop cluster

Integrate Machine Learning algorithms written in Python into a Hadoop cluster - python

After creating a Hadoop cluster that provides data to a Cassandra database, I would like to integrate into the Hadoop architecture some Machine Learning algorithms that I have coded in Python using the SciKit-Learn library in order to schedule when to run these algorithms to the data stored in Cassandra automatically.
Does anyone know how to proceed or any bibliography that could help me?
I have tried to search for information but I have only found that I can use Mahout, but the algorithms I want to apply are the ones I wrote in Python.

For starters, Cassandra isn't part of Hadoop, and Hadoop isn't required for it.
Scikit is fine for small datasets, but to scale an algorithm into Hadoop specifically, your dataset will be distributed, and therefore cannot load into Scikit directly.
You would need to use PySpark w/ Pandas Integration as a starting point, Spark MLlib has several algorithms of its own, and you can optionally deploy that code into Hadoop YARN.

Related

Pydoop vs Mrjob for image processing on Hadoop

I want to process images(most probably in big sizes) on Hadoop platform but I am confused about which one to choose from the aforementioned 2 interfaces, especially for someone who is still a beginner in Hadoop. Considering the need to split the images into blocks to distribute processing among working machines and merge the blocks after processing is completed.
It's known that Pydoop has better access to the Hadoop API while mrjob has powerful utilities for executing the jobs, which one is suitable to be used with this kind of work?

I would actually suggest pyspark because it natively supports binary files.
For image processing, you can try TensorFlowOnSpark

Using regular python code on a Spark cluster

Can I run a normal python code using regular ML libraries (e.g., Tensorflow or sci-kit learn) in a Spark cluster? If yes, can spark distribute my data and computation across the cluster? if no, why?

Spark use RDD(Resilient distributed dataset) to distribute work among workers or slaves , I dont think you can use your existing code in python without dramatically adapting the code to spark specification , for tensorflow there are many options to distribute computing over multiple gpus.

How to process images in Hadoop using python?

My objective is to apply map-reduce framework to cluster images using hadoop framework.For map-reduce i am using python programming and language and MRJOB package.But i am not able to create the logic of how to process the images.
Like i have the images in .tif format.The questions i have is
How to store the (format of storing)images in hdfs in order to
retrive them for map-reduce in python.
i am not getting the I/O
pipeline for using python and hadoop

Start Azure Machine Learning algorithms from python or r script

Is there any possibility to execute Azure machine learning models from inside a Python or R script?
My requirement is to run a large but changing number of ML algorithms on multiple given data sets. I could either hard wire all sets to all algorithms, but this is very complex and not flexible.
So the idea was to wire the data sets into the Azure-embeded Python script, write the program logic in Python, and start the ML algorithms form within (e.g. a for Loop) in the Python script.
Thank you for any hints!

Per my experience, I think to execute Azure ML models from inside a Python or R script is possible in principles. You can try to follow the offical tutorials for Python or R to know how to use the Execute Python script or Execute R script module for implementing your wants in the Azure ML studio.

Are there any distributed machine learning libraries for using Python with Hadoop? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java.
As far as I can tell there are no well developed Python libraries for distributed machine learning. Java on the other hand has Apache Mahout and the more recent Oryx from Cloudera.
Essentially it seems I have to choose between two options. Slog through parallelising my own algorithms to use with Hadoop streaming or one of the Python wrapper for Hadoop until decent libraries exist or jump ship to Java so that I can use Mahout/Oryx. There is a world of difference between writing your own MapReduce word count code and writing your own MapReduce SVM! Even with with help of great tutorials like this.
I don't know which is the wiser choice, so my question is:
A) Is there some Python library I have missed which would be useful? If not, do you know if there are any in development which will be useful in the near future?
B) If the answer to the above is no then would my time be better spent jumping ship to Java?

I do not know of any library that could be used natively in Python for machine learning on Hadoop, but an easy solution would be to use the jpype module, which basically allows you to interact with Java from within your Python code.
You can for example start a JVM like this:
from jpype import *
jvm = None
def start_jpype():
global jvm
if (jvm is None):
cpopt="-Djava.class.path={cp}".format(cp=classpath)
startJVM(jvmlib,"-ea",cpopt)
jvm="started"
There is a very good tutorial on the topic here, which explains you how to use KMeans clustering from your Python code using Mahout.

Answer to the questions:
To my knowledge, no, python has an extensive collection of machine learning and map-reduce modules but not ML+MR
I would say yes, since you are a heavy programmer you should be able to catch with Java fairly fast if you are not involved with those nasty(sorry no offense) J2EE frameworks

I would recommend using Java, when you are using EMR.
First, and simple, its the way it was designed to work. If your going to play in Windows you write in C#, if your making a web service in apache you use PHP. When your running MapReduce Hadoop in EMR, you use Java.
Second, all the tools are there for you in Java, like AWS SDK. I regularly develop MapReduce jobs in EMR quickly with the help of Netbeans, Cygwin(when on Windows), and s3cmd(in cygwin). I use netbeans to build my MR jar, and cygwin + s3cmd to copy it to my s3 directory to be run be emr. I then also write a program using AWS SDK to launch my EMR cluster with my config and to run my jar.
Third, there are many Hadoop debugging tools(usually need mac or linux os for them to work though) for Java
Please see here for creating a new Netbeans project with maven for hadoop.

This blog post provides a fairly comprehensive review of the python frameworks for working with hadoop:
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
including:
Hadoop Streaming
mrjob
dumbo
hadoopy
pydoop
and this example provides a working example of parallelized ML with python and hadoop:
http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/

A) No
B) No
What you actually want to do is jump ship to Scala and if you want to do any hardcore ML then you also want to forget about using Hadoop and jump ship to Spark. Hadoop is a MapReduce framework, but ML algorithms do not necessarily map to this dataflow structure, as they are often iterative. This means many ML algorithms will result in a large number of MapReduce stages - each stage has the huge overhead of reading and writting to disk.
Spark is a in memory distributed framework that allows data to stay in memory increasing speed by orders of magnitude.
Now Scala is a best of all worlds language, especially for Big Data and ML. It's not dynamically typed, but has type inference and implicit conversions, and it's significantly more concise than Java and Python. This means you can write code very fast in Scala, but moreover, that code is readable and maintainable.
Finally Scala is functional, and naturally lends itself to mathematics and parallelization. This is why all the serious cutting edge work for Big Data and ML is being done in Scala; e.g. Scalding, Scoobi, Scrunch and Spark. Crufty Python & R code will be a thing of the past.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.