After creating a Hadoop cluster that provides data to a Cassandra database, I would like to integrate into the Hadoop architecture some Machine Learning algorithms that I have coded in Python using the SciKit-Learn library in order to schedule when to run these algorithms to the data stored in Cassandra automatically.
Does anyone know how to proceed or any bibliography that could help me?
I have tried to search for information but I have only found that I can use Mahout, but the algorithms I want to apply are the ones I wrote in Python.
For starters, Cassandra isn't part of Hadoop, and Hadoop isn't required for it.
Scikit is fine for small datasets, but to scale an algorithm into Hadoop specifically, your dataset will be distributed, and therefore cannot load into Scikit directly.
You would need to use PySpark w/ Pandas Integration as a starting point, Spark MLlib has several algorithms of its own, and you can optionally deploy that code into Hadoop YARN.
My objective is to apply map-reduce framework to cluster images using hadoop framework.For map-reduce i am using python programming and language and MRJOB package.But i am not able to create the logic of how to process the images.
Like i have the images in .tif format.The questions i have is
How to store the (format of storing)images in hdfs in order to
retrive them for map-reduce in python.
i am not getting the I/O
pipeline for using python and hadoop
I am trying to provide my custom python code which requires libraries that are not supported by AWS(pandas). So, I created a zip file with the necessary libraries and uploaded it to the S3 bucket. While running the job, I pointed the path of S3 bucket in the advanced properties.Still my job is not running successfully. Can anyone suggest why?
1.Do I have to include my code in the zip file?
If yes then how will Glue understand that it's the code?
2. Also do I need to create a package or just zip file will do?
Appreciate the help!
An update on AWS Glue Jobs released on 22nd Jan 2019.
Introducing Python Shell Jobs in AWS Glue -- Posted On: Jan 22, 2019
Python shell jobs in AWS Glue support scripts that are compatible with
Python 2.7 and come pre-loaded with libraries such as the Boto3,
NumPy, SciPy, pandas, and others. You can run Python shell jobs using
1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A
single DPU provides processing capacity that consists of 4 vCPUs of
compute and 16 GB of memory.
More info at : https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html
According to AWS Glue Documentation:
Only pure Python libraries can be used. Libraries that rely on C
extensions, such as the pandas Python Data Analysis Library, are not
yet supported.
I think it wouldn't work even if we upload the python library as a zip file, if the library you are using has a dependency for C extensions. I had tried using Pandas, Holidays, etc the same way you have tried, and on contacting AWS Support, they mentioned it is in their to do list (support for these python libaries), but no ETA as of now.
So, any libraries that are not native python, would not work in AWS Glue, at this point. But should be available in the near future, since this is a popular demand.
If still you would like to try it out, please refer to this link, where its explained how to package the external libraries to run in AWS glue, I tried it but didnt work for me.
As Yuva's answer mentioned, I believe it's currently impossible to import a library that is not purely in Python and the documentation reflects that.
However, in case someone came here looking for an answer on how to import a python library in AWS Glue in general, there is a good explanation in this post on how to do it with the pg8000 library:
AWS Glue - Truncate destination postgres table prior to insert
I currently have a Python program which reads a local file (containing a pickled database object) and saves to that file when it's done. I'd like to branch out and use this program on multiple computers accessing the same database, but I don't want to worry about synchronizing the local database files with each other, so I've been considering cloud storage options. Does anyone know how I might store a single data file in the cloud and interact with it using Python?
I've considered something like Google Cloud Platform and similar services, but those seem to be more server-oriented whereas I just need to access a single file on my own machines.
You could install gsutil and the boto library and use that.
The documentation for the Azure Machine Learning Python script module describes using a ZIP file containing code as a resource, but I don't see how to create and upload such a ZIP file in the first place.
How do get my custom Python code into Azure Machine Learning for use as a ZIP resource?
Just upload it as a dataset. Reference. (search for it, as it is not on the first page).
Reference on how to upload the dataset.
Here is the doc explaining this in details - https://azure.microsoft.com/en-us/documentation/articles/machine-learning-execute-python-scripts/#importing-existing-python-script-modules