My objective is to apply map-reduce framework to cluster images using hadoop framework.For map-reduce i am using python programming and language and MRJOB package.But i am not able to create the logic of how to process the images.
Like i have the images in .tif format.The questions i have is
How to store the (format of storing)images in hdfs in order to
retrive them for map-reduce in python.
i am not getting the I/O
pipeline for using python and hadoop
Related
Is there anything to monitor or automate in Mongodb using python script?
In terms of monitoring, mongodb exporter and grafana are used.
Is there any monitoring operation that cannot be detected by mongodb exporter that can utilize python script?
I'm not considering backup or adding shards because I use atlas.
After creating a Hadoop cluster that provides data to a Cassandra database, I would like to integrate into the Hadoop architecture some Machine Learning algorithms that I have coded in Python using the SciKit-Learn library in order to schedule when to run these algorithms to the data stored in Cassandra automatically.
Does anyone know how to proceed or any bibliography that could help me?
I have tried to search for information but I have only found that I can use Mahout, but the algorithms I want to apply are the ones I wrote in Python.
For starters, Cassandra isn't part of Hadoop, and Hadoop isn't required for it.
Scikit is fine for small datasets, but to scale an algorithm into Hadoop specifically, your dataset will be distributed, and therefore cannot load into Scikit directly.
You would need to use PySpark w/ Pandas Integration as a starting point, Spark MLlib has several algorithms of its own, and you can optionally deploy that code into Hadoop YARN.
I have a very large CSV file in a blob storage nearly of size 64 GB. I need to do some processing on top of every row and push the data to DB.
What should be the best solution to do this efficiently?
64 GB file size shouldn't be the reason to worry you as Pyspark is even capable to process even 1TB of data.
You can use Azure Databricks to write the code in Pyspark and run it on cluster.
A Databricks cluster is a set of computation resources and
configurations on which you run data engineering, data science, and
data analytics workloads, such as production ETL pipelines, streaming
analytics, ad-hoc analytics, and machine learning.
You can refer this third-party tutorial Read a CSV file stored in blob container using python in DataBricks.
Though you shouldn't face any issue while processing the file but if required you can consider Boost Query Performance with Databricks and Spark
.
I want to process images(most probably in big sizes) on Hadoop platform but I am confused about which one to choose from the aforementioned 2 interfaces, especially for someone who is still a beginner in Hadoop. Considering the need to split the images into blocks to distribute processing among working machines and merge the blocks after processing is completed.
It's known that Pydoop has better access to the Hadoop API while mrjob has powerful utilities for executing the jobs, which one is suitable to be used with this kind of work?
I would actually suggest pyspark because it natively supports binary files.
For image processing, you can try TensorFlowOnSpark
I am working on Hadoop and Spark Framework for clustering of images.
I am using Python as my programming language.For map-reduce framework MRJOB package is used.
The doubt i am having is how to access the hdfs files directly in python?
For example if my file on hdfs is /a.txt
now how do i access it in python directly to apply further processing.
I looked at many libraries but i am not getting a concrete answer.I saw snakebite but it is only for python 2.