Can I run a normal python code using regular ML libraries (e.g., Tensorflow or sci-kit learn) in a Spark cluster? If yes, can spark distribute my data and computation across the cluster? if no, why?
Spark use RDD(Resilient distributed dataset) to distribute work among workers or slaves , I dont think you can use your existing code in python without dramatically adapting the code to spark specification , for tensorflow there are many options to distribute computing over multiple gpus.
Related
After creating a Hadoop cluster that provides data to a Cassandra database, I would like to integrate into the Hadoop architecture some Machine Learning algorithms that I have coded in Python using the SciKit-Learn library in order to schedule when to run these algorithms to the data stored in Cassandra automatically.
Does anyone know how to proceed or any bibliography that could help me?
I have tried to search for information but I have only found that I can use Mahout, but the algorithms I want to apply are the ones I wrote in Python.
For starters, Cassandra isn't part of Hadoop, and Hadoop isn't required for it.
Scikit is fine for small datasets, but to scale an algorithm into Hadoop specifically, your dataset will be distributed, and therefore cannot load into Scikit directly.
You would need to use PySpark w/ Pandas Integration as a starting point, Spark MLlib has several algorithms of its own, and you can optionally deploy that code into Hadoop YARN.
I want to process images(most probably in big sizes) on Hadoop platform but I am confused about which one to choose from the aforementioned 2 interfaces, especially for someone who is still a beginner in Hadoop. Considering the need to split the images into blocks to distribute processing among working machines and merge the blocks after processing is completed.
It's known that Pydoop has better access to the Hadoop API while mrjob has powerful utilities for executing the jobs, which one is suitable to be used with this kind of work?
I would actually suggest pyspark because it natively supports binary files.
For image processing, you can try TensorFlowOnSpark
How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?
I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.
I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?
Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.
So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:
conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)
I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.
How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?
There is no solution in Spark. You may consider:
To keep persistent RDDs:
Apache Ignite
To keep persistent shared context:
spark-jobserver
livy - https://github.com/cloudera/livy
mist - https://github.com/Hydrospheredata/mist
To share context for with notebooks:
Apache Zeppelin
I think that out of these only Zeppelin officially supports Windows.
For those who may follow: I've recently discovered SnappyData.
SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. It has a job server that you can dump concurrent jobs onto.
It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data.
I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems.
I just wanted to educate myself more on Spark. So wanted to ask this question.
I have Spark currently installed on my local machine. Its a Mach with 16GB.
I have connected a Jupyter notebook running with Pyspark.
So now when I do any coding in that notebook, like reading the data and converting the data into Spark DataFrame, I wanted to check:
1). Where all the dataset is distributed on local machine. Like does it distribute the dataset using different cores of CPU etc?. Is there a way to find that out?
2). Running the code and computation just using Jupyter notebook without spark is different from running Jupyter notebook with Pyspark? Like the first one just uses one core of the machine and runs using one thread while Jupyter notebook with Pyspark runs the code and computing on different cores of CPU with multi-threading/processing? Is this understanding correct?.
Is there a way to check these.
Thanks
Jupyter has mainly three parts Jupyter Notebook, Jupyter Clients and Kernels. http://jupyter.readthedocs.io/en/latest/architecture/how_jupyter_ipython_work.html
Here is short note on Kernel from Jupyter homepage.
Kernels are processes that run interactive code in a particular
programming language and return output to the user. Kernels also
respond to tab completion and introspection requests.
Jupyter's job is to communicate between the Kernel(Like python kernel, spark kernel ..) and web interface(your notebook). So under the hood spark is running drivers and executors for you, Jupyter helps in communicating with the spark driver.
1). Where all the dataset is distributed on local machine. Like does
it distribute the dataset using different cores of CPU etc?. Is there
a way to find that out?
Spark will spawn n number of executors(process responsible for executing a task) as specified using --num-executors, executors are managed by a spark driver(program/process responsible for running the Job over the Spark Engine). So the number of executors are specified while you run a spark program, you will find this kernel conf directory.
2). Running the code and computation just using Jupyter notebook
without spark is different from running Jupyter notebook with Pyspark?
Like the first one just uses one core of the machine and runs using
one thread while Jupyter notebook with Pyspark runs the code and
computing on different cores of CPU with multi-threading/processing?
Is this understanding correct?.
Yes, as I explained Jupyter is merely an interface letting you run code. Under the hood, Jupyter connects to Kernels be it normal Python or Apache Spark.
Spark has it's own good UI to monitor jobs, by default it runs on spark master server on port 4040. In your case, it will available on http://localhost:4040
I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?
Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.