Are all cluster computing libraries compatible with starcluster? - python

I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?

Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.

Related

How do I decide which runner to use for Apache Beam?

I'm using Apache Beam 2.40.0 in python.
It has 10 different runners for jobs.
How do you choose which one to use? The DirectRunner seems like the easiest one to set up, but the docs claim it does not focus on efficient execution.
DirectRunner runs the pipeline on a single machine. It's hardly used in production. There is also an InteractiveRunner wrapper for Python SDK that mostly uses DirectRunner in an IPython/Notebook environment to execute small pipelines interactively for learning and prototyping.
To process large amount of data in a distributed manner, the runners with the best support (document/support-wise) and most popularity currently are:
DataflowRunner: if you want to use Google Cloud services and want a more serverless experience without worrying about setting up your clusters.
FlinkRunner/SparkRunner: if you prefer setting up your own EMR solutions or using existing services that allows you to provision clusters with optional components (there are also serverless options for these runners out there).
As for other runners, you may refer to the runners section of the roadmap for the newest update.

Airflow + Dask: task environment identification

In the essence, I want to distribute Airflow DAGs with DaskExecutor, however, I have failed to find detailed information on how it actually works. Some of my tasks are supposed to save files and for such purposes, there is a network share accessible from each server running dask-worker. If my guess is correct, some tasks within one DAG could be executed on different servers with different paths to this share. The question is how to specify the path within a task depending on a worker and OS type executing it? If there is a better solution, or vision on this problem it will be great if you share it.
UPD:
Following the comment by #mdurant - the network share is a Redis server on the machine with Debian OS, therefore it represents as a directory on it, other nodes predominantly controlled by Windows Server, and share mounted as network driver.
For now the best appropriate solution I see, is to use the Dask framework directly from within each high-load task. As it guarantees distribution of computations among the cluster.

Package python code dependencies for remote execution on the fly

my situation is as follows, we have:
an "image" with all our dependencies in terms of software + our in-house python package
a "pod" in which such image is loaded on command (kubernetes pod)
a python project which has some uncommitted code of its own which leverages the in-house package
Also please assume you cannot work on the machine (or cluster) directly (say remote SSH interpreter). the cluster is multi-tenant and we want to optimize it as much as possible so no idle times between trials
also for now forget security issues, everything is locked down on our side so no issues there
we want to essentially to "distribute" remotely the workload -i.e. a script.py - (unfeasible in our local machines) without being constrained by git commits, therefore being able to do it "on the fly". this is necessary as all of the changes are experimental in nature (think ETL/pipeline kind of analysis): we want to be able to experiment at scale but with no bounds with regards to git.
I tried dill but I could not manage to make it work (probably due to the structure of the code). ideally, I would like to replicate the concept mleap applied to ML pipelines for Spark but on a much smaller scale, basically packaging but with little to no constraints.
What would be the preferred route for this use case?

Linking docker containers to combine different libraries

Docker containers can be linked. Most examples involve linking a Redis container with an SQL container. The beauty of linking containers is that you can keep the SQL environment separate from your Redis environment, and instead of building one monolithic image one can maintain two nicely separate ones.
I can see how this works for server applications (where the communication is transmitted through ports), but I have troubles replicating a similar approach for different libraries. As a concrete example, I'd like to use a container with Ipython Notebook together with the C/C++-library caffe (which exposes a Python interface through a package in one of its subfolders) and an optimisation library such as Ipopt. Containers for Ipython and Caffe readily exist, and I am currently working on a separate image for Ipopt. Yet how do I link the three together without building one giant monolithic Dockerfile? Caffe, Ipython and Ipopt each have a range of dependencies, making a combined maintenance a real nightmare.
My view on docker containers is that each container typically represents one process. E.g. redis or nginx. Containers typically communicates with each other using networking or via shared files in volumes.
Each container runs its own operating system (typically specified in the FROM-section in your Dockerfile). In your case, you are not running any specific processes but instead you simply wish to share libraries. This is not what docker was designed for and I am not even sure that it is doable but it sure seems as if it is a strange way of doing things.
My suggestion is therefore that you create a base image with the least common denominator (some of the shared libraries that are common to all other images) and that your other images use that image as the FROM-image.
Furthermore, If you need more complex setup of your environment with lots of dependencies and heavy provisioning, I suggest that you take a look at other provisioning tools such as Chef or Puppet.
Docker linking is about linking microservices, that is separate processes, and has no relation to your question as far as I can see.
There is no out-of-the-box facility to compose separate docker images into one container, the way you call 'linking' in your question.
If you don't want to have that giant monolithic image, you might consider using provisioning tools a-la puppet, chef or ansible together with docker. One example here. There you might theoretically get use of the existing recipes/playbooks for the libraries you need. I would be surprised though if this approach would be much easier for you than to maintain your "big monolithic" Dockerfile.

syncronizing across machines for a python apscheduler and wmi based windows service

I am using apscheduler and wmi to create and install new python based windows services where the service determines the type of job to be run. The services are installed across all the machines on the same network. Given this scenario I want to make sure that these services run only on one machine and not all the machines.
If a machine goes down I still want the job to be run from another machine on the same network. How would I accomplish this task?
I know I need to do some kind of synchronization across machines but not sure how to address it?
I tried to include functionality like this in APScheduler 2.0 but it didn't pan out. Maybe The biggest issue is handling concurrent accesses to jobs and making sure jobs get run even if a particular node crashes. The nodes also need to communicate somehow.
Are you sure you don't want to use Celery instead?

Categories