Airflow + Dask: task environment identification - python

In the essence, I want to distribute Airflow DAGs with DaskExecutor, however, I have failed to find detailed information on how it actually works. Some of my tasks are supposed to save files and for such purposes, there is a network share accessible from each server running dask-worker. If my guess is correct, some tasks within one DAG could be executed on different servers with different paths to this share. The question is how to specify the path within a task depending on a worker and OS type executing it? If there is a better solution, or vision on this problem it will be great if you share it.
UPD:
Following the comment by #mdurant - the network share is a Redis server on the machine with Debian OS, therefore it represents as a directory on it, other nodes predominantly controlled by Windows Server, and share mounted as network driver.
For now the best appropriate solution I see, is to use the Dask framework directly from within each high-load task. As it guarantees distribution of computations among the cluster.

Related

Package python code dependencies for remote execution on the fly

my situation is as follows, we have:
an "image" with all our dependencies in terms of software + our in-house python package
a "pod" in which such image is loaded on command (kubernetes pod)
a python project which has some uncommitted code of its own which leverages the in-house package
Also please assume you cannot work on the machine (or cluster) directly (say remote SSH interpreter). the cluster is multi-tenant and we want to optimize it as much as possible so no idle times between trials
also for now forget security issues, everything is locked down on our side so no issues there
we want to essentially to "distribute" remotely the workload -i.e. a script.py - (unfeasible in our local machines) without being constrained by git commits, therefore being able to do it "on the fly". this is necessary as all of the changes are experimental in nature (think ETL/pipeline kind of analysis): we want to be able to experiment at scale but with no bounds with regards to git.
I tried dill but I could not manage to make it work (probably due to the structure of the code). ideally, I would like to replicate the concept mleap applied to ML pipelines for Spark but on a much smaller scale, basically packaging but with little to no constraints.
What would be the preferred route for this use case?

Run Python ZMQ Workers simultaneously

I am pretty new in the Python and at distributed systems.
I am using the ZeroMQ Venitlator-Worker-Sink configuration:
Ventilator - Worker - Sink
Everything is working fine at the moment, my problem is, that I need a lot of workers. Every worker is doing the same work.
At the moment every worker is working in his own Python file and has his own Output-Console.
If I have programm changes, I have to change (or copy) the code in every file.
Next problem is that I have to start/run every file, so it quiet annoying to start 12 files.
What are here the best solutions? Threads, processes?
I have to say that the goal is to run every worker on a diffrent raspberry pi.
This appears to be more of a dev/ops problem. You have your worker code, which is presumably a single codebase, on multiple distributed machines or instances. You make a change to that codebase and you need the resulting code to be distributed to each instance, and then the process restarted.
To start, you should at minimum be using a source control system, like Git. With such a system you could at least go to each instance and pull the most recent commit and restart. Beyond that, you could set up a system like Ansible to go out and run those actions on each instance initiated from a single command.
There's a whole host of other tools, strategies and services that will help you do those things in a myriad of different ways. Using Docker to create a single worker container and then distribute and run that container on your various instances is probably one of the more popular ways to do what you're after, but it'll require a more fundamental change to your infrastructure.
Hope this helps.

Distributing jobs over multiple servers using python

I currently has an executable that when running uses all the cores on my server. I want to add another server, and have the jobs split between the two machines, but still each job using all the cores on the machine it is running. If both machines are busy I need the next job to queue until one of the two machines become free.
I thought this might be controlled by python, however I am a novice and not sure which python package would be the best for this problem.
I liked the "heapq" package for the queuing of the jobs, however it looked like it is designed for a single server use. I then looked into Ipython.parallel, but it seemed more designed for creating a separate smaller job for every core (on either one or more servers).
I saw a huge list of different options here (https://wiki.python.org/moin/ParallelProcessing) but I could do with some guidance as which way to go for a problem like this.
Can anyone suggest a package that may help with this problem, or a different way of approaching it?
Celery does exactly what you want - make it easy to distribute a task queue across multiple (many) machines.
See the Celery tutorial to get started.
Alternatively, IPython has its own multiprocessing library built in, based on ZeroMQ; see the introduction. I have not used this before, but it looks pretty straight-forward.

syncronizing across machines for a python apscheduler and wmi based windows service

I am using apscheduler and wmi to create and install new python based windows services where the service determines the type of job to be run. The services are installed across all the machines on the same network. Given this scenario I want to make sure that these services run only on one machine and not all the machines.
If a machine goes down I still want the job to be run from another machine on the same network. How would I accomplish this task?
I know I need to do some kind of synchronization across machines but not sure how to address it?
I tried to include functionality like this in APScheduler 2.0 but it didn't pan out. Maybe The biggest issue is handling concurrent accesses to jobs and making sure jobs get run even if a particular node crashes. The nodes also need to communicate somehow.
Are you sure you don't want to use Celery instead?

Are all cluster computing libraries compatible with starcluster?

I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?
Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.

Categories