Starting and stopping processes in a cluster

Starting and stopping processes in a cluster - python

I'm writing software that runs a bunch of different programs (via twisted's twistd); that is N daemons of various kinds must be started across multiple machines. If I did this manually, I would be running commands like twistd foo_worker, twistd bar_worker and so on on the machines involved.
Basically there will be a list of machines, and the daemon(s) I need them to run. Additionally, I need to shut them all down when the need arises.
If I were to program this from scratch, I would write a "spawner" daemon that would run permanently on each machine in the cluster with the following features accessible through the network for an authenticated administrator client:
Start a process with a given command line. Return a handle to manage it.
Kill a process given a handle.
Optionally, query stuff like cpu time given a handle.
It would be fairly trivial to program the above, but I cannot imagine this is a new problem. Surely there are existing solutions to doing exactly this? I do however lack experience with server administration, and don't even know what the related terms are.
What existing ways are there to do this on a linux cluster, and what are some of the important terms involved? Python specific solutions are welcome, but not necessary.
Another way to put it: Given a bunch of machines in a lan, how do I programmatically work with them as a cluster?

The most familiar and universal way is just to use ssh. To automate you could use fabric.
To start foo_worker on all hosts:
$ fab all_hosts start:foo_worker
To stop bar_worker on a particular list of hosts:
$ fab -H host1,host2 stop:bar_worker
Here's an example fabfile.py:
from fabric.api import env, run, hide # pip install fabric
def all_hosts():
env.hosts = ['host1', 'host2', 'host3']
def start(daemon):
run("twistd --pid %s.pid %s" % (daemon, daemon))
def stop(daemon):
run("kill %s" % getpid(daemon))
def getpid(daemon):
with hide('stdout'):
return run("cat %s.pid" % daemon)
def ps(daemon):
"""Get process info for the `daemon`."""
run("ps --pid %s" % getpid(daemon))
There are a number of ways to configure host lists in fabric, with scopes varying from global to per-task, and it’s possible mix and match as needed..
To streamline the process management on a particular host you could write initd scripts for the daemons (and run service daemon_name start/stop/restart) or use supervisord (and run supervisorctl e.g., supervisorctl stop all). To control "what installed where" and to push configuration in a centralized manner something like puppet could be used.

The usual tool is a batch queue system, such as SLURM, SGE, Torque/Moab, LSF, and so on.

Circus :
Documentation :
http://docs.circus.io/en/0.5/index.html
Code:
http://pypi.python.org/pypi/circus/0.5
Summary from the documentation :
Circus is a process & socket manager. It can be used to monitor and control processes and sockets.
Circus can be driven via a command-line interface or programmatically trough its python API.
It shares some of the goals of Supervisord, BluePill and Daemontools. If you are curious about what Circus brings compared to other projects, read Why should I use Circus instead of X ?.
Circus is designed using ZeroMQ http://www.zeromq.org/. See Design for more details.

Related

Running scripts on multiple hosts concurrently with fabric

I am trying to create a program that creates multiple droplets, sends a script to each droplet, and initiates the execution all of the scripts without waiting for the output. I have tried to run it in the background, using nohup so that it isn't killed off when disconnected from terminal with the following code:
for i in len(script_names):
c = Connection(host = host[i], user = user[i], connect_kwargs = {"password" : password, "key_filename" : key_filename})
c.run("nohup python3 /root/" + script_names[i] + " &")
I have tried other variations of the same idea, including setting "pty=False", redirecting the output to dev/null with "> /dev/null < /dev/null &" yet nothing seems to work.
Is it possible to issue multiple commands to run scripts on different hosts concurrently without waiting for the output with fabric? Or should I use another package?

Fabric 2.x's groups aren't fully fleshed out yet, so they aren't well suited for this use case. In fabric 1.x I would accomplish this by using a dictionary for script_names where the keys are the host strings from your host list and the values are the names from script_names currently. Then I'd have each task perform its run commands in parallel as usual, looking up values using fabric.api.env.host_string within the task. The execution layer of fabric 2.x does not yet support this use-case afaik. This was my attempt at hacking it in, but the author rightly pointed out that this functionality should be handled in an Executor, which I could not come up with a solution for at the time: https://github.com/fabric/fabric/pull/1595

Python compiler call another python compiler to execute a script (execute a script from one independent machine to another)

I know the question title is weird!.
I have two virtual machines. First one has limited resources, while the second one has enough resources just like normal machine. The first machine will receive a signal from an external device. This signal will trigger a python compiler to execute a script. The script is big and the first machine does not have enough resources to execute it.
I can copy the script to the second machine to run it there, but I can't make the second machine receive the external signal. I am wondering if there is a way to make the compiler on the first machine ( once the external signal received) call the compiler on the second machine, so the compiler on the second machine executes the script? so the second compiler should use the second machine resources. check the attached image please.
Assume that the connection is established between the two machines and they can see each other, and the second machine has a copy from the script. I just need the commands that pass ( the execution ) to the second machine and make it use its own resources.

You should look into the microservice architecture to do this.
You can achieve this either by using flask and sending server requests between each machine, or something like nameko, which will allow you to create a "bridge" between machines and call functions between them (seems like what you are more interested in). Example for nameko:
Machine 2 (executor of resource-intensive script):
from nameko.rpc import rpc
class Stuff(object):
#rpc
def example(self):
return "Function running on Machine 2."
You would run the above script through the Nameko shell, as detailed in the docs.
Machine 1:
from nameko.standalone.rpc import ClusterRpcProxy
# This is the amqp server that machine 2 would be running.
config = {
'AMQP_URI': AMQP_URI # e.g. "pyamqp://guest:guest#localhost"
}
with ClusterRpcProxy(config) as cluster_rpc:
cluster_rpc.Stuff.example() # Function running on Machine 2.
More info here.

Hmm, there's many approaches to this problem.
If you want a python only solution, you can check out
dispy http://dispy.sourceforge.net/
Or Dask. https://dask.org/
If you want a robust solution (what I use on my home computing cluster but imo overkill for your problem) you can use
SLURM. SLURM is basically a way to string multiple computers together into a "supercomputer". https://slurm.schedmd.com/documentation.html
For a semi-quick, hacky solution. You can write a microservice. Essentially, your "weak" computer will receive the message then send a http request to your "strong" computer. Your strong computer will contain the actual program, compute results, and pass back the result to your "weak" computer.
Flask is an easy and lightweight solution for this.
All of these solutions require some type of networking. At the least, the computers need to be on the same LAN or both have access over the web.
There are many other approaches not mentioned. For example, you can export a NFS (netowrk file storage) and have one computer put a file in the shared folder and the other computer perform work on the file. I'm sure there are plenty other contrived ways to accomplish this task :). I'd be happy to expand on a particular method if you want.

How to schedule r and python script on unix server used by many users in a centralized way

I am finalizing a new Unix Server with R/Rstudio Server and Python/ANAconda/Jupyter Lab to be able to let users (kinda 15 end-user of different skill levels) run self analysis, save output and maybe schedule some job and run.
I need a way to schedule R and python script and job (the best would be a single tool that can handle both) in a easy and trasparent way (so also the low-level skill users could use it) that it is centralized for all the users that get access to it.
I have seen cronR for R but it seems to be for single users and not centralizable.
There is a way or maybe some other open source tool that can be used so to make transparent scheduling ?
Ideally the scheduling will be time-based, but also trigger/event based.
What firewall port, bash abilitation and command should i have installed and configured?

Handling hardware resources when testing with Jenkins

I want to setup Jenkins to
1) pull our source code from our repository,
2) compile and build it
3) run the tests on an embedded device
step 1 & 2 are quite easy and straight forward with Jenkins
as for step 3,
we have hundreds of those devices in various versions of them, and I'm looking for a utility (preferable in python) that can handle the availability of hardware devices/resources.
in such manner that one of the steps will be able to receive which of the device is available and run the tests on it.

What I have found, is that the best thing to do, is have something like jenkins, or if you're using enterprise, electric commander, manage a resource 'pool' the pool is essentially virtual devices, but they have a property, such that you can call into a python script w/ either an ip-address or serial port and communicate w/ your devices.
I used it for automated embedded testing on radios. The python script managed a whole host of tests, and commander would go ahead and choose a single-step resource from the pool, that resource had an ip, and would pass it into the python script. test would then perform all the tests and the stdout would get stored up into commander/jenkins ... Also set properties to track pass/fail count as test was executing
//main resource gets single step item from pool, in the main resource wrote a tiny script that asked if the item pulled from the pool had the resource name == "Bench1" .. "BenchX" etc.
basically:
if resource.name=="BENCH1":
python myscript.py --com COM3 --baud 9600
...
etc.
the really great feature about doing it this way, is if you have to disconnect a device, you don't need to deliver up script changes, you simply mark the commander/jenkins resource as disabled, and the main 'project' can still pull from what remains in your resource pool

Task Scheduling Across a Network?

Can you recommend on a python tool / module that allows scheduling tasks on remote machine in a network?
Note that the solution must be able to not only run certain jobs/commands on remote machines, but also verify that jobs etc are still running (for example, consider the case where a machine dies after a task has been assigned to it?)

RPyC or Remote Python Call, is a transparent and symmetrical python library for remote procedure calls, clustering and distributed-computing. Here an example from Wikipedia:
import rpyc
conn = rpyc.classic.connect("hostname") # assuming a classic server is running on 'hostname'
print conn.modules.sys.path
conn.modules.sys.path.append("lucy")
print conn.modules.sys.path[-1]
# a version of 'ls' that runs remotely
def remote_ls(path):
ros = conn.modules.os
for filename in ros.listdir(path):
stats = ros.stat(ros.path.join(path, filename))
print "%d\t%d\t%s" % (stats.st_size, stats.st_uid, filename)
remote_ls("/usr/bin")
# and exceptions...
try:
f = conn.builtin.open("/non/existent/file/name")
except IOError:
pass
To check if the remote server has died after assigning it a job, you can use the ping method of the Connection class. The complete API is described here.

Fabric (http://docs.fabfile.org/en/1.0.1/index.html) is a pretty good toolkit for various sys admin and deployment tasks. It comes with a few pre defined tasks but also gives you the flexibility to add what you need.
I highly recommend it.

Should be able to use Python WMI, for *NIX based systems it's a wrap around SSH and CRON.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Starting and stopping processes in a cluster - python

The usual tool is a batch queue system, such as SLURM, SGE, Torque/Moab, LSF, and so on.

Related

Running scripts on multiple hosts concurrently with fabric

Python compiler call another python compiler to execute a script (execute a script from one independent machine to another)

How to schedule r and python script on unix server used by many users in a centralized way

Handling hardware resources when testing with Jenkins

Task Scheduling Across a Network?

Categories

Resources