I have to run a script on several machines in a compute cluster using SSH. But before I run the script I have to log in into a node in the cluster using ssh, and then use nvidia-smi to check which GPU is free (as there is no job-scheduler in place at the moment). Each node has several GPUs. So I typically access, say GPU1, by issuing ssh gpu1...followed by nvidia-smi which just outputs a list of gpus and processes and utilization of each gpu.
I need to automate all this. That is, say we have 4 GPUs : gpu1...gpu4.
I want to be able to ssh into each of these, check their utlization, then run a python script run_test.py -arg1 on the gpu that is free.
How can I write a python script that can do all these ?
I'm new to Python so need some help pls...
Related
I'm playing around with some python deep learning packages (Theano/Lasagne/Keras). I've been running it on CPU on my laptop, which takes a very long time to train the models.
For a while I was also using Amazon GPU instances, with an iPython notebook server running, which obviously ran much faster for full runs, but was pretty expensive to use for prototyping.
Is there any way to set things up that would let me prototype in iPython on my local machine, and then when I have a large model to train spin up a GPU instance, do all processing/training on that, then shut down the instance.
Is a setup like this possible, or does anyone have any suggestions to combine the convenience of the local machine with temporary processing on AWS?
My thoughts so far were along the lines of
Prototype on local ipython notebook
Set up cell to run a long process from start to
finish.
Use boto to start up an ec2 instance ssh into the instance
using boto's sshclient_from_instance
ssh_client = sshclient_from_instance(instance,
key_path='<path to SSH keyfile>',
user_name='ec2-user')
Get the contents of the cell I've set up using the script using the solution here, say the script is in cell 13 Execute that script using
ssh_client.run('python -c "'+ _i13 + '"' )
Shut down instance using boto
This just seems a bit convoluted, is there a proper way to do this?
So when it comes to EC2 you don't have to shut down the instance every time. The beauty of AWS is that you stop and start your instance when you use it, and only pay for the time you have it up and running. Also you can always try your code on a smaller and cheaper instance, and if its too slow for your liking then you just scale up to a larger instance.
I am using the python yelp/mrjob framework for my mapreduce jobs. There are only about 4G of data and I don't want to go through the trouble of setting up Hadoop or EMR. I have a 64 core machine and it takes about 2 hours to process the data with mrjob. I notice mrjob assigns 54 mappers for my job, but it seems to run only one at a time. Is there a way to make mrjob run all tasks in parallel with all my cpu cores?
I manually changed number of tasks but didn't help much.
--jobconf mapred.map.tasks=10 --jobconf mapred.reduce.tasks=10
EDIT:
I have -r local when I execute the job, however, looking at the code, it seems it defaults to run one process at a time. Please tell me I am wrong.
The local job runner for mrjob just spawns one subprocess for each MR stage, one for the mapper, one for the combiner (optionally), and one for the reducer, and passes data between them via a pipe. It is not designed to have any parallelism at all, so it will never take advantage of your 64 cores.
My suggestion would be to run hadoop on your local machine and submit the job with the -r hadoop option. A hadoop cluster running on your local machine in pseduo-distributed mode should be able to take advantage of your multiple cores.
See this question which addresses that topic: Full utilization of all cores in Hadoop pseudo-distributed mode
The runner for a job can be specified via the command line with the -r option.
When you run a mrjob script from the command line, the default run mode is inline which runs your job on your local machine in a single process. The other obvious options for running jobs are emr and hadoop.
You can make the job run in parallel on you local machine by setting the runner as local
$ python myjob.py -r local
Those --jobconf options are only recognised by Hadoop (i.e. on EMR or a Hadoop cluster).
We run many Python scripts for data processing tasks. We have a modeling computer that has been upgraded to provide the best performance for these tasks, but it is shared by many people that all need to run different scripts on it at the same time.
Is it possible for me to run a Python script remotely on that machine from my laptop while others are either directly logged into it or also remotely running a script?
Is SSH a possibility? I haven't ever run any scripts remotely aside from logging in via remote desktop. Ideally, I could start the Python script on that remote machine, but all the messages would be visible to me on my laptop. Does this sound doable?
EDIT:
I forgot to mention all machines are running Windows 7.
SSH is definitely the way to go and also have a look at Fabric.
Regarding your edit. You can use Fabric on Windows. And I think that using SSH on Windows will be a bit easier than dancing with their Powershell's remoting capabilities.
SSH does seem like it should meet your needs.
You could also consider setting up an iPython notebook server that everyone could use.
Its got nice parallel processing capabilities if you are doing some serious number crunching.
I am attempting to launch a couple of external applications from a Jenkins build step in Windows 7 64-bit. They are essentially programs designed to interact with each other and perform a series of regression tests on some software. Jenkins is run as Windows service as a user with admin privileges on the machine. I think that's full disclosure on any weirdness with my Jenkins installation.
I have written a Python3 script that successfully does what I want it to when run from the Windows command line. When when I run this script as a Jenkins build step, I can see that the applications have been spawned via the Task Manager, but there is no CPU activity associated with them, and no other evidence that they are actually doing anything (they produce log files, etc., but none of these appear). One of the applications typically runs at 25% CPU during the course of the regression tests.
The Python script itself runs to completion as if everything is OK. Jenkins is correctly monitoring the output of the script, which I can watch from the job's console output. I'm using os.spawnv(os.P_NOWAIT, ...) for each of the external application. The subprocess module doesn't do what I want it to, I just want these programs to run externally.
I've even run a bash script via Cygwin that functionally does the same thing as the Python script with the same results. Any idea why these applications spawn but don't execute?
Thanks!
I want to execute a Python script on several (15+) remote machine using SSH. After invoking the script/command I need to disconnect ssh session and keep the processes running in background for as long as they are required to.
I have used Paramiko and PySSH in past so have no problems using them again. Only thing I need to know is how to disconnect a ssh session in python (since normally local script would wait for each remote machine to complete processing before moving on).
This might work, or something similar:
ssh user#remote.host nohup python scriptname.py &
Basically, have a look at the nohup command.
On Linux machines, you can run the script with 'at'.
echo "python scriptname.py" ¦ at now
If you are going to perform repetitive tasks on many hosts, like for example deploying software and running setup scripts, you should consider using something like Fabric
Fabric is a Python (2.5 or higher) library and command-line tool for
streamlining the use of SSH for application deployment or systems
administration tasks.
It provides a basic suite of operations for executing local or remote
shell commands (normally or via sudo) and uploading/downloading files,
as well as auxiliary functionality such as prompting the running user
for input, or aborting execution.
Typical use involves creating a Python module containing one or more
functions, then executing them via the fab command-line tool.
You can even use tmux in this scenario.
As per the tmux documentation:
tmux is a terminal multiplexer. It lets you switch easily between several programs in one terminal, detach them (they keep running in the background) and reattach them to a different terminal. And do a lot more
From a tmux session, you can run a script, quit the terminal, log in again and check back as it keeps the session until the server restart.
How to configure tmux on a cloud server