Command to run Theano on GPU (windows) - python

I talk about the tutorial here http://deeplearning.net/software/theano/tutorial/using_gpu.html
The code I use
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
At section Testing Theano with GPU:
There are some command line which set Theano flag to run on cpu or gpu. The problem is I have no idea to put these command in.
I have try on windows cmd
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python theanogpu_example.py
then I get
'THEANO_FLAGS' is not recognized as an internal or external command,
operable program or batch file.
However, I am able to run the code, on cpu using the command
python theanogpu_example.py
I want to run the code on GPU, what should I do (with these command in the tutorial)?
SOLUTION
Thanks to #Blauelf about the idea of windows environment variable.
However, the param has to be separated
set THEANO_FLAGS="mode=FAST_RUN" & set THEANO_FLAGS="device=gpu" & set THEANO_FLAGS="floatX=float32" & python theanogpu_example.py

From the docs, THEANO_FLAGS is an environment variable. So as you're on Windows, you might want to change
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python theanogpu_example.py
into
set THEANO_FLAGS="mode=FAST_RUN,device=gpu,floatX=float32" & python theanogpu_example.py

Related

How to run Scikit's gp_minimize in parallel?

I am unable to make skopt.gp_minimize run at multiple cores. According to the documentation, the parameter n_jobs should set the number of cores. However, setting n_cores>1 seems to have no effect. Here is a minimal example that reproduces the problem:
from skopt import gp_minimize
import time
import datetime
def J(paramlist):
x = paramlist[0]
time.sleep(5)
return x**2
print "starting at "+str(datetime.datetime.now())
res = gp_minimize(J, # the function to minimize
[(-1.0, 1.0)],
acq_func="EI", # the acquisition function
n_calls=10, # the number of evaluations of f
n_random_starts=1, # the number of random initialization points
random_state=1234,
acq_optimizer="lbfgs",
n_jobs=5,
)
print "ending at "+str(datetime.datetime.now())
I am trying to optimize J. In order to verify if calls to J happen in parallel, I put a delay in J. The optimizer is set up for 10 function calls so I'd expect it to run for ~50 seconds in series and ~10 seconds if executed on 5 cores as specified.
The output is:
starting at 2022-11-28 12:32:30.954389
ending at 2022-11-28 12:33:23.403255
meaning that the runtime was 53 seconds and it did not run in parallel. I was wondering whether I'm missing something in the optimizer. I use Anaconda with the following scikit versions:
conda list | grep scikit
scikit-learn 0.19.2 py27_blas_openblasha84fab4_201 [blas_openblas] conda-forge
scikit-optimize 0.3 py27_0 conda-forge

automatically choose a device keras tensorflow [duplicate]

I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).
Others have access too and sometimes they take random gpus.
I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.
I would like to go throuh the gpus usage and find the first that is unused and use only this one.
I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.
I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?
I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.
Something like this:
import subprocess, re
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE
import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow
An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.

How to run a Particle Swarm Optimization using Python and Abaqus in Cluster

Regards,
I apologize before hand for the lengthy post:
My question: How do you modify the loop between * * * in errFunction, the runABQfile function (subprocess.call), and the bash script below so that I can run a PSO optimization in a cluster?
The Background: I am calibrating a model using Particle Swarm Optimization (PSO) written in Python and ABAQUS with VUMAT (user material). A python script updates the input files N different ABAQUS models (which correspond to N different experiments) for each iteration and should run each of the N models until the global error between experiments and models is minimized. I am running this optimization in a cluster where I do not have admin privileges.
Assume I have a working main script main.py that import necessary modules, initiates variables, read the experimental data before calling a function PSO.py using
XOpt, FOpt = pso(errFunction, lb, ub, f_ieqcons=mycons, args=args)
The target function errFunction to be minimized is to run all N models using the runABQfile function and return the global error each iteration to the PSO function. A brief view of the structure of my code is shown below (I left out parts that are not relevant).
def errFunction(param2Calibrate,otherProps,InputFiles,experimentData,otherArgs):
maxNpts = otherArgs[0]
nAnalysis = otherArgs[1]
# Run Each Abaqus Simulation
inpFile = [0] * nAnalysis
abqDisp = [[0 for x in range(maxNpts)] for y in range(nAnalysis)]
abqForce = [[0 for x in range(maxNpts)] for y in range(nAnalysis)]
iexpForce = [[0 for x in range(maxNpts)] for y in range(nAnalysis)]
# ***********************************#
# - Update and Run Each Input File - #
for r in range(nParallelLoops):
for k in range( r*nAnalysis/nParallelLoops, (r+1)*nAnalysis/nParallelLoops ):
# - Write and Run Abaqus INP file - #
inpFile[k] = writeABQfile(param2Calibrate,otherProps[k],InputFiles[k])
runABQfile(inpFile[k])
# - Extract from Abaqus ODB - #
abqDisp_, abqForce_ = extraction(inpFile[k])
abqDisp[k][0:len(abqDisp_)] = abqDisp_
abqForce[k][0:len(abqForce_)] = abqForce_
# ***********************************#
# - Interpolate Experimental Results to Match Abaqus - #
for k in range(0,nAnalysis):
iexpForce_ = interpolate(experimentData[k],abqDisp[:][k])
iexpForce[k][0:len(abqDisp_)]= iexpForce_
# - Get Error - #
for k in range(0,nAnalysis):
Err[k] = Error(iexpForce[:][k],abqDisp[:][k],abqForce[:][k])
return Err
And the runABQfile is setup as follow, where 2 processes are to run in serie:
def runABQfile(inpFile):
import subprocess
import os
# - Run Abaqus - #
ABQexe = '/opt/abaqus/6.14-1/code/bin/abq6141'
prcStr1 = (ABQexe+' '+'job='+inpFile+' input='+inpFile+' \
user=$HOME/mPDFvumatNED.f scratch=/scratch/$USER/$SLURM_JOBID \
cpus=12 parallel=domain domains=12 mp_mode=mpi memory=60000mb \
interactive double=both')
prcStr2 = (ABQexe+' '+'cae noGUI='+inpFile+'_CAE.py')
process = subprocess.call(prcStr1,stdin=None,stdout=None,stderr=None,shell=True)
process = subprocess.call(prcStr2,shell=True)
Where the problem seem to be: I have access to maximum 2 nodes with 24 cpus per job (restricted by # of ABAQUS licenses). If I were to run a single analysis, I'd queue the job using SLURM with the following script.
#!/bin/bash
#SBATCH --job-name="abaqus"
#SBATCH --output="abaqus.%j.%N.out"
#SBATCH --partition=debug
#SBATCH --nodes=2
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=24
#SBATCH -L abaqus:25
#SBATCH -t 00:30:00
#Get the env file setup
scontrol show hostname > file-list1
scontrol show hostlist > file-list2
HOST1=`sed -n '1p' file-list1`
HOST2=`sed -n '2p' file-list1`
cat abq_v6.env |sed -e "s/host1/$HOST1/g" > ttt1.env
cat ttt1.env | sed -e "s/host2/$HOST2/g" > abaqus_v6.env
rm ttt*env
#Run the executable remotely
sed "s/DUMMY/$SLURM_JOBID/g" s4b.sh.orig > s4b.sh
chmod u+x s4b.sh
export EXHOST=`/bin/hostname`
ssh $EXHOST $SLURM_SUBMIT_DIR/s4b.sh
where s4b.sh.orig looks like this:
#!/bin/bash -l
cd /share/apps/examples/ABAQUS/s4b_multinode
module purge
module load abaqus/6.14-1
export EXE=abq6141
$EXE job=s4b scratch=/scratch/$USER/DUMMY cpus=48 -verbose 3 \
standard_parallel=all mp_mode=mpi memory=120000mb interactive
This script setup is the only way to to submit one ABAQUS job that runs on multiple nodes on that cluster because of problems with the ABAQUS environment file and SLURM (my guess the mp_host_list is not being properly assigned or it is oversubscribed, but honestly I do not understand what could be going on).
I modified my runABQfile function to use the bash construct when calling subprocess.call to something like this:
prcStr1 = ('sed "s/DUMMY/$SLURM_JOBID/g" s4b.sh.orig > s4b0.sh; \
sed "s/MODEL/inpFile/g" s4b0.sh > s4b1.sh; \
chmod u+x s4b1.sh; \
export EXHOST=`/bin/hostname`; \
ssh $EXHOST $SLURM_SUBMIT_DIR/s4b1.sh' )
process = subprocess.call(prcStr1,stdin=None,stdout=None,stderr=None,shell=True)
But the optimization never starts and quits right after modifying the first script.
Now the question again is How do you modify the loop between * * * in errFunction, the runABQfile function (subprocess.call), and the bash script so that I can run this optimization?... I would like to use at least 12 processors per ABAQUS model that is potentially running 4 jobs at the same time. Keep in mind all N models need to run and finish before moving to the next iteration.
I will appreciate any help you guys could provide.
Sincerely,
D P.

How configure theano on Windows?

I have Installed Theano on Windows machine and followed the configuration instructions.
I placed the following .theanorc.txt file in C:\Users\my_username folder:
#!sh
[global]
device = gpu
floatX = float32
[nvcc]
fastmath = True
# flags=-m32 # we have this hard coded for now
[blas]
ldflags =
# ldflags = -lopenblas # placeholder for openblas support
I tried to run the test, but haven't managed to run it on GPU. I guess the values from .theanorc.txt are not read, because I added the line print config.device and it outputs "cpu".
Below is the basic test script and the output:
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
print config.device
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
output:
pydev debugger: starting (pid: 9564)
cpu
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 10.0310001373 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
I have installed CUDA Toolkit successfully but haven't managed to install pyCUDA. I guess Theano should work without pyCUDA installed anyway.
I would be very thankful if anyone could help out solving this problem. I have followed these instructions but don't know why the configuration values in the program don't match the values in .theanorc.txt file.
Contrary to what has been said on a couple of pages, my installation (Windows 10, Python 2.7, Theano 0.10.0.dev1) would not interpret config instructions within a .theanorc.txt file in my user profile folder, but would read a .theanorc file.
If you are having trouble creating a file with that style of name, use the following commands at a terminal:
cd %USERPROFILE%
type NUL > .theanorc
Sauce: http://ankivil.com/making-theano-faster-with-cudnn-and-cnmem-on-windows-10/
You are right that Theano does not need PyCUDA.
It is strange that Theano does not read your configuration file. The exact path that gets read is this. Just run this in Python and you'll see where to put it:
os.path.expanduser('~/.theanorc.txt')
Try to change the content in .theanorc.txt as indicating by Theano website ( http://deeplearning.net/software/theano/install_windows.html). The path needs to be changed accordingly based on your installation.
[global]
floatX = float32
device = gpu
[nvcc]
flags=-LC:\Users\cchan\Anaconda3\libs
compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin

Calling function using Timeit

I'm trying to time several things in python, including upload time to Amazon's S3 Cloud Storage, and am having a little trouble. I can time my hash, and a few other things, but not the upload. I thought this post would finally, get me there, but I can't seem to find salvation. Any help would be appreciated. Very new to python, thanks!
import timeit
accKey = r"xxxxxxxxxxx";
secKey = r"yyyyyyyyyyyyyyyyyyyyyyyyy";
bucket_name = 'sweet_data'
c = boto.connect_s3(accKey, secKey)
b = c.get_bucket(bucket_name);
k = Key(b);
p = '/my/aws.path'
f = 'C:\\my.file'
def upload_data(p, f):
k.key = p
k.set_contents_from_filename(f)
return
t = timeit.Timer(lambda: upload_data(p, f), "from aws_lib import upload_data; p=%r; f = %r" % (p,f))
# Just calling the function works fine
#upload_data(p, f)
I know this is heresy in the Python community, but I actually recommend not to use timeit, especially for something like this. For your purposes, I believe it will be good enough (and possibly even better than timeit!) if you simply use time.time() to time things. In other words, do something like
from time import time
t0 = time()
myfunc()
t1 = time()
print t1 - t0
Note that depending on your platform, you might want to try time.clock() instead (see Stack Overflow questions such as this and this), and if you're on Python 3.3, then you have better options, due to PEP 418.
You can use the command line interface to timeit.
Just save your code as a module without the timing stuff. For example:
# file: test.py
data = range(5)
def foo(l):
return sum(l)
Then you can run the timing code from the command line, like this:
$ python -mtimeit -s 'import test;' 'test.foo(test.data)'
See also:
http://docs.python.org/2/library/timeit.html#command-line-interface
http://docs.python.org/2/library/timeit.html#examples

Categories