I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?
Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.
Related
I am writing a genetic optimization algorithm based on the deap package in python 2.7 (goal is to migrate to python 3 soon). As it is a pretty heavy process, some parts of the optimisation are processed using the multiprocessing package. Here is a summary outline of my program:
Configurations are read in and saved in a config object
Some additional pre-computations are made and saved as well in the config object
The optimisation starts (population is initialized randomly and mutations, crossover is applied to find a better solution) and some parts of it (evaluation function) are executed in multiprocessing
The results are saved
For the evaluation function, we need to have access to some parts of the config object (which after phase 2 stays a constant). Therefore we make it accessible to the different cores using a global (constant) variable:
from deap import base
import multiprocessing
toolbox = base.Toolbox()
def evaluate(ind):
# compute evaluation using config object
return(obj1,obj2)
toolbox.register('evaluate',evaluate)
def init_pool_global_vars(self, _config):
global config
config = _config
...
# setting up multiprocessing
pool = multiprocessing.Pool(processes=72, initializer=self.init_pool_global_vars,
initargs=[config])
toolbox.register('map', pool.map_async)
...
while tic < max_time:
# creating new individuals
# computing in optimisation the objective function on the different individuals
jobs = toolbox.map(toolbox.evaluate, ind)
fits = jobs.get()
# keeping best individuals
We basically make different iterations (big for loop) until a maximum time is reached. I have noticed that if I make the config object bigger (i.e. add big attributes to it, like a big numpy array) even if the code is still same it runs much slower (fewer iterations for the same timespan). So I thought I would make a specific config_multiprocessing object that contains only the attributes needed in the multiprocessing part and pass that as a global variable, but when I run it on 3 cores it is slower than with the big config object and on 72 cores, it is slightly faster, but not much.
What should I do in order to make sure my loops don't suffer in speed from the config object or from any other data manipulations I make before launching the multiprocessing loops?
Running in a Linux docker image on a linux VM in the cloud.
The joblib package is designed to handle cases where you have large numpy arrays to distribute to workers with shared memory. This is especially useful if you are treating the data in shared memory as "read-only" like what you describe in your scenario. You can also create writable shared memory as described in the docs.
Your code might look something like:
import os
import numpy as np
from joblib import Parallel, delayed
from joblib import dump, load
folder = './joblib_memmap'
try:
os.mkdir(folder)
except FileExistsError:
pass
def evaluate(ind, data):
# compute evaluation using shared memory data
return(obj1, obj2)
# just used to initialize memory mapped data
def init_memmap_data(original_data):
data_filename_memmap = os.path.join(folder, 'data_memmap')
dump(original_data, data_filename_memmap)
shared_data = load(data_filename_memmap, mmap_mode='r')
return shared_data
...
# however you set up indices needs to be changed here
indexes = range(10)
# however you load your numpy data needs to be done here
shared_data = init_memmap_data(numpy_array_to_share)
# change n_jobs as appropriate
results = Parallel(n_jobs=2)(delayed(evaluate)(ind, shared_data) for ind in indexes)
# get index of the maximum as the "best" individual
best_fit_individual = indexes[results.argmax()]
Additionally, joblib supports a threading backend that may be faster than the process based one. It will be easy to test both with joblib.
I would like to call model.wv.most_similar_cosmul, on the same copy of model object, using multiple cores, on batches of input pairs.
The multiprocessing module requires multiple copies of model, which will require too much RAM because my model is 30+ GB in RAM.
I have tried to evaluate my query pairs. It took me ~12 hours for the first round. There may be more rounds coming. That's why I am looking for a threading solution. I understand Python has Global Interpreter Lock issue.
Any suggestions?
Forking off processes using multiprocessing after your text-vector model is in memory and unchanging might work to let many processes share the same object-in-memory.
In particular, you'd want to be sure that the automatic generation of unit-normed vectors (into a syn0norm or doctag_syn0norm) has already happened. It'll be automatically triggered the first time it's needed by a most_similar() call, or you can force it with the init_sims() method on the relevant object. If you'll only be doing most-similar queries between unit-normed vectors, never needing the original raw vectors, use init_sims(replace=True) to clobber the raw mixed-magnitude syn0 vectors in-place and thus save a lot of addressable memory.
Gensim also has options to use memory-mapped files as the sources of model giant arrays, and when multiple processes use the same read-only memory-mapped file, the OS will be smart enough to only map that file into physical memory once, providing both processes pointers to the shared array.
For more discussion of the tricky parts of using this technique in a similar-but-not-identical use case, see my answer at:
How to speed up Gensim Word2vec model load time?
Gensim v4.x.x simplified a lot of what #gojomo described above, as he also explained in his other answer here. Based on those answers, here's an example of how you can multiprocess most_similar in a memory-efficient way, including logging of progress with tqdm. Swap in your own model/dataset to see how this works at scale.
import multiprocessing
from functools import partial
from typing import Dict, List, Tuple
import tqdm
from gensim.models.word2vec import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from gensim.test.utils import common_texts
def get_most_similar(
word: str, keyed_vectors: KeyedVectors, topn: int
) -> List[Tuple[str, float]]:
try:
return keyed_vectors.most_similar(word, topn=topn)
except KeyError:
return []
def get_most_similar_batch(
word_batch: List[str], word_vectors_path: str, topn: int
) -> Dict[str, List[Tuple[str, float]]]:
# Load the keyedvectors with mmap, so memory isn't duplicated
keyed_vectors = KeyedVectors.load(word_vectors_path, mmap="r")
return {word: get_most_similar(word, keyed_vectors, topn) for word in word_batch}
def create_batches_from_iterable(iterable, batch_size=1000):
return [iterable[i : i + batch_size] for i in range(0, len(iterable), batch_size)]
if __name__ == "__main__":
model = Word2Vec(
sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4
)
# Save wv, so it can be reloaded with mmap later
word_vectors_path = "word2vec.wordvectors"
model.wv.save(word_vectors_path)
# Dummy set of words to find most similar words for
words_to_match = list(model.wv.key_to_index.keys())
# Multiprocess
batches = create_batches_from_iterable(words_to_match, batch_size=2)
partial_func = partial(
get_most_similar_batch,
word_vectors_path=word_vectors_path,
topn=5,
)
words_most_similar = dict()
num_workers = multiprocessing.cpu_count()
with multiprocessing.Pool(num_workers) as pool:
max_ = len(batches)
with tqdm.tqdm(total=max_) as pbar:
# imap required for tqdm to function properly
for result in pool.imap(partial_func, batches):
words_most_similar.update(result)
pbar.update()
Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!
Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.
I ran into this today and can't figure out why. I have several functions chained together that perform some time consuming operations as part of a larger pipeline. I've included these here, pared down to a test example, as best as I could. The issue is that when I call a function directly, I get the expected output (e.g., 5 different trees). However, when I call the same function in a multiprocessing pool with apply_async (or apply, doesn't matter), I get 5 trees, but they are all the same.
I've documented this in an IPython notebook, which can be viewed here: http://nbviewer.ipython.org/gist/cfriedline/0e275d528ff1a8d674c6
In cell 91, I create 5 trees (each with 10 tips), and return two lists. The first containing the non-multiprocessing trees, and the second from apply_async.
In cell 92, you can see the results of creating trees without multiprocessing, and in 93, with multiprocessing.
What I expect is that there would be a total of 10 different trees between the two tests, but instead all of the multiprocessing trees are identical. Makes little sense to me.
Relevant versions of things:
Linux 2.6.18-238.12.1.el5 x86_64 GNU/Linux
Python 2.7.6 :: Anaconda 1.9.2 (64-bit)
IPython 2.0.0
Rpy2 2.3.9
Thanks!
Chris
I solved this one, with a point in the right direction from #mgilson. In fact, it was a random number problem, just not in python - in R (sigh). The state of R is copied when the Pool is created, meaning so is its random seed. To fix, just a little rpy2 as below calling R's set.seed function (with some process specific stuff for good measure):
def create_tree(num_tips, type):
"""
creates the taxa tree in R
#param num_tips: number of taxa to create
#param type: type for naming (e.g., 'taxa')
#return: a dendropy Tree
#rtype: dendropy.Tree
"""
r = rpy2.robjects.r
set_seed = r('set.seed')
set_seed(int((time.time()+os.getpid()*1000)))
rpy2.robjects.globalenv['numtips'] = num_tips
rpy2.robjects.globalenv['treetype'] = type
name = _get_random_string(20)
if type == "T":
r("%s = rtree(numtips, rooted=T, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
else:
r("%s = rtree(numtips, rooted=F, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
tree = r[name]
return ape_to_dendropy(tree)
I'm not 100% familiar with these libraries, however, on Linux, (IIRC) multiprocessing uses os.fork. This means that the state of the random module (which you're using) will also be forked and that each of your processes will generate the same sequence of random numbers resulting in a not-so-random _get_random_string function.
If I'm right, and you make the pool smaller than the number of trees that you want, you should see that you get groups of N identical trees (where N is the number of pools).
I think that probably the ideal solution is to re-seed the random number generator inside of each of the processes. It's unlikely that they'll run at exactly the same time, so you should get differing results.
The following program loads two images with PyGame, converts them to Numpy arrays, and then performs some other Numpy operations (such as FFT) to emit a final result (of a few numbers). The inputs can be large, but at any moment only one or two large objects should be live.
A test image is about 10M pixels, which translates to 10MB once it's greyscaled. It gets converted to a Numpy array of dtype uint8, which after some processing (applying Hamming windows), is an array of dtype float64. Two images are loaded into arrays this way; later FFT steps result in an array of dtype complex128. Prior to adding the excessive gc.collect calls, the program memory size tended to increase with each step. Additionally, it seems most Numpy operations will give a result in the highest precision available.
Running the test (sans the gc.collect calls) on my 1GB Linux machine results in prolonged thrashing, which I have not waited for. I don't yet have detailed memory use stats -- I tried some Python modules and the time command to no avail; now I'm looking into valgrind. Watching PS (and dealing with machine unresponsiveness in the later stages of the test) suggests a maximum memory usage of about 800 MB.
A 10 million cell array of complex128 should occupy 160 MB. Having (ideally) at most two of these live at one time, plus the not-insubstantial Python and Numpy libraries and other paraphernalia, probably means allowing for 500 MB.
I can think of two angles from which to attack the problem:
Discarding intermediate arrays as soon as possible. That's what the gc.collect calls are for -- they seem to have improved the situation, as it now completes with only a few minutes of thrashing ;-). I think one can expect that memory-intensive programming in a language like Python will require some manual intervention.
Using less-precise Numpy arrays at each step. Unfortunately the operations that return arrays, like fft2, do not appear to allow the type to be specified.
So my main question is: is there a way of specifying output precision in Numpy array operations?
More generally, are there other common memory-conserving techniques when using Numpy?
Additionally, does Numpy have a more idiomatic way of freeing array memory? (I imagine this would leave the array object live in Python, but in an unusable state.) Explicit deletion followed by immediate GC feels hacky.
import sys
import numpy
import pygame
import gc
def get_image_data(filename):
im = pygame.image.load(filename)
im2 = im.convert(8)
a = pygame.surfarray.array2d(im2)
hw1 = numpy.hamming(a.shape[0])
hw2 = numpy.hamming(a.shape[1])
a = a.transpose()
a = a*hw1
a = a.transpose()
a = a*hw2
return a
def check():
gc.collect()
print 'check'
def main(args):
pygame.init()
pygame.sndarray.use_arraytype('numpy')
filename1 = args[1]
filename2 = args[2]
im1 = get_image_data(filename1)
im2 = get_image_data(filename2)
check()
out1 = numpy.fft.fft2(im1)
del im1
check()
out2 = numpy.fft.fft2(im2)
del im2
check()
out3 = out1.conjugate() * out2
del out1, out2
check()
correl = numpy.fft.ifft2(out3)
del out3
check()
maxs = correl.argmax()
maxpt = maxs % correl.shape[0], maxs / correl.shape[0]
print correl[maxpt], maxpt, (correl.shape[0] - maxpt[0], correl.shape[1] - maxpt[1])
if __name__ == '__main__':
args = sys.argv
exit(main(args))
This
on SO says "Scipy 0.8 will have single precision support for almost all the fft code",
and SciPy 0.8.0 beta 1 is just out.
(Haven't tried it myself, cowardly.)
if I understand correctly, you are calculating a convolution between two images. The Scipy package contains a dedicated module for that (ndimage), which might be more memory efficient than the "manual" approach via Fourier transforms. It would be good to try using it instead of going through Numpy.