Why does SpaCy keep GPU memory even after I delete the model? First I thought I'm facing same problem as this question but then I realized even Cupy doesn't mark the memory as being free.
import spacy
import cupy as cp
import time
import gc
def pool_stats(mempool):
print('used:',mempool.used_bytes(),'bytes')
print('total:',mempool.total_bytes(),'bytes\n')
pool = cp.cuda.MemoryPool(cp.cuda.memory.malloc_managed) # get unified pool
cp.cuda.set_allocator(pool.malloc) # set unified pool as default allocator
spacy.prefer_gpu()
nlp = spacy.load('/tmp/nlp')
pool_stats(pool)
# used: 2716626944 bytes
# total: 2719709696 bytes
del nlp
time.sleep(3)
gc.collect()
pool_stats(pool)
# used: 2700994560 bytes
# total: 2719709696 bytes
Related
I am writing a genetic optimization algorithm based on the deap package in python 2.7 (goal is to migrate to python 3 soon). As it is a pretty heavy process, some parts of the optimisation are processed using the multiprocessing package. Here is a summary outline of my program:
Configurations are read in and saved in a config object
Some additional pre-computations are made and saved as well in the config object
The optimisation starts (population is initialized randomly and mutations, crossover is applied to find a better solution) and some parts of it (evaluation function) are executed in multiprocessing
The results are saved
For the evaluation function, we need to have access to some parts of the config object (which after phase 2 stays a constant). Therefore we make it accessible to the different cores using a global (constant) variable:
from deap import base
import multiprocessing
toolbox = base.Toolbox()
def evaluate(ind):
# compute evaluation using config object
return(obj1,obj2)
toolbox.register('evaluate',evaluate)
def init_pool_global_vars(self, _config):
global config
config = _config
...
# setting up multiprocessing
pool = multiprocessing.Pool(processes=72, initializer=self.init_pool_global_vars,
initargs=[config])
toolbox.register('map', pool.map_async)
...
while tic < max_time:
# creating new individuals
# computing in optimisation the objective function on the different individuals
jobs = toolbox.map(toolbox.evaluate, ind)
fits = jobs.get()
# keeping best individuals
We basically make different iterations (big for loop) until a maximum time is reached. I have noticed that if I make the config object bigger (i.e. add big attributes to it, like a big numpy array) even if the code is still same it runs much slower (fewer iterations for the same timespan). So I thought I would make a specific config_multiprocessing object that contains only the attributes needed in the multiprocessing part and pass that as a global variable, but when I run it on 3 cores it is slower than with the big config object and on 72 cores, it is slightly faster, but not much.
What should I do in order to make sure my loops don't suffer in speed from the config object or from any other data manipulations I make before launching the multiprocessing loops?
Running in a Linux docker image on a linux VM in the cloud.
The joblib package is designed to handle cases where you have large numpy arrays to distribute to workers with shared memory. This is especially useful if you are treating the data in shared memory as "read-only" like what you describe in your scenario. You can also create writable shared memory as described in the docs.
Your code might look something like:
import os
import numpy as np
from joblib import Parallel, delayed
from joblib import dump, load
folder = './joblib_memmap'
try:
os.mkdir(folder)
except FileExistsError:
pass
def evaluate(ind, data):
# compute evaluation using shared memory data
return(obj1, obj2)
# just used to initialize memory mapped data
def init_memmap_data(original_data):
data_filename_memmap = os.path.join(folder, 'data_memmap')
dump(original_data, data_filename_memmap)
shared_data = load(data_filename_memmap, mmap_mode='r')
return shared_data
...
# however you set up indices needs to be changed here
indexes = range(10)
# however you load your numpy data needs to be done here
shared_data = init_memmap_data(numpy_array_to_share)
# change n_jobs as appropriate
results = Parallel(n_jobs=2)(delayed(evaluate)(ind, shared_data) for ind in indexes)
# get index of the maximum as the "best" individual
best_fit_individual = indexes[results.argmax()]
Additionally, joblib supports a threading backend that may be faster than the process based one. It will be easy to test both with joblib.
I am trying to restrict GPU memory allocation in a MonitoredTrainingSession.
The methods of setting tf.GPUOptions as shown here: How to prevent tensorflow from allocating the totality of a GPU memory? do not work out in the case of MonitoredTrainingSession.
I tried:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=.1)
# or allow_growth=True
config = tf.ConfigProto(allow_soft_placement=False,
device_filters=filters,
gpu_options=gpu_options)
scaffold = tf.train.Scaffold(saver=tf.train.Saver(max_to_keep=100, keep_checkpoint_every_n_hours=.5))
with tf.train.MonitoredTrainingSession(
server.target,
is_chief=True,
checkpoint_dir=log_dir,
scaffold=scaffold,
save_checkpoint_secs=600,
save_summaries_secs=30,
log_step_count_steps=int(1e7),
config=config) as session:
Despite using tf.GPUOptions memory consumption is 10189MiB / 11175MiB
I figured out what was the problem: the first session that is opened needs to include the memory options.
Hence, if in doubt, open a dummy session with memory limit just at the beginning of the script.
I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).
Others have access too and sometimes they take random gpus.
I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.
I would like to go throuh the gpus usage and find the first that is unused and use only this one.
I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.
I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?
I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.
Something like this:
import subprocess, re
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE
import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow
An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.
I would like to call model.wv.most_similar_cosmul, on the same copy of model object, using multiple cores, on batches of input pairs.
The multiprocessing module requires multiple copies of model, which will require too much RAM because my model is 30+ GB in RAM.
I have tried to evaluate my query pairs. It took me ~12 hours for the first round. There may be more rounds coming. That's why I am looking for a threading solution. I understand Python has Global Interpreter Lock issue.
Any suggestions?
Forking off processes using multiprocessing after your text-vector model is in memory and unchanging might work to let many processes share the same object-in-memory.
In particular, you'd want to be sure that the automatic generation of unit-normed vectors (into a syn0norm or doctag_syn0norm) has already happened. It'll be automatically triggered the first time it's needed by a most_similar() call, or you can force it with the init_sims() method on the relevant object. If you'll only be doing most-similar queries between unit-normed vectors, never needing the original raw vectors, use init_sims(replace=True) to clobber the raw mixed-magnitude syn0 vectors in-place and thus save a lot of addressable memory.
Gensim also has options to use memory-mapped files as the sources of model giant arrays, and when multiple processes use the same read-only memory-mapped file, the OS will be smart enough to only map that file into physical memory once, providing both processes pointers to the shared array.
For more discussion of the tricky parts of using this technique in a similar-but-not-identical use case, see my answer at:
How to speed up Gensim Word2vec model load time?
Gensim v4.x.x simplified a lot of what #gojomo described above, as he also explained in his other answer here. Based on those answers, here's an example of how you can multiprocess most_similar in a memory-efficient way, including logging of progress with tqdm. Swap in your own model/dataset to see how this works at scale.
import multiprocessing
from functools import partial
from typing import Dict, List, Tuple
import tqdm
from gensim.models.word2vec import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
from gensim.test.utils import common_texts
def get_most_similar(
word: str, keyed_vectors: KeyedVectors, topn: int
) -> List[Tuple[str, float]]:
try:
return keyed_vectors.most_similar(word, topn=topn)
except KeyError:
return []
def get_most_similar_batch(
word_batch: List[str], word_vectors_path: str, topn: int
) -> Dict[str, List[Tuple[str, float]]]:
# Load the keyedvectors with mmap, so memory isn't duplicated
keyed_vectors = KeyedVectors.load(word_vectors_path, mmap="r")
return {word: get_most_similar(word, keyed_vectors, topn) for word in word_batch}
def create_batches_from_iterable(iterable, batch_size=1000):
return [iterable[i : i + batch_size] for i in range(0, len(iterable), batch_size)]
if __name__ == "__main__":
model = Word2Vec(
sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4
)
# Save wv, so it can be reloaded with mmap later
word_vectors_path = "word2vec.wordvectors"
model.wv.save(word_vectors_path)
# Dummy set of words to find most similar words for
words_to_match = list(model.wv.key_to_index.keys())
# Multiprocess
batches = create_batches_from_iterable(words_to_match, batch_size=2)
partial_func = partial(
get_most_similar_batch,
word_vectors_path=word_vectors_path,
topn=5,
)
words_most_similar = dict()
num_workers = multiprocessing.cpu_count()
with multiprocessing.Pool(num_workers) as pool:
max_ = len(batches)
with tqdm.tqdm(total=max_) as pbar:
# imap required for tqdm to function properly
for result in pool.imap(partial_func, batches):
words_most_similar.update(result)
pbar.update()
Consider the memory usage of the following code after executing each statement:
import numpy
import scipy.signal
import gc
# memory at this point is ~35mb
a = numpy.ones(10**7)
b = numpy.ones(10**7)
# memory at this point is ~187mb
c = scipy.signal.fftconvolve(a, numpy.flipud(b), mode="full")
# memory usage at this point is 645mb
# given that a,b,c take up about 305mb in total, this is much
# larger than expected
# If we delete a,b,c and garbage collect...
del a,b,c
gc.collect()
# ...the memory usage drops to ~340mb which is a drop of 305mb
# (as expected since that is the size of what we deleted)
# but the remaining memory usage is still 340mb
# which is much larger than the starting value of 35 mb
Why is the process using so much extra memory. Is there a leak in the fftconvolve function, or is there some other explanation?
Platform:
CPython 3.5.1 x64
Scipy 0.17.0 (Christoph Golke's package)
Numpy+MKL 1.11.0rc1 (Cristoph Golke's package)
Windows 8.1 x64