tensorflow: memory allocation for a 'for' cycle - python

I am trying to use TensorFlow for calculating minimum Euclidean distance between each column in the matrix and all other columns (excluding itself):
with graph.as_default():
...
def get_diversity(matrix):
num_rows = matrix.get_shape()[0].value
num_cols = matrix.get_shape()[1].value
identity = tf.ones([1, num_cols], dtype=tf.float32)
diversity = 0
for i in range(num_cols):
col = tf.reshape(matrix[:, i], [num_rows, 1])
col_extended_to_matrix = tf.matmul(neuron_matrix, identity)
difference_matrix = (col_extended_to_matrix - matrix) ** 2
sum_vector = tf.reduce_sum(difference_matrix, 0)
mask = tf.greater(sum_vector, 0)
non_zero_vector = tf.select(mask, sum_vector, tf.ones([num_cols], dtype=tf.float32) * 9e99)
min_diversity = tf.reduce_min(non_zero_vector)
diversity += min_diversity
return diversity / num_cols
...
diversity = get_diversity(matrix1)
...
When I call get_diversity() once per 1000 iterations (on the scale of 300k) it works just fine. But when I try to call it at every iteration the interpreter returns:
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 2.99MiB. See logs for memory state.
I was thinking that was because TF creates a new set of variables each time get_diversity() is called. I tried this:
def get_diversity(matrix, scope):
scope.reuse_variables()
...
with tf.variable_scope("diversity") as scope:
diversity = get_diversity(matrix1, scope)
But it did not fix the problem.
How can I fix this allocation issue and use get_diversity() with large number of iterations?

Assuming you call get_diversity() multiple times in your training loop, Aaron's comment is a good one: instead you can do something like the following:
diversity_input = tf.placeholder(tf.float32, [None, None], name="diversity_input")
diversity = get_diversity(matrix)
# ...
with tf.Session() as sess:
for _ in range(NUM_ITERATIONS):
# ...
diversity_val = sess.run(diversity, feed_dict={diversity_input: ...})
This will avoid creating new operations each time round the loop, which should prevent the memory leak. This answer has more details.

Related

tf.data.datasets set each batch (prefetch)

I am looking for help thinking through this.
I have a function (that is not a generator) that will give me any number of samples.
Let's say that getting all the data I want to train (1000 samples) can't fit into memory.
So I want to call this function 10 times to get smaller number of samples that fit into memory.
This is a dummy example for simplicity.
def get_samples(num_samples: int, random_seed=0):
np.random.seed(random_seed)
x = np.random.randint(0,100, num_samples)
y = np.random.randint(0,2, num_samples)
return np.array(list(zip(x,y))
Again lets say get_samples(1000,0) won't fit into memory.
So in theory I am looking for something like this:
batch_size = 100
total_num_samples = 1000
batches = []
for i in range(total_num_samples//batch_size):
batches.append(get_samples(batch_size, i))
But this still loads everything into memory.
Again this function is a dummy representation and the real one is already defined and not a generator.
In tf land. I was hoping that:
tf.data.Dataset.batch[0] would equal to the output of get_data(100,0)
tf.data.Dataset.batch[1] would equal to the output of get_data(100,1)
tf.data.Dataset.batch[2] would equal to the output of get_data(100,2)
...
tf.data.Dataset.batch[9] would equal to the output of get_data(100,9)
I understand that I can use tf.data.Datasets with a generator (and I think you can set a generator per batch). But the function I have gives more than a single sample. The set up is too expensive to set it up for a every single sample.
I was wanting to use tf.data.Dataset.prefetch() to run the get_batch function on every batch. And of course, it would call the get_batch with the same parameters on every epoch.
Sorry if the explaination is convoluted. Trying my best to describe the problem.
Anyone have any ideas?
This what I came up with:
def simple_static_synthesizer(batch_size, seed=1, verbose=True):
if verbose:
print(f"Creating Synthetic Data with seed {seed}")
rng = np.random.default_rng(seed)
all_x = []
all_y = []
for i in range(batch_size):
x = np.array(np.concatenate((rng.integers(0,100, 1, dtype=int), rng.integers(0,100, 1, dtype=int), rng.integers(0,100, 1, dtype=int))))
y = np.array(rng.integers(0,2,1, dtype=int))
all_x.append(x)
all_y.append(y)
return all_x, all_y
def my_generator(total_size, batch_size, seed=0, verbose=True):
counter = 0
for i in range(total_size):
# Regenerate for each batch
if counter%batch_size == 0: # Regen data for every batch
x,y = simple_static_synthesizer(batch_size,seed,verbose)
seed += 1
yield x[i%batch_size],y[i%batch_size]
counter += 1
my_gen = my_generator(10,2,seed=1)
# See values
for x,y in my_gen:
print(x,y)
# Call again, this give same answer as above
my_gen = my_generator(10,2,seed=1)
for x,y in my_gen:
print(x,y)
# Dataset with small batches to see if it is doing it correctly
total_samples = 10
batch_size = 2
seed = 5
dataset = tf.data.Dataset.from_generator(
my_generator,
args=[total_samples,batch_size,seed],
output_signature=(
tf.TensorSpec(shape=(3,), dtype=tf.uint8),
tf.TensorSpec(shape=(1,), dtype=tf.uint8),
)
)
for i,(x,y) in enumerate(dataset):
print(x.numpy(),y.numpy())
if i == 4:
break # shows first 3 syn calls
Wish we could have notebook answers!

How to free up RAM when using Juypter Notebook?

I have a Juypter Notebook where I am working with large matrices (20000x20000). I am running multiple iterations, but I am getting an error saying that I do not have enough RAM after every iteration. If I restart the kernel, I can run the next iteration, so perhaps the Juypter Notebook is running out of RAM because it stores the variables (which aren't needed for the next iteration). Is there a way to free up RAM?
Edit: I don't know if the bold segment is correct. In any case, I am looking to free up RAM, any suggestions are welcome.
## Outputs:
two_moons_n_of_samples = [int(_) for _ in np.repeat(20000, 10)]
for i in range(len(two_moons_n_of_samples)):
# print(f'n: {two_moons_n_of_samples[i]}')
## Generate the data and the graph
X, ground_truth, fid = synthetic_data({'type': 'two_moons', 'n': two_moons_n_of_samples[i], 'fidelity': 60, 'sigma': 0.18})
N = X.shape[0]
dist_mat = sqdist(X.T, X.T)
opt = {
'graph': 'full',
'tau': 0.004,
'type': 's'
}
LS = dense_laplacian(dist_mat, opt)
## Eigenvalues and eigenvectors
tic = time.time() ## Time how long to calculate eigenvalues/eigenvectors
V, E = np.linalg.eigh(LS)
idx = np.argsort(V)
V, E = V[idx], E[:, idx]
V = V / V.max()
decomposition_time = time.time() - tic
## Initialize u0
u0 = np.zeros(N)
for j in range(len(fid[0])):
u0[fid[0][j]] = 1
for j in range(len(fid[1])):
u0[fid[1][j]] = -1
## Initialize parameters
dt = 0.05
gamma = 0.07
max_iter = 100
## Run MAP estimation
tic = time.time()
u_eg, _ = probit_optimization_eig(E, V, u0, dt, gamma, fid, max_iter)
eg_time = time.time() - tic
## Run MAP estimation with CG
tic2 = time.time()
u_cg, _ = probit_optimization_cg(LS, u0, dt, gamma, fid, max_iter)
cg_time = time.time() - tic2
## Write to file:
with open('results2_two_moons_egvscg.txt', 'a') as f:
f.write(f'{i},{two_moons_n_of_samples[i]},{decomposition_time + eg_time},{cg_time}\n')
Error:
MemoryError: Unable to allocate 1.07 GiB for an array with shape (12000, 12000) and data type float64
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp\2/ipykernel_2344/941022539.py in <module>
11 'type': 's'
12 }
---> 13 LS = dense_laplacian(dist_mat, opt)
14
15 ## Eigenvalues and eigenvectors
C:/Users/\util\graph\dense_laplacian.py in dense_laplacian(dist_mat, opt)
69 D_inv_sqrt = 1.0 / np.sqrt(D)
70 D_inv_sqrt = np.diag(D_inv_sqrt)
---> 71 L = np.eye(W.shape[0]) - D_inv_sqrt # W # D_inv_sqrt
72 # L = 0.5 * (L + L.T)
73 if opt['type'] == 'rw':
MemoryError: Unable to allocate 1.07 GiB for an array with shape (12000, 12000) and data type float64
I faced the same problem, the way I solved it was -
Writing Functions wherever preprocessing is required and returning only preprocessed variables.
Deleting used huge variables just use del x
Clearing Garbage
import gc
gc.collect()
Sometimes clearing garbage doesn't helps and i used to clear the cache as well by using
import ctypes
libc = ctypes.CDLL("libc.so.6") # clearing cache
libc.malloc_trim(0)
I tried to batch my code as far as possible.
I think the best solution for you would be to batch the matrix multiplication. Libraries like TensorFlow and PyTorch does it by default, not sure about NumPy though. Check - https://www.tensorflow.org/api_docs/python/tf/linalg/matmul ( An API for matrix multiplication in batches ). Most of modern-day GPU calculations are possible due to batching !
I would suggest adding more swap space which is really easy and will probably save you more time and headache than redesigning the code to be less wasteful or trying to delete and garbage collect unnecessary objects. It would of course be slower than using ram memory since it will use the disk to simulate the extra memory needed.
Excellent answer on how to do this on ubuntu, link

Avoid memory re-allocation in tensorflow while_loop

In every step of the while_loop, I want to update a 0.5 GB variable. I cannot avoid the loop because each iteration depends on the previous iteration. My program need to run the while loop for 100 million times.
To test the performance of tf.while in this scenario, I make a test. The update here is simply adding a constant to the variable.
However, even this simple loop takes 24 seconds and requires 4 times 1 GB memory. I suspect the loop is constantly trying to reallocate 1 GB chunks of memory, which is horribly slow on a GPU. The GPU has 4 GB memory, when I set the variable to 2 GB, I get oom.
Is it possible to avoid the re-allocation?
I can use x as a loop variable instead of using the tf.control_dependencies. But that uses a bit more memory.
tf.contrib.compiler.jit.experimental_jit_scope leads to oom.
Thanks.
Test:
import tensorflow as tf
import numpy as np
from functools import partial
from timeit import default_timer as timer
def body1(x, i):
a = tf.assign(x, x + 0.001)
with tf.control_dependencies([a]):
return i + 1
def make_loop1(x, end_ix):
i = tf.Variable(0, name="i", dtype=np.int32)
cond = lambda i2: tf.less(i2, end_ix)
body = partial(body1, x)
return tf.while_loop(
cond, body, [i], back_prop=False,
parallel_iterations=1)
def main():
N = int(1e9 / 4)
x = tf.get_variable('x', shape=N, dtype=np.float32,
initializer=tf.ones_initializer)
end_ix = tf.constant(int(1000), dtype=np.int32)
loop1 = make_loop1(x, end_ix)
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print("running_loop1")
st = timer()
sess.run(loop1)
en = timer()
print(st - en)
print(sess.run(x[0]))
main()

Considerations of model definitions when moving from Tensorflow to PyTorch

I've just recently switched to PyTorch after getting frustrated in debugging tf and understand that it is equivalent to coding in numpy almost completely. My question is what are the permitted python aspects we can use in a PyTorch model (to be put completely on GPU) eg. if-else has to be implemented as follows in tensorflow
a = tf.Variable([1,2,3,4,5], dtype=tf.float32)
b = tf.Variable([6,7,8,9,10], dtype=tf.float32)
p = tf.placeholder(dtype=tf.float32)
ps = tf.placeholder(dtype=tf.bool)
li = [None]*5
li_switch = [True, False, False, True, True]
for i in range(5):
li[i] = tf.Variable(tf.random.normal([5]))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
def func_0():
return tf.add(a, p)
def func_1():
return tf.subtract(b, p)
with tf.device('GPU:0'):
my_op = tf.cond(ps, func_1, func_0)
for i in range(5):
print(sess.run(my_op, feed_dict={p:li[i], ps:li_switch[i]}))
How would the structure change in pytorch for the above code? How to place the variables and ops above on GPU and parallelize the list inputs to our graph in pytorch?
In pytorch, the code can be written like the way normal python code is written.
CPU
import torch
a = torch.FloatTensor([1,2,3,4,5])
b = torch.FloatTensor([6,7,8,9,10])
cond = torch.randn(5)
for ci in cond:
if ci > 0:
print(torch.add(a, 1))
else:
print(torch.sub(b, 1))
GPU
Move the tensors to GPU like this:
a = torch.FloatTensor([1,2,3,4,5]).to('cuda')
b = torch.FloatTensor([6,7,8,9,10]).to('cuda')
cond = torch.randn(5).to('cuda')
import torch.nn as nn
class Cond(nn.Module):
def __init__(self):
super(Cond, self).__init__()
def forward(self, cond, a, b):
result = torch.empty(cond.shape[0], a.shape[0]).cuda()
for i, ci in enumerate(cond):
if ci > 0:
result[i] = torch.add(a, 1)
else:
result[i] = torch.sub(b, 1)
return result
cond_model = Cond().to('cuda')
output = cond_model(cond, a, b)
https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#cuda-tensors
To initialize your a and b tensors in PyTorch, you do the following:
a = torch.tensor([1,2,3,4,5], dtype=torch.float32)
b = torch.tensor([6,7,8,9,10], dtype=torch.float32)
But, since you need them to be completely on the GPU, you have to use the magic .cuda() function. So, it would be:
a = torch.tensor([1,2,3,4,5], dtype=torch.float32).cuda()
b = torch.tensor([6,7,8,9,10], dtype=torch.float32).cuda()
Which moves the tensor to the GPU
Another way of initializing is:
a = torch.FloatTensor([1,2,3,4,5]).cuda()
b = torch.FloatTensor([6,7,8,9,10]).cuda()
If we need to generate a random normal distribution we use torch.randn (there is also torch.rand which does a uniform random distribution).
li = torch.randn(5, 5)
(Catch the bug, it has to be initialized on cuda, you cannot do operations on tensors that are located on separate processing units, i.e., CPU and GPU)
li = torch.randn(5, 5).cuda()
There is no difference for the li_switch initialization.
One possible way of handling your func_0 and func_1 is to declare them as
def func_0(li_value):
return torch.add(a, li_value)
def func_1(li_value):
return torch.sub(b, li_value)
Then, for the predicate function call, it could be as simple as doing this:
for i, pred in enumerate(li_switch):
if pred:
func_0(li[i])
else:
func_1(li[i])
However, I suggest vectorizing your operations and do something like:
li_switch = torch.tensor([True, False, False, True, True])
torch.add(a, li[li_switch]).sum(dim=0)
torch.sub(b, li[~li_switch]).sum(dim=0)
This is much more optimized.

Initialize a batch-dependent variable in Tensorflow

I have a tensorflow code that runs well and accurately, but occupies a lot of memory. Specifically, in my code, I have a for-loop that looks something like this:
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
vals = []
for k in range(0,K):
tmp = tf.reduce_sum(myarray1*myarray2[k],axis=(1,2))
vals.append(tmp)
result = tf.min( tf.stack(vals,axis=-1), axis=-1 )
Unfortunately, that takes a lot of memory as K gets to be big in my application. So, I want to have a better way of doing it. For example, in numpy/python, you would just keep track of the minimum value as you iterate through the loops, and update it on each iteration. It seems like I could use tf.assign, as:
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
min_value = tf.Variable(myarray1, validate_shape=False, trainable=False)
for k in range(0,K):
tmp = myarray1*myarray2[k]
idx = tf.where(tmp<min_value)
tf.scatter_nd_assign(min_value, idx, tmp[idx], use_locking=True)
result = min_value
While this code builds the graph (when validate_shape=False), it fails to run because it complains that min_value has not been initialized. The issue is, when I run the initializer as:
sess.run(tf.global_variables_initializer())
or
sess.run(tf.variables_initializer(tf.trainable_variables()))
it complains that I am not feeding in a placeholder. This actually makes sense because the definition of min_value depends on myarray1 in the graph.
What I would actually want to do is define a dummy variable that doesn't depend on myarray1's values, but does match its shape. I would like these values to be initialized as some number (in this case something large is fine), as I will manually ensure these are overwritten in the network.
Note: as far as I know, currently you cannot define a variable with an unknown shape unless you feed in another variable of the desired shape and set validate_shape=False). Maybe there is another way?
Any help / suggestions appreciated.
Try this, if don't know how to feed placeholder, read the tutorial.
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
###################ADD THIS ####################
sess=tf.Session()
FOO = tf.run(myarray1,feed_dict={myarray1: YOURDATA}) #get myarray1 value
#replace all myarray1 below with FOO
################################################
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
min_value = tf.Variable(FOO, validate_shape=False, trainable=False)
for k in range(0,K):
tmp = FOO*myarray2[k]
idx = tf.where(tmp<min_value)
tf.scatter_nd_assign(min_value, idx, tmp[idx], use_locking=True)
result = min_value
-------above new 15.April.2018------
Since I don't know your input data, I would like to try to make some steps.
Step_1: make a place for input data
x = tf.placeholder(tf.float32, shape=[None,2])
Step_2: Get batches of data
batch_x=[[1,2],[3,4]] #example
#since x=[None,2], the batch size would be batch_x_size/x_size=2
Step_3: make a session
sess=tf.Session()
if you have variables add the following code to initialize before calculation
init=tf.gobal_variables_initializer()
sess.run(init)
Step_4:
yourplaceholderdictiornay={x: batch_x}
sess.run(x, feed_dict=yourplaceholderdictiornay)
always feed your placeholder so it gets the value to calculate.
There is a Tensorflow and Deep Learning without a PHD, very helpful PDF file, you can also find it on youtube with this title.

Categories