How to free up RAM when using Juypter Notebook?

How to free up RAM when using Juypter Notebook? - python

I have a Juypter Notebook where I am working with large matrices (20000x20000). I am running multiple iterations, but I am getting an error saying that I do not have enough RAM after every iteration. If I restart the kernel, I can run the next iteration, so perhaps the Juypter Notebook is running out of RAM because it stores the variables (which aren't needed for the next iteration). Is there a way to free up RAM?
Edit: I don't know if the bold segment is correct. In any case, I am looking to free up RAM, any suggestions are welcome.
## Outputs:
two_moons_n_of_samples = [int(_) for _ in np.repeat(20000, 10)]
for i in range(len(two_moons_n_of_samples)):
# print(f'n: {two_moons_n_of_samples[i]}')
## Generate the data and the graph
X, ground_truth, fid = synthetic_data({'type': 'two_moons', 'n': two_moons_n_of_samples[i], 'fidelity': 60, 'sigma': 0.18})
N = X.shape[0]
dist_mat = sqdist(X.T, X.T)
opt = {
'graph': 'full',
'tau': 0.004,
'type': 's'
}
LS = dense_laplacian(dist_mat, opt)
## Eigenvalues and eigenvectors
tic = time.time() ## Time how long to calculate eigenvalues/eigenvectors
V, E = np.linalg.eigh(LS)
idx = np.argsort(V)
V, E = V[idx], E[:, idx]
V = V / V.max()
decomposition_time = time.time() - tic
## Initialize u0
u0 = np.zeros(N)
for j in range(len(fid[0])):
u0[fid[0][j]] = 1
for j in range(len(fid[1])):
u0[fid[1][j]] = -1
## Initialize parameters
dt = 0.05
gamma = 0.07
max_iter = 100
## Run MAP estimation
tic = time.time()
u_eg, _ = probit_optimization_eig(E, V, u0, dt, gamma, fid, max_iter)
eg_time = time.time() - tic
## Run MAP estimation with CG
tic2 = time.time()
u_cg, _ = probit_optimization_cg(LS, u0, dt, gamma, fid, max_iter)
cg_time = time.time() - tic2
## Write to file:
with open('results2_two_moons_egvscg.txt', 'a') as f:
f.write(f'{i},{two_moons_n_of_samples[i]},{decomposition_time + eg_time},{cg_time}\n')
Error:
MemoryError: Unable to allocate 1.07 GiB for an array with shape (12000, 12000) and data type float64
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp\2/ipykernel_2344/941022539.py in <module>
11 'type': 's'
12 }
---> 13 LS = dense_laplacian(dist_mat, opt)
14
15 ## Eigenvalues and eigenvectors
C:/Users/\util\graph\dense_laplacian.py in dense_laplacian(dist_mat, opt)
69 D_inv_sqrt = 1.0 / np.sqrt(D)
70 D_inv_sqrt = np.diag(D_inv_sqrt)
---> 71 L = np.eye(W.shape[0]) - D_inv_sqrt # W # D_inv_sqrt
72 # L = 0.5 * (L + L.T)
73 if opt['type'] == 'rw':
MemoryError: Unable to allocate 1.07 GiB for an array with shape (12000, 12000) and data type float64

I faced the same problem, the way I solved it was -
Writing Functions wherever preprocessing is required and returning only preprocessed variables.
Deleting used huge variables just use del x
Clearing Garbage
import gc
gc.collect()
Sometimes clearing garbage doesn't helps and i used to clear the cache as well by using
import ctypes
libc = ctypes.CDLL("libc.so.6") # clearing cache
libc.malloc_trim(0)
I tried to batch my code as far as possible.
I think the best solution for you would be to batch the matrix multiplication. Libraries like TensorFlow and PyTorch does it by default, not sure about NumPy though. Check - https://www.tensorflow.org/api_docs/python/tf/linalg/matmul ( An API for matrix multiplication in batches ). Most of modern-day GPU calculations are possible due to batching !

I would suggest adding more swap space which is really easy and will probably save you more time and headache than redesigning the code to be less wasteful or trying to delete and garbage collect unnecessary objects. It would of course be slower than using ram memory since it will use the disk to simulate the extra memory needed.
Excellent answer on how to do this on ubuntu, link

Related

How to speed up Cupy with Streams correctly?

I want to use cuda streams in order to speed up small calculations on the GPU. My test so far consists of the following:
import cupy as xp
import time
x = xp.random.randn(10003, 20000) + 1j * xp.random.randn(10003, 20000)
y = xp.zeros_like(x)
nStreams = 16
streams = [xp.cuda.stream.Stream() for ii in range(nStreams)]
f = xp.fft.fft(x[:,:200])
t = time.time()
for ii in range(int(x.shape[1]/100)):
ss = streams[ii % nStreams]
with ss:
y[:,ii*200:(ii+1)*200] = xp.fft.fft(x[:,ii*200:(ii+1)*200], axis=0)
for ii,ss in enumerate(streams):
ss.synchronize()
print(time.time()-t)
t = time.time()
for ii in range(int(x.shape[1]/100)):
y[:,ii*200:(ii+1)*200] = xp.fft.fft(x[:,ii*200:(ii+1)*200], axis=0)
xp.cuda.Stream.null.synchronize()
print(time.time()-t)
produces
[user#pc snippets]$ intelpython3 strm.py
0.019365549087524414
0.018717050552368164
which I have trouble believing that I do everything correctly. Additionally, the situation becomes even more severe when replacing the FFT-calls with calls to xp.sum, which yields
[user#pc snippets]$ intelpython3 strm.py
0.002195596694946289
0.001004934310913086
What is the rationale behind cupy streams? How do I use them to my advantage?

Avoid memory re-allocation in tensorflow while_loop

In every step of the while_loop, I want to update a 0.5 GB variable. I cannot avoid the loop because each iteration depends on the previous iteration. My program need to run the while loop for 100 million times.
To test the performance of tf.while in this scenario, I make a test. The update here is simply adding a constant to the variable.
However, even this simple loop takes 24 seconds and requires 4 times 1 GB memory. I suspect the loop is constantly trying to reallocate 1 GB chunks of memory, which is horribly slow on a GPU. The GPU has 4 GB memory, when I set the variable to 2 GB, I get oom.
Is it possible to avoid the re-allocation?
I can use x as a loop variable instead of using the tf.control_dependencies. But that uses a bit more memory.
tf.contrib.compiler.jit.experimental_jit_scope leads to oom.
Thanks.
Test:
import tensorflow as tf
import numpy as np
from functools import partial
from timeit import default_timer as timer
def body1(x, i):
a = tf.assign(x, x + 0.001)
with tf.control_dependencies([a]):
return i + 1
def make_loop1(x, end_ix):
i = tf.Variable(0, name="i", dtype=np.int32)
cond = lambda i2: tf.less(i2, end_ix)
body = partial(body1, x)
return tf.while_loop(
cond, body, [i], back_prop=False,
parallel_iterations=1)
def main():
N = int(1e9 / 4)
x = tf.get_variable('x', shape=N, dtype=np.float32,
initializer=tf.ones_initializer)
end_ix = tf.constant(int(1000), dtype=np.int32)
loop1 = make_loop1(x, end_ix)
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
print("running_loop1")
st = timer()
sess.run(loop1)
en = timer()
print(st - en)
print(sess.run(x[0]))
main()

tensorflow: memory allocation for a 'for' cycle

I am trying to use TensorFlow for calculating minimum Euclidean distance between each column in the matrix and all other columns (excluding itself):
with graph.as_default():
...
def get_diversity(matrix):
num_rows = matrix.get_shape()[0].value
num_cols = matrix.get_shape()[1].value
identity = tf.ones([1, num_cols], dtype=tf.float32)
diversity = 0
for i in range(num_cols):
col = tf.reshape(matrix[:, i], [num_rows, 1])
col_extended_to_matrix = tf.matmul(neuron_matrix, identity)
difference_matrix = (col_extended_to_matrix - matrix) ** 2
sum_vector = tf.reduce_sum(difference_matrix, 0)
mask = tf.greater(sum_vector, 0)
non_zero_vector = tf.select(mask, sum_vector, tf.ones([num_cols], dtype=tf.float32) * 9e99)
min_diversity = tf.reduce_min(non_zero_vector)
diversity += min_diversity
return diversity / num_cols
...
diversity = get_diversity(matrix1)
...
When I call get_diversity() once per 1000 iterations (on the scale of 300k) it works just fine. But when I try to call it at every iteration the interpreter returns:
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 2.99MiB. See logs for memory state.
I was thinking that was because TF creates a new set of variables each time get_diversity() is called. I tried this:
def get_diversity(matrix, scope):
scope.reuse_variables()
...
with tf.variable_scope("diversity") as scope:
diversity = get_diversity(matrix1, scope)
But it did not fix the problem.
How can I fix this allocation issue and use get_diversity() with large number of iterations?

Assuming you call get_diversity() multiple times in your training loop, Aaron's comment is a good one: instead you can do something like the following:
diversity_input = tf.placeholder(tf.float32, [None, None], name="diversity_input")
diversity = get_diversity(matrix)
# ...
with tf.Session() as sess:
for _ in range(NUM_ITERATIONS):
# ...
diversity_val = sess.run(diversity, feed_dict={diversity_input: ...})
This will avoid creating new operations each time round the loop, which should prevent the memory leak. This answer has more details.

Gaussian SVM parameters C and gamma

I am using rbf,Support Vector machine for large training set=1135x9 matrix and test set{95x9}.
I am using C=20 and gamma=0.0001
'the result are as follows
optimization finished, iter = 3904
nu = 0.187228
obj = -2499.485353, rho = -0.072050
nSV = 852, nBSV = 48
Total nSV = 852
<Accuracy = 63.1579% (60/95) (classification)
i want to ask what should be optimal C & gamma for this data set

Use grid search method, although the speed may be an issue.
If you are using Matlab, the following code from the FAQ page may work:
bestcv = 0;
for log2c = -1:3,
for log2g = -4:1,
cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
cv = svmtrain(label, training_set, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
end
fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);
end
end
If you are using Python, check this page with the usage on gridrepression.py.

Anaconda's NumbaPro CUDA Assertion Error

I am trying to use NumbaPro's cuda extension to multiply large array matrixes. What I want in the end is to multiply a matrix of size NxN by a diagonal matrix that would be fed in as a 1D matrix (thus, a.dot(numpy.diagflat(b)) which I have found to be synonymous to a * b). However, I am getting an assertion error that provides no information.
I can only avoid this assertion error if I multiply two 1D array matrixes but that is not what I want to do.
from numbapro import vectorize, cuda
from numba import f4,f8
import numpy as np
def generate_input(n):
import numpy as np
A = np.array(np.random.sample((n,n)))
B = np.array(np.random.sample(n) + 10)
return A, B
def product(a, b):
return a * b
def main():
cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product)
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape)
stream = cuda.stream()
with stream.auto_synchronize():
dA = cuda.to_device(A, stream)
dB = cuda.to_device(B, stream)
dD = cuda.to_device(D, stream, copy=False)
cu_product(dA, dB, out=dD, stream=stream)
dD.to_host(stream)
if __name__ == '__main__':
main()
This is what my terminal spits out:
Traceback (most recent call last):
File "cuda_vectorize.py", line 32, in <module>
main()
File "cuda_vectorize.py", line 28, in main
cu_product(dA, dB, out=dD, stream=stream)
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement
AssertionError

The problem is you are using vectorize on a function that takes non-scalar arguments. The idea with NumbaPro's vectorize is that it takes a scalar function as input, and generates a function that applies the scalar operation in parallel to all the elements of a vector. See the NumbaPro documentation.
Your function takes a matrix and a vector, which are definitely not scalar. [Edit] You can do what you want on the GPU using either NumbaPro's wrapper for cuBLAS, or by writing your own simple kernel function. Here's an example that demonstrates both. Note will need NumbaPro 0.12.2 or later (just released as of this edit).
from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer
def generate_input(n):
A = np.array(np.random.sample((n,n)), dtype=np.float32)
B = np.array(np.random.sample(n), dtype=A.dtype)
return A, B
#cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
startX, startY = cuda.grid(2)
gridX = cuda.gridDim.x * cuda.blockDim.x;
gridY = cuda.gridDim.y * cuda.blockDim.y;
height, width = c.shape
for y in range(startY, height, gridY):
for x in range(startX, width, gridX):
c[y, x] = a[y, x] * b[x]
def main():
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape, dtype=A.dtype)
E = np.zeros(A.shape, dtype=A.dtype)
F = np.empty(A.shape, dtype=A.dtype)
start = timer()
E = np.dot(A, np.diag(B))
numpy_time = timer() - start
blas = cublas.api.Blas()
start = timer()
blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
cublas_time = timer() - start
diff = np.abs(D-E)
print("Maximum CUBLAS error %f" % np.max(diff))
blockdim = (32, 8)
griddim = (16, 16)
start = timer()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dF = cuda.to_device(F, copy=False)
diagproduct[griddim, blockdim](dF, dA, dB)
dF.to_host()
cuda_time = timer() - start
diff = np.abs(F-E)
print("Maximum CUDA error %f" % np.max(diff))
print("Numpy took %f seconds" % numpy_time)
print("CUBLAS took %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time))
print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))
if __name__ == '__main__':
main()
The kernel is significantly faster because SGEMM does a full matrix-matrix multiply (O(n^3)), and expands the diagonal into a full matrix. The diagproduct function is smarter. It simply does a single multiply for each matrix element, and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N=1000:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.024535 seconds
CUBLAS took 0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup
The timing includes all of the copies to and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10,000 and run again, we get a much bigger speedup:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 7.245677 seconds
CUBLAS took 1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup
For very small matrices, however, CUBLAS SGEMM has an optimized path so it is closer to the CUDA performance. Here, N=100
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.006876 seconds
CUBLAS took 0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup

Just to bounce back on all those considerations. I also wanted to implement some matrix computations on CUDA, but then heard about the numpy.einsum function.
It turns out that einsum is incredibly fast.
In a case like this, here is the code for it. But it can be applied to many types of computations.
G = np.einsum('ij,j -> ij',A, B)
In terms of speed, here are the results for N = 10000
Numpy took 8.387756 seconds
CUDA JIT took 0.218394 seconds, 38.41x speedup
EINSUM took 0.131751 seconds, 63.66x speedup

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to free up RAM when using Juypter Notebook? - python

Related

How to speed up Cupy with Streams correctly?

Avoid memory re-allocation in tensorflow while_loop

tensorflow: memory allocation for a 'for' cycle

Gaussian SVM parameters C and gamma

Anaconda's NumbaPro CUDA Assertion Error

Categories

Resources