I'm coping data in python using OpenCL onto my graphic card. There I've a kernel processing the data with n threads.
After this step I copy the result back to python and in a new kernel. (The data is very big 900MB and the result is 100MB) With the result I need to calculate triangles which are about 200MB. All data exceed the memory on my graphic card.
I do not need the the first 900MB anymore after the first kernel finished it's work.
My question is, how can I delete the first dataset (stored in one array) from the graphic card?
Here some code:
#Write
self.gridBuf = cl.Buffer(self.context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=self.grid)
#DO PART 1
...
#Read result
cl.enqueue_read_buffer(self.queue, self.indexBuf,index).wait()
You will need to call clReleaseMemObject with the mem object you created with the call to clCreateBuffer. If the reference count becomes zero with this call, the underlying device/shared memory is released by the implementation.
Related
I am opening a list of asm files, and closing them after extracting arrays from them. I had issues with RAM usage which is solved now.
When I was appending the extracted arrays (in array format) in a list, the RAM usage kept stacking up with each iteration. However, when I changed my code to change the format of the extracted arrays to list before appending, the issue resolved. Please see line i_arrays.append(h.tolist()). I'm just trying to understand why was the RAM usage stacking up when I was storing np arrays.
Code >>>>>>>>>>>>>>>>>>>>>>>>>
t1_start = process_time()
files = os.listdir('asmFiles')
i_arrays=[]
file_names=[]
for i in tqdm(files[0:1501]):
f_name=i.split('.')[0]
file_names.append(f_name)
b='asmFiles/'+str(i)
f=open(b,'rb')
ln=os.path.getsize(b)
width=int(ln**0.5)
rem=ln%width
a = array.array("B")
a.fromfile(f,ln-rem)
f.close()
g=np.reshape(a,(int(len(a)/width),width))
g=np.uint(g)
h=g[0][0:800]
i_arrays.append(h.tolist())
print(psutil.virtual_memory()[2])
t1_stop = process_time()
print("Elapsed time during the whole program in seconds:",t1_stop-t1_start)
My script takes two movie files as an input, and writes a 2x1 array movie output (stereoscopic Side-by-Side Half-Width). The input video clips are of equal resolution (1280x720), frame rate (60), number of frames (23,899), format (mp4)...
When the write_videofile function starts processing, it provides an estimated time of completion that is very reasonable ~20min. As it processes every frame, the process gets slower and slower and slower (indicated by progress bar and estimated completion time). In my case, the input movie clips are about 6min long. After three minutes of processing, it indicates it will take over 3 hours to complete. After a half hour of processing, it then indicates it will take over 24hours to complete.
I have tried the 'threads' option of the write_videofile function, butit did not help.
Any idea? Thanks for the help.
---- Script ----
movie_L = 'movie_L.mp4'
movie_R = 'movie_R.mp4'
output_movie = 'new_movie.mp4')
clip_L = VideoFileClip(movie_L)
(width_L, height_L) = clip_L.size
clip_L = clip_L.resize((width_L/2, height_L))
clip_R = VideoFileClip(movie_R)
(width_R, height_R) = clip_R.size
clip_R = clip_R.resize((width_R/2, height_R))
print("*** Make an array of the two movies side by side")
arrayClip = clips_array([[clip_L, clip_R]])
print("*** Write the video file")
arrayClip.write_videofile(output_movie, threads=4, audio = False)
I realize that this is old but for anyone still having this issue be sure to add
progress_bar = False to your code. EG.
arrayClip.write_videofile(output_movie, threads=4, audio = False, progress_bar = False)
Having the progress bar printing out each time it updates into IDLE takes up a ton of memory, thus slowing down your program until it stops completely.
I have also had problems with slow rendering. I find that it helps a lot to use multithreading and also to set the bitrate.
This is my configuration:
videoclip.write_videofile("fractal.mp4",fps=20,threads=16,logger=None,codec="mpeg4",preset="slow",ffmpeg_params=['-b:v','10000k'])
This works very well even with preset set to slow. This setting gives better quality for the same number of bits and if this is not an issue, you could set it to medium or fast to gain some more on speed.
let say I have some big matrix saved on disk. storing it all in memory is not really feasible so I use memmap to access it
A = np.memmap(filename, dtype='float32', mode='r', shape=(3000000,162))
now let say I want to iterate over this matrix (not essentially in an ordered fashion) such that each row will be accessed exactly once.
p = some_permutation_of_0_to_2999999()
I would like to do something like that:
start = 0
end = 3000000
num_rows_to_load_at_once = some_size_that_will_fit_in_memory()
while start < end:
indices_to_access = p[start:start+num_rows_to_load_at_once]
do_stuff_with(A[indices_to_access, :])
start = min(end, start+num_rows_to_load_at_once)
as this process goes on my computer is becoming slower and slower and my RAM and virtual memory usage is exploding.
Is there some way to force np.memmap to use up to a certain amount of memory? (I know I won't need more than the amount of rows I'm planning to read at a time and that caching won't really help me since I'm accessing each row exactly once)
Maybe instead is there some other way to iterate (generator like) over a np array in a custom order? I could write it manually using file.seek but it happens to be much slower than np.memmap implementation
do_stuff_with() does not keep any reference to the array it receives so no "memory leaks" in that aspect
thanks
This has been an issue that I've been trying to deal with for a while. I work with large image datasets and numpy.memmap offers a convenient solution for working with these large sets.
However, as you've pointed out, if I need to access each frame (or row in your case) to perform some operation, RAM usage will max out eventually.
Fortunately, I recently found a solution that will allow you to iterate through the entire memmap array while capping the RAM usage.
Solution:
import numpy as np
# create a memmap array
input = np.memmap('input', dtype='uint16', shape=(10000,800,800), mode='w+')
# create a memmap array to store the output
output = np.memmap('output', dtype='uint16', shape=(10000,800,800), mode='w+')
def iterate_efficiently(input, output, chunk_size):
# create an empty array to hold each chunk
# the size of this array will determine the amount of RAM usage
holder = np.zeros([chunk_size,800,800], dtype='uint16')
# iterate through the input, replace with ones, and write to output
for i in range(input.shape[0]):
if i % chunk_size == 0:
holder[:] = input[i:i+chunk_size] # read in chunk from input
holder += 5 # perform some operation
output[i:i+chunk_size] = holder # write chunk to output
def iterate_inefficiently(input, output):
output[:] = input[:] + 5
Timing Results:
In [11]: %timeit iterate_efficiently(input,output,1000)
1 loop, best of 3: 1min 48s per loop
In [12]: %timeit iterate_inefficiently(input,output)
1 loop, best of 3: 2min 22s per loop
The size of the array on disk is ~12GB. Using the iterate_efficiently function keeps the memory usage to 1.28GB whereas the iterate_inefficiently function eventually reaches 12GB in RAM.
This was tested on Mac OS.
I've been experimenting with this problem for a couple days now and it appears there are two ways to control memory consumption using np.mmap. The first is reliable while the second would require some testing and will be OS dependent.
Option 1 - reconstruct the memory map with each read / write:
def MoveMMapNPArray(data, output_filename):
CHUNK_SIZE = 4096
for idx in range(0,x.shape[1],CHUNK_SIZE):
x = np.memmap(data.filename, dtype=data.dtype, mode='r', shape=data.shape, order='F')
y = np.memmap(output_filename, dtype=data.dtype, mode='r+', shape=data.shape, order='F')
end = min(idx+CHUNK_SIZE, data.shape[1])
y[:,idx:end] = x[:,idx:end]
Where data is of type np.memmap. This discarding of the memmap object with each read keeps the array from being collected into memory and will keep memory consumption very low if the chunk size is low. It likely introduces some CPU overhead but was found to be small on my setup (MacOS).
Option 2 - construct the mmap buffer yourself and provide memory advice
If you look at the np.memmap source code here, you can see that it is relatively simple to create your own memmapped numpy array relatively easily. Specifically, with the snippet:
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap_np_array = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm, offset=array_offset, order=order)
Note this python mmap instance is stored as the np.memmap's private _mmap attribute.
With access to the python mmap object, and python 3.8, you can use its madvise method, described here.
This allows you to advise the OS to free memory where available. The various madvise constants are described here for linux, with some generic cross platform options specified.
The MADV_DONTDUMP constant looks promising but I haven't tested memory consumption with it like I have for option 1.
First some background
I am trying to write my own set of tools for video analysis, mainly for detecting render errors like flashing frames and possibly some other stuff in the future.
The (obvious) goal is to write a script, that is faster and more accurate than me watching the file in real time.
Using OpenCV, I have something that looks like this:
import cv2
vid = cv2.VideoCapture("Video/OpenCV_Testfile.mov", cv2.CAP_FFMPEG)
width = 1024
height = 576
length = vid.get(cv2.CAP_PROP_FRAME_COUNT)
for f in range(length):
blue_values = []
vid.set(cv2.CAP_PROP_POS_FRAMES, f)
is_read, frame = vid.read()
if is_read:
for row in range(height):
for col in range(width):
blue_values.append(frame[row][col][0])
print(blue_values)
vid.release()
This just prints out a list of all blue values of every frame.
- Just for simplicity (My actual script compares a few values across each frame and only saves the frame number when all are equal)
Although this works, it is not a very fast operation. (Nested loops, but most important, the read() method has to be called for every frame, which is rather slow.
I tried to use multiprocessing but basically ended up having the same crashes as described here:
how to get frames from video in parallel using cv2 & multiprocessing in python
I have a 20s long 1024x576#25fps Testfile which performs as follows:
mov, ProRes: 15s
mp4, h.264: 30s (too slow)
My machine is capable of playing back h.264 in 1920x1080#50fps with mplayer (which uses ffmpeg to decode). So, I should be able to get more out of this. Which leads me to
my Question
How can I decode a video and simply dump all pixel values into a list for further (possibly multithreaded) operations? Speed is really all that matters. Note: I'm not fixated on OpenCV. Whatever works best.
Thanks!
I am currently using a Python script to process information stored in the EnSight Gold format. My Python (2.6) scipt uses VTK (5.10.0) to process the file, where I used the vtkEnSightGoldReader for reading the data, and loop over time steps. In principle this works find for smaller datasets, however, for large datasets (GBs), I see the memory usage (via top) increasing with time while the process is running. This filling of the memory goes slow, but in some cases problems are inevitable.
The following script is the minimal productive script that I reduced my issue to.
import vtk
reader = vtk.vtkEnSightGoldReader()
reader.SetCaseFileName("case.case")
reader.Update()
# Get time values
timeset=reader.GetTimeSets()
time=timeset.GetItem(0)
timesteps=time.GetSize()
#reader.ReleaseDataFlagOn()
for j in range(timesteps):
curTime=time.GetTuple(j)[0]
print curTime
reader.SetTimeValue(curTime)
reader.Update()
#reader.RemoveAllInputs()
My question is, how can I unload/replace the data that is stored in the memory, instead of using more memory continuously?
As you can see in my source code, I tried member functions "RemoveAllInputs" and "ReleaseDataFlagOn", but they don't work or I used them in the wrong way. Infortunately, I am not getting any closer to a solution.
Something else I tried is the DeepCopy() approach, which I found on the VTK website. However, it seems that this approach is not useful for me, because I get the memory issues even before calling GetOutput()
There is indeed a (minor) memory leak in the vtkEnsightGoldReader. The memory leak is a result of not properly clear collection object, which becomes apparent only for processing very large datasets. Technically it is not a memoryleak, since it gets properly cleared after a run.
This can only be solved by applying a patch to the VTK source and recompiling. I received the patch below via people from Kitware, so I would assume this rolled out in later versions of VTK.
diff --git a/IO/vtkEnSightReader.cxx b/IO/vtkEnSightReader.cxx
index 68a9b8f..7ab8ddd 100644
--- a/IO/vtkEnSightReader.cxx
+++ b/IO/vtkEnSightReader.cxx
## -985,6 +985,8 ## int vtkEnSightReader::ReadCaseFileTime(char* line)
int timeSet, numTimeSteps, i, filenameNum, increment, lineRead;
float timeStep;
+ this->TimeSetFileNameNumbers->RemoveAllItems();
+
// found TIME section
int firstTimeStep = 1;