I'm using Librosa and I have a memory problem.
I have a lot of audio files, let's say a hundred.
I process audio files one by one.
Each audio file is loaded by chunks of 1 minute.
Each chunk is processed before I move to the next chunk.
This way, I know that I never have more than 60s of audio in memory at a given time.
This allows me to avoid using too much memory during the whole process.
For some reason, the memory used by the process is growing over time.
Here is a simpler version of the code:
import librosa
import matplotlib.pyplot as plt
import os
import psutil
SAMPLING_RATE = 22050
N_FFT = 2048
HOP_LENGTH = 1024
def foo():
y, sr = librosa.load("clip_04.wav", sr=SAMPLING_RATE, offset=600, duration=60)
D = librosa.stft(y, n_fft=N_FFT, hop_length=HOP_LENGTH)
spec_mag = abs(D)
spec_db = librosa.amplitude_to_db(spec_mag)
return 42 # I return a constant to make sure the memory is (or should be) released
def main():
process = psutil.Process(os.getpid())
array = []
for i in range(100):
foo()
m = int(process.memory_info().rss / 1024**2)
array.append(m)
plt.figure()
plt.plot(array)
plt.xlabel('iterations')
plt.ylabel('MB')
plt.show()
if __name__ == '__main__':
main()
Using this code, the memory increases like this:
Is that normal? And if it is, is there a way to clear Librosa memory at each iteration?
Related
In the CuPY documentation, it is stated that
"CuPy caches the kernel code sent to GPU device within the process, which reduces the kernel compilation time on further calls."
This means that when one calls a function from CuPY, subsequent calls to this function will be extremely fast. An example is as follows:
import cupy as cp
from timeit import default_timer as timer
import time
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
def multiply():
rand = cp.random.default_rng() #This is the fast way of creating large arrays with cp
arr = rand.integers(0, 100_000, (10000, 1000)) #Create array
y = cp.multiply(arr, 42) ## Multiply by 42, randomly chosen number
return y
if __name__ == '__main__':
times = []
start = timer()
for i in range(21):
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()
start = timer()
multiply()
times.append(timer()-start)
print(times)
This will return the times:
[0.17462146899993058, 0.0006819850000283623, 0.0006159440001738403, 0.0006145069999092811, 0.000610309999956371, 0.0006169410000893549, 0.0006062159998236893, 0.0006096620002153941, 0.0006096250001519365, 0.0006106630000886071, 0.0006063629998607212, 0.0006168999998408253, 0.0006058349999875645, 0.0006090080000831222, 0.0005964219999441411, 0.0006113049998930364, 0.0005968339999071759, 0.0005951619998540991, 0.0005980400001135422, 0.0005941219999385794, 0.0006568090000200755]
Where only the first call includes the time it takes to compile the kernel as well.
Is there a way to flush everything in order to force the compilation for each subsequent call to multiply()?
Currently, there is no way to disable kernel caching in CuPy. The only option available is to disable persisting kernel caching on disk (CUPY_CACHE_IN_MEMORY=1), but kernels are cached on-memory so compilation runs only once within the process.
https://docs.cupy.dev/en/stable/user_guide/performance.html#one-time-overheads
https://docs.cupy.dev/en/latest/reference/environment.html
I am new to Python but I've worked a lot with MatLab and r. I have used both MatLab and r to do custom audio and signal processing. In the many years I have worked with MatLab and r with large datasets I have never run into out of memory issues before.
For many reasons I've tried switching over to Python but I have grown extremely frustrated by out of memory errors.
For example I'm using Python ver. 3* (64bit version) and using JupyterLab as the IDE. I have a Windows PC with an i7 processor and 16gigs of ram. Using both MatLab and r keeping all steps and variables in memory they have no problem synthesizing a 1 sec signal and plotting it. I am attempting to do the same in Python and it runs out of memory before completing the final combining of the 2 signal elements into a single array. I'm using numpy and Matlibplot to handle these tasks.
I've divided my code into functions in order to remove unused arrays and variables but to no avail, I still get the out of memory error.
From what I've learned the issue likely stems from Python saving numpy arrays as floats. Can someone provide assistance on a way to streamline the generation of large arrays using numpy so that I can get around these errors?
I generate my arrays by 3 main ways.
np.arang(0, SigLen, SigLen/fs)
np.zeros(SigLen)
np.append(SigComponents) #building an array in a for loop.
Again, these signals are generally 1 sec in length with a sampling frequency of 44100 (fs=44100). Sometimes I could process signals up to 5 sec. Long. But here again I've never run into issues is r or MatLab.
Thoughts? Suggestions?
```python
## Load Dependancies
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
## Generate General Variables
fs = 441000
## Generate Signal Variables
SignalLength = 5.0 # Signal Length (or file length in seconds)
PulseLength = 0.025 # In Seconds
NumberofPulses = 3 # # of pulses in syllable
InterPInterval = 0.025 # In Seconds
InterSInterval = 0.375 # In Seconds
CarrierFreq = 4200.0 # In Hertz
RiseTime = 0.005; # In Seconds
FallTime = 0.005; # In Seconds
Amplitude = 0.8; # In mV
## BUILD PULSE FUNCTION
def BuildPulse():
wt = np.arange(0,PulseLength, 1.0/fs) #Builds an array
CarrierCall = Amplitude * np.sin(np.pi * 2.0 * CarrierFreq * wt) # Build Signal
# Rise & Fall: sinusoidal increase decrease
wrt = np.arange(0, np.pi/2.0, (np.pi/2.0)/(fs*RiseTime))
wrt_env = np.sin(wrt)
wft = np.arange(0, np.pi/2.0, (np.pi/2.0)/(fs*FallTime))
wft = np.flip(wft)
wft_env = np.sin(wft)
CarrierCall[0:np.size(wrt,axis=None)] = CarrierCall[0:np.size(wrt, axis=None)]*wrt_env
CarrierCall[-np.size(wft,axis=None):] = CarrierCall[-np.size(wft,axis=None):]*wft_env
plt.plot(wt,CarrierCall)
plt.show()
## EXPORT SIGNAL AS A WAV
CarrierCall*=32767 #Binary 16 ones = decimal value of 65535 / 2
CarrierCall = np.int16(CarrierCall) #Changing datatype from float to int16 datatype
wavfile.write("Pulse_Test2.wav", fs, CarrierCall)
return CarrierCall
## BUILD SYLLABLE FUNCTION
def BuildSyllable():
SamplesIPI = InterPInterval * fs
SpaceIPI = np.zeros(int(SamplesIPI))
Syllable = []
SyllUnit = [Pulse, SpaceIPI]
for i in range(0 , NumberofPulses):
Syllable.append(SyllUnit)
st = np.arange(0,(np.size(Syllable)), 1.0/fs)
st = np.arange(0,33075,1.0/fs)
plt.plot(st,Syllable)
plt.show()
return Syllable
## BUILDING THE PLAYBACK SIGNAL FUNCTION
def BuildCall():
SampleSignal = SignalLength * fs
SamplesISI = InterSInterval * fs
NumberofSyllables = np.floor(SamplesSignal / (np.size(Syllable) + SamplesISI))
SpaceISI = np.zeros(int(SamplesISI))
Calls = []
for i in range(0 , NumberofSyllables):
Calls.append(Syllable, SpaceISI)
return Calls
## BUILDING THE FULL PLAYBACK SIGNAL
Pulse = BuildPulse()
Syllable = BuildSyllable()
Playback = BuildCall()
# plot playback
time_sig = np.arange(1, np.size(int(Playback)), np.size(int(Playback))/fs)
plt.plot(time_sig, Playback)
plt.show()
I made two minor changes, and this seems to produce reasonable results now:
def BuildSyllable():
SamplesIPI = InterPInterval * fs
SpaceIPI = np.zeros(int(SamplesIPI))
Syllable = []
SyllUnit = [Pulse, SpaceIPI]
for i in range(0 , NumberofPulses):
Syllable.append(SyllUnit)
# Convert from Python list to np.array and reshape to 1D.
Syllable = np.array(Syllable).reshape( np.size(Syllable) )
# Fix misunderstanding of arange.
st = np.arange(0,(np.size(Syllable))/fs, 1.0/fs)
plt.plot(st,Syllable)
plt.show()
return Syllable
My compliments for including a complete runnable example. This might have taken much longer.
I have been trying this for quite sometime now, but my array remains unchanged.
My array here is TC_p_value, and the function I am trying to simulate is TC_stats. The code runs fine if we run it normally, but takes too long to simulate (about an hour). Thus, to reduce the processing time, I divided the original array (1000x100) in 10 small sets of 100x100. Although, the code runs without an error, I somehow always get the same array (same as it is defined originally). I tried to define TC_p_value as global, so that each run can assign values to specific part of the array. However, it seems like I am doing something wrong here (as simulating a single array on multiple processors is not possible) or is there something wrong with my coding logic?
Any help is greatly appreciated.
Code for the same is written below.
import pingouin as pg # A package to do regression
TC_p_value = np.zeros((Treecover.shape[1],Treecover.shape[2])) #let this array be of size 1000 x 100
def TC_stats(grid_start):
global TC_p_value
for lat in tqdm(range(grid_start, grid_start+100)):
for lon in range(Treecover.shape[2]):
TC_p_value[lat,lon] = pg.corr(y=Treecover[:, lat,lon].values,
x=np.arange(1,16,1))['p-val'].values[0]
#Multiprocessing starts here
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
grid = np.arange(0,1000,100) #Running it in a group of 100, 10 times
pool.map(TC_stats, grid)
pool.close()
pool.join()
The problem is that an array defined globally is not shared across processes. Thus, you need to use shared memory.
import ctypes
import numpy as np
import pingouin as pg # A package to do regression
N, M = Treecover.shape[1], Treecover.shape[2]
mp_arr = mp.Array(ctypes.c_double, N * M)
TC_p_value = np.frombuffer(mp_arr.get_obj())
TC_p_value = TC_p_value.reshape((N, M))
#let this array be of size 1000 x 100
def TC_stats(grid_start):
TC_p_value = np.frombuffer(mp_arr.get_obj())
TC_p_value = TC_p_value.reshape((N, M))
for lat in tqdm(range(grid_start, grid_start+100)):
for lon in range(Treecover.shape[2]):
TC_p_value[lat,lon] = pg.corr(y=Treecover[:, lat,lon].values,
x=np.arange(1,16,1))['p-val'].values[0]
def init(shared_arr_):
global mp_arr
mp_arr = shared_arr_
#Multiprocessing starts here
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool(initializer=init, initargs=(mp_arr,))
grid = np.arange(0,1000,100) #Running it in a group of 100, 10 times
pool.map_async(TC_stats, grid)
pool.close()
pool.join()
I ran the code above with some modified toy example, and it worked.
Reference: Use numpy array in shared memory for multiprocessing
I am performing DCT(in Raspberry Pi). I've broken the image into 8x8 blocks. Initially I performed DCT in nested for loop (without multithreading). I observed that it takes about 18 seconds for a 512x512 image.
But, Here's the code with multi-threads
#!/usr/bin/env python
from __future__ import print_function,division
import time
start_time = time.time()
import cv2
import numpy as np
import sys
import pylab as plt
import threading
import Queue
from numpy import empty,arange,exp,real,imag,pi
from numpy.fft import rfft,irfft
from pprint import pprint
queue = Queue.Queue()
if len(sys.argv)>1:
im = cv2.imread(sys.argv[1])
else :
im = cv2.imread('baboon.jpg')
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
h, w = im.shape[:2]
DF = np.zeros((h,w))
Nb=8
def dct2(y):
M = y.shape[0]
N = y.shape[1]
a = empty([M,N],float)
b = empty([M,N],float)
for i in range(M):
a[i,:] = dct(y[i,:])
for j in range(N):
b[:,j] = dct(a[:,j])
queue.put(b)
def dct(y):
N = len(y)
y2 = empty(2*N,float)
y2[:N] = y[:]
y2[N:] = y[::-1]
c = rfft(y2)
phi = exp(-1j*pi*arange(N)/(2*N))
return real(phi*c[:N])
def Main():
jobs = []
for row in range(0, h, Nb):
for col in range(0, w, Nb):
f = im[(row):(row+Nb), (col):(col+Nb)]
thread = threading.Thread(target=dct2(f))
jobs.append(thread)
df = queue.get()
DF[row:row+Nb, col:col+Nb] = df
for j in jobs:
j.start()
for j in jobs:
j.join()
if __name__ == "__main__":
Main()
cv2.imwrite('dct_img.jpg', DF)
print("--- %s seconds ---" % (time.time() - start_time))
plt.imshow(DF1, cmap = 'Greys')
plt.show()
cv2.waitKey(0)
cv2.destroyAllWindows()
After using multiple threads, this code take about 25 seconds to get executed. What's wrong? Have I implemented multi-threading wrongly? I want to reduce the time taken to perform DCT as much as possible (1-5 seconds). Any suggestions?
Any other concept or method (I've read post on multiprocessing) that'll significantly reduce my execution and processing time?
Due to GIL all your threads are executed in a sequence (not in parallel).
So you might want to switch to multiprocessing. Another option is to build numba, which can greatly increase speed of usual python code and also can unlock GIL.
In Python, you should use multithreading for performances only when mixing IO and CPU tasks.
For your problem you should use multiprocessing.
Maybe the other posters are right about the GIL. But OpenCV as well as Numpy release the GIL so I would at least expect a speedup from a multithreaded solution.
I would have a look at how many threads you are creating simultaneously. It's probably a lot since you start one for each 8 by 8 pixel sub picture. (Each time a thread is taken off the cpu and replaced by another it incurs a small overhead which in sum gets quite noticeable if you have a lot of threads)
If this is the case you probably gain performance by not starting them all at once but to only start as many as you have cpu cores (a few more a few less...just experiment) and only start the next thread if one has finished.
Look at the answers to this question on how to do this with minimal effort.
What is the fastest Python mechanism for getting data read off of a serial port, to a separate process which is plotting that data?
I am plotting eeg data in real-time that I read off of a serial port. The serial port reading and packet unpacking code works fine, in that if I read and store the data, then later plot the stored data, it looks great. Like this:
note: device generates test sine wave for debugging
I am using pyQtGraph for the plotting. Updating the plot in the same process that I read the serial data in is not an option because the slight delay between serial read() calls causes the serial buffer to overflow and bad check-sums ensue. pyQtGraph has provisions for rendering the graph on a separate process, which is great, but the bottle-neck seems to be in the inter-process communication. I have tried various configuration of Pipe() and Queue(), all of which result in laggy, flickering graph updates. So far, the smoothest, most consistent method of getting new values from the serial port to the graph seems to be through shared memory, like so:
from pyqtgraph.Qt import QtGui
import pyqtgraph as pg
from multiprocessing import Process, Array, Value, Pipe
from serial_interface import EEG64Board
from collections import deque
def serialLoop(arr):
eeg = EEG64Board(port='/dev/ttyACM0')
eeg.openSerial()
eeg.sendTest('1') #Tells the eeg device to start sending data
while True:
data = eeg.readEEG() #Returns an array of the 8 latest values, one per channel
if data != False: #Returns False if bad checksum
val.value = data[7]
val = Value('d',0.0)
q = deque([],500)
def graphLoop():
global val,q
plt = pg.plot(q)
while True:
q.append(val.value)
plt.plot(q,clear=True)
QtGui.QApplication.processEvents()
serial_proc = Process(target=serialLoop, args=(val,), name='serial_proc')
serial_proc.start()
try:
while True:
graphLoop()
except KeyboardInterrupt:
print('interrupted')
The above code performs real-time plotting by simply pulling the latest value recorded by the serialLoop and appending it to a deque. While the plot updates smoothly, it is only grabbing about 1 in 4 values, as seen in the resulting plot:
So, what multi-process or thread structure would you recommend, and then what form of IPC should be used between them?
Update:
I am receiving 2,000 samples per second. I am thinking that if I update the display at 100 fps and add 20 new samples per frame then I should be good. What is the best Python multithreading mechanism for implementing this?
This may not be the most efficient, but the following code achieves 100 fps for one plot, or 20 fps for 8 plots. The idea is very simple: share an array, index, and lock. Serial fills array and increments index while is has lock, plotting process periodically grabs all of the new values from the array and decrements index, again, under lock.
from pyqtgraph.Qt import QtGui
import pyqtgraph as pg
from multiprocessing import Process, Array, Value, Lock
from serial_interface import EEG64Board
from collections import deque
def serialLoop(arr,idx,lock):
eeg = EEG64Board(port='/dev/ttyACM0')
eeg.openSerial()
eeg.sendTest('1') #Tells the eeg device to start sending data
while True:
data = eeg.readEEG() #Returns an array of the 8 latest values, one per channel
if data != False: #Returns False if bad checksum
lock.acquire()
for i in range(8):
arr[i][idx.value] = data[i]
idx.value += 1
lock.release()
eeg.sendTest('2')
arr = [Array('d',range(1024)) for i in range(8)]
idx = Value('i', 0)
q = [deque([],500) for i in range(8)]
iq = deque([],500)
lock = Lock()
lastUpdate = pg.ptime.time()
avgFps = 0.0
def graphLoop():
global val,q,lock,arr,iq, lastUpdate, avgFps
win = pg.GraphicsWindow()
plt = list()
for i in range(8):
plt += [win.addPlot(row=(i+1), col=0, colspan=3)]
#iplt = pg.plot(iq)
counter = 0
while True:
lock.acquire()
#time.sleep(.01)
for i in range(idx.value):
for j in range(8):
q[j].append(arr[j][i])
idx.value = 0
lock.release()
for i in range(8):
plt[i].plot(q[i],clear=True)
QtGui.QApplication.processEvents()
counter += 1
now = pg.ptime.time()
fps = 1.0 / (now - lastUpdate)
lastUpdate = now
avgFps = avgFps * 0.8 + fps * 0.2
serial_proc = Process(target=serialLoop, args=(arr,idx,lock), name='serial_proc')
serial_proc.start()
graphLoop()
serial_proc.terminate()