I have several binary images (geotiff) that i want to perform binary closing on.
Because there are atleast 2600 images, i want to shorten the execution time for each operation.
I'm using Python and have found several packages which include this operation. Among these are scipy and scikit-image:
from scipy.ndimage import binary_closing
from scipy.ndimage import grey_closing
import skimage.morphology as sm
....
arr = some_binary_image_as_numpy_array
start = time.time()
disk = sm.disk(5)
opened_arr = sm.binary_closing(arr, disk )
end = time.time()
print(end - start)
start = time.time()
opened_arr = binary_closing(arr, structure=np.ones((10, 10)) )
end = time.time()
print(end - start)
output:
8.373806715011597
9.53745985031128
I know the scikit-image disk isnt the same as the np.ones ((10,10)), so they are not completely the same.
However, when i run scipy's greyscale closing (grey_closing) i get this result:
start = time.time()
opened_arr = grey_closing(arr, size=(10, 10))
end = time.time()
print(end - start)
output:
4.35299277305603
On 10 out of 11 images i tried, the number of pixels with 1 in grey_closing and binary_closing were the same.
I know scipy's grey_closing also has the argument structure, but when i use this, the execution time increased to 30 seconds.
So my question is: would scipy's grey_closing using the argument size, be the same as
using scipy's binary_closing with argument structure?
Related
I would like to know, is there any simple method to parallel einsum in Numpy?
I found some discussions
Numpy np.einsum array multiplication using multiple cores
Any chance of making this faster? (numpy.einsum)
numpy.tensordot() only for binary contraction with a single axis, Numba needs to specify certain loops. Is there any simple and robust approach to parallel einsum (possibly including opt-einsum, tf-einsum etc) with arbitrary contractions?
A sample code is as following (if necessary I can use more complicated contraction as the example)
import numpy as np
import timeit
import time
na = nc = 1000
nb = 1000
n_iter = 10
A = np.random.random((na,nb))
B = np.random.random((nb,nc))
t_total = 0.
for i in range(n_iter):
start = time.time()
C = np.einsum('ij,jk->ik', A, B)
end = time.time()
t_total += end - start
print('AB->C',(t_total)/n_iter)
I am new to Python but I've worked a lot with MatLab and r. I have used both MatLab and r to do custom audio and signal processing. In the many years I have worked with MatLab and r with large datasets I have never run into out of memory issues before.
For many reasons I've tried switching over to Python but I have grown extremely frustrated by out of memory errors.
For example I'm using Python ver. 3* (64bit version) and using JupyterLab as the IDE. I have a Windows PC with an i7 processor and 16gigs of ram. Using both MatLab and r keeping all steps and variables in memory they have no problem synthesizing a 1 sec signal and plotting it. I am attempting to do the same in Python and it runs out of memory before completing the final combining of the 2 signal elements into a single array. I'm using numpy and Matlibplot to handle these tasks.
I've divided my code into functions in order to remove unused arrays and variables but to no avail, I still get the out of memory error.
From what I've learned the issue likely stems from Python saving numpy arrays as floats. Can someone provide assistance on a way to streamline the generation of large arrays using numpy so that I can get around these errors?
I generate my arrays by 3 main ways.
np.arang(0, SigLen, SigLen/fs)
np.zeros(SigLen)
np.append(SigComponents) #building an array in a for loop.
Again, these signals are generally 1 sec in length with a sampling frequency of 44100 (fs=44100). Sometimes I could process signals up to 5 sec. Long. But here again I've never run into issues is r or MatLab.
Thoughts? Suggestions?
```python
## Load Dependancies
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
## Generate General Variables
fs = 441000
## Generate Signal Variables
SignalLength = 5.0 # Signal Length (or file length in seconds)
PulseLength = 0.025 # In Seconds
NumberofPulses = 3 # # of pulses in syllable
InterPInterval = 0.025 # In Seconds
InterSInterval = 0.375 # In Seconds
CarrierFreq = 4200.0 # In Hertz
RiseTime = 0.005; # In Seconds
FallTime = 0.005; # In Seconds
Amplitude = 0.8; # In mV
## BUILD PULSE FUNCTION
def BuildPulse():
wt = np.arange(0,PulseLength, 1.0/fs) #Builds an array
CarrierCall = Amplitude * np.sin(np.pi * 2.0 * CarrierFreq * wt) # Build Signal
# Rise & Fall: sinusoidal increase decrease
wrt = np.arange(0, np.pi/2.0, (np.pi/2.0)/(fs*RiseTime))
wrt_env = np.sin(wrt)
wft = np.arange(0, np.pi/2.0, (np.pi/2.0)/(fs*FallTime))
wft = np.flip(wft)
wft_env = np.sin(wft)
CarrierCall[0:np.size(wrt,axis=None)] = CarrierCall[0:np.size(wrt, axis=None)]*wrt_env
CarrierCall[-np.size(wft,axis=None):] = CarrierCall[-np.size(wft,axis=None):]*wft_env
plt.plot(wt,CarrierCall)
plt.show()
## EXPORT SIGNAL AS A WAV
CarrierCall*=32767 #Binary 16 ones = decimal value of 65535 / 2
CarrierCall = np.int16(CarrierCall) #Changing datatype from float to int16 datatype
wavfile.write("Pulse_Test2.wav", fs, CarrierCall)
return CarrierCall
## BUILD SYLLABLE FUNCTION
def BuildSyllable():
SamplesIPI = InterPInterval * fs
SpaceIPI = np.zeros(int(SamplesIPI))
Syllable = []
SyllUnit = [Pulse, SpaceIPI]
for i in range(0 , NumberofPulses):
Syllable.append(SyllUnit)
st = np.arange(0,(np.size(Syllable)), 1.0/fs)
st = np.arange(0,33075,1.0/fs)
plt.plot(st,Syllable)
plt.show()
return Syllable
## BUILDING THE PLAYBACK SIGNAL FUNCTION
def BuildCall():
SampleSignal = SignalLength * fs
SamplesISI = InterSInterval * fs
NumberofSyllables = np.floor(SamplesSignal / (np.size(Syllable) + SamplesISI))
SpaceISI = np.zeros(int(SamplesISI))
Calls = []
for i in range(0 , NumberofSyllables):
Calls.append(Syllable, SpaceISI)
return Calls
## BUILDING THE FULL PLAYBACK SIGNAL
Pulse = BuildPulse()
Syllable = BuildSyllable()
Playback = BuildCall()
# plot playback
time_sig = np.arange(1, np.size(int(Playback)), np.size(int(Playback))/fs)
plt.plot(time_sig, Playback)
plt.show()
I made two minor changes, and this seems to produce reasonable results now:
def BuildSyllable():
SamplesIPI = InterPInterval * fs
SpaceIPI = np.zeros(int(SamplesIPI))
Syllable = []
SyllUnit = [Pulse, SpaceIPI]
for i in range(0 , NumberofPulses):
Syllable.append(SyllUnit)
# Convert from Python list to np.array and reshape to 1D.
Syllable = np.array(Syllable).reshape( np.size(Syllable) )
# Fix misunderstanding of arange.
st = np.arange(0,(np.size(Syllable))/fs, 1.0/fs)
plt.plot(st,Syllable)
plt.show()
return Syllable
My compliments for including a complete runnable example. This might have taken much longer.
If I have an array on the GPU, it is really slow (order of hundreds of seconds) to copy back an array of shape (20, 256, 256).
My code is the following:
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
xt_gpu = cp.asarray(xt)
# Also very fast...
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
# Very very very very very slow....
result_cpu = cp.asnumpy(result_gpu)
I measured the times using cp.cuda.Event() with record and synchronize to avoid measuring any random times, but is still the same result, the GPU->CPU transfer is incredible slow. However, using PyTorch or TensorFlow this is not the case (out of experience for similar data size/shape)... What am I doing wrong?
I think you might be timing it wrong. I modified the code to synchronize between every GPU operation and it seems like the convolution takes the majority of the time with both transfer operations being very fast.
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
import time
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
t0 = time.time()
xt_gpu = cp.asarray(xt)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
# Also very fast...
t0 = time.time()
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
# Very very very very very slow....
t0 = time.time()
result_cpu = cp.asnumpy(result_gpu)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
Output:
0.1380000114440918
4.032999753952026
0.0010001659393310547
To me it seems like you are not actually synchronizing between calls when you tested it. Until the transfer back to a numpy array all operations are simply queued up and seem to finish instantly without the synchronize calls. This would lead to the measured GPU->CPU transfer time actually being the time for the convolution and the transfer.
I also meet the same problem, I found that accessing Float64 data is way faster than Float32, maybe you can try to .astype(float64).
I have tried the following python median filtering on time-series signals to find the fastest and more efficient function.
sig is a numpy array of size 80×188 which contains 188 samples measured by 80 sensors.
import numpy as np
from scipy.ndimage import median_filter
from scipy.signal import medfilt
from scipy.signal import medfilt2d
import time
sig = np.random.rand(80,188).astype('f')
print(type(sig))
print(type(sig[0][0]))
window_length = 181
t = time.time()
sigFiltered = medfilt2d(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.signal.medfilt2d: %g seconds' % elapsed)
t = time.time()
sigFiltered = median_filter(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.ndimage.median_filter: %g seconds' % elapsed)
t = time.time()
sigFiltered = medfilt(sig, (1,window_length))
elapsed = time.time() - t
print('scipy.signal.medfilt: %g seconds' % elapsed)
The code can be tried here.
The result of the filter is another time-series array of size 80×188 with smoothed time-points for each sensor.
MATLAB medfilt1(sig, 181, [], 2) performs the filtering on the same data 10 times faster compared to scipy.signal.medfilt2d, which was the fastest among other functions. On my machine, MATLAB=2ms vs Python=20 ms. I think MATLAB performs multithreading processing and python does not.
Is there any way to perform multithreading median filtering to speed up the process and assign sensors to different threads? Is there a more efficient median filtering available in python? Can I achieve the performance of MATLAB win python or at least get closer to it?
With such a long filter relative to the input most outputs using a standard medfilt are going to be the same. Where this to be a convolution this would be a "full" convolution. If you instead only give outputs for "valid" convolution, that will be much faster in this case:
t = time.time()
medians = []
for i in range(188-181):
sig2 = sig[:, i:i+window_length]
f = np.median(sig2, axis=1)
medians.append(f)
sigFiltered = np.stack(medians).T
elapsed = time.time() - t
print('numpy.median: %g seconds' % elapsed)
numpy.median: 0.0015518 seconds
This is in the ballpark of the requested 1 ms runtime per 188 sample size.
Considering that even each unique value here will change very slowly/rarely with new input samples. You could therefore speed this up considerably by using a hop larger than 1.
I'm wondering why you're using a median filter of 181 points for a data length of 188? The filter is so long that you're essentially just throwing away all the data and replacing it with the global median of the sensor's output. Typical median filter lengths would be a few samples, depending on what kind of transients you want to filter out.
The filter length also explains why it's so slow. On my machine, your median_filter example takes 46 ms. Running with a more normal filter size of 3 samples takes 0.7 ms.
I was having problems with the accuracy of floats in Python. I need high accuracy because I want to use explicitly written spherical bessel functions J_n (x), which deviate (especially for n>5) from their theoretical values at low x values if numpy floats are used (15 precise digits).
I have tried many options, especially from mpmath and sympy, in order to keep more precise numbers. I had problems when combining the accuracy of mpmath inside the functions with numpy arrays, until I knew there was the function numpy.vectorize. Finally I got this solution to my initial problem:
import time
% matplotlib qt
import scipy
import numpy as np
from scipy import special
import matplotlib.pyplot as plt
from sympy import *
from mpmath import *
mp.dps=100
#explicit inaccurate
def bessel6_expi(z):
return -((z**6-210*z**4+4725*z**2-10395)*np.sin(z)+(21*z**5-1260*z**3+10395*z)*np.cos(z))/z**7
#explicit inaccurate 1, computation time increases, a bit less inaccuracy
def bessel6_exp1(z):
def bv(z):
return -((z**6-210*z**4+4725*z**2-10395)*mp.sin(z)+(21*z**5-1260*z**3+10395*z)*mp.cos(z))/z**7
bvec=np.vectorize(bv)
return bvec(z)
#explicit accurate 2, computation time increases markedly, accurate
def bessel6_exp2(z):
def bv(z):
return -((mpf(z)**mpf(6)-mpf(210)*mpf(z)**mpf(4)+mpf(4725)*mpf(z)**mpf(2)-mpf(10395))*mp.sin(mpf(z))+(mpf(21)*mpf(z)**mpf(5)-mpf(1260)*mpf(z)**mpf(3)+mpf(10395)*mpf(z))*mp.cos(mpf(z)))/mpf(z)**mpf(7)
bvec=np.vectorize(bv)
return bvec(z)
#explicit accurate 3, computation time increases markedly, accurate
def bessel6_exp3(z):
def bv(z):
return -((mpf(z)**6-210*mpf(z)**4+4725*mpf(z)**2-10395)*mp.sin(mpf(z))+(21*mpf(z)**5-1260*mpf(z)**3+10395*mpf(z))*mp.cos(mpf(z)))/mpf(z)**7
bvec=np.vectorize(bv)
return bvec(z)
#implemented in scipy, accurate, fast
def bessel6_imp(z):
def bv(z):
return scipy.special.sph_jn(6,(z))[0][6]
bvec=np.vectorize(bv)
return bvec(z)
a=np.arange(0.0001,17,0.0001)
plt.figure()
start = time.time()
plt.plot(a,bessel6_expi(a),'b',lw=1,label='expi')
end = time.time()
print(end - start)
start = time.time()
plt.plot(a,bessel6_exp1(a),'m',lw=1,label='exp1')
end = time.time()
print(end - start)
start = time.time()
plt.plot(a,bessel6_exp2(a),'c',lw=3,label='exp2')
end = time.time()
print(end - start)
start = time.time()
plt.plot(a,bessel6_exp2(a),'y',lw=5,linestyle='--',label='exp3')
end = time.time()
print(end - start)
start = time.time()
plt.plot(a,bessel6_imp(a),'r',lw=1,label='imp')
end = time.time()
print(end - start)
plt.ylim(-0.5/10**7,2.5/10**7)
plt.xlim(0,2.0)
plt.legend()
plt.show()
The problem I have now is that just for plotting the explicit, accurate ones, it takes quite a long time (about 31 times slower than the scipy function for mp.dps=100). Smaller dps do not make these processes much faster, even with mp.dps=15, they are still 26 times slower. Is there a way to make this faster?
Note that the loss of accuracy you observe near zero comes from the fact that you are subtracting two nearly equal terms both of the form 10395 z^-6 + O(z^-4). As the true value is 1/135135 z^6 + O(z^8) you will lose a factor of ~1.4 x 10^9 z^-12 in accuracy. So if you want to calculate the value at z=0.01 to, say, 7 decimals you need to start with >40 decimals precision.
The solution is of course to avoid this cancellation. A straight-forward way of achieving this is to compute the power series around 0.
You could use sympy to obtain the power series:
>>> z = sympy.Symbol('z')
>>> f = -((z**6-210*z**4+4725*z**2-10395)*sympy.sin(z)+(21*z**5-1260*z**3+10395*z)*sympy.cos(z))/z**7
>>> f.nseries(n=20)
z**6/135135 - z**8/4054050 + z**10/275675400 - z**12/31426995600 + z**14/5279735260800 - z**16/1214339109984000 + z**18/364301732995200000 + O(z**20)
For small z a small number of terms appear to be enough for good accuracy.
>>> ply = f.nseries(n=20).removeO().as_poly()
>>> float(ply.subs(z, 0.1))
7.397541093587708e-12
You can export the coefficients for use with numpy.
>>> monoms = np.array(ply.monoms(), dtype=int).ravel()
>>> coeffs = np.array(ply.coeffs(), dtype=float)
>>>
>>> (np.linspace(-0.1, 0.1, 21)[:, None]**monoms * coeffs).sum(axis=1)
array([7.39754109e-12, 3.93160564e-12, 1.93945374e-12, 8.70461282e-13,
3.45213317e-13, 1.15615481e-13, 3.03088138e-14, 5.39444356e-15,
4.73594159e-16, 7.39998273e-18, 0.00000000e+00, 7.39998273e-18,
4.73594159e-16, 5.39444356e-15, 3.03088138e-14, 1.15615481e-13,
3.45213317e-13, 8.70461282e-13, 1.93945374e-12, 3.93160564e-12,
7.39754109e-12])