I have a problem in the analysis of the acoustic indices. I tried calculating different indices both with Scikit-maad and Soundecology, but the results at the end are hardly comparable, here's an example of the results for ADI.
Results Comparisons
We checked that all the parameters set for the analysis were the same and we came to the conclusion that the problem is in how Soundecology and Maad calculate the spectrogram.
Maad uses some parameters that we do not fully understand and we cannot find in R packages that should do the same thing (like ReadWave of TuneR or Spectro of seewave).
https://cran.r-project.org/web/packages/tuneR/tuneR.pdf
https://www.rdocumentation.org/packages/seewave/versions/1.0/topics/spectro
Python-code example
if __name__ == '__main__':
fullfilename="wav_files/AM08_Grotte-New_2019-10-04_0FE081F80FE081F0_2019-07-26_000000_UTC.wav"
wave, fs = sound.load(filename=fullfilename, channel='left', detrend=False, verbose=True)
Sxx_power,tn,fn,ext = sound.spectrogram (wave, fs, window='hanning',
nperseg = 1024, noverlap= None,
verbose = False, display = False,
savefig = None)
adi = features.acoustic_diversity_index(Sxx_power, fn, fmin=0, fmax=10000, bin_step=1000, dB_threshold=-50, index='shannon')
print(adi)
R code example
a <- readWave("wav_files/AM08_Grotte-New_2019-10-04_0FE081F80FE081F0_2019-07-26_000000_UTC.wav")
adi <- acoustic_diversity(a, max_freq = 10000, db_threshold = -50,
freq_step = 1000, shannon = TRUE)
For example, we do not find a matching parameter for 'nperseg' in any R package that calculates the spectrogram.
I would be very grateful if you could help us with this.
Thank you very much,
Valeria
Thank you very much to point out the issue.
I found the reason that explains the discrepancy between the result from soundecology library in R and scikit-maad package in Python.
In soundecology, the DC value is never subtracted from the original signal (i.e. no detrend of the signal) meaning that the mean value of the signal is not null. Depending on the electronic of the audio recorder, the signal is a bit above or below the zero level.
In scikit-maad, the DC value is removed by default when loading the signal (but you can switch off the option : detrend=False) and also when computed the spectrogram ( I've just added a switch off option, detrend=False).
The problem is now in the function all_spectral_alpha_indices which computes all indices at the same time, where the input argument is a detrend spectrogram (i.e the dc value is already subtracted from the signal).
There are two options :
I've just added optional arguments ADI_dB_threshold and AEI_dB_threshold in the function all_spectral_alpha_indices in order to change the threshold used for the calculation. A "good" value seems to be -30. With -30, the ADI and AEI from maad is well correlated with soundecology but the values are a bit different, especially for low values (when there is not much activities). See the colab notebook below.
Compute ADI and AEI as soundecology. An example is provided for both functions.
Here is the example for ADI :
s, fs = maad.sound.load('../data/cold_forest_daylight.wav', detrend=False)
Sxx, tn, fn, ext = maad.sound.spectrogram (s, fs, nperseg=int(fs/10), noverlap=0, mode='amplitude', detrend=False)
ADI = maad.features.acoustic_diversity_index(Sxx,fn,fmax=10000)
print('ADI : %2.2f ' %ADI)
The main problem with this solution is that you need to load the signal and compute the spectrogram without detrend specifically for ADI (and AEI) which could increase the time to process a large dataset.
Here, I wrote an script to explain the issue and give you both solutions to fix the issue and compare the results.
https://colab.research.google.com/drive/1xoErUfqN1_AX0ibujOlnt8bKO23DUXFR#scrollTo=Poydb8kSF9VD
I hope this helps.
Sylvain
Related
I'm trying to do the following:
Extract the melody of me asking a question (word "Hey?" recorded to
wav) so I get a melody pattern that I can apply to any other
recorded/synthesized speech (basically how F0 changes in time).
Use polynomial interpolation (Lagrange?) so I get a function that describes the melody (approximately of course).
Apply the function to another recorded voice sample. (eg. word "Hey." so it's transformed to a question "Hey?", or transform the end of a sentence to sound like a question [eg. "Is it ok." => "Is it ok?"]). Voila, that's it.
What I have done? Where am I?
Firstly, I have dived into the math that stands behind the fft and signal processing (basics). I want to do it programatically so I decided to use python.
I performed the fft on the entire "Hey?" voice sample and got data in frequency domain (please don't mind y-axis units, I haven't normalized them)
So far so good. Then I decided to divide my signal into chunks so I get more clear frequency information - peaks and so on - this is a blind shot, me trying to grasp the idea of manipulating the frequency and analyzing the audio data. It gets me nowhere however, not in a direction I want, at least.
Now, if I took those peaks, got an interpolated function from them, and applied the function on another voice sample (a part of a voice sample, that is also ffted of course) and performed inversed fft I wouldn't get what I wanted, right?
I would only change the magnitude so it wouldn't affect the melody itself (I think so).
Then I used spec and pyin methods from librosa to extract the real F0-in-time - the melody of asking question "Hey?". And as we would expect, we can clearly see an increase in frequency value:
And a non-question statement looks like this - let's say it's moreless constant.
The same applies to a longer speech sample:
Now, I assume that I have blocks to build my algorithm/process but I still don't know how to assemble them beacause there are some blanks in my understanding of what's going on under the hood.
I consider that I need to find a way to map the F0-in-time curve from the spectrogram to the "pure" FFT data, get an interpolated function from it and then apply the function on another voice sample.
Is there any elegant (inelegant would be ok too) way to do this? I need to be pointed in a right direction beceause I can feel I'm close but I'm basically stuck.
The code that works behind the above charts is taken just from the librosa docs and other stackoverflow questions, it's just a draft/POC so please don't comment on style, if you could :)
fft in chunks:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
import os
file = os.path.join("dir", "hej_n_nat.wav")
fs, signal = wavfile.read(file)
CHUNK = 1024
afft = np.abs(np.fft.fft(signal[0:CHUNK]))
freqs = np.linspace(0, fs, CHUNK)[0:int(fs / 2)]
spectrogram_chunk = freqs / np.amax(freqs * 1.0)
# Plot spectral analysis
plt.plot(freqs[0:250], afft[0:250])
plt.show()
spectrogram:
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import os
file = os.path.join("/path/to/dir", "hej_n_nat.wav")
y, sr = librosa.load(file, sr=44100)
f0, voiced_flag, voiced_probs = librosa.pyin(y, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))
times = librosa.times_like(f0)
D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
fig, ax = plt.subplots()
img = librosa.display.specshow(D, x_axis='time', y_axis='log', ax=ax)
ax.set(title='pYIN fundamental frequency estimation')
fig.colorbar(img, ax=ax, format="%+2.f dB")
ax.plot(times, f0, label='f0', color='cyan', linewidth=2)
ax.legend(loc='upper right')
plt.show()
Hints, questions and comments much appreciated.
The problem was that I didn't know how to modify the fundamental frequency (F0). By modifying it I mean modify F0 and its harmonics, as well.
The spectrograms in question show frequencies at certain points in time with power (dB) of certain frequency point.
Since I know which time bin holds which frequency from the melody (green line below) ...
....I need to compute a function that represents that green line so I can apply it to other speech samples.
So I need to use some interpolation method which takes as parameters the sample F0 function points.
One need to remember that degree of the polynomial should equal to the number of points. The example doesn't have that unfortunately, but the effect is somehow ok as for the prototype.
def _get_bin_nr(val, bins):
the_bin_no = np.nan
for b in range(0, bins.size - 1):
if bins[b] <= val < bins[b + 1]:
the_bin_no = b
elif val > bins[bins.size - 1]:
the_bin_no = bins.size - 1
return the_bin_no
def calculate_pattern_poly_coeff(file_name):
y_source, sr_source = librosa.load(os.path.join(ROOT_DIR, file_name), sr=sr)
f0_source, voiced_flag, voiced_probs = librosa.pyin(y_source, fmin=librosa.note_to_hz('C2'),
fmax=librosa.note_to_hz('C7'), pad_mode='constant',
center=True, frame_length=4096, hop_length=512, sr=sr_source)
all_freq_bins = librosa.core.fft_frequencies(sr=sr, n_fft=n_fft)
f0_freq_bins = list(filter(lambda x: np.isfinite(x), map(lambda val: _get_bin_nr(val, all_freq_bins), f0_source)))
return np.polynomial.polynomial.polyfit(np.arange(0, len(f0_freq_bins), 1), f0_freq_bins, 3)
def calculate_pattern_poly_func(coefficients):
return np.poly1d(coefficients)
Method calculate_pattern_poly_coeff calculates polynomial coefficients.
Using pythons poly1d lib I can compute function which can modify the speech. How to do that?
I just need to move up or down all values vertically at certain point in time.
for instance I want to move all frequencies at time bin 0,75 seconds up 3 times -> it means that frequency will be increased and the melody at that point will sound higher.
Code:
def transform(sentence_audio_sample, mode=None, show_spectrograms=False, frames_from_end_to_transform=12):
# cutting out silence
y_trimmed, idx = librosa.effects.trim(sentence_audio_sample, top_db=60, frame_length=256, hop_length=64)
stft_original = librosa.stft(y_trimmed, hop_length=hop_length, pad_mode='constant', center=True)
stft_original_roll = stft_original.copy()
rolled = stft_original_roll.copy()
source_frames_count = np.shape(stft_original_roll)[1]
sentence_ending_first_frame = source_frames_count - frames_from_end_to_transform
sentence_len = np.shape(stft_original_roll)[1]
for i in range(sentence_ending_first_frame + 1, sentence_len):
if mode == 'question':
by = int(_question_pattern(i) / 500)
elif mode == 'exclamation':
by = int(_exclamation_pattern(i) / 500)
else:
by = 0
rolled = _roll_column(rolled, i, by)
transformed_data = librosa.istft(rolled, hop_length=hop_length, center=True)
def _roll_column(two_d_array, column, shift):
two_d_array[:, column] = np.roll(two_d_array[:, column], shift)
return two_d_array
In this case I am simply rolling up or down frequencies referencing certain time bin.
This needs to be polished as it doesn't take into consideration an actual state of the transformed sample. It just rolls it up/down according to the factor calculated using the polynomial function computer earlier.
You can check full code of my project at github, "audio" package contains pattern calculator and audio transform algorithm described above.
Feel free to ask if something's unclear :)
I recently do my homework about MFCC, and I can't figure out some differences between using these libraries.
The 3 libraries I use are:
python_speech_features
SpeechPy
LibROSA
samplerate = 16000
NFFT = 512
NCEPT = 13
1st Part: Mel filter bank
temp1_fb = pyspeech.get_filterbanks(nfilt=NFILT, nfft=NFFT, samplerate=sample1)
# speechpy do not divide 2 and add 1 when initializing
temp2_fb = speechpy.feature.filterbanks(num_filter=NFILT, fftpoints=NFFT, sampling_freq=sample1)
temp3_fb = librosa.filters.mel(sr=sample1, n_fft=NFFT, n_mels=NFILT)
# fix librosa normalized version
temp3_fb /= np.max(temp3_fb, axis=-1)[:, None]
pic1
Only the shape in speechpy will get (, 512), other all (, 257). The figure of librosa is a bit of deformation.
2nd Part: MFCC
# pyspeech without lifter. Using hamming
temp1_mfcc = pyspeech.mfcc(speaker1, samplerate=sample1, winlen=0.025, winstep=0.01, numcep=NCEPT, nfilt=NFILT, nfft=NFFT,
preemph=0.97, ceplifter=0, winfunc=np.hamming, appendEnergy=False)
# speechpy need pre-emphasized. Using rectangular window fixed. Mel filter bank is not the same
temp2_mfcc = speechpy.feature.mfcc(emphasized_speaker1, sampling_frequency=sample1, frame_length=0.025, frame_stride=0.01,
num_cepstral=NCEPT, num_filters=NFILT, fft_length=NFFT)
# librosa need pre-emphasized. Using log energy. Its STFT using hanning, but its framing is not the same
temp3_energy = librosa.feature.melspectrogram(emphasized_speaker1, sr=sample1, S=temp3_pow.T, n_fft=NFFT,
hop_length=frame_step, n_mels=NFILT).T
temp3_energy = np.log(temp3_energy)
temp3_mfcc = librosa.feature.mfcc(emphasized_speaker1, sr=sample1, S=temp3_energy.T, n_mfcc=13, dct_type=2, n_fft=NFFT,
hop_length=frame_step).T
pic2
I've tried my best to set the condition faire. The figure of speechpy gets darker.
3rd Part: Delta coefficient
temp1 = pyspeech.delta(mfcc_speaker1, 2)
temp2 = speechpy.processing.derivative_extraction(mfcc_speaker1.T, 1).T
# librosa along the frame axis
temp3 = librosa.feature.delta(mfcc_speaker1, width=5, axis=0, order=1)
pic3
I can't directly set mfcc as argument in speechpy, or it will be very strange. And what these parameters originally act is not the same as my expected.
I'm wondering what factors to make these differences. Is it just somethong I mentioned above? Or I made some mistakes? Hope for details, thanks.
There are many MFCC implementations and they often differ bit by bit - window function shape, mel filterbank calculation, dct could be different too. It is hard to find a fully compatible library. In the long term it should not matter for you as long as you use the same implementation everywhere. The differences do not affect results.
There are several threads asking for a way to simulate time-inhomogenous poisson processes in python. The NeuroTools module offer a simple way to do so via the inh_poisson_generator () function. The help of this function is introduced at the bottom of this thread. The function was originally designed to simulate spike trains, and uses the thinning method.
I would like to simulate a spike train during 2000ms. The spike rate (in Hertz) changes every millisecond, and is comprised between 20 spikes/second and 160 spikes/second. I've tried to simulate this using the following code:
import NeuroTools
import numpy as np
from NeuroTools import stgen
import matplotlib.pyplot as plt
import random
st_gen = stgen.StGen()
time = np.arange(0, 2000)
t_rate = []
for i in range (2000):
t_rate.append(random.randrange(20, 161, 1))
t_rate = np.array(t_rate)
Psim = st_gen.inh_poisson_generator(rate = t_rate, t = time, t_stop = 2000, array = True)
However, the code returns very few timestamps (e.g., array([ 397.55345905, 1208.79804513, 1478.03525045, 1982.63643262]), which doesn't make sense to me. I would appreciate any help on this.
inh_poisson_generator(self, rate, t, t_stop, array=False) method of NeuroTools.stgen.StGen instance
Returns a SpikeTrain whose spikes are a realization of an inhomogeneous
poisson process (dynamic rate). The implementation uses the thinning
method, as presented in the references.
Inputs:
rate - an array of the rates (Hz) where rate[i] is active on interval
[t[i],t[i+1]]
t - an array specifying the time bins (in milliseconds) at which to
specify the rate
t_stop - length of time to simulate process (in ms)
array - if True, a numpy array of sorted spikes is returned,
rather than a SpikeList object.
Note:
t_start=t[0]
References:
Eilif Muller, Lars Buesing, Johannes Schemmel, and Karlheinz Meier
Spike-Frequency Adapting Neural Ensembles: Beyond Mean Adaptation and Renewal Theories
Neural Comput. 2007 19: 2958-3010.
Devroye, L. (1986). Non-uniform random variate generation. New York: Springer-Verlag.
Examples:
>> time = arange(0,1000)
>> stgen.inh_poisson_generator(time,sin(time), 1000)enter code here
I don't really have an answer for you but because this post helped me to get started with NeuroTools, I thought I'd share my small example which is working fine.
For the inh_poisson_generator() the rate input is in unit Hz and all times are in ms. I use an average rate of 1.6 spikes/ms, so I expect to receive ~4000 events. The results confirm that just fine!
I guess it might be an issue that you are using a non-continuous rate. However I barely know anything about the algorithm implemented for this function..
I hope my example can help you somehow!
import NeuroTools
from NeuroTools import stgen
v0=1.6 #spikes/ms
Amp=1 # amplitude in spikes/ms
w=4/1000 # periodic frequency in spikes/ms
st_gen = stgen.StGen()
tstop=2500.0
intervals=np.arange(0,tstop,0.05)
rate=np.array([])
for tt in intervals:
v_next=v0+Amp*math.sin(2*math.pi*w*tt)
if (v_next>0.0):
rate=np.append(rate,v_next*1000)
else: rate=np.append(rate,0.0)
PSim=st_gen.inh_poisson_generator(rate=rate,t = intervals, t_stop = 2500.0, array = True) # important to have rate in Hz and all other times in ms
print len(PSim)
print np.mean(rate)/1000*tstop
I'm using matplotlib's magnitude_spectrum to compare the tonal characteristics of guitar strings. Magnitude_spectrum shows the y axis as having units of "Magnitude (energy)". I use two different 'processes' to compare the FFT. Process 2 (for lack of a better description) is much easier to interpret- code & graphs below
My questions are:
In terms of units, what does "Magnitude (energy)" mean and how does it relate to dB?
Using #Process 2 (see code & graphs below), what type of units am I looking at, dB?
If #Process 2 is not dB, then what is the best way to scale it to dB?
My code below (simplified) shows an example of what I'm talking about/looking at.
import numpy as np
from scipy.io.wavfile import read
from pylab import plot
from pylab import plot, psd, magnitude_spectrum
import matplotlib.pyplot as plt
#Hello Signal!!!
(fs, x) = read('C:\Desktop\Spectral Work\EB_AB_1_2.wav')
#Remove silence out of beginning of signal with threshold of 1000
def indices(a, func):
#This allows to use the lambda function for equivalent of find() in matlab
return [i for (i, val) in enumerate(a) if func(val)]
#Make the signal smaller so it uses less resources
x_tiny = x[0:100000]
#threshold is 1000, 0 is calling the first index greater than 1000
thresh = indices(x_tiny, lambda y: y > 1000)[1]
# backs signal up 20 bins, so to not ignore the initial pluck sound...
thresh_start = thresh-20
#starts at threshstart ends at end of signal (-1 is just a referencing thing)
analysis_signal = x[thresh_start-1:]
#Split signal so it is 1 second long
one_sec = 1*fs
onesec = x[thresh_start-1:one_sec+thresh_start-1]
#process 1
(spectrum, freqs, _) = magnitude_spectrum(onesec, Fs=fs)
#process 2
spectrum1 = spectrum/len(spectrum)
I don't know how to bulk process on multiple .wav files so I run this code separately on a whole bunch of different .wav files and i put them into excel to compare. But for the sake of not looking at ugly graphs, I graphed it in Python. Here's what #process1 and #process2 look like when graphed:
Process 1
Process 2
Magnetude is just the absolute value of the frequency spectrum. As you have labelled in Process 1 "Energy" is a good way to think about it.
Both Process 1 and Process 2 are in the same units. The only difference is that the values in Process 2 has been divided by the total length of the array (a scalar, hence no change of units). Normally this happens as part of the FFT, but sometimes it does not (e.g. numpy.FFT doesn't include the divide by length).
The easiest way to scale it to dB is:
(spectrum, freqs, _) = magnitude_spectrum(onesec, Fs=fs, scale='dB')
If you wanted to do this yourself then you would need to do something like:
spectrum2 = 20*numpy.log10(spectrum)
**It is worth noting that I'm not sure if you should be applying the /len(spectrum) or not. I would suggest using the scale='dB' !!
To convert to dB, take the log of any non-zero spectrum magnitudes, and scale (scale to match a calibrated mic and sound source if available, or use an arbitrarily scale to make the levels look familiar otherwise), before plotting.
For zero magnitude values, perhaps just replace or clamp the log with whatever you want to be on the bottom of your log plot (certainly not negative-infinity).
In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
#Priors
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
fig.show()
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Update
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?
You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.