Additional audio feature extraction tips

Additional audio feature extraction tips - python

I'm trying to create a speech emotion recognition model using Keras, I've done all of the code and have trained the model. It sits around 50% validation and is overfitting.
When i use model.predict() with unseen data it seems to have a hard time distinguishing between 'neutral', 'calm', 'happy' and 'suprised', but seems to be able to predict 'angry' correctly in the majority of cases - i assume because there's a clear difference in pitch or something.
I'm thinking it could possibly be that i'm not getting enough features from these emotions, which would help the model distinguish between them.
Currently i am using Librosa and coverting audio to MFCC's. Is there any other way, even using Librosa, that i can do to extract features for the model to help it better distinguish between the 'neutral', 'calm', 'happy', 'surprised' etc?
some feature extraction code:
wav_clip, sample_rate = librosa.load(file_path, duration=3, mono=True, sr=None)
mfcc = librosa.feature.mfcc(wav_clip, sample_rate)
Also, this is with 1400 samples.

A few observations for starter:
Likely you have far too few samples to efficiently use neural networks. Use a simple algorithm for starter to understand well how your model is making prediction.
Make sure you have enough (30% or more) samples from different speakers put aside for final testing. You can use this test set only once, so think about building a pipeline to generate train, validation and test sets. Make sure you don't put the same speaker into more than 1 set.
First coefficient from librosa gives you AFAIK an offset. I'd recommend plotting how your features correlate with labels and how far they overlap, some can be easily confused I guess. Find if there are any feature that would differentiate your classes. Don't do this by running your model, do visual inspection first.
To the actual features! You're right to assume pitch should play a vital role. I'd recommend checking out aubio - it has Python bindings.
Yaafe also offers excellent selection of features.
You might easily end up with 150+ features. You might want to reduce dimensionality of the problem, perhaps even compress it to 2d and see if you can somehow separate the classes. Here is my own example with Dash.
Last but not least, some basic code to extract frequencies from the audio. In this case I am also trying to find three peak frequencies.
import numpy as np
def spectral_statistics(y: np.ndarray, fs: int, lowcut: int = 0) -> dict:
"""
Compute selected statistical properties of spectrum
:param y: 1-d signsl
:param fs: sampling frequency [Hz]
:param lowcut: lowest frequency [Hz]
:return: spectral features (dict)
"""
spec = np.abs(np.fft.rfft(y))
freq = np.fft.rfftfreq(len(y), d=1 / fs)
idx = int(lowcut / fs * len(freq) * 2)
spec = np.abs(spec[idx:])
freq = freq[idx:]
amp = spec / spec.sum()
mean = (freq * amp).sum()
sd = np.sqrt(np.sum(amp * ((freq - mean) ** 2)))
amp_cumsum = np.cumsum(amp)
median = freq[len(amp_cumsum[amp_cumsum <= 0.5]) + 1]
mode = freq[amp.argmax()]
Q25 = freq[len(amp_cumsum[amp_cumsum <= 0.25]) + 1]
Q75 = freq[len(amp_cumsum[amp_cumsum <= 0.75]) + 1]
IQR = Q75 - Q25
z = amp - amp.mean()
w = amp.std()
skew = ((z ** 3).sum() / (len(spec) - 1)) / w ** 3
kurt = ((z ** 4).sum() / (len(spec) - 1)) / w ** 4
top_peaks_ordered_by_power = {'stat_freq_peak_by_power_1': 0, 'stat_freq_peak_by_power_2': 0, 'stat_freq_peak_by_power_3': 0}
top_peaks_ordered_by_order = {'stat_freq_peak_by_order_1': 0, 'stat_freq_peak_by_order_2': 0, 'stat_freq_peak_by_order_3': 0}
amp_smooth = signal.medfilt(amp, kernel_size=15)
peaks, height_d = signal.find_peaks(amp_smooth, distance=100, height=0.002)
if peaks.size != 0:
peak_f = freq[peaks]
for peak, peak_name in zip(peak_f, top_peaks_ordered_by_order.keys()):
top_peaks_ordered_by_order[peak_name] = peak
idx_three_top_peaks = height_d['peak_heights'].argsort()[-3:][::-1]
top_3_freq = peak_f[idx_three_top_peaks]
for peak, peak_name in zip(top_3_freq, top_peaks_ordered_by_power.keys()):
top_peaks_ordered_by_power[peak_name] = peak
specprops = {
'stat_mean': mean,
'stat_sd': sd,
'stat_median': median,
'stat_mode': mode,
'stat_Q25': Q25,
'stat_Q75': Q75,
'stat_IQR': IQR,
'stat_skew': skew,
'stat_kurt': kurt
}
specprops.update(top_peaks_ordered_by_power)
specprops.update(top_peaks_ordered_by_order)
return specprops

Related

Cleaning generated Sinewave artifacts for smooth frequency transitions

I'm a python beginner and as a learning project I'm doing a SSTV encoder using Wraase SC2-120 methods.
SSTV for those who don't know is a technique sending images through radio as sound and to be decoded in the receiving end back to an image. Wraase SC2-120 is one of many types of encoding but it's one of the more simpler ones that support color.
I've been able to create a system that takes an image and converts it to an array. Then take that array and create the needed values for the luminance and chrominance needed for the encoder.
I'm using then this block to create a value between 1500hz - 2300hz for the method.
def ChrominanceAsHertz(value=0.0):
value = 800 * value
value -= value % 128 # Test. Results were promising but too much noise
value += 1500
return int(value)
You can ignore the modulus operation. It is just my way of playing around with the data for "fun" and experiments.
I then clean the audio to avoid having too many of the same values in the same array and add their duration together to achieve a cleaner sound
cleanTone = []
cleanDuration = []
for i in range(len(hertzData)-1):
# If the next tone is not the same
# Add it to the cleantone array
# with it's initial duration
if hertzData[i] != hertzData[i+1]:
cleanTone.append(hertzData[i])
cleanDuration.append(durationData[i])
# else add the duration of the current hertz to the clean duration array
else:
# the current duration is the last inserted duration
currentDur = cleanDuration[len(cleanDuration)-1]
# Add the new duration to the current duration
currentDur += durationData[i]
cleanDuration[len(cleanDuration)-1] = currentDur
My array handling can use some work but it's not why I'm here for now.
The result is a array where no consecutive values are the same and the duration of that tone is still correct.
I then create a sinewave array using this block
audio = []
for i in range(len(cleanTone)):
sineAudio = AudioGen.SineWave(cleanTone[i], cleanDuration[i])
for sine in sineAudio:
audio.append(sine)
The sinewave function is
def SineWave( freq=440, durationMS = 500, sample_rate = 44100.0 ):
num_samples = durationMS * (sample_rate / 1000)
audio = []
for i in range(int(num_samples)):
audio.insert(i, np.sin(2 * np.pi * freq * (i / sample_rate)))
return audio
It works as intended. It creates a sinewave of the frequency I want and for the duration I want.
The problem is that when I create the .wav file then with wave the sinewaves created are not smoothly transitioning.
Screenshot of a closeup of what I mean.
Sinusoidal wave artifacts
The audio file has these immense screeches and cracks because of these artifacts that the above method produces, seeing as how it takes a single frequency and duration with no regard of where the last tone ended and starts a new.
What I've tried to do to remedy these is to refactor that SineWave method to take in a whole array and create the sinewaves consecutively right after one another in hopes of achieving a clean sound but it still did the same thing.
I also tried "smoothing" the generated audio array then with a simple filtering operation from this post.
0.7 * audio[1:-1] + 0.15 * ( audio[2:] + audio[:-2] )
but the results again were not satisfying and the artifacts were still present.
I've also started to look into Fourier Transforms, mainly FFT (fast fourier transform) but I'm not that familiar with them yet to know exactly what it is that I'm trying to do and code.
For SSTV to work the changes in frequency have to sometimes be very fast. 0.3ms fast to be exact, so I'm kinda lost on how to achieve this without loosing too much data in the process.
TL;DR
My sinewave function is producing artifacts inbetween tone changes that cause scratches and unwanted pops. How to not do that?

What you need to transfer from one wave-snippet to the next is the phase. You have to start the next wave with the phase you ended the previous phase.
def SineWave( freq=440, durationMS = 500, phase = 0, sample_rate = 44100.0):
num_samples = int(durationMS * (sample_rate / 1000))
audio = []
for i in range(num_samples):
audio.insert(i, np.sin(2 * np.pi * freq * (i / sample_rate) + phase))
phase = (phase + 2 * np.pi * freq * (num_samples / sample_rate)) % (2 * np.pi)
return audio, phase
In your main loop pass through the phase from one wave fragment to the next:
audio = []
phase = 0
for i in range(len(cleanTone)):
sineAudio, phase = AudioGen.SineWave(cleanTone[i], cleanDuration[i], phase)
for sine in sineAudio:
audio.append(sine)

Change the melody of human speech using FFT and polynomial interpolation

I'm trying to do the following:
Extract the melody of me asking a question (word "Hey?" recorded to
wav) so I get a melody pattern that I can apply to any other
recorded/synthesized speech (basically how F0 changes in time).
Use polynomial interpolation (Lagrange?) so I get a function that describes the melody (approximately of course).
Apply the function to another recorded voice sample. (eg. word "Hey." so it's transformed to a question "Hey?", or transform the end of a sentence to sound like a question [eg. "Is it ok." => "Is it ok?"]). Voila, that's it.
What I have done? Where am I?
Firstly, I have dived into the math that stands behind the fft and signal processing (basics). I want to do it programatically so I decided to use python.
I performed the fft on the entire "Hey?" voice sample and got data in frequency domain (please don't mind y-axis units, I haven't normalized them)
So far so good. Then I decided to divide my signal into chunks so I get more clear frequency information - peaks and so on - this is a blind shot, me trying to grasp the idea of manipulating the frequency and analyzing the audio data. It gets me nowhere however, not in a direction I want, at least.
Now, if I took those peaks, got an interpolated function from them, and applied the function on another voice sample (a part of a voice sample, that is also ffted of course) and performed inversed fft I wouldn't get what I wanted, right?
I would only change the magnitude so it wouldn't affect the melody itself (I think so).
Then I used spec and pyin methods from librosa to extract the real F0-in-time - the melody of asking question "Hey?". And as we would expect, we can clearly see an increase in frequency value:
And a non-question statement looks like this - let's say it's moreless constant.
The same applies to a longer speech sample:
Now, I assume that I have blocks to build my algorithm/process but I still don't know how to assemble them beacause there are some blanks in my understanding of what's going on under the hood.
I consider that I need to find a way to map the F0-in-time curve from the spectrogram to the "pure" FFT data, get an interpolated function from it and then apply the function on another voice sample.
Is there any elegant (inelegant would be ok too) way to do this? I need to be pointed in a right direction beceause I can feel I'm close but I'm basically stuck.
The code that works behind the above charts is taken just from the librosa docs and other stackoverflow questions, it's just a draft/POC so please don't comment on style, if you could :)
fft in chunks:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
import os
file = os.path.join("dir", "hej_n_nat.wav")
fs, signal = wavfile.read(file)
CHUNK = 1024
afft = np.abs(np.fft.fft(signal[0:CHUNK]))
freqs = np.linspace(0, fs, CHUNK)[0:int(fs / 2)]
spectrogram_chunk = freqs / np.amax(freqs * 1.0)
# Plot spectral analysis
plt.plot(freqs[0:250], afft[0:250])
plt.show()
spectrogram:
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import os
file = os.path.join("/path/to/dir", "hej_n_nat.wav")
y, sr = librosa.load(file, sr=44100)
f0, voiced_flag, voiced_probs = librosa.pyin(y, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))
times = librosa.times_like(f0)
D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
fig, ax = plt.subplots()
img = librosa.display.specshow(D, x_axis='time', y_axis='log', ax=ax)
ax.set(title='pYIN fundamental frequency estimation')
fig.colorbar(img, ax=ax, format="%+2.f dB")
ax.plot(times, f0, label='f0', color='cyan', linewidth=2)
ax.legend(loc='upper right')
plt.show()
Hints, questions and comments much appreciated.

The problem was that I didn't know how to modify the fundamental frequency (F0). By modifying it I mean modify F0 and its harmonics, as well.
The spectrograms in question show frequencies at certain points in time with power (dB) of certain frequency point.
Since I know which time bin holds which frequency from the melody (green line below) ...
....I need to compute a function that represents that green line so I can apply it to other speech samples.
So I need to use some interpolation method which takes as parameters the sample F0 function points.
One need to remember that degree of the polynomial should equal to the number of points. The example doesn't have that unfortunately, but the effect is somehow ok as for the prototype.
def _get_bin_nr(val, bins):
the_bin_no = np.nan
for b in range(0, bins.size - 1):
if bins[b] <= val < bins[b + 1]:
the_bin_no = b
elif val > bins[bins.size - 1]:
the_bin_no = bins.size - 1
return the_bin_no
def calculate_pattern_poly_coeff(file_name):
y_source, sr_source = librosa.load(os.path.join(ROOT_DIR, file_name), sr=sr)
f0_source, voiced_flag, voiced_probs = librosa.pyin(y_source, fmin=librosa.note_to_hz('C2'),
fmax=librosa.note_to_hz('C7'), pad_mode='constant',
center=True, frame_length=4096, hop_length=512, sr=sr_source)
all_freq_bins = librosa.core.fft_frequencies(sr=sr, n_fft=n_fft)
f0_freq_bins = list(filter(lambda x: np.isfinite(x), map(lambda val: _get_bin_nr(val, all_freq_bins), f0_source)))
return np.polynomial.polynomial.polyfit(np.arange(0, len(f0_freq_bins), 1), f0_freq_bins, 3)
def calculate_pattern_poly_func(coefficients):
return np.poly1d(coefficients)
Method calculate_pattern_poly_coeff calculates polynomial coefficients.
Using pythons poly1d lib I can compute function which can modify the speech. How to do that?
I just need to move up or down all values vertically at certain point in time.
for instance I want to move all frequencies at time bin 0,75 seconds up 3 times -> it means that frequency will be increased and the melody at that point will sound higher.
Code:
def transform(sentence_audio_sample, mode=None, show_spectrograms=False, frames_from_end_to_transform=12):
# cutting out silence
y_trimmed, idx = librosa.effects.trim(sentence_audio_sample, top_db=60, frame_length=256, hop_length=64)
stft_original = librosa.stft(y_trimmed, hop_length=hop_length, pad_mode='constant', center=True)
stft_original_roll = stft_original.copy()
rolled = stft_original_roll.copy()
source_frames_count = np.shape(stft_original_roll)[1]
sentence_ending_first_frame = source_frames_count - frames_from_end_to_transform
sentence_len = np.shape(stft_original_roll)[1]
for i in range(sentence_ending_first_frame + 1, sentence_len):
if mode == 'question':
by = int(_question_pattern(i) / 500)
elif mode == 'exclamation':
by = int(_exclamation_pattern(i) / 500)
else:
by = 0
rolled = _roll_column(rolled, i, by)
transformed_data = librosa.istft(rolled, hop_length=hop_length, center=True)
def _roll_column(two_d_array, column, shift):
two_d_array[:, column] = np.roll(two_d_array[:, column], shift)
return two_d_array
In this case I am simply rolling up or down frequencies referencing certain time bin.
This needs to be polished as it doesn't take into consideration an actual state of the transformed sample. It just rolls it up/down according to the factor calculated using the polynomial function computer earlier.
You can check full code of my project at github, "audio" package contains pattern calculator and audio transform algorithm described above.
Feel free to ask if something's unclear :)

Harmonic product spectrum for single guitar note Python

I am trying to detect the pitch of a B3 note played with a guitar. The audio can be found here.
This is the spectrogram:
As you can see, it is visible that the fundamental pitch is about 250Hz which corresponds to the B3 note.
It also contains a good amount of harmonics and that is why I chose to use HPS from here. I am using this code for detecting the pitch:
def freq_from_hps(signal, fs):
"""Estimate frequency using harmonic product spectrum
Low frequency noise piles up and overwhelms the desired peaks
"""
N = len(signal)
signal -= mean(signal) # Remove DC offset
# Compute Fourier transform of windowed signal
windowed = signal * kaiser(N, 100)
# Get spectrum
X = log(abs(rfft(windowed)))
# Downsample sum logs of spectra instead of multiplying
hps = copy(X)
for h in arange(2, 9): # TODO: choose a smarter upper limit
dec = decimate(X, h)
hps[:len(dec)] += dec
# Find the peak and interpolate to get a more accurate peak
i_peak = argmax(hps[:len(dec)])
i_interp = parabolic(hps, i_peak)[0]
# Convert to equivalent frequency
return fs * i_interp / N # Hz
My sampling rate is 40000. However, instead of getting a result close to 250Hz (B3 note), I am getting 0.66Hz. How is this possible?
I also tried with an autocorrelation method from the same repo but I also get bad results like 10000Hz.
Thanks to an answer I understand I have to apply a filter to remove the low frequencies in the signal. How do I do that? Are there multiple methods to do that, and which one is recommended?
STATUS UPDATE:
The high-pass filter proposed by the answer is working. If I apply the function in the answer to my audio signal, it correctly displays about 245Hz. However, I would like to filter the whole signal, not only a part of it. A note could lie in the middle of the signal or a signal contain more than one note (I know a solution is onset detection, but I am curious to know why this isn't working). That is why I edited the code to return filtered_audio.
The problem is that if I do that, even though the noise has been correctly removed (see screenshot). I get 0.05 as a result.

Based on the distances between the harmonics in the spectrogram, I would estimate the pitch to be about 150-200 Hz. So, why doesn't the pitch detection algorithm detect the pitch that we can see by eye in the spectrogram? I have a few guesses:
The note only lasts for a few seconds. At the beginning, there is a beautiful harmonic stack with 10 or more harmonics! These quickly fade away and are not even visible after 5 seconds. If you are trying to estimate the pitch of the entire signal, your estimate might be contaminated by the "pitch" of the sound from 5-12 seconds. Try computing the pitch only for the first 1-2 seconds.
There is too much low frequency noise. In the spectrogram, you can see a lot of power between 0 and 64 Hz. This is not part of the harmonics, so you could try removing it with a high-pass filter.
Here is some code that does the job:
import numpy as np
from scipy.io import wavfile
from scipy import signal
import matplotlib.pyplot as plt
from frequency_estimator import freq_from_hps
# downloaded from https://github.com/endolith/waveform-analyzer/
filename = 'Vocaroo_s1KZzNZLtg3c.wav'
# downloaded from http://vocaroo.com/i/s1KZzNZLtg3c
# Parameters
time_start = 0 # seconds
time_end = 1 # seconds
filter_stop_freq = 70 # Hz
filter_pass_freq = 100 # Hz
filter_order = 1001
# Load data
fs, audio = wavfile.read(filename)
audio = audio.astype(float)
# High-pass filter
nyquist_rate = fs / 2.
desired = (0, 0, 1, 1)
bands = (0, filter_stop_freq, filter_pass_freq, nyquist_rate)
filter_coefs = signal.firls(filter_order, bands, desired, nyq=nyquist_rate)
# Examine our high pass filter
w, h = signal.freqz(filter_coefs)
f = w / 2 / np.pi * fs # convert radians/sample to cycles/second
plt.plot(f, 20 * np.log10(abs(h)), 'b')
plt.ylabel('Amplitude [dB]', color='b')
plt.xlabel('Frequency [Hz]')
plt.xlim((0, 300))
# Apply high-pass filter
filtered_audio = signal.filtfilt(filter_coefs, [1], audio)
# Only analyze the audio between time_start and time_end
time_seconds = np.arange(filtered_audio.size, dtype=float) / fs
audio_to_analyze = filtered_audio[(time_seconds >= time_start) &
(time_seconds <= time_end)]
fundamental_frequency = freq_from_hps(audio_to_analyze, fs)
print 'Fundamental frequency is {} Hz'.format(fundamental_frequency)

subclassing of scipy.stats.rv_continuous

I have 3 questions regarding subclassing of scipy.stats.rv_continuous.
My goal is to code a statistical mixture model of a truncated normal distribution, a truncated exponential distribution and 2 uniform distributions.
1) Why is drawing random variates via mm_model.rvs(size = 1000) so extremely slow? I read something about performance issues in the documentary, but I was really surprised.
2) After drawing random variates via mm_model.rvs(size = 1000) I get this IntegrationWarning?
IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
warnings.warn(msg, IntegrationWarning)
3) I read in the documentary that I can transmit parameters to the pdf via the "shape" parameter. I tried to adjust my pdf and set the shape parameter but it did not work. Could someone explain it?
Thanks for any help.
def trunc_norm(z,low_bound,up_bound,mu,sigma):
a = (low_bound - mu) / sigma
b = (up_bound - mu) / sigma
return stats.truncnorm.pdf(z,a,b,loc=mu,scale=sigma)
def trunc_exp(z,up_bound,lam):
return stats.truncexpon.pdf(z,b=up_bound*lam,scale=1/lam)
def uniform(z,a,b):
return stats.uniform.pdf(z,loc=a,scale=(b-a))
class Measure_mixture_model(stats.rv_continuous):
def _pdf(self,z):
z_true = 8
z_min = 0
z_max = 10
p_hit = 0.7
p_short = 0.1
p_max = 0.1
p_rand = 0.1
sig_hit = 1
lambda_short = 0.5
sum_p = p_hit + p_short + p_max + p_rand
if sum_p < 0.99 or 1.01 < sum_p:
misc.eprint("apriori probabilities p_hit, p_short, p_max, p_rand have to be 1!")
return None
pdf_hit = trunc_norm(z,z_min,z_max,z_true,sig_hit)
pdf_short = trunc_exp(z,z_true,lambda_short)
pdf_max = uniform(z,0.99*z_max,z_max)
pdf_rand = uniform(z,z_min,z_max)
pdf = p_hit * pdf_hit + p_short * pdf_short + p_max * pdf_max + p_rand * pdf_rand
return pdf
#mm_model = Measure_mixture_model(shapes='z_true,z_min,z_max,p_hit,p_short,p_max,p_rand,sig_hit,lambda_short')
mm_model = Measure_mixture_model()
z = np.linspace(-1,11,1000)
samples = mm_model.pdf(z)
plt.plot(z,samples)
plt.show()
rand_samples = mm_model.rvs(size = 1000)
bins = np.linspace(-1, 11, 100)
plt.hist(rand_samples,bins)
plt.title("measure mixture model")
plt.xlabel("z: measurement")
plt.ylabel("p: relative frequency")
plt.show()

(1) and (2) are probably related. You are asking scipy to generate random samples based on only the density you provide.
I don't really know what scipy does, but I suspect it integrates the density ("pdf") to get the probability function ("cdf") and then inverts it to map uniform samples to your distribution. This is numerically expensive and as you experienced error prone.
To speed things up you can help scipy by implementing _rvs directly. Just draw a uniform to decide which sub-model of your mixture to select and then invoke the rvs of the selected sub-model. And similar for other functions you may require.
Here are some tips on how to implement a vectorised rvs:
To batch-select sub-models. Since your number of sub-models is small np.digitize should be good enough for this. If possible use rv_frozen instances for sub-models; they are very convenient, but I seem to remember that you can't pass all the optional parameters to them, so you may have to handle those separately.
self._bins = np.cumsum([p_hit, p_short, p_max])
self._bins /= self._bins[-1] + p_rand
submodel = np.digitize(uniform.rvs(size=size), self._bins)
result = np.empty(size)
for j, frozen in enumerate((frz_trunc_norm, frz_trunc_exp, frz_unif_1, frz_unif_2)):
inds = np.where(submodel == j)
result[inds] = frozen.rvs(size=inds.shape)
return result
Re (3) Here is what the scipy docs have to say.
A note on shapes: subclasses need not specify them explicitly. In this case, shapes will be automatically deduced from the signatures of the overridden methods (pdf, cdf etc). If, for some reason, you prefer to avoid relying on introspection, you can specify shapes explicitly as an argument to the instance constructor.
So the normal way would be to just put some arguments in your methods.

Why NUMPY correlate and corrcoef return different values and how to "normalize" a correlate in "full" mode?

I'm trying to use some Time Series Analysis in Python, using Numpy.
I have two somewhat medium-sized series, with 20k values each and I want to check the sliding correlation.
The corrcoef gives me as output a Matrix of auto-correlation/correlation coefficients. Nothing useful by itself in my case, as one of the series contains a lag.
The correlate function (in mode="full") returns a 40k elements list that DO look like the kind of result I'm aiming for (the peak value is as far from the center of the list as the Lag would indicate), but the values are all weird - up to 500, when I was expecting something from -1 to 1.
I can't just divide it all by the max value; I know the max correlation isn't 1.
How could I normalize the "cross-correlation" (correlation in "full" mode) so the return values would be the correlation on each lag step instead those very large, strange values?

You are looking for normalized cross-correlation. This option isn't available yet in Numpy, but a patch is waiting for review that does just what you want. It shouldn't be too hard to apply it I would think. Most of the patch is just doc string stuff. The only lines of code that it adds are
if normalize:
a = (a - mean(a)) / (std(a) * len(a))
v = (v - mean(v)) / std(v)
where a and v are the inputted numpy arrays of which you are finding the cross-correlation. It shouldn't be hard to either add them into your own distribution of Numpy or just make a copy of the correlate function and add the lines there. I would do the latter personally if I chose to go this route.
Another, quite possibly better, alternative is to just do the normalization to the input vectors before you send it to correlate. It's up to you which way you would like to do it.
By the way, this does appear to be the correct normalization as per the Wikipedia page on cross-correlation except for dividing by len(a) rather than (len(a)-1). I feel that the discrepancy is akin to the standard deviation of the sample vs. sample standard deviation and really won't make much of a difference in my opinion.

According to this slides, I would suggest to do it this way:
def cross_correlation(a1, a2):
lags = range(-len(a1)+1, len(a2))
cs = []
for lag in lags:
idx_lower_a1 = max(lag, 0)
idx_lower_a2 = max(-lag, 0)
idx_upper_a1 = min(len(a1), len(a1)+lag)
idx_upper_a2 = min(len(a2), len(a2)-lag)
b1 = a1[idx_lower_a1:idx_upper_a1]
b2 = a2[idx_lower_a2:idx_upper_a2]
c = np.correlate(b1, b2)[0]
c = c / np.sqrt((b1**2).sum() * (b2**2).sum())
cs.append(c)
return cs

For a full mode, would it make sense to compute corrcoef directly on the lagged signal/feature? Code
from dataclasses import dataclass
from typing import Any, Optional, Sequence
import numpy as np
ArrayLike = Any
#dataclass
class XCorr:
cross_correlation: np.ndarray
lags: np.ndarray
def cross_correlation(
signal: ArrayLike, feature: ArrayLike, lags: Optional[Sequence[int]] = None
) -> XCorr:
"""
Computes normalized cross correlation between the `signal` and the `feature`.
Current implementation assumes the `feature` can't be longer than the `signal`.
You can optionally provide specific lags, if not provided `signal` is padded
with the length of the `feature` - 1, and the `feature` is slid/padded (creating lags)
with 0 padding to match the length of the new signal. Pearson product-moment
correlation coefficients is computed for each lag.
See: https://en.wikipedia.org/wiki/Cross-correlation
:param signal: observed signal
:param feature: feature you are looking for
:param lags: optional lags, if not provided equals to (-len(feature), len(signal))
"""
signal_ar = np.asarray(signal)
feature_ar = np.asarray(feature)
if np.count_nonzero(feature_ar) == 0:
raise ValueError("Unsupported - feature contains only zeros")
assert (
signal_ar.ndim == feature_ar.ndim == 1
), "Unsupported - only 1d signal/feature supported"
assert len(feature_ar) <= len(
signal
), "Unsupported - signal should be at least as long as the feature"
padding_sz = len(feature_ar) - 1
padded_signal = np.pad(
signal_ar, (padding_sz, padding_sz), "constant", constant_values=0
)
lags = lags if lags is not None else range(-padding_sz, len(signal_ar), 1)
if np.max(lags) >= len(signal_ar):
raise ValueError("max positive lag must be shorter than the signal")
if np.min(lags) <= -len(feature_ar):
raise ValueError("max negative lag can't be longer than the feature")
assert np.max(lags) < len(signal_ar), ""
lagged_patterns = np.asarray(
[
np.pad(
feature_ar,
(padding_sz + lag, len(signal_ar) - lag - 1),
"constant",
constant_values=0,
)
for lag in lags
]
)
return XCorr(
cross_correlation=np.corrcoef(padded_signal, lagged_patterns)[0, 1:],
lags=np.asarray(lags),
)
Example:
signal = [0, 0, 1, 0.5, 1, 0, 0, 1]
feature = [1, 0, 0, 1]
xcorr = cross_correlation(signal, feature)
assert xcorr.lags[xcorr.cross_correlation.argmax()] == 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Additional audio feature extraction tips - python

Related

Cleaning generated Sinewave artifacts for smooth frequency transitions

Change the melody of human speech using FFT and polynomial interpolation

Harmonic product spectrum for single guitar note Python

subclassing of scipy.stats.rv_continuous

Why NUMPY correlate and corrcoef return different values and how to "normalize" a correlate in "full" mode?

Categories

Resources