Librosa pitch tracking - STFT

Librosa pitch tracking - STFT - python

I am using this algorithm to detect the pitch of
this audio file. As you can hear, it is an E2 note played on a guitar with a bit of noise in the background.
I generated this spectrogram using STFT:
And I am using the algorithm linked above like this:
y, sr = librosa.load(filename, sr=40000)
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
np.set_printoptions(threshold=np.nan)
print pitches[np.nonzero(pitches)]
As a result, I am getting pretty much every possible frequency between my fmin and fmax. What do I have to do with the output of the piptrack method to discover the fundamental frequency of a time frame?
UPDATE
I am still not sure what those 2D array represents, though. Let's say I want to find out how strong is 82Hz in frame 5. I could do that using the STFT function which simply returns a 2D matrix (which was used to plot the spectrogram).
However, piptrack does something additional which could be useful and I don't really understand what. pitches[f, t] contains instantaneous frequency at bin f, time t. Does that mean that, if I want to find the maximum frequency at time frame t, I have to:
Go to the magnitudes[][t] array, find the bin with the maximum
magnitude.
Assign the bin to a variable f.
Find pitches[b][t] to find the frequency that belongs to that bin?

Pitch detection is a tricky topic and is often counter-intuitive. I'm not wild about the way the source code is documented for this particular function -- it almost seems like the developer is confusing a 'harmonic' with a 'pitch'.
When a single note (a 'pitch') is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies, called harmonics. Typical pitch tracking techniques include searching the results of a FFT for magnitudes in certain bins that correspond to the expected frequencies of harmonics. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). However, the frequencies where harmonics are located are logarithmically spaced, but the FFT uses a linear spacing. Often the vertical spacing for FFTs are not resolved enough at the lower frequencies.
For that reason when I wrote a pitch detecting application (PitchScope Player), I chose to create a logarithmically spaced DFT, rather than a FFT, so I could focus on the precise frequencies of interest for music ( see the attached diagram of my custom DFT from 3 seconds of a guitar solo ). If you are serious about pursuing pitch detection, you should consider doing more reading into the topic, looking at other sample code (mine is linked below), and consider writing your own functions to measure frequency.
https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection
https://github.com/CreativeDetectors/PitchScope_Player

Turns out the way to pick the pitch at a certain frame t is simple:
def detect_pitch(y, sr, t):
index = magnitudes[:, t].argmax()
pitch = pitches[index, t]
return pitch
First getting the bin of the strongest frequency by looking at the magnitudes array, and then finding the pitch at pitches[index, t].

To find the pitch of the whole audio segment:
def detect_pitch(y, sr):
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
# get indexes of the maximum value in each time slice
max_indexes = np.argmax(magnitudes, axis=0)
# get the pitches of the max indexes per time slice
pitches = pitches[max_indexes, range(magnitudes.shape[1])]
return pitches

Related

FFT sound analysis yields the correct note but in another octave [duplicate]

I have implemented Demetri's Pitch Detector project for the iPhone and hitting up against two problems. 1) any sort of background noise sends the frequency reading bananas and 2) lower frequency sounds aren't being pitched correctly. I tried to tune my guitar and while the higher strings worked - the tuner could not correctly discern the low E.
The Pitch Detection code is located in RIOInterface.mm and goes something like this ...
// get the data
AudioUnitRender(...);
// convert int16 to float
Convert(...);
// divide the signal into even-odd configuration
vDSP_ctoz((COMPLEX*)outputBuffer, 2, &A, 1, nOver2);
// apply the fft
vDSP_fft_zrip(fftSetup, &A, stride, log2n, FFT_FORWARD);
// convert split real form to split vector
vDSP_ztoc(&A, 1, (COMPLEX *)outputBuffer, 2, nOver2);
Demetri then goes on to determine the 'dominant' frequency as follows:
float dominantFrequency = 0;
int bin = -1;
for (int i=0; i<n; i+=2) {
float curFreq = MagnitudeSquared(outputBuffer[i], outputBuffer[i+1]);
if (curFreq > dominantFrequency) {
dominantFrequency = curFreq;
bin = (i+1)/2;
}
}
memset(outputBuffer, 0, n*sizeof(SInt16));
// Update the UI with our newly acquired frequency value.
[THIS->listener frequencyChangedWithValue:bin*(THIS->sampleRate/bufferCapacity)];
To start with, I believe I need to apply a LOW PASS FILTER ... but I'm not an FFT expert and not sure exactly where or how to do that against the data returned from the vDSP functions. I'm also not sure how to improve the accuracy of the code in the lower frequencies. There seem to be other algorithms to determine the dominant frequency - but again, looking for a kick in the right direction when using the data returned by Apple's Accelerate framework.
UPDATE:
The accelerate framework actually has some windowing functions. I setup a basic window like this
windowSize = maxFrames;
transferBuffer = (float*)malloc(sizeof(float)*windowSize);
window = (float*)malloc(sizeof(float)*windowSize);
memset(window, 0, sizeof(float)*windowSize);
vDSP_hann_window(window, windowSize, vDSP_HANN_NORM);
which I then apply by inserting
vDSP_vmul(outputBuffer, 1, window, 1, transferBuffer, 1, windowSize);
before the vDSP_ctoz function. I then change the rest of the code to use 'transferBuffer' instead of outputBuffer ... but so far, haven't noticed any dramatic changes in the final pitch guess.

Pitch is not the same as peak magnitude frequency bin (which is what the FFT in the Accelerate framework might give you directly). So any peak frequency detector will not be reliable for pitch estimation. A low-pass filter will not help when the note has a missing or very weak fundamental (common in some voice, piano and guitar sounds) and/or lots of powerful overtones in its spectrum.
Look at a wide-band spectrum or spectrograph of your musical sounds and you will see the problem.
Other methods are usually needed for a more reliable estimate of musical pitch. Some of these include autocorrelation methods (AMDF, ASDF), Cepstrum/Cepstral analysis, harmonic product spectrum, phase vocoder, and/or composite algorithms such as RAPT (Robust Algorithm for Pitch Tracking) and YAAPT. An FFT is useful as only a sub-part of some of the above methods.

At the very least you need to apply a window function to your time domain data, prior to calculating the FFT. Without this step the power spectrum will contain artefacts (see: spectral leakage) which will interfere with your attempts at extracting pitch information.
A simple Hann (aka Hanning) window should suffice.

What is your sample frequency and blocksize? Low E is around 80 Hz, so you need to make sure your capture block is long enough to capture many cycles at this frequency. This is because the Fourier Transform divides the frequency spectrum into bins, each several Hz wide. If you sample at 44.1 kHz and have a 1024 point time domain sample, for instance, each bin will be 44100/1024 = 43.07 Hz wide. Thus a low E would be in the second bin. For a bunch of reasons (to do with spectral leakage and the nature of finite time blocks), practically speaking you should consider the first 3 or 4 bins of data in an FFT result with extreme suspicion.
If you drop the sample rate to 8 kHz, the same blocksize gives you bins that are 7.8125 Hz wide. Now low E will be in the 10th or 11th bin, which is much better. You could also use a longer blocksize.
And as Paul R points out, you MUST use a window to reduce spectral leakage.

The frequency response function of the iPhone drops off below 100 - 200 Hz (see http://blog.faberacoustical.com/2009/ios/iphone/iphone-microphone-frequency-response-comparison/ for an example).
If you are trying to detect the fundamental mode of a low guitar string, the microphone might be acting as a filter and suppressing the frequency you are interested in. There are a couple of options if you interested in using the fft data you can get - you can window the data in the frequency domain around the note you are trying to detect so that all you can see is the first mode even if it is of lower magnitude than higher modes(i.e. have a toggle to tune the first string and put it in this mode).
Or you can low pass filter the sound data - you can do this either in the time domain or even easier since you already have frequency domain data, in the frequency domain. A very simple time domain low pass filter is to do a time-moving average filter. A very simple frequency domain low pass filter is to multiply your fft magnitudes by a vector with 1's in the low frequency range and a linear (or even a step) ramp down in the higher frequencies.

How can I get the pitch of an audio using Librosa? (NOT 2D array, but similar with crepe) [duplicate]

I am using this algorithm to detect the pitch of
this audio file. As you can hear, it is an E2 note played on a guitar with a bit of noise in the background.
I generated this spectrogram using STFT:
And I am using the algorithm linked above like this:
y, sr = librosa.load(filename, sr=40000)
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
np.set_printoptions(threshold=np.nan)
print pitches[np.nonzero(pitches)]
As a result, I am getting pretty much every possible frequency between my fmin and fmax. What do I have to do with the output of the piptrack method to discover the fundamental frequency of a time frame?
UPDATE
I am still not sure what those 2D array represents, though. Let's say I want to find out how strong is 82Hz in frame 5. I could do that using the STFT function which simply returns a 2D matrix (which was used to plot the spectrogram).
However, piptrack does something additional which could be useful and I don't really understand what. pitches[f, t] contains instantaneous frequency at bin f, time t. Does that mean that, if I want to find the maximum frequency at time frame t, I have to:
Go to the magnitudes[][t] array, find the bin with the maximum
magnitude.
Assign the bin to a variable f.
Find pitches[b][t] to find the frequency that belongs to that bin?

Turns out the way to pick the pitch at a certain frame t is simple:
def detect_pitch(y, sr, t):
index = magnitudes[:, t].argmax()
pitch = pitches[index, t]
return pitch
First getting the bin of the strongest frequency by looking at the magnitudes array, and then finding the pitch at pitches[index, t].

To find the pitch of the whole audio segment:
def detect_pitch(y, sr):
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
# get indexes of the maximum value in each time slice
max_indexes = np.argmax(magnitudes, axis=0)
# get the pitches of the max indexes per time slice
pitches = pitches[max_indexes, range(magnitudes.shape[1])]
return pitches

Trouble with visualizing components of fourier transform (python fft)

I am analyzing a time-series dataset that I am pretty sure can be broken down using fft. I want to develop a model to estimate the data using a sum of sin/cos but I am having trouble with the syntax to find the frequencies in python
Here is a graph of the data
data graph
And here's a link to the original data: https://drive.google.com/open?id=1mqZtQ-txdd_AFbKGBlbSL6903CK-_kXl
Most of the examples I have seen have multiple samples per second/time period, however the data in this set represent by-minute observations of some metric. Because of this, I've had trouble translating the answers online to this problem
Here's my naive first approach
X = fftpack.fft(data)
freqs = fftpack.fftfreq(len(data))
plt.plot(freqs, np.abs(X))
plt.show()
Instead of peaking at the major frequencies, my plot only has one peak at 0.
result

The FFT you posted has been shifted so that 0 is at the center. Data to the left of the center represents negative frequencies and to the right represents positive frequencies. If you zoom in and look more closely, I think you will see that there are two peaks close to the center that you are interpreting as a single peak at 0. Just looking at the positive side, the location of this peak will tell you which frequency is contributing significant signal power.
Like you said, your x-axis is probably incorrect. scipy.fftpack.fftfreq needs to know the time between samples (in seconds, I think) of your time-domain signal to correctly determine the bandwidth and create the x-axis array in Hz. This should do it:
dt = 60 # 60 seconds between samples
freqs = fftpack.fftfreq(len(data),dt)

Python - FFT leads to wrong physical meanings

I am new to Python.
I intend to do Fourier Transform to an array of discrete points, (time, acceleration), and plot the result out.
I copy and paste the sample FFT code, and modify accordingly.
Please see codes:
import numpy as np
import matplotlib.pyplot as plt
# Load the .txt file in
myData = np.loadtxt('twenty_z_up.txt')
# Extract the time and acceleration columns
time = copy(myData[:,0])
# Extract the acceleration columns
zAcc = copy(myData[:,3])
t = np.arange(10080)
sp = np.fft.fft(zAcc)
freq = np.fft.fftfreq(t.shape[-1])
plt.plot(freq, sp.real)
myData is a rectangular matrix with 10080 rows and 10 columns.
Thus, zAcc is the row3 extracted from the matrix.
In the plot drawn by Spyder, most of the harmonics concentrated around 0.
They are all extremely small.
But my data are actually the accelerations of the phone carried by a walking person (including the gravity). So I expect the most significant harmonic happens around 2Hz.
Why is the graph non-sense?
Thanks in advance!
==============UPDATES: My Graphs======================
The first time domain one:
x-axis is in millisecond.
y-axis is in m/s^2, due to earth gravity, it has a DC offset of ~10.

You do get two spikes at (approximately) 2Hz. Your sampling period is around 2.8 ms (as best as I can infer from your first plot), giving +/-2Hz the normalized frequency of +/-0.056, which is about where your spikes are. fft.fftfreq by default returns the normalized frequency (which scales the sampling period). You can set the d argument to be the sampling period, and you'll get a vector containing the actual frequency.
Your huge spike in the middle is obviously the DC offset (which you can trivially remove by subtracting the mean).

As others said, we need to see the data, post it somewhere. Just to check, try first fixing the timestep size in fftfreq, then plot this synthetic signal, and then plot your signal to see how they compare:
timestep=1./50.#Assume sampling at 50Hz. Change this accordingly.
N=10080#the number of samples
T=N*timestep
t = np.linspace(0,T,N)#needed only to generate xAcc_synthetic
freq=2.#peak a frequency at 2Hz
#generate synthetic signal at 2Hz and add some noise to it
xAcc_synthetic = sin((2*np.pi)*freq*t)+np.random.rand(N)*0.2
sp_synthetic = np.fft.fft(xAcc_synthetic)
freq = np.fft.fftfreq(t.size,d=timestep)
print max(abs(freq))==(1/timestep)/2.#simple check highest freq.
plt.plot(freq, abs(sp_synthetic))
xlabel('Hz')
Now, at the x axis equal to 2 you actually have a physical frequency of 2Hz, and you may spot the more pronounced peak you are looking for. Moreover, you may want to have a look also at yAcc and zAcc.

Clipping FFT Matrix

Audio processing is pretty new for me. And currently using Python Numpy for processing wave files. After calculating FFT matrix I am getting noisy power values for non-existent frequencies. I am interested in visualizing the data and accuracy is not a high priority. Is there a safe way to calculate the clipping value to remove these values, or should I use all FFT matrices for each sample set to come up with an average number ?
regards
Edit:
from numpy import *
import wave
import pymedia.audio.sound as sound
import time, struct
from pylab import ion, plot, draw, show
fp = wave.open("500-200f.wav", "rb")
sample_rate = fp.getframerate()
total_num_samps = fp.getnframes()
fft_length = 2048.
num_fft = (total_num_samps / fft_length ) - 2
temp = zeros((num_fft,fft_length), float)
for i in range(num_fft):
tempb = fp.readframes(fft_length);
data = struct.unpack("%dH"%(fft_length), tempb)
temp[i,:] = array(data, short)
pts = fft_length/2+1
data = (abs(fft.rfft(temp, fft_length)) / (pts))[:pts]
x_axis = arange(pts)*sample_rate*.5/pts
spec_range = pts
plot(x_axis, data[0])
show()
Here is the plot in non-logarithmic scale, for synthetic wave file containing 500hz(fading out) + 200hz sine wave created using Goldwave.

Simulated waveforms shouldn't show FFTs like your figure, so something is very wrong, and probably not with the FFT, but with the input waveform. The main problem in your plot is not the ripples, but the harmonics around 1000 Hz, and the subharmonic at 500 Hz. A simulated waveform shouldn't show any of this (for example, see my plot below).
First, you probably want to just try plotting out the raw waveform, and this will likely point to an obvious problem. Also, it seems odd to have a wave unpack to unsigned shorts, i.e. "H", and especially after this to not have a large zero-frequency component.
I was able to get a pretty close duplicate to your FFT by applying clipping to the waveform, as was suggested by both the subharmonic and higher harmonics (and Trevor). You could be introducing clipping either in the simulation or the unpacking. Either way, I bypassed this by creating the waveforms in numpy to start with.
Here's what the proper FFT should look like (i.e. basically perfect, except for the broadening of the peaks due to the windowing)
Here's one from a waveform that's been clipped (and is very similar to your FFT, from the subharmonic to the precise pattern of the three higher harmonics around 1000 Hz)
Here's the code I used to generate these
from numpy import *
from pylab import ion, plot, draw, show, xlabel, ylabel, figure
sample_rate = 20000.
times = arange(0, 10., 1./sample_rate)
wfm0 = sin(2*pi*200.*times)
wfm1 = sin(2*pi*500.*times) *(10.-times)/10.
wfm = wfm0+wfm1
# int test
#wfm *= 2**8
#wfm = wfm.astype(int16)
#wfm = wfm.astype(float)
# abs test
#wfm = abs(wfm)
# clip test
#wfm = clip(wfm, -1.2, 1.2)
fft_length = 5*2048.
total_num_samps = len(times)
num_fft = (total_num_samps / fft_length ) - 2
temp = zeros((num_fft,fft_length), float)
for i in range(num_fft):
temp[i,:] = wfm[i*fft_length:(i+1)*fft_length]
pts = fft_length/2+1
data = (abs(fft.rfft(temp, fft_length)) / (pts))[:pts]
x_axis = arange(pts)*sample_rate*.5/pts
spec_range = pts
plot(x_axis, data[2], linewidth=3)
xlabel("freq (Hz)")
ylabel('abs(FFT)')
show()

FFT's because they are windowed and sampled cause aliasing and sampling in the frequency domain as well. Filtering in the time domain is just multiplication in the frequency domain so you may want to just apply a filter which is just multiplying each frequency by a value for the function for the filter you are using. For example multiply by 1 in the passband and by zero every were else. The unexpected values are probably caused by aliasing where higher frequencies are being folded down to the ones you are seeing. The original signal needs to be band limited to half your sampling rate or you will get aliasing. Of more concern is aliasing that is distorting the area of interest because for this band of frequencies you want to know that the frequency is from the expected one.
The other thing to keep in mind is that when you grab a piece of data from a wave file you are mathmatically multiplying it by a square wave. This causes a sinx/x to be convolved with the frequency response to minimize this you can multiply the original windowed signal with something like a Hanning window.

It's worth mentioning for a 1D FFT that the first element (index [0]) contains the DC (zero-frequency) term, the elements [1:N/2] contain the positive frequencies and the elements [N/2+1:N-1] contain the negative frequencies. Since you didn't provide a code sample or additional information about the output of your FFT, I can't rule out the possibility that the "noisy power values at non-existent frequencies" aren't just the negative frequencies of your spectrum.
EDIT: Here is an example of a radix-2 FFT implemented in pure Python with a simple test routine that finds the FFT of a rectangular pulse, [1.,1.,1.,1.,0.,0.,0.,0.]. You can run the example on codepad and see that the FFT of that sequence is
[0j, Negative frequencies
(1+0.414213562373j), ^
0j, |
(1+2.41421356237j), |
(4+0j), <= DC term
(1-2.41421356237j), |
0j, v
(1-0.414213562373j)] Positive frequencies
Note that the code prints out the Fourier coefficients in order of ascending frequency, i.e. from the highest negative frequency up to DC, and then up to the highest positive frequency.

I don't know enough from your question to actually answer anything specific.
But here are a couple of things to try from my own experience writing FFTs:
Make sure you are following Nyquist rule
If you are viewing the linear output of the FFT... you will have trouble seeing your own signal and think everything is broken. Make sure you are looking at the dB of your FFT magnitude. (i.e. "plot(10*log10(abs(fft(x))))" )
Create a unitTest for your FFT() function by feeding generated data like a pure tone. Then feed the same generated data to Matlab's FFT(). Do a absolute value diff between the two output data series and make sure the max absolute value difference is something like 10^-6 (i.e. the only difference is caused by small floating point errors)
Make sure you are windowing your data
If all of those three things work, then your fft is fine. And your input data is probably the issue.
Check the input data to see if there is clipping http://www.users.globalnet.co.uk/~bunce/clip.gif
Time doamin clipping shows up as mirror images of the signal in the frequency domain at specific regular intervals with less amplitude.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.