Related
I am using this algorithm to detect the pitch of
this audio file. As you can hear, it is an E2 note played on a guitar with a bit of noise in the background.
I generated this spectrogram using STFT:
And I am using the algorithm linked above like this:
y, sr = librosa.load(filename, sr=40000)
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
np.set_printoptions(threshold=np.nan)
print pitches[np.nonzero(pitches)]
As a result, I am getting pretty much every possible frequency between my fmin and fmax. What do I have to do with the output of the piptrack method to discover the fundamental frequency of a time frame?
UPDATE
I am still not sure what those 2D array represents, though. Let's say I want to find out how strong is 82Hz in frame 5. I could do that using the STFT function which simply returns a 2D matrix (which was used to plot the spectrogram).
However, piptrack does something additional which could be useful and I don't really understand what. pitches[f, t] contains instantaneous frequency at bin f, time t. Does that mean that, if I want to find the maximum frequency at time frame t, I have to:
Go to the magnitudes[][t] array, find the bin with the maximum
magnitude.
Assign the bin to a variable f.
Find pitches[b][t] to find the frequency that belongs to that bin?
Pitch detection is a tricky topic and is often counter-intuitive. I'm not wild about the way the source code is documented for this particular function -- it almost seems like the developer is confusing a 'harmonic' with a 'pitch'.
When a single note (a 'pitch') is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies, called harmonics. Typical pitch tracking techniques include searching the results of a FFT for magnitudes in certain bins that correspond to the expected frequencies of harmonics. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). However, the frequencies where harmonics are located are logarithmically spaced, but the FFT uses a linear spacing. Often the vertical spacing for FFTs are not resolved enough at the lower frequencies.
For that reason when I wrote a pitch detecting application (PitchScope Player), I chose to create a logarithmically spaced DFT, rather than a FFT, so I could focus on the precise frequencies of interest for music ( see the attached diagram of my custom DFT from 3 seconds of a guitar solo ). If you are serious about pursuing pitch detection, you should consider doing more reading into the topic, looking at other sample code (mine is linked below), and consider writing your own functions to measure frequency.
https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection
https://github.com/CreativeDetectors/PitchScope_Player
Turns out the way to pick the pitch at a certain frame t is simple:
def detect_pitch(y, sr, t):
index = magnitudes[:, t].argmax()
pitch = pitches[index, t]
return pitch
First getting the bin of the strongest frequency by looking at the magnitudes array, and then finding the pitch at pitches[index, t].
To find the pitch of the whole audio segment:
def detect_pitch(y, sr):
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
# get indexes of the maximum value in each time slice
max_indexes = np.argmax(magnitudes, axis=0)
# get the pitches of the max indexes per time slice
pitches = pitches[max_indexes, range(magnitudes.shape[1])]
return pitches
BIG EDIT
The original code was:
The the plotting of a graph that corresponds to the reading of a text file with n lines. Each line contains 4 columns,the first three columns are coordinates of (x,y,z) points, and the fourth column is a binary variable not necessary for this plotting. At each 20 lines read, a skeleton is read, this skeleton being a group of 20 (x,y,z) points or joints, each joint made by the first three columns of each line.
Example of a text file content: A text file contains 860 lines, and 860/20 = 43, being 20 the number of joints to create a skeleton of (x,y,z) joints. Then, the text file is made of 43 skeletons, that generates a movement. Therefore, the text file represents a movement. I've called it "example" because the numbers vary.
After building the code to read the skeleton's movements, I've made a big 2D array that contains all the movements together, and the result was a 22797x400 array, where each line is a skeleton. Therefore, there are 22797 skeletons, with 400 columns for each. I've called this last 2D array of final_array.
I've applied the Singular Value Decomposition (SVD) to final_array, where I've used the V matrix from SVD (that results in S, V and D matrices) to make a multiplication between final_array and a reduced version of V (which is originally 400x400), resulting in a 22797x3 2D array, since the reduced version of V was 400x3. This was necessary for some reasons that don't need to be mentioned here, but it was for dimension reduction to plot the skeletons in upcoming parts of the process.
Hence, I have a 22797x3 2D array, where each line represents a skeleton, built from operations explained above, and I need to apply clustering to this matrix, where each line will be clustered to a group, using Kmeans from Scikit-learn in Python. It must be a cluster with 100 clustering groups.
What I need to have as result is the kmeans_labels result, with a list of 22797 elements, informing was group of the 100 clustering groups each line (skeleton) was grouped at.
So far I've tried:
kmeans = KMeans(n_clusters=100, random_state=0).fit(matrix)
But the result was the following error message:
Number of distinct clusters (68) found smaller than n_clusters (100). Possibly due to duplicate points in X.
return_n_iter=True)
It doesn't matter how many times I change the groups number, the error message returns with a smaller value.
Any hep?
This error means that your data matrix is mostly composed of repeated vectors.
So from your 22797 data points, there are only 68 different vectors and the rest are just repetitions of these 68 values.
Try printing the matrix. I believe you either you are not reading the data as you should, or you are not measuring them the right way
I am using this algorithm to detect the pitch of
this audio file. As you can hear, it is an E2 note played on a guitar with a bit of noise in the background.
I generated this spectrogram using STFT:
And I am using the algorithm linked above like this:
y, sr = librosa.load(filename, sr=40000)
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
np.set_printoptions(threshold=np.nan)
print pitches[np.nonzero(pitches)]
As a result, I am getting pretty much every possible frequency between my fmin and fmax. What do I have to do with the output of the piptrack method to discover the fundamental frequency of a time frame?
UPDATE
I am still not sure what those 2D array represents, though. Let's say I want to find out how strong is 82Hz in frame 5. I could do that using the STFT function which simply returns a 2D matrix (which was used to plot the spectrogram).
However, piptrack does something additional which could be useful and I don't really understand what. pitches[f, t] contains instantaneous frequency at bin f, time t. Does that mean that, if I want to find the maximum frequency at time frame t, I have to:
Go to the magnitudes[][t] array, find the bin with the maximum
magnitude.
Assign the bin to a variable f.
Find pitches[b][t] to find the frequency that belongs to that bin?
Pitch detection is a tricky topic and is often counter-intuitive. I'm not wild about the way the source code is documented for this particular function -- it almost seems like the developer is confusing a 'harmonic' with a 'pitch'.
When a single note (a 'pitch') is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies, called harmonics. Typical pitch tracking techniques include searching the results of a FFT for magnitudes in certain bins that correspond to the expected frequencies of harmonics. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). However, the frequencies where harmonics are located are logarithmically spaced, but the FFT uses a linear spacing. Often the vertical spacing for FFTs are not resolved enough at the lower frequencies.
For that reason when I wrote a pitch detecting application (PitchScope Player), I chose to create a logarithmically spaced DFT, rather than a FFT, so I could focus on the precise frequencies of interest for music ( see the attached diagram of my custom DFT from 3 seconds of a guitar solo ). If you are serious about pursuing pitch detection, you should consider doing more reading into the topic, looking at other sample code (mine is linked below), and consider writing your own functions to measure frequency.
https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection
https://github.com/CreativeDetectors/PitchScope_Player
Turns out the way to pick the pitch at a certain frame t is simple:
def detect_pitch(y, sr, t):
index = magnitudes[:, t].argmax()
pitch = pitches[index, t]
return pitch
First getting the bin of the strongest frequency by looking at the magnitudes array, and then finding the pitch at pitches[index, t].
To find the pitch of the whole audio segment:
def detect_pitch(y, sr):
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
# get indexes of the maximum value in each time slice
max_indexes = np.argmax(magnitudes, axis=0)
# get the pitches of the max indexes per time slice
pitches = pitches[max_indexes, range(magnitudes.shape[1])]
return pitches
I am attempting to calculate the MTF from a test target. I calculate the spread function easily enough, but the FFT results do not quite make sense to me. To summarize,the values seem to alternate giving me a reflection of what I would expect. To test, I used a simple square wave and numpy:
from numpy import fft
data = []
for x in range (0, 20):
data.append(0)
data[9] = 10
data[10] = 10
data[11] = 10
dataFFT = fft.fft(data)
The results look correct, with the exception of the sign... I am seeing the following for the first 4 values as an example:
30.00000000 +0.00000000e+00j
-29.02113033 +7.10542736e-15j
26.18033989 -1.24344979e-14j
-21.75570505 +1.24344979e-14j
So my question is why positive->negative->positive->negative in the real plane? This is not what I would expect... It I plot it, it almost appears that the correct function is mirrored around the x axis.
Note: I was expecting the following as an example:
This is what I am getting:
Your pulse is symmetric and positioned in the center of your FFT window (around N/2). Symmetric real data corresponds to only the cosine or "real" components of an FFT result. Note that the cosine function alternates between being -1 and 1 at the center of the FFT window, depending on the frequency bin index (representing cosine periods per FFT width). So the correlation of these FFT basis functions with a positive going pulse will also alternate as long as the pulse is narrower than half the cosine period.
If you want the largest FFT coefficients to be mostly positive, try centering your narrow rectangular pulse around time 0 (or circularly, time N), where the cosine function is always 1 for any frequency.
It works if you shift the data around 0 instead of half your array, with:
dataFFT = fft.fft(np.fftshift(data))
This isn't all that unexpected. If you want to check against conventional plots, make sure you convert that info to magnitude and phase before coming to any conclusions.
I did a quick check using your code and numpy.abs for mag, numpy,angle for phase. It sure looks like a sinc() function to me, which is what would be expected if the time-domain is a square pulse. If you do this, you'll find a pretty wide sinc, as would be expeceted for a short duration pulse on so few samples.
you forget to specify if your data is Real or Complex
not everyone code in python/numpy (including me) and if you do not know this then you probably handle data to/from FFT the wrong way.
FFT input can be both real or complex domain
FFT output is complex domain
so check the docs for your FFT implementation and specify it and also repair your data handling accordingly. Complex domain usually have first value Re and Second Im but that depends on FFT implementation/configuration.
signal
here is an example of impulse response from FFT
first is input Real domain signal (Im=0) single finite nonzero width pulse and second is the Re part of FFT output. The third is the Im part of FFT output. If you zoom it a bit then you will see amplitude range of y axis of each signal (on left).
Do not forget that different FFT implementations can have different normalization constants which will change the amplitude of signal. If you want magnitude and phase convert it like this:
mag=sqrt(Re*Re+Im*Im); // power
ang=atanxy(Re,Im); // phase angle
atanxy(dx,dy) is 4 quadrant arctan also called atan2 but be careful to get the operand order the same as your atanxy/atan2 implementation needs. Also can use mine C++ atanxy implementation
[Notes]
if your input signal is Real domain then FFT output is symmetric. Both Re and Im signals will be like:
{ a0,a1,a2,a3,...,a(n-1),a(n-1)...,a3,a2,a1,a0 }
exactly like on the image above. On the left are low frequencies and in the middle is the top frequency. If your input signal is Complex domain then the output can be anything.
I'm trying to get the correct FFT bin index based on the given frequency. The audio is being sampled at 44.1k Hz and the FFT size is 1024. Given the signal is real (capture from PyAudio, decoded through numpy.fromstring, windowed by scipy.signal.hann), I then perform FFT through scipy.fftpack.rfft, and compute the decibel of the result, in whole, magnitude = 20 * scipy.log10(abs(rfft(audio_sample)))
Based on this, and this, I originally had my mapping from the FFT bin index, k, to any frequency, F, as:
F = k*Fs/N for k = 0 ... N/2-1 where Fs is the sampling rate, and N is the FFT bin size, in this case, 1024. And the reverse as:
k = F*N/Fs for F = 0Hz ... Fs/2-Fs/N
However, realizing that the rfft's result is no symmetric like fft, and provides the result, in an N size array. I now have some questions in regarding the mapping and the function. Documentation unfortunately did not provide much information as I'm novice in this area.
My questions:
To me, the result of rfft on an audio sample can be used directly from the first bin to the last bin, as no symmetry occurs in the output, is that correct?
Given the lack of symmetry from the above, the frequency resolution appears to have increased, is this interpretation correct?
Because of using rfft, my mapping function from bin index k to frequency F is now F = k*Fs/(2N) for k = 0 ... N-1 is this correct?
Conversely, the reverse mapping function from frequency F to bin index k now becomes k = 2*F*N/Fs for F = 0Hz ... Fs/2-(Fs/2/N), what about the correctness of this?
My general confusion arises from how rfft is related to fft, and how the mapping can be done correctly while using rfft. I believe my mapping is offset by a small amount, and that is crucial in my application. Please point out the mistake or advise on the matter if possible, thank you very much.
First to clear up a few things for you:
A quick reference to the fftpack documentation reveals that rfft only gives you an output vector from 0..512 (in your case). The reason for this is exactly because of the symmetry present when calculating the discrete Fourier transform of a real-valued input:
y[k] = y*[N-k] (see Wikipedia page on DFTs). Therefore, the rfft function only calculates and stores N/2+1 values since you can calculate the other half by just taking the complex conjugates (should you really want it for plotting (say)). The fft function makes no assumption on the input values (they can have both a real and imaginary part) and therefore no symmetry can be assumed in the output and it gives you a full output vector with N values. Admittedly, most applications use a real input, so people tend to assume the symmetry is always there. Note that the Fast Fourier Transform (FFT) is an (efficient) algorithm to calculate the Discrete Fourier Transform (DFT) and the rfft function also uses the FFT to do the calculation.
In light of the above, your indices for accessing the output vector are out of bounds, i.e. > 512. The reasons why/how you can do this depends on your code. You should clearly distinguish between the 'logical N' (that you use to map the bin frequencies, define the DFT etc.) and the 'computational N' (the actual number of values in your output vector), then all your problems should disappear.
To concretely answer your questions:
No. There is symmetry and you need to use this to calculate the last bins (but they give you no extra information).
No. The only way to increase resolution of a DFT is to increase your sample length.
No, but almost. F = k*Fs/N for k = 0..N/2
For an output vector with N bins you get frequencies from 0 to (N-1)/N*Fs. Using the rfft you will have an output vector with N/2+1 bins. You do the maths, but I get 0..Fs/2
Hope things are clearer now.