I am currently doing a project at university where I am distinguishing between different instruments playing notes of the same pitch using python.
I have recorded various notes on different instruments using a microphone attached to a computer.
I have also recorded background for the room.
So far I have plots for different notes on different instruments, where on the y-axis I have the amplitude in dB: 20*log10(|FFT(signal)|)
And on the x-axis I have DFT sample frequencies
Some of the harmonic peaks are small enough (or the background is large enough) for noise to be a factor-(can't post images as I'm a noob!) my problem is calculating the level of uncertainty in the height of the peaks when accounting for background noise.
My question is:
Well, how to calculate the level of uncertainty in the height of the peaks (their relative harmonic amplitudes) when accounting for background noise.
Some ideas:
What dB threshold I should use when classifying what is a harmonic peak and what is attributable to noise (should I discount a peak lower than the maximum backgound (~28000dB) or the mean (~15000) or perhaps twice one of these values)?
Also, to take account of the noise introduced by background, is it legitimate to subtract the value in FFT bin n for the background, from FFT bin n for my instrument recording?
Also I have looked at this post how can the noise be removed from a recorded sound,using fft in MATLAB? , there seem to be very differing opinions on there.
If it's relevant I can post segments of my code- wary of putting too much up though in case of classmate plagarism.
Links to literature that would help with the project would be very much appreciated. (Still at the stage where I'm plotting the data every which way I can think of to look for distinguishing attributes for each instrument).
Thanks in advance
You seem to be asking many questions. Let me start by answering your first one:
Well, how to calculate the level of uncertainty in the height of the
peaks (their relative harmonic amplitudes) when accounting for
background noise.
You would expect the sound to summate linearly (to a first order approximation). The natural thing to do would be to do some recordings of only the background and then measure the mean amplitude and standard deviation of the harmonics in the background.
As an example, say you are looking at 3 harmonics - 20KHz, 11KHz and 33KHz. Do some recordings of only background and you find mean amplitudes of 1.3dB 2.2dB and 2.3dB with standard deviations of say +/-0.1, +/-0.2 and +/-0.4dB. You now have an uncertainty estimate and a mean background harmonic to subtract from.
There are smarter ways to do this but it's a start.
Now then, to get on to your second question
What dB threshold I should use when classifying what is a harmonic
peak and what is attributable to noise (should I discount a peak lower
than the maximum backgound (~28000dB) or the mean (~15000) or perhaps
twice one of these values)?
If a peak is within the mean + the uncertainty (one or two standard deviations, this is arbitrary really and depends on convention) you can say it's significant. Eg if you find that noise level at 3KHz is 1.2dB with an uncertainty of +/- 0.3dB and you measure your harmonic to be 1.3dB with an uncertainty of (measured in the same way) of 0.1dB then it's not significant.
Now for the third part:
Also, to take account of the noise introduced by background, is it
legitimate to subtract the value in FFT bin n for the background, from
FFT bin n for my instrument recording?
Yes (generally speaking). If you really want to convince yourself of this you can either A)do some simulations with summating waves and doing an FFT of them, B)do an experiment and the same as in A or C)Go through the mathematics of Fourier transforms.
With regard to literature, I think that would depend on what you're doing specifically, if you're a physics student "Mathematical Methods in the Physical Sciences" by Mary Boas treats Fourier transforms well, if you're a computer scientist/engineer you probably want something different.
Let me know if you need more help.
As a musician [bassoon], neural network researcher, and using fft to compress bird song, I have a few suggestions:
musical instruments are defined by the bore -- straight, conical, or a combination. This results in emphasis of harmonic variations that will help distinguish.
double Reed, single Reed, flute, and brass instruments have different vibration patterns.
fft can resolve sound into fundamental and harmonics.
training a neural network to recognize harmonic patterns [normalize if fundamentals fiffer?] -- use a separate input for each frequency bin AND a separate output for each instrument [or family? -- it may be difficult to distinguish between saxes, for instance, while oboe, English horn, bassoon, and contra-bassoon may be distinguished]. I like at least 3 layer neural nets and had excellent results doing OCR with 4 layers [2 internal]
Related
This article https://www.bitweenie.com/listings/fft-zero-padding/ gives a simple relation between time-length of the input data to the FFT and the minimum distance between two frequencies that can be distinguished in the FFT. The article calls this Waveform frequency resolution.
In other words; if two input-frequencies are closer in frequency than 1/time-length_of_input_data, they will show as only one peak in the FFT-plot.
My question is: is there a way to increase this Waveform frequency resolution? I am finding it difficult to work with rather short data-series due to this limitation.
As an example, if I use a combination of sine series with periods 9.5, 10, and 11 over 240 datapoints I cannot distinguish between the different frequencies.
To have good frequency resultion you need a long time series.
This is a fundamental issue, called uncertainty principle. It cannot be overcome within Fourier analysis (Fourier transform, DFT, short-time Fourier transform and so on).
Also note that zero padding will not overcome this issue.
It gives more points in the frequency domain, in the sense that the same spectral information is sampled more densely, but it will not make peaks sharper or more separated.
The only way to overcome the uncertainty principle is to make further assumptions on the data.
If for example you know that there is only a single frequency component, it is possible to determine its frequency more accurately than the uncertainty principle predicts.
Also you can use transforms such as the Vigner-Wille transform . It is not bound by the uncertainty principle, but generates "crossterms", i.e. frequency component artifacts. However, when you only have few frequency compoents this might be acceptable. Depends on the use-case.
I need to optimise a method for finding the number of data peaks in a 1D array. The data is a time-series of the amplitude of a wav file.
I have the code implemented already:
from scipy.io.wavfile import read
from scipy.signal import find_peaks
_, amplitudes = read('audio1.wav')
indexes, _ = find_peaks(amplitudes, height=80)
print(f'Number of peaks: {len(indexes)}')
When plotted, the data looks like this:
General scale
The 'peaks' that I am interested in are clear to the human eye - there are 23 in this particular dataset.
However, because the array is so large, the data is extremely variant within the peaks that are clear at a general scale (hence the many hundreds of peaks labelled with blue crosses):
Zoomed in view of one peak
Peak-finding questions have been asked many times before (I've been through a lot of them!) - but I can't find any help or explanation of optimising the parameters for finding only the peaks I want. I know a little about Python, but am blind when it comes to mathematical analysis!
Analysing by width seems useless because, as per the second image, the peaks clear at a large scale are actually interspersed with 'silent' ranges. Distance is not helpful because I do not know how close the peaks will be in other wav files. Prominence has been suggested as the best method but I could not get the results I needed; threshold likewise. I have also tried decimating the signal, smoothing the signal with a Savitzky-Golay filter, and different combinations of parameters and values, all with inaccurate results.
Height alone has been useful because I can see from the charts that peaks always reach above 80.
This is a common task in audio processing and there are several approaches which totally depend on your data.
However, there are implementations out there which are used for finding peaks in novelty functions (e.g., the output from a beat tracker). Try these ones:
https://madmom.readthedocs.io/en/latest/modules/features/onsets.html#madmom.features.onsets.peak_picking
https://librosa.github.io/librosa/generated/librosa.util.peak_pick.html#librosa.util.peak_pick
Basically they implement the same method but there might be differences in the details.
Furthermore, you could check, if you really need to work on this high sampling frequency. Try downsampling the signal or use a moving average filter.
Look at 0d persistent homology to find a good strategy, where the parameter you can optimize for is peak persistence. A nice blog post here explains the basics.
But in short the idea is to imagine your graph being filled by water, and then slowly draining the water. Every time a piece of the graph comes above water a new island is born. When two islands are next to each other they merge, which causes the younger island (with the lower peak) to die.
Then each data point has a birth time and a death time. The most significant peaks are those with the longest persistence, which is death - birth.
If the water level drops at a continuous rate, then the persistence is defined in terms of peak height. Another possibility is by dropping the water instantaneously from point to point as time goes from step t to step t+1, in wich case the persistence is defined in peak width in terms of signal samples.
For you it seems that using the original definition in terms of peak height > 70 finds all peaks you are interested in, albeit possibly too many, clustered together. You can limit this by choosing the first peak in each cluster or the highest peak in each cluster or by doing both approaches and only choosing peaks that have both great height persistence as well as width persistence.
Does aubio have a way to detect sections of a piece of audio that lack tonal elements -- rhythm only? I tested a piece of music that has 16 seconds of rhythm at the start, but all the aubiopitch and aubionotes algorithms seemed to detect tonality during the rhythmic section. Could it be tuned somehow to distinguish tonal from non-tonal onsets? Or is there a related library that can do this?
Been busy the past couple of days - but started looking into this today...
It'll take a while to perfect I guess but I thought I'd give you a few thoughts and some code I've started working on to attack this!
Firstly, pseudo code's a good way to design an initial method.
1/ use import matplotlib.pyplot as plt to spectrum analyse the audio, and plot various fft and audio signals.
2/ import numpy as np for basic array-like structure handling.
(I know this is more than pseudo code, but hey :-)
3/ plt.specgram creates spectral maps of your audio. Apart from the image it creates (which can be used to start to manually deconstruct your audio file), it returns 4 structures.
eg
ffts,freqs,times,img = plt.specgram(signal,Fs=44100)
ffts is a 2 dimentional array where the columns are the ffts (Fast Fourier Transforms) of the time sections (rows).
The plain vanilla specgram analyses time sections of 256 samples long, stepping 128 samples forwards each time.
This gives a very low resolution frequency array at a pretty fast rate.
As musical notes merge into a single sound when played at more or less 10 hz, I decided to use the specgram options to divide the audio into 4096 sample lengths (circa 10 hz) stepping forwards every 2048 samples (ie 20 times a second).
This gives a decent frequency resolution, and the time sections being 20th sec apart are faster than people can perceive individual notes.
This means calling the specgram as follows:
plt.specgram(signal,Fs=44100,NFFT=4096,noverlap=2048,mode='magnitude')
(Note the mode - this seems to give me amplitudes of between 0 - 0.1: I have a problem with fft not giving me amplitudes of the same scale as the audio signal (you may have seen the question I posted). But here we are...
4/ Next I decided to get rid of noise in the ffts returned. This means we can concentrate on freqs of a decent amplitude, and zero out the noise which is always present in ffts (in my experience).
Here is (are) my function(s):
def gate(signal,minAmplitude):
return np.array([int((((a-minAmplitude)+abs(a-minAmplitude))/2) > 0) * a for a in signal])
Looks a bit crazy - and I'm sure a proper mathematician could come up with something more efficient - but this is the best I could invent. It zeros any freqencies of amplitude less than minAmplitude.
This is the relevant code to call it from the ffts returned by plt.specgram as follows, my function is more involved as it is part of a class, and has other functions it references - but this should be enough:
def fft_noise_gate(minAmplitude=0.001,check=True):
'''
zero the amplitudes of frequencies
with amplitudes below minAmplitude
across self.ffts
check - plot middle fft just because!
'''
nffts = ffts.shape[1]
gated_ffts = []
for f in range(nffts):
fft = ffts[...,f]
# Anyone got a more efficient noise gate formula? Best I could think up!
fft_gated = gate(fft,minAmplitude)
gated_ffts.append(fft_gated)
ffts = np.array(gated_ffts)
if check:
# plot middle fft just to see!
plt.plot(ffts[int(nffts/2)])
plt.show(block=False)
return ffts
This should give you a start I'm still working on it and will get back to you when I've got further - but if you have any ideas, please share them.
Any way my strategy from here is to:
1/ find the peaks ( ie start of any sounds) then
2/ Look for ranges of frequencies which rise and fall in unison (ie make up a sound).
And
3/ Differentiate them into individual instruments (sound sources more specifically), and plot the times and amplitudes thereof to create your analysis (score).
Hope you're having fun with it - I know I am.
As I said any thoughts...
Regards
Tony
Use a spectrum analyser to detect sections with high amplitude. If you program - you could take each section and make an average of the freqencies (and amplitudes) present to give you an idea of the instrument(s) involved in creating that amplitude peak.
Hope that helps - if you're using python I could give you some pointers how to program this!?
Regards
Tony
I'm very new to signal processing. I have two sound signal data right now. Each of the data is collected at a sample rate of 10 KHz, 2 seconds. I have imported this data into python. Both sound_1 and sound_2 is a numpy array right now. The length of each sound data is 20000 of course.
Sound_1 contains a water flow sound(which I'm interested) and environmental noise(I'm not interested), while sound_2 only contains environment noise(I'm not interested).
I'm looking for an algorithm(or package) which can help me determine the frequency range of this water flow sound. I think if I can find out the frequency range, I can use an inverse Fourier transform to filter the environment noise.
However, my ultimate purpose is to extract the water flow sound from sound_1 data and eliminate environmental noise. It would be great if there are other approaches.
I'm currently looking at this post: Python frequency detection
But I don't understand how they can find out the frequency by only one sound signal. I think we need to compare 2 signal data at least(one contains the sound I am interested, the other doesn't), so we can find out the difference.
Since sound_1 contains both water flow and environmental noise, there's no straightforward way of extracting the water flow. The Fourier transform will get you all frequencies in the signal, irrespective of the source.
The way to approach is get frequencies of environmental noise from sound_2 and then remove them from sound_1. After that is done, you can extract the frequencies from already denoised sound_1.
One of popular approaches to such noise reduction is with spectral gating. Essentially, you first determine how the noise sounds like and then remove smoothed spectrum from your signal. Smoothing is crucial, as sound is a wave, a continuous entity. If you simply chop out discrete frequencies from the wave, you will get very poor results (audio will sound unnatural and robotic). The amount of smoothing you apply will determine how much noise is reduced (mind it's never truly removed - you will always get some residue).
To the concrete solution.
As you're new to the subject, I'd recommend first how noise reduction works in a software that will do the work for you. Audacity is an excellent choice. I linked the manual for noise reduction, but there are plenty of tutorials out there.
After you know what you want to get, you can either implement spectral gating yourself or use existing package. Audacity has an excellent implementation in C++, but it may prove difficult to a newbie to port. I'd recommend going first with noisereduce package. It's based on Audacity implementation. If you use it, you will be done in a few lines.
Here's a snippet:
import noisereduce as nr
# load data
rate, data = wavfile.read("sound_1.wav")
# select section of data that is noise
noisy_part = wavfile.read("sound_2.wav")
# perform noise reduction
reduced_noise = nr.reduce_noise(audio_clip=data, noise_clip=noisy_part, verbose=True)
Now simply run FFT on the reduced_noise to discover the frequencies of water flow.
Here's how I am using noisereduce. In this part I am determining the frequency statistics.
Does anyone know if it is possible to find a power spectral density of a signal with gaps in it. For example (in matlab syntax cause that is what I'm familiar with)
ta=1:1000;
tb=1200:3000;
t=[ta tb]; % this is the timebase
signal=randn(size(t)); this is a signal
figure(101)
plot(t,signal,'.')
I'd like to be able to determine frequencies on a longer time base that just the individual sections of data. Obviously I could just take the PSD of individual sections but that will limit the lowest frequency. I could interpolate the data, but this would colour the PSD.
Any thoughts would be much appreciated.
The Lomb-Scargle periodogram algorithm is usually used to perform analysis on unevenly spaced data (sampled at arbitrary time points) or when a proportion of the data is missing.
Here's a couple of MATLAB implementations:
lombscargle.m (FEX)
Lomb (Lomb-Scargle) Periodogram (FEX)
lomb.m - ECG tools by Gari Clifford
I found this Non Uniform FFT but I'm not sure that its exactly what I need as it might really be for data that is mostly sampled on an uneven time base, rather than evenly spaced data with significant gaps. I'll give it a go!
Leaving out segments of the Fourier basis vectors results in exactly the same FT, thus PSD, as using the complete basis, but multiplying by zeros within a zero padding in any signal "gaps".