I am analyzing the spectrogram's of .wav files. But after getting the code to finally work, I've run into a small issue. After saving the spectrograms of 700+ .wav files I realize that they all essentially look the same!!! This is not because they are the same audio file, but because I don't know how to change the scale of the plot to be smaller(so I can make out the differences).
I've already tried to fix this issue by looking at this StackOverflow post
Changing plot scale by a factor in matplotlib
I'll show the graph of two different .wav files below
This is .wav #1
This is .wav #2
Believe it or not, these are two different .wav files, but they look super similar. And a computer especially won't be able to pick up the differences in these two .wav files if the scale is this broad.
My code is below
def individualWavToSpectrogram(myAudio, fileNameToSaveTo):
print(myAudio)
#Read file and get sampling freq [ usually 44100 Hz ] and sound object
samplingFreq, mySound = wavfile.read(myAudio)
#Check if wave file is 16bit or 32 bit. 24bit is not supported
mySoundDataType = mySound.dtype
#We can convert our sound array to floating point values ranging from -1 to 1 as follows
mySound = mySound / (2.**15)
#Check sample points and sound channel for duel channel(5060, 2) or (5060, ) for mono channel
mySoundShape = mySound.shape
samplePoints = float(mySound.shape[0])
#Get duration of sound file
signalDuration = mySound.shape[0] / samplingFreq
#If two channels, then select only one channel
#mySoundOneChannel = mySound[:,0]
#if one channel then index like a 1d array, if 2 channel index into 2 dimensional array
if len(mySound.shape) > 1:
mySoundOneChannel = mySound[:,0]
else:
mySoundOneChannel = mySound
#Plotting the tone
# We can represent sound by plotting the pressure values against time axis.
#Create an array of sample point in one dimension
timeArray = numpy.arange(0, samplePoints, 1)
#
timeArray = timeArray / samplingFreq
#Scale to milliSeconds
timeArray = timeArray * 1000
plt.rcParams['agg.path.chunksize'] = 100000
#Plot the tone
plt.plot(timeArray, mySoundOneChannel, color='Black')
#plt.xlabel('Time (ms)')
#plt.ylabel('Amplitude')
print("trying to save")
plt.savefig('/Users/BillyBobJoe/Desktop/' + fileNameToSaveTo + '.jpg')
print("saved")
#plt.show()
#plt.close()
How can I modify this code to increase the sensitivity of the graphing so that the differences between two .wav files is made more distinct?
Thanks!
[UPDATE]
I have tried using
plt.xlim((0, 16000))
But this just adds whitespace to the right of the graph
like
I need a way to change the scale of each unit. so that the graph is filled out when I change the x axis from 0 - 16000
If the question is: how to limit the scale on the xaxis, say to between 0 and 1000, you can do as follows:
plt.xlim((0, 1000))
Related
I am trying to plot an audio sample's amplitude across the time domain. I've used scipy.io.wavfile to read the audio sample and determine sampling rate:
# read .wav file
data = read("/Users/user/Desktop/voice.wav")
# determine sample rate of .wav file
# print(data[0]) # 48000 samples per second
# 48000 (samples per second) * 4 (4 second sample) = 192000 samples overall
# store the data read from .wav file
audio = data[1]
# plot the data
plt.plot(audio[0 : 192000]) # see code above for how this value was determined
This creates a plot displaying amplitude on y axis and sample number on the x axis. How can I transform the x axis to instead show seconds?
I tried using plt.xticks but I don't think this is the correct use case based upon the error I received:
# label title, axis, show the plot
seconds = range(0,4)
plt.xticks(range(0,192000), seconds)
plt.ylabel("Amplitude")
plt.xlabel("Time")
plt.show()
ValueError: The number of FixedLocator locations (192000), usually from a call to set_ticks, does not match the number of ticklabels (4).
You need to pass a t vector to the plotting command, a vector that you can generate on the fly using Numpy, so that after the command execution it is garbage collected (sooner or later, that is)
from numpy import linspace
plt.plot(linspace(0, 4, 192000, endpoint=False), audio[0 : 192000])
I'm trying to plot the Amplitude (dBFS) vs. Time (s) plot of an audio (.wav) file using matplotlib. I managed to do that with the following code:
def convert_to_decibel(sample):
ref = 32768 # Using a signed 16-bit PCM format wav file. So, 2^16 is the max. value.
if sample!=0:
return 20 * np.log10(abs(sample) / ref)
else:
return 20 * np.log10(0.000001)
from scipy.io.wavfile import read as readWav
from scipy.fftpack import fft
import matplotlib.pyplot as gplot1
import matplotlib.pyplot as gplot2
import numpy as np
import struct
import gc
wavfile1 = '/home/user01/audio/speech.wav'
wavsamplerate1, wavdata1 = readWav(wavfile1)
wavdlen1 = wavdata1.size
wavdtype1 = wavdata1.dtype
gplot1.rcParams['figure.figsize'] = [15, 5]
pltaxis1 = gplot1.gca()
gplot1.axhline(y=0, c="black")
gplot1.xticks(np.arange(0, 10, 0.5))
gplot1.yticks(np.arange(-200, 200, 5))
gplot1.grid(linestyle = '--')
wavdata3 = np.array([convert_to_decibel(i) for i in wavdata1], dtype=np.int16)
yvals3 = wavdata3
t3 = wavdata3.size / wavsamplerate1
xvals3 = np.linspace(0, t3, wavdata3.size)
pltaxis1.set_xlim([0, t3 + 2])
pltaxis1.set_title('Amplitude (dBFS) vs Time(s)')
pltaxis1.plot(xvals3, yvals3, '-')
which gives the following output:
I had also plotted the Power Spectral Density (PSD, in dBm) using the code below:
from scipy.signal import welch as psd # Computes PSD using Welch's method.
fpsd, wPSD = psd(wavdata1, wavsamplerate1, nperseg=1024)
gplot2.rcParams['figure.figsize'] = [15, 5]
pltpsdm = gplot2.gca()
gplot2.axhline(y=0, c="black")
pltpsdm.plot(fpsd, 20*np.log10(wPSD))
gplot2.xticks(np.arange(0, 4000, 400))
gplot2.yticks(np.arange(-150, 160, 10))
pltpsdm.set_xlim([0, 4000])
pltpsdm.set_ylim([-150, 150])
gplot2.grid(linestyle = '--')
which gives the output as:
The second output above, using the Welch's method plots a more presentable output. The dBFS plot though informative is not very presentable IMO. Is this because of:
the difference in the domains (time in case of 1st output vs frequency in the 2nd output)?
the way plot function is implemented in pyplot?
Also, is there a way I can plot my dBFS output as a peak-to-peak style of plot just like in my PSD (dBm) plot rather than a dense stem plot?
Would be much helpful and would appreciate any pointers, answers or suggestions from experts here as I'm just a beginner with matplotlib and plots in python in general.
TLNR
This has nothing to do with pyplot.
The frequency domain is different from the time domain, but that's not why you didn't get what you want.
The calculation of dbFS in your code is wrong.
You should frame your data, calculate RMSs or peaks in every frame, and then convert that value to dbFS instead of applying this transformation to every sample point.
When we talk about the amplitude, we are talking about a periodic signal. And when we read in a series of data from a sound file, we read in a series of sample points of a signal(may be or be not periodic). The value of every sample point represents a, say, voltage value, or sound pressure value sampled at a specific time.
We assume that, within a very short time interval, maybe 10ms for example, the signal is stationary. Every such interval is called a frame.
Some specific function is applied to each frame usually, to reduce the sudden change at the edge of this frame, and these functions are called window functions. If you did nothing to every frame, you added rectangle windows to them.
An example: when the sampling frequency of your sound is 44100Hz, in a 10ms-long frame, there are 44100*0.01=441 sample points. That's what the nperseg argument means in your psd function but it has nothing to do with dbFS.
Given the knowledge above, now we can talk about the amplitude.
There are two methods a get the value of amplitude in every frame:
The most straightforward one is to get the maximum(peak) values in every frame.
Another one is to calculate the RMS(Root Mean Sqaure) of every frame.
After that, the peak values or RMS values can be converted to dbFS values.
Let's start coding:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
# Determine full scall(maximum possible amplitude) by bit depth
bit_depth = 16
full_scale = 2 ** bit_depth
# dbFS function
to_dbFS = lambda x: 20 * np.log10(x / full_scale)
# Read in the wave file
fname = "01.wav"
fs,data = wavfile.read(fname)
# Determine frame length(number of sample points in a frame) and total frame numbers by window length(how long is a frame in seconds)
window_length = 0.01
signal_length = data.shape[0]
frame_length = int(window_length * fs)
nframes = signal_length // frame_length
# Get frames by broadcast. No overlaps are used.
idx = frame_length * np.arange(nframes)[:,None] + np.arange(frame_length)
frames = data[idx].astype("int64") # Convert to in 64 to avoid integer overflow
# Get RMS and peaks
rms = ((frames**2).sum(axis=1)/frame_length)**.5
peaks = np.abs(frames).max(axis=1)
# Convert them to dbfs
dbfs_rms = to_dbFS(rms)
dbfs_peak = to_dbFS(peaks)
# Let's start to plot
# Get time arrays of every sample point and ever frame
frame_time = np.arange(nframes) * window_length
data_time = np.linspace(0,signal_length/fs,signal_length)
# Plot
f,ax = plt.subplots()
ax.plot(data_time,data,color="k",alpha=.3)
# Plot the dbfs values on a twin x Axes since the y limits are not comparable between data values and dbfs
tax = ax.twinx()
tax.plot(frame_time,dbfs_rms,label="RMS")
tax.plot(frame_time,dbfs_peak,label="Peak")
tax.legend()
f.tight_layout()
# Save serval details
f.savefig("whole.png",dpi=300)
ax.set_xlim(1,2)
f.savefig("1-2sec.png",dpi=300)
ax.set_xlim(1.295,1.325)
f.savefig("1.2-1.3sec.png",dpi=300)
The whole time span looks like(the unit of the right axis is dbFS):
And the voiced part looks like:
You can see that the dbFS values become greater while the amplitudes become greater at the vowel start point:
I'm a python beginner and as a learning project I'm doing a SSTV encoder using Wraase SC2-120 methods.
SSTV for those who don't know is a technique sending images through radio as sound and to be decoded in the receiving end back to an image. Wraase SC2-120 is one of many types of encoding but it's one of the more simpler ones that support color.
I've been able to create a system that takes an image and converts it to an array. Then take that array and create the needed values for the luminance and chrominance needed for the encoder.
I'm using then this block to create a value between 1500hz - 2300hz for the method.
def ChrominanceAsHertz(value=0.0):
value = 800 * value
value -= value % 128 # Test. Results were promising but too much noise
value += 1500
return int(value)
You can ignore the modulus operation. It is just my way of playing around with the data for "fun" and experiments.
I then clean the audio to avoid having too many of the same values in the same array and add their duration together to achieve a cleaner sound
cleanTone = []
cleanDuration = []
for i in range(len(hertzData)-1):
# If the next tone is not the same
# Add it to the cleantone array
# with it's initial duration
if hertzData[i] != hertzData[i+1]:
cleanTone.append(hertzData[i])
cleanDuration.append(durationData[i])
# else add the duration of the current hertz to the clean duration array
else:
# the current duration is the last inserted duration
currentDur = cleanDuration[len(cleanDuration)-1]
# Add the new duration to the current duration
currentDur += durationData[i]
cleanDuration[len(cleanDuration)-1] = currentDur
My array handling can use some work but it's not why I'm here for now.
The result is a array where no consecutive values are the same and the duration of that tone is still correct.
I then create a sinewave array using this block
audio = []
for i in range(len(cleanTone)):
sineAudio = AudioGen.SineWave(cleanTone[i], cleanDuration[i])
for sine in sineAudio:
audio.append(sine)
The sinewave function is
def SineWave( freq=440, durationMS = 500, sample_rate = 44100.0 ):
num_samples = durationMS * (sample_rate / 1000)
audio = []
for i in range(int(num_samples)):
audio.insert(i, np.sin(2 * np.pi * freq * (i / sample_rate)))
return audio
It works as intended. It creates a sinewave of the frequency I want and for the duration I want.
The problem is that when I create the .wav file then with wave the sinewaves created are not smoothly transitioning.
Screenshot of a closeup of what I mean.
Sinusoidal wave artifacts
The audio file has these immense screeches and cracks because of these artifacts that the above method produces, seeing as how it takes a single frequency and duration with no regard of where the last tone ended and starts a new.
What I've tried to do to remedy these is to refactor that SineWave method to take in a whole array and create the sinewaves consecutively right after one another in hopes of achieving a clean sound but it still did the same thing.
I also tried "smoothing" the generated audio array then with a simple filtering operation from this post.
0.7 * audio[1:-1] + 0.15 * ( audio[2:] + audio[:-2] )
but the results again were not satisfying and the artifacts were still present.
I've also started to look into Fourier Transforms, mainly FFT (fast fourier transform) but I'm not that familiar with them yet to know exactly what it is that I'm trying to do and code.
For SSTV to work the changes in frequency have to sometimes be very fast. 0.3ms fast to be exact, so I'm kinda lost on how to achieve this without loosing too much data in the process.
TL;DR
My sinewave function is producing artifacts inbetween tone changes that cause scratches and unwanted pops. How to not do that?
What you need to transfer from one wave-snippet to the next is the phase. You have to start the next wave with the phase you ended the previous phase.
def SineWave( freq=440, durationMS = 500, phase = 0, sample_rate = 44100.0):
num_samples = int(durationMS * (sample_rate / 1000))
audio = []
for i in range(num_samples):
audio.insert(i, np.sin(2 * np.pi * freq * (i / sample_rate) + phase))
phase = (phase + 2 * np.pi * freq * (num_samples / sample_rate)) % (2 * np.pi)
return audio, phase
In your main loop pass through the phase from one wave fragment to the next:
audio = []
phase = 0
for i in range(len(cleanTone)):
sineAudio, phase = AudioGen.SineWave(cleanTone[i], cleanDuration[i], phase)
for sine in sineAudio:
audio.append(sine)
I'm trying to make a wavetable synthesizer in Python for the first time (based off an example I found here https://blamsoft.com/tutorials/expanse-creating-wavetables/) but the resultant sound I'm getting doesn't sound tonal at all. My output is just a low grainy buzz. I'm pretty new to making wavetables in Python and I was wondering if anybody might be able to tell me what I'm missing in order to write an A440 sine wavetable to the file "wavetable.wav" and have it actually produce a pure sine tone? Here's what I have at the moment:
import wave
import struct
import numpy as np
frame_count = 256
frame_size = 2048
sps = 44100
freq_hz = 440
file = "wavetable.wav" #write waveform to file
wav_file = wave.open(file, 'w')
wav_file.setparams((1, 2, sps, frame_count, 'NONE', 'not compressed'))
values = bytes(0)
for i in range(frame_count):
for ii in range(frame_size):
sample = np.sin((float(ii)/frame_size) * (i+128)/256 * 2 * np.pi * freq_hz/sps) * 65535
if sample < 0:
sample = 0
sample -= 32768
sample = int(sample)
values += struct.pack('h', sample)
wav_file.writeframes(values)
wav_file.close()
print("Generated " + file)
The sine function I have inside the for loop is probably the part I understand the least because I just went by the example verbatim. I'm used to making sine functions like (y = Asin(2πfx)) but I'm not sure what the purpose is of multiplying by ((i+128)/256) and 65535 (16-bit amplitude resolution?). I'm also not sure what the purpose is of subtracting 32768 from each sample. Is anyone able to clarify what I'm missing and maybe point me in the right direction? Am I going about this the wrong way? Any help is appreciated!
If you just wanted to generate sound data ahead of time and then dump it all into a file, and you’re also comfortable using NumPy, I’d suggest using it with a library like SoundFile. Then there’s no need to delimit the data into frames.
Starting with a naïve approach (using numpy.sin, not trying to optimize things yet), one ends with something like this:
from math import tau
import numpy as np
import soundfile as sf
file_path = 'sine.flac'
sample_rate = 48_000 # hertz
duration = 1.0 # seconds
frequency = 432.0 # hertz
amplitude = 0.8 # (not in decibels!)
start_phase = 0.0 # at what phase to start
sample_count = floor(sample_rate * duration)
# cyclical frequency in sample^-1
omega = frequency * tau / sample_rate
# all phases for which we want to sample our sine
phases = np.linspace(start_phase, start_phase + omega * sample_count,
sample_count, endpoint=False)
# our sine wave samples, generated all at once
audio = amplitude * np.sin(phases)
# now write to file
fmt, sub = 'FLAC', 'PCM_24'
assert sf.check_format(fmt, sub) # to make sure we ask the correct thing beforehand
sf.write(file_path, audio, sample_rate, format=fmt, subtype=sub)
This will be a mono sound, you can write stereo using 2d arrays (see NumPy and SoundFile’s docs).
But note that to make a wavetable specifically, you need to be sure it contains just a single period (or an integer number of periods) of the wave exactly, so the playback of the wavetable will be without clicks and have a correct frequency.
You can play chunked sound in real time in Python too, using something like PyAudio. (I’ve not yet used that, so at least for a time this answer would lack code related to that.)
Finally, frankly, all above is unrelated to the generation of sound data from a wavetable: you just pick a wavetable from somewhere, that doesn’t do much for actual synthesis. Here is a simple starting algorithm for that. Assume you want to play back a chunk of sample_count samples and have a wavetable stored in wavetable, a single period which loops perfectly and is normalized. And assume your current wave phase is start_phase, frequency is frequency, sample rate is sample_rate, amplitude is amplitude. Then:
# indices for the wavetable values; this is just for `np.interp` to work
wavetable_period = float(len(wavetable))
wavetable_indices = np.linspace(0, wavetable_period,
len(wavetable), endpoint=False)
# frequency of the wavetable played at native resolution
wavetable_freq = sample_rate / wavetable_period
# start index into the wavetable
start_index = start_phase * wavetable_period / tau
# code above you run just once at initialization of this wavetable ↑
# code below is run for each audio chunk ↓
# samples of wavetable per output sample
shift = frequency / wavetable_freq
# fractional indices into the wavetable
indices = np.linspace(start_index, start_index + shift * sample_count,
sample_count, endpoint=False)
# linearly interpolated wavetavle sampled at our frequency
audio = np.interp(indices, wavetable_indices, wavetable,
period=wavetable_period)
audio *= amplitude
# at last, update `start_index` for the next chunk
start_index += shift * sample_count
Then you output the audio. Though there are better ways to play back a wavetable, linear interpolation is at least a fine start. Frequency slides are also possible with this approach: just compute indices in another way, no longer spaced uniformly.
I am using the Librosa library for pitch and onset detection. Specifically, I am using onset_detect and piptrack.
This is my code:
def detect_pitch(y, sr, onset_offset=5, fmin=75, fmax=1400):
y = highpass_filter(y, sr)
onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
pitches, magnitudes = librosa.piptrack(y=y, sr=sr, fmin=fmin, fmax=fmax)
notes = []
for i in range(0, len(onset_frames)):
onset = onset_frames[i] + onset_offset
index = magnitudes[:, onset].argmax()
pitch = pitches[index, onset]
if (pitch != 0):
notes.append(librosa.hz_to_note(pitch))
return notes
def highpass_filter(y, sr):
filter_stop_freq = 70 # Hz
filter_pass_freq = 100 # Hz
filter_order = 1001
# High-pass filter
nyquist_rate = sr / 2.
desired = (0, 0, 1, 1)
bands = (0, filter_stop_freq, filter_pass_freq, nyquist_rate)
filter_coefs = signal.firls(filter_order, bands, desired, nyq=nyquist_rate)
# Apply high-pass filter
filtered_audio = signal.filtfilt(filter_coefs, [1], y)
return filtered_audio
When running this on guitar audio samples recorded in a studio, therefore samples without noise (like this), I get very good results in both functions. The onset times are correct and the frequencies are almost always correct (with some octave errors sometimes).
However, a big problem arises when I try to record my own guitar sounds with my cheap microphone. I get audio files with noise, such as this. The onset_detect algorithm gets confused and thinks that noise contains onset times. Therefore, I get very bad results. I get many onset times even if my audio file consists of one note.
Here are two waveforms. The first is of a guitar sample of a B3 note recorded in a studio, whereas the second is my recording of an E2 note.
The result of the first is correctly B3 (the one onset time was detected).
The result of the second is an array of 7 elements, which means that 7 onset times were detected, instead of 1! One of those elements is the correct onset time, other elements are just random peaks in the noise part.
Another example is this audio file containing the notes B3, C4, D4, E4:
As you can see, the noise is clear and my high-pass filter has not helped (this is the waveform after applying the filter).
I assume this is a matter of noise, as the difference between those files lies there. If yes, what could I do to reduce it? I have tried using a high-pass filter but there is no change.
I have three observations to share.
First, after a bit of playing around, I've concluded that the onset detection algorithm appears as if it's probably probably been designed to automatically rescale its own operation in order to take into account local background noise at any given instant. This is likely in order so that it can detect onset times in pianissimo sections with equal likelihood as it would in fortissimo sections. This has the unfortunate result that the algorithm tends to trigger on background noise coming from your cheap microphone--the onset detection algorithm honestly thinks it's simply listening to pianissimo music.
A second observation is that roughly the first ~2200 samples in your recorded example (roughly the first 0.1 seconds) are a bit wonky, in the sense that the noise truly is nearly zero during that short initial interval. Try zooming way into the waveform at the starting point and you'll see what I mean. Unfortunately, the start of the guitar playing follows so quickly after the noise onset (roughly around sample 3000) that the algorithm is unable to resolve the two independently--instead it simply merges the two into a single onset event that begins about 0.1 seconds too early. I therefore cut out roughly the first 2240 samples in order to "normalize" the file (I don't think this is cheating though; it's an edge effect that would likely disappear if you had simply recorded a second or so of initial silence prior to plucking the first string, as one would normally do).
My third observation is that frequency-based filtering only works if the noise and the music are actually in somewhat different frequency bands. That may be true in this case, however I don't think you've demonstrated it yet. Therefore, instead of frequency-based filtering, I elected to try a different approach: thresholding. I used the final 3 seconds of your recording, where there is no guitar playing, in order to estimate the typical background noise level throughout the recording, in units of RMS energy, and then I used that median value to set a minimum energy threshold which was calculated to lie safely above the median. Only onset events returned by the detector occurring at times when the RMS energy is above the threshold are accepted as "valid".
An example script is shown below:
import librosa
import numpy as np
import matplotlib.pyplot as plt
# I played around with this but ultimately kept the default value
hoplen=512
y, sr = librosa.core.load("./Vocaroo_s07Dx8dWGAR0.mp3")
# Note that the first ~2240 samples (0.1 seconds) are anomalously low noise,
# so cut out this section from processing
start = 2240
y = y[start:]
idx = np.arange(len(y))
# Calcualte the onset frames in the usual way
onset_frames = librosa.onset.onset_detect(y=y, sr=sr, hop_length=hoplen)
onstm = librosa.frames_to_time(onset_frames, sr=sr, hop_length=hoplen)
# Calculate RMS energy per frame. I shortened the frame length from the
# default value in order to avoid ending up with too much smoothing
rmse = librosa.feature.rmse(y=y, frame_length=512, hop_length=hoplen)[0,]
envtm = librosa.frames_to_time(np.arange(len(rmse)), sr=sr, hop_length=hoplen)
# Use final 3 seconds of recording in order to estimate median noise level
# and typical variation
noiseidx = [envtm > envtm[-1] - 3.0]
noisemedian = np.percentile(rmse[noiseidx], 50)
sigma = np.percentile(rmse[noiseidx], 84.1) - noisemedian
# Set the minimum RMS energy threshold that is needed in order to declare
# an "onset" event to be equal to 5 sigma above the median
threshold = noisemedian + 5*sigma
threshidx = [rmse > threshold]
# Choose the corrected onset times as only those which meet the RMS energy
# minimum threshold requirement
correctedonstm = onstm[[tm in envtm[threshidx] for tm in onstm]]
# Print both in units of actual time (seconds) and sample ID number
print(correctedonstm+start/sr)
print(correctedonstm*sr+start)
fg = plt.figure(figsize=[12, 8])
# Print the waveform together with onset times superimposed in red
ax1 = fg.add_subplot(2,1,1)
ax1.plot(idx+start, y)
for ii in correctedonstm*sr+start:
ax1.axvline(ii, color='r')
ax1.set_ylabel('Amplitude', fontsize=16)
# Print the RMSE together with onset times superimposed in red
ax2 = fg.add_subplot(2,1,2, sharex=ax1)
ax2.plot(envtm*sr+start, rmse)
for ii in correctedonstm*sr+start:
ax2.axvline(ii, color='r')
# Plot threshold value superimposed as a black dotted line
ax2.axhline(threshold, linestyle=':', color='k')
ax2.set_ylabel("RMSE", fontsize=16)
ax2.set_xlabel("Sample Number", fontsize=16)
fg.show()
Printed output looks like:
In [1]: %run rosatest
[ 0.17124717 1.88952381 3.74712018 5.62793651]
[ 3776. 41664. 82624. 124096.]
and the plot that it produces is shown below:
Did you test to normalize the sound sample before treatment ?
When reading onset_detect documentation we can see that there is a lot of optionals arguments, have you already try to use some ?
Maybe one of this optionals arguments may help you to keep only the good one (or at least limit the size of the onset time returned array):
librosa.util.peak_pick (maybe the best)
backtrack
energy
Please see also an update of your code in order to use a pre-computed onset envelope:
def detect_pitch(y, sr, onset_offset=5, fmin=75, fmax=1400):
y = highpass_filter(y, sr)
o_env = librosa.onset.onset_strength(y, sr=sr)
times = librosa.frames_to_time(np.arange(len(o_env)), sr=sr)
onset_frames = librosa.onset.onset_detect(y=o_env, sr=sr)
pitches, magnitudes = librosa.piptrack(y=y, sr=sr, fmin=fmin, fmax=fmax)
notes = []
for i in range(0, len(onset_frames)):
onset = onset_frames[i] + onset_offset
index = magnitudes[:, onset].argmax()
pitch = pitches[index, onset]
if (pitch != 0):
notes.append(librosa.hz_to_note(pitch))
return notes
def highpass_filter(y, sr):
filter_stop_freq = 70 # Hz
filter_pass_freq = 100 # Hz
filter_order = 1001
# High-pass filter
nyquist_rate = sr / 2.
desired = (0, 0, 1, 1)
bands = (0, filter_stop_freq, filter_pass_freq, nyquist_rate)
filter_coefs = signal.firls(filter_order, bands, desired, nyq=nyquist_rate)
# Apply high-pass filter
filtered_audio = signal.filtfilt(filter_coefs, [1], y)
return filtered_audio
does it work better ?