Python : spectrogram speech recognition - python

I am trying to reproduce a preprocessing seen in a youtube video where they create a spectrogram taking a windows of 20 ms and then apply an FFT on it. Finally they feed a neural network with the spectrogram obtained. I am using the scipy package but I am a little confused about the parameters to use. Here is the code :
def get_spectrogram(path, nsamples=16000):
Given path, return specgram.
# read the wav files
wav =[1] # 16000 samples per second
# zero pad the shorter samples and cut off the long ones to have a signal of 1 sec.
if wav.size < nsamples:
d = np.pad(wav, (nsamples - wav.size, 0), mode='constant')
d = wav[0:nsamples]
# get the specgram
specgram = signal.spectrogram(d, fs= ? , nperseg=None, noverlap=None, nfft=None)[2]
return specgram
Moreover, I am also wondering what have to be the shape of the output ? Is it (X, 1) ?


Get timing information from MFCC generated with librosa.feature.mfcc

I am extracting MFCCs from an audio file using Librosa's function (librosa.feature.mfcc) and I correctly get back a numpy array with the shape I was expecting: 13 MFCCs values for the entire length of the audio file which is 1292 windows (in 30 seconds).
What is missing is timing information for each window: for example I want to know what the MFCC looks like at time 5000ms, then at 5200ms etc.
Do I have to manually calculate the time? Is there a way to automatically get the exact time for each window?
The "timing information" is not directly available, as it depends on sampling rate. In order to provide such information, librosa would have create its own classes. This would rather pollute the interface and make it much less interoperable. In the current implementation, feature.mfcc returns you numpy.ndarray, meaning you can easily integrate this code anywhere in Python.
To relate MFCC to timing:
import librosa
import numpy as np
filename = librosa.util.example_audio_file()
y, sr = librosa.load(filename)
hop_length = 512 # number of samples between successive frames
mfcc = librosa.feature.mfcc(y=y, n_mfcc=13, sr=sr, hop_length=hop_length)
audio_length = len(y) / sr # in seconds
step = hop_length / sr # in seconds
intervals_s = np.arange(start=0, stop=audio_length, step=step)
print(f'MFCC shape: {mfcc.shape}')
print(f'intervals_s shape: {intervals_s.shape}')
print(f'First 5 intervals: {intervals_s[:5]} second')
Note that array length of mfcc and intervals_s is the same - a sanity check that we did not make a mistake in our calculation.
MFCC shape: (13, 2647)
intervals_s shape: (2647,)
First 5 intervals: [0. 0.02321995 0.04643991 0.06965986 0.09287982] second

How can I do real-time voice activity detection in Python?

I am performing a voice activity detection on the recorded audio file to detect speech vs non-speech portions in the waveform.
The output of the classifier looks like (highlighted green regions indicate speech):
The only issue I face here is making it work for a stream of audio input (for eg: from a microphone) and do real-time analysis for a stipulated time-frame.
I know PyAudio can be used to record speech from the microphone dynamically and there a couple of real-time visualization examples of a waveform, spectrum, spectrogram, etc, but could not find anything relevant to carrying out feature extraction in a near real-time manner.
You should try using Python bindings to webRTC VAD from Google. It's lightweight, fast and provides very reasonable results, based on GMM modelling. As the decision is provided per frame, the latency is minimal.
# Run the VAD on 10 ms of silence. The result should be False.
import webrtcvad
vad = webrtcvad.Vad(2)
sample_rate = 16000
frame_duration = 10 # ms
frame = b'\x00\x00' * int(sample_rate * frame_duration / 1000)
print('Contains speech: %s' % (vad.is_speech(frame, sample_rate))
Also, this article might be useful for you.
UPDATE December 2022
As the topic still draws attention, I'd like to update my answer. SileroVAD is very fast and very accurate VAD that was released recently under MIT license.
I found out that LibROSA could be one of the solutions to your problem. There's a simple tutorial on Medium on using Microphone streaming to realise real-time prediction.
Let's use Short-Time Fourier Transform (STFT) as the feature extractor, the author explains:
To calculate STFT, Fast Fourier transform window size(n_fft) is used
as 512. According to the equation n_stft = n_fft/2 + 1, 257 frequency
bins(n_stft) are calculated over a window size of 512. The window is
moved by a hop length of 256 to have a better overlapping of the
windows in calculating the STFT.
stft = np.abs(librosa.stft(trimmed, n_fft=512, hop_length=256, win_length=512))
# Plot audio with zoomed in y axis
def plotAudio(output):
fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,10))
plt.plot(output, color='blue')
ax.set_xlim((0, len(output)))
ax.margins(2, -0.1)
# Plot audio
def plotAudio2(output):
fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,4))
plt.plot(output, color='blue')
ax.set_xlim((0, len(output)))
def minMaxNormalize(arr):
mn = np.min(arr)
mx = np.max(arr)
return (arr-mn)/(mx-mn)
def predictSound(X):
clip, index = librosa.effects.trim(X, top_db=20, frame_length=512, hop_length=64) # Empherically select top_db for every sample
stfts = np.abs(librosa.stft(clip, n_fft=512, hop_length=256, win_length=512))
stfts = np.mean(stfts,axis=1)
stfts = minMaxNormalize(stfts)
result = model.predict(np.array([stfts]))
predictions = [np.argmax(y) for y in result]
CHUNKSIZE = 22050 # fixed chunk size
RATE = 22050
p = pyaudio.PyAudio()
stream =, channels=1,
rate=RATE, input=True, frames_per_buffer=CHUNKSIZE)
#preprocessing the noise around
#noise window
data =
noise_sample = np.frombuffer(data, dtype=np.float32)
print("Noise Sample")
loud_threshold = np.mean(np.abs(noise_sample)) * 10
print("Loud threshold", loud_threshold)
audio_buffer = []
near = 0
# Read chunk and load it into numpy array.
data =
current_window = np.frombuffer(data, dtype=np.float32)
#Reduce noise real-time
current_window = nr.reduce_noise(audio_clip=current_window, noise_clip=noise_sample, verbose=False)
audio_buffer = current_window
print("Inside silence reign")
audio_buffer = np.concatenate((audio_buffer,current_window))
near += 1
audio_buffer = []
print("Inside loud reign")
near = 0
audio_buffer = np.concatenate((audio_buffer,current_window))
# close stream
Code credit to: Chathuranga Siriwardhana
Full code can be found here.
Audio usually has a low bitrate, so I don't see any problem of writing your code completely in numpy and python. And if you need low-level array access consider numba. Also profile your code e.g. with line_profiler. ALso note there is scipy.signal for more advanced signal processing.
Usually audio processing works in samples. So you define a sample size for your process, and then run a method to decide if that sample contains speech or not.
import numpy as np
def main_loop():
stream = <create stream with your audio library>
while True:
sample = stream.readframes(<define number of samples / time to read>)
def is_speech(sample):
audio = np.array(sample)
< do you processing >
# e.g. simple loudness test
return np.any(audio > 0.8):
That should get you pretty far.
I think there are two approaches here,
Threshold Approach
Small, deployable, Neural net. Approach
The first one is fast, feasible and can be implemented and tested very fast. while the second one is a bit more difficult to implement. I think you are a bit familiar with 2nd option already.
in the case of the 2nd approach, you will be needing a dataset of speeches that are labeled in a sequence of binary classification like 00000000111111110000000011110000. The neural net should be small and optimized for running on edge devices like mobile.
You can check this out from TensorFlow
This is a voice activity detector. I think it's for your purpose.
Also, check these out.
of course, you should compare performance of the mentioned toolkits and models and the feasibility of the implementation of mobile devices.

Replicating Spectogram of Audacity?

I'm trying to plot the spectrograms of audio samples. While I plot it using my code given below, it figures out to be weirder. However, I imported them into audacity which came out to be so nice. Suggest me the changes I need to make in order to replicate the same in python? I would like to know that is the colour map I need to use and what are the changes to be done so that I can acquire an image similar to audacity spectrograms.
Thanks in advance.
from scipy import fft
# other usual libraries
N = 8000
K = 256
Step = 4
wind = 0.5*(1 -np.cos(np.array(range(K))*2*np.pi/(K-1) ))
ffts = []
S = data_hollow['collection_hollow'][0]
Spectogram = []
for j in range(int(Step*N/K)-Step):
vec = S[int(j * K/Step) : int((j+Step) * K/Step)] * wind
Python Spectogram:
Audacity Spectrogram:

Python implementation of MFCC algorithm

I have a database which contains a videos streaming. I want to calculate the LBP features from images and MFCC audio and for every frame in the video I have some annotation. The annotation is inlined with the video frames and the time of the video. Thus, I want to map the time that i have from the annotation to the result of the mfcc. I know that the sample_rate = 44100
from python_speech_features import mfcc
from python_speech_features import logfbank
import as wav
audio_file = "sample.wav"
(rate,sig) =
mfcc_feat = mfcc(sig,rate)
print len(sig) # 2130912
print len(mfcc_feat) # 4831
Firstly, why the result of the length of the mfcc is 4831 and how to map that in the annotation that i have in seconds? The total duration of the video is 48second. And the annotation of the video is 0 everywhere except the 19-29sec windows where is is 1. How can i locate the samples within the window (19-29) from the results of the mfcc?
You should get (4831,13) . 13 is your MFCC length (default numcep is 13). 4831 is the windows. Default winstep is 10 msec, and this matches your sound file duration. To get to the windows corresponding to 19-29 sec, just slice
Remember, that you can not listen to the MFCC. It just represents the slice of audio of 0.025 sec (default value of winlen parameter).
If you want to get to the audio itself, it is

How to decrease the scale of a matplotlib spectrogram in python3

I am analyzing the spectrogram's of .wav files. But after getting the code to finally work, I've run into a small issue. After saving the spectrograms of 700+ .wav files I realize that they all essentially look the same!!! This is not because they are the same audio file, but because I don't know how to change the scale of the plot to be smaller(so I can make out the differences).
I've already tried to fix this issue by looking at this StackOverflow post
Changing plot scale by a factor in matplotlib
I'll show the graph of two different .wav files below
This is .wav #1
This is .wav #2
Believe it or not, these are two different .wav files, but they look super similar. And a computer especially won't be able to pick up the differences in these two .wav files if the scale is this broad.
My code is below
def individualWavToSpectrogram(myAudio, fileNameToSaveTo):
#Read file and get sampling freq [ usually 44100 Hz ] and sound object
samplingFreq, mySound =
#Check if wave file is 16bit or 32 bit. 24bit is not supported
mySoundDataType = mySound.dtype
#We can convert our sound array to floating point values ranging from -1 to 1 as follows
mySound = mySound / (2.**15)
#Check sample points and sound channel for duel channel(5060, 2) or (5060, ) for mono channel
mySoundShape = mySound.shape
samplePoints = float(mySound.shape[0])
#Get duration of sound file
signalDuration = mySound.shape[0] / samplingFreq
#If two channels, then select only one channel
#mySoundOneChannel = mySound[:,0]
#if one channel then index like a 1d array, if 2 channel index into 2 dimensional array
if len(mySound.shape) > 1:
mySoundOneChannel = mySound[:,0]
mySoundOneChannel = mySound
#Plotting the tone
# We can represent sound by plotting the pressure values against time axis.
#Create an array of sample point in one dimension
timeArray = numpy.arange(0, samplePoints, 1)
timeArray = timeArray / samplingFreq
#Scale to milliSeconds
timeArray = timeArray * 1000
plt.rcParams['agg.path.chunksize'] = 100000
#Plot the tone
plt.plot(timeArray, mySoundOneChannel, color='Black')
#plt.xlabel('Time (ms)')
print("trying to save")
plt.savefig('/Users/BillyBobJoe/Desktop/' + fileNameToSaveTo + '.jpg')
How can I modify this code to increase the sensitivity of the graphing so that the differences between two .wav files is made more distinct?
I have tried using
plt.xlim((0, 16000))
But this just adds whitespace to the right of the graph
I need a way to change the scale of each unit. so that the graph is filled out when I change the x axis from 0 - 16000
If the question is: how to limit the scale on the xaxis, say to between 0 and 1000, you can do as follows:
plt.xlim((0, 1000))
