I am working on speech recognition using neural network. To do so I need to get the spectrograms of those training audio files (.wav) . How to get those spectrograms in python ?
There are numerous ways to do so. The easiest is to check out the methods proposed in Kernels on Kaggle competition TensorFlow Speech Recognition Challenge (just sort by most voted). This one is particularly clear and simple and contains the following function. The input is a numeric vector of samples extracted from the wav file, the sample rate, the size of the frame in milliseconds, the step (stride or skip) size in milliseconds and a small offset.
from scipy.io import wavfile
from scipy import signal
import numpy as np
sample_rate, audio = wavfile.read(path_to_wav_file)
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
return freqs, times, np.log(spec.T.astype(np.float32) + eps)
Outputs are defined in the SciPy manual, with an exception that the spectrogram is rescaled with a monotonic function (Log()), which depresses larger values much more than smaller values, while leaving the larger values still larger than the smaller values. This way no extreme value in spec will dominate the computation. Alternatively, one can cap the values at some quantile, but log (or even square root) are preferred. There are many other ways to normalize the heights of the spectrogram, i.e. to prevent extreme values from "bullying" the output :)
freq (f) : ndarray, Array of sample frequencies.
times (t) : ndarray, Array of segment times.
spec (Sxx) : ndarray, Spectrogram of x. By default, the last axis of Sxx corresponds to the segment times.
Alternatively, you can check the train.py and models.py code on github repo from the Tensorflow example on audio recognition.
Here is another thread that explains and gives code on building spectrograms in Python.
Scipy serve this purpose.
import scipy
# Read the .wav file
sample_rate, data = scipy.io.wavfile.read('directory_path/file_name.wav')
# Spectrogram of .wav file
sample_freq, segment_time, spec_data = signal.spectrogram(data, sample_rate)
# Note sample_rate and sampling frequency values are same but theoretically they are different measures
Use matplot library to visualize the spectrogram
import matplotlib.pyplot as plt
plt.pcolormesh(segment_time, sample_freq, spec_data )
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()
You can use NumPy, SciPy and matplotlib packages to make spectrograms. See this following post.
http://www.frank-zalkow.de/en/code-snippets/create-audio-spectrograms-with-python.html
Related
I need to get a log-frequency scaled spectrogram. I'm currently using scipy.signal.stft function to get a magnitude array. But output frequencies are linearly spaced.
import librosa
import scipy
sample, samplerate = librosa.load('sound.wav', sr=64000)
f, t, Zxx = scipysignal.stft(sample, fs=samplerate, window='hamming', nperseg=512, noverlap=256)
I basically need f to be log-spaced from 1Hz to 32kHz (since my sound has a samplerate of 64kHz).
I can only get the top spectrogram. I need the actual array of values of the bottom spectrogram. I can obtain it through various visualisation function (librosa specshow, matplotlib yscaled etc.) but I can't find a solution to retrieve an actual 2-D array of magnitudes with only frequency logarithmically-spaced.
Any help or clue on what method to use will be greatly appreciated !
I just stumbled across a good soulution for your problem.
The nnAudio library is an audio processing toolbox using PyTorch convolutional neural network as its backend. Though it can also be used as a stand alone solution.
for installation just use:
pip install git+https://github.com/KinWaiCheuk/nnAudio.git#subdirectory=Installation
To transform your audio into a spectrogram with log-spaced frequency bins use:
from nnAudio import features
from scipy.io import wavfile
import torch
sr, song = wavfile.read('./Bach.wav') # Loading your audio
x = song.mean(1) # Converting Stereo to Mono
x = torch.tensor(x).float() # casting the array into a PyTorch Tensor
spec_layer = features.STFT(n_fft=2048, hop_length=512,
window='hann', freq_scale='log', pad_mode='reflect', sr=sr) # Initializing the model
spec = spec_layer(x) # Feed-forward your waveform to get the spectrogram
log_spec =np.array(spec)[0]# cast PyTorch Tensor back to numpy array
db_log_spec = librosa.amplitude_to_db(log_spec) # convert amplitude spec into db representation
Plotting the resulting log-frequency spectrogram with librosa specshow using the y_axis='linear' flag will give you the asked for representation in an actual 2d array :)
plt.figure()
librosa.display.specshow(db_log_spec, y_axis='linear', x_axis='time', sr=sr)
plt.colorbar()
The library also contains an inverse funktion and a ton of additional features:
https://kinwaicheuk.github.io/nnAudio/intro.html
Although producing a good looking log-freq spectrogram I am having trouble reverting the STFT back into the time domain.
The included iSTFT does not do the trick for me. Maybe someone else can pick it up from here?
Actually, for record I found out taht what I needed was to perform a constant-Q transform, which is exactly a log-based spectrogram. But you choose the starting frequency, which is in my case, very useful. For this I used librosa.cqt
I'm hoping this is an appropriate question for here.
I have used Python Librosa to plot a wave form for a sound file. I'm finding it difficult to extract the data points. e.g. what is the value of y, at x (Time) = 0.15 on this output below. I can't see this on the documentation for Librosa, so I' wondering if this can be done.
Here is the code I have based on Librosa documentation so far:
import librosa
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load('audio.wav')
bpm, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
plt.figure()
librosa.display.waveplot(y, sr=sr)
plt.show()
print (f'bpm: {bpm:.2f} beats per minute')
output
Is it possible to get the x and y axis into an array for example, or at least print a single data point please?
Thank you
librosa.display.waveplot compute and plots the amplitude envelope of the audio signal.
You can see how this is done by looking at the source code of the function (accessible via "view source" on the documentation page for the function).
Here is the relevant part of the code for computing the envelope.
def __envelope(x, hop):
"""Compute the max-envelope of non-overlapping frames of x at length hop
x is assumed to be multi-channel, of shape (n_channels, n_samples).
"""
import numpy as np
x_frame = np.abs(librosa.util.frame(x, frame_length=hop, hop_length=hop))
return x_frame.max(axis=1)
# Reduce by envelope calculation
env = __envelope(y, hop_length)
Where hop_length is the number of audio samples per point of the envelope.
I wonder if any of the tools listed in the post How to edit raw PCM audio data without an audio library? might be helpful in getting the raw PCM you are seeking. Ideally, librosa library itself should allow a hook to view the PCM data array itself, but I've found that audio tools sometimes don't actually allow this (for example Clip in Java). If you can't locate a librosa hook, maybe obtaining the PCM array from one of these tools prior to or in parallel to shipping data to librosa would be helpful.
Background
I'm trying to validate audio data received over RTP for its accuracy when compared to original source. In my system the audio is played by embedded platform devices and sent out on network for other devices to capture and play. it's specs vary based on mono, stereo or surround audio but sample rate and bit specs are as per following.
I'm using .wav file for now which contains Sine wave of spec, 44.1 kHz, 440 frequency, 1-channel, 16-bit PCM data. I'm using Sine wave so that it is clean and easy to analyze.
using following Python code I could verify that both the files are same since the alignment.distance it gives is 0.0
import librosa
import librosa.display
import matplotlib.pyplot as plt
from numpy.linalg import norm
from dtw import dtw
# Loading audio files
y1, sr1 = librosa.load("./test-audio-files/1kHz_44100Hz_16bit_05sec.wav")
y2, sr2 = librosa.load("./test-audio-files/received.wav")
# Computing MFCC values
mfcc1 = librosa.feature.mfcc(y1, sr1)
mfcc2 = librosa.feature.mfcc(y2, sr2)
alignment = dtw(mfcc1.T, mfcc2.T)
print("The normalized distance between the two : ", alignment.distance) # 0 for similar audios
Query
what I'm wondering is, how to validate the accuracy of the sine wave?
The above solution would work as far as I make sure my source file is perfect and accurate. If the source file has any problem then also the above solution would claim they match.
I'm playing with .wav file for now but it can be any file like mp3, mp4...
The following packet loss is what I am trying to detect:
Reference
Downloaded the .wav file from https://www.mediacollege.com/downloads/
If you know the frequency of the test sine wave, you could mathematically generate a perfect sinewave, align the peak, and compare offsets and amplitude.
If you don't know what type of signal you are testing, but still wanted to test if it was a sine wave, you could try a regression to fit a sine wave to it.
Note: The higher frequencies may appear(confound) as packet loss, but may be in fact a source sampling limitation i.e: a 70KHz signal sampled at 44 KHz.
If you want to ensure that the original sine wave is accurate an approach could be to generate the sine wave directly on python and then use scipy's scipy.io.wavfile.write to generate the .wav file.
from scipy.io.wavfile import write
samplerate = 44100; fs = 100
t = np.linspace(0., 1., samplerate)
amplitude = np.iinfo(np.int16).max
data = amplitude * np.sin(2. * np.pi * fs * t)
write("example.wav", samplerate, data.astype(np.int16))
As you might notice, i am really new to python and sound processing. I (hopefully) extracted FFT data from a wave file using python and the logfbank and mfcc function. (The logfbank seems to give the most promising data, mfcc output looked a bit weird for me).
In my program i want to change the logfbank/mfcc data and then create wave data from it (and write them into a file). I didn't really find any information about the process of creating wave data from FFT data. Does anyone of you have an idea how to solve this? I would appreciate it a lot :)
This is my code so far:
from scipy.io import wavfile
import numpy as np
from python_speech_features import mfcc, logfbank
rate, signal = wavfile.read('orig.wav')
fbank = logfbank(signal, rate, nfilt=100, nfft=1400).T
mfcc = mfcc(signal, rate, numcep=13, nfilt=26, nfft=1103).T
#magic data processing of fbank or mfcc here
#creating wave data and writing it back to a .wav file here
A suitably constructed STFT spectrogram containing both magnitude and phase can be converted back to a time-domain waveform using the Overlap Add method. Important thing is that the spectrogram construction must have the constant-overlap-add property.
It can be challenging to have your modifications correctly manipulate both magnitude and phase of a spectrogram. So sometimes the phase is discarded, and magnitude manipulated independently. In order to convert this back into a waveform one must then estimate phase information during reconstruction (phase reconstruction). This is a lossy process, and usually pretty computationally intensive. Established approaches use an iterative algorithm, usually a variation on Griffin-Lim. But there are now also new methods using Convolutional Neural Networks.
Waveform from mel-spectrogram or MFCC using librosa
librosa version 0.7.0 contains a fast Griffin-Lim implementation as well as helper functions to invert a mel-spectrogram of MFCC.
Below is a code example. The input test file is found at https://github.com/jonnor/machinehearing/blob/ab7fe72807e9519af0151ec4f7ebfd890f432c83/handson/spectrogram-inversion/436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac
import numpy
import librosa
import soundfile
# parameters
sr = 22050
n_mels = 128
hop_length = 512
n_iter = 32
n_mfcc = None # can try n_mfcc=20
# load audio and create Mel-spectrogram
path = '436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac'
y, _ = librosa.load(path, sr=sr)
S = numpy.abs(librosa.stft(y, hop_length=hop_length, n_fft=hop_length*2))
mel_spec = librosa.feature.melspectrogram(S=S, sr=sr, n_mels=n_mels, hop_length=hop_length)
# optional, compute MFCCs in addition
if n_mfcc is not None:
mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S), sr=sr, n_mfcc=n_mfcc)
mel_spec = librosa.feature.inverse.mfcc_to_mel(mfcc, n_mels=n_mels)
# Invert mel-spectrogram
S_inv = librosa.feature.inverse.mel_to_stft(mel_spec, sr=sr, n_fft=hop_length*4)
y_inv = librosa.griffinlim(S_inv, n_iter=n_iter,
hop_length=hop_length)
soundfile.write('orig.wav', y, samplerate=sr)
soundfile.write('inv.wav', y_inv, samplerate=sr)
Results
The reconstructed waveform will have some artifacts.
The above example got a lot of repetitive noise, more than I expected. It was possible to reduce it quite a lot using the standard Noise Reduction algorithm in Audacity.
I'm a python newbie and audio analysis newbie. If this is not the right place for this question, please point me to right place.
I have an mp3 audio file which has just silence.
Converted to .wav using sox
sox input.mp3 output.wav
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
(fs,x)=read('/home/vivek/Documents/VivekProjects/Silence/silence.wav')
##plt.rcParams['agg.path.chunksize'] = 5000 # for preventing overflow error.
fs
x.size/float(fs)
plt.plot(x)
Which generates this image:
I also used solution to this question: How to plot a wav file
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
# read audio samples
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
# read audio samples
input_data = read("/home/vivek/Documents/VivekProjects/Silence/silence.wav")
audio = input_data[1]
# plot the first 1024 samples
plt.plot(audio)
# label the axes
plt.ylabel("Amplitude")
plt.xlabel("Time")
# set the title
plt.title("Sample Wav")
# display the plot
plt.show()
Which generated this image:
Question:
I want to know how to interpret the different color bars(blue green,yellow) in the chart. If you listen to the file it is only silence, and I expected to see just a flat line if anything.
My mp3 file can be downloaded from here.
The sox converted wav file can be found here.
Even though the file is silent, even dropbox is generating a waveform. I can't seem to figure out why.
First, always check the shape of your data before plotting.
x.shape
## (3479040, 2)
So the 2 here means you have two channel in your .wav file, matplotlib by default plot them in different colors. You will need to slice the matrix by row in this situation.
import matplotlib.pyplot as plt
ind = int(fs * 0.5) ## plot first 500ms
### plot as time series
plt.plot(x[:ind,:])
plt.figure()
#### Visualise distribution
plt.hist(x[:ind,0],bins = 10)
plt.gca().set_yscale('log')
#####
print x.min(),x.max()
#### -3 3
As can be seen from the graph, the signal is of very low absolute value (-3,3). Depending on the encoding of .wav file (integer or float), it will be translated to amplitude (but probably a very low amplitude, that's why it's silent).
I my self is not familiar with the precise encoding. But this page might help: http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
For all formats other than PCM, the Format chunk must have an extended portion. The extension can be of zero length, but the size
field (with value 0) must be present.
For float data, full scale is 1. The bits/sample would normally be 32 or 64.
For the log-PCM formats (ยต-law and A-law), the Rev. 3 documentation indicates that the bits/sample field (wBitsPerSample)
should be set to 8 bits.
The non-PCM formats must have a fact chunk.
PS: if you want to start some more advanced audio analysis, do check this workshop which I found super practical, especially the Energy part and FFT part.
I had a suspicion that your silence.mp3 file had audio which was very low (below human hearing) since I couldn't hear it even when I played at maximum speaker sound.
So, I came across plotting audio frequency from mp3 from here
first we convert mp3 audio to wav. As parent file is stero, converted wav file is stereo as well. In order to demonstrate that there are audio frequencies , we just need single channel.
Once we have single channel wav audio, we then simply plot frequency against time index with a color-bar of dB power level.
import scipy.io.wavfile
from pydub import AudioSegment
import matplotlib.pyplot as plt
import numpy as np
from numpy import fft as fft
#read mp3 file
mp3 = AudioSegment.from_mp3("silence.mp3")
#convert to wav
mp3.export("silence.wav", format="wav")
#read wav file
rate,audData=scipy.io.wavfile.read("silence.wav")
#if stereo grab both channels
channel1=audData[:,0] #left
#channel2=audData[:,1] #right channel, we dont need here
#create a time variable in seconds
time = np.arange(0, float(audData.shape[0]), 1) / rate
#Plot spectrogram of frequency vs time
plt.figure(1, figsize=(8,6))
plt.subplot(211)
Pxx, freqs, bins, im = plt.specgram(channel1, Fs=rate, NFFT=1024, cmap=plt.get_cmap('autumn_r'))
cbar=plt.colorbar(im)
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
cbar.set_label('Intensity dB')
plt.show()
As you can see in the image , the silence.mp3 does contain audio frequencies possible with power level of -30 to -45 dB.