Tensorflow is not reconstructing the original signal when applying the STFT followed by the inverse STFT. The problems arise when the frames of the STFT overlap: It seems like every frame contributes with a weight of 1 regardless of the number of overlapping frames N = frame_size / frame_step. As a result, the central part of the signal is N times larger than the original. Here is a minimal code to reproduce the error:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
size = 2048
frame_length = 512
frame_step = 128
waveform = np.sin(np.arange(size) * 1 / 100)
stft = tf.signal.stft(waveform, frame_length, frame_step, window_fn=None)
inverse_stft = tf.signal.inverse_stft(stft, frame_length, frame_step, window_fn=None)
plt.plot(waveform)
plt.plot(inverse_stft)
plt.show()
plt.clf()
Notice that I'm using no window. If I put the Hann window, the central part works well but the borders are smoothly going to zero, a related but surprisingly different error. The implementation of scipy works well under all circumstances.
Am I missing something?
Seems that Tensorflow has some not perfect implementation that does not work well on boundaries of signals/stfts. And it has not take into the account approriate scaling when frame_rate =/= frame length
This could be posted as an issue on their github.
For longer signals (plus taking into account scaling) it looks "good enough":
size = 20480
frame_length = 512
frame_step = 256
waveform = np.sin(np.arange(size) * 1 / 100)
stft = tf.signal.stft(waveform, frame_length, frame_step, window_fn=None, pad_end=True)
inverse_stft = tf.signal.inverse_stft(stft, frame_length, frame_step, window_fn=None) / (frame_length / frame_step)
plt.plot(waveform)
plt.plot(inverse_stft)
plt.show()
plt.clf()
Related
I'm pretty new in Python. I would like to sintetize a first order Gauss-Markov process from a white Gaussian noise. I know from signal processing theory that it could be performed using a noise shaping filter designed properly. See https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_process for details. First order Guass-Markov processes have two parameter: sigma, which is the standard deviation of the process, and the time costant beta.
The shaping filter should have a transfer function equal to the one in this figure:
Here it is my code:
import scipy.signal as dsp
import numpy as np
Nsamples = 2000
fs = 100
time = np.arange(Nsamples) / fs
rng = np.random.default_rng()
gaussianNoise = rng.standard_normal(size=time.shape)
wgn = (gaussianNoise - np.mean(gaussianNoise)) / np.std(gaussianNoise)
print('\n\n\nWGN MEAN: ', np.mean(wgn))
print('WGN STD: ', np.std(wgn))
beta = 0.01
sigma = 0.1
b = np.array([np.sqrt(2 * beta * sigma**2)])
a = np.array([1, beta])
gaussMarkovNoise = dsp.lfilter(b, a, whiteGaussianNoise)
Unfortunately, something is wrong because the gaussMarkovNoise should have an autocorrelation with an exponential decay (see http above); while filtered in this way, it still has a spike in the origin as a white noise sequence. What am I missing?
I'm trying to plot the Amplitude (dBFS) vs. Time (s) plot of an audio (.wav) file using matplotlib. I managed to do that with the following code:
def convert_to_decibel(sample):
ref = 32768 # Using a signed 16-bit PCM format wav file. So, 2^16 is the max. value.
if sample!=0:
return 20 * np.log10(abs(sample) / ref)
else:
return 20 * np.log10(0.000001)
from scipy.io.wavfile import read as readWav
from scipy.fftpack import fft
import matplotlib.pyplot as gplot1
import matplotlib.pyplot as gplot2
import numpy as np
import struct
import gc
wavfile1 = '/home/user01/audio/speech.wav'
wavsamplerate1, wavdata1 = readWav(wavfile1)
wavdlen1 = wavdata1.size
wavdtype1 = wavdata1.dtype
gplot1.rcParams['figure.figsize'] = [15, 5]
pltaxis1 = gplot1.gca()
gplot1.axhline(y=0, c="black")
gplot1.xticks(np.arange(0, 10, 0.5))
gplot1.yticks(np.arange(-200, 200, 5))
gplot1.grid(linestyle = '--')
wavdata3 = np.array([convert_to_decibel(i) for i in wavdata1], dtype=np.int16)
yvals3 = wavdata3
t3 = wavdata3.size / wavsamplerate1
xvals3 = np.linspace(0, t3, wavdata3.size)
pltaxis1.set_xlim([0, t3 + 2])
pltaxis1.set_title('Amplitude (dBFS) vs Time(s)')
pltaxis1.plot(xvals3, yvals3, '-')
which gives the following output:
I had also plotted the Power Spectral Density (PSD, in dBm) using the code below:
from scipy.signal import welch as psd # Computes PSD using Welch's method.
fpsd, wPSD = psd(wavdata1, wavsamplerate1, nperseg=1024)
gplot2.rcParams['figure.figsize'] = [15, 5]
pltpsdm = gplot2.gca()
gplot2.axhline(y=0, c="black")
pltpsdm.plot(fpsd, 20*np.log10(wPSD))
gplot2.xticks(np.arange(0, 4000, 400))
gplot2.yticks(np.arange(-150, 160, 10))
pltpsdm.set_xlim([0, 4000])
pltpsdm.set_ylim([-150, 150])
gplot2.grid(linestyle = '--')
which gives the output as:
The second output above, using the Welch's method plots a more presentable output. The dBFS plot though informative is not very presentable IMO. Is this because of:
the difference in the domains (time in case of 1st output vs frequency in the 2nd output)?
the way plot function is implemented in pyplot?
Also, is there a way I can plot my dBFS output as a peak-to-peak style of plot just like in my PSD (dBm) plot rather than a dense stem plot?
Would be much helpful and would appreciate any pointers, answers or suggestions from experts here as I'm just a beginner with matplotlib and plots in python in general.
TLNR
This has nothing to do with pyplot.
The frequency domain is different from the time domain, but that's not why you didn't get what you want.
The calculation of dbFS in your code is wrong.
You should frame your data, calculate RMSs or peaks in every frame, and then convert that value to dbFS instead of applying this transformation to every sample point.
When we talk about the amplitude, we are talking about a periodic signal. And when we read in a series of data from a sound file, we read in a series of sample points of a signal(may be or be not periodic). The value of every sample point represents a, say, voltage value, or sound pressure value sampled at a specific time.
We assume that, within a very short time interval, maybe 10ms for example, the signal is stationary. Every such interval is called a frame.
Some specific function is applied to each frame usually, to reduce the sudden change at the edge of this frame, and these functions are called window functions. If you did nothing to every frame, you added rectangle windows to them.
An example: when the sampling frequency of your sound is 44100Hz, in a 10ms-long frame, there are 44100*0.01=441 sample points. That's what the nperseg argument means in your psd function but it has nothing to do with dbFS.
Given the knowledge above, now we can talk about the amplitude.
There are two methods a get the value of amplitude in every frame:
The most straightforward one is to get the maximum(peak) values in every frame.
Another one is to calculate the RMS(Root Mean Sqaure) of every frame.
After that, the peak values or RMS values can be converted to dbFS values.
Let's start coding:
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
# Determine full scall(maximum possible amplitude) by bit depth
bit_depth = 16
full_scale = 2 ** bit_depth
# dbFS function
to_dbFS = lambda x: 20 * np.log10(x / full_scale)
# Read in the wave file
fname = "01.wav"
fs,data = wavfile.read(fname)
# Determine frame length(number of sample points in a frame) and total frame numbers by window length(how long is a frame in seconds)
window_length = 0.01
signal_length = data.shape[0]
frame_length = int(window_length * fs)
nframes = signal_length // frame_length
# Get frames by broadcast. No overlaps are used.
idx = frame_length * np.arange(nframes)[:,None] + np.arange(frame_length)
frames = data[idx].astype("int64") # Convert to in 64 to avoid integer overflow
# Get RMS and peaks
rms = ((frames**2).sum(axis=1)/frame_length)**.5
peaks = np.abs(frames).max(axis=1)
# Convert them to dbfs
dbfs_rms = to_dbFS(rms)
dbfs_peak = to_dbFS(peaks)
# Let's start to plot
# Get time arrays of every sample point and ever frame
frame_time = np.arange(nframes) * window_length
data_time = np.linspace(0,signal_length/fs,signal_length)
# Plot
f,ax = plt.subplots()
ax.plot(data_time,data,color="k",alpha=.3)
# Plot the dbfs values on a twin x Axes since the y limits are not comparable between data values and dbfs
tax = ax.twinx()
tax.plot(frame_time,dbfs_rms,label="RMS")
tax.plot(frame_time,dbfs_peak,label="Peak")
tax.legend()
f.tight_layout()
# Save serval details
f.savefig("whole.png",dpi=300)
ax.set_xlim(1,2)
f.savefig("1-2sec.png",dpi=300)
ax.set_xlim(1.295,1.325)
f.savefig("1.2-1.3sec.png",dpi=300)
The whole time span looks like(the unit of the right axis is dbFS):
And the voiced part looks like:
You can see that the dbFS values become greater while the amplitudes become greater at the vowel start point:
Both Librosa and Scipy have the fft function, however, they give me a different spectrogram output even with the same signal input.
Scipy
I am trying to get the spectrogram with the following code
import numpy as np # fast vectors and matrices
import matplotlib.pyplot as plt # plotting
from scipy import fft
X = np.sin(np.linspace(0,1e10,5*44100))
fs = 44100 # assumed sample frequency in Hz
window_size = 2048 # 2048-sample fourier windows
stride = 512 # 512 samples between windows
wps = fs/float(512) # ~86 windows/second
Xs = np.empty([int(2*wps),2048])
for i in range(Xs.shape[0]):
Xs[i] = np.abs(fft(X[i*stride:i*stride+window_size]))
fig = plt.figure(figsize=(20,7))
plt.imshow(Xs.T[0:150],aspect='auto')
plt.gca().invert_yaxis()
fig.axes[0].set_xlabel('windows (~86Hz)')
fig.axes[0].set_ylabel('frequency')
plt.show()
Then I get the following spectrogram
Librosa
Now I try to get the same spectrogram with Librosa
from librosa import stft
X_libs = stft(X, n_fft=window_size, hop_length=stride)
X_libs = np.abs(X_libs)[:,:int(2*wps)]
fig = plt.figure(figsize=(20,7))
plt.imshow(X_libs[0:150],aspect='auto')
plt.gca().invert_yaxis()
fig.axes[0].set_xlabel('windows (~86Hz)')
fig.axes[0].set_ylabel('frequency')
plt.show()
Question
The two spectrogram are obviously different, specifically, the Librosa version has an attack at the very beginning.
What causes the difference? I don't see many parameters that I can tune in the documentation for Scipy and Librosa.
The reason for this is the argument center for librosa's stft. By default it's True (along with pad_mode = 'reflect').
From the docs:
librosa.core.stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, dtype=, pad_mode='reflect')
center:boolean
If True, the signal y is padded so that frame D[:, t] is centered at y[t * hop_length].
If False, then D[:, t] begins at y[t * hop_length]
pad_mode:string
If center=True, the padding mode to use at the edges of the signal. By default, STFT uses reflection padding.
Calling the STFT like this
X_libs = stft(X, n_fft=window_size, hop_length=stride,
center=False)
does lead to a straight line:
Note that librosa's stft also uses the Hann window function by default. If you want to avoid this and make it more like your Scipy stft implementation, call the stft with a window consisting only of ones:
X_libs = stft(X, n_fft=window_size, hop_length=stride,
window=np.ones(window_size),
center=False)
You'll notice that the line is thinner.
I recently do my homework about MFCC, and I can't figure out some differences between using these libraries.
The 3 libraries I use are:
python_speech_features
SpeechPy
LibROSA
samplerate = 16000
NFFT = 512
NCEPT = 13
1st Part: Mel filter bank
temp1_fb = pyspeech.get_filterbanks(nfilt=NFILT, nfft=NFFT, samplerate=sample1)
# speechpy do not divide 2 and add 1 when initializing
temp2_fb = speechpy.feature.filterbanks(num_filter=NFILT, fftpoints=NFFT, sampling_freq=sample1)
temp3_fb = librosa.filters.mel(sr=sample1, n_fft=NFFT, n_mels=NFILT)
# fix librosa normalized version
temp3_fb /= np.max(temp3_fb, axis=-1)[:, None]
pic1
Only the shape in speechpy will get (, 512), other all (, 257). The figure of librosa is a bit of deformation.
2nd Part: MFCC
# pyspeech without lifter. Using hamming
temp1_mfcc = pyspeech.mfcc(speaker1, samplerate=sample1, winlen=0.025, winstep=0.01, numcep=NCEPT, nfilt=NFILT, nfft=NFFT,
preemph=0.97, ceplifter=0, winfunc=np.hamming, appendEnergy=False)
# speechpy need pre-emphasized. Using rectangular window fixed. Mel filter bank is not the same
temp2_mfcc = speechpy.feature.mfcc(emphasized_speaker1, sampling_frequency=sample1, frame_length=0.025, frame_stride=0.01,
num_cepstral=NCEPT, num_filters=NFILT, fft_length=NFFT)
# librosa need pre-emphasized. Using log energy. Its STFT using hanning, but its framing is not the same
temp3_energy = librosa.feature.melspectrogram(emphasized_speaker1, sr=sample1, S=temp3_pow.T, n_fft=NFFT,
hop_length=frame_step, n_mels=NFILT).T
temp3_energy = np.log(temp3_energy)
temp3_mfcc = librosa.feature.mfcc(emphasized_speaker1, sr=sample1, S=temp3_energy.T, n_mfcc=13, dct_type=2, n_fft=NFFT,
hop_length=frame_step).T
pic2
I've tried my best to set the condition faire. The figure of speechpy gets darker.
3rd Part: Delta coefficient
temp1 = pyspeech.delta(mfcc_speaker1, 2)
temp2 = speechpy.processing.derivative_extraction(mfcc_speaker1.T, 1).T
# librosa along the frame axis
temp3 = librosa.feature.delta(mfcc_speaker1, width=5, axis=0, order=1)
pic3
I can't directly set mfcc as argument in speechpy, or it will be very strange. And what these parameters originally act is not the same as my expected.
I'm wondering what factors to make these differences. Is it just somethong I mentioned above? Or I made some mistakes? Hope for details, thanks.
There are many MFCC implementations and they often differ bit by bit - window function shape, mel filterbank calculation, dct could be different too. It is hard to find a fully compatible library. In the long term it should not matter for you as long as you use the same implementation everywhere. The differences do not affect results.
I'm trying to extract data from an wav file for audio analysis of each frequency and their amplitude with respect to time, my aim to run this data for a machine learning algorithm for a college project, after a bit of googling I found out that this can be done by Python's matplotlib library, I saw some sample codes that ran a Short Fourier transform and plotted a spectrogram of these wav files but wasn't able to understand how to use this library to extract data (all frequency's amplitude at a given time in the audio file) and store it in an 3D array or a .mat file.
Here's the code I saw on some website:
#!/usr/bin/env python
""" This work is licensed under a Creative Commons Attribution 3.0 Unported License.
Frank Zalkow, 2012-2013 """
import numpy as np
from matplotlib import pyplot as plt
import scipy.io.wavfile as wav
from numpy.lib import stride_tricks
""" short time fourier transform of audio signal """
def stft(sig, frameSize, overlapFac=0.5, window=np.hanning):
win = window(frameSize)
hopSize = int(frameSize - np.floor(overlapFac * frameSize))
# zeros at beginning (thus center of 1st window should be for sample nr. 0)
samples = np.append(np.zeros(np.floor(frameSize/2.0)), sig)
# cols for windowing
cols = np.ceil( (len(samples) - frameSize) / float(hopSize)) + 1
# zeros at end (thus samples can be fully covered by frames)
samples = np.append(samples, np.zeros(frameSize))
frames = stride_tricks.as_strided(samples, shape=(cols, frameSize), strides=(samples.strides[0]*hopSize, samples.strides[0])).copy()
frames *= win
return np.fft.rfft(frames)
""" scale frequency axis logarithmically """
def logscale_spec(spec, sr=44100, factor=20.):
timebins, freqbins = np.shape(spec)
scale = np.linspace(0, 1, freqbins) ** factor
scale *= (freqbins-1)/max(scale)
scale = np.unique(np.round(scale))
# create spectrogram with new freq bins
newspec = np.complex128(np.zeros([timebins, len(scale)]))
for i in range(0, len(scale)):
if i == len(scale)-1:
newspec[:,i] = np.sum(spec[:,scale[i]:], axis=1)
else:
newspec[:,i] = np.sum(spec[:,scale[i]:scale[i+1]], axis=1)
# list center freq of bins
allfreqs = np.abs(np.fft.fftfreq(freqbins*2, 1./sr)[:freqbins+1])
freqs = []
for i in range(0, len(scale)):
if i == len(scale)-1:
freqs += [np.mean(allfreqs[scale[i]:])]
else:
freqs += [np.mean(allfreqs[scale[i]:scale[i+1]])]
return newspec, freqs
""" plot spectrogram"""
def plotstft(audiopath, binsize=2**10, plotpath=None, colormap="jet"):
samplerate, samples = wav.read(audiopath)
s = stft(samples, binsize)
sshow, freq = logscale_spec(s, factor=1.0, sr=samplerate)
ims = 20.*np.log10(np.abs(sshow)/10e-6) # amplitude to decibel
timebins, freqbins = np.shape(ims)
plt.figure(figsize=(15, 7.5))
plt.imshow(np.transpose(ims), origin="lower", aspect="auto", cmap=colormap, interpolation="none")
plt.colorbar()
plt.xlabel("time (s)")
plt.ylabel("frequency (hz)")
plt.xlim([0, timebins-1])
plt.ylim([0, freqbins])
xlocs = np.float32(np.linspace(0, timebins-1, 5))
plt.xticks(xlocs, ["%.02f" % l for l in ((xlocs*len(samples)/timebins)+(0.5*binsize))/samplerate])
ylocs = np.int16(np.round(np.linspace(0, freqbins-1, 10)))
plt.yticks(ylocs, ["%.02f" % freq[i] for i in ylocs])
if plotpath:
plt.savefig(plotpath, bbox_inches="tight")
else:
plt.show()
plt.clf()
plotstft("abc.wav")
Please guide me to understand how to extract the data, if not by matplotlib, recommend me some other library which will help me achieve this.
First of all, this looks like my code which is stated to be under a CC license. I don't take it too serious, but you should not ignore those aspects (you omitted the statement of authorship in this case), others could be more miffed about such a thing.
To your question: In this code the stft isn't computed by matplotlib, but just by numpy. You can get it like this:
samplerate, samples = wav.read(audiopath)
s = stft(samples, 1024)
I am not sure why you want a 3D array? It is a 2D-array, but it is complex valued. If you want to save it in a .mat file:
from scipy.io import savemat
savemat("file.mat", {'arr': s})
You can see once the wav audio file is read into variable samples it is passed to a function called stft :
samplerate, samples = wav.read(audiopath)
s = stft(samples, binsize)
here you already have access to the audio samples in var samples in the form of integers ... be aware that bit depth will impact number of bytes per sample as represented as a series of integers ... also know your endianness (left to right or visa versa) ... however in function stft that array is further processed into an array of floats in variable : frames before its passed into function np.fft.rfft
Depending on your needs those are your access choices without doing any of your own processing