Creating wave data from FFT data? - python

As you might notice, i am really new to python and sound processing. I (hopefully) extracted FFT data from a wave file using python and the logfbank and mfcc function. (The logfbank seems to give the most promising data, mfcc output looked a bit weird for me).
In my program i want to change the logfbank/mfcc data and then create wave data from it (and write them into a file). I didn't really find any information about the process of creating wave data from FFT data. Does anyone of you have an idea how to solve this? I would appreciate it a lot :)
This is my code so far:
from scipy.io import wavfile
import numpy as np
from python_speech_features import mfcc, logfbank
rate, signal = wavfile.read('orig.wav')
fbank = logfbank(signal, rate, nfilt=100, nfft=1400).T
mfcc = mfcc(signal, rate, numcep=13, nfilt=26, nfft=1103).T
#magic data processing of fbank or mfcc here
#creating wave data and writing it back to a .wav file here

A suitably constructed STFT spectrogram containing both magnitude and phase can be converted back to a time-domain waveform using the Overlap Add method. Important thing is that the spectrogram construction must have the constant-overlap-add property.
It can be challenging to have your modifications correctly manipulate both magnitude and phase of a spectrogram. So sometimes the phase is discarded, and magnitude manipulated independently. In order to convert this back into a waveform one must then estimate phase information during reconstruction (phase reconstruction). This is a lossy process, and usually pretty computationally intensive. Established approaches use an iterative algorithm, usually a variation on Griffin-Lim. But there are now also new methods using Convolutional Neural Networks.
Waveform from mel-spectrogram or MFCC using librosa
librosa version 0.7.0 contains a fast Griffin-Lim implementation as well as helper functions to invert a mel-spectrogram of MFCC.
Below is a code example. The input test file is found at https://github.com/jonnor/machinehearing/blob/ab7fe72807e9519af0151ec4f7ebfd890f432c83/handson/spectrogram-inversion/436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac
import numpy
import librosa
import soundfile
# parameters
sr = 22050
n_mels = 128
hop_length = 512
n_iter = 32
n_mfcc = None # can try n_mfcc=20
# load audio and create Mel-spectrogram
path = '436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac'
y, _ = librosa.load(path, sr=sr)
S = numpy.abs(librosa.stft(y, hop_length=hop_length, n_fft=hop_length*2))
mel_spec = librosa.feature.melspectrogram(S=S, sr=sr, n_mels=n_mels, hop_length=hop_length)
# optional, compute MFCCs in addition
if n_mfcc is not None:
mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S), sr=sr, n_mfcc=n_mfcc)
mel_spec = librosa.feature.inverse.mfcc_to_mel(mfcc, n_mels=n_mels)
# Invert mel-spectrogram
S_inv = librosa.feature.inverse.mel_to_stft(mel_spec, sr=sr, n_fft=hop_length*4)
y_inv = librosa.griffinlim(S_inv, n_iter=n_iter,
hop_length=hop_length)
soundfile.write('orig.wav', y, samplerate=sr)
soundfile.write('inv.wav', y_inv, samplerate=sr)
Results
The reconstructed waveform will have some artifacts.
The above example got a lot of repetitive noise, more than I expected. It was possible to reduce it quite a lot using the standard Noise Reduction algorithm in Audacity.

Related

Log-frequency spectrogram array

I need to get a log-frequency scaled spectrogram. I'm currently using scipy.signal.stft function to get a magnitude array. But output frequencies are linearly spaced.
import librosa
import scipy
sample, samplerate = librosa.load('sound.wav', sr=64000)
f, t, Zxx = scipysignal.stft(sample, fs=samplerate, window='hamming', nperseg=512, noverlap=256)
I basically need f to be log-spaced from 1Hz to 32kHz (since my sound has a samplerate of 64kHz).
I can only get the top spectrogram. I need the actual array of values of the bottom spectrogram. I can obtain it through various visualisation function (librosa specshow, matplotlib yscaled etc.) but I can't find a solution to retrieve an actual 2-D array of magnitudes with only frequency logarithmically-spaced.
Any help or clue on what method to use will be greatly appreciated !
I just stumbled across a good soulution for your problem.
The nnAudio library is an audio processing toolbox using PyTorch convolutional neural network as its backend. Though it can also be used as a stand alone solution.
for installation just use:
pip install git+https://github.com/KinWaiCheuk/nnAudio.git#subdirectory=Installation
To transform your audio into a spectrogram with log-spaced frequency bins use:
from nnAudio import features
from scipy.io import wavfile
import torch
sr, song = wavfile.read('./Bach.wav') # Loading your audio
x = song.mean(1) # Converting Stereo to Mono
x = torch.tensor(x).float() # casting the array into a PyTorch Tensor
spec_layer = features.STFT(n_fft=2048, hop_length=512,
window='hann', freq_scale='log', pad_mode='reflect', sr=sr) # Initializing the model
spec = spec_layer(x) # Feed-forward your waveform to get the spectrogram
log_spec =np.array(spec)[0]# cast PyTorch Tensor back to numpy array
db_log_spec = librosa.amplitude_to_db(log_spec) # convert amplitude spec into db representation
Plotting the resulting log-frequency spectrogram with librosa specshow using the y_axis='linear' flag will give you the asked for representation in an actual 2d array :)
plt.figure()
librosa.display.specshow(db_log_spec, y_axis='linear', x_axis='time', sr=sr)
plt.colorbar()
The library also contains an inverse funktion and a ton of additional features:
https://kinwaicheuk.github.io/nnAudio/intro.html
Although producing a good looking log-freq spectrogram I am having trouble reverting the STFT back into the time domain.
The included iSTFT does not do the trick for me. Maybe someone else can pick it up from here?
Actually, for record I found out taht what I needed was to perform a constant-Q transform, which is exactly a log-based spectrogram. But you choose the starting frequency, which is in my case, very useful. For this I used librosa.cqt

How can I extract the data points (corresponding y value for x values) from a waveplot

I'm hoping this is an appropriate question for here.
I have used Python Librosa to plot a wave form for a sound file. I'm finding it difficult to extract the data points. e.g. what is the value of y, at x (Time) = 0.15 on this output below. I can't see this on the documentation for Librosa, so I' wondering if this can be done.
Here is the code I have based on Librosa documentation so far:
import librosa
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load('audio.wav')
bpm, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
plt.figure()
librosa.display.waveplot(y, sr=sr)
plt.show()
print (f'bpm: {bpm:.2f} beats per minute')
output
Is it possible to get the x and y axis into an array for example, or at least print a single data point please?
Thank you
librosa.display.waveplot compute and plots the amplitude envelope of the audio signal.
You can see how this is done by looking at the source code of the function (accessible via "view source" on the documentation page for the function).
Here is the relevant part of the code for computing the envelope.
def __envelope(x, hop):
"""Compute the max-envelope of non-overlapping frames of x at length hop
x is assumed to be multi-channel, of shape (n_channels, n_samples).
"""
import numpy as np
x_frame = np.abs(librosa.util.frame(x, frame_length=hop, hop_length=hop))
return x_frame.max(axis=1)
# Reduce by envelope calculation
env = __envelope(y, hop_length)
Where hop_length is the number of audio samples per point of the envelope.
I wonder if any of the tools listed in the post How to edit raw PCM audio data without an audio library? might be helpful in getting the raw PCM you are seeking. Ideally, librosa library itself should allow a hook to view the PCM data array itself, but I've found that audio tools sometimes don't actually allow this (for example Clip in Java). If you can't locate a librosa hook, maybe obtaining the PCM array from one of these tools prior to or in parallel to shipping data to librosa would be helpful.

how can I validate a perfect sine wave?

Background
I'm trying to validate audio data received over RTP for its accuracy when compared to original source. In my system the audio is played by embedded platform devices and sent out on network for other devices to capture and play. it's specs vary based on mono, stereo or surround audio but sample rate and bit specs are as per following.
I'm using .wav file for now which contains Sine wave of spec, 44.1 kHz, 440 frequency, 1-channel, 16-bit PCM data. I'm using Sine wave so that it is clean and easy to analyze.
using following Python code I could verify that both the files are same since the alignment.distance it gives is 0.0
import librosa
import librosa.display
import matplotlib.pyplot as plt
from numpy.linalg import norm
from dtw import dtw
# Loading audio files
y1, sr1 = librosa.load("./test-audio-files/1kHz_44100Hz_16bit_05sec.wav")
y2, sr2 = librosa.load("./test-audio-files/received.wav")
# Computing MFCC values
mfcc1 = librosa.feature.mfcc(y1, sr1)
mfcc2 = librosa.feature.mfcc(y2, sr2)
alignment = dtw(mfcc1.T, mfcc2.T)
print("The normalized distance between the two : ", alignment.distance) # 0 for similar audios
Query
what I'm wondering is, how to validate the accuracy of the sine wave?
The above solution would work as far as I make sure my source file is perfect and accurate. If the source file has any problem then also the above solution would claim they match.
I'm playing with .wav file for now but it can be any file like mp3, mp4...
The following packet loss is what I am trying to detect:
Reference
Downloaded the .wav file from https://www.mediacollege.com/downloads/
If you know the frequency of the test sine wave, you could mathematically generate a perfect sinewave, align the peak, and compare offsets and amplitude.
If you don't know what type of signal you are testing, but still wanted to test if it was a sine wave, you could try a regression to fit a sine wave to it.
Note: The higher frequencies may appear(confound) as packet loss, but may be in fact a source sampling limitation i.e: a 70KHz signal sampled at 44 KHz.
If you want to ensure that the original sine wave is accurate an approach could be to generate the sine wave directly on python and then use scipy's scipy.io.wavfile.write to generate the .wav file.
from scipy.io.wavfile import write
samplerate = 44100; fs = 100
t = np.linspace(0., 1., samplerate)
amplitude = np.iinfo(np.int16).max
data = amplitude * np.sin(2. * np.pi * fs * t)
write("example.wav", samplerate, data.astype(np.int16))

Soundfile imports audio in two different formats

I am attempting to preprocess audiofiles to be used in a neural net with soundfile.read(), but the function is formatting the returned data differently for different .FLAC files with the same sample rate and length. For example, calling data, sr = soundfile.read(audiofile1) produced an array with shape data.shape = (48000, 2) (where individual element values were either the amplitude, 0, or the negative amplitude in NumPy float64), while calling data, sr = soundfile.read(audiofile2) produced an array with shape data.shape = (48000,) (where individual element values were varied NumPy float64).
Also, if it helps, audiofile1 was a recording taken from a recording taken via PyAudio, whereas audiofile2 was a sample from the LibriSpeech corpus.
So, my question is twofold:
Why is soundfile.read() producing two different data formats, and how do I ensure that the function returns the arrays in the same format in the future?
Your audiofile2 sample is mono, whereas your audiofile1 recording is stereo (i.e. you probably recorded it with a PyAudio stream configured with channels=2). So I suggest you first figure out whether you need mono or stereo for your application.
If all you really care is a mono audio signal, you can convert stereo (or more generally N-channel) audio to mono by averaging the channels:
data, sr = soundfile.read(audiofile)
if np.dim(data)>1:
data = np.mean(data,axis=1)
If you need stereo audio, then you may create an additional channel by duplicating the one you have (although that would not be adding the usual additional information such as phase or amplitude differences between the different channels) with:
if np.dim(data)<2:
data = np.tile(data,(2,1)).transpose()
It's as simple as:
data, sr = soundfile.read(audiofile2, always_2d=True)
With this, data.shape will always have two elements; data.shape[0] will be the number of frames and data.shape[1] will be the number of channels.

plotting spectrogram in audio analysis

I am working on speech recognition using neural network. To do so I need to get the spectrograms of those training audio files (.wav) . How to get those spectrograms in python ?
There are numerous ways to do so. The easiest is to check out the methods proposed in Kernels on Kaggle competition TensorFlow Speech Recognition Challenge (just sort by most voted). This one is particularly clear and simple and contains the following function. The input is a numeric vector of samples extracted from the wav file, the sample rate, the size of the frame in milliseconds, the step (stride or skip) size in milliseconds and a small offset.
from scipy.io import wavfile
from scipy import signal
import numpy as np
sample_rate, audio = wavfile.read(path_to_wav_file)
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
return freqs, times, np.log(spec.T.astype(np.float32) + eps)
Outputs are defined in the SciPy manual, with an exception that the spectrogram is rescaled with a monotonic function (Log()), which depresses larger values much more than smaller values, while leaving the larger values still larger than the smaller values. This way no extreme value in spec will dominate the computation. Alternatively, one can cap the values at some quantile, but log (or even square root) are preferred. There are many other ways to normalize the heights of the spectrogram, i.e. to prevent extreme values from "bullying" the output :)
freq (f) : ndarray, Array of sample frequencies.
times (t) : ndarray, Array of segment times.
spec (Sxx) : ndarray, Spectrogram of x. By default, the last axis of Sxx corresponds to the segment times.
Alternatively, you can check the train.py and models.py code on github repo from the Tensorflow example on audio recognition.
Here is another thread that explains and gives code on building spectrograms in Python.
Scipy serve this purpose.
import scipy
# Read the .wav file
sample_rate, data = scipy.io.wavfile.read('directory_path/file_name.wav')
# Spectrogram of .wav file
sample_freq, segment_time, spec_data = signal.spectrogram(data, sample_rate)
# Note sample_rate and sampling frequency values are same but theoretically they are different measures
Use matplot library to visualize the spectrogram
import matplotlib.pyplot as plt
plt.pcolormesh(segment_time, sample_freq, spec_data )
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()
You can use NumPy, SciPy and matplotlib packages to make spectrograms. See this following post.
http://www.frank-zalkow.de/en/code-snippets/create-audio-spectrograms-with-python.html

Categories