I need to get a log-frequency scaled spectrogram. I'm currently using scipy.signal.stft function to get a magnitude array. But output frequencies are linearly spaced.
import librosa
import scipy
sample, samplerate = librosa.load('sound.wav', sr=64000)
f, t, Zxx = scipysignal.stft(sample, fs=samplerate, window='hamming', nperseg=512, noverlap=256)
I basically need f to be log-spaced from 1Hz to 32kHz (since my sound has a samplerate of 64kHz).
I can only get the top spectrogram. I need the actual array of values of the bottom spectrogram. I can obtain it through various visualisation function (librosa specshow, matplotlib yscaled etc.) but I can't find a solution to retrieve an actual 2-D array of magnitudes with only frequency logarithmically-spaced.
Any help or clue on what method to use will be greatly appreciated !
I just stumbled across a good soulution for your problem.
The nnAudio library is an audio processing toolbox using PyTorch convolutional neural network as its backend. Though it can also be used as a stand alone solution.
for installation just use:
pip install git+https://github.com/KinWaiCheuk/nnAudio.git#subdirectory=Installation
To transform your audio into a spectrogram with log-spaced frequency bins use:
from nnAudio import features
from scipy.io import wavfile
import torch
sr, song = wavfile.read('./Bach.wav') # Loading your audio
x = song.mean(1) # Converting Stereo to Mono
x = torch.tensor(x).float() # casting the array into a PyTorch Tensor
spec_layer = features.STFT(n_fft=2048, hop_length=512,
window='hann', freq_scale='log', pad_mode='reflect', sr=sr) # Initializing the model
spec = spec_layer(x) # Feed-forward your waveform to get the spectrogram
log_spec =np.array(spec)[0]# cast PyTorch Tensor back to numpy array
db_log_spec = librosa.amplitude_to_db(log_spec) # convert amplitude spec into db representation
Plotting the resulting log-frequency spectrogram with librosa specshow using the y_axis='linear' flag will give you the asked for representation in an actual 2d array :)
plt.figure()
librosa.display.specshow(db_log_spec, y_axis='linear', x_axis='time', sr=sr)
plt.colorbar()
The library also contains an inverse funktion and a ton of additional features:
https://kinwaicheuk.github.io/nnAudio/intro.html
Although producing a good looking log-freq spectrogram I am having trouble reverting the STFT back into the time domain.
The included iSTFT does not do the trick for me. Maybe someone else can pick it up from here?
Actually, for record I found out taht what I needed was to perform a constant-Q transform, which is exactly a log-based spectrogram. But you choose the starting frequency, which is in my case, very useful. For this I used librosa.cqt
Related
I'm hoping this is an appropriate question for here.
I have used Python Librosa to plot a wave form for a sound file. I'm finding it difficult to extract the data points. e.g. what is the value of y, at x (Time) = 0.15 on this output below. I can't see this on the documentation for Librosa, so I' wondering if this can be done.
Here is the code I have based on Librosa documentation so far:
import librosa
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load('audio.wav')
bpm, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
plt.figure()
librosa.display.waveplot(y, sr=sr)
plt.show()
print (f'bpm: {bpm:.2f} beats per minute')
output
Is it possible to get the x and y axis into an array for example, or at least print a single data point please?
Thank you
librosa.display.waveplot compute and plots the amplitude envelope of the audio signal.
You can see how this is done by looking at the source code of the function (accessible via "view source" on the documentation page for the function).
Here is the relevant part of the code for computing the envelope.
def __envelope(x, hop):
"""Compute the max-envelope of non-overlapping frames of x at length hop
x is assumed to be multi-channel, of shape (n_channels, n_samples).
"""
import numpy as np
x_frame = np.abs(librosa.util.frame(x, frame_length=hop, hop_length=hop))
return x_frame.max(axis=1)
# Reduce by envelope calculation
env = __envelope(y, hop_length)
Where hop_length is the number of audio samples per point of the envelope.
I wonder if any of the tools listed in the post How to edit raw PCM audio data without an audio library? might be helpful in getting the raw PCM you are seeking. Ideally, librosa library itself should allow a hook to view the PCM data array itself, but I've found that audio tools sometimes don't actually allow this (for example Clip in Java). If you can't locate a librosa hook, maybe obtaining the PCM array from one of these tools prior to or in parallel to shipping data to librosa would be helpful.
I would like to voxelise a .stl file and write it into an np.array. The resolution of the voxels should be adjustable.
Here is my code for this:
component_path = r"C:\Users\User\documents\components\Test_1.stl"
mesh = o3d.io.read_triangle_mesh(component_path)
voxel_grid = o3d.geometry.VoxelGrid.create_from_triangle_mesh(mesh, voxel_size = 3)
ply_path = "voxel.ply"
o3d.io.write_voxel_grid(ply_path, voxel_grid, True,True,True)
pcd = o3d.io.read_point_cloud(ply_path)
list_path = "list.xyz"
o3d.io.write_point_cloud(list_path, pcd)
Then I read the coordinate points from the list, write them into a 3D array and plot them. When plotting, the border is not displayed for certain voxel sizes, as can be seen in the image (although it is present in the original). Is there a solution for this so that it doesn't happen no matter what voxel size?
voxelized picture with missing border
In addition, the voxel size changes the maximum dimension. So the component originally has three times the length as it is shown here. How can this be adjusted? (If I just multiply a factor, the voxels stay small but pull the distances apart).
Is there perhaps a more reasonable way to write a voxelisation of a .stl file and put the centers of voxels into an np.array?
If anyone ever has the same problem and is looking for a solution:
This project worked for me: GitHub: stl-to-voxel
The model is then also filled. If the maximum dimension is known, you can determine the exact voxel size via the resolution.
Here is some code:
import stl_reader
import stltovoxel
import numpy as np
import copy
import os
import sys
input=r"C:\Users\user\Example.stl"
output=r"C:\Users\user\Test.xyz"
resolution = 50 #Resolution, into how many layers the model should be divided
stltovoxel.doExport(input, output, resolution)
Afterwards, you can read the coordinates from the list, write them into an array and process them further (quite normally).
I am attempting to preprocess audiofiles to be used in a neural net with soundfile.read(), but the function is formatting the returned data differently for different .FLAC files with the same sample rate and length. For example, calling data, sr = soundfile.read(audiofile1) produced an array with shape data.shape = (48000, 2) (where individual element values were either the amplitude, 0, or the negative amplitude in NumPy float64), while calling data, sr = soundfile.read(audiofile2) produced an array with shape data.shape = (48000,) (where individual element values were varied NumPy float64).
Also, if it helps, audiofile1 was a recording taken from a recording taken via PyAudio, whereas audiofile2 was a sample from the LibriSpeech corpus.
So, my question is twofold:
Why is soundfile.read() producing two different data formats, and how do I ensure that the function returns the arrays in the same format in the future?
Your audiofile2 sample is mono, whereas your audiofile1 recording is stereo (i.e. you probably recorded it with a PyAudio stream configured with channels=2). So I suggest you first figure out whether you need mono or stereo for your application.
If all you really care is a mono audio signal, you can convert stereo (or more generally N-channel) audio to mono by averaging the channels:
data, sr = soundfile.read(audiofile)
if np.dim(data)>1:
data = np.mean(data,axis=1)
If you need stereo audio, then you may create an additional channel by duplicating the one you have (although that would not be adding the usual additional information such as phase or amplitude differences between the different channels) with:
if np.dim(data)<2:
data = np.tile(data,(2,1)).transpose()
It's as simple as:
data, sr = soundfile.read(audiofile2, always_2d=True)
With this, data.shape will always have two elements; data.shape[0] will be the number of frames and data.shape[1] will be the number of channels.
As you might notice, i am really new to python and sound processing. I (hopefully) extracted FFT data from a wave file using python and the logfbank and mfcc function. (The logfbank seems to give the most promising data, mfcc output looked a bit weird for me).
In my program i want to change the logfbank/mfcc data and then create wave data from it (and write them into a file). I didn't really find any information about the process of creating wave data from FFT data. Does anyone of you have an idea how to solve this? I would appreciate it a lot :)
This is my code so far:
from scipy.io import wavfile
import numpy as np
from python_speech_features import mfcc, logfbank
rate, signal = wavfile.read('orig.wav')
fbank = logfbank(signal, rate, nfilt=100, nfft=1400).T
mfcc = mfcc(signal, rate, numcep=13, nfilt=26, nfft=1103).T
#magic data processing of fbank or mfcc here
#creating wave data and writing it back to a .wav file here
A suitably constructed STFT spectrogram containing both magnitude and phase can be converted back to a time-domain waveform using the Overlap Add method. Important thing is that the spectrogram construction must have the constant-overlap-add property.
It can be challenging to have your modifications correctly manipulate both magnitude and phase of a spectrogram. So sometimes the phase is discarded, and magnitude manipulated independently. In order to convert this back into a waveform one must then estimate phase information during reconstruction (phase reconstruction). This is a lossy process, and usually pretty computationally intensive. Established approaches use an iterative algorithm, usually a variation on Griffin-Lim. But there are now also new methods using Convolutional Neural Networks.
Waveform from mel-spectrogram or MFCC using librosa
librosa version 0.7.0 contains a fast Griffin-Lim implementation as well as helper functions to invert a mel-spectrogram of MFCC.
Below is a code example. The input test file is found at https://github.com/jonnor/machinehearing/blob/ab7fe72807e9519af0151ec4f7ebfd890f432c83/handson/spectrogram-inversion/436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac
import numpy
import librosa
import soundfile
# parameters
sr = 22050
n_mels = 128
hop_length = 512
n_iter = 32
n_mfcc = None # can try n_mfcc=20
# load audio and create Mel-spectrogram
path = '436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac'
y, _ = librosa.load(path, sr=sr)
S = numpy.abs(librosa.stft(y, hop_length=hop_length, n_fft=hop_length*2))
mel_spec = librosa.feature.melspectrogram(S=S, sr=sr, n_mels=n_mels, hop_length=hop_length)
# optional, compute MFCCs in addition
if n_mfcc is not None:
mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S), sr=sr, n_mfcc=n_mfcc)
mel_spec = librosa.feature.inverse.mfcc_to_mel(mfcc, n_mels=n_mels)
# Invert mel-spectrogram
S_inv = librosa.feature.inverse.mel_to_stft(mel_spec, sr=sr, n_fft=hop_length*4)
y_inv = librosa.griffinlim(S_inv, n_iter=n_iter,
hop_length=hop_length)
soundfile.write('orig.wav', y, samplerate=sr)
soundfile.write('inv.wav', y_inv, samplerate=sr)
Results
The reconstructed waveform will have some artifacts.
The above example got a lot of repetitive noise, more than I expected. It was possible to reduce it quite a lot using the standard Noise Reduction algorithm in Audacity.
I am working on speech recognition using neural network. To do so I need to get the spectrograms of those training audio files (.wav) . How to get those spectrograms in python ?
There are numerous ways to do so. The easiest is to check out the methods proposed in Kernels on Kaggle competition TensorFlow Speech Recognition Challenge (just sort by most voted). This one is particularly clear and simple and contains the following function. The input is a numeric vector of samples extracted from the wav file, the sample rate, the size of the frame in milliseconds, the step (stride or skip) size in milliseconds and a small offset.
from scipy.io import wavfile
from scipy import signal
import numpy as np
sample_rate, audio = wavfile.read(path_to_wav_file)
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
return freqs, times, np.log(spec.T.astype(np.float32) + eps)
Outputs are defined in the SciPy manual, with an exception that the spectrogram is rescaled with a monotonic function (Log()), which depresses larger values much more than smaller values, while leaving the larger values still larger than the smaller values. This way no extreme value in spec will dominate the computation. Alternatively, one can cap the values at some quantile, but log (or even square root) are preferred. There are many other ways to normalize the heights of the spectrogram, i.e. to prevent extreme values from "bullying" the output :)
freq (f) : ndarray, Array of sample frequencies.
times (t) : ndarray, Array of segment times.
spec (Sxx) : ndarray, Spectrogram of x. By default, the last axis of Sxx corresponds to the segment times.
Alternatively, you can check the train.py and models.py code on github repo from the Tensorflow example on audio recognition.
Here is another thread that explains and gives code on building spectrograms in Python.
Scipy serve this purpose.
import scipy
# Read the .wav file
sample_rate, data = scipy.io.wavfile.read('directory_path/file_name.wav')
# Spectrogram of .wav file
sample_freq, segment_time, spec_data = signal.spectrogram(data, sample_rate)
# Note sample_rate and sampling frequency values are same but theoretically they are different measures
Use matplot library to visualize the spectrogram
import matplotlib.pyplot as plt
plt.pcolormesh(segment_time, sample_freq, spec_data )
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()
You can use NumPy, SciPy and matplotlib packages to make spectrograms. See this following post.
http://www.frank-zalkow.de/en/code-snippets/create-audio-spectrograms-with-python.html