I am attempting to preprocess audiofiles to be used in a neural net with soundfile.read(), but the function is formatting the returned data differently for different .FLAC files with the same sample rate and length. For example, calling data, sr = soundfile.read(audiofile1) produced an array with shape data.shape = (48000, 2) (where individual element values were either the amplitude, 0, or the negative amplitude in NumPy float64), while calling data, sr = soundfile.read(audiofile2) produced an array with shape data.shape = (48000,) (where individual element values were varied NumPy float64).
Also, if it helps, audiofile1 was a recording taken from a recording taken via PyAudio, whereas audiofile2 was a sample from the LibriSpeech corpus.
So, my question is twofold:
Why is soundfile.read() producing two different data formats, and how do I ensure that the function returns the arrays in the same format in the future?
Your audiofile2 sample is mono, whereas your audiofile1 recording is stereo (i.e. you probably recorded it with a PyAudio stream configured with channels=2). So I suggest you first figure out whether you need mono or stereo for your application.
If all you really care is a mono audio signal, you can convert stereo (or more generally N-channel) audio to mono by averaging the channels:
data, sr = soundfile.read(audiofile)
if np.dim(data)>1:
data = np.mean(data,axis=1)
If you need stereo audio, then you may create an additional channel by duplicating the one you have (although that would not be adding the usual additional information such as phase or amplitude differences between the different channels) with:
if np.dim(data)<2:
data = np.tile(data,(2,1)).transpose()
It's as simple as:
data, sr = soundfile.read(audiofile2, always_2d=True)
With this, data.shape will always have two elements; data.shape[0] will be the number of frames and data.shape[1] will be the number of channels.
Related
I need to get a log-frequency scaled spectrogram. I'm currently using scipy.signal.stft function to get a magnitude array. But output frequencies are linearly spaced.
import librosa
import scipy
sample, samplerate = librosa.load('sound.wav', sr=64000)
f, t, Zxx = scipysignal.stft(sample, fs=samplerate, window='hamming', nperseg=512, noverlap=256)
I basically need f to be log-spaced from 1Hz to 32kHz (since my sound has a samplerate of 64kHz).
I can only get the top spectrogram. I need the actual array of values of the bottom spectrogram. I can obtain it through various visualisation function (librosa specshow, matplotlib yscaled etc.) but I can't find a solution to retrieve an actual 2-D array of magnitudes with only frequency logarithmically-spaced.
Any help or clue on what method to use will be greatly appreciated !
I just stumbled across a good soulution for your problem.
The nnAudio library is an audio processing toolbox using PyTorch convolutional neural network as its backend. Though it can also be used as a stand alone solution.
for installation just use:
pip install git+https://github.com/KinWaiCheuk/nnAudio.git#subdirectory=Installation
To transform your audio into a spectrogram with log-spaced frequency bins use:
from nnAudio import features
from scipy.io import wavfile
import torch
sr, song = wavfile.read('./Bach.wav') # Loading your audio
x = song.mean(1) # Converting Stereo to Mono
x = torch.tensor(x).float() # casting the array into a PyTorch Tensor
spec_layer = features.STFT(n_fft=2048, hop_length=512,
window='hann', freq_scale='log', pad_mode='reflect', sr=sr) # Initializing the model
spec = spec_layer(x) # Feed-forward your waveform to get the spectrogram
log_spec =np.array(spec)[0]# cast PyTorch Tensor back to numpy array
db_log_spec = librosa.amplitude_to_db(log_spec) # convert amplitude spec into db representation
Plotting the resulting log-frequency spectrogram with librosa specshow using the y_axis='linear' flag will give you the asked for representation in an actual 2d array :)
plt.figure()
librosa.display.specshow(db_log_spec, y_axis='linear', x_axis='time', sr=sr)
plt.colorbar()
The library also contains an inverse funktion and a ton of additional features:
https://kinwaicheuk.github.io/nnAudio/intro.html
Although producing a good looking log-freq spectrogram I am having trouble reverting the STFT back into the time domain.
The included iSTFT does not do the trick for me. Maybe someone else can pick it up from here?
Actually, for record I found out taht what I needed was to perform a constant-Q transform, which is exactly a log-based spectrogram. But you choose the starting frequency, which is in my case, very useful. For this I used librosa.cqt
I accidentally forgot to convert some NumPy arrays to bytes objects when using PyAudio, but to my surprise it still played audio, even if it sounded a bit off. I wrote a little test script (see below) for playing 1 second of a 440Hz tone, and it seems like writing a NumPy array directly to a PyAudio Stream cuts that tone short.
Can anyone explain why this happens? I thought a NumPy array was a contiguous sequence of bytes with some header information about its dtype and strides, so I would've predicted that PyAudio played the full second of the tone after some garbled audio from the header, not cut the tone off.
# script segment
import pyaudio
import numpy as np
RATE = 48000
p = pyaudio.PyAudio()
stream = p.open(format = pyaudio.paFloat32, channels = 1, rate = RATE, output = True)
TONE = 440
SECONDS = 1
t = np.arange(0, 2*np.pi*TONE*SECONDS, 2*np.pi*TONE/RATE)
sina = np.sin(t).astype(np.float32)
sinb = sina.tobytes()
# console commands segment
stream.write(sinb) # bytes object plays 1 second of 440Hz tone
stream.write(sina) # still plays 440Hz tone, but noticeably shorter than 1 second
The problem is more subtle than you describe. Your first call is passing a bytes array of size 192,000. The second call is passing a list of float32 values with size 48,000. pyaudio handles both of them, and passes the buffer to portaudio to be played.
However, when you opened pyaudio, you told it you were sending paFloat32 data, which has 4 bytes per sample. The pyaudio write handler takes the length of the array you gave it, and divides by the number of channels times the sample size to determine how many audio samples there are. In your second call, the length of the array is 48,000, which it divides by 4, and thereby tells portaudio "there are 12,000 samples here".
So, everyone understood the format, but were confused about the size. If you change the second call to
stream.write(sina, 48000)
then no one has to guess, and it works perfectly fine.
I am working with AVIRIS Classic data which has an interleave of BIP, or Band Interleaved by Pixel. I want to convert the datatype to BIL (Band Interleaved by Line). In the image processing language IDL, you can do this using the function CONVERT_DOIT, but this uses a proprietary software. Are there any python libraries that have a function to carry out this task?
I am completely unfamiliar with AVIRIS data and its processing, so there may be much simpler or better methods of accessing it of which I am unaware. However, I found and downloaded a smallish sample from the linked website as follows:
Reading the .hdr file (which is ASCII fortunately), I was able to work out that the data are signed 16-bit integers, band-interleaved-by-pixel of 224 bands, and 735 samples/line and 2017 lines. So, I can then load the image and process it with Numpy as follows:
import numpy as np
from PIL import Image
# Define datafile parameters
channels, samples, lines = 224, 735, 2017
# Load file and reshape
im = np.fromfile('f090710t01p00r11rdn_b_ort_img', dtype=np.int16).reshape(lines,samples,channels)
The data are signed integers in the range -32768...+32767, so if we add 32768 the data will be 0..65536 and then multiply by 255/65535 we should get a viewable, but not radiometrically correct, image to prove the reading from file:
# That's kind of all - just do crude scaling to show we have it correctly
a = (im.astype(np.float)+32768.0)*255.0/65535.0
Now select band 0, and save (using PIL, but we could use OpenCV or tifffile):
Image.fromarray(a[:,:,0].astype(np.uint8)).save('result.png')
Presumably you can now arrange the data however you like with Numpy as we have read it successfully. So, say you want line 7 of band 4, you could use:
a[7,:,4]
Or line 0, all bands:
a[0,:,:]
If you wanted to make a false colour composite from 3 of the 224 bands, you can use np.dstack() like this - my bands were chosen at random:
FalseColour = np.dstack((a[...,37], a[...,164], a[...,200])).astype(np.uint8)
Image.fromarray(FalseColour).save('result.png')
Keywords: Python, AVIRIS, hyperspectral image processing, hyper-spectral, band-interleaved by pixel, by line, planar, interlace.
I'm using pyaudio to take input from a microphone or read a wav file, and analyze the stream while playing it. I want to only analyze the right channel if the input is stereo. I've been able to extract the data and convert to integers using loops:
levels = []
length = len(data)
if channels == 1:
for i in range(length//2):
volume = abs(struct.unpack('<h', data[i:i+2])[0])
levels.append(volume)
elif channels == 2:
for i in range(length//4):
j = 4 * i + 2
volume = abs(struct.unpack('<h', data[j:j+2])[0])
levels.append(volume)
I think this working correctly, I know it runs without error on a laptop and Raspberry Pi 3, but it appears to consume too much time to run on a Raspberry Pi Zero when simultaneously streaming the output to a speaker. I figure that eliminating the loop and using numpy may help. I assume I need to use np.ndarray to do this, and the first parameter will be (CHUNK,) where CHUNK is my chunk size for analyzing the audio (I'm using 1024). And the format would be '<h', as in the struct code above, I think. But I'm at a loss as to how to code it correctly for each of the two cases (mono and right channel only for stereo). How do I create the numpy arrays for each of the two cases?
You are here reading 16-bit integers from a binary file. It seems that you are first reading the data into data variable with something like data = f.read(), which is here not visible. Then you do:
for i in range(length//2):
volume = abs(struct.unpack('<h', data[i:i+2])[0])
levels.append(volume)
BTW, that code is wrong, it shoud be abs(struct.unpack('<h', data[2*i:2*i+2])[0]), otherwise you are overlapping bytes from different values.
To do the same with numpy, you should just do this (instead of both f.read()and the whole loop):
data = np.fromfile(f, dtype='<i2')
This is over 100 times faster than the manual thing above in my test on 5 MB of data.
In the second case, you have interleaved left-right-left-right values. Again you can read them all (assuming you have enough memory) and then access only one half:
data = np.fromfile(f, dtype='<i2')
left = data[::2]
right = data[1::2]
This processes everything, even though you need just one half, but it is still much much faster.
EDIT: If the data not coming from a file, np.fromfile can be replaced with np.frombuffer. Then you have this:
channel_data = np.frombuffer(data, dtype='<i2')
if channels == 2:
channel_data = channel_data[1::2]
levels = np.abs(channel_data)
As you might notice, i am really new to python and sound processing. I (hopefully) extracted FFT data from a wave file using python and the logfbank and mfcc function. (The logfbank seems to give the most promising data, mfcc output looked a bit weird for me).
In my program i want to change the logfbank/mfcc data and then create wave data from it (and write them into a file). I didn't really find any information about the process of creating wave data from FFT data. Does anyone of you have an idea how to solve this? I would appreciate it a lot :)
This is my code so far:
from scipy.io import wavfile
import numpy as np
from python_speech_features import mfcc, logfbank
rate, signal = wavfile.read('orig.wav')
fbank = logfbank(signal, rate, nfilt=100, nfft=1400).T
mfcc = mfcc(signal, rate, numcep=13, nfilt=26, nfft=1103).T
#magic data processing of fbank or mfcc here
#creating wave data and writing it back to a .wav file here
A suitably constructed STFT spectrogram containing both magnitude and phase can be converted back to a time-domain waveform using the Overlap Add method. Important thing is that the spectrogram construction must have the constant-overlap-add property.
It can be challenging to have your modifications correctly manipulate both magnitude and phase of a spectrogram. So sometimes the phase is discarded, and magnitude manipulated independently. In order to convert this back into a waveform one must then estimate phase information during reconstruction (phase reconstruction). This is a lossy process, and usually pretty computationally intensive. Established approaches use an iterative algorithm, usually a variation on Griffin-Lim. But there are now also new methods using Convolutional Neural Networks.
Waveform from mel-spectrogram or MFCC using librosa
librosa version 0.7.0 contains a fast Griffin-Lim implementation as well as helper functions to invert a mel-spectrogram of MFCC.
Below is a code example. The input test file is found at https://github.com/jonnor/machinehearing/blob/ab7fe72807e9519af0151ec4f7ebfd890f432c83/handson/spectrogram-inversion/436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac
import numpy
import librosa
import soundfile
# parameters
sr = 22050
n_mels = 128
hop_length = 512
n_iter = 32
n_mfcc = None # can try n_mfcc=20
# load audio and create Mel-spectrogram
path = '436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac'
y, _ = librosa.load(path, sr=sr)
S = numpy.abs(librosa.stft(y, hop_length=hop_length, n_fft=hop_length*2))
mel_spec = librosa.feature.melspectrogram(S=S, sr=sr, n_mels=n_mels, hop_length=hop_length)
# optional, compute MFCCs in addition
if n_mfcc is not None:
mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S), sr=sr, n_mfcc=n_mfcc)
mel_spec = librosa.feature.inverse.mfcc_to_mel(mfcc, n_mels=n_mels)
# Invert mel-spectrogram
S_inv = librosa.feature.inverse.mel_to_stft(mel_spec, sr=sr, n_fft=hop_length*4)
y_inv = librosa.griffinlim(S_inv, n_iter=n_iter,
hop_length=hop_length)
soundfile.write('orig.wav', y, samplerate=sr)
soundfile.write('inv.wav', y_inv, samplerate=sr)
Results
The reconstructed waveform will have some artifacts.
The above example got a lot of repetitive noise, more than I expected. It was possible to reduce it quite a lot using the standard Noise Reduction algorithm in Audacity.