"Constant" audio waveform generation using python?

"Constant" audio waveform generation using python? - python

I'm looking for something to generate a "constant" audio waveform, which looks like this: A constant waveform
I have a set of analog datas:
12766:149
12786:0
13339:149
13359:0
13721:57
13741:0
15249:255
15269:0
15822:87
Where the format is time_in_ms:amplitue. I try to output them from a headphone jack.
I think that "constant waveform" shown above can be considered as a combination of multiple square waves with a very small frequency having different amplitudes, and last for 20ms.
Is that possible? What's out there that I can do this?

I guess you can simply write raw PCM audio frames into a .wav file using the wave module... https://docs.python.org/3/library/wave.html If your amplitude ranges from 0 to 255 inclusive it's probably easiest to create a wav file with a sampwidth of 1 (byte) and just write the amplitude as bytes into the sample frames.
with wave.open("test.wav","w") as w:
w.setnchannels(1)
w.setsampwidth(1)
w.setframerate(4000) # 4000 samples/sec
w.writeframes(bytearray([100]*4000)) # 4000 samples of amplitude 100
The above creates a small file 'test.wav' that is a mono 8-bit audio waveform of constant amplitude 100. Change the code accordingly to write the amplitude values from your input file, adjust sample rate as required

Related

How to replicate a 128 Kaldi-Compatible Mel-frequency bands?

Hi I am trying to replicate the following pre-processing pipeline from the paper 'Masked Autoencoders that Listen' (https://doi.org/10.48550/arXiv.2207.06405):
we transform raw waveform (pre-processed as mono channel under 16,000 sampling rate) into 128 Kaldi [54]-compatible Mel-frequency bands with a 25ms Hanning window that shifts every 10 ms. For a 10-second recording in AudioSet, the resulting spectrogram is of 1×1024×128 dimension.
Starting from a 10s audio, with a sampling rate of 16Khz, I am trying with the following function from torchaudio:
fbank = torchaudio.compliance.kaldi.fbank(signal, htk_compat=True, sample_frequency=16000, use_energy=False, window_type='hanning', num_mel_bins=128, dither=0.0, frame_shift=10)
However I am not reaching to obtain the expected spectrogram, the expected shape is (1x1024x128) instead I am obtaining (1x998x128)

Splitting AudioSegments

Here I am practicing analyzing audio(wav format) in order to remove low volumes in given range and export to new audio. It was formatted to int16 array and max value gave +(some number), min gave -(some number). Now as a result the output audio is too small and i think the problem is in wrong range. So how to choose the right range? I gave it between min/2 and max/2.
from pydub import AudioSegment
import io
import scipy.io.wavfile
import IPython
import numpy as np
w = AudioSegment.from_file("input.wav", format="wav")
a = w.get_array_of_samples()
fp_arr = np.array(a).T.astype(np.int16)
avg = (max(fp_arr)/2).astype(np.int16)
avg2= (min(fp_arr)/2).astype(np.int16)
b=[]
for d in a:
if d not in range(avg2,avg) :#d<avg2 and d>avg:
b.append(d)
myarray = np.asarray(b)
wav_io = io.BytesIO()
scipy.io.wavfile.write(wav_io, 16000, myarray)
wav_io.seek(0)
sound = AudioSegment.from_wav(wav_io)
file_handle = sound.export("output.wav", format="wav")

If you reject some samples without replacing them by something, it's normal for the resulting wave to be shorter. If what you plane to do is a kind of noise gate, you should probably replace the eliminated samples by silence instead.
However, a real noise gate, as any dynamic processor, works a little bit differently. First if follows the enveloppe of the signal meaning that it doesn't take into account each oscillation around the axis (if you do that, you'll cut some samples inside each oscillation, meaning several dozen of times per second, which is probably not what you want to do). Instead, a noise gate analyses the variation of amplitude at a highest temporal level. After that step, the resulting enveloppe contains no negative value anymore. When this enveloppe goes below the defined threshold (let's say 0.125 for power, or an equivalent integer value in 16 or 24 bits), it takes a few milliseconds to make a little fade out (it means that it multiply the amplitude by a factor going progressively from 1 to 0). At the contrary, when the signal passes above the threshold again, it reopens the gate with a little fade in.
If you bypass these little fades in/out, the resulting wave will contains unpleasant numeric clicks. If you bypass the enveloppe follower used to smooth the amplitude, you will close the gate a lot too often.

Getting FFT result peaks at 0 Hz

I have an audio frame which is a NumPy array of length 16000.
When I apply numpy FFT to the audio frame, I get a spectrum that peaks at 0 Hz. I tried different audio frames from the same audio file but all of them seem to have peaks at 0 Hz.
Could anyone please help me to understand where I am doing wrongly? Thank you.

There is a bias of around -0.2, right? This is a constant value along with the time. This is to say that there is a strong component at 0 Hz compared with the variation around this constant value. You need only to interpret the results.
Solution: try to subtract the average value from the signal in the time domain. I suppose that, magically, the 0 Hz component will disappear.

What is the second number in the MFCCs array?

When I extract MFCCs from an audio the ouput is (13, 22). What does the number represent? Is it time frames ? I use librosa.
The code is use is:
mfccs = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13, hop_length=256)
mfccs
print(mfccs.shape)
And the ouput is (13,22).

Yes, it is time frames and mainly depends on how many samples you provide via y and what hop_length you choose.
Example
Say you have 10s of audio sampled at 44.1 kHz (CD quality). When you load it with librosa, it gets resampled to 22,050 Hz (that's the librosa default) and downmixed to one channel (mono). When you then run something like a STFT, melspectrogram, or MFCC, so-called feature frames are computed.
The question is, how many (feature) frames do you get for your 10s of audio?
The deciding parameter for this is the hop_length. For all the mentioned functions, librosa slides a window of a certain length (typically n_fft) over the 1d audio signal, i.e., it looks at one shorter segment (or frame) at a time, computes features for this segment and moves on to the next segment. These segments are usually overlapping. The distance between two such segments is hop_length and it is specified in number of samples. It may be identical to n_fft, but often times hop_length is half or even just a quarter of n_fft. It allows you to control the temporal resolution of your features (the spectral resolution is controlled by n_fft or n_mfcc, depending on what you are actually computing).
10s of audio at 44.1 kHz are 441000 samples. But remember, librosa by default resamples to 22050 Hz, so it's actually only 220500 samples. How many times can we move a segment of some length over these 220500 samples, if we move it by 256 samples in each step? The precise number depends on how long the segment is. But let's ignore that for a second and assume that when we hit the end, we simply zero-pad the input so that we can still compute frames for as long as there is at least some input. Then the computation becomes trivial:
number_of_samples / hop_length = number_of_frames
So for our examples, this would be:
220500 / 256 = 861.3
So we get about 861 frames.
Note that you can make this computation even easier by computing the so-called frame_rate. That's frames per second in Hz. It's:
frame_rate = sample_rate / hop_length = 86.13
To get the number of frames for your input simply multiple frame_rate with the length of your audio and you're set (ignoring padding).
frames = frame_rate * audio_in_seconds

Understanding the shape of spectrograms and n_mels

I am going through these two librosa docs: melspectrogram and stft.
I am working on datasets of audio of variable lengths, but I don't quite get the shapes. For example:
(waveform, sample_rate) = librosa.load('audio_file')
spectrogram = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
dur = librosa.get_duration(waveform)
spectrogram = torch.from_numpy(spectrogram)
print(spectrogram.shape)
print(sample_rate)
print(dur)
Output:
torch.Size([128, 150])
22050
3.48
What I get are the following points:
Sample rate is that you get N samples each second, in this case 22050 samples each second.
The window length is the FFT calculated for that period of length of the audio.
STFT is calculation os FFT in small windows of time of audio.
The shape of the output is (n_mels, t). t = duration/window_of_fft.
I am trying to understand or calculate:
What is n_fft? I mean what exactly is it doing to the audio wave? I read in the documentation the following:
n_fft : int > 0 [scalar]
length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value,
n_fft=2048 samples, corresponds to a physical duration of 93
milliseconds at a sample rate of 22050 Hz, i.e. the default sample
rate in librosa.
This means that in each window 2048 samples are taken which means that --> 1/22050 * 2048 = 93[ms]. FFT is being calculated for every 93[ms] of the audio?
So, this means that the window size and window is for filtering the signal in this frame?
In the example above, I understand I am getting 128 number of Mel spectrograms but what exactly does that mean?
And what is hop_length? Reading the docs, I understand that it is how to shift the window from one fft window to the next right? If this value is 512 and n_fft = also 512, what does that mean? Does this mean that it will take a window of 23[ms], calculate FFT for this window and skip the next 23[ms]?
How can I specify that I want to overlap from one FFT window to another?
Please help, I have watched many videos of calculating spectrograms but I just can't seem to see it in real life.

The essential parameter to understanding the output dimensions of spectrograms is not necessarily the length of the used FFT (n_fft), but the distance between consecutive FFTs, i.e., the hop_length.
When computing an STFT, you compute the FFT for a number of short segments. These segments have the length n_fft. Usually these segments overlap (in order to avoid information loss), so the distance between two segments is often not n_fft, but something like n_fft/2. The name for this distance is hop_length. It is also defined in samples.
So when you have 1000 audio samples, and the hop_length is 100, you get 10 features frames (note that, if n_fft is greater than hop_length, you may need to pad).
In your example, you are using the default hop_length of 512. So for audio sampled at 22050 Hz, you get a feature frame rate of
frame_rate = sample_rate/hop_length = 22050 Hz/512 = 43 Hz
Again, padding may change this a little.
So for 10s of audio at 22050 Hz, you get a spectrogram array with the dimensions (128, 430), where 128 is the number of Mel bins and 430 the number of features (in this case, Mel spectra).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.