Inaccurate real-time audio FFT interpretation with Python - python

I'm trying to use Python to create a live music visualization. The libraries I'm using are SoundCard (for live audio capture) and Librosa (for short-time Fourier transform).
However I suspect I'm not interpreting the audio data correctly. Looking at the 100Hz-200Hz bin, I get a constant stream of sound even when the song doesn't contain that much bass (or really, any whatsoever). I admit I am a bit in over my head with all the audio processing FFT stuff, since it's not really my expertise and the math beats me most of the time.
This is the function that captures and analyses the audio. lb is set to the speakers and works properly. Fs is set to 48000 and I record 1000 frames in the attempt of keeping 48FPS. fftwindowsize is set to 2048*8 because... I'm not sure. I increased the number until Librosa stopped throwing warnings.
def audioanalysis():
with lb.recorder(samplerate=Fs) as mic:
rawdata = mic.record(numframes=1000)
datalen: int = int(rawdata.size/2)
monodata = numpy.empty(datalen)
for x in range(0, datalen):
monodata[x] = max(rawdata[x][0], rawdata[x][1])
data = numpy.abs(librosa.stft(monodata, n_fft=fftwindowsize, hop_length=1024))
return librosa.amplitude_to_db(data, ref=numpy.max)
And the code for making buckets:
frequencies = librosa.core.fft_frequencies(n_fft=fftwindowsize)
freq_index_ratio = len(frequencies)/frequencies[len(frequencies)-1] / 2
[...]
for i in range(0,buckets):
avg = 0
for j in range (i * bucketsize, (i+1)*bucketsize):
avg += amp(spectrogram=spectrogram, freq=j)
amps[i] = avg/bucketsize
def amp(spectrogram, freq) -> float:
return spectrogram[int(freq*freq_index_ratio)]
Over the course of a song, amps[1] (so 100Hz-200Hz) stays in the -50dB to -30dB range, which isn't really useful or representative of the song playing.
Is my FFT analysis wrong? Is there no way to better interpret short samples of sound?
P.S. I know my Python code isn't excellent. This is my first project in Python :)

Related

Use microphone in Raspberry Pi Pico (Micropython)

How can i use micropython firmware alongside a Max9814?
I have written the code below but cant hear clear voice in audacity...
from machine import Pin, ADC
import ustruct , time
analog_value = machine.ADC(26)
conversion_factor =3.3/(65536)
samples = []
while True:
reading = analog_value.read_u16()*conversion_factor
samples.append(int(reading)) #print("ADC: ",reading)
time.sleep(0.002)
with open('Voice.bin', 'wb') as output:
for sample in samples:
output.write(struct.pack('<h', sample))
Try changing
conversion_factor = 3.3/(65536)
to
conversion_factor = 3.3/(4096)
This is because, although the ADC result is returned as a 16-bit integer the actual result is only the lower 12 bits - it is a 12-bit ADC!
Using 65536 (16 bits), the resulting audio will seem quiet as it is only capable of reaching 1/16 of the full-scale range of a 16-bit value.
I would also suggest using the Normalise effect in Audacity, bearing in mind the audio will always sound a bit noisy.
A further point to bear in mind is that you sample rate is unlikely to 100% stable by doing the timing using code. If you want hardware-timed audio it is worth learning to use DMA. e.g. https://iosoft.blog/2021/10/26/pico-adc-dma/

Audio signal split at word level boundary

I am working with audio file using webrtcvad and pydub. The split of any fragment is by silence of the sentence.
Is there any way by which the split can be done at word level boundry condition? (after each spoken word)?
If librosa/ffmpeg/pydub has any feature like this, can split is possible at each vocal? but after split, I need start and end time of the vocal exactly what that vocal part has positioned in the original file.
One simple solution or way to split by ffmpeg is also defined by :
https://gist.github.com/vadimkantorov/00bf4fbe4323360722e3d2220cc2915e
but this is also splitting by silence, and with each padding number or the frame size, the split is different. I am trying split by vocal.
As example, I have done this manually the original file, split words and its time position in json is in a folder provided here under the link:
www.mediafire.com/file/u4ojdjezmw4vocb/attached_problem.tar.gz
Simple audio segmentation problems can be handled by using a Hidden Markov Model, after preprocessing the audio into suitable features. Typical features for speech would be soundlevel, vocal activity / voicedness. To get word-level segmentation (as opposed to sentence), this needs to have rather high time resolution. Unfortunately the pyWebRTCVAD does not have adjustable time smoothening so it might not be suited for the task.
In your audio sample there is a radio host speaking rather quickly in German.
Looking at the soundlevels wrt to the word boundaries you have marked it is clear that between some words the soundlevel doesnt really drop. That rules out a simple soundlevel segmentation model.
All in all, getting good results for general speech signals can be quite hard. But fortunately this is very well researched, and with off-the-shelf solutions being available.
These use typically an acoustic model (how words and phonemes sound), as well as a language model (likely orders of words), learned over many hours of audio.
Word segmentation using Speech Recognition library
All these features are included in a Speech Recognition framework, and many allow to get word-level outputs with timing. Below is some working code for this using Vosk.
Alternatives to Vosk would be PocketSphinx. Or using an online speech recognition service from Google Cloud, Amazon Web Services, Azure Cloud etc.
import sys
import os
import subprocess
import json
import math
# tested with VOSK 0.3.15
import vosk
import librosa
import numpy
import pandas
def extract_words(res):
jres = json.loads(res)
if not 'result' in jres:
return []
words = jres['result']
return words
def transcribe_words(recognizer, bytes):
results = []
chunk_size = 4000
for chunk_no in range(math.ceil(len(bytes)/chunk_size)):
start = chunk_no*chunk_size
end = min(len(bytes), (chunk_no+1)*chunk_size)
data = bytes[start:end]
if recognizer.AcceptWaveform(data):
words = extract_words(recognizer.Result())
results += words
results += extract_words(recognizer.FinalResult())
return results
def main():
vosk.SetLogLevel(-1)
audio_path = sys.argv[1]
out_path = sys.argv[2]
model_path = 'vosk-model-small-de-0.15'
sample_rate = 16000
audio, sr = librosa.load(audio_path, sr=16000)
# convert to 16bit signed PCM, as expected by VOSK
int16 = numpy.int16(audio * 32768).tobytes()
# XXX: Model must be downloaded from https://alphacephei.com/vosk/models
# https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip
if not os.path.exists(model_path):
raise ValueError(f"Could not find VOSK model at {model_path}")
model = vosk.Model(model_path)
recognizer = vosk.KaldiRecognizer(model, sample_rate)
res = transcribe_words(recognizer, int16)
df = pandas.DataFrame.from_records(res)
df = df.sort_values('start')
df.to_csv(out_path, index=False)
print('Word segments saved to', out_path)
if __name__ == '__main__':
main()
Run the program with the .WAV file and the path to an output file.
python vosk_words.py attached_problem/main.wav out.csv
The script outputs words and their times in the CSV. These timings can then be used to split the audio file. Here is example output:
conf,end,start,word
0.618949,1.11,0.84,also
1.0,1.32,1.116314,eine
1.0,1.59,1.32,woche
0.411941,1.77,1.59,des
Comparing the output (bottom) with the example file you provided (top), it looks pretty good.
It actually picked up a word that your annotations did not include, "und" at 42.25 seconds.
Delimiting words is out of the audio domain and requires a kind of intelligence. Doing it manually is easy because we are intelligent and know exactly what we are looking for, but automatizing the process is hard because, as you already noticed, a silence is not (not only, not always) a word delimiter.
At audio level, we can only approach a solution and this require both analyzing the amplitude of the signal and adding some time mechanisms. As an example, Protools provides a nice tool named Strip Silence that cuts audio regions automatically based on the amplitude of the signal. It always keeps the material at its original position in the timeline and naturally each region knows its own duration. In addition to the threshold in dB, and to prevent creating too much regions, it provides several useful parameters in the time domain : a minimum length for the created regions, a delay before the cut (the delay is computed from the point the amplitude passes below the threshold), an inverted delay before reopening the gate (the delay is computed backward from the point the amplitude passes above the threshold).
This could be a good starting point for you. Implementing such a system probably won't be 100 % successful, but you could obtain a quite good ratio if the settings are well adjusted to the speaker. Even if it's not perfect, it will significantly reduce the need for manual work.

Read left channel of wav data into numpy array

I'm using pyaudio to take input from a microphone or read a wav file, and analyze the stream while playing it. I want to only analyze the right channel if the input is stereo. I've been able to extract the data and convert to integers using loops:
levels = []
length = len(data)
if channels == 1:
for i in range(length//2):
volume = abs(struct.unpack('<h', data[i:i+2])[0])
levels.append(volume)
elif channels == 2:
for i in range(length//4):
j = 4 * i + 2
volume = abs(struct.unpack('<h', data[j:j+2])[0])
levels.append(volume)
I think this working correctly, I know it runs without error on a laptop and Raspberry Pi 3, but it appears to consume too much time to run on a Raspberry Pi Zero when simultaneously streaming the output to a speaker. I figure that eliminating the loop and using numpy may help. I assume I need to use np.ndarray to do this, and the first parameter will be (CHUNK,) where CHUNK is my chunk size for analyzing the audio (I'm using 1024). And the format would be '<h', as in the struct code above, I think. But I'm at a loss as to how to code it correctly for each of the two cases (mono and right channel only for stereo). How do I create the numpy arrays for each of the two cases?
You are here reading 16-bit integers from a binary file. It seems that you are first reading the data into data variable with something like data = f.read(), which is here not visible. Then you do:
for i in range(length//2):
volume = abs(struct.unpack('<h', data[i:i+2])[0])
levels.append(volume)
BTW, that code is wrong, it shoud be abs(struct.unpack('<h', data[2*i:2*i+2])[0]), otherwise you are overlapping bytes from different values.
To do the same with numpy, you should just do this (instead of both f.read()and the whole loop):
data = np.fromfile(f, dtype='<i2')
This is over 100 times faster than the manual thing above in my test on 5 MB of data.
In the second case, you have interleaved left-right-left-right values. Again you can read them all (assuming you have enough memory) and then access only one half:
data = np.fromfile(f, dtype='<i2')
left = data[::2]
right = data[1::2]
This processes everything, even though you need just one half, but it is still much much faster.
EDIT: If the data not coming from a file, np.fromfile can be replaced with np.frombuffer. Then you have this:
channel_data = np.frombuffer(data, dtype='<i2')
if channels == 2:
channel_data = channel_data[1::2]
levels = np.abs(channel_data)

Making a synthesized sound at arbitrary pitch in python

I'm finding it strangely hard to find a synthesizer module for python that allows the program to play a note at an arbitrary pitch. Preferably the note should be more than just a pure sinewave and should include at least a few harmonics - it should be more than just a beep.
The idea is to be able to write something like
the_module.play(frequency, loudness, duration)
or
my_synth = the_module.newsynth()
my_synth.play(frequency, loudness, duration)
where frequency is specified in Hz, and have a synthesized tone play from the user's speakers. There's JavaScript modules for doing this, such as Tone.js, but does anyone know of an approach using Python?
If on windows, you can use the builtin winsound.Beep.
If on Linux, you need to write directly to /dev/audio, like suggested here:
def beep(frequency, amplitude, duration):
sample = 8000
half_period = int(sample/frequency/2)
beep = chr(amplitude)*half_period+chr(0)*half_period
beep *= int(duration*frequency)
audio = file('/dev/audio', 'wb')
audio.write(beep)
audio.close()

How to control a sound card programmatically?

I'm playing with pyaudio on a mac using a Saffire Pro 40 sound card.
Currently I have two inputs plugged in and I'd like to control the levels of the second input channel programmatically. (This works fine using the sound card's mix control software).
I've been going through the pyaudio docs, but haven't found anything glaring on this issue so far. What's the simplest way to essentially do what the mix control software does (control volume per channel) programmatically? (A Python API would be nice, but not essential)
To simplify: it looks like it's possible to manually read the streams from the channels I want to control, scale them using numpy, them write them as output, but I'm hoping there is a method to simply send a normalized value per channel to control it.
So instead of something like this:
stream1 = pyaudioInstance.open( format = FORMAT,
channels = CHANNELS,
rate = RATE,
input = True,
output = True,
input_device_index = 0,
frames_per_buffer = CHUNK
)
stream2 = pyaudioInstance.open( format = FORMAT,
channels = CHANNELS,
rate = RATE,
input = True,
input_device_index = 1,
frames_per_buffer = CHUNK
)
while processingAudio:
# manually fetch each channel
data1In = stream1.read(CHUNK)
data2In = stream2.read(CHUNK)
# convert to numpy to easy scale the arrays
decodeddata1 = numpy.fromstring(data1In, numpy.int16)
decodeddata2 = numpy.fromstring(data2In, numpy.int16)
newdata = (decodeddata1 * 0.5 + decodeddata2* 0.1).astype(numpy.int16)
# finally write the processed data
stream1.write(result.tostring())
This is a bit misleading but I would need to mix separate channels from the same input device index. However what I'm hoping is something like:
someSoundCardAPI.channels[0].setVolume(0.2)
Having a look at the Channel Maps example feels closer to what I'm after. At the moment I find the host_api_specific part of API a bit confusing and I was hoping someone already has some experience successfully using this.
I am using OSX 10.10
I don't really have any experience with OSX, so I don't know, but normally you can remote-control everything with AppleScript.
See, for example, this question.
It doesn't say how to control the volume of a single channel separately, though.
Probably you should ask there ...
Regarding the inferior work-around, you can use python-sounddevice to create a little (untested) Python script:
import sounddevice as sd
def callback(indata, outdata, *stuff):
outdata[:] = indata * [1, 0.5]
with sd.Stream(channels=2, callback=callback):
input()
This script will run until you press <Return> and it will reduce the volume of the second channel.

Categories