Audio signal split at word level boundary - python

I am working with audio file using webrtcvad and pydub. The split of any fragment is by silence of the sentence.
Is there any way by which the split can be done at word level boundry condition? (after each spoken word)?
If librosa/ffmpeg/pydub has any feature like this, can split is possible at each vocal? but after split, I need start and end time of the vocal exactly what that vocal part has positioned in the original file.
One simple solution or way to split by ffmpeg is also defined by :
https://gist.github.com/vadimkantorov/00bf4fbe4323360722e3d2220cc2915e
but this is also splitting by silence, and with each padding number or the frame size, the split is different. I am trying split by vocal.
As example, I have done this manually the original file, split words and its time position in json is in a folder provided here under the link:
www.mediafire.com/file/u4ojdjezmw4vocb/attached_problem.tar.gz

Simple audio segmentation problems can be handled by using a Hidden Markov Model, after preprocessing the audio into suitable features. Typical features for speech would be soundlevel, vocal activity / voicedness. To get word-level segmentation (as opposed to sentence), this needs to have rather high time resolution. Unfortunately the pyWebRTCVAD does not have adjustable time smoothening so it might not be suited for the task.
In your audio sample there is a radio host speaking rather quickly in German.
Looking at the soundlevels wrt to the word boundaries you have marked it is clear that between some words the soundlevel doesnt really drop. That rules out a simple soundlevel segmentation model.
All in all, getting good results for general speech signals can be quite hard. But fortunately this is very well researched, and with off-the-shelf solutions being available.
These use typically an acoustic model (how words and phonemes sound), as well as a language model (likely orders of words), learned over many hours of audio.
Word segmentation using Speech Recognition library
All these features are included in a Speech Recognition framework, and many allow to get word-level outputs with timing. Below is some working code for this using Vosk.
Alternatives to Vosk would be PocketSphinx. Or using an online speech recognition service from Google Cloud, Amazon Web Services, Azure Cloud etc.
import sys
import os
import subprocess
import json
import math
# tested with VOSK 0.3.15
import vosk
import librosa
import numpy
import pandas
def extract_words(res):
jres = json.loads(res)
if not 'result' in jres:
return []
words = jres['result']
return words
def transcribe_words(recognizer, bytes):
results = []
chunk_size = 4000
for chunk_no in range(math.ceil(len(bytes)/chunk_size)):
start = chunk_no*chunk_size
end = min(len(bytes), (chunk_no+1)*chunk_size)
data = bytes[start:end]
if recognizer.AcceptWaveform(data):
words = extract_words(recognizer.Result())
results += words
results += extract_words(recognizer.FinalResult())
return results
def main():
vosk.SetLogLevel(-1)
audio_path = sys.argv[1]
out_path = sys.argv[2]
model_path = 'vosk-model-small-de-0.15'
sample_rate = 16000
audio, sr = librosa.load(audio_path, sr=16000)
# convert to 16bit signed PCM, as expected by VOSK
int16 = numpy.int16(audio * 32768).tobytes()
# XXX: Model must be downloaded from https://alphacephei.com/vosk/models
# https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip
if not os.path.exists(model_path):
raise ValueError(f"Could not find VOSK model at {model_path}")
model = vosk.Model(model_path)
recognizer = vosk.KaldiRecognizer(model, sample_rate)
res = transcribe_words(recognizer, int16)
df = pandas.DataFrame.from_records(res)
df = df.sort_values('start')
df.to_csv(out_path, index=False)
print('Word segments saved to', out_path)
if __name__ == '__main__':
main()
Run the program with the .WAV file and the path to an output file.
python vosk_words.py attached_problem/main.wav out.csv
The script outputs words and their times in the CSV. These timings can then be used to split the audio file. Here is example output:
conf,end,start,word
0.618949,1.11,0.84,also
1.0,1.32,1.116314,eine
1.0,1.59,1.32,woche
0.411941,1.77,1.59,des
Comparing the output (bottom) with the example file you provided (top), it looks pretty good.
It actually picked up a word that your annotations did not include, "und" at 42.25 seconds.

Delimiting words is out of the audio domain and requires a kind of intelligence. Doing it manually is easy because we are intelligent and know exactly what we are looking for, but automatizing the process is hard because, as you already noticed, a silence is not (not only, not always) a word delimiter.
At audio level, we can only approach a solution and this require both analyzing the amplitude of the signal and adding some time mechanisms. As an example, Protools provides a nice tool named Strip Silence that cuts audio regions automatically based on the amplitude of the signal. It always keeps the material at its original position in the timeline and naturally each region knows its own duration. In addition to the threshold in dB, and to prevent creating too much regions, it provides several useful parameters in the time domain : a minimum length for the created regions, a delay before the cut (the delay is computed from the point the amplitude passes below the threshold), an inverted delay before reopening the gate (the delay is computed backward from the point the amplitude passes above the threshold).
This could be a good starting point for you. Implementing such a system probably won't be 100 % successful, but you could obtain a quite good ratio if the settings are well adjusted to the speaker. Even if it's not perfect, it will significantly reduce the need for manual work.

Related

Inaccurate real-time audio FFT interpretation with Python

I'm trying to use Python to create a live music visualization. The libraries I'm using are SoundCard (for live audio capture) and Librosa (for short-time Fourier transform).
However I suspect I'm not interpreting the audio data correctly. Looking at the 100Hz-200Hz bin, I get a constant stream of sound even when the song doesn't contain that much bass (or really, any whatsoever). I admit I am a bit in over my head with all the audio processing FFT stuff, since it's not really my expertise and the math beats me most of the time.
This is the function that captures and analyses the audio. lb is set to the speakers and works properly. Fs is set to 48000 and I record 1000 frames in the attempt of keeping 48FPS. fftwindowsize is set to 2048*8 because... I'm not sure. I increased the number until Librosa stopped throwing warnings.
def audioanalysis():
with lb.recorder(samplerate=Fs) as mic:
rawdata = mic.record(numframes=1000)
datalen: int = int(rawdata.size/2)
monodata = numpy.empty(datalen)
for x in range(0, datalen):
monodata[x] = max(rawdata[x][0], rawdata[x][1])
data = numpy.abs(librosa.stft(monodata, n_fft=fftwindowsize, hop_length=1024))
return librosa.amplitude_to_db(data, ref=numpy.max)
And the code for making buckets:
frequencies = librosa.core.fft_frequencies(n_fft=fftwindowsize)
freq_index_ratio = len(frequencies)/frequencies[len(frequencies)-1] / 2
[...]
for i in range(0,buckets):
avg = 0
for j in range (i * bucketsize, (i+1)*bucketsize):
avg += amp(spectrogram=spectrogram, freq=j)
amps[i] = avg/bucketsize
def amp(spectrogram, freq) -> float:
return spectrogram[int(freq*freq_index_ratio)]
Over the course of a song, amps[1] (so 100Hz-200Hz) stays in the -50dB to -30dB range, which isn't really useful or representative of the song playing.
Is my FFT analysis wrong? Is there no way to better interpret short samples of sound?
P.S. I know my Python code isn't excellent. This is my first project in Python :)

Producing histogram of sound for m3u of radio station

I want to sample a radio station which broadcasts in format *.m3u8 and to produce the histogram of the first n seconds (where the user fixes n).
I had been trying using radiopy but it doesn't work and gnuradio seems useless. How can I produce and show this histogram?
EDIT: Now I use Gstreamer v1.0 so I can play it directly but now I need to live-sample my broadcast. How can I do it using Gst?
gnuradio seems useless
Well, I'd argue that this is what you're looking for, if you're looking for a live spectrogram:
As you can see, it's but a matter of connecting a properly configured audio source to a Qt GUI sink. If properly configured (I wrote an answer about that, and a GNU Radio wiki page as well).
Point is: you shouldn't be trying to play an internet station by yourself. Let a software do that which knows what it is doing.
In your case, I'd recommend:
using VLC or mplayer to write the radio, after decoding it to PCM 32bit float of a fixed sampling rate to a file.
Use Python with the libraries Numpy to open that file (samples = numpy.fromfile(filename, dtype=numpy.float32)), and matplotlib/pyplot to plot a spectrogram to a file, i.e. something like (untested, because written right here):
#!/usr/bin/python2
import sys
import os
import tempfile
import numpy
from matplotlib import pyplot
stream = sys.argv[1] ## you can pass the stream URL as argument
outfile = sys.argv[2] ## second argument: output file; ending determines type!
num_of_seconds = min(int(sys.argv[3]), 60) # not more than 1min of streaming
(intermediate_file, inter_fname) = tempfile.mkstemp()
# pan = number of output channels (1: mix to mono)
# resample = sampling rate. this must be the same for all files, so that we can actually compare spectrograms
# format = sample format, here: native floats
sys.system("mplayer -endpos %d -vo null -af pan=1 -af resample=441000 -af format=floatne -ao pcm:nowaveheader:file=%s" % num_of_seconds % inter_fname)
samples = numpy.fromfile(inter_fname, dtype=float32)
pyplot.figure((num_of_seconds * 44100, 256), dpi=1)
### Attention: this call to specgram expects of you to understand what the Discrete Fourier Transform does.
### This uses a Hanning window by default; whether that is appropriate for audio data is questionable. Use all your DSP skillz!
### pyplot.specgram has a lot of options, including colormaps, frequency scaling, overlap. Make yourself acquintanced with those!
pyplot.specgram(samples, NFFT=256, FS=44100)
pyplot.savefig(outfile, bbox_inches="tight")

Using pyDub to chop up a long audio file

I'd like to use pyDub to take a long WAV file of individual words (and silence in between) as input, then strip out all the silence, and output the remaining chunks is individual WAV files. The filenames can just be sequential numbers, like 001.wav, 002.wav, 003.wav, etc.
The "Yet another Example?" example on the Github page does something very similar, but rather than outputting separate files, it combines the silence-stripped segments back together into one file:
from pydub import AudioSegment
from pydub.utils import db_to_float
# Let's load up the audio we need...
podcast = AudioSegment.from_mp3("podcast.mp3")
intro = AudioSegment.from_wav("intro.wav")
outro = AudioSegment.from_wav("outro.wav")
# Let's consider anything that is 30 decibels quieter than
# the average volume of the podcast to be silence
average_loudness = podcast.rms
silence_threshold = average_loudness * db_to_float(-30)
# filter out the silence
podcast_parts = (ms for ms in podcast if ms.rms > silence_threshold)
# combine all the chunks back together
podcast = reduce(lambda a, b: a + b, podcast_parts)
# add on the bumpers
podcast = intro + podcast + outro
# save the result
podcast.export("podcast_processed.mp3", format="mp3")
Is it possible to output those podcast_parts fragments as individual WAV files? If so, how?
Thanks!
The example code is pretty simplified, you'll probably want to look at the strip_silence function:
https://github.com/jiaaro/pydub/blob/2644289067aa05dbb832974ac75cdc91c3ea6911/pydub/effects.py#L98
And then just export each chunk instead of combining them.
The main difference between the example and the strip_silence function is the example looks at one millisecond slices, which doesn't count low frequency sound very well since one waveform of a 40hz sound, for example, is 25 milliseconds long.
The answer to your original question though, is that all those slices of the original audio segment are also audio segments, so you can just call the export method on them :)
update: you may want to take a look at the silence utilities I've just pushed up into the master branch; especially split_on_silence() which could do this (assuming the right specific arguments) like so:
from pydub import AudioSegment
from pydub.silence import split_on_silence
sound = AudioSegment.from_mp3("my_file.mp3")
chunks = split_on_silence(sound,
# must be silent for at least half a second
min_silence_len=500,
# consider it silent if quieter than -16 dBFS
silence_thresh=-16
)
you could export all the individual chunks as wav files like this:
for i, chunk in enumerate(chunks):
chunk.export("/path/to/ouput/dir/chunk{0}.wav".format(i), format="wav")
which would make output each one named "chunk0.wav", "chunk1.wav", "chunk2.wav", and so on

Trimming audio in python

ok so im using pyaudio as well but from what I been looking at the wave module could maybe help me out here.
So im trying to add a trimming function on my program, what I mean is im trying to allow the user to find parts of a wav. that he/she doesn't like and has the ability to trim the wave file to however he/she wants it.
so far i been using pyaudio for just simple playback and pyaudio is really easy when it comes to recording from an input device.
I been searching on pyaudio on anything I can do to trim audio but i really havent found anything that can help me. Though on the embedded wave module I see there are ways to set position.
Would I have to have a loop or if statement done so that the program would know which positions to record and then have either pyaudio or the wave module to record the song from user set positions (beginning, end)? Would my program run efficiently if I approached it this way?
Let's assume that you read in the wave file using scipy.
Then you need to have "edit points" These are in and out values (in seconds for example) that the user would like to keep. you could get these from a file or from displaying the wave file and getting clicks from the mouse. If the user gives parts of the audio file that should be removed then this would need to be first negated
This is not the most efficient solution but should be ok for many scenarios.
import scipy.io.wavfile
fs1, y1 = scipy.io.wavfile.read(filename)
l1 = numpy.array([ [7.2,19.8], [35.3,67.23], [103,110 ] ])
l1 = ceil(l1*fs1)#get integer indices into the wav file - careful of end of array reading with a check for greater than y1.shape
newWavFileAsList = []
for elem in l1:
startRead = elem[0]
endRead = elem[1]
if startRead >= y1.shape[0]:
startRead = y1.shape[0]-1
if endRead >= y1.shape[0]:
endRead = y1.shape[0]-1
newWavFileAsList.extend(y1[startRead:endRead])
newWavFile = numpy.array(newWavFileAsList)
scipy.io.wavfile.write(outputName, fs1, newWavFile)

Python activity logging

I have a question regarding logging for somescript.py
The script performs some actions to find matches for words the user is looking for in some pages that have become unreadable due to re-formatting and printing of the pages.
Because of this, OCR techniques don't work for us anymore so i've come up with a script that compares countours of words to find matches.
the script looks something like:
import cv2
from cv2 import *
import numpy as np
method = cv.CV_TM_SQDIFF_NORMED
template_name = "this.png"
image_name = "3.tif"
needle = cv2.imread(template_name)
haystack = cv2.imread(image_name)
# Convert to gray:
needle_g = cv2.cvtColor(needle, cv2.CV_32FC1)
haystack_g = cv2.cvtColor(haystack, cv2.CV_32FC1)
# Attempt match
d = cv2.matchTemplate(needle_g, haystack_g, cv2.cv.CV_TM_SQDIFF_NORMED)
#we want the minimum squared difference
mn,_,mnLoc,_ = cv2.minMaxLoc(d)
print mnLoc
# Draw the rectangle
MPx,MPy = mnLoc
trows,tcols = needle_g.shape[:2]
#Normed methods give better results, ie matchvalue = [1,3,5], others sometimes showserrors
cv2.rectangle(haystack, (MPx,MPy),(MPx+tcols,MPy+trows),(0,0,255),2)
cv2.imshow('output',haystack)
cv2.waitKey(0)
import sys
sys.exit(0)
Now i want to log the various tasks that the script performs, like
converting the image to grayscale
attempting a match
drawing the rectangle
I have seen a few scripts on stackoverflow explaining how to log an entire script or the entire output but i haven't found anything that just logs a few actions.
Also i would like to add the date and time the activity was performed.
Furthermore i have wrote a function that calculates an MD5 and SHA1 hash of the input file, for this particular case, that is for 'this.png' and '3.tif', I have yet to implement this piece of code but would it be easy to log that as well?
I am a python-noob so if the anwsers are obvious to you guys you know why i couldn't figure it out myself.
I hope you can help me out on this one!

Categories