I need to measure the delay difference between different streams of the same TV channel on different platforms. Details can be given for this problem as follows:
As known there are several reasons why different platforms show the live TV channels not exactly at the same time but within the several seconds of each other. The delay is different from one platform to another.
For this i am thinking first recording a stream then using audio fingerprinting in python with the help of dejavu platform(the coding language can be changed). But problem is how can i achieve this ? How can i find the delay between two streams using audio fingerprinting ? Also forexample i want to compare the delay of the same TV channel between web, mobile platform and from Television. How can i record them from different platforms and make operations on them.
I will be happy to hear suggestion from you guys.
It sounds like you'll want to cross-correlate the two signals. The highest peak in the output would correspond to the time delay. Furthermore, the inputs need not be identical which is handy in this case.
Source and for further explanation: https://dev.to/hiisi13/find-an-audio-within-another-audio-in-10-lines-of-python-1866
First you need to decode them into PCM and ensure it has specific sample rate, which you can choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.
You can use the following code for that:
ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav
Then you can use the following code, it normalizes PCM data (i.e. find maximum sample value and rescale all samples so that sample with largest amplitude uses entire dynamic range of data format and then converts it to spectrum domain (FFT) and finds a peak using cross correlation to finally return the offset in seconds
import argparse
import librosa
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
def find_offset(within_file, find_file, window):
y_within, sr_within = librosa.load(within_file, sr=None)
y_find, _ = librosa.load(find_file, sr=sr_within)
c = signal.correlate(y_within, y_find[:sr_within*window], mode='valid', method='fft')
peak = np.argmax(c)
offset = round(peak / sr_within, 2)
fig, ax = plt.subplots()
ax.plot(c)
fig.savefig("cross-correlation.png")
return offset
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--find-offset-of', metavar='audio file', type=str, help='Find the offset of file')
parser.add_argument('--within', metavar='audio file', type=str, help='Within file')
parser.add_argument('--window', metavar='seconds', type=int, default=10, help='Only use first n seconds of a target audio')
args = parser.parse_args()
offset = find_offset(args.within, args.find_offset_of, args.window)
print(f"Offset: {offset}s" )
if __name__ == '__main__':
main()
Related
What I'm trying to do is setup 16 analog input channels, sample them constantly at a given rate and read 1 sample from each channel when calling the read function. Ideally I would like to read the newest sample so I can timestamp it when reading.
The problem is that the readings do not change from read to read, only after a few seconds. If I adjust the sampling speed, I can get to a situation where I get an error saying the software can't keep up with the hardware sampling rate.
Which part of my code is wrong?
import numpy
import nidaqmx
from nidaqmx.stream_readers import AnalogSingleChannelReader, AnalogMultiChannelReader
from nidaqmx.constants import Edge, AcquisitionType
# Create a task and a reader
task = nidaqmx.Task()
values_read = numpy.zeros(16, dtype = numpy.float64)
task.ai_channels.add_ai_current_chan('cDAQ1Mod2/ai0:15')
task.timing.cfg_samp_clk_timing(rate = 1000, active_edge = Edge.RISING, sample_mode = AcquisitionType.CONTINUOUS, samps_per_chan = 1)
reader = AnalogMultiChannelReader(task.in_stream)
task.start()
while 1:
reader.read_one_sample(values_read)
print(values_read)
The sampling rate is 1000 but you are reading only one sample each time. Usually, each Read call takes a few milliseconds. You are not reading fast enough hence the buffer overflow error.
Suggestions:
Reduce sample rate.
Read more samples per Read call.
Since you want to read only the latest data and timestamp yourself, you can use the On Demand software timed acquisition. See example ai_voltage_sw_timed.py
I'm hoping this is an appropriate question for here.
I have used Python Librosa to plot a wave form for a sound file. I'm finding it difficult to extract the data points. e.g. what is the value of y, at x (Time) = 0.15 on this output below. I can't see this on the documentation for Librosa, so I' wondering if this can be done.
Here is the code I have based on Librosa documentation so far:
import librosa
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load('audio.wav')
bpm, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
plt.figure()
librosa.display.waveplot(y, sr=sr)
plt.show()
print (f'bpm: {bpm:.2f} beats per minute')
output
Is it possible to get the x and y axis into an array for example, or at least print a single data point please?
Thank you
librosa.display.waveplot compute and plots the amplitude envelope of the audio signal.
You can see how this is done by looking at the source code of the function (accessible via "view source" on the documentation page for the function).
Here is the relevant part of the code for computing the envelope.
def __envelope(x, hop):
"""Compute the max-envelope of non-overlapping frames of x at length hop
x is assumed to be multi-channel, of shape (n_channels, n_samples).
"""
import numpy as np
x_frame = np.abs(librosa.util.frame(x, frame_length=hop, hop_length=hop))
return x_frame.max(axis=1)
# Reduce by envelope calculation
env = __envelope(y, hop_length)
Where hop_length is the number of audio samples per point of the envelope.
I wonder if any of the tools listed in the post How to edit raw PCM audio data without an audio library? might be helpful in getting the raw PCM you are seeking. Ideally, librosa library itself should allow a hook to view the PCM data array itself, but I've found that audio tools sometimes don't actually allow this (for example Clip in Java). If you can't locate a librosa hook, maybe obtaining the PCM array from one of these tools prior to or in parallel to shipping data to librosa would be helpful.
I am working with audio file using webrtcvad and pydub. The split of any fragment is by silence of the sentence.
Is there any way by which the split can be done at word level boundry condition? (after each spoken word)?
If librosa/ffmpeg/pydub has any feature like this, can split is possible at each vocal? but after split, I need start and end time of the vocal exactly what that vocal part has positioned in the original file.
One simple solution or way to split by ffmpeg is also defined by :
https://gist.github.com/vadimkantorov/00bf4fbe4323360722e3d2220cc2915e
but this is also splitting by silence, and with each padding number or the frame size, the split is different. I am trying split by vocal.
As example, I have done this manually the original file, split words and its time position in json is in a folder provided here under the link:
www.mediafire.com/file/u4ojdjezmw4vocb/attached_problem.tar.gz
Simple audio segmentation problems can be handled by using a Hidden Markov Model, after preprocessing the audio into suitable features. Typical features for speech would be soundlevel, vocal activity / voicedness. To get word-level segmentation (as opposed to sentence), this needs to have rather high time resolution. Unfortunately the pyWebRTCVAD does not have adjustable time smoothening so it might not be suited for the task.
In your audio sample there is a radio host speaking rather quickly in German.
Looking at the soundlevels wrt to the word boundaries you have marked it is clear that between some words the soundlevel doesnt really drop. That rules out a simple soundlevel segmentation model.
All in all, getting good results for general speech signals can be quite hard. But fortunately this is very well researched, and with off-the-shelf solutions being available.
These use typically an acoustic model (how words and phonemes sound), as well as a language model (likely orders of words), learned over many hours of audio.
Word segmentation using Speech Recognition library
All these features are included in a Speech Recognition framework, and many allow to get word-level outputs with timing. Below is some working code for this using Vosk.
Alternatives to Vosk would be PocketSphinx. Or using an online speech recognition service from Google Cloud, Amazon Web Services, Azure Cloud etc.
import sys
import os
import subprocess
import json
import math
# tested with VOSK 0.3.15
import vosk
import librosa
import numpy
import pandas
def extract_words(res):
jres = json.loads(res)
if not 'result' in jres:
return []
words = jres['result']
return words
def transcribe_words(recognizer, bytes):
results = []
chunk_size = 4000
for chunk_no in range(math.ceil(len(bytes)/chunk_size)):
start = chunk_no*chunk_size
end = min(len(bytes), (chunk_no+1)*chunk_size)
data = bytes[start:end]
if recognizer.AcceptWaveform(data):
words = extract_words(recognizer.Result())
results += words
results += extract_words(recognizer.FinalResult())
return results
def main():
vosk.SetLogLevel(-1)
audio_path = sys.argv[1]
out_path = sys.argv[2]
model_path = 'vosk-model-small-de-0.15'
sample_rate = 16000
audio, sr = librosa.load(audio_path, sr=16000)
# convert to 16bit signed PCM, as expected by VOSK
int16 = numpy.int16(audio * 32768).tobytes()
# XXX: Model must be downloaded from https://alphacephei.com/vosk/models
# https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip
if not os.path.exists(model_path):
raise ValueError(f"Could not find VOSK model at {model_path}")
model = vosk.Model(model_path)
recognizer = vosk.KaldiRecognizer(model, sample_rate)
res = transcribe_words(recognizer, int16)
df = pandas.DataFrame.from_records(res)
df = df.sort_values('start')
df.to_csv(out_path, index=False)
print('Word segments saved to', out_path)
if __name__ == '__main__':
main()
Run the program with the .WAV file and the path to an output file.
python vosk_words.py attached_problem/main.wav out.csv
The script outputs words and their times in the CSV. These timings can then be used to split the audio file. Here is example output:
conf,end,start,word
0.618949,1.11,0.84,also
1.0,1.32,1.116314,eine
1.0,1.59,1.32,woche
0.411941,1.77,1.59,des
Comparing the output (bottom) with the example file you provided (top), it looks pretty good.
It actually picked up a word that your annotations did not include, "und" at 42.25 seconds.
Delimiting words is out of the audio domain and requires a kind of intelligence. Doing it manually is easy because we are intelligent and know exactly what we are looking for, but automatizing the process is hard because, as you already noticed, a silence is not (not only, not always) a word delimiter.
At audio level, we can only approach a solution and this require both analyzing the amplitude of the signal and adding some time mechanisms. As an example, Protools provides a nice tool named Strip Silence that cuts audio regions automatically based on the amplitude of the signal. It always keeps the material at its original position in the timeline and naturally each region knows its own duration. In addition to the threshold in dB, and to prevent creating too much regions, it provides several useful parameters in the time domain : a minimum length for the created regions, a delay before the cut (the delay is computed from the point the amplitude passes below the threshold), an inverted delay before reopening the gate (the delay is computed backward from the point the amplitude passes above the threshold).
This could be a good starting point for you. Implementing such a system probably won't be 100 % successful, but you could obtain a quite good ratio if the settings are well adjusted to the speaker. Even if it's not perfect, it will significantly reduce the need for manual work.
I'm finding it strangely hard to find a synthesizer module for python that allows the program to play a note at an arbitrary pitch. Preferably the note should be more than just a pure sinewave and should include at least a few harmonics - it should be more than just a beep.
The idea is to be able to write something like
the_module.play(frequency, loudness, duration)
or
my_synth = the_module.newsynth()
my_synth.play(frequency, loudness, duration)
where frequency is specified in Hz, and have a synthesized tone play from the user's speakers. There's JavaScript modules for doing this, such as Tone.js, but does anyone know of an approach using Python?
If on windows, you can use the builtin winsound.Beep.
If on Linux, you need to write directly to /dev/audio, like suggested here:
def beep(frequency, amplitude, duration):
sample = 8000
half_period = int(sample/frequency/2)
beep = chr(amplitude)*half_period+chr(0)*half_period
beep *= int(duration*frequency)
audio = file('/dev/audio', 'wb')
audio.write(beep)
audio.close()
I want to sample a radio station which broadcasts in format *.m3u8 and to produce the histogram of the first n seconds (where the user fixes n).
I had been trying using radiopy but it doesn't work and gnuradio seems useless. How can I produce and show this histogram?
EDIT: Now I use Gstreamer v1.0 so I can play it directly but now I need to live-sample my broadcast. How can I do it using Gst?
gnuradio seems useless
Well, I'd argue that this is what you're looking for, if you're looking for a live spectrogram:
As you can see, it's but a matter of connecting a properly configured audio source to a Qt GUI sink. If properly configured (I wrote an answer about that, and a GNU Radio wiki page as well).
Point is: you shouldn't be trying to play an internet station by yourself. Let a software do that which knows what it is doing.
In your case, I'd recommend:
using VLC or mplayer to write the radio, after decoding it to PCM 32bit float of a fixed sampling rate to a file.
Use Python with the libraries Numpy to open that file (samples = numpy.fromfile(filename, dtype=numpy.float32)), and matplotlib/pyplot to plot a spectrogram to a file, i.e. something like (untested, because written right here):
#!/usr/bin/python2
import sys
import os
import tempfile
import numpy
from matplotlib import pyplot
stream = sys.argv[1] ## you can pass the stream URL as argument
outfile = sys.argv[2] ## second argument: output file; ending determines type!
num_of_seconds = min(int(sys.argv[3]), 60) # not more than 1min of streaming
(intermediate_file, inter_fname) = tempfile.mkstemp()
# pan = number of output channels (1: mix to mono)
# resample = sampling rate. this must be the same for all files, so that we can actually compare spectrograms
# format = sample format, here: native floats
sys.system("mplayer -endpos %d -vo null -af pan=1 -af resample=441000 -af format=floatne -ao pcm:nowaveheader:file=%s" % num_of_seconds % inter_fname)
samples = numpy.fromfile(inter_fname, dtype=float32)
pyplot.figure((num_of_seconds * 44100, 256), dpi=1)
### Attention: this call to specgram expects of you to understand what the Discrete Fourier Transform does.
### This uses a Hanning window by default; whether that is appropriate for audio data is questionable. Use all your DSP skillz!
### pyplot.specgram has a lot of options, including colormaps, frequency scaling, overlap. Make yourself acquintanced with those!
pyplot.specgram(samples, NFFT=256, FS=44100)
pyplot.savefig(outfile, bbox_inches="tight")