Automated aligning audio tracks with timings for dubbing screencasts - python

We have a some screen casts that need to be dubbed to various languages for which we have textual script for the target language as shown below:
Begining Time Audio Narration
0:0 blah nao lorep iposm...
1:20 xao dok dkjv dwv....
..
We can record each of the above units separately and then align it at the proper beginning times as mentioned in the above script.
Example:
Input:
Input the N timing values: 0:0,1:20 ...
Then input the N audio recordings
Output:
Audio recordings aligned to the above timings. An overflow should be detected by the system individually whereas an underflow is padded by silence.
Are there any platform independent audio apis \ software or a code snippet preferably in python that allows us to align these audio units based on the times provided?

If the input audio files are uncompressed (i.e., WAV files, etc.), the audio library I like to use is libsndfile. It appears to have a python wrapper here: https://code.google.com/p/libsndfile-python/. With that in mind, the rest could be accomplished like such:
Open an output audio stream to write audio data to with libsndfile
For each input audio file, open an input stream with libsndfile
Extract the meta-data information for the given audio file based on your textual description 'script'
Write any silence needed to your master output stream, and then write the data from the input stream to the output stream. Note current position/time. Repeat this step for each input audio file, checking that the audio clips target start time is always >= the current position/time noted earlier. If not then you have an overlap.
Of course, you have to worry about sample rates matching etc., but that should be enough to get started. Also, I'm not exactly sure if you are trying to write a single output file, or one for each input-file, but this answer should be tweekable enough. libsndfile will give you all the information you need (such as clip lengths, etc.) assuming it supports the input file format.

Related

Sound detection using Python

To perform an end-to-end test of an embedded platform that plays musical notes, we are trying to record via a microphone and identify whether a specific sound were played using the device' speakers. The testing setup is not a real-time system so we don't really know when (or even if) the expected sound begins and ends.
The expected sound is represented in a wave file (or similar) we can read from disk.
How can we run a test that asserts whether the sound were played as expected?
There are a few ways to tackle this problem:
Convert the expected sound into a sequence of frequency amplitude
pairs. Then, record the sound via the microphone and convert that
recording into a corresponding sequence of frequency amplitude
pairs. Finally, compare the two sequences to see if they match.
This task can be accomplished using the modules scipy, numpy,
and matplotlib.
We'll need to generate a sequence of frequency amplitude pairs
for the expected sound. We can do this by using the
scipy.io.wavfile.read() function to read in a wave file
containing the expected sound. This function will return a tuple
containing the sample rate (in samples per second) and a numpy
array containing the amplitudes of the waveform. We can then use
the numpy.fft.fft() function to convert the amplitudes into a
sequence of frequency amplitude pairs.
We'll need to record the sound via the microphone. For this,
we'll use the pyaudio module. We can create a PyAudio object
using the pyaudio.PyAudio() constructor, and then use the
open() method to open a stream on the microphone. We can then
read in blocks of data from the stream using the read() method.
Each block of data will be a numpy array containing the
amplitudes of the waveform at that particular moment in time.
We can then use the numpy.fft.fft() function to convert the
amplitudes into a sequence of frequency amplitude pairs.
Finally, we can compare the two sequences of frequency
amplitude pairs to see if they match. If they do match, then we
can conclude that the expected sound was recorded correctly. If
they don't match, then we can conclude that the expected sound
was not recorded correctly.
Use a sound recognition system to identify the expected sound in the recording.
from pydub import AudioSegment
from pydub.silence import split_on_silence, detect_nonsilent
from pydub.playback import play
def get_sound_from_recording():
sound = AudioSegment.from_wav("recording.wav") # detect silent chunks and split recording on them
chunks = split_on_silence(sound, min_silence_len=1000, keep_silence=200) # split on silences longer than 1000ms. Anything under -16 dBFS is considered silence. keep 200ms of silence at the beginning and end
for i, chunk in enumerate(chunks):
play(chunk)
return chunks
Cross-correlate the recording with the expected sound. This will produce a sequence of values that indicates how closely the recording matches the expected sound. A high value at a particular time index indicates that the recording and expected sound match closely at that time.
# read in the wav file and get the sampling rate
sampling_freq, audio = wavfile.read('audio_file.wav')
# read in the reference image file
reference = plt.imread('reference_image.png')
# cross correlate the image and the audio signal
corr = signal.correlate2d(audio, reference)
# plot the cross correlation signal
plt.plot(corr)
plt.show()
This way you can set up your test to check if you are getting the correct output.

Find number of times recognized audio repeat in the source

I want to find the number of times a snippet of audio is repeated in another audio.
There are libraries like https://github.com/worldveil/dejavu which can be used to create fingerprints of audio after that it can be used for recognition but it only tells whether the snippet exists in audio or not, it does not give count.
Is there any way to make changes to find the number of times the recorded audio repeats in the source(any audio from database)?
Thanks
If it's an exact copy, you may want to convolve said audio with the source audio you're trying to extract from. Then the peaks from the convolution will show you where the offset is.

How to direct realtime-synthesized sound to individual channels in multichannel audio output in Python

I need to
read in variable data from sensors
use those data to generate audio
spit out the generated audio to individual audio output channels in real time
My trouble is with item 3.
Parts 1&2 have a lot in common with a guitar effects pedal, I should think: take in some variable and then adjust the audio output in real time as the input variable changes but don't ever stop sending a signal while doing it.
I have had no trouble using pyaudio to drive wav files to specific channels using the mapping[] parameter of pyaudio.play nor have I had trouble generating sine waves dynamically and sending them out using pyaudio.stream.play.
I'm working with 8 audio output channels. My problem is that stream.play only lets you specify a count of channels and as far as I can tell I can't say, for example, "stream generated_audio to channel 5".

Audio alignment (same sentence with different speakers)

I am super new to audio processing. I have one reference audio file and several other audio recordings (same sentence spoken by different speakers - differ in dialect and duration) and I want to align the all the audio files to the one audio reference file with the least warping. I tried using MFCC and Chroma features (python/librosa) but I don't know what to do next. I was reading about DTW (Dynamic Time Warping) for alignment, would that work? Is there an example/open source project or audio tool which already does this? It seems to be a solved problem but I couldn't find it. Please help.
I was following read this -
https://librosa.github.io/librosa_gallery/auto_examples/plot_music_sync.html but how do I save back the aligned audio in time domain?
This seems related - Dynamic time warping with python (final mapping)

How to extract the bitrate and other statistics of a video file with Python

I am trying to extract the prevailing bitrate of a video file (e.g. .mkv file containing a movie) at a regular sampling interval of between 1-10 seconds under conditions of normal playback. Kind of like you may see in vlc, during playback of the file in the statistics window.
Can anyone suggest the best way to bootstrap the coding of such an analyser? Is there a library that provides an API to such information that people know of? Perhaps a Python wrapper for ffmpeg or equivalent tool that processes video files and can thereby extract such statistics.
What I am really aiming for is a CSV format file containing the seconds offset and the average or actual bitrate in KiB/s at that offset into the asset.
Update:
I built pyffmpeg and wrote the following spike:
import pyffmpeg
reader = pyffmpeg.FFMpegReader(False)
reader.open("/home/mark/Videos/BBB.m2ts", pyffmpeg.TS_VIDEO)
tracks=reader.get_tracks()
# Called for each frame
def obs(f):
pass
tracks[0].set_observer(obs)
reader.run()
But observing frame information (f) in the callback does not appear to give me any hooks to calculate per second bitrates. In fact bitrate calculations within pyffmpeg are measured across the entire file (filesize / duration) and so the treatment within the library is very superficial. Clearly its focus is on extract i-frames and other frame/GOP specific work.
Something like these:
http://code.google.com/p/pyffmpeg/
http://pymedia.org/
You should be able to do this with gstreamer. http://pygstdocs.berlios.de/pygst-tutorial/seeking.html has an example of a simple media player. It calls
pos_int = self.player.query_position(gst.FORMAT_TIME, None)[0]
periodically. All you have to do is call query_position() a second time with gst.FORMAT_BYTES, do some simple math, and voila! Bitrate vs. time.

Categories