Audio alignment (same sentence with different speakers) - python

I am super new to audio processing. I have one reference audio file and several other audio recordings (same sentence spoken by different speakers - differ in dialect and duration) and I want to align the all the audio files to the one audio reference file with the least warping. I tried using MFCC and Chroma features (python/librosa) but I don't know what to do next. I was reading about DTW (Dynamic Time Warping) for alignment, would that work? Is there an example/open source project or audio tool which already does this? It seems to be a solved problem but I couldn't find it. Please help.
I was following read this -
https://librosa.github.io/librosa_gallery/auto_examples/plot_music_sync.html but how do I save back the aligned audio in time domain?
This seems related - Dynamic time warping with python (final mapping)

Related

How would you approach selectively isolating and amplifying specific audio streams?

I am trying to wrap my head around how I'd go about isolating and amplifying specific sound streams in real time. I am playing with code that enables you to classify specific sounds in a given sound environment (i.e. isolating the guitar in a clip of a song) -- not difficult.
The issue is, how does one then selecitvely amplify the target audio stream? The most common output from existing audio diarizers is a probability of the audio signal belonging to a given class. The crux appears to be using that real-time class probability to identify the stream and then amplify it.

How to Prepare .WAV Files for Cross-Correlation with Scipy.Correlate for Time of Arrival Lag

I have two .wav audio files recorded simultaneously (outdoor mics for bioacoustics pilot study). A bird flying over chirps, and both mics detect the bird, but at different time points.
A common task is to cross-correlate the two signals and find the peak cross-correlation which indicates the time lag between the signal arriving at one microphone vs. the other. I find code for doing exactly this here Find time shift of two signals using cross correlation
However, that post seems to assume that people know how to get their audio files into a useful format for this analysis. Basic attempts to just use my whole wav files as y1 and y2 fail on account of the data not being in a correct format
TypeError: ufunc 'multiply' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')
I started looking around at how to turn a .wav file into a numpy array but got errors and didn't really know what I was doing. I assume it has something to do with doing FFT and turning the audio file into an image (spectrogram) for each audio file, and those image arrays are the y1 and y2 in the example above. I assume this link is talking about that. FFT-based 2D convolution and correlation in Python
What is the right way to proceed? Thank you very much.
TLDR. How should I import and modify two locally saved .wav files to prepare them for finding peak time lag by cross-correlation?
Having found no tool to sync the start of video/audio recording,
and seeing more such questions,
I decided to make a tool: syncstart.
It is now on github and pypi.

Detecting a noise within an audio stream in Python

My goal is to be able to detect a specific noise that comes through the speakers of a PC using Python. That means the following, in pseudo code:
Sounds is being played out of the speakers, by applications such as games for example
My "audio to detect" sound happens, and I want to detect that, and take an action
The specific sound I want to detect for example can be found here.
If I break that down, i believe I need two things:
A way to sample the audio that is being streamed to an audio device -- perhaps something based on this? or potentially sounddevice - but I can't determine how to make this work by looking at their api?
A way to compare each sample with my "audio to detect" sound file.
The detection does not need to be exact - it just needs to be close. For example there will be lots of other noises happening at the same time, so its more being able to detect the footprint of the "audio to detect" within the audio stream of a variety of sounds.
Having investigated this, I found technologies mentioned in this post on SO and also this interesting article on Chromaprint. The Chromaprint article uses fpcalc to generate fingerprints, but because my "audio to detect" is around 1 - 2 seconds, fpcalc can't generate the fingerprint. I need something which works across smaller timespaces.
My question is - can somebody help me with the two parts to my question:
How do I sample the audio device on my PC using python
How should I attempt this comparison (ideally with a little example)
Many thanks in advance.

How to extract audio after particular sound?

Let's say I have a few very long audio files (for ex., radio recordings). I need to extract 5 seconds after particular sound (for ex., ad start sound) from each file. Each file may contain 3-5 such sounds, so I should get *(3-5)number of source files result files.
I found librosa and scipy python libraries, but not sure if they can help. What should I start with?
You could start by calculating the correlation of the signal with your particular sound. Not sure if librosa offers this. I'd start with scipy.signal.correlate or scipy.signal.convolve.
Not sure what your background is. Start here if you need some theory.
Basically the correlation will be high if the audio matches your particular signal or is very similar to it. After identifying these positions you can select an area around them.

Automated aligning audio tracks with timings for dubbing screencasts

We have a some screen casts that need to be dubbed to various languages for which we have textual script for the target language as shown below:
Begining Time Audio Narration
0:0 blah nao lorep iposm...
1:20 xao dok dkjv dwv....
..
We can record each of the above units separately and then align it at the proper beginning times as mentioned in the above script.
Example:
Input:
Input the N timing values: 0:0,1:20 ...
Then input the N audio recordings
Output:
Audio recordings aligned to the above timings. An overflow should be detected by the system individually whereas an underflow is padded by silence.
Are there any platform independent audio apis \ software or a code snippet preferably in python that allows us to align these audio units based on the times provided?
If the input audio files are uncompressed (i.e., WAV files, etc.), the audio library I like to use is libsndfile. It appears to have a python wrapper here: https://code.google.com/p/libsndfile-python/. With that in mind, the rest could be accomplished like such:
Open an output audio stream to write audio data to with libsndfile
For each input audio file, open an input stream with libsndfile
Extract the meta-data information for the given audio file based on your textual description 'script'
Write any silence needed to your master output stream, and then write the data from the input stream to the output stream. Note current position/time. Repeat this step for each input audio file, checking that the audio clips target start time is always >= the current position/time noted earlier. If not then you have an overlap.
Of course, you have to worry about sample rates matching etc., but that should be enough to get started. Also, I'm not exactly sure if you are trying to write a single output file, or one for each input-file, but this answer should be tweekable enough. libsndfile will give you all the information you need (such as clip lengths, etc.) assuming it supports the input file format.

Categories