Mute/silence non-speech part of audio with Python (Voice Activity Detection) - python

My purpose is to silence all the parts of a .wav audio where there is no speech. I am currently using webrtcvad, but what I achieve is just removing the non-speech part from the audio (with their example.py code: https://github.com/wiseman/py-webrtcvad/blob/master/example.py). If someone can point me or show me a how to achieve my goal, I would be grateful! This sounds also sounds like a background noise removal problem.

Assuming that you want the WAV output to have the same duration as the input, just with the non-speech areas being replaced with silence, and the speech areas unaltered.
The way to do this is to multiply the audio signal with the output from the detector. The detector should output 1.0 for passing though (speech signal), and 0.0 for silencing (non-speech).
Sometimes one uses a small value instead of 0.0 for the blocking part, to just reduce the volume a bit, without making it pure silence. For example 0.01 (-20 dB).
Sometimes an abrupt transition can be a bit rough. In this case one may apply a bit of smoothing or fade. A simple alternative is an exponential moving average.

Related

Multi-channel square-wave noise removal

I am new to the signals and Scipy but I may need to use it to remove the square-wave noise on multiple channels. I have tried a few things with fft but nothing seems to make sense so far. I am hoping to get a few clues here that I can try. Problem: I have a series of 6 sensors transmitting data via USB # 1Hz (yes, very slow) / sensor. Every once in a while, they capture an external motor noise along with the signal that I am trying to remove (see attached figure). Any ideas on how to handle this? My original idea was to process the incoming data for 60 seconds, use fft to identify the common frequency on all the sensor channels and to remove it but that did not work. The code is basically useless to even share here.the square wave in the figure is the noise I am trying to remove. Thank you for your input.

Spectrogram image to Audio

I want to write a python script which takes the input as the image of the spectrogram and generates the audio from it. Is there a way to convert the image of spectrogram into corresponding audio ?
I believe that there must be a way to reverse engineer the image of spectrogram to generate the audio. Can someone please help me with the same?
By a strange coincidence, I also needed to do this to recover audio for which only the spectrograms were available, but could find no tools to do this so wrote it myself in C. It's not simple and the results are, as user14325 rightly points out, very noisy compared to the originals, partly due to the low time resolution of most spectrograms but mostly because the phase information for each data point is lost and has to be invented.
However, if you are interested, you will find a brief description at
https://wikidelia.net/wiki/Spectrograms#Inverse_spectrograms
and you can find the code by following the "even hairier custom software" link and checking the files named "run.*" (the rest of the code there is for log-freq-axis forward spectrograms)

OpenCV decentralized processing for stereo vision

I have a decent amount of experience with OpenCV and am currently familiarizing myself with stereo vision. I happen to have two JeVois cameras (don't ask why) and was wondering if it was possible to run some sort of code on each camera to distribute the workload and cut down on processing time. It needs to be so that each camera can do part of the overall process (without needing to talk to each other) and the computer they're connected to receives that information and handles the rest of the work. If this is possible, does anyone have any solutions or tips? Thanks in advance!
To generalize the stereo-vision pipeline (look here for more in-depth):
Find the intrinsic/extrinsic values of each camera (good illustration here)
Solve for the transformation that will rectify your cameras' images (good illustration here)
Capture a pair of images
Transform the images according to Step 2.
Perform stereo-correspondence on that pair of rectified images
If we can assume that your cameras are going to remain perfectly stationary (relative to each other), you'll only need to perform Steps 1 and 2 one time after camera installation.
That leaves you with image capture (duh) and the image rectification as general stereo-vision tasks that can be done without the two cameras communicating.
Additionally, there are some pre-processing techniques (you could try this and this) that have been shown to improve the accuracy of some stereo-correspondence algorithms. These could also be done on each of your image-capture platforms individually.

Determining the "quality" of a video in Python

I am trying to compute a rough "quality" metric for a video, which takes the following into consideration:
"Smoothness" of video; i.e., the opposite of how "choppy" it is
Image quality; i.e. if there are a lot of compression artifacts, the quality should decrease in size
I came across https://github.com/aizvorski/scikit-video, but the code seems to be littered with FIXMEs and TODOs, and on top of that there's barely any comments or documentation.
Is there a Python library, or even a program with a CLI, for computing video quality, or perhaps a set of libraries that will help me compute the above two metrics separately?
Image Quality
I would think that "Image Quality" is largely a function of bit-depth (or effective bit-depth) and bit-rate.
You can parse ffmpeg output to get this information. PIL or PyQt/PySide can also do this.
Smoothness
For smoothness, you may need to use some type of optical flow algorithm and get deltas from frame to frame.
OpenCV looks like a project that does many of these things.

Get the amplitude at a given time within a sound file?

I'm working on a project where I need to know the amplitude of sound coming in from a microphone on a computer.
I'm currently using Python with the Snack Sound Toolkit and I can record audio coming in from the microphone, but I need to know how loud that audio is. I could save the recording to a file and use another toolkit to read in the amplitude at given points in time from the audio file, or try and get the amplitude while the audio is coming in (which could be more error prone).
Are there any libraries or sample code that can help me out with this? I've been looking and so far the Snack Sound Toolkit seems to be my best hope, yet there doesn't seem to be a way to get direct access to amplitude.
Looking at the Snack Sound Toolkit examples, there seems to be a dbPowerSpectrum function.
From the reference:
dBPowerSpectrum ( )
Computes the log FFT power spectrum of the sound (at the sample number given in the start option) and returns a list of dB values. See the section item for a description of the rest of the options. Optionally an ending point can be given, using the end option. In this case the result is the average of consecutive FFTs in the specified range. Their default spacing is taken from the fftlength but this can be changed using the skip option, which tells how many points to move the FFT window each step. Options:
EDIT: I am assuming when you say amplitude, you mean how "loud" the sound appears to a human, and not the time domain voltage(Which would probably be 0 throughout the entire length since the integral of sine waves is going to be 0. eg: 10 * sin(t) is louder than 5 * sin(t), but their average value over time is 0. (You do not want to send non-AC voltages to a speaker anyways)).
To get how loud the sound is, you will need to determine the amplitudes of each frequency component. This is done with a Fourier Transform (FFT), which breaks down the sound into it's frequency components. The dbPowerSpectrum function seems to give you a list of the magnitudes (forgive me if this differs from the exact definition of a power spectrum) of each frequency. To get the total volume, you can just sum the entire list (Which will be close, xept it still might be different from percieved loudness since the human ear has a frequency response itself).
I disagree completely with this "answer" from CookieOfFortune.
granted, the question is poorly phrased... but this answer is making things much more complex than necessary. I am assuming that by 'amplitude' you mean perceived loudness. as technically each sample in the (PCM) audio stream represents an amplitude of the signal at a given time-slice. to get a loudness representation try a simple RMS calculation:
RMS
|K<
I'm not sure if this will help, but
skimpygimpy
provides facilities for parsing WAVE files into python
sequences and back -- you could potentially use this
to examine the wave form samples directly and do
what you like. You will have to read some source,
these subcomponents are not documented.

Categories