I have a database which contains a videos streaming. I want to calculate the LBP features from images and MFCC audio and for every frame in the video I have some annotation. The annotation is inlined with the video frames and the time of the video. Thus, I want to map the time that i have from the annotation to the result of the mfcc. I know that the sample_rate = 44100
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav
audio_file = "sample.wav"
(rate,sig) = wav.read(audio_file)
mfcc_feat = mfcc(sig,rate)
print len(sig) # 2130912
print len(mfcc_feat) # 4831
Firstly, why the result of the length of the mfcc is 4831 and how to map that in the annotation that i have in seconds? The total duration of the video is 48second. And the annotation of the video is 0 everywhere except the 19-29sec windows where is is 1. How can i locate the samples within the window (19-29) from the results of the mfcc?
You should get (4831,13) . 13 is your MFCC length (default numcep is 13). 4831 is the windows. Default winstep is 10 msec, and this matches your sound file duration. To get to the windows corresponding to 19-29 sec, just slice
Remember, that you can not listen to the MFCC. It just represents the slice of audio of 0.025 sec (default value of winlen parameter).
If you want to get to the audio itself, it is
I am extracting MFCCs from an audio file using Librosa's function (librosa.feature.mfcc) and I correctly get back a numpy array with the shape I was expecting: 13 MFCCs values for the entire length of the audio file which is 1292 windows (in 30 seconds).
What is missing is timing information for each window: for example I want to know what the MFCC looks like at time 5000ms, then at 5200ms etc.
Do I have to manually calculate the time? Is there a way to automatically get the exact time for each window?
The "timing information" is not directly available, as it depends on sampling rate. In order to provide such information, librosa would have create its own classes. This would rather pollute the interface and make it much less interoperable. In the current implementation, feature.mfcc returns you numpy.ndarray, meaning you can easily integrate this code anywhere in Python.
To relate MFCC to timing:
import librosa
import numpy as np
filename = librosa.util.example_audio_file()
y, sr = librosa.load(filename)
hop_length = 512 # number of samples between successive frames
mfcc = librosa.feature.mfcc(y=y, n_mfcc=13, sr=sr, hop_length=hop_length)
audio_length = len(y) / sr # in seconds
step = hop_length / sr # in seconds
intervals_s = np.arange(start=0, stop=audio_length, step=step)
print(f'MFCC shape: {mfcc.shape}')
print(f'intervals_s shape: {intervals_s.shape}')
print(f'First 5 intervals: {intervals_s[:5]} second')
Note that array length of mfcc and intervals_s is the same - a sanity check that we did not make a mistake in our calculation.
MFCC shape: (13, 2647)
intervals_s shape: (2647,)
First 5 intervals: [0. 0.02321995 0.04643991 0.06965986 0.09287982] second
I found many meanings of the Volume Unit meter (VU meter). For example, the average of loudness of sound, the average of frequencies of sound, and the average of power in dB scale.
I read audio by using AudioSegment and segmented an audio sound into small windows. Then I got an array of values for each window (I guess that the values that I got are amplitudes).
from pydub import AudioSegment
from pydub.utils import get_array_type
#def vu(arr):
# return vu_value
sound = AudioSegment.from_file(fullfilename) #also read file
# stereo signal to two mono signal for left and right channel
split_sound = sound.split_to_mono()
left_channel = split_sound[0]
right_channel = split_sound[1]
left_channel = np.array(left_channel.get_array_of_samples())
right_channel = np.array(right_channel.get_array_of_samples())
# print(vu(left_channel))
I would like to know the exact meaning of the VU meter and how to get VU value for each window (ex. formula). I also confused between VU meter, Peak Programme Meter (PPM), and RMS. If anyone knows the answer, please help me.
Thank you
I am using wave files for making deep learning model
they are in different length , so i want to pad all of them
to 16 sec length using python
If I understood correctly, the question wants to fix all lengths to a given length. Therefore, the solution will be slightly different:
from pydub import AudioSegment
pad_ms = 1000 # Add here the fix length you want (in milliseconds)
audio = AudioSegment.from_wav('you-wav-file.wav')
assert pad_ms > len(audio), "Audio was longer that 1 second. Path: " + str(full_path)
silence = AudioSegment.silent(duration=pad_ms-len(audio)+1)
padded = audio + silence # Adding silence after the audio
padded.export('padded-file.wav', format='wav')
This answer differs from this one in the sense that this one creates all audios from the same length where the other adds the same size of silence at the end.
Using pydub:
from pydub import AudioSegment
pad_ms = 1000 # milliseconds of silence needed
silence = AudioSegment.silent(duration=pad_ms)
audio = AudioSegment.from_wav('you-wav-file.wav')
padded = audio + silence # Adding silence after the audio
padded.export('padded-file.wav', format='wav')
AudioSegment objects are immutable
You can use Librosa. The Librosa.util.fix_length function adds silent patch to audio file by appending zeros to the end the numpy array containing the audio data:
from librosa import load
from librosa.util import fix_length
file_path = 'dir/audio.wav'
sf = 44100 # sampling frequency of wav file
required_audio_size = 5 # audio of size 2 second needs to be padded to 5 seconds
audio, sf = load(file_path, sr=sf, mono=True) # mono=True converts stereo audio to mono
padded_audio = fix_length(audio, size=5*sf) # array size is required_audio_size*sampling frequency
print('Array length before padding', np.shape(audio))
print('Audio length before padding in seconds', (np.shape(audio)[0]/fs))
print('Array length after padding', np.shape(padded_audio))
print('Audio length after padding in seconds', (np.shape(padded_audio)[0]/fs))
Array length before padding (88200,)
Audio length before padding in seconds 2.0
Array length after padding (220500,)
Audio length after padding in seconds 5.0
Although after looking through a number of similar questions, it seems like pydub.AudioSegment is the go to solution.
I am trying to reproduce a preprocessing seen in a youtube video https://www.youtube.com/watch?v=g-sndkf7mCs where they create a spectrogram taking a windows of 20 ms and then apply an FFT on it. Finally they feed a neural network with the spectrogram obtained. I am using the scipy package but I am a little confused about the parameters to use. Here is the code :
def get_spectrogram(path, nsamples=16000):
Given path, return specgram.
# read the wav files
wav = wavfile.read(path)[1] # 16000 samples per second
# zero pad the shorter samples and cut off the long ones to have a signal of 1 sec.
if wav.size < nsamples:
d = np.pad(wav, (nsamples - wav.size, 0), mode='constant')
d = wav[0:nsamples]
# get the specgram
specgram = signal.spectrogram(d, fs= ? , nperseg=None, noverlap=None, nfft=None)[2]
return specgram
Moreover, I am also wondering what have to be the shape of the output ? Is it (X, 1) ?
I'm in a need for a program that transforms tones recorded by the microphone into keybord presses. Example: if somebody sings at a frequency between 400hz and 600hz at the microphone and the average tone is 550hz, then i store the average frequency in the var 'tom', and the key "G" of my keyboard is pressed.
Even tho i'm newbye at programming, i searched and figured out a way to do so,
by using Audiopy at python language, by recording small WAV files, i could then read those and get a number as an average frequency, and with this number and some ifs and elifs, press keys (not that hard to find a code to press keys), in an enormous WHILE that repeats the process while the program runs, and so i would have the process of talking, reading the small files the talk would produce, and then transforming into key presses, according to the tone.
The main problem is that I have no idea on how to transform the WAV files i've been recording on a single average number of frequency. Can somebody help me with this? Or with the big picture? Cuz i know this method is not a really good one. Thanks! I was using this code to record, that I found on the Audiopy website:
import pyaudio
import wave
import numpy as np
import pyaudio
CHUNK = 2048
FORMAT = pyaudio.paInt16
RATE = 44100
WAVE_OUTPUT_FILENAME = "output1.wav"
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
print("* recording")
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
print("* done chunk")
To press the keys, this other code:
import win32com.client
shell = win32com.client.Dispatch("WScript.Shell")
if tom >= 400 and tom<=500:
PS.: I'm using Windows
You can use the Fourier transform to convert sound into frequencies.
More specifically, use the one-dimensional discrete Fourier Transform provided by numpy.fft.rfft.
An example to read a single second from a stereo WAV file and extract the frequencies.
import wave
import numpy as np
with wave.open('input.wav', 'r') as wr:
sz = wr.getframerate() # Read and process 1 second.
da = np.fromstring(wr.readframes(sz), dtype=np.int16)
left, right = da[0::2], da[1::2] # separate into left and right channel
lf, rf = np.absolute(np.fft.rfft(left)), np.absolute(np.fft.rfft(right))
The lf and rf are numpy arrays containing the intensity of each frequency. Using numpy.argmax you can get the index (frequency) with the highest strength.
But try it and graph the result using e.g. matplotlib. You'll see that there are probably multiple peaks in the data. For example you might find a peak at 50 Hz or 60 Hz. This is most probably interference from mains electricity and should be ignored by zero-ing out the data.
Example for 60 Hz:
lf[55:65], rf[55:65] = 0, 0
Below is an example plot made with matplotlib from a one-second sound clip. The top graph shows the samples from the WAV file while the bottom one shows the same data converted to frequencies. This is a graph of a person speaking, so there are many peaks. The highest is around 200 Hz.