I've develop a script that given an input file, extract the voice signal and give in output the signal WITHOUT voice (so the signal that containts the noise):
!pip install pydub
from pydub import AudioSegment
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
audio = AudioSegment.from_file('fileInput.mp3')
Download fileInput.mp3
samples = audio.get_array_of_samples()
plt.plot(list(samples))
from scipy import signal
sos = signal.butter(10, [100, 4000], 'bandstop', fs=44100, output='sos')
filtered = signal.sosfilt(sos, np.array(samples))
plt.figure(figsize=(10,10))
plt.plot(np.array(samples))
plt.plot(filtered)
plt.title('After 1 - 10 Hz pass-band filter')
plt.tight_layout()
plt.show()
To export the file filtered (so the file that contains the noise) i write that following line:
from scipy.io.wavfile import write
write('./test.wav', 44100, filtered.astype(np.int16))
That codes save a file but the file don't have the same lenght of the original (input) one.
As you can notice, the input file has 36second lenght instead the output is 1:12 ...
Download Output file
The input file is stereo. The pydub documentation states that:
AudioSegment(…).get_array_of_samples()
Returns the raw audio data as an array of (numeric) samples. Note: if the audio has multiple channels, the samples for each channel will be serialized – for example, stereo audio would look like [sample_1_L, sample_1_R, sample_2_L, sample_2_R, …]
for scipy this is just 1 "long" channel. it can not know that the samples are split like this. A filter also has state. Meaning it can not process data that is shuffled like this and produce the desired output.
either you reshape the data from AudioSegment for example into 2 mono channels like:
[sample1L, sample2L, ...]
and
[sample1R, sample2R, ...]
and process these individually.
OR
you simply convert the AudioSegment to mono. like so:
audio = AudioSegment.from_file('fileInput.mp3')
audio = audio.set_channels(1)
either way I highly recommend you use the sample rate of the input file, wherever a sample rate is required. else loading a file with other sample rate will shift the filter frequencies and change the length and playback speed of the output file. e.g.
sos = signal.butter(10, [100, 4000], 'bandstop', fs=audio.frame_rate, output='sos')
Related
I am adding noise to a signal using librosa but after adding noise I am unable to save the signal back as wav file.
My code is as follows:
import librosa
import matplotlib.pyplot as plt
import numpy as np
import math
file_path = r'path\to\file'
#
#
signal, sr = librosa.load(file_path, sr = 16000)
# plt.plot(signal)
#
RMS=math.sqrt(np.mean(signal**2))
STD_n= 0.001
noise=np.random.normal(0, STD_n, signal.shape[0])
#
# # X=np.fft.rfft(noise)
# # radius,angle=to_polar(X)
#
signal_noise = signal+noise
I want to convert signal_noise as a wav file. I tried different librosa functions but I am unable to find one. I tried using scipy.io.wavfile.write but I was getting an error probably because Librosa generates Normalized audio while Scipy doesn't.
U can do it using the soundfile library. Add these lines to ur code:
import soundfile
soundfile.write('filename.wav',signal_noise,16000)
Parameters:
The 1st parameter is the file name
The 2nd parameter is the audio that u wanna save
The 3rd parameter is the sample rate
Hope that this helps u!
Basically I have trained a few models using keras to do isolated word recognition. Currently i can record the audio using sound device record function for a pre-fixed duration and save the audio file as a wav file. I have implemented silence detection to trim out unwanted samples. But this is all working after the whole recording is complete. I would like to get the trimmed audio segments immediately while recording simultaneously so that i can do speech recognition in real-time. I'm using python2 and tensorflow 1.14.0. Below is the snippet of what i currently have,
import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing
models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]
for model in models:
loaded_models.append(tf.keras.models.load_model(model))
def prediction(model_ip):
model,t=model_ip
ret_val=model.predict(t).tolist()[0]
return ret_val
print("recording in 5sec")
time.sleep(5)
fs = 44100 # Sample rate
seconds = 10 # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized.
end_points(wav_file,thresh,50)
#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
...
... # The trimmed audio is processed further and the input which the model can predict
#is t
...
modelon=[]
for md in loaded_models:
modelon.append((md,t))
start_time=time.time()
with closing(multiprocessing.Pool()) as p:
predops=p.map(prediction,modelon)
print('Total time taken: {}'.format(time.time() - start_time))
actops=[]
for predop in predops:
actops.append(predop.index(max(predop)))
print(actops)
max_freqq = max(set(actops), key = actops.count)
final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))
Note that the above code only includes what is relevant to the question and will not run. I wanted to give an overview of what i have so far and would really appreciate your inputs on how i can proceed to be able to record and trim audio based on a threshold simultaneously so that if multiple words are spoken within the recording duration of 10 seconds(seconds variable in code), as i speak when the energy of the samples for a window size of 50ms goes below a certain threshold i cut the audio at those two points, trim and use it for prediction. Both recording and prediction of trimmed audio segments must happen simultaneously so that the each output word can be displayed immediately after its utterance during the 10 seconds of recording. Would really appreciate any suggestions on how I can go about this.
Hard to say what your model architecture is but there are models specifically designed for streaming recognition. Like Facebook's streaming convnets. You won't be able to implement them in Keras easily though.
I am new to Python, and I am trying to train my audio voice recognition model. I want to read a .wav file and get output of that .wav file into Numpy arrays. How can I do that?
In keeping with #Marco's comment, you can have a look at the Scipy library and, in particular, at scipy.io.
from scipy.io import wavfile
To read your file ('filename.wav'), simply do
output = wavfile.read('filename.wav')
This will output a tuple (which I named 'output'):
output[0], the sampling rate
output[1], the sample array you want to analyze
This is possible with a few lines with wave (built in) and numpy (obviously). You don't need to use librosa, scipy or soundfile. The latest gave me problems reading wav files and it's the whole reason I'm writting here now.
import numpy as np
import wave
# Start opening the file with wave
with wave.open('filename.wav') as f:
# Read the whole file into a buffer. If you are dealing with a large file
# then you should read it in blocks and process them separately.
buffer = f.readframes(f.getnframes())
# Convert the buffer to a numpy array by checking the size of the sample
# with in bytes. The output will be a 1D array with interleaved channels.
interleaved = np.frombuffer(buffer, dtype=f'int{f.getsampwidth()*8}')
# Reshape it into a 2D array separating the channels in columns.
data = np.reshape(interleaved, (-1, f.getnchannels()))
I like to pack it into a function that returns the sampling frequency and works with pathlib.Path objects. In this way it can be played using sounddevice
# play_wav.py
import sounddevice as sd
import numpy as np
import wave
from typing import Tuple
from pathlib import Path
# Utility function that reads the whole `wav` file content into a numpy array
def wave_read(filename: Path) -> Tuple[np.ndarray, int]:
with wave.open(str(filename), 'rb') as f:
buffer = f.readframes(f.getnframes())
inter = np.frombuffer(buffer, dtype=f'int{f.getsampwidth()*8}')
return np.reshape(inter, (-1, f.getnchannels())), f.getframerate()
if __name__ == '__main__':
# Play all files in the current directory
for wav_file in Path().glob('*.wav'):
print(f"Playing {wav_file}")
data, fs = wave_read(wav_file)
sd.play(data, samplerate=fs, blocking=True)
I'm a python newbie and audio analysis newbie. If this is not the right place for this question, please point me to right place.
I have an mp3 audio file which has just silence.
Converted to .wav using sox
sox input.mp3 output.wav
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
(fs,x)=read('/home/vivek/Documents/VivekProjects/Silence/silence.wav')
##plt.rcParams['agg.path.chunksize'] = 5000 # for preventing overflow error.
fs
x.size/float(fs)
plt.plot(x)
Which generates this image:
I also used solution to this question: How to plot a wav file
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
# read audio samples
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
# read audio samples
input_data = read("/home/vivek/Documents/VivekProjects/Silence/silence.wav")
audio = input_data[1]
# plot the first 1024 samples
plt.plot(audio)
# label the axes
plt.ylabel("Amplitude")
plt.xlabel("Time")
# set the title
plt.title("Sample Wav")
# display the plot
plt.show()
Which generated this image:
Question:
I want to know how to interpret the different color bars(blue green,yellow) in the chart. If you listen to the file it is only silence, and I expected to see just a flat line if anything.
My mp3 file can be downloaded from here.
The sox converted wav file can be found here.
Even though the file is silent, even dropbox is generating a waveform. I can't seem to figure out why.
First, always check the shape of your data before plotting.
x.shape
## (3479040, 2)
So the 2 here means you have two channel in your .wav file, matplotlib by default plot them in different colors. You will need to slice the matrix by row in this situation.
import matplotlib.pyplot as plt
ind = int(fs * 0.5) ## plot first 500ms
### plot as time series
plt.plot(x[:ind,:])
plt.figure()
#### Visualise distribution
plt.hist(x[:ind,0],bins = 10)
plt.gca().set_yscale('log')
#####
print x.min(),x.max()
#### -3 3
As can be seen from the graph, the signal is of very low absolute value (-3,3). Depending on the encoding of .wav file (integer or float), it will be translated to amplitude (but probably a very low amplitude, that's why it's silent).
I my self is not familiar with the precise encoding. But this page might help: http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
For all formats other than PCM, the Format chunk must have an extended portion. The extension can be of zero length, but the size
field (with value 0) must be present.
For float data, full scale is 1. The bits/sample would normally be 32 or 64.
For the log-PCM formats (µ-law and A-law), the Rev. 3 documentation indicates that the bits/sample field (wBitsPerSample)
should be set to 8 bits.
The non-PCM formats must have a fact chunk.
PS: if you want to start some more advanced audio analysis, do check this workshop which I found super practical, especially the Energy part and FFT part.
I had a suspicion that your silence.mp3 file had audio which was very low (below human hearing) since I couldn't hear it even when I played at maximum speaker sound.
So, I came across plotting audio frequency from mp3 from here
first we convert mp3 audio to wav. As parent file is stero, converted wav file is stereo as well. In order to demonstrate that there are audio frequencies , we just need single channel.
Once we have single channel wav audio, we then simply plot frequency against time index with a color-bar of dB power level.
import scipy.io.wavfile
from pydub import AudioSegment
import matplotlib.pyplot as plt
import numpy as np
from numpy import fft as fft
#read mp3 file
mp3 = AudioSegment.from_mp3("silence.mp3")
#convert to wav
mp3.export("silence.wav", format="wav")
#read wav file
rate,audData=scipy.io.wavfile.read("silence.wav")
#if stereo grab both channels
channel1=audData[:,0] #left
#channel2=audData[:,1] #right channel, we dont need here
#create a time variable in seconds
time = np.arange(0, float(audData.shape[0]), 1) / rate
#Plot spectrogram of frequency vs time
plt.figure(1, figsize=(8,6))
plt.subplot(211)
Pxx, freqs, bins, im = plt.specgram(channel1, Fs=rate, NFFT=1024, cmap=plt.get_cmap('autumn_r'))
cbar=plt.colorbar(im)
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
cbar.set_label('Intensity dB')
plt.show()
As you can see in the image , the silence.mp3 does contain audio frequencies possible with power level of -30 to -45 dB.
I would like to use sounddevice's playrec feature. To start I would like to just get sd.play() to work, I am new to Python and have never worked with NumPy, I have gotten audio to play using pyaudio, but I need the simultaneous play record feature in sounddevice. When I try to play an audio .wav file I get: TypeError: Unsupported data type: 'string288'. I think it has something to do with having to store the .wav file in a numpy array, but I have no idea how to do that. Here is what I have:
import sounddevice as sd
import numpy as np
sd.default.samplerate = 44100
sd.play('test.wav')
sd.wait
The documentation of sounddevice.play() says:
sounddevice.play(data, samplerate=None, mapping=None, blocking=False, loop=False, **kwargs)
where data is an "array-like".
It can't work with an audio file name, as you tried. The audio file has first to be read, and interpreted as a numpy array.
This code should work:
data, fs = sf.read(filename, dtype='float32')
sd.play(data, fs)
You'll find more examples here.