To perform an end-to-end test of an embedded platform that plays musical notes, we are trying to record via a microphone and identify whether a specific sound were played using the device' speakers. The testing setup is not a real-time system so we don't really know when (or even if) the expected sound begins and ends.
The expected sound is represented in a wave file (or similar) we can read from disk.
How can we run a test that asserts whether the sound were played as expected?
There are a few ways to tackle this problem:
Convert the expected sound into a sequence of frequency amplitude
pairs. Then, record the sound via the microphone and convert that
recording into a corresponding sequence of frequency amplitude
pairs. Finally, compare the two sequences to see if they match.
This task can be accomplished using the modules scipy, numpy,
and matplotlib.
We'll need to generate a sequence of frequency amplitude pairs
for the expected sound. We can do this by using the
scipy.io.wavfile.read() function to read in a wave file
containing the expected sound. This function will return a tuple
containing the sample rate (in samples per second) and a numpy
array containing the amplitudes of the waveform. We can then use
the numpy.fft.fft() function to convert the amplitudes into a
sequence of frequency amplitude pairs.
We'll need to record the sound via the microphone. For this,
we'll use the pyaudio module. We can create a PyAudio object
using the pyaudio.PyAudio() constructor, and then use the
open() method to open a stream on the microphone. We can then
read in blocks of data from the stream using the read() method.
Each block of data will be a numpy array containing the
amplitudes of the waveform at that particular moment in time.
We can then use the numpy.fft.fft() function to convert the
amplitudes into a sequence of frequency amplitude pairs.
Finally, we can compare the two sequences of frequency
amplitude pairs to see if they match. If they do match, then we
can conclude that the expected sound was recorded correctly. If
they don't match, then we can conclude that the expected sound
was not recorded correctly.
Use a sound recognition system to identify the expected sound in the recording.
from pydub import AudioSegment
from pydub.silence import split_on_silence, detect_nonsilent
from pydub.playback import play
def get_sound_from_recording():
sound = AudioSegment.from_wav("recording.wav") # detect silent chunks and split recording on them
chunks = split_on_silence(sound, min_silence_len=1000, keep_silence=200) # split on silences longer than 1000ms. Anything under -16 dBFS is considered silence. keep 200ms of silence at the beginning and end
for i, chunk in enumerate(chunks):
play(chunk)
return chunks
Cross-correlate the recording with the expected sound. This will produce a sequence of values that indicates how closely the recording matches the expected sound. A high value at a particular time index indicates that the recording and expected sound match closely at that time.
# read in the wav file and get the sampling rate
sampling_freq, audio = wavfile.read('audio_file.wav')
# read in the reference image file
reference = plt.imread('reference_image.png')
# cross correlate the image and the audio signal
corr = signal.correlate2d(audio, reference)
# plot the cross correlation signal
plt.plot(corr)
plt.show()
This way you can set up your test to check if you are getting the correct output.
Related
I need to
read in variable data from sensors
use those data to generate audio
spit out the generated audio to individual audio output channels in real time
My trouble is with item 3.
Parts 1&2 have a lot in common with a guitar effects pedal, I should think: take in some variable and then adjust the audio output in real time as the input variable changes but don't ever stop sending a signal while doing it.
I have had no trouble using pyaudio to drive wav files to specific channels using the mapping[] parameter of pyaudio.play nor have I had trouble generating sine waves dynamically and sending them out using pyaudio.stream.play.
I'm working with 8 audio output channels. My problem is that stream.play only lets you specify a count of channels and as far as I can tell I can't say, for example, "stream generated_audio to channel 5".
Say I computed Short Time Fourier Transform or Spectrogram of some audio input using scipy.signal.stft or scipy.signal.spectrogram.
Is there an easy way in python to use it as a filter for my audio input to produce an other audio which would basically be just silence or very near to that?
In practice I would modify that spectrogram before applying it back to the audio to filter out only some audio information.
A spectrogram is lossy (does not preserve necessary phase information), thus cannot be used to recreate the input signal (or a filtered variant).
I am trying to write a simple audio function generator in Python, to be run on a Raspberry Pi (model 2). The code essentially does this:
Generate 1 second of the audio signal (say, a sine wave, or a square wave, etc)
Play it repeatedly in a loop
For example:
import pyaudio
from numpy import linspace,sin,pi,int16
def note(freq, len, amp=1, rate=44100):
t = linspace(0,len,len*rate)
data = sin(2*pi*freq*t)*amp
return data.astype(int16) # two byte integers
RATE = 44100
FREQ = 261.6
pa = pyaudio.PyAudio()
s = pa.open(output=True,
channels=2,
rate=RATE,
format=pyaudio.paInt16,
output_device_index=2)
# generate 1 second of sound
tone = note(FREQ, 1, amp=10000, rate=RATE)
# play it forever
while True:
s.write(tone)
The problem is that every iteration of the loop results in an audible "tick" in the audio, even when using an external USB sound card. Is there any way to avoid this, rather than trying to rewrite everything in C?
I tried using the pyaudio callback interface, but that actually sounded worse (like maybe my Pi was flatulent).
The generated audio needs to be short because it will ultimately be adjusted dynamically with an external control, and anything more than 1 second latency on control changes just feels awkward. Is there a better way to produce these signals from within Python code?
You're hearing a "tick" because there's a discontinuity in the audio you're sending. One second of 261.6 Hz contains 261.6 cycles, so you end up with about half a cycle left over at the end:
You'll need to either change the frequency so that there are a whole number of cycles per second (e.g, 262 Hz), change the duration such that it's long enough for a whole number of cycles, or generate a new audio clip every second that starts in the right phase to fit where the last chunk left off.
I was looking for a similar question to yours, and found a variation that plays a pre-calculated length by concatenating a bunch of pre-calculated chunks.
http://milkandtang.com/blog/2013/02/16/making-noise-in-python/
Using a for loop with a 1-second pre-calculated chunk "play_tone" function seems to generate a smooth sounding output, but this is on a PC. If this doesn't work for you, it may be that the raspberry pi has a different back-end implementation that doesn't like successive writes.
I'm doing some research on how to compare sound files(wave). Basically, I want to compare stored soundfiles (wav) with sound from a microphone. So in the end I would like to pre-store some voice commands of my own and then when I'm running my app I would like to compare the pre-stored files with input from the microphone.
My thought was to put in some margin when comparing because saying something two times in a row in the exactly same way would be difficult I guess.
So after some googling I see that Python has this module named wave and the Wave_read object. That object has a function named readframes(n):
Reads and returns at most n frames of
audio, as a string of bytes.
What do these bytes contain? I'm thinking of looping thru the wave files one frame at the time comparing them frame by frame.
An audio frame, or sample, contains amplitude (loudness) information at that particular point in time. To produce sound, tens of thousands of frames are played in sequence to produce frequencies.
In the case of CD quality audio or uncompressed wave audio, there are around 44,100 frames/samples per second. Each of those frames contains 16-bits of resolution, allowing for fairly precise representations of the sound levels. Also, because CD audio is stereo, there is actually twice as much information, 16-bits for the left channel, 16-bits for the right.
When you use the sound module in python to get a frame, it will be returned as a series of hexadecimal characters:
One character for an 8-bit mono signal.
Two characters for 8-bit stereo.
Two characters for 16-bit mono.
Four characters for 16-bit stereo.
In order to convert and compare these values you'll have to first use the python wave module's functions to check the bit depth and number of channels. Otherwise, you'll be comparing mismatched quality settings.
A simple byte-by-byte comparison has almost no chance of a successful match, even with some tolerance thrown in. Voice-pattern recognition is a very complex and subtle problem that is still the subject of much research.
I believe the accepted description to be slightly incorrect.
A frame appears to be somewhat like stride in graphics formats. For interleaved stereo # 16 bits/sample, the frame size is 2*sizeof(short)=4 bytes. For non-interleaved stereo # 16 bits/sample, the samples of the left channel are all one after another, so the frame size is just sizeof(short).
The first thing you should do is a fourier transformation to transform the data into its frequencies. It is rather complex however. I wouldn't use voice recognition libraries here as it sounds like you don't record voices only. You would then try different time shifts (in case the sounds are not exactly aligned) and use the one that gives you the best similarity - where you have to define a similarity function. Oh and you should normalize both signals (same maximum loudness).
I'm working on a project where I need to know the amplitude of sound coming in from a microphone on a computer.
I'm currently using Python with the Snack Sound Toolkit and I can record audio coming in from the microphone, but I need to know how loud that audio is. I could save the recording to a file and use another toolkit to read in the amplitude at given points in time from the audio file, or try and get the amplitude while the audio is coming in (which could be more error prone).
Are there any libraries or sample code that can help me out with this? I've been looking and so far the Snack Sound Toolkit seems to be my best hope, yet there doesn't seem to be a way to get direct access to amplitude.
Looking at the Snack Sound Toolkit examples, there seems to be a dbPowerSpectrum function.
From the reference:
dBPowerSpectrum ( )
Computes the log FFT power spectrum of the sound (at the sample number given in the start option) and returns a list of dB values. See the section item for a description of the rest of the options. Optionally an ending point can be given, using the end option. In this case the result is the average of consecutive FFTs in the specified range. Their default spacing is taken from the fftlength but this can be changed using the skip option, which tells how many points to move the FFT window each step. Options:
EDIT: I am assuming when you say amplitude, you mean how "loud" the sound appears to a human, and not the time domain voltage(Which would probably be 0 throughout the entire length since the integral of sine waves is going to be 0. eg: 10 * sin(t) is louder than 5 * sin(t), but their average value over time is 0. (You do not want to send non-AC voltages to a speaker anyways)).
To get how loud the sound is, you will need to determine the amplitudes of each frequency component. This is done with a Fourier Transform (FFT), which breaks down the sound into it's frequency components. The dbPowerSpectrum function seems to give you a list of the magnitudes (forgive me if this differs from the exact definition of a power spectrum) of each frequency. To get the total volume, you can just sum the entire list (Which will be close, xept it still might be different from percieved loudness since the human ear has a frequency response itself).
I disagree completely with this "answer" from CookieOfFortune.
granted, the question is poorly phrased... but this answer is making things much more complex than necessary. I am assuming that by 'amplitude' you mean perceived loudness. as technically each sample in the (PCM) audio stream represents an amplitude of the signal at a given time-slice. to get a loudness representation try a simple RMS calculation:
RMS
|K<
I'm not sure if this will help, but
skimpygimpy
provides facilities for parsing WAVE files into python
sequences and back -- you could potentially use this
to examine the wave form samples directly and do
what you like. You will have to read some source,
these subcomponents are not documented.