I'm doing some research on how to compare sound files(wave). Basically, I want to compare stored soundfiles (wav) with sound from a microphone. So in the end I would like to pre-store some voice commands of my own and then when I'm running my app I would like to compare the pre-stored files with input from the microphone.
My thought was to put in some margin when comparing because saying something two times in a row in the exactly same way would be difficult I guess.
So after some googling I see that Python has this module named wave and the Wave_read object. That object has a function named readframes(n):
Reads and returns at most n frames of
audio, as a string of bytes.
What do these bytes contain? I'm thinking of looping thru the wave files one frame at the time comparing them frame by frame.
An audio frame, or sample, contains amplitude (loudness) information at that particular point in time. To produce sound, tens of thousands of frames are played in sequence to produce frequencies.
In the case of CD quality audio or uncompressed wave audio, there are around 44,100 frames/samples per second. Each of those frames contains 16-bits of resolution, allowing for fairly precise representations of the sound levels. Also, because CD audio is stereo, there is actually twice as much information, 16-bits for the left channel, 16-bits for the right.
When you use the sound module in python to get a frame, it will be returned as a series of hexadecimal characters:
One character for an 8-bit mono signal.
Two characters for 8-bit stereo.
Two characters for 16-bit mono.
Four characters for 16-bit stereo.
In order to convert and compare these values you'll have to first use the python wave module's functions to check the bit depth and number of channels. Otherwise, you'll be comparing mismatched quality settings.
A simple byte-by-byte comparison has almost no chance of a successful match, even with some tolerance thrown in. Voice-pattern recognition is a very complex and subtle problem that is still the subject of much research.
I believe the accepted description to be slightly incorrect.
A frame appears to be somewhat like stride in graphics formats. For interleaved stereo # 16 bits/sample, the frame size is 2*sizeof(short)=4 bytes. For non-interleaved stereo # 16 bits/sample, the samples of the left channel are all one after another, so the frame size is just sizeof(short).
The first thing you should do is a fourier transformation to transform the data into its frequencies. It is rather complex however. I wouldn't use voice recognition libraries here as it sounds like you don't record voices only. You would then try different time shifts (in case the sounds are not exactly aligned) and use the one that gives you the best similarity - where you have to define a similarity function. Oh and you should normalize both signals (same maximum loudness).
Related
To perform an end-to-end test of an embedded platform that plays musical notes, we are trying to record via a microphone and identify whether a specific sound were played using the device' speakers. The testing setup is not a real-time system so we don't really know when (or even if) the expected sound begins and ends.
The expected sound is represented in a wave file (or similar) we can read from disk.
How can we run a test that asserts whether the sound were played as expected?
There are a few ways to tackle this problem:
Convert the expected sound into a sequence of frequency amplitude
pairs. Then, record the sound via the microphone and convert that
recording into a corresponding sequence of frequency amplitude
pairs. Finally, compare the two sequences to see if they match.
This task can be accomplished using the modules scipy, numpy,
and matplotlib.
We'll need to generate a sequence of frequency amplitude pairs
for the expected sound. We can do this by using the
scipy.io.wavfile.read() function to read in a wave file
containing the expected sound. This function will return a tuple
containing the sample rate (in samples per second) and a numpy
array containing the amplitudes of the waveform. We can then use
the numpy.fft.fft() function to convert the amplitudes into a
sequence of frequency amplitude pairs.
We'll need to record the sound via the microphone. For this,
we'll use the pyaudio module. We can create a PyAudio object
using the pyaudio.PyAudio() constructor, and then use the
open() method to open a stream on the microphone. We can then
read in blocks of data from the stream using the read() method.
Each block of data will be a numpy array containing the
amplitudes of the waveform at that particular moment in time.
We can then use the numpy.fft.fft() function to convert the
amplitudes into a sequence of frequency amplitude pairs.
Finally, we can compare the two sequences of frequency
amplitude pairs to see if they match. If they do match, then we
can conclude that the expected sound was recorded correctly. If
they don't match, then we can conclude that the expected sound
was not recorded correctly.
Use a sound recognition system to identify the expected sound in the recording.
from pydub import AudioSegment
from pydub.silence import split_on_silence, detect_nonsilent
from pydub.playback import play
def get_sound_from_recording():
sound = AudioSegment.from_wav("recording.wav") # detect silent chunks and split recording on them
chunks = split_on_silence(sound, min_silence_len=1000, keep_silence=200) # split on silences longer than 1000ms. Anything under -16 dBFS is considered silence. keep 200ms of silence at the beginning and end
for i, chunk in enumerate(chunks):
play(chunk)
return chunks
Cross-correlate the recording with the expected sound. This will produce a sequence of values that indicates how closely the recording matches the expected sound. A high value at a particular time index indicates that the recording and expected sound match closely at that time.
# read in the wav file and get the sampling rate
sampling_freq, audio = wavfile.read('audio_file.wav')
# read in the reference image file
reference = plt.imread('reference_image.png')
# cross correlate the image and the audio signal
corr = signal.correlate2d(audio, reference)
# plot the cross correlation signal
plt.plot(corr)
plt.show()
This way you can set up your test to check if you are getting the correct output.
I want to find the number of times a snippet of audio is repeated in another audio.
There are libraries like https://github.com/worldveil/dejavu which can be used to create fingerprints of audio after that it can be used for recognition but it only tells whether the snippet exists in audio or not, it does not give count.
Is there any way to make changes to find the number of times the recorded audio repeats in the source(any audio from database)?
Thanks
If it's an exact copy, you may want to convolve said audio with the source audio you're trying to extract from. Then the peaks from the convolution will show you where the offset is.
I'm interested in migrating from psychtoolbox to shady for my stimulus presentation. I looked through the online docs, but it is not very clear to me how to replicate what I'm currently doing in matlab in shady.
What I do is actually very simple. For each trial,
I load from disk a single image (I do luminance linearization off-line), which contains all the frames I plan to display in that trial (the stimulus is 1000x1000 px, and I present 25 frames, hence the image is 5000x5000px. I only use BW images, so I have a single int8 value per pixel).
I transfer the entire image from the CPU to the GPU
At some point (externally controlled) I copy the first frame to the video buffer and present it
At some other point (externally controlled) I trigger the presentation of the
remaining 24 frames (copying the relevant part of the image to video buffer for each video frame, and then calling flip()).
The external control happens by having another machine communicate with the stimulus presentation code over TCP/IP. After the control PC sends a command to the presentation PC and this is executed, the presentation PC needs to send back an acknowledgement message to the control PC. I need to send three ACK messages, one when the first frame appears on screen, one when the 2nd frame appears on screen, and one when the 25th frame appears on screen (this way the control PC can easily verify if a frame has been dropped).
In matlab I do this by calling the blocking method flip() to present a frame, and when it returns I send the ACK to the control PC.
That's it. How would I do that in shady? Is there an example that I should look at?
The places to look for this information are the docstrings of Shady.Stimulus and Shady.Stimulus.LoadTexture, as well as the included example script animated-textures.py.
Like most things Python, there are multiple ways to do what you want. Here's how I would do it:
w = Shady.World()
s = w.Stimulus( [frame00, frame01, frame02, ...], multipage=True )
where each frameNN is a 1000x1000-pixel numpy array (either floating-point or uint8).
Alternatively you can ask Shady to load directly from disk:
s = w.Stimulus('trial01/*.png', multipage=True)
where directory trial01 contains twenty-five 1000x1000-pixel image files, named (say) 00.png through 24.png so that they get sorted correctly. Or you could supply an explicit list of filenames.
Either way, whether you loaded from memory or from disk, the frames are all transferred to the graphics card in that call. You can then (time-critically) switch between them with:
s.page = 0 # or any number up to 24 in your case
Note that, due to our use of the multipage option, we're using the "page" animation mechanism (create one OpenGL texture per frame) instead of the default "frame" mechanism (create one 1000x25000 OpenGL texture) because the latter would exceed the maximum allowable dimensions for a single texture on many graphics cards. The distinction between these mechanisms is discussed in the docstring for the Shady.Stimulus class as well as in the aforementioned interactive demo:
python -m Shady demo animated-textures
To prepare the next trial, you might use .LoadPages() (new in Shady version 1.8.7). This loops through the existing "pages" loading new textures into the previously-used graphics-card texture buffers, and adds further pages as necessary:
s.LoadPages('trial02/*.png')
Now, you mention that your established workflow is to concatenate the frames as a single 5000x5000-pixel image. My solutions above assume that you have done the work of cutting it up again into 1000x1000-pixel frames, presumably using numpy calls (sounds like you might be doing the equivalent in Matlab at the moment). If you're going to keep saving as 5000x5000, the best way of staying in control of things might indeed be to maintain your own code for cutting it up. But it's worth mentioning that you could take the entirely different strategy of transferring it all in one go:
s = w.Stimulus('trial01_5000x5000.png', size=1000)
This loads the entire pre-prepared 5000x5000 image from disk (or again from memory, if you want to pass a 5000x5000 numpy array instead of a filename) into a single texture in the graphics card's memory. However, because of the size specification, the Stimulus will only show the lower-left 1000x1000-pixel portion of the array. You can then switch "frames" by shifting the carrier relative to the envelope. For example, if you were to say:
s.carrierTranslation = [-1000, -2000]
then you would be looking at the frame located one "column" across and two "rows" up in your 5x5 array.
As a final note, remember that you could take advantage of Shady's on-the-fly gamma-correction and dithering–they're happening anyway unless you explicitly disable them, though of course they have no physical effect if you leave the stimulus .gamma at 1.0 and use integer pixel values. So you could generate your stimuli as separate 1000x1000 arrays, each containing unlinearized floating-point values in the range [0.0,1.0], and let Shady worry about everything beyond that.
I have a song and I'd like to use Python to analyze it.
I need to find the "major sounds" in the song.
I use this term because I don't know the technical term for it, but here is what I mean:
https://www.youtube.com/watch?v=TYYyMu3pzL4
If you play the only first second of the song, I count about 4 major sounds.
In general, these are the same sounds that a person would hum if they were humming the song.
What are these called? And is there a function in librosa (or any other library/programming language) that can help me pinpoint their occurrence in a song?
I can provide more info/examples as needed.
UPDATE: After doing more research, I believe I am looking for what is called the "strongest beats". Librosa already has a beat_track function, but I think this gives you every single thing that can be called a beat in the song. I don't really want every beat, just the ones that stand out the most. The over-arching goal here is to create a music video where the major action happening on the screen lines up perfectly with the strongest beats. This creates a synergistic effect within the video - everything feels connected.
You would do well to call the process of parsing audio to identify its sonic archetypes acoustic fingerprinting
Audio has a time dimension so to witness your "major sounds" requires listening to the audio for a period of time ... across a succession of instantaneous audio samples. Audio can be thought of as a time series curve where for each instant in time you record the height of the audio curve digitized into PCM format. It takes wall clock time to hear a given "major sound". Here your audio is in its natural state in the time domain. However the information load of a stretch of audio can be transformed into its frequency domain counterpart by feeding a window of audio samples into a fft api call ( to take its Fourier Transform ).
A powerfully subtle aspect of taking the FFT is it removes the dimension of time from the input data and replaces it with a distillation while retaining the input information load. As an aside, if the audio is periodic once transformed from the time domain into its frequency domain representation by applying a Fourier Transform, it can be reconstituted back into the same identical time domain audio curve by applying an inverse Fourier Transform. The data which began life as a curve which wobbles up and down over time is now cast as a spread of frequencies each with an intensity and phase offset yet critically without any notion of time. Now you have the luxury to pluck from this static array of frequencies a set of attributes which can be represented by a mundane struct data structure and yet imbued by its underlying temporal origins.
Here is where you can find your "major sounds". To a first approximation you simply stow the top X frequencies along with their intensity values and this is a measure of a given stretch of time of your input audio captured as its "major sound". Once you have a collection of "major sounds" you can use this to identify when any subsequent audio contains an occurrence of a "major sound" by performing a difference match test between your pre stored set of "major sounds" and the FFT of the current window of audio samples. You have found a match when there is little or no difference between the frequency intensity values of each of those top X frequencies of the current FFT result compared against each pre stored "major sound"
I could digress by explaining how by sitting down and playing the piano you are performing the inverse Fourier Transform of those little white and black frequency keys, or by saying the muddied wagon tracks across a spring rain swollen pasture is the Fourier Transform of all those untold numbers of heavily laden market wagons as they trundle forward leaving behind an ever deepening track imprinted with each wagon's axle width, but I won't.
Here are some links to audio fingerprinting
Audio fingerprinting and recognition in Python
https://github.com/worldveil/dejavu
Audio Fingerprinting with Python and Numpy http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
Shazam-like acoustic fingerprinting of continuous audio streams (github.com) https://news.ycombinator.com/item?id=15809291
https://github.com/dest4/stream-audio-fingerprint
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints. https://github.com/adblockradio/stream-audio-fingerprint
https://stackoverflow.com/questions/26357841/audio-matching-audio-fingerprinting
I'm working on a project where I need to know the amplitude of sound coming in from a microphone on a computer.
I'm currently using Python with the Snack Sound Toolkit and I can record audio coming in from the microphone, but I need to know how loud that audio is. I could save the recording to a file and use another toolkit to read in the amplitude at given points in time from the audio file, or try and get the amplitude while the audio is coming in (which could be more error prone).
Are there any libraries or sample code that can help me out with this? I've been looking and so far the Snack Sound Toolkit seems to be my best hope, yet there doesn't seem to be a way to get direct access to amplitude.
Looking at the Snack Sound Toolkit examples, there seems to be a dbPowerSpectrum function.
From the reference:
dBPowerSpectrum ( )
Computes the log FFT power spectrum of the sound (at the sample number given in the start option) and returns a list of dB values. See the section item for a description of the rest of the options. Optionally an ending point can be given, using the end option. In this case the result is the average of consecutive FFTs in the specified range. Their default spacing is taken from the fftlength but this can be changed using the skip option, which tells how many points to move the FFT window each step. Options:
EDIT: I am assuming when you say amplitude, you mean how "loud" the sound appears to a human, and not the time domain voltage(Which would probably be 0 throughout the entire length since the integral of sine waves is going to be 0. eg: 10 * sin(t) is louder than 5 * sin(t), but their average value over time is 0. (You do not want to send non-AC voltages to a speaker anyways)).
To get how loud the sound is, you will need to determine the amplitudes of each frequency component. This is done with a Fourier Transform (FFT), which breaks down the sound into it's frequency components. The dbPowerSpectrum function seems to give you a list of the magnitudes (forgive me if this differs from the exact definition of a power spectrum) of each frequency. To get the total volume, you can just sum the entire list (Which will be close, xept it still might be different from percieved loudness since the human ear has a frequency response itself).
I disagree completely with this "answer" from CookieOfFortune.
granted, the question is poorly phrased... but this answer is making things much more complex than necessary. I am assuming that by 'amplitude' you mean perceived loudness. as technically each sample in the (PCM) audio stream represents an amplitude of the signal at a given time-slice. to get a loudness representation try a simple RMS calculation:
RMS
|K<
I'm not sure if this will help, but
skimpygimpy
provides facilities for parsing WAVE files into python
sequences and back -- you could potentially use this
to examine the wave form samples directly and do
what you like. You will have to read some source,
these subcomponents are not documented.