Pitch tone analysis of wav singing file - python

I'm analyzing 10 seconds of singing and wanted to know what the predominant frequencies were over the 10 seconds to see if the person is hitting them properly. For example if they were to "sing" do-re-mi-fa-so-la-si-do I'd like to know at what frequency they hit the tone and how closely that approximates to the theoretical frequency of that note (do=261hz, re=293hz, etc..)
I've tried working with the wave module in Python and breaking the WAV file into a numpy array, then slicing it into 0.1 second slices (WAV is sampled at 44kHz) and then calculate the predominant frequency in that time slice.
Is this the correct way of doing this?

Related

How to find time of high frequency noise in an audio wav / raw file using Python?

I have an audio clip, and I want to detect when a certain (high pitch) noise occurs. I don't know anything about FFT, how do I return the audio frame at which the noise occurs (I was thinking frequency trigger)?
Since you already have "scipy" there: You can compute a spectrogram to get frequencies over time for your sample (which you can load with wavfile.read.
You can then plot the spectrogram and figure out what the frequency you're looking for is.
Then
either loop over the spectrogram data itself to find the time where there's suitably strong signal in that frequency, or
filter your original signal first (a bandpass filter would be best) and then find the overall intense bits.

How to plot a very large audio file with low latency and time to save file?

I have an audio file sampled at 44 kbps and it has a few hours of recording. I would like to view the raw waveform in a plot (figure) with something like matplotlib (or GR in Julia) and then to save the figure to disk. Currently this takes a considerable amount of time and would like to reduce that time.
What are some common strategies to do so? Are there any special circumstances to consider on approaches of reducing the number of points in the figure? I expect that some type of subsampling of the time points will be needed and that some interpolation or smoothing will be used. (Python or Julia solutions would be ideal but other languages like R or MATLAB are similar enough to understand the approach.)
Assuming that your audio file has a sample rate of 44 kHz (which is the most common sampling rate), then there are 60*60*44_000 = 158400000 samples per hour. This number should be compared to a high-resolution screen which is ~4000 pixels wide (4k resolution). If you would print time series with a 600 dpi printer, 1 hour would be 60*60*44_000 / (600 * 2.54 * 100) = 1039 meters long if every sample should be resolved. (so please don't print this :-))
Instead have a look at PyPlot.jl functions psd (power spectral density) and specgram (spectrogram) which are often used to visualize frequencies present in an audio recording.

Python audio analysis: find real time values of the strongest beat in each meter

I have a song and I'd like to use Python to analyze it.
I need to find the "major sounds" in the song.
I use this term because I don't know the technical term for it, but here is what I mean:
https://www.youtube.com/watch?v=TYYyMu3pzL4
If you play the only first second of the song, I count about 4 major sounds.
In general, these are the same sounds that a person would hum if they were humming the song.
What are these called? And is there a function in librosa (or any other library/programming language) that can help me pinpoint their occurrence in a song?
I can provide more info/examples as needed.
UPDATE: After doing more research, I believe I am looking for what is called the "strongest beats". Librosa already has a beat_track function, but I think this gives you every single thing that can be called a beat in the song. I don't really want every beat, just the ones that stand out the most. The over-arching goal here is to create a music video where the major action happening on the screen lines up perfectly with the strongest beats. This creates a synergistic effect within the video - everything feels connected.
You would do well to call the process of parsing audio to identify its sonic archetypes acoustic fingerprinting
Audio has a time dimension so to witness your "major sounds" requires listening to the audio for a period of time ... across a succession of instantaneous audio samples. Audio can be thought of as a time series curve where for each instant in time you record the height of the audio curve digitized into PCM format. It takes wall clock time to hear a given "major sound". Here your audio is in its natural state in the time domain. However the information load of a stretch of audio can be transformed into its frequency domain counterpart by feeding a window of audio samples into a fft api call ( to take its Fourier Transform ).
A powerfully subtle aspect of taking the FFT is it removes the dimension of time from the input data and replaces it with a distillation while retaining the input information load. As an aside, if the audio is periodic once transformed from the time domain into its frequency domain representation by applying a Fourier Transform, it can be reconstituted back into the same identical time domain audio curve by applying an inverse Fourier Transform. The data which began life as a curve which wobbles up and down over time is now cast as a spread of frequencies each with an intensity and phase offset yet critically without any notion of time. Now you have the luxury to pluck from this static array of frequencies a set of attributes which can be represented by a mundane struct data structure and yet imbued by its underlying temporal origins.
Here is where you can find your "major sounds". To a first approximation you simply stow the top X frequencies along with their intensity values and this is a measure of a given stretch of time of your input audio captured as its "major sound". Once you have a collection of "major sounds" you can use this to identify when any subsequent audio contains an occurrence of a "major sound" by performing a difference match test between your pre stored set of "major sounds" and the FFT of the current window of audio samples. You have found a match when there is little or no difference between the frequency intensity values of each of those top X frequencies of the current FFT result compared against each pre stored "major sound"
I could digress by explaining how by sitting down and playing the piano you are performing the inverse Fourier Transform of those little white and black frequency keys, or by saying the muddied wagon tracks across a spring rain swollen pasture is the Fourier Transform of all those untold numbers of heavily laden market wagons as they trundle forward leaving behind an ever deepening track imprinted with each wagon's axle width, but I won't.
Here are some links to audio fingerprinting
Audio fingerprinting and recognition in Python
https://github.com/worldveil/dejavu
Audio Fingerprinting with Python and Numpy http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
Shazam-like acoustic fingerprinting of continuous audio streams (github.com) https://news.ycombinator.com/item?id=15809291
https://github.com/dest4/stream-audio-fingerprint
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints. https://github.com/adblockradio/stream-audio-fingerprint
https://stackoverflow.com/questions/26357841/audio-matching-audio-fingerprinting

wav file generated from numpy is not audible (complete silence)

I have a numpy array that represents audio data(dtype is np.int16). Here is a plot of the audio data(me saying "one, two"):
the sampling rate is 100HZ. I saved this array into a wav file. However, the wav file is not audible from other music players(iTunes, vlc, Audacity etc). It's just complete silence.
Here is how I saved the array:
scipy.io.wavfile.write('output.wav',100,waveform) # 'waveform' is the numpy array
I am wondering what could the cause be ?
sampling rate too low?
amplitude not enough ? i tried to normalize to -32767 to 32767, but still no sound
Any help is appreciated
PS:
This is how the file looks in Audacity(I'm not very familiar with this software):
With a sampling frequency of 100Hz the highest audible frequency you get is 50Hz.
The range of human hearing is from about 20 to about 20000Hz.
For "telephone quality" you need 8000Hz and for "cd quality" you need 44100Hz (that is the standard sampling frequency for consumer audio).

Get the amplitude at a given time within a sound file?

I'm working on a project where I need to know the amplitude of sound coming in from a microphone on a computer.
I'm currently using Python with the Snack Sound Toolkit and I can record audio coming in from the microphone, but I need to know how loud that audio is. I could save the recording to a file and use another toolkit to read in the amplitude at given points in time from the audio file, or try and get the amplitude while the audio is coming in (which could be more error prone).
Are there any libraries or sample code that can help me out with this? I've been looking and so far the Snack Sound Toolkit seems to be my best hope, yet there doesn't seem to be a way to get direct access to amplitude.
Looking at the Snack Sound Toolkit examples, there seems to be a dbPowerSpectrum function.
From the reference:
dBPowerSpectrum ( )
Computes the log FFT power spectrum of the sound (at the sample number given in the start option) and returns a list of dB values. See the section item for a description of the rest of the options. Optionally an ending point can be given, using the end option. In this case the result is the average of consecutive FFTs in the specified range. Their default spacing is taken from the fftlength but this can be changed using the skip option, which tells how many points to move the FFT window each step. Options:
EDIT: I am assuming when you say amplitude, you mean how "loud" the sound appears to a human, and not the time domain voltage(Which would probably be 0 throughout the entire length since the integral of sine waves is going to be 0. eg: 10 * sin(t) is louder than 5 * sin(t), but their average value over time is 0. (You do not want to send non-AC voltages to a speaker anyways)).
To get how loud the sound is, you will need to determine the amplitudes of each frequency component. This is done with a Fourier Transform (FFT), which breaks down the sound into it's frequency components. The dbPowerSpectrum function seems to give you a list of the magnitudes (forgive me if this differs from the exact definition of a power spectrum) of each frequency. To get the total volume, you can just sum the entire list (Which will be close, xept it still might be different from percieved loudness since the human ear has a frequency response itself).
I disagree completely with this "answer" from CookieOfFortune.
granted, the question is poorly phrased... but this answer is making things much more complex than necessary. I am assuming that by 'amplitude' you mean perceived loudness. as technically each sample in the (PCM) audio stream represents an amplitude of the signal at a given time-slice. to get a loudness representation try a simple RMS calculation:
RMS
|K<
I'm not sure if this will help, but
skimpygimpy
provides facilities for parsing WAVE files into python
sequences and back -- you could potentially use this
to examine the wave form samples directly and do
what you like. You will have to read some source,
these subcomponents are not documented.

Categories