How to properly use pitch_shift (librosa)?

How to properly use pitch_shift (librosa)? - python

I try to use the librosa and pitch_shift from librosa.
I recorded some my voice and used this code:
sampling_rate= 44100
y, sr = librosa.load(directory, sr=sampling_rate) # y is a numpy array of the wav file, sr = sample rate
y_shifted = librosa.effects.pitch_shift(y, sr, n_steps=4, bins_per_octave=24) # shifted by 4 half steps
librosa.output.write_wav(directory, y_shifted, sr=sampling_rate, norm=False)
It works fine - almost.
I hear some noise in my new voice (after pitch_shifting)
Is there something what I need to use?
Without shift:
https://vocaroo.com/i/s1qEEDvzcUHN
With shift (n_steps = 4):
https://vocaroo.com/i/s0cOiC0cFJSB

Pitch-shifting typically involves an STFT, the shift—usually of a magnitude spectrum along the frequency axis, and then signal reconstruction via the Griffin-Lim-algorithm (Quora-explanation on how Griffin-Lim works).
The problem is that when we shift the magnitude spectrum, we do just that—and ignore the phase! Griffin-Lim tries to find a reasonable solution to find the correct phase when reconstructing the time domain signal, but it's often just that: a reasonable solution, not a perfect one. And that is why you hear this metallic twang. That's the phases of your signal not being quite right (also called "phasiness").
I believe your function call to librosa is perfectly alright. It may just not be the greatest implementation on earth. Give PyRubberband a try. It's based on Rubberband (a C++ library) and has a good reputation.

Related

How do I find amplitude of wav file in python?

I am working with wav files analysis using the librosa library in python. I used librosa.load() to load the audio file. Apparently this function loads the wav file into a numpy array with normalised amplitude values in the range -1 to 1. But I need to get the actual amplitude values for processing. How can I find that?
Thanks in advance!

You observed correctly that librosa always normalizes the samples to mono [-1:1] (and also 22050 Hz). That said, it's digital audio, so could multiply with whatever you want to get a different scale. If you insist, that your samples are on a scale of -2^15 to 2^15, simply multiply with 2^15. It pretty much means the same.
You won't gain anything, except dragging a peculiarity of the encoding audio format into your data.
That said, if that's what you want, you could use PySoundFile like this:
import soundfile as sf
y, sr = sf.read('existing_file.wav', dtype='int16')
The parameter dtype='int16' tells the library to assume a signed 16bit format per sample.

You can't. As Hendrik mentioned, the signal is digital and the amplitude in the WAV file won't tell you anything about the actual sound wave amplitude / sound power. That's completely lost the moment it was digitalised to WAV.
That being said, you can compute e.g. loudness, a relative perception of the sound power. If you are dealing with human auditory system, one of the recommended approaches is to:
Use to the Bark scale (Bark scale better reflects how we hear).
Compute energy in each bin.
(Optional) Normalise by the overall sum.
If you don't want to compute it yourself, check out e.g. YAAFE.

How to transpose part of sound by librosa

For example,
y, sr = librosa.load("sound.wav",sr=44100,mono=True)
half = int(y.shape / 2)
y1 = y[:half]
y2 = y[half:]
y_pit= librosa.effects.pitch_shift(y2, sr, n_steps=24)
y = np.concatenate([y1,y_pit])
This code imports sound.wav and pitch-shift only the later half ,then finally makes one sound file.
Now, what I want to do is more.
I would like to pitch-shift only around specific hz like 440hz=A
For example
In the case, I have sound (A C E) = Am Chord
I want to pitch shift only around A then make (G C E)
Where should I start ?? or librosa.effects.pitch_shift is useful for this purpose???

This is not possible with a pitch shifter. A pitch shifter simply changes the frequencies by slowing the sound up or down (as a varispeed) then cutting some small slices if the resulting sound is longer or, at the contrary, duplicating some small slices while it is shorter. As you can imagine, this process handles the whole wave as a single thing meaning that the spectrum is completely transposed.
Doing what you want requires a much more sophisticated technique called resynthesis which first converts the wave in a synthetic sound using FFT and additive synthesis (or other techniques more appropriate when the sound is noisy), then allows some manipulation on independent parts of the spectrum, and finally reconverts the synthetic sound to an audio wave. There is a standalone software doing that quite well which is called Spear. You could also investigate Loris which seems to have a python module.

Simulation Lorentzian function with FFT

I need to utilize Fourier transform on Lorentzian function with ln scale.
I know Lorentzian function after FFT is exp(-pi|k|), it seems right.
I do that on ln scale. It supposed to be linear and no oscillation at all.
However there is oscillation. I lost it totally.
Here is my code:
import numpy as np
from scipy import fft
import matplotlib.pyplot as plt
a =1
N = 500
x =np.linspace(-5,5,N)
lorentz = (a/np.pi) * (1/(a**2 + x**2))
fourier = (fft.fft(lorentz))
fig, (ax1) = plt.subplots(nrows=1, ncols=1)
ax1.loglog(abs(fourier[0:int(N/2)]),basey=np.e)
ax1.grid(True)
plt.show()
How could I solve the problem?
Follow comment said:
Here is
x =np.linspace(-20,20,N)
It seems like postpone the oscillation but still there.
after adding hamming window :
Hamming window postpones it also.
I try to extend to
x =np.linspace(-60,60,N)
It seems correct(related to a and wider range and point interval). But I'm curious about what happened.

Pascal's remark helps to explain. At first glance, I felt the oscillating distortion you show was related to analysis window boundaries. When your signal is not zero on either side of the window, the FFT analysis will find a step, which results into "butterfly" problems that have nothing to do with your input signal. Hamming window - raised cosine - can solve that, but if the hamming has to do too much smoothing on the edges, you analyse a signal you don't have !
Nice to see the tip to enlarge the analysis window worked in this case.. so you took a larger range around zero, you get the expected result for a Lorentzian function. My science expertise is too limited to actually understand why this particular spectrum is the correct result.
Attempt to explain why 2xNyquist requirement is relevant: you are using signal analysis tools in the real domain (not complex input). For FFT analysis of real samples, the analysis window should accommodate 2 periods of the lowest frequency you are interested in. You are investigating an impulse response, so its "period" shape will be the only one you are interested in. By taking a larger interval around 0 into account, you have put your impulse response in the middle. Hamming window will be around 1 there. When your analysis window is wide enough, and a Hamming window is applied, the FFT will see proper input (zero on either side !) and yield proper, smooth output, as if you are analysing a very low frequency periodic signal.
My experience with FFT tools is in the field of speech research. In a speech sample, there is a pitch. The lowest frequency or "f0" of the speaker. For males, you have e.g. typically 100 Hz pitch. With a sampling frequency of 20kHz, a single pitch period will require 200 samples. Two pitch periods require 400 samples. I prefer setting FFT order instead of analysis window. Order 9 FFT is 512 samples in your window, it will yield 256 frequencies. Order 10 is 512 result frequencies, requiring 1024 samples, etc. The hamming window I use in my spectrum tool is always the full window, not a truncated window.

Followup: extracting phase information from FFT - proper use of frequency shifts and windows

This is a followup question to one I asked earlier based on the chat after the answer given by #hotpaw2. I have a signal which will be a single cosine wave, with a phase offset. My task is to extract (with very high accuracy required) the amplitude and phase of this single freuqnecy component.
On paper, the following relations hold assuming a properly normalized fourier transform T:
Unsurprisingly, there is a bit more to the DFT than simply taking the transform and picking off the relevent components. In particular, the discussion suggested to me that I was not entirely clear on what the phase offset is being measured with respect to, and that there are significant edge effects that can destroy the accuracy of the result if data is not properly windowed.
I have been googling around but most of the discussion is fairly technical and light on example, so I was hoping that someone can shed some light on things. In particular, I came across one example suggesting that instead of doing a simple transform, I should be shifting it first:
import numpy as np
import pylab as pl
f = 30.0
w = 2.0*np.pi*f
phase = np.pi/2
num_t = 10*f
t, dt = np.linspace(0, 1, num_t, endpoint=False, retstep=True)
signal = np.cos(w*t+phase)#+np.random.normal(0,0.25,len(t))
amp = np.fft.fftshift(np.fft.rfft(np.fft.ifftshift(signal)))
freqs = np.fft.fftshift(np.fft.rfftfreq(t.shape[-1],dt))
index = np.where(freqs==30)
print index[0][0]
print(np.angle(amp))[index[0][0]]
print (np.abs(amp))[index[0][0]]*(2.0/len(t))
pl.subplot(211)
pl.semilogy(freqs,np.abs(amp))
pl.subplot(212)
pl.plot(freqs,(np.angle(amp)))
pl.show()
So: first set of questions: can someone explain the point of using fftshift, and what exactly it is doing to the data? Why does using an inverse shift, transforming, and then shifting then require a frequency component set which is only shifted once, with no inverse opration? Is this approach the correct one (ignoring for the moment the issue of a window).
second set of questions: if I window my data, presumably this will affect the amplitude and likely the phase(?) of the result. Is there an analytical way to correct for the amplitude change for a given window shape? I can find a few tables that list correction factors, but I haven't really seen a good explanation yet.
In the linked question, it was stated that phase should be measured near the center of the hump of the window function. But because the window function is a time-domain function and I want the phase for a specific frequency, I don't quite understand what that means.
Any light that could be shed on the matter (perhaps in the form of references, since clearly I need to do some more reading) would be most appreciated.

Get the amplitude at a given time within a sound file?

I'm working on a project where I need to know the amplitude of sound coming in from a microphone on a computer.
I'm currently using Python with the Snack Sound Toolkit and I can record audio coming in from the microphone, but I need to know how loud that audio is. I could save the recording to a file and use another toolkit to read in the amplitude at given points in time from the audio file, or try and get the amplitude while the audio is coming in (which could be more error prone).
Are there any libraries or sample code that can help me out with this? I've been looking and so far the Snack Sound Toolkit seems to be my best hope, yet there doesn't seem to be a way to get direct access to amplitude.

Looking at the Snack Sound Toolkit examples, there seems to be a dbPowerSpectrum function.
From the reference:
dBPowerSpectrum ( )
Computes the log FFT power spectrum of the sound (at the sample number given in the start option) and returns a list of dB values. See the section item for a description of the rest of the options. Optionally an ending point can be given, using the end option. In this case the result is the average of consecutive FFTs in the specified range. Their default spacing is taken from the fftlength but this can be changed using the skip option, which tells how many points to move the FFT window each step. Options:
EDIT: I am assuming when you say amplitude, you mean how "loud" the sound appears to a human, and not the time domain voltage(Which would probably be 0 throughout the entire length since the integral of sine waves is going to be 0. eg: 10 * sin(t) is louder than 5 * sin(t), but their average value over time is 0. (You do not want to send non-AC voltages to a speaker anyways)).
To get how loud the sound is, you will need to determine the amplitudes of each frequency component. This is done with a Fourier Transform (FFT), which breaks down the sound into it's frequency components. The dbPowerSpectrum function seems to give you a list of the magnitudes (forgive me if this differs from the exact definition of a power spectrum) of each frequency. To get the total volume, you can just sum the entire list (Which will be close, xept it still might be different from percieved loudness since the human ear has a frequency response itself).

I disagree completely with this "answer" from CookieOfFortune.
granted, the question is poorly phrased... but this answer is making things much more complex than necessary. I am assuming that by 'amplitude' you mean perceived loudness. as technically each sample in the (PCM) audio stream represents an amplitude of the signal at a given time-slice. to get a loudness representation try a simple RMS calculation:
RMS
|K<

I'm not sure if this will help, but
skimpygimpy
provides facilities for parsing WAVE files into python
sequences and back -- you could potentially use this
to examine the wave form samples directly and do
what you like. You will have to read some source,
these subcomponents are not documented.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.