Calculating thd in python

Calculating thd in python - python

I'm trying to calculate the total harmonic distortion values of ac voltage supplied. I am sampling voltage data using Arduino at over 8 KHz rate and storing those data into a text file. Then I'm trying to calculate thd using the following code snippet written in python:
import numpy as np
import scipy.fftpack
from scipy.fftpack import fft
from numpy import genfromtxt
sampled_data = genfromtxt('/../file.txt',delimiter=',')
abs_yf=np.abs(fft(sampled_data))
#As far as I know, THD=sqrt(sum of square magnitude of
#harmonics+noise)/Fundamental value (Is it correct?)So I'm
#just summing up square of all frequency data obtained from FFT,
#sqrt() them and dividing them with fundamental frequecy value.
def thd(abs_data):
sq_sum=0.0
for r in range(len(abs_data)):
sq_sum=sq_sum+(abs_data[r])**2
sq_harmonics=sq_sum-(max(abs_data))**2.0
thd=100*sq_harmonics**0.5/max(abs_data)
return thd
print "Total Harmonic Distortion(in percent):"
print thd(abs_yf)
Problem is, The obtained Thd values vary within 5% to 25% in my case. (In reality it's not more than 5% actually). What am I doing wrong? Is there any other way to find out thd?

Though this is long quiet, for anyone encountering this post like me: There are a couple of problems with the OP method.
1) The magnitudes returned by FFT include a magnitude of the 0 frequency bin, so the assumption that max(abs_data) is the magnitude corresponding to the fundamental frequency is not correct if there is any DC bias in the signal. This is a problem in the line
thd = 100*sq_harmonics**0.5 / max(abs_data)
The amplitude associated with the 0 frequency can just be ignored as a quick solution.
2) The second half of the abs_data should be thrown out, it is a "mirrored" reflection of the first. This is due to the nature of the Fourier transform.
Both these issues can be addressed by changing the input to the function, i.e by replacing
print thd(abs_yf)
with
print( thd(abs_yf[1:int(len(abs_yf)/2) ]) )
where we have changed the input to not include the first or the last N/2 elements.
The result is still not ideal because the window needs to be exactly an integer number of cycles as the previous answers noted above. Testing with a pure sine with offset and adjusting the window demonstrates that the method works fairly well but fails terribly if significant window errors.
t0=0
tf = 0.02 # integer number of cycles
dt = 1e-4
offset = 0.5
N = int((tf-t0)/dt)
time = np.linspace(0.0,tf,N ) #;
commandSigFreq = 100
Amplitude = 2
waveOfSin = Amplitude*np.sin(2.0*pi*commandSigFreq*time) + offset
abs_yf = np.abs(fft(waveOfSin))
#print("freq is" + str(scipy.fftpack.fftfreq(sampled_data, dt ) ))
#As far as I know, THD=sqrt(sum of square magnitude of
#harmonics+noise)/Fundamental value (Is it correct?)So I'm
#just summing up square of all frequency data obtained from FFT,
#sqrt() them and dividing them with fundamental frequency value.
def thd(abs_data):
sq_sum=0.0
for r in range( len(abs_data)):
sq_sum = sq_sum + (abs_data[r])**2
sq_harmonics = sq_sum -(max(abs_data))**2.0
thd = 100*sq_harmonics**0.5 / max(abs_data)
return thd
print("Total Harmonic Distortion(in percent):")
print(thd(abs_yf[1:int(len(abs_yf)/2) ]))

It is quite likely that you add additional distortion by the measurement process itself.
If you compare an Arduino ADC with a high class measurement device, the values of the Arduino will very likely much worse. At least you need a very stable and jitter-free clock.
Furthermore, the output of the data (I guess via UART) might interfere with the timing of the ADC measurement.

Related

Find plateau in Numpy array

I am looking for an efficient way to detect plateaus in otherwise very noisy data. The plateaus are always relatively broad A simple example of what this data could look like:
test=np.random.uniform(0.9,1,100)
test[10:20]=0
plt.plot(test)
Note that there can be multiple plateaus (which should all be detected) which can have different values.
I've tried using scipy.signal.argrelextrema, but it doesn't seem to be doing what I want it to:
peaks=argrelextrema(test,np.less,order=25)
plt.vlines(peaks,ymin=0, ymax=1)
I don't need the exact interval of the plateau- a rough range estimate would be enough, as long as that estimate is bigger or equal than the actual plateau range. It should be relatively efficient however.

There is a method scipy.signal.find_peaks that you can try, here is an exmple
import numpy
from scipy.signal import find_peaks
test = numpy.random.uniform(0.9, 1.0, 100)
test[10 : 20] = 0
peaks, peak_plateaus = find_peaks(- test, plateau_size = 1)
although find_peaks only finds peaks, it can be used to find valleys if the array is negated, then you do the following
for i in range(len(peak_plateaus['plateau_sizes'])):
if peak_plateaus['plateau_sizes'][i] > 1:
print('a plateau of size %d is found' % peak_plateaus['plateau_sizes'][i])
print('its left index is %d and right index is %d' % (peak_plateaus['left_edges'][i], peak_plateaus['right_edges'][i]))
it will print
a plateau of size 10 is found
its left index is 10 and right index is 19

This is really just a "dumb" machine learning task. You'll want to code a custom function to screen for them. You have two key characteristics to a plateau:
They're consecutive occurrences of the same value (or very nearly so).
The first and last points deviate strongly from a forward and backward moving average, respectively. (Try quantifying this based on the standard deviation if you expect additive noise, for geometric noise you'll have to take the magnitude of your signal into account too.)
A simple loop should then be sufficient to calculate a forward moving average, stdev of points in that forward moving average, reverse moving average, and stdev of points in that reverse moving average.
Read until you find a point well outside the regular noise (compare to variance). Start buffering those indices into a list.
Keep reading and buffering indices into that list while they have the same value (or nearly the same, if your plateaus can be a little rough; you'll want to use some tolerance plus the standard deviation of your plateaus, or just some tolerance if you expect them all to behave similarly).
If the variance of the points in your buffer gets too high, it's not a plateau, too rough; throw it out and start scanning again from your current position.
If the last value was very different from the previous (on the order of the change that triggered your code to start buffering indices) and in the opposite direction of the original impulse, cap your buffer here; you've got a plateau there.
Now do whatever you want with the points at those indices. Delete them, replace them with a linear interpolation between the two boundary points, whatever.
I could generate some noise and give you some sample code, but this is really something you're going to have to adapt to your application. (For example, there's a shortcoming in this method that a plateau which captures a point on the middle of the "cliff edge" may leave that point when it removes the rest of the plateau. If that's something you're worried about, you'll have to do a little more exploring after you ID the plateau.) You should be able to do this in a single pass over the data, but it might be wise to get some statistics on the whole set first to intelligently tweak your thresholds.
If you have an exact definition of what constitutes a plateau, you can make this a lot less hand-wavey and ML-looking, but so long as you're trying to identify fuzzy pattern, you're gonna have to take a statistics-based approach.

I had a similar problem, and found a simple heuristic solution shared below. I find plateaus as ranges of constant gradient of the signal. You could change the code to also check that the gradient is (close to) 0.
I apply a moving average (uniform_filter_1d) to filter out noise. Also, I calculate the first and second derivative of the signal numerically, so I'm not sure it matches the requirement of efficiency. But it worked perfectly for my signal and might be a good starting point for others.
def find_plateaus(F, min_length=200, tolerance = 0.75, smoothing=25):
'''
Finds plateaus of signal using second derivative of F.
Parameters
----------
F : Signal.
min_length: Minimum length of plateau.
tolerance: Number between 0 and 1 indicating how tolerant
the requirement of constant slope of the plateau is.
smoothing: Size of uniform filter 1D applied to F and its derivatives.
Returns
-------
plateaus: array of plateau left and right edges pairs
dF: (smoothed) derivative of F
d2F: (smoothed) Second Derivative of F
'''
import numpy as np
from scipy.ndimage.filters import uniform_filter1d
# calculate smooth gradients
smoothF = uniform_filter1d(F, size = smoothing)
dF = uniform_filter1d(np.gradient(smoothF),size = smoothing)
d2F = uniform_filter1d(np.gradient(dF),size = smoothing)
def zero_runs(x):
'''
Helper function for finding sequences of 0s in a signal
https://stackoverflow.com/questions/24885092/finding-the-consecutive-zeros-in-a-numpy-array/24892274#24892274
'''
iszero = np.concatenate(([0], np.equal(x, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# Find ranges where second derivative is zero
# Values under eps are assumed to be zero.
eps = np.quantile(abs(d2F),tolerance)
smalld2F = (abs(d2F) <= eps)
# Find repititions in the mask "smalld2F" (i.e. ranges where d2F is constantly zero)
p = zero_runs(np.diff(smalld2F))
# np.diff(p) gives the length of each range found.
# only accept plateaus of min_length
plateaus = p[(np.diff(p) > min_length).flatten()]
return (plateaus, dF, d2F)

Hurst Exponent in python

from datetime import datetime
from pandas.io.data import DataReader
from numpy import cumsum, log, polyfit, sqrt, std, subtract
from numpy.random import randn
def hurst(ts):
"""Returns the Hurst Exponent of the time series vector ts"""
# Create the range of lag values
lags = range(2, 100)
# Calculate the array of the variances of the lagged differences
# Here it calculates the variances, but why it uses
# standard deviation and then make a root of it?
tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
# Use a linear fit to estimate the Hurst Exponent
poly = polyfit(log(lags), log(tau), 1)
# Return the Hurst exponent from the polyfit output
return poly[0]*2.0
# Download the stock prices series from Yahoo
aapl = DataReader("AAPL", "yahoo", datetime(2012,1,1), datetime(2015,9,18))
# Call the function
hurst(aapl['Adj Close'])
From this code for estimating Hurst Exponent, when we want to calculate the variance of the lagged difference, why we still use a standard deviation and take a square root? I am confused for a long time, and I don't know why others don't have the same confuse. Do I misunderstand the math behind it? Thanks!

I'm just as confused. I don't understand where the sqrt of std comes from either, and have spent 3 days trying to figure it out. In the end I noticed QuantStart credits Dr Tom Starke who uses a slightly different code. Dr Tom Starke credits Dr Ernie Chan, and going to his blog. I was able to find enough information to put together my own code from his principles. This doesn't use sqrt, uses variance instead of std and uses a 2.0 divisor at the end instead of a 2.0 multiplier. In the end, it seems to give the same results as the quantstart code you post, but I am able to understand it from first principles, which I guess is important. I put together a Jupyter Notebook which makes it clearer, but I'm not sure if I can post that here, so I will try to explain as best I can here. Code is pasted first, then an explanation.
lags = range(2,100)
def hurst_ernie_chan(p):
variancetau = []; tau = []
for lag in lags:
# Write the different lags into a vector to compute a set of tau or lags
tau.append(lag)
# Compute the log returns on all days, then compute the variance on the difference in log returns
# call this pp or the price difference
pp = subtract(p[lag:], p[:-lag])
variancetau.append(var(pp))
# we now have a set of tau or lags and a corresponding set of variances.
#print tau
#print variancetau
# plot the log of those variance against the log of tau and get the slope
m = polyfit(log10(tau),log10(variancetau),1)
hurst = m[0] / 2
return hurst
Dr Chan doesn't give any code on this page (I believe he works in MATLAB not Python anyway). Hence I needed to put together my own code from the notes he gives in his blog and answers he gives to questions posed on his blog.
Dr Chan states that if z is the log price, then volatility, sampled at intervals of τ, is volatility(τ)=√(Var(z(t)-z(t-τ))). To me another way of describing volatility is standard deviation, so std(τ)=√(Var(z(t)-z(t-τ)))
std is just the root of variance so var(τ)=(Var(z(t)-z(t-τ)))
Dr Chan then states: In general, we can write Var(τ) ∝ τ^(2H) where H is the Hurst exponent
Hence (Var(z(t)-z(t-τ))) ∝ τ^(2H)
Taking the log of each side we get log (Var(z(t)-z(t-τ))) ∝ 2H log τ
[ log (Var(z(t)-z(t-τ))) / log τ ] / 2 ∝ H (gives the Hurst exponent) where we know the term in square brackets on far left is the slope of a log-log plot of tau and a corresponding set of variances.
If you run that function and compare the answers to the Quantstart function, they should be the same. Not sure if that helped.

All that is going on here is a variation on math notation
I'll define
d = subtract(ts[lag:], ts[:-lag])
Then it is clear that
np.log(np.std(d)**2) == np.log(np.var(d))
np.log(np.std(d)) == .5*np.log(np.var(d))
Then you have the equivalence
2*np.log(np.sqrt(np.std(d))) == .5*np.log(np.sqrt(np.var(d)))
The functional output of polyfit scales proportionally to its input

As per intuitive definition taken from Ernest Chan's "Algorithmic trading" (p.44):
Intuitively speaking, a “stationary” price series means that the prices diffuse
from its initial value more slowly than a geometric random walk would.
one would want to check variance of time series with increasing lags against lag(s). This is because for normal distribution -- and log prices are believed to be normal (to certain extent) -- variance of sum of normal distributions is a sum of constituents' variances.
As per Ernest Chan's citation, for mean reverting processes the realized variance will be less than theoretically projected.
Putting this in code:
def hurst(p, l):
"""
Arguments:
p: ndarray -- the price series to be tested
l: list of integers or an integer -- lag(s) to test for mean reversion
Returns:
Hurst exponent
"""
if isinstance(l, int):
lags = [1, l]
else:
lags = l
assert lags[-1] >=2, "Lag in prices must be greater or equal 2"
print(f"Price lags of {lags[1:]} are included")
lp = np.log(p)
var = [np.var(lp[l:] - lp[:-l]) for l in lags]
hr = linregress(np.log(lags), np.log(var))[0] / 2
return hr

The code posted by OP is correct.
The reason for the confusion is that it does a square-root first, and then counters it by multiplying the slope (returned by polyfit) with 2.
For a more detailed explanation, continue reading.
tau is calculated with an "extra" square-root. Then, its log is calculated. log(sqrt(x)) = log(x^0.5) = 0.5*log(x) (this is the key).
polyfit now conducts the fitting with y multiplied by an "extra 0.5". So, the result obtained is also multiplied by nearly 0.5. Returning twice of that (return poly[0]*2.0) counters the initial (seemingly) extra 0.5.
Hope this makes it clearer.

Fastest way to get average value of frequencies within range [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am new in python as well as in signal processing. I am trying to calculate mean value among some frequency range of a signal.
What I am trying to do is as follows:
import numpy as np
data = <my 1d signal>
lF = <lower frequency>
uF = <upper frequency>
ps = np.abs(np.fft.fft(data)) ** 2 #array of power spectrum
time_step = 1.0 / 2000.0
freqs = np.fft.fftfreq(data.size, time_step) # array of frequencies
idx = np.argsort(freqs) # sorting frequencies
sum = 0
c =0
for i in idx:
if (freqs[i] >= lF) and (freqs[i] <= uF) :
sum += ps[i]
c +=1
avgValue = sum/c
print 'mean value is=',avgValue
I think calculation is fine, but it takes a lot of time like for data of more than 15GB and processing time grows exponentially. Is there any fastest way available such that I would be able to get mean value of power spectrum within some frequency range in fastest manner. Thanks in advance.
EDIT 1
I followed this code for calculation of power spectrum.
EDIT 2
This doesn't answer to my question as it calculates mean over the whole array/list but I want mean over part of the array.
EDIT 3
Solution by jez of using mask reduces time. Actually I have more than 10 channels of 1D signal and I want to treat them in a same manner i.e. average frequencies in a range of each channel separately. I think python loops are slow. Is there any alternate for that?
Like this:
for i in xrange(0,15):
data = signals[:, i]
ps = np.abs(np.fft.fft(data)) ** 2
freqs = np.fft.fftfreq(data.size, time_step)
mask = np.logical_and(freqs >= lF, freqs <= uF )
avgValue = ps[mask].mean()
print 'mean value is=',avgValue

The following performs a mean over a selected region:
mask = numpy.logical_and( freqs >= lF, freqs <= uF )
avgValue = ps[ mask ].mean()
For proper scaling of power values that have been computed as abs(fft coefficients)**2, you will need to multiply by (2.0 / len(data))**2 (Parseval's theorem)
Note that it gets slightly fiddly if your frequency range includes the Nyquist frequency—for precise results, handling of that single frequency component would then need to depend on whether data.size is even or odd). So for simplicity, ensure that uF is strictly less than max(freqs). [For similar reasons you should ensure lF > 0.]
The reasons for this are tedious to explain and even more tedious to correct for, but basically: the DC component is represented once in the DFT, whereas most other frequency components are represented twice (positive frequency and negative frequency) at half-amplitude each time. The even-more-annoying exception is the Nyquist frequency which is represented once at full amplitude if the signal length is even, but twice at half amplitude if the signal length is odd. All of this would not affect you if you were averaging amplitude: in a linear system, being represented twice compensates for being at half amplitude. But you're averaging power, i.e. squaring the values before averaging, so this compensation doesn't work out.
I've pasted my code for grokking all of this. This code also shows how you can work with multiple signals stacked in one numpy array, which addresses your follow-up question about avoiding loops in the multi-channel case. Remember to supply the correct axis argument both to numpy.fft.fft() and to my fft2ap().

If you really have a signal of 15 GB size, you'll not be able to calculate the FFT in an acceptable time. You can avoid using the FFT, if it is acceptable for you to approximate your frequency range by a band pass filter. The justification is the Poisson summation formula, which states that sum of squares is not changed by a FFT (or: the power is preserved). Staying in the time domain will let the processing time rise proportionally to the signal length.
The following code designs a Butterworth band path filter, plots the filter response and filters a sample signal:
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
dd = np.random.randn(10**4) # generate sample data
T = 1./2e3 # sampling interval
n, f_s = len(dd), 1./T # number of points and sampling frequency
# design band path filter:
f_l, f_u = 50, 500 # Band from 50 Hz to 500 Hz
wp = np.array([f_l, f_u])*2/f_s # normalized pass band frequnecies
ws = np.array([0.8*f_l, 1.2*f_u])*2/f_s # normalized stop band frequencies
b, a = signal.iirdesign(wp, ws, gpass=60, gstop=80, ftype="butter",
analog=False)
# plot filter response:
w, h = signal.freqz(b, a, whole=False)
ff_w = w*f_s/(2*np.pi)
fg, ax = plt.subplots()
ax.set_title('Butterworth filter amplitude response')
ax.plot(ff_w, np.abs(h))
ax.set_ylabel('relative Amplitude')
ax.grid(True)
ax.set_xlabel('Frequency in Hertz')
fg.canvas.draw()
# do the filtering:
zi = signal.lfilter_zi(b, a)*dd[0]
dd1, _ = signal.lfilter(b, a, dd, zi=zi)
# calculate the avarage:
avg = np.mean(dd1**2)
print("RMS values is %g" % avg)
plt.show()
Read the documentation to Scipy's Filter design to learn how to modify the parameters of the filter.
If you want to stay with the FFT, read the docs on signal.welch and plt.psd. The Welch algorithm is a method to efficiently calculate the power spectral density of a signal (with some trade-offs).

It is much easier to work with FFT if your arrays are power of 2. When you do fft the frequencies ranges from -pi/timestep to pi/timestep (assuming that frequency is defined as w = 2*pi/t, change the values accordingly if you use f =1/t representation). Your spectrum is arranged as 0 to minfreqq--maxfreq to zero. you can now use fftshift function to swap the frequencies and your spectrum looks like minfreq -- DC -- maxfreq. now you can easily determine your desired frequency range because it is already sorted.
The frequency step dw=2*pi/(time span) or max-frequency/(N/2) where N is array size.
N/2 point is DC or 0 frequency. Nth position is max frequency now you can easily determine your range
Lower_freq_indx=N/2+N/2*Lower_freq/max_freq
Higher_freq_index=N/2+N/2*Higher_freq/Max_freq
avg=sum(ps[lower_freq_indx:Higher_freq_index]/(Higher_freq_index-Lower_freq_index)
I hope this will help
regards

Integration in Fourier or time domain

I'm struggling to understand a problem with numerical integration of a signal. Basically I have a signal which I would like to integrate or perform and antiderivative as a function of time (integration of pick-up coil for getting magnetic field). I've tried two different methods which in principle should be consistent but they are not. The code I'm using is the following. Beware that the signals y in the code has been previously high pass filtered using butterworth filtering (similar to what done here http://wiki.scipy.org/Cookbook/ButterworthBandpass). The signal and time basis can be downloaded here (https://www.dropbox.com/s/fi5z38sae6j5410/trial.npz?dl=0)
import scipy as sp
from scipy import integrate
from scipy import fftpack
data = np.load('trial.npz')
y = data['arr_1'] # this is the signal
t = data['arr_0']
# integration using pfft
bI = sp.fftpack.diff(y-y.mean(),order=-1)
bI2= sp.integrate.cumtrapz(y-y.mean(),x=t)
Now the two signals (besides the eventual different linear trend which can be taken out) are different, or better dynamically they are quite similar with the same time of oscillations but there is a factor approximately of 30 between the two signals, in the sense that bI2 is 30 times (approximately) lower than bI. BTW I've subtracted the mean in both the two signals to be sure that they are zero mean signals and performing integration in IDL (both with equivalent cumsumtrapz and in the fourier domain) gives values compatible with bI2. Any clue is really welcomed

It's difficult to know what scipy.fftpack.diff() is doing under the bonnet.
To try and solve your problem, I have dug up an old frequency domain integration function that I wrote a while ago. It's worth pointing out that in practice, one generally wants a bit more control of some of the parameters than scipy.fftpack.diff() gives you. For example, the f_lo and f_hi parameters of my intf() function allow you to band-limit the input to exclude very low or very high frequencies which may be noisy. Noisy low frequencies in particular can 'blow-up' during integration and overwhelm the signal. You may also want to use a window at the start and end of the time series to stop spectral leakage.
I have calculated bI2 and also a result, bI3, integrated once with intf() using the following code (I assumed an average sampling rate for simplicity):
import intf
from scipy import integrate
data = np.load(path)
y = data['arr_1']
t = data['arr_0']
bI2= sp.integrate.cumtrapz(y-y.mean(),x=t)
bI3 = intf.intf(y-y.mean(), fs=500458, f_lo=1, winlen=1e-2, times=1)
I plotted bI2 and bI3:
The two time series are of the same order of magnitude, and broadly the same shape, notwithstanding the piecewise linear trend apparent in bI2. I know this doesn't explain what's going on in the scipy function, but at least this shows it's not a problem with the frequency domain method.
The code for intf is pasted in full below.
def intf(a, fs, f_lo=0.0, f_hi=1.0e12, times=1, winlen=1, unwin=False):
"""
Numerically integrate a time series in the frequency domain.
This function integrates a time series in the frequency domain using
'Omega Arithmetic', over a defined frequency band.
Parameters
----------
a : array_like
Input time series.
fs : int
Sampling rate (Hz) of the input time series.
f_lo : float, optional
Lower frequency bound over which integration takes place.
Defaults to 0 Hz.
f_hi : float, optional
Upper frequency bound over which integration takes place.
Defaults to the Nyquist frequency ( = fs / 2).
times : int, optional
Number of times to integrate input time series a. Can be either
0, 1 or 2. If 0 is used, function effectively applies a 'brick wall'
frequency domain filter to a.
Defaults to 1.
winlen : int, optional
Number of seconds at the beginning and end of a file to apply half a
Hanning window to. Limited to half the record length.
Defaults to 1 second.
unwin : Boolean, optional
Whether or not to remove the window applied to the input time series
from the output time series.
Returns
-------
out : complex ndarray
The zero-, single- or double-integrated acceleration time series.
Versions
----------
1.1 First development version.
Uses rfft to avoid complex return values.
Checks for even length time series; if not, end-pad with single zero.
1.2 Zero-means time series to avoid spurious errors when applying Hanning
window.
"""
a = a - a.mean() # Convert time series to zero-mean
if np.mod(a.size,2) != 0: # Check for even length time series
odd = True
a = np.append(a, 0) # If not, append zero to array
else:
odd = False
f_hi = min(fs/2, f_hi) # Upper frequency limited to Nyquist
winlen = min(a.size/2, winlen) # Limit window to half record length
ni = a.size # No. of points in data (int)
nf = float(ni) # No. of points in data (float)
fs = float(fs) # Sampling rate (Hz)
df = fs/nf # Frequency increment in FFT
stf_i = int(f_lo/df) # Index of lower frequency bound
enf_i = int(f_hi/df) # Index of upper frequency bound
window = np.ones(ni) # Create window function
es = int(winlen*fs) # No. of samples to window from ends
edge_win = np.hanning(es) # Hanning window edge
window[:es/2] = edge_win[:es/2]
window[-es/2:] = edge_win[-es/2:]
a_w = a*window
FFTspec_a = np.fft.rfft(a_w) # Calculate complex FFT of input
FFTfreq = np.fft.fftfreq(ni, d=1/fs)[:ni/2+1]
w = (2*np.pi*FFTfreq) # Omega
iw = (0+1j)*w # i*Omega
mask = np.zeros(ni/2+1) # Half-length mask for +ve freqs
mask[stf_i:enf_i] = 1.0 # Mask = 1 for desired +ve freqs
if times == 2: # Double integration
FFTspec = -FFTspec_a*w / (w+EPS)**3
elif times == 1: # Single integration
FFTspec = FFTspec_a*iw / (iw+EPS)**2
elif times == 0: # No integration
FFTspec = FFTspec_a
else:
print 'Error'
FFTspec *= mask # Select frequencies to use
out_w = np.fft.irfft(FFTspec) # Return to time domain
if unwin == True:
out = out_w*window/(window+EPS)**2 # Remove window from time series
else:
out = out_w
if odd == True: # Check for even length time series
return out[:-1] # If not, remove last entry
else:
return out

Clipping FFT Matrix

Audio processing is pretty new for me. And currently using Python Numpy for processing wave files. After calculating FFT matrix I am getting noisy power values for non-existent frequencies. I am interested in visualizing the data and accuracy is not a high priority. Is there a safe way to calculate the clipping value to remove these values, or should I use all FFT matrices for each sample set to come up with an average number ?
regards
Edit:
from numpy import *
import wave
import pymedia.audio.sound as sound
import time, struct
from pylab import ion, plot, draw, show
fp = wave.open("500-200f.wav", "rb")
sample_rate = fp.getframerate()
total_num_samps = fp.getnframes()
fft_length = 2048.
num_fft = (total_num_samps / fft_length ) - 2
temp = zeros((num_fft,fft_length), float)
for i in range(num_fft):
tempb = fp.readframes(fft_length);
data = struct.unpack("%dH"%(fft_length), tempb)
temp[i,:] = array(data, short)
pts = fft_length/2+1
data = (abs(fft.rfft(temp, fft_length)) / (pts))[:pts]
x_axis = arange(pts)*sample_rate*.5/pts
spec_range = pts
plot(x_axis, data[0])
show()
Here is the plot in non-logarithmic scale, for synthetic wave file containing 500hz(fading out) + 200hz sine wave created using Goldwave.

Simulated waveforms shouldn't show FFTs like your figure, so something is very wrong, and probably not with the FFT, but with the input waveform. The main problem in your plot is not the ripples, but the harmonics around 1000 Hz, and the subharmonic at 500 Hz. A simulated waveform shouldn't show any of this (for example, see my plot below).
First, you probably want to just try plotting out the raw waveform, and this will likely point to an obvious problem. Also, it seems odd to have a wave unpack to unsigned shorts, i.e. "H", and especially after this to not have a large zero-frequency component.
I was able to get a pretty close duplicate to your FFT by applying clipping to the waveform, as was suggested by both the subharmonic and higher harmonics (and Trevor). You could be introducing clipping either in the simulation or the unpacking. Either way, I bypassed this by creating the waveforms in numpy to start with.
Here's what the proper FFT should look like (i.e. basically perfect, except for the broadening of the peaks due to the windowing)
Here's one from a waveform that's been clipped (and is very similar to your FFT, from the subharmonic to the precise pattern of the three higher harmonics around 1000 Hz)
Here's the code I used to generate these
from numpy import *
from pylab import ion, plot, draw, show, xlabel, ylabel, figure
sample_rate = 20000.
times = arange(0, 10., 1./sample_rate)
wfm0 = sin(2*pi*200.*times)
wfm1 = sin(2*pi*500.*times) *(10.-times)/10.
wfm = wfm0+wfm1
# int test
#wfm *= 2**8
#wfm = wfm.astype(int16)
#wfm = wfm.astype(float)
# abs test
#wfm = abs(wfm)
# clip test
#wfm = clip(wfm, -1.2, 1.2)
fft_length = 5*2048.
total_num_samps = len(times)
num_fft = (total_num_samps / fft_length ) - 2
temp = zeros((num_fft,fft_length), float)
for i in range(num_fft):
temp[i,:] = wfm[i*fft_length:(i+1)*fft_length]
pts = fft_length/2+1
data = (abs(fft.rfft(temp, fft_length)) / (pts))[:pts]
x_axis = arange(pts)*sample_rate*.5/pts
spec_range = pts
plot(x_axis, data[2], linewidth=3)
xlabel("freq (Hz)")
ylabel('abs(FFT)')
show()

FFT's because they are windowed and sampled cause aliasing and sampling in the frequency domain as well. Filtering in the time domain is just multiplication in the frequency domain so you may want to just apply a filter which is just multiplying each frequency by a value for the function for the filter you are using. For example multiply by 1 in the passband and by zero every were else. The unexpected values are probably caused by aliasing where higher frequencies are being folded down to the ones you are seeing. The original signal needs to be band limited to half your sampling rate or you will get aliasing. Of more concern is aliasing that is distorting the area of interest because for this band of frequencies you want to know that the frequency is from the expected one.
The other thing to keep in mind is that when you grab a piece of data from a wave file you are mathmatically multiplying it by a square wave. This causes a sinx/x to be convolved with the frequency response to minimize this you can multiply the original windowed signal with something like a Hanning window.

It's worth mentioning for a 1D FFT that the first element (index [0]) contains the DC (zero-frequency) term, the elements [1:N/2] contain the positive frequencies and the elements [N/2+1:N-1] contain the negative frequencies. Since you didn't provide a code sample or additional information about the output of your FFT, I can't rule out the possibility that the "noisy power values at non-existent frequencies" aren't just the negative frequencies of your spectrum.
EDIT: Here is an example of a radix-2 FFT implemented in pure Python with a simple test routine that finds the FFT of a rectangular pulse, [1.,1.,1.,1.,0.,0.,0.,0.]. You can run the example on codepad and see that the FFT of that sequence is
[0j, Negative frequencies
(1+0.414213562373j), ^
0j, |
(1+2.41421356237j), |
(4+0j), <= DC term
(1-2.41421356237j), |
0j, v
(1-0.414213562373j)] Positive frequencies
Note that the code prints out the Fourier coefficients in order of ascending frequency, i.e. from the highest negative frequency up to DC, and then up to the highest positive frequency.

I don't know enough from your question to actually answer anything specific.
But here are a couple of things to try from my own experience writing FFTs:
Make sure you are following Nyquist rule
If you are viewing the linear output of the FFT... you will have trouble seeing your own signal and think everything is broken. Make sure you are looking at the dB of your FFT magnitude. (i.e. "plot(10*log10(abs(fft(x))))" )
Create a unitTest for your FFT() function by feeding generated data like a pure tone. Then feed the same generated data to Matlab's FFT(). Do a absolute value diff between the two output data series and make sure the max absolute value difference is something like 10^-6 (i.e. the only difference is caused by small floating point errors)
Make sure you are windowing your data
If all of those three things work, then your fft is fine. And your input data is probably the issue.
Check the input data to see if there is clipping http://www.users.globalnet.co.uk/~bunce/clip.gif
Time doamin clipping shows up as mirror images of the signal in the frequency domain at specific regular intervals with less amplitude.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.