Scaling factor of n-dim fft in numpy

Scaling factor of n-dim fft in numpy - python

I have an image of a grid of holes. Processing it with numpy.fft.fft2 yields a nice image where I can clearly see periodicity, base vectors etc.
But how can I extract the lattice spacing?
The lattice points in real-space have a spacing of about 96px, so the spacing in k-space would be 2*Pi / 96px = 0.065 1/px.
Naturally, numpy can't return an image array with sub-pixel spacing, so it is somehow scaled - spacing in k-space is about 70px.
But how is the scaling done and what is the exact scaling factor?

Units of numpy.fft.fft2's output frequency scale is in cycle/full-length/pixel, under the assumption that the input is periodic with a period corresponding to the full input length.
So, if you have an fft2 output with a size of 6720 x 6720 pixels and with a spike at the 70th pixel, you may expect a periodic component in the spatial domain with a period of:
1 / (70 pixels * 1 cycle / 6720 pixels / pixel) = 96 pixels/cycle.
Correspondingly, if you have an input image with a size of 6720 x 6720 pixels with elements that are repeating every 96 pixels, you will get a spike in the frequency domain at:
(1 / (96 pixels/cycle)) / (1 cycle / 6720 pixels / pixels) = 70 pixels.
While this is unit accurate, perhaps a simpler way to look at it is:
spatial-domain-period-in-pixels
= image-size-in-pixels / frequency-domain-frequency-in-pixels
frequency-domain-frequency-in-pixels =
= image-size-in-pixels / spatial-domain-period-in-pixels

Related

estimate object's height and width using lidar data

I'am currently working on lidar and camera fusion for object detection, distance and size estimation. I'am struggling with the width and height estimation using lidar data ( x and y coordinates ) .
i need help with a method that makes use of all the info extracted from the lidar sensor to estimate the object's size!
NB :1- the bbox are provided by the yolov5 algorithm.
2- I have calculated the actual distance of each object inside a bbox.
height and width of the cyclist in the image attached : enter image description here

This is geometry around the "pinhole camera" model.
Let's first regard the unit circle.
Picture from debraborkovitz.com
Your camera is at the origin looking at B. The object is the line BC. Let's say it is 1 meter away (O-B), and 0.5 meters tall (B-C). It spans a certain angle of the unit circle (your view). Let's call that alpha (or theta, doesn't matter).
tan(alpha) * 1.0 m = 0.5 m
tan(alpha) * distance[m] = length[m]
tan(alpha) = length[m] / distance[m]
alpha isn't important, but tan(alpha) is, because it's proportional to the object's length. Only keep that in mind.
Focal length is just a factor, describing image resolution. Say f = 1000 px, then this object would be 500 px tall because
length[px] = f[px] * tan(alpha)
= f[px] * length[m] / distance[m]
Now, if lidar says the object is 5 m away, and the image says the object is 300 px tall/wide, you calculate
length[px] = f[px] * length[m] / distance[m]
rearrange
length[m] = length[px] / f[px] * distance[m]
length[m] = 300px / 1000px * 5m
length[m] = 1.5 m
You need to know the focal length (in pixels) for your camera. That is either given by the manufacturer, somewhere in the documentation, or you have to calculate it. There are calibration methods available. You can also calculate it from manual measurements.
If you need to estimate it, you can just place a yard stick at a known distance, take a picture, measure its length in pixels, and use the previous equations to evaluate:
f[px] = length[px] * distance[m] / length[m]
If you knew the sensor's pixel pitch, let's say 1.40 µm/px, and the true focal distance (not 35mm-equivalent), let's say 4.38 mm, then f[px] = 4.38 mm / (1.40 µm/px) = 3128 px. Those values are roughly representative of smartphone cameras and some webcams.

How can I report the outcome of the typical spatial frequency Butterworth filter in cycles / degree of visual angle?

I'm working on spatial frequency filtering using code from this site.
https://www.djmannion.net/psych_programming/vision/sf_filt/sf_filt.html
There is similar code here on stack exchange. What I was wondering though is how to convert the cutoff used in the Butterworth filter, which is a number from 0 to 1, to cycles / degree in the image when I report it. I feel like I'm missing something obvious. I'm imagining it has to do with the visual angle the image subtends and the resolution.

The Butterworth filter commonly used in psychology software to filter spatial frequencies from images is typically used as a low pass or high pass filter with a value given between 0 and 1 for the cutoff. That value will be the 50% cut point for the filter. For example, if it's low pass 50% or more of the frequencies below the cutoff will remain in the image and 50% or fewer above the cutoff will remain. The opposite is true for the high pass setting. It does have a steep rise to the function near the cutoff so it's good to use the cutoff to describe your filtered image.
So, what is the cutoff value and what does it mean? It's just a proportion of the max frequency in cycles/pixel. Once you know that max it's easy to derive, and it turns out the max is a constant. Suppose your image is 128x128. In 128 pixels you could have a maximum frequency of 64 cycles, or 0.5 cycles/pixel. Looking at fftfreq function in numpy or matlab reveals the max is always a 0.5 cycles/pixel. That makes sense because you need two pixels for a period. This means that a cutoff of 0.2 is going to be 0.1 cycles/pixel. Whatever cutoff you pick you can just divide in half and that's the cycles/pixel. Than all you need to do is scale that to cycles/degree.
Cycles / pixel can be converted to cycles / degree of visual angle once you know the size of the image presented in degrees of visual angle. For example, if someone's eye is d distance from the image of h height then the image will span 2 * atan( (h/2) / d) degrees of visual angle (assuming your atan function reports degrees and not radians, you may have to convert). Take the number of pixels in your image and divide it by the total span in degrees to get pixels / degree. Then multiply frequency (cycles / pixel) by pixels / degree to get cycles / degree.
Pseudo code equation version below:
cycles/pixel = cutoff * 0.5
d = distance_from_image_in_cm
h = height_of_image_in_cm
degrees_of_visual_angle = 2 * atan( (h/2) / d)
pixels/degree = total_pixels / degrees_of_visual_angle
cycles/degree = cycles/pixel * pixels/degree

average luminescence value vs distance to the center of an image

I would like to compute the average luminescence value vs distance to the center of an image. The approach I am thinking about is to
compute the distance between pixels in image and image center
group pixels with same distance
compute the average value of pixels for each group
plot graph of distance vs average intensity
To compute first step I use this function:
dist_img = np.zeros(gray.shape, dtype=np.uint8)
for y in range(0, h):
for x in range(0, w):
cy = gray.shape[0]/2
cx = gray.shape[1]/2
dist = math.sqrt(((x-cx)**2)+((y-cy)**2))
dist_img[y,x] = dist
Unfortunately id does give different result from the one which I compute from here
distance = math.sqrt(((1 - gray.shape[0]/2)**2 )+((1 - gray.shape[1]/2 )**2))
when I test it for pixel (1,1) I receive 20 from first code and 3605 from second.
I would appreciate suggestions on the how to correct the loop and hints on how to start with other points.Or maybe there is other way to achieve what I would like to.

You are setting up dist_img with an np.uint8 dtype. This 8 Bit unsigned integer can fit values between 0 and 255, thus 3605 can not be properly represented. Use a higher bith depth for your distance image dtype, like np.uint32.
distance = math.sqrt(((1 - gray.shape[0]/2)**2 )+((1 - gray.shape[1]/2 )**2))
Careful: gray.shape will give you (height, width) or (y, x). The other code correctly assigns gray.shape[0]/2 to the y center, this one mixes it up and uses the height for the x coordinate.
Your algorithm seems good enough, I would suggest you stick with it. You can achieve something similar to the first two steps by converting the image to polar space (e.g. with OpenCV linearToPolar), but that may be harder to debug.

How to implement/perform DFT on a segment in python?

I am trying to write a simple program in python that will calculate and display DFT output of 1 segment.
My signal is 3 seconds long, I want to calculate DFT for every 10ms long segment. Sampling rate is 44100. So one segment is 441 samples long.
Since I am in the phase of testing this and original program is much larger(speech recognition) here is an isolated part for testing purposes that unfortunately behaves odd. Either that or my lack of knowledge on the subject.
I read somewhere that DFT input should be rounded to power of 2 so I arranged my array to 512 instead 441. Is this true?
If I am sampling at a rate of 44100, at most I can reach frequency of 22050Hz and for sample of length 512(~441) at least 100Hz ?
If 2. is true, then I can have all frequencies between 100hz and 22050hz in that 10ms segments, but the length of segment is 512(441) samples only, output of fft returns array of 256(220) values, they cannot contain all 21950 frequencies in there, can they?
My first guess is that the values in output of fft should be multiplied by 100, since 10ms is 100th of a second. Is this good reasoning?
The following program for two given frequencies 1000 and 2000 returns two spikes on graph at positions 24 and 48 in the output array and ~2071 and ~4156 on the graph. Since ratio of numbers is okay (2000:1000 = 48:24) I wonder if I should ignore some starting part of the fft output?
import matplotlib.pyplot as plt
import numpy as np
t = np.arange(0, 1, 1/512.0) # We create 512 long array
# We calculate here two sinusoids together at 1000hz and 2000hz
y = np.sin(2*np.pi*1000*t) + np.sin(2*np.pi*2000*t)
n = len(y)
k = np.arange(n)
# Problematic part is around here, I am not quite sure what
# should be on the horizontal line
T = n/44100.0
frq = k/T
frq = frq[range(n/2)]
Y = fft(y)
Y = Y[range(n/2)]
# Convert from complex numbers to magnitudes
iY = []
for f in Y:
iY.append(np.sqrt(f.imag * f.imag + f.real * f.real))
plt.plot(frq, iY, 'r')
plt.xlabel('freq (HZ)')
plt.show()

I read somewhere that the DFT input should be rounded to power of 2 so I arranged my array to 512 instead 441. Is this true?
The DFT is defined for all sizes. However, implementations of the DFT such as the FFT are generally much more efficient for sizes which can be factored in small primes. Some library implementations have limitations and do not support sizes other than powers of 2, but that isn't the case with numpy.
If I am sampling at a rate of 44100, at most I can reach frequency of 22050Hz and for sample of length 512(~441) at least 100Hz?
The highest frequency for even sized DFT will be 44100/2 = 22050Hz as you've correctly pointed out. Note that for odd sized DFT the highest frequency bin will correspond to a frequency slightly less than the Nyquist frequency. As for the minimum frequency, it will always be 0Hz. The next non-zero frequency will be 44100.0/N where N is the DFT length in samples (which gives 100Hz if you are using a DFT length of 441 samples and ~86Hz with a DFT length of 512 samples).
If 2) is true, then I can have all frequencies between 100Hz and 22050Hz in that 10ms segments, but the length of segment is 512(441) samples only, output of fft returns array of 256(220) values, they cannot contain all 21950 frequencies in there, can they?
First there aren't 21950 frequencies between 100Hz and 22050Hz since frequencies are continuous and not limited to integer frequencies. That said, you are correct in your realization that the output of the DFT will be limited to a much smaller set of frequencies. More specifically the DFT represents the frequency spectrum at discrete frequency step: 0, 44100/N, 2*44100/N, ...
My first guess is that the values in output of FFT should be multiplied by 100, since 10ms is 100th of a second. Is this good reasoning?
There is no need to multiply the FFT output by 100. But if you meant multiples of 100Hz with a DFT of length 441 and a sampling rate of 44100Hz, then your guess would be correct.
The following program for two given frequencies 1000 and 2000 returns two spikes on graph at positions 24 and 48 in the output array and ~2071 and ~4156 on the graph. Since ratio of numbers is okay (2000:1000 = 48:24) I wonder if I should ignore some starting part of the fft output?
Here the problem is more significant. As you declare the array
t = np.arange(0, 1, 1/512.0) # We create 512 long array
you are in fact representing a signal with a sampling rate of 512Hz instead of 44100Hz. As a result the tones you are generating are severely aliased (to 24Hz and 48Hz respectively). This is further compounded by the fact that you then use a sampling rate of 44100Hz for the frequency axis conversion. This is why the peaks are not appearing at the expected 1000Hz and 2000Hz frequencies.
To represent 512 samples of a signal sampled at a rate of 44100Hz, you should instead use
t = np.arange(0, 511.0/44100, 1/44100.0)
at which point the formula you used for the frequency axis would be correct (since it is based of the same 44100Hz sampling rate). You should then be able to see peaks near the expected 1000Hz and 2000Hz (the closest frequency bins of the peaks being at ~1033Hz and 1981Hz).

1) I read somewhere that DFT input should be rounded to power of 2 so
I aranged my array to 512 instead 441. Is this true?
Yes, DFT length should be a power of two. Just pad the input with zero to match 512.
2) If I am sampling at a rate of 44100, at most I can reach frequency
of 22050hz and for sample of length 512(~441) at least 100hz ?
Yes, the highest frequency you can get is half the the sampling rate, It's called the Nyquist frequency.
No, the lowest frequency bin you get (the first bin of the DFT) is called the DC component and marks the average of the signal. The next lowest frequency bin in your case is 22050 / 256 = 86Hz, and then 172Hz, 258Hz, and so on until 22050Hz.
You can get this freqs with the numpy.fftfreq() function.
3) If 2) is true, then I can have all frequencies between 100hz and
22050hz in that 10ms segments, but the length of segment is 512(441)
samples only, output of fft returns array of 256(220) values, they
cannot contain all 21950 frequencies in there, can they?
DFT doesn't lose the original signal's data, but it lacks accuracy when the DFT size is small. You may zero-pad it to make the DFT size larger, such as 1024 or 2048.
The DFT bin refers to a frequency range centered at each of the N output
points. The width of the bin is sample rate/2,
and it extends from: center frequency -(sample rate/N)/2 to center
frequency +(sample rate/N)/2. In other words, half of the bin extends
below each of the N output points, and half above it.
4) My first guess is that the values in output of fft should be
multiplied by 100, since 10ms is 100th of a second. Is this good
reasoning?
No, The value should not be multiplied if you want to preserve the magnitude.
The following program for two given frequencies 1000 and 2000 returns
two spikes on graph at positions 24 and 48 in the output array and
~2071 and ~4156 on the graph. Since ratio of numbers is okay
(2000:1000 = 48:24) I wonder if I should ignore some starting part of
the fft output?
The DFT result is mirrored in real input. In other words, your frequencies will be like this:
n 0 1 2 3 4 ... 255 256 257 ... 511 512
Hz DC 86 172 258 344 ... 21964 22050 21964 ... 86 0

Analyze audio using Fast Fourier Transform

I am trying to create a graphical spectrum analyzer in python.
I am currently reading 1024 bytes of a 16 bit dual channel 44,100 Hz sample rate audio stream and averaging the amplitude of the 2 channels together. So now I have an array of 256 signed shorts. I now want to preform a fft on that array, using a module like numpy, and use the result to create the graphical spectrum analyzer, which, to start will just be 32 bars.
I have read the wikipedia articles on Fast Fourier Transform and Discrete Fourier Transform but I am still unclear of what the resulting array represents. This is what the array looks like after I preform an fft on my array using numpy:
[ -3.37260500e+05 +0.00000000e+00j 7.11787022e+05 +1.70667403e+04j
4.10040193e+05 +3.28653370e+05j 9.90933073e+04 +1.60555003e+05j
2.28787050e+05 +3.24141951e+05j 2.09781047e+04 +2.31063376e+05j
-2.15941453e+05 +1.63773851e+05j -7.07833051e+04 +1.52467334e+05j
-1.37440802e+05 +6.28107674e+04j -7.07536614e+03 +5.55634993e+03j
-4.31009964e+04 -1.74891657e+05j 1.39384348e+05 +1.95956947e+04j
1.73613033e+05 +1.16883207e+05j 1.15610357e+05 -2.62619884e+04j
-2.05469722e+05 +1.71343186e+05j -1.56779748e+04 +1.51258101e+05j
-2.08639913e+05 +6.07372799e+04j -2.90623668e+05 -2.79550838e+05j
-1.68112214e+05 +4.47877871e+04j -1.21289916e+03 +1.18397979e+05j
-1.55779104e+05 +5.06852464e+04j 1.95309737e+05 +1.93876325e+04j
-2.80400414e+05 +6.90079265e+04j 1.25892113e+04 -1.39293422e+05j
3.10709174e+04 -1.35248953e+05j 1.31003438e+05 +1.90799303e+05j...
I am wondering what exactly these numbers represent and how I would convert these numbers into a percentage of a height for each of the 32 bars. Also, should I be averaging the 2 channels together?

The array you are showing is the Fourier Transform coefficients of the audio signal. These coefficients can be used to get the frequency content of the audio. The FFT is defined for complex valued input functions, so the coefficients you get out will be imaginary numbers even though your input is all real values. In order to get the amount of power in each frequency, you need to calculate the magnitude of the FFT coefficient for each frequency. This is not just the real component of the coefficient, you need to calculate the square root of the sum of the square of its real and imaginary components. That is, if your coefficient is a + b*j, then its magnitude is sqrt(a^2 + b^2).
Once you have calculated the magnitude of each FFT coefficient, you need to figure out which audio frequency each FFT coefficient belongs to. An N point FFT will give you the frequency content of your signal at N equally spaced frequencies, starting at 0. Because your sampling frequency is 44100 samples / sec. and the number of points in your FFT is 256, your frequency spacing is 44100 / 256 = 172 Hz (approximately)
The first coefficient in your array will be the 0 frequency coefficient. That is basically the average power level for all frequencies. The rest of your coefficients will count up from 0 in multiples of 172 Hz until you get to 128. In an FFT, you only can measure frequencies up to half your sample points. Read these links on the Nyquist Frequency and Nyquist-Shannon Sampling Theorem if you are a glutton for punishment and need to know why, but the basic result is that your lower frequencies are going to be replicated or aliased in the higher frequency buckets. So the frequencies will start from 0, increase by 172 Hz for each coefficient up to the N/2 coefficient, then decrease by 172 Hz until the N - 1 coefficient.
That should be enough information to get you started. If you would like a much more approachable introduction to FFTs than is given on Wikipedia, you could try Understanding Digital Signal Processing: 2nd Ed.. It was very helpful for me.
So that is what those numbers represent. Converting to a percentage of height could be done by scaling each frequency component magnitude by the sum of all component magnitudes. Although, that would only give you a representation of the relative frequency distribution, and not the actual power for each frequency. You could try scaling by the maximum magnitude possible for a frequency component, but I'm not sure that that would display very well. The quickest way to find a workable scaling factor would be to experiment on loud and soft audio signals to find the right setting.
Finally, you should be averaging the two channels together if you want to show the frequency content of the entire audio signal as a whole. You are mixing the stereo audio into mono audio and showing the combined frequencies. If you want two separate displays for right and left frequencies, then you will need to perform the Fourier Transform on each channel separately.

Although this thread is years old, I found it very helpful. I just wanted to give my input to anyone who finds this and are trying to create something similar.
As for the division into bars this should not be done as antti suggest, by dividing the data equally based on the number of bars. The most useful would be to divide the data into octave parts, each octave being double the frequency of the previous. (ie. 100hz is one octave above 50hz, which is one octave above 25hz).
Depending on how many bars you want, you divide the whole range into 1/X octave ranges.
Based on a given center frequency of A on the bar, you get the upper and lower limits of the bar from:
upper limit = A * 2 ^ ( 1 / 2X )
lower limit = A / 2 ^ ( 1 / 2X )
To calculate the next adjoining center frequency you use a similar calculation:
next lower = A / 2 ^ ( 1 / X )
next higher = A * 2 ^ ( 1 / X )
You then average the data that fits into these ranges to get the amplitude for each bar.
For example:
We want to divide into 1/3 octaves ranges and we start with a center frequency of 1khz.
Upper limit = 1000 * 2 ^ ( 1 / ( 2 * 3 ) ) = 1122.5
Lower limit = 1000 / 2 ^ ( 1 / ( 2 * 3 ) ) = 890.9
Given 44100hz and 1024 samples (43hz between each data point) we should average out values 21 through 26. ( 890.9 / 43 = 20.72 ~ 21 and 1122.5 / 43 = 26.10 ~ 26 )
(1/3 octave bars would get you around 30 bars between ~40hz and ~20khz).
As you can figure out by now, as we go higher we will average a larger range of numbers. Low bars typically only include 1 or a small number of data points. While the higher bars can be the average of hundreds of points. The reason being that 86hz is an octave above 43hz... while 10086hz sounds almost the same as 10043hz.

what you have is a sample whose length in time is 256/44100 = 0.00580499 seconds. This means that your frequency resolution is 1 / 0.00580499 = 172 Hz. The 256 values you get out from Python correspond to the frequencies, basically, from 86 Hz to 255*172+86 Hz = 43946 Hz. The numbers you get out are complex numbers (hence the "j" at the end of every second number).
EDITED: FIXED WRONG INFORMATION
You need to convert the complex numbers into amplitude by calculating the sqrt(i2 + j2) where i and j are the real and imaginary parts, resp.
If you want to have 32 bars, you should as far as I understand take the average of four successive amplitudes, getting 256 / 4 = 32 bars as you want.

FFT return N complex values which of you can compute the module=sqrt(real_part^2+imaginary_part^2). To get the value for each band you have to sum the modules about all harmonics inside the band. Below you can see an example about a 10 bars spectrum analyzer. The c code has to be wrapped to get a pyd python module.
float *samples_vett;
float *out_filters_vett;
int Nsamples;
float band_power = 0.0;
float harmonic_amplitude=0.0;
int i, out_index;
out_index=0;
for (i = 0; i < Nsamples / 2 + 1; i++)
{
if (i == 1 || i == 2 || i == 4 || i == 8 || i == 17 || i == 33 || i == 66 || i == 132 || i == 264 || i == 511)
{
out_filters_vett[out_index] = band_power;
band_power = 0;
out_index++;
}
harmonic_amplitude = sqrt(pow(ttfr_out_vett[i].r, 2) + pow(ttfr_out_vett[i].i, 2));
band_power += harmonic_amplitude;
}
I designed and made a whole 10 led bar spectrum analyzer by Python. Instead to use the nunmpy library (too big and useless to get just the FFT) a python pyd module (just 27KB) to get the FFT and to split the entire audio spectrum to bands was created.
In addition, to read the output audio a loopback WASapi portaudio pyd module was created. You can see the project (block diagram) in the image
10BarsSpectrumAnalyzerWithWASapi.jpg
Just added a tutorial video on my YouTube channel: how to design and make a very smart Python Spectrum Analyzer 10 Led Bar

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.