I have the following scenario:
value_range = [250.0, 350.0]
precision = 0.01
unique_values = len(np.arange(min(values_range),
max(values_range) + precision,
precision))
This means all values range between 250.0 and 350.0 with a precision of 0.01, giving a potential total of 10001 unique values that the data set can have.
# This is the data I'd like to scale
values_to_scale = np.arange(min(value_range),
max(value_range) + precision,
precision)
# These are the bins I want to assign to
unique_bins = np.arange(1, unique_values + 1)
You can see in the above example, each value in values_to_scale will map exactly to its corresponding item in the unique_bins array. I.e. a value of 250.0 (values_to_scale[0]) will equal 1.0 (unique_bins[0]) etc.
However, if my values_to_scale array looks like:
values_to_scale = np.array((250.66, 342.02))
How can I do the scaling/transformation to get the unique bin value? I.e. 250.66 should equal a value of 66 but how do I obtain this?
NOTE The value_range could equally be between -1 and 1, I'm just looking for a generic way to scale/normalise data between two values.
You're basically looking for a linear interpolation between min and max:
minv = min(value_range)
maxv = max(value_range)
unique_values = int(((maxv - minv) / precision) + 1)
((values_to_scale - minv) / (maxv + precision - minv) * unique_values).astype(int)
# array([ 65, 9202])
Related
I am working on finding the frequencies from a given dataset and I am struggling to understand how np.fft.fft() works. I thought I had a working script but ran into a weird issue that I cannot understand.
I have a dataset that is roughly sinusoidal and I wanted to understand what frequencies the signal is composed of. Once I took the FFT, I got this plot:
However, when I take the same dataset, slice it in half, and plot the same thing, I get this:
I do not understand why the frequency drops from 144kHz to 128kHz which technically should be the same dataset but with a smaller length.
I can confirm a few things:
Step size between data points 0.001
I have tried interpolation with little luck.
If I slice the second half of the dataset I get a different frequency as well.
If my dataset is indeed composed of both 128 and 144kHz, then why doesn't the 128 peak show up in the first plot?
What is even more confusing is that I am running a script with pure sine waves without issues:
T = 0.001
fs = 1 / T
def find_nearest_ind(data, value):
return (np.abs(data - value)).argmin()
x = np.arange(0, 30, T)
ff = 0.2
y = np.sin(2 * ff * np.pi * x)
x = x[:len(x) // 2]
y = y[:len(y) // 2]
n = len(y) # length of the signal
k = np.arange(n)
T = n / fs
frq = k / T * 1e6 / 1000 # two sides frequency range
frq = frq[:len(frq) // 2] # one side frequency range
Y = np.fft.fft(y) / n # dft and normalization
Y = Y[:n // 2]
frq = frq[:50]
Y = Y[:50]
fig, (ax1, ax2) = plt.subplots(2)
ax1.plot(x, y)
ax1.set_xlabel("Time (us)")
ax1.set_ylabel("Electric Field (V / mm)")
peak_ind = find_nearest_ind(abs(Y), np.max(abs(Y)))
ax2.plot(frq, abs(Y))
ax2.axvline(frq[peak_ind], color = 'black', linestyle = '--', label = F"Frequency = {round(frq[peak_ind], 3)}kHz")
plt.legend()
plt.xlabel('Freq(kHz)')
ax1.title.set_text('dV/dX vs. Time')
ax2.title.set_text('Frequencies')
fig.tight_layout()
plt.show()
Here is a breakdown of your code, with some suggestions for improvement, and extra explanations. Working through it carefully will show you what is going on. The results you are getting are completely expected. I will propose a common solution at the end.
First set up your units correctly. I assume that you are dealing with seconds, not microseconds. You can adjust later as long as you stay consistent.
Establish the period and frequency of the sampling. This means that the Nyquist frequency for the FFT will be 500Hz:
T = 0.001 # 1ms sampling period
fs = 1 / T # 1kHz sampling frequency
Make a time domain of 30e3 points. The half domain will contain 15000 points. That implies a frequency resolution of 500Hz / 15k = 0.03333Hz.
x = np.arange(0, 30, T) # time domain
n = x.size # number of points: 30000
Before doing anything else, we can define our time domain right here. I prefer a more intuitive approach than what you are using. That way you don't have to redefine T or introduce the auxiliary variable k. But as long as the results are the same, it does not really matter:
F = np.linspace(0, 1 - 1/n, n) / T # Notice F[1] = 0.03333, as predicted
Now define the signal. You picked ff = 0.2. Notice that 0.2Hz. 0.2 / 0.03333 = 6, so you would expect to see your peak in exactly bin index 6 (F[6] == 0.2). To better illustrate what is going on, let's take ff = 0.22. This will bleed the spectrum into neighboring bins.
ff = 0.22
y = np.sin(2 * np.pi * ff * x)
Now take the FFT:
Y = np.fft.fft(y) / n
maxbin = np.abs(Y).argmax() # 7
maxF = F[maxbin] # 0.23333333: This is the nearest bin
Since your frequency bins are 0.03Hz wide, the best resolution you can expect 0.015Hz. For your real data, which has much lower resolution, the error is much larger.
Now let's take a look at what happens when you halve the data size. Among other things, the frequency resolution becomes smaller. Now you have a maximum frequency of 500Hz spread over 7.5k samples, not 15k: the resolution drops to 0.066666Hz per bin:
n2 = n // 2 # 15000
F2 = np.linspace(0, 1 - 1 / n2, n2) / T # F[1] = 0.06666
Y2 = np.fft.fft(y[:n2]) / n2
Take a look what happens to the frequency estimate:
maxbin2 = np.abs(Y2).argmax() # 3
maxF2 = F2[maxbin2] # 0.2: This is the nearest bin
Hopefully, you can see how this applies to your original data. In the full FFT, you have a resolution of ~16.1 per bin with the full data, and ~32.2kHz with the half data. So your original result is within ~±8kHz of the right peak, while the second one is within ~±16kHz. The true frequency is therefore between 136kHz and 144kHz. Another way to look at it is to compare the bins that you showed me:
full: 128.7 144.8 160.9
half: 96.6 128.7 160.9
When you take out exactly half of the data, you drop every other frequency bin. If your peak was originally closest to 144.8kHz, and you drop that bin, it will end up in either 128.7 or 160.9.
Note: Based on the bin numbers you show, I suspect that your computation of frq is a little off. Notice the 1 - 1/n in my linspace expression. You need that to get the right frequency axis: the last bin is (1 - 1/n) / T, not 1 / T, no matter how you compute it.
So how to get around this problem? The simplest solution is to do a parabolic fit on the three points around your peak. That is usually a sufficiently good estimator of the true frequency in the data when you are looking for essentially perfect sinusoids.
def peakF(F, Y):
index = np.abs(Y).argmax()
# Compute offset on normalized domain [-1, 0, 1], not F[index-1:index+2]
y = np.abs(Y[index - 1:index + 2])
# This is the offset from zero, which is the scaled offset from F[index]
vertex = (y[0] - y[2]) / (0.5 * (y[0] + y[2]) - y[1])
# F[1] is the bin resolution
return F[index] + vertex * F[1]
In case you are wondering how I got the formula for the parabola: I solved the system with x = [-1, 0, 1] and y = Y[index - 1:index + 2]. The matrix equation is
[(-1)^2 -1 1] [a] Y[index - 1]
[ 0^2 0 1] # [b] = Y[index]
[ 1^2 1 1] [c] Y[index + 1]
Computing the offset using a normalized domain and scaling afterwards is almost always more numerically stable than using whatever huge numbers you have in F[index - 1:index + 2].
You can plug the results in the example into this function to see if it works:
>>> peakF(F, Y)
0.2261613409657391
>>> peakF(F2, Y2)
0.20401580936430794
As you can see, the parabolic fit gives an improvement, however slight. There is no replacement for just increasing frequency resolution through more samples though!
I've generated a huge amount of random data like so:
ndata = np.random.binomial(1, 0.25, (100000, 1000))
which is a 100,000 by 1000 matrix(!)
I'm generating new matrix where for each row, each column is true if the mean of all the columns beforehand (minus the expectancy of bernoulli RV with p=0.25) is greater than equal some epsilon.
like so:
def true_false_inequality(data, eps, data_len):
return [abs(np.mean(data[:index + 1]) - 0.25) >= eps for index in range(data_len)]
After doing so I'm generating a 1-d array (finally!) where each column represents how many true values I had in the same column in the matrix, and then I'm dividing every column by some number (exp_numer = 100,000)
def final_res(data, eps):
tf = np.array([true_false_inequality(data[seq], eps, data_len) for seq in range(exp_number)])
percentage = tf.sum(axis=0)/exp_number
return percentage
Also I have 5 different epsilons which I iterate from to get my final result 5 times.
(epsilons = [0.001, 0.1, 0.5, 0.25, 0.025])
My code does work, but it takes a long while for 100,000 rows by 1000 columns, I know I can make it faster by exploring the numpy functionality a little bit more but I just don't know how.
You can perform the whole calculation with vectorized operations on the full data array:
mean = np.cumsum(data, axis=1) / np.arange(1, data.shape[1]+1)
condition = np.abs(mean - 0.25) >= eps
percentage = condition.sum(axis=0) / len(data)
You can calculate the cummulative mean with:
np.cumsum(ndata, axis=0).sum(axis=1) / np.arange(1, 100001)
so we can optimize the true_false_inequality to:
def true_false_inequality(data, eps, data_len):
cummean = np.cumsum(ndata, axis=0).sum(axis=1) / np.arange(1, data_len)
return abs(cummean - 0.25) >= eps
Or like #a_guest suggests, we can first sum up the elements, and then calculate the cumulative sum:
def true_false_inequality(data, eps, data_len):
cummean = ndata.sum(axis=1).cumsum(axis=0) / np.arange(1, 100001)
return abs(cummean - 0.25) >= eps
How can I sample N random values such that the following constraints are satisfied?
the N values add up to 1.0
none of the values is less than 0.01 (or some other threshold T << 1/N)
The following procedure was my first attempt.
def proportions(N):
proportions = list()
for value in sorted(numpy.random.random(N - 1) * 0.98 + 0.01):
prop = value - sum(proportions)
proportions.append(prop)
prop = 1.0 - sum(proportions)
proportions.append(prop)
return proportions
The * 0.98 + 0.01 bit was intended to enforce the ≥ 1% constraint. This works on the margins, but doesn't work internally—if two random values have a distance of < 0.01 it is not caught/corrected. Example:
>>> numpy.random.seed(2000)
>>> proportions(5)
[0.3397481983960182, 0.14892479749759702, 0.07456518420712799, 0.005868759570153426, 0.43089306032910335]
Any suggestions to fix this broken approach or to replace it with a better approach?
You could adapt Mark Dickinson's nice solution:
import random
def proportions(n):
dividers = sorted(random.sample(range(1, 100), n - 1))
return [(a - b) / 100 for a, b in zip(dividers + [100], [0] + dividers)]
print(proportions(5))
# [0.13, 0.19, 0.3, 0.34, 0.04]
# or
# [0.31, 0.38, 0.12, 0.05, 0.14]
# etc
Note this assumes "none of the values is less than 0.01" is a fixed threshold
UPDATE: We can generalize if we take the reciprocal of the threshold and use that to replace the hard-coded 100 values in the proposed code.
def proportions(N, T=0.01):
limit = int(1 / T)
dividers = sorted(random.sample(range(1, limit), N - 1))
return [(a - b) / limit for a, b in zip(dividers + [limit], [0] + dividers)]
What about this?
N/2 times, choose a random number x such that 1/N+x & 1/N-x fit your constraints; add 1/N+x & 1/N-x
If N is odd, add 1/N
I have a big continuous array of values that ranges from (-100, 100)
Now for this array I want to calculate the weighted average described here
since it's continuous I want also to set breaks for the values every 20
i.e the values should be discrete as
-100
-80
-60
....
60
80
100
How can I do this in NumPy or python in general?
EDIT: the difference here from the normal mean, that the mean is calculated according to the frequency of values
You actually have 2 different questions.
How to make data discrete, and
How to make a weighted average.
It's usually better to ask 1 question at a time, but anyway.
Given your specification:
xmin = -100
xmax = 100
binsize = 20
First, let's import numpy and make some data:
import numpy as np
data = numpy.array(range(xmin, xmax))
Then let's make the binnings you are looking for:
bins_arange = numpy.arange(xmin, xmax + 1, binsize)
From this we can convert the data to the discrete form:
counts, edges = numpy.histogram(data, bins=bins_arange)
Now to calculate the weighted average, we can use the binning middle (e.g. numbers between -100 and -80 will be on average -90):
bin_middles = (edges[:-1] + edges[1:]) / 2
Note that this method does not require the binnings to be evenly "spaced", contrary to the integer division method.
Then let's make some weights:
weights = numpy.array(range(len(counts)) / sum(range(len(counts))
Then to bring it all together:
average = np.sum(bin_middles * counts * 1) / sum(counts)
weighted_average = np.sum(bin_middles * counts * weights) / sum(counts)
For the discretization (breaks), here is a method using the python integer division :
import numpy as np
values = np.array([0, 5, 10, 11, 21, 24, 48, 60])
(values/20) *20
# or (a/10).astype(int)*10 to force rounding
that will print :
aarray([ 0, 0, 0, 0, 20, 20, 40, 60])
For the weighted mean, if you have another array with the weights for each point, you can use :
weighted_means = sum([ w*v for w,v in zip(weights, values)]) / sum( w*w )
I have four columns, namely x,y,z,zcosmo. The range of zcosmo is 0.0<zcosmo<0.5.
For each x,y,z, there is a zcosmo.
When x,y,z are plotted, this is how they look.
I would like to find the volume of this figure. If I slice it into 50 parts (in ascending zcosmo order), so that each part resembles a cylinder, I can add them up to get the final volume.
The volume of the sliced cylinders would be pi*r^2*h, in my case r = z/2 & h = x
The slicing for example would be like,
x,z for 0.0<zcosmo<0.01 find this volume V1. Then x,z for 0.01<zcosmo<0.02 find this volume V2 and so on until zcosmo=0.5
I know to do this manually (which of course is time consuming) by saying:
r1 = z[np.logical_and(zcosmo>0.0,zcosmo<0.01)] / 2 #gives me z within the range 0.0<zcosmo<0.01
h1 = x[np.logical_and(zcosmo>0.0,zcosmo<0.01)] #gives me x within the range 0.0<zcosmo<0.01
V1 = math.pi*(r1**2)*(h1)
Here r1 and h1 should be r1 = ( min(z) + max(z) ) / 2.0 and h1 = max(x) - min(x), i.e the max and min values so that I get one volume for each slice
How should I create a code that calculates the 50 volume slices within the zcosmo sliced ranges??
Use a for loop:
volumes = list()
for index in range(0, 50):
r = z[np.logical_and(zcosmo>index * 0.01, zcosmo<index * 0.01 + 0.01)] / 2
h = x[np.logical_and(zcosmo>index * 0.01, zcosmo<index * 0.01 + 0.01)]
volumes.append(math.pi*(r**2)*(h))
At the end, volumes will be a list containing the volumes of the 50 cylinders.
You can use volume = sum(volumes) to get the final volume of the shape.