I am just starting to learn numpy.fft, so apologies in advance.
I have an array consisting of 1000 elements of 1s and 0s, representing 1000ms of pulsed input consisting of trues and falses. I wanted to perform rfft on this array. For a simple example, I created this array that has a 1 on every 3rd element and 0 otherwise:
freq = 3
for j in range(0, 1000):
if freq != 0 and (((j + 1) % freq) == 0):
arr3hz.append(1)
else:
arr3hz.append(0)
I was expecting rfft to give me 3Hz somehow, I used this code:
n = len(arr3hz)
d = 1 / 1000
hs = np.fft.rfft(arr3hz)
fs = np.fft.rfftfreq(n, d)
amps = np.absolute(hs)
for j in range(0, len(fs)):
fw.write("Freq: %d Amp: %f\n" % (fs[j], amps[j]))
On my written file, I am just seeing random frequency elements with random amplitudes, which I was not able to make sense of. What is wrong with my use of numpy.rfft? I was also not sure of what to use for n and d as well for an array like this.
There are a few things going on here.
The first block gives a period of 3 ms, i.e. a frequency of 333.33 Hz
The mean is not zero, so there will also be a zero frequency component.
1000 ms is not divisible by 3 ms. The discrete Fourier transform assumes that the entire signal repeats with period equal to the window length (1000 ms). This means that you have 332 intervals of 3 ms and 1 interval of 4 ms. As it is not periodic at 333 Hz, there will be a spread of frequencies.
You were using the correct values for n and d.
In general, I find it more helpful to plot the output than printing out the values.
import matplotlib.pyplot as plt
plt.plot(fs, amps)
see
Related
I want to calculate the remaining probabilities for each result in a football game at n minute.
In this case I have expected goals for home team of 2.69 and away team 1.12 at 70 minute for a current result of 2-1
Code
from scipy.stats import poisson
from itertools import product
import numpy as np
import pandas as pd
xgh = 2.69
xga = 1.12
minute = 70
hg, ag = 2,1
phs=[]
pas=[]
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga, k=l, loc=ag)
pas.append(pa)
prod_table = np.array([(i*j) for i, j in product(phs, pas)])
prod_table.shape = (6, 6)
prob_df = pd.DataFrame(prod_table, index=range(0,6), columns=range(0, 6))
This return a probability of 2-1 final result for 2.21% that is pretty low I expect an high probability considering only 20 minutes left
Math considerations
Poisson distribution is the probability that an event occurs k times in a given time frame, knowing that, on average, it is supposed to occur μ times in this same time frame.
The postulate of Poisson distribution is that events are totally independent. So how many times it has already occurred is meaningless. And that they are uniformly distributed (If I may use this confusing word, since this is not a uniform distribution).
Most of the time, Poisson's usage is to compute probability of occurrence of k events in a timeframe T, when we know that μ events occur on average in a timeframe τ (difference with 1st sentence being that T and τ are not the same).
But that is the easy part: since evens are uniformly distributed, if μ events occurs on averate in a time frame τ, then μ×T/τ events shoud occur, on average, in a time frame T (understand: if we were to experiment millions of time frame T, then on average, there should be μT/τ events in each of them).
So, to compute the probability that event occurs k times in time frame T, knowing that it occurs μ times in time frame τ, you just have to reply to question "how many times event occurs k times in time frame T, knowing that it occurs μT/τ times in that time time frame". Which is the question Poisson can answer.
In python, that answer is poisson.pmf(k, μT/τ).
In your case, you know μ, the number of goals expected in a 90 minutes time frame. You know that the time frame left to score is 20 minutes. If 2.69 goals are expected in a time frame of 90 minutes then 0.5978 goals are expected in a time frame of 20 minutes (at least, that is Poisson postulates that things work that way).
Therefore, the probability for that team to score no other goal in that timeframe is poisson.pmf(0, 0.5978). Or, using your keyword style poisson.pmf(mu=0.5978, k=0). Or using loc, to have the total amount of goals poisson.pmf(mu=0.5978, k=2, loc=2) (but that is just cosmetic. Having a loc parameter just replace k by k-loc)
tl;dr solution
So, long story short, you just need to scale down xgh and xga so that they reflect the expected number of goals in the remaining time.
for i, l in zip(range(0, 6), range(0, 6)):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=i, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=l, loc=ag)
pas.append(pa)
Other comments
zip
While at it, and since there is a python tag, some comments on the code
for i, l in zip(range(0, 6), range(0, 6)):
print(i,l)
produces
0 0
1 1
2 2
3 3
4 4
5 5
So it is quite strange not to use a single variable. Especially if you consider that there is no way you could use different ranges (zip must be used with iterables of the same length. And we don't see under which circumstances, we would need, for example, i to grow from 0 to 5, while l would grow from 0 to 10)
So just
for k in range(0, 6):
ph = poisson.pmf(mu=xgh*(90-minute)/90, k=k, loc=hg)
phs.append(ph)
pa = poisson.pmf(mu=xga*(90-minute)/90, k=k, loc=ag)
pas.append(pa)
I surmise, especially because of what is the object of the next remark, that once upon a time, there was a product instead of that zip, before you realized that this was computing several time the same exact pmf.
Cross product
That usage of product has probably been then reduced to the task of computing phs[i]×pas[j] for all i,j. That is a good usage of product.
But, since you have 2 arrays, and you intend to build a numpy array from those phs[i]×pas[j], let numpy do the job. It will be more efficient at it.
prod_table = np.array(phs).reshape(-1,1)*np.array(pas)
Getting arrays directly from Poisson
Which leads to another optimization. If the goal is to transform phs and pha into arrays, so that we can mutiply them (one as a line, another as a column) to get the table, why not let numpy build that array directly. As many numpy function, pmf can have k being a list rather than a scalar, and then returns a list rather than a scalar.
So
phs=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg)
pas=poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
So, altogether
prod_table=poisson.pmf(mu=xgh*(90-minute)/90, k=range(6), loc=hg).reshape(-1,1)*poisson.pmf(mu=xga*(90-minute)/90, k=range(6), loc=ag)
Timings
Optimisations
Time in μs
Without
1647 μs
With
329 μs
So, it is not just most compact and readable. It is also (almost exactly) 5 times faster.
I have a huge list of numpy arrays (1 dimensional), which are time series for different events. Each point has a label, and I want to window the numpy arrays based on its label. The labels I have is 0, 1, and 2. Each window has a fixed size M.
The label of each window will be the biggest label available in the window. So if a window consists of both 0 an 1 labeled datapoints, the label will be 1 for the whole window.
But the problem is that, the windowing is not label agnostic. Because of class imbalance, I want to only do overlapped windowing in case of labels 1 and 2.
So far I have written this code:
# conditional framing
data = []
start_cursor = 0
while start_cursor < arr.size:
end_cursor = start_cursor + window_size
data.append(
{
"frame": arr[start_cursor:end_cursor],
"label": y[start_cursor:end_cursor].max(),
}
)
start_cursor = end_cursor
if np.any(y[start_cursor, end_cursor] != 0):
start_cursor = start_cursor - overlap_size
But this is clearly too verbose and just plain inefficient, especially because I will call this while loop on my huge list of separate arrays.
EDIT: to explain the problem more. Imagine you are going to window a signal with fixed length M. If there only exists 0 label points in the window, there will be no overlap between adjacent windows. But if there exists labels 1 and 2, there will be an overlap between two signals with percentage p%.
I think this does what you are asking to do. The visualization for checking isn't great, but it helps you see how the windowing works. Hopefully I understood your question right and this is what you are trying to do. Anytime there is a 1 or 2 in the time series (rather than a 0) the window steps forward some fraction of the full window length (here 50%).
To examine how to do this, start with a sample time series:
import matplotlib.pylab as plt
import numpy as np
N = 5000 # time series length
# create some sort of data set to work with
x = np.zeros(N)
# add a few 1s and 2s to the list (though really they are the same for the windowing)
y = np.random.random(N)
x[y < 0.01] = 1
x[y < 0.005] = 2
# assign a window length
M = 50 # window length
overlap = 0.5 # assume 50% overlap
M_overlap = int(M * (1-overlap))
My approach is to sum the window of interest over your time series. If the sum ==0, there is no overlap between windows and if it is >0 then there is overlap. The question, then, becomes how should we calculate these sums efficiently? I compare two approaches. The first is simply to walk through the time series and the second is to use convolution (which is much faster). For the first one, I also explore different ways of assessing window size after summation.
Summation (slow version)
def window_sum1():
# start of windows in list windows
windows = [0,]
while windows[-1] + M < N:
check = sum(x[windows[-1]:windows[-1]+M]) == 0
windows.append(windows[-1] + M_overlap + (M - M_overlap) * check)
if windows[-1] + M > N:
windows.pop()
break
# plotting stuff for checking
return(windows)
Niter = 10**4
print(timeit.timeit(window_sum1, number = Niter))
# 29.201083058
So this approach went through 10,000 time series of length 5000 in about 30 seconds. But the line windows.append(windows[-1] + M_overlap + (M - M_overlap) * check) can be streamlined in an if statement.
Summation (fast version, 33% faster than slow version)
def window_sum2():
# start of windows in list windows
windows = [0,]
while windows[-1] + M < N:
check = sum(x[windows[-1]:windows[-1]+M]) == 0
if check:
windows.append(windows[-1] + M)
else:
windows.append(windows[-1] + M_overlap)
if windows[-1] + M > N:
windows.pop()
break
# plotting stuff for checking
return(windows)
print(timeit.timeit(window_sum2, number = Niter))
# 20.456240447000003
We see a 1/3 reduction in time with the if statement.
Convolution (85% faster than fast summation)
We can use signal processing to get a lot faster, by convolving the time series with the window of interest using numpy.convolve. (Disclaimer: I got the idea from the accepted answer to this question.) Of course, it also makes sense to adopt the faster window size assessment from above.
def window_conv():
a = np.convolve(x,np.ones(M,dtype=int),'valid')
windows = [0,]
while windows[-1] + M < N:
if a[windows[-1]]:
windows.append(windows[-1] + M_overlap)
else:
windows.append(windows[-1] + M)
if windows[-1] + M > N:
windows.pop()
break
return(windows)
print(timeit.timeit(window_conv, number = Niter))
#3.3695770570000008
Sliding window
The last thing I will add is that, as shown in one of the comments of this question, as of numpy 1.20 there is a function called sliding_window_view. I still have numpy 1.19 running and was not able to test it to see if it's faster than convolution.
At first, I think you should revise this line if np.any(y[start_cursor, end_cursor] != 0): to if np.any(y[start_cursor:end_cursor] != 0):
Any way,
I think we can revise your code at some points.
Firstly, you can revise this part :
if np.any(y[start_cursor: end_cursor] != 0):
start_cursor = start_cursor - overlap_size
before these lines you have calculated y[start_cursor:end_cursor].max() so you know is there any label that is bigger than 0 or not. so this is a better one:
if data[-1]['label'] != 0):
start_cursor -= overlap_size
Although, the better way is that you set y[start_cursor:end_cursor].max() into the value for using for setting 'label' and checking "if expression"
Secondly, You used "append" for data. It is so inefficient. the best way is to allocate frames with zero (you have fix size for your frame and you know maximum number of frame is maxNumFrame=np.ceil((arr.size-overlap_size)/(window_size-overlap_size)). So, you should initialize frames=np.zeros((maxNumFrame,window_size)) at the first step, then you change frames in the "while" or if you want to use your customized structure, you can initialize your list with zero value, then change values in "while"
Thirdly, the best way is that in a while you calculate "start_cursor" and y and set them into the array of tuple or 2 arrays. ("end_cursor" is redundant)
After that, make the frames by using "map" in one of the ways that I said. (In one array or your customized structure)
I have a large data set, statistic, with statistic.shape = (1E10,) that I want to effectively bin (sum) into an array of zeros, out = np.zeros(1E10). Each entry in statistic has a corresponding index, idx, which tells me in which out bin it belongs. The indices are not unique so I cannot use out += statistic[idx] since this will only count the first time a particular index is encountered. Therefore I'm using np.add.at(out, idx, statistic). My problem is that for very large arrays, np.add.at() returns the wrong answer.
Below is an example script that shows this behaviour. The function check_add() should return 1.
import numpy as np
def check_add(N):
N = int(N)
out = np.zeros(N)
np.add.at(out, np.arange(N), np.ones(N))
return np.sum(out)/N
n_arr = [1E3, 1E5, 1E8, 1E10]
for n in n_arr:
print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))
This example returns for me:
N = 1000.0 (log(N) = 3.0); output ratio is 1.0
N = 100000.0 (log(N) = 5.0); output ratio is 1.0
N = 100000000.0 (log(N) = 8.0); output ratio is 1.0
N = 10000000000.0 (log(N) = 10.0); output ratio is 0.1410065408
Can someone explain to me why the function fails for N=1E10?
This is an old bug, NumPy issue 13286. ufunc.at was using a too-small variable for the loop counter. It got fixed a while ago, so update your NumPy. (The fix is present in 1.16.3 and up.)
You're overflowing int32:
1E10 % (np.iinfo(np.int32).max - np.iinfo(np.int32).min + 1) # + 1 for 0
Out[]: 1410065408
There's your weird number (googling that number actually got me to here which is how I figured this out.)
Now, what's happening in your function is a bit more weird. By the documentation of ufunc.at you should just be accumulate-adding the 1 values in the indices that are lower than np.iinfo(np.int32).max and the negative indices above np.iinfo(np.int32).min - but it seems to be 1) working backwards and 2) stopping when it gets to the last overflow. Without digging into the c code I couldn't tell you why, but it's probably a good thing it does - your function would fail silently and with the "correct" mean if it had done things this way, while corrupting your results (having 2 or 3 in those indices and 0 in the middle).
It is most likely due to integer precision indeed. If you play around with the numpy data-type (e.g. you constrain it to an (unsigned) value between 0-255) by setting uint8, you will see that they ratios start declining already for the second array. I do not have enough memory to test it, but setting all dtypes to uint64 as below should help:
def check_add(N):
N = int(N)
out = np.zeros(N,dtype='uint64')
np.add.at(out, np.arange(N,dtype='uint64'), 1)
return np.sum(out)/N
To understand the behavior, I recommend setting dtype='uint8' and checking the behavior for smaller N. So what happens is that the np.arange function creates ascending integers for the vector elements until it reaches the integer limit. It then starts again at 0 and counts up again, so at the beginning (smaller Ns) you get correct sum (although your out vector contains a lot of elements >1 in the positions 0:limit and a lot of elements = 0 beyond the limit). If however you choose N large enough, the elements in your out vector start exceeding the integer limit and start again from 0. As soon as that happens your sum is vastly off. To double-check, realize that the uint8 limit is 255(256 integers) and 256^2=65536. Set N = 65536 with dtype='uint8' and check_add(65536) will return 0.
import numpy as np
def check_add(N):
N = int(N)
out = np.zeros(N,dtype='uint8')
np.add.at(out, np.arange(N,dtype='uint8'), 1)
return np.sum(out)/N
n_arr = [1E1, 1E3, 1E5,65536, 1E7]
for n in n_arr:
print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))
Also note, that you don't need the np.ones vector but can simply replace it by 1, if all you care about is uniformly incrementing everything by 1.
Guessing as I couldn't run it, but could it be a problem that you are exceeding max integer value in python for the last option? Ie exceeds 2147483647.
Use longinteger type instead as per below.
Referring to: [enter link description here][1]https://docs.python.org/2.0/ref/integers.html
Hope this helps. Please let me know if it does work.
Hi I have written a script that randomly shuffles read sequences over the gene they were mapped to.
This is useful if you want to determine if a peak that you observe over your gene of interest is statistically significant. I use this code to calculate False Discovery Rates for peaks in my gene of interest.
Below the code:
import numpy as np
import matplotlib.pyplot as plt
iterations = 1000 # number of times a read needs to be shuffled
featurelength = 1000 # length of the gene
a = np.zeros((iterations,featurelength)) # create a matrix with 1000 rows of the feature length
b = np.arange(iterations) # a matrix with the number of iterations (0-999)
reads = np.random.randint(10,50,1000) # a random dataset containing an array of DNA read lengths
Below the code to fill the large matrix (a):
for i in reads: # for read with read length i
r = np.random.randint(-i,featurelength-1,iterations) # generate random read start positions for the read i
for j in b: # for each row in a:
pos = r[j] # get the first random start position for that row
if pos < 0: # start position can be negative because a read does not have to completely overlap with the feature
a[j][:pos+i]+=1
else:
a[j][pos:pos+i]+=1 # add the read to the array and repeat
Then generate a heat map to see if the distribution is roughly even:
plt.imshow(a)
plt.show()
This generates the desired result but it is very slow because of the many for loops.
I tried to do fancy numpy indexing but I constantly get the "too many indices error".
Anybody have a better idea of how to do this?
Fancy indexing is a bit tricky, but still possible:
for i in reads:
r = np.random.randint(-i,featurelength-1,iterations)
idx = np.clip(np.arange(i)[:,None]+r, 0, featurelength-1)
a[b,idx] += 1
To deconstruct this a bit, we're:
Creating a simple index array as a column vector, from 0 to i: np.arange(i)[:,None]
Adding each element from r (a row vector), which broadcasts to make a matrix of size (i,iterations) with the correct offsets into the columns of a.
Clamping the indices to the range [0,featurelength), via np.clip.
Finally, we fancy-index a for each row (b) and the relevant columns (idx).
I have scipy and numpy, Python v3.1
I need to create a 1D array of length 3million, using random numbers between (and including) 100-60,000. It has to fit a normal distribution.
Using 'a = numpy.random.standard_normal(3000000)', I get a normal distribution for that required length; not sure how to achieve the required range.
A standard normal distribution has mean 0 and standard deviation 1. What I understand from your requirements is that you need a ((60000-100)/2, (60000-100)/2) one. Take each value from standard_normal() result, multiply it by the new variance, and add the new mean.
I haven't used NumPy, but a quick search of the docs says that you can achieve what you want directly bu using numpy.random.normal()
One last tidbit: normal distributions are not bounded. That means there isn't a value with probability zero. Your requirements should be in terms of means and variances (or standard deviations), and not of limits.
If you want a truly random normal distribution, you can't guarentee how far the numbers will spread. You can reduce the probability of outliers, however, by specifying the standard deviation
>>> n = 3000000
>>> sigma5 = 1.0 / 1744278
>>> n * sigma5
1.7199093263803131 # Expect one values in 3 mil outside range at 5 stdev.
>>> sigma6 = 1.0 / 1 / 506800000
>>> sigma6 = 1.0 / 506800000
>>> n * sigma6
0.0059194948697711127 # Expect 0.005 values in 3 mil outside range at 6 stdev.
>>> sigma7 = 1.0 / 390600000000
>>> n * sigma7
7.6804915514592934e-06
Therefore, in this case, ensuring that the standard deviation is only 1/6 or 1/7 of half the range will give you reasonable confidence that your data will not exceed the range.
>>> range = 60000 - 100
>>> spread = (range / 2) / 6 # Anything outside of the range will be six std. dev. from the mean
>>> mean = (60000 + 100) / 2
>>> a = numpy.random.normal(loc = mean, scale = spread, size = n)
>>> min(a)
6320.0238199673404
>>> max(a)
55044.015566089176
Of course, you can still can values that fall outside the range here
try this nice little method:
You'll want a method that just makes one random number.
import random
list = [random.randint(min,max) for i in range(numitems)]
This will give you a list with numitems random numbers between min and max.
Of course, 3000000 is a lot of items to have in memory. Consider making the random numbers as they are needed by the program.