Faster convolution of probability density functions in Python - python
Suppose the convolution of a general number of discrete probability density functions needs to be calculated. For the example below there are four distributions which take on values 0,1,2 with the specified probabilities:
import numpy as np
pdfs = np.array([[0.6,0.3,0.1],[0.5,0.4,0.1],[0.3,0.7,0.0],[1.0,0.0,0.0]])
The convolution can be found like this:
pdf = pdfs[0]
for i in range(1,pdfs.shape[0]):
pdf = np.convolve(pdfs[i], pdf)
The probabilities of seeing 0,1,...,8 are then given by
array([ 0.09 , 0.327, 0.342, 0.182, 0.052, 0.007, 0. , 0. , 0. ])
This part is the bottleneck in my code and it seems there must be something available to vectorize this operation. Does anyone have a suggestion for making it faster?
Alternatively, a solution where you could use
pdf1 = np.array([[0.6,0.3,0.1],[0.5,0.4,0.1]])
pdf2 = np.array([[0.3,0.7,0.0],[1.0,0.0,0.0]])
convolve(pd1,pd2)
and get the pairwise convolutions
array([[ 0.18, 0.51, 0.24, 0.07, 0. ],
[ 0.5, 0.4, 0.1, 0. , 0. ]])
would also help tremendously.
You can compute the convolution of all your PDFs efficiently using fast fourier transforms (FFTs): the key fact is that the FFT of the convolution is the product of the FFTs of the individual probability density functions. So transform each PDF, multiply the transformed PDFs together, and then perform the inverse transform. You'll need to pad each input PDF with zeros to the appropriate length to avoid effects from wraparound.
This should be reasonably efficient: if you have m PDFs, each containing n entries, then the time to compute the convolution using this method should grow as (m^2)n log(mn). The time is dominated by the FFTs, and we're effectively computing m + 1 independent FFTs (m forward transforms and one inverse transform), each of an array of length no greater than mn. But as always, if you want real timings you should profile.
Here's some code:
import numpy.fft
def convolve_many(arrays):
"""
Convolve a list of 1d float arrays together, using FFTs.
The arrays need not have the same length, but each array should
have length at least 1.
"""
result_length = 1 + sum((len(array) - 1) for array in arrays)
# Copy each array into a 2d array of the appropriate shape.
rows = numpy.zeros((len(arrays), result_length))
for i, array in enumerate(arrays):
rows[i, :len(array)] = array
# Transform, take the product, and do the inverse transform
# to get the convolution.
fft_of_rows = numpy.fft.fft(rows)
fft_of_convolution = fft_of_rows.prod(axis=0)
convolution = numpy.fft.ifft(fft_of_convolution)
# Assuming real inputs, the imaginary part of the output can
# be ignored.
return convolution.real
Applying this to your example, here's what I get:
>>> convolve_many([[0.6, 0.3, 0.1], [0.5, 0.4, 0.1], [0.3, 0.7], [1.0]])
array([ 0.09 , 0.327, 0.342, 0.182, 0.052, 0.007])
That's the basic idea. If you want to tweak this, you might also look at numpy.fft.rfft (and its inverse, numpy.fft.irfft), which take advantage of the fact that the input is real to produce more compact transformed arrays. You might also be able to gain some speed by padding the rows array with zeros so that the total number of columns is optimal for performing an FFT. The definition of "optimal" here would depend on the FFT implementation, but powers of two would be good targets, for example. Finally, there are some obvious simplifications that can be made when creating rows if all the input arrays have the same length. But I'll leave these potential enhancements to you.
Related
Find the N smallest values in a pair-wise comparison NxN numpy array?
I have a python NxN numpy pair-wise array (matrix) of double values. Each array element of e.g., (i,j), is a measurement between the i and j item. The diagonal, where i==j, is 1 as it's a pairwise measurement of itself. This also means that the 2D NxN numpy array can be represented in matrix triangular form (one half of the numpy array identical to the other half across the diagonal). A truncated representation: [[1. 0.11428571 0.04615385 ... 0.13888889 0.07954545 0.05494505] [0.11428571 1. 0.09836066 ... 0.06578947 0.09302326 0.07954545] [0.04615385 0.09836066 1. ... 0.07843137 0.09821429 0.11711712] ... [0.13888889 0.06578947 0.07843137 ... 1. 0.34313725 0.31428571] [0.07954545 0.09302326 0.09821429 ... 0.34313725 1. 0.64130435] [0.05494505 0.07954545 0.11711712 ... 0.31428571 0.64130435 1. ]] I want to get out the smallest N values whilst not including the pairwise values twice, as would be the case due to the pair-wise duplication e.g., (5,6) == (6,5), and I do not want to include any of the identical diagonal values of 1 where i == j. I understand that numpy has the partition method and I've seen plenty of examples for a flat array, but I'm struggling to find anything straightforward for a pair-wise comparison matrix. EDIT #1 Based on my first response below I implemented: seventyPercentInt: int = round((populationSizeInt/100)*70) upperTriangleArray = dataArray[np.triu_indices(len(dataArray),1)] seventyPercentArray = upperTriangleArray[np.argpartition(upperTriangleArray,seventyPercentInt)][0:seventyPercentInt] print(len(np.unique(seventyPercentArray))) The upperTriangleArray numpy array has 1133265 elements to pick the lowest k from. In this case k is represented by seventyPercentInt, which is around 1054 values. However, when I apply np.argpartition only the value of 0 is returned. The flat array upperTriangleArray is reduced to a shape (1133265,). SOLUTION As per the first reply below (the accepted answer), my code that worked: upperTriangleArray = dataArray[np.triu_indices(len(dataArray),1)] seventyPercentInt: int = round((len(upperTriangleArray)/100)*70) seventyPercentArray = upperTriangleArray[np.argpartition(upperTriangleArray,seventyPercentInt)][0:seventyPercentInt] I ran into some slight trouble (my own making), with the seventyPercentInt. Rather than taking 70% of the pairwise elements, I took 70% of the elements to be compared. Two very different values.
You can use np.triu_indices to keep only the values of the upper triangle. Then you can use np.argpartition as in the example below. import numpy as np A = np.array([[1.0, 0.1, 0.2, 0.3], [0.1, 1.0, 0.4, 0.5], [0.2, 0.3, 1.0, 0.6], [0.3, 0.5, 0.4, 1.0]]) A_upper_triangle = A[np.triu_indices(len(A), 1)] print(A_upper_triangle) # return [0.1 0.2 0.3 0.3 0.5 0.4] k=2 print(A_upper_triangle[np.argpartition(A_upper_triangle, k)][0:k]) #return [0.1 0.2]
How to clip the real and imaginary parts of elements of a python numpy array of complex numbers
I have a numpy array of complex numbers. I seek to clip the real and imaginary parts of each number in the array to some prescribed minimum and maximum (same clipping applied to both the real and imaginary parts). For example, consider: import numpy as np clip_min = -4 clip_max = 3 x = np.array([-1.4 + 5j, -4.7 - 3j]) The desired output of the clipping operation would be: [-1.4 + 3j, -4-3j] I can achieve this by calling np.clip on the real and imaginary parts of the complex array and then adding them (after multiplying the imaginary clipped data by 1j). Is there a way to do this with one command? np.clip(x, clip_min, clip_max) doesn't yield the desired result.
There is a slightly more efficient way than clipping the real and imaginary parts as separate arrays, using in-place operations: np.clip(x.real, clip_min, clip_max, out=x.real) np.clip(x.imag, clip_min, clip_max, out=x.imag) If these are just cartesian coordinates stored in complex numbers, you could clip them in a single command by keeping them as floats rather than complex. x = array([[-1.4, 5. ], [-4.7, -3. ]]) np.clip(x, clip_min, clip_max) >>> array([[-1.4, 3. ], [-4. , -3. ]])
Markov Clustering in Python
As the title says, I'm trying to get a Markov Clustering Algorithm to work in Python, namely Python 3.7 Unfortunately, it's not doing much of anything, and it's driving me up the wall trying to fix it. EDIT: First, I've made the adjustments to the main code to make each column sum to 100, even if it's not perfectly balanced. I'm going to try to account for that in the final answer. To be clear, the biggest problem is that the numbers spiral out of control, into such easily-understandable numbers as 5.56268465e-309, and I don't know how to convert that into something understandable. Here's the code so far: import numpy as np import math ## How far you'd like your random-walkers to go (bigger number -> more walking) EXPANSION_POWER = 2 ## How tightly clustered you'd like your final picture to be (bigger number -> more clusters) INFLATION_POWER = 2 ITERATION_COUNT = 10 def normalize(matrix): return matrix/np.sum(matrix, axis=0) def expand(matrix, power): return np.linalg.matrix_power(matrix, power) def inflate(matrix, power): for entry in np.nditer(transition_matrix, op_flags=['readwrite']): entry[...] = math.pow(entry, power) return matrix def run(matrix): #np.fill_diagonal(matrix, 1) #print(matrix) matrix = normalize(matrix) print(matrix) for _ in range(ITERATION_COUNT): matrix = normalize(inflate(expand(matrix, EXPANSION_POWER), INFLATION_POWER)) return matrix transition_matrix = np.array ([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0.5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0.34,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0.33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0.33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0,0.34,0,0,0,0,0,0,0,0,0,0,0,0,0.125,0], [0,0,0,0.33,0,0,0.5,0,0,0,0,0,0,0,0,0,0.125,1], [0,0,0,0.33,0,0,0.5,1,1,0,0,0,0,0,0,0,0.125,0], [0,0,0,0,0.166,0,0,0,0,0,0,0,0,0,0,0,0.125,0], [0,0,0,0,0.166,0,0,0,0,0.2,0,0,0,0,0,0,0.125,0], [0,0,0,0,0.167,0,0,0,0,0.2,0.25,0,0,0,0,0,0.125,0], [0,0,0,0,0.167,0,0,0,0,0.2,0.25,0.5,0,0,0,0,0,0], [0,0,0,0,0.167,0,0,0,0,0.2,0.25,0.5,0,1,0,0,0.125,0], [0,0,0,0,0.167,0,0,0,0,0.2,0.25,0,1,0,1,0,0.125,0], [0,0,0,0,0,0.34,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0.33,0,0,0,0,0,0,0,0,0,0.5,0,0], [0,0,0,0,0,0.33,0,0,0,0,0,0,0,0,0,0.5,0,0]]) run(transition_matrix) print(transition_matrix) This is part of a uni assignment - I need to do this array both weighted and unweighted (though the weighted part can just wait until I've got the bloody thing working at all) any tips or suggestions?
Your transition matrix is not valid. >>> transition_matrix.sum(axis=0) >>> matrix([[1. , 1. , 0.99, 0.99, 0.96, 0.99, 1. , 1. , 0. , 1. , 1. , 1. , 1. , 0. , 0. , 1. , 0.88, 1. ]]) Not only does some of your columns not sum to 1, some of them sum to 0. This means when you try to normalize your matrix, you will end up with nan because you are dividing by 0. Lastly, is there a reason why you are using a Numpy matrix instead of just a Numpy array, which is the recommended container for such data? Because using Numpy arrays will simplify some of the operations, such as raising each entry to a power. Also, there are some differences between Numpy matrix and Numpy array which can result in subtle bugs.
SAS Proc Corr with Weighting in Python
I have a SAS script that uses the "proc corr" procedure, along with weighting in order to create a weighted correlation matrix. I am now trying to reproduce this function in python, but I haven't found a good way of including the weighting in the output matrix. While looking for a solution, I've found a few scripts and functions that calculate weighted correlation coefficients for two columns/variables (examples here) using a weights array, but I am trying to create a weighted correlation matrix with many more variables. I've tried using these functions by looping through variable combinations, but it is running magnitudes slower than the SAS procedure. I was wondering if there was an efficient way to create a weighted correlation matrix in python that works similarly to the SAS code, or at least returns equivalent results without looping through all variable combinations.
numpy's covariance takes two different kind of weights parameters - I don't have SAS to check against, but it is likely a similar approach. https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html#numpy.cov Once you have a covariance matrix, it can be converted to a correlation matrix using a formula like this https://en.wikipedia.org/wiki/Covariance_matrix#Correlation_matrix Complete example import numpy as np x = np.array([1., 1.1, 1.2, 0.9]) y = np.array([2., 2.05, 2.02, 2.8]) np.cov(x, y) Out[49]: array([[ 0.01666667, -0.03816667], [-0.03816667, 0.151225 ]]) cov = np.cov(x, y, fweights=[10, 1, 1, 1]) cov Out[51]: array([[ 0.00474359, -0.00703205], [-0.00703205, 0.04872308]]) def cov_to_corr(cov): """ based on https://en.wikipedia.org/wiki/Covariance_matrix#Correlation_matrix """ D = np.sqrt(np.diag(np.diag(cov))) Dinv = np.linalg.inv(D) return Dinv # cov # Dinv # requires python3.5, use np.dot otherwise cov_to_corr(cov) Out[53]: array([[ 1. , -0.46255259], [-0.46255259, 1. ]])
Need explanation how specgram function work in python (matplotlib - MATLAB compatible functions)
I'm working on converting my code from python to objective c. Inside matplotlib.mlab.specgram function I see 3 important functions before fft : result = stride_windows(x, NFFT, noverlap, axis=0) result = detrend(result, detrend_func, axis=0) result, windowVals = apply_window(result, window, axis=0, return_window=True) result = np.fft.fft(result, n=pad_to, axis=0)[:numFreqs, :] I tried to debug to understand purpose of each. For example I have array of input: x = [1,2,3,4,5,6,7,8,9,10,11,12] After first function stride_windows (this one to prevent leakage?), if NFFT = 4, noverlap = 2 then: x = [ [1,3,5,7,9], [2,4,6,8,10], [3,5,7,9,11], [4,6,8,10,12] ] After detrend nothing changes (I understand of detrend before fft) Inside apply_window (I don't understand this step): xshape = list(x.shape) xshapetarg = xshape.pop(axis) // =4 windowVals = window(np.ones(xshapetarg, dtype=x.dtype)) //result of 4 elements [0.0, 0.75, 0.75, 0.0] xshapeother = xshape.pop() // =5 otheraxis = (axis+1) % 2 // =1 windowValsRep = stride_repeat(windowVals, xshapeother, axis=otheraxis) // result windowValsRep = [ [ 0. ,0. ,0. ,0. ,0. ,], [0.75, 0.75, 0.75, 0.75, [0.75, 0.75, 0.75, 0.75, [ 0. ,0. ,0. ,0. ,0. ,] ] then multiply it with x windowValsRep * x Now x = [ [ 0. , 0. , 0. , 0. , 0. ], [ 1.5 , 3 , 4.5 , 6. , 7.5 ], [ 2.25, 3.75 , 5.25 , 6.75 , 8.25 ], [ 0. , 0. , 0. , 0. , 0. ] ] And then final is fft, as I know fft only need a single array but here it processes 2 dimension array. Why ? result = np.fft.fft(x, n=pad_to, axis=0)[:numFreqs, :] Could anyone explain for me step by step why data need to be processed like this before fft ? Thanks,
Spectrograms and FFTs are not the same thing. The purpose of a spectogram is to take the FFT of small, equal-sized time chunks. This produces a 2D fourier transform where the X axis is the start time of the time chunk and the Y axis is the energy (or power, etc.) in each frequency in that time chunk. This allows you to see how the frequency components change over time. This is explained in the documentation for the specgram function: Data are split into NFFT length segments and the spectrum of each section is computed. The windowing function window is applied to each segment, and the amount of overlap of each segment is specified with noverlap. As for the individual functions, a lot of what you are asking is described in the documentation for reach function, but I will try to explain in a bit more detail. The purpose of stride_windows, as described in the documentation, is to convert the 1D array of data into a 2D array of successive time chunks. These are the time chunks that will have their FFT calculated in the final spectrogram. In your case they are length-4 (NFFT=4) time chunks (notice the 4 elements per column). Because you set noverlap=2, the last 2 elements of each column are the same as the first 2 elements of the next column (that is what the overlap means). It is called "stride" because it uses a trick regarding the internal storage of numpy arrays to allow it to create an array with the overlapping windows without taking any additional memory. The detrend function, as its name implies and as is described in its documentation, removes the trend from a signal. By default it uses the mean, which as the detrend_mean documentation describes, removes the mean (DC offset) of the signal. The apply_window function does exactly what its name implies, and what its documentation says: it applies a window function to each of the time chunks. This is needed because suddenly cutting of the signal at the beginning and end of the time chunks causes large bursts of broadband energy called transients that will mess up the spectrogram. Windowing the signal reduces those transients. By default the spectrogram function uses the hanning window. This attenuates the beginning and end of each time chunk. The FFT isn't really 2D. The numpy FFT function allows you to specify an axis to take an FFT over. So in this case, we have a 2D array, and we take the FFT of each column of that array. It is much cleaner and a little faster to do this in one step rather than manually looping over each column.