Calculate correlation in xarray with missing data - python

I am trying to calculate a correlation between two datasets in xarray along the time dimension. My dataset are both lat x lon x time. One of my datasets has enough data missing that is isn't reasonable to interpolate and eliminate gaps, instead I would like to just ignore missing values. I have some simple bits of code that are working somewhat, but none that fits my exact use case. For example:
def covariance(x,y,dims=None):
return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)
def correlation(x,y,dims=None):
return covariance(x,y,dims) / (x.std(dims) * y.std(dims))
works well if no data is missing but of course can't work with nans. While there is a good example written for xarray here, even with this code I am struggling to calcuate the pearson's correlation not the spearman's.
import numpy as np
import xarray as xr
import bottleneck
def covariance_gufunc(x, y):
return ((x - x.mean(axis=-1, keepdims=True))
* (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)
def pearson_correlation_gufunc(x, y):
return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))
def spearman_correlation_gufunc(x, y):
x_ranks = bottleneck.rankdata(x, axis=-1)
y_ranks = bottleneck.rankdata(y, axis=-1)
return pearson_correlation_gufunc(x_ranks, y_ranks)
def spearman_correlation(x, y, dim):
return xr.apply_ufunc(
spearman_correlation_gufunc, x, y,
input_core_dims=[[dim], [dim]],
dask='parallelized',
output_dtypes=[float])
Finally there was a useful discussion on github of adding this as a feature to xarray but it has yet to be implemented. Is there an efficient way to do this on datasets with data gaps?

I've been following this Github discussion and the subsequent attempts to implement a .corr() method, seems like we're pretty close but it's still not there yet.
In the meantime, the basic code which most are attempting to merge is outlined pretty well in this other answer (How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?). It's a good solution which leverages vectorized operations in NumPy and with some small tweaking (see accepted answer in the link) can be made to account for NaNs along the time axis.
def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time.
Thus the input data could be a 1D time series, or for example, have three
dimensions (time,lat,lon).
Datasets can be provided in any order, but note that the regression slope
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value,
and standard error on regression between the two datasets along their
aligned time dimension.
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount.
"""
#1. Ensure that the data are properly alinged to each other.
x,y = xr.align(x,y)
#2. Add lag information if any, and shift the data accordingly
if lagx!=0:
# If x lags y by 1, x must be shifted 1 step backwards.
# But as the 'zero-th' value is nonexistant, xr assigns it as invalid
# (nan). Hence it needs to be dropped
x = x.shift(time = -lagx).dropna(dim='time')
# Next important step is to re-align the two datasets so that y adjusts
# to the changed coordinates of x
x,y = xr.align(x,y)
if lagy!=0:
y = y.shift(time = -lagy).dropna(dim='time')
x,y = xr.align(x,y)
#3. Compute data length, mean and standard deviation along time axis:
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd = x.std(axis=0)
ystd = y.std(axis=0)
#4. Compute covariance along time axis
cov = np.sum((x - xmean)*(y - ymean), axis=0)/(n)
#5. Compute correlation along time axis
cor = cov/(xstd*ystd)
#6. Compute regression slope and intercept:
slope = cov/(xstd**2)
intercept = ymean - xmean*slope
#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats
from scipy.stats import t
pval = t.sf(tstats, n-2)*2
pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)
return cov,cor,slope,intercept,pval,stderr
Hope this helps! Fingers crossed the merge comes soon for this.

The solution is in the github thread https://github.com/pydata/xarray/issues/1115
def covariance(x, y, dim=None):
valid_values = x.notnull() & y.notnull()
valid_count = valid_values.sum(dim)
demeaned_x = (x - x.mean(dim)).fillna(0)
demeaned_y = (y - y.mean(dim)).fillna(0)
return xr.dot(demeaned_x, demeaned_y, dims=dim) / valid_count
def correlation(x, y, dim=None):
# dim should default to the intersection of x.dims and y.dims
return covariance(x, y, dim) / (x.std(dim) * y.std(dim))

Related

np.random.choice not producing expected histogram

I'm looking to generate random normally distributed numbers between 1 and 0, but as the mean moves closer to 1 or 0, the right or left side respectively becomes "squished".
After modifying the normal distribution and playing around with sliders in geogebra, I came up with the following:
Next I needed to create a method in python which would generate random samples that would be distributed according to this PDF.
Originally I thought the only way to do this was to try and derive a new equation for generating random numbers as seen in the Box-Muller proof (which I got by following along with this tutorial).
However, I thought there might be an easier way to do this by using the numpy library's np.random.choice() method.
After all, I should be able to integrate the PDF at a very small step size and get the various probabilities for said steps (approximately of course).
So with that I wrote the following script:
# Standard libs
import math
# Third party libs
import numpy as np
from alive_progress import alive_bar
from matplotlib import pyplot as plt
class RandomNumberGenerator:
def __init__(self):
pass
def clamped_normal_distribution(self, mu: float,
stddev: float, x: float):
""" Computes a value from the clamped normal distribution """
divideByZeroAvoider = 1e-5
if x < 0 or x > 1:
return 0
elif x >= 0 and x <= mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/(x**2 + divideByZeroAvoider)))
elif x <= 1 and x > mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/((1-x)**2 + divideByZeroAvoider)))
else:
print("This shouldn't happen!: {}".format(x))
return 0
if __name__ == '__main__':
rng = RandomNumberGenerator()
mu = 0.7
stddev = 1
stepSize = 1e-3
x = np.linspace(stepSize,1, int(1/stepSize) - 1)
# Determine the total area under the curve
samples = []
print("Generating samples...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
samples.append(rng.clamped_normal_distribution(
mu, stddev, i))
bar()
area = np.trapz(samples, dx=stepSize)
print("Area = {}".format(area))
# Determine the probability of x falling in a specific interval
probabilities = []
print("Generating probabilties...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
lead = rng.clamped_normal_distribution(mu,
stddev, i)
lag = rng.clamped_normal_distribution(mu,
stddev, i - stepSize)
probability = np.trapz(
np.array([lag, lead]),
dx=stepSize)
# Divide by the area because this isn't a standard normal
probabilities.append(probability / area)
bar()
# Should be approximately 1
print("Probability: {}".format(sum(probabilities)))
plt.plot(x, probabilities)
plt.show()
y = []
print("Performing distribution test...")
testSize = int(10e3)
with alive_bar(testSize) as bar:
for _ in range(testSize):
randSamp = np.random.choice(samples, p=probabilities)
y.append(randSamp)
bar()
plt.hist(y,300)
plt.show()
The first plot of the probabilities against the linearly spaced samples looks promising, giving me the following graph:
However, if we use these samples as choices with given probabilities, we get the following histogram:
I have no idea why this isn't working correctly.
I've tried other (smaller) examples like the ones listed on the numpy website, and they produce histograms of the according to the given probabilities array.
I'd really appreciate some advice/intuition if at all possible :).
It looks like there is a problem with the first argument in the call np.random.choice(samples, p=probabilities). The first argument should be x, not samples.
ADDITION BY AUTHOR:
The reason for this is the samples are the values of the curve (i.e. the y-axis and NOT the x-axis).
Thus the values with the highest probabilities (i.e. the samples around the mean) all have a value of ~1, which is why we see such a massive spike around the value 1.
Changing this to x gives us the following graphs (for 10e3 samples):
Working as expected, very nice.

Manually recover the original function from numpy rfft

I have performed a numpy.fft.rfft on a function to obtain the Fourier coefficients. Since the docs do not seem to contain the exact formula used, I have been assuming a formula found in a textbook of mine:
S(x) = a_0/2 + SUM(real(a_n) * cos(nx) + imag(a_n) * sin(nx))
where imag(a_n) is the imaginary part of the n_th element of the Fourier coefficients.
To translate this into python-speak, I have implemented the following:
def fourier(freqs, X):
# input the fourier frequencies from np.fft.rfft, and arbitrary X
const_term = np.repeat(np.real(freqs[0])/2, X.shape[0]).reshape(-1,1)
# this is the "n" part of the inside of the trig terms
trig_terms = np.tile(np.arange(1,len(freqs)), (X.shape[0],1))
sin_terms = np.imag(freqs[1:])*np.sin(np.einsum('i,ij->ij', X, trig_terms))
cos_terms = np.real(freqs[1:])*np.cos(np.einsum('i,ij->ij', X, trig_terms))
return np.concatenate((const_term, sin_terms, cos_terms), axis=1)
This should give me an [X.shape[0], 2*freqs.shape[0] - 1] array, containing at entry i,j the i_th element of X evaluated at the j_th term of the Fourier decomposition (where the j_th term is a sin term for odd j).
By summing this array over the axis of Fourier terms, I should obtain the function evaluated at the i_th term in X:
import numpy as np
import matplotlib.pyplot as plt
X = np.linspace(-1,1,50)
y = X*(X-0.8)*(X+1)
reconstructed_y = np.sum(
fourier(
np.fft.rfft(y),
X
),
axis = 1
)
plt.plot(X,y)
plt.plot(X, reconstructed_y, c='r')
plt.show()
In any case, the red line should be basically on top of the blue line. Something has gone wrong either in my assumptions about what numpy.fft.rfft returns, or in my specific implementation, but I am having a hard time tracking down the bug. Can anyone shed some light on what I've done wrong here?

Fast way reduce noise of autocorrelation function in python?

I can compute the autocorrelation using numpy's built in functionality:
numpy.correlate(x,x,mode='same')
However the resulting correlation is naturally noisy. I can partition my data, and compute the correlation on each resulting window, then average them all together to compute cleaner autocorrelation, similar to what signal.welch does. Is there a handy function in either numpy or scipy that does this, possibly faster than I would get if I were to compute partition and loop through the data myself?
UPDATE
This is motivated by #kazemakase answer. I have tried to show what I mean with some code used to generate the figure below.
One can see that #kazemakase is correct with the fact that the AC function naturally averages out the noise. However the averaging of the AC has the advantage that it is much faster! np.correlate seems to scale as the slow O(n^2) rather than O(nlogn) that I would expect if the correlation was calculated using circular convolution via the FFT...
from statsmodels.tsa.arima_model import ARIMA
import statsmodels as sm
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(12345)
arparams = np.array([.75, -.25, 0.2, -0.15])
maparams = np.array([.65, .35])
ar = np.r_[1, -arparams] # add zero-lag and negate
ma = np.r_[1, maparams] # add zero-lag
x = sm.tsa.arima_process.arma_generate_sample(ar, ma, 10000)
def calc_rxx(x):
x = x-x.mean()
N = len(x)
Rxx = np.correlate(x,x,mode="same")[N/2::]/N
#Rxx = np.correlate(x,x,mode="same")[N/2::]/np.arange(N,N/2,-1)
return Rxx/x.var()
def avg_rxx(x,nperseg=1024):
rxx_windows = []
Nw = int(np.floor(len(x)/nperseg))
print Nw
first = True
for i in range(Nw-1):
xw = x[i*nperseg:nperseg*(i+1)]
y = calc_rxx(xw)
if i%1 == 0:
if first:
plt.semilogx(y,"k",alpha=0.2,label="Short AC")
first = False
else:
plt.semilogx(y,"k",alpha=0.2)
rxx_windows.append(y)
print np.shape(rxx_windows)
return np.mean(rxx_windows,axis=0)
plt.figure()
r_avg = avg_rxx(x,nperseg=300)
r = calc_rxx(x)
plt.semilogx(r_avg,label="Average AC")
plt.semilogx(r,label="Long AC")
plt.xlabel("Lag")
plt.ylabel("Auto-correlation")
plt.legend()
plt.xlim([0,150])
plt.show()
TL-DR: To decrease noise in the autocorrelation function increase the length of your signal x.
Partitioning the data and averaging like in spectral estimation is an interesting idea. I wish it would work...
The autocorrelation is defined as
Let's say we partition the data into two windows. Their autocorrelations become
Note how they are only different in the limits of the sumations. Basically, we split the summation of the autocorrelation into two parts. When we add these back together we are back to the original autocorrelation! So we did not gain anything.
The conclusion is, there is no such thing implemented in numpy/scipy because there is no point in doing so.
Remarks:
I hope it's easy to see that this extends to any number of partitions.
to keep it simple I left the normalization out. If you divide Rxx by n and the partial Rxx by n/2 you get Rxx / n == (Rxx1 * 2/n + Rxx2 * 2/n) / 2. I.e. The mean of the normalized partial autocorrelation is equal to the complete normalized autocorrelation.
to keep it even simpler I assumed the signal x could be indexed beyond the limits of 0 and n-1. In practice, if the signal is stored in an array this is often not possible. In this case there is a small difference between the full and the partialized autocorrelations that increases with the lag l. Unfortunately, this is merely a loss of precision and does not reduce noise.
Code heretic! I don't belive your evil math!
Of course we can try things out and see:
import matplotlib.pyplot as plt
import numpy as np
n = 2**16
n_segments = 8
x = np.random.randn(n) # data
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
segments = x.reshape(n_segments, -1)
m = segments.shape[1]
rs = []
for y in segments:
ry = np.correlate(y, y, mode='same') / m # partial ACF
rs.append(ry)
l2 = np.arange(-m//2, m//2) # lags of partial ACFs
plt.plot(l1, rx, label='full ACF')
plt.plot(l2, np.mean(rs, axis=0), label='partial ACF')
plt.xlim(-m, m)
plt.legend()
plt.show()
Although we used 8 segments to average the ACF, the noise level visually stays the same.
Okay, so that's why it does not work but what is the solution?
Here are the good news: Autocorrelation is already a noise reduction technique! Well, in some way at least: An application of the ACF is to find periodic signals hidden by noise.
Since noise (ideally) has zero mean, its influence diminishes the more elements we sum up. In other words, you can reduce noise in the autocorrelation by using longer signals. (I guess this is probably not true for every type of noise, but should hold for the usual Gaussian white noise and its relatives.)
Behold the noise getting lower with more data samples:
import matplotlib.pyplot as plt
import numpy as np
for n in [2**6, 2**8, 2**12]:
x = np.random.randn(n)
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
plt.plot(l1, rx, label='n={}'.format(n))
plt.legend()
plt.xlim(-20, 20)
plt.show()

Python: generating a "curve fit score"

I am working on a project in which I am trying to model the movement of an object in a kymograph. In order to do so, I fit a curve to each line of pixels in an image, and append the location of the vertex to approximately model the location of the object in the image. Below is a sample image.
As you can see, early in the time series (at the top of the image) the position of the object is nicely focused and easily modeled with a Gaussian curve. However, closer to the end of the time series (at the bottom of the image), the peak is much more diffuse. I suspect that the data at the bottom of the image will be fit much more closely by a curve modeling a Poisson distribution (image below, right) while the data at the top/middle of the image will be fit much more closely by a Gaussian or polynomial curve (image below, left).
Is there any way to, for each line of pixels, fit more than one curve to the same data and then score each for a least-squares fit? This way, I could (hopefully) switch models midway through an image to accommodate changing behaviors of the object that I am trying to track. My current code is below:
from PIL import Image
def populateData(picture) :
"""Open an image and populate a list of lists with the grayscale value"""
im = Image.open(picture)
size = im.size
width = size[0]
height = size[1]
allPixels = list(im.getdata())
pixelList = [allPixels[width*i :
width * (i+1)] for i in range(height)]
return(pixelList)
rawData = populateData("testTop.tif")
import numpy as np
from scipy.optimize import curve_fit
def findVertex(listOfRows) :
xList = []
for row in listOfRows :
x = np.arange(len(row))
ffunc = lambda x, a, x0, s: a*np.exp(-0.5*(x-x0)**2/s**2)
p, _ = curve_fit(ffunc, x, row, p0=[100,5,2])
x0 = p[1]
xList.append(x0)
xArray = np.array(xList)
return(xArray)
xValues = findVertex(rawData)
def buildRows(listOfRows) :
yArray = np.arange(len(listOfRows))
return(yArray)
yValues = buildRows(rawData)
from matplotlib import pyplot as plt
from scipy import ndimage
image = ndimage.imread("testTop.tif",flatten=True)
fig = plt.figure()
axes = fig.add_subplot(111)
axes.imshow(image)
axes.plot(xValues, yValues, 'k-')
axes.set_title('testLine')
axes.grid()
axes.set_xlabel('x')
axes.set_ylabel('time')
plt.show()
EDIT:
This is the file I used as an input (testTop.tif)
You will need to work out some form of goodness of fit between the fit and your data. Taking the sum of the squared differences between your current fit (a Gaussian) and your data divided by the variance.
sumerrsq = 0.
for i in range(yValues.shape[0]):
sumerrsq += np.power(yValues[i] - xValues[i],2)
goodfit = np.sqrt(sumerrsq/var)
I think you can use use the second output from curve fit (the covariance) to get the variance,
p, pcov = curve_fit(ffunc, x, row, p0=[100,5,2])
var = np.diag(pcov)
You can then check the value of goodfit and if it is not sufficient, switch to a different distribution. In using a different distribution, you may need to use a different estimation of error (this assumes the errors are normally distributed).
Note, without the data (and not being sure what array was which) I couldn't test any of this code...
According to the curve_fit docs:
To compute one standard deviation errors on the parameters use perr =
np.sqrt(np.diag(pcov)).
So if that's the value you're trying to compare, then you could take that second returned value from curve_fit (the one you are currently assigning to _), use it to calculate perr as above, and compare the error between multiple curves.
I would suggest you work with a 2D fit model. A 1d Gaussian distribution is the basis but the mean and variance depend on position and time. You then would fit the model against the 2d image data.
In case you want to stay with your approach, it looks like it's just the starting value for mean and variance which you need to tweak in order to get a better fit for the lines with large times.
To your question, you can model any score function you want, so you could do something like:
def score(x,y):
if x < 10:
return x**2 - y
else:
return x - y
So in order to work with two different models in different ranges, follow this example.

KDE in python with different mu, sigma / mapping a function to an array

I have a 2-dimensional array of values that I would like to perform a Gaussian KDE on, with a catch: the points are assumed to have different variances. For that, I have a second 2-dimensional array (with the same shape) that is the variance of the Gaussian to be used for each point. In the simple example,
import numpy as np
data = np.array([[0.4,0.2],[0.1,0.5]])
sigma = np.array([[0.05,0.1],[0.02,0.3]])
there would be four gaussians, the first of which is centered at x=0.4 with σ=0.05. Note: Actual data is much larger than 2x2
I am looking for one of two things:
A Gaussian KDE solver that will allow for bandwidth to change for each point
or
A way to map the results of each Gaussian into a 3-dimensional array, with each Gaussian evaluated across a range of points (say, evaluate each center/σ pair along np.linspace(0,1,101)). In this case, I could e.g. have the KDE value at x=0.5 by taking outarray[:,:,51].
The best way I found to handle this is through array multiplication of a sigma array and a data array. Then, I stack the arrays for each value I want to solve the KDE for.
import numpy as np
def solve_gaussian(val,data_array,sigma_array):
return (1. / sigma_array) * np.exp(- (val - data_array) * (val - data_array) / (2 * sigma_array * sigma_array))
def solve_kde(xlist,data_array,sigma_array):
kde_array = np.array([])
for xx in xlist:
single_kde = solve_gaussian(xx,data_array,sigma_array)
if np.ndim(kde_array) == 3:
kde_array = np.concatenate((kde_array,single_kde[np.newaxis,:,:]),axis=0)
else:
kde_array = np.dstack(single_kde)
return kde_array
xlist = np.linspace(0,1,101) #Adjust as needed
kde_array = solve_kde(xlist,data_array,sigma_array)
kde_vector = np.sum(np.sum(kde_array,axis=2),axis=1)
mode_guess = xlist[np.argmax(kde_vector)]
Caveat, for anyone attempting to use this code: the value of the Gaussian is along axis 0, not axis 2 as specified in the original question.

Categories