Resample a numpy array - python

It's easy to resample an array like
a = numpy.array([1,2,3,4,5,6,7,8,9,10])
with an integer resampling factor. For instance, with a factor 2 :
b = a[::2] # [1 3 5 7 9]
But with a non-integer resampling factor, it doesn't work so easily :
c = a[::1.5] # [1 2 3 4 5 6 7 8 9 10] => not what is needed...
It should be (with linear interpolation):
[1 2.5 4 5.5 7 8.5 10]
or (by taking the nearest neighbour in the array)
[1 3 4 6 7 9 10]
How to resample a numpy array with a non-integer resampling factor?
Example of application: audio signal resampling / repitching

NumPy has numpy.interp which does linear interpolation:
In [1]: numpy.interp(np.arange(0, len(a), 1.5), np.arange(0, len(a)), a)
Out[1]: array([ 1. , 2.5, 4. , 5.5, 7. , 8.5, 10. ])
SciPy has scipy.interpolate.interp1d which can do linear and nearest interpolation (though which point is nearest might not be obvious):
In [2]: from scipy.interpolate import interp1d
In [3]: xp = np.arange(0, len(a), 1.5)
In [4]: lin = interp1d(np.arange(len(a)), a)
In [5]: lin(xp)
Out[5]: array([ 1. , 2.5, 4. , 5.5, 7. , 8.5, 10. ])
In [6]: nearest = interp1d(np.arange(len(a)), a, kind='nearest')
In [7]: nearest(xp)
Out[7]: array([ 1., 2., 4., 5., 7., 8., 10.])

As scipy.signal.resample can be very slow, I searched for other algorithms adapted for audio.
It seems that Erik de Castro Lopo's SRC (a.k.a. Secret Rabbit Code a.k.a. libsamplerate) is one of the best resampling algorithms available.
It is used by scikit's scikit.samplerate, but this library seems to be complicated to install (I gave up on Windows).
Fortunately, there is an easy-to-use and easy-to-install Python wrapper for libsamplerate, made by Tino Wagner: https://pypi.org/project/samplerate/. Installation with pip install samplerate. Usage:
import samplerate
from scipy.io import wavfile
sr, x = wavfile.read('input.wav') # 48 khz file
y = samplerate.resample(x, 44100 * 1.0 / 48000, 'sinc_best')
Interesting reading / comparison of many resampling solutions:
http://signalsprocessed.blogspot.com/2016/08/audio-resampling-in-python.html
Addendum: comparison of spectrograms of a resampled frequency sweep (20hz to 20khz):
1) Original
2) Resampled with libsamplerate / samplerate module
3) Resampled with numpy.interp ("One-dimensional linear interpolation"):

Since you mention this being data from an audio .WAV file, you might look at scipy.signal.resample.
Resample x to num samples using Fourier method along the given axis.
The resampled signal starts at the same value as x but is sampled
with a spacing of len(x) / num * (spacing of x). Because a
Fourier method is used, the signal is assumed to be periodic.
Your linear array a is not a good one to test this on, since it isn't periodic in appearance. But consider sin data:
x=np.arange(10)
y=np.sin(x)
y1, x1 =signal.resample(y,15,x) # 10 pts resampled at 15
compare these with either
y1-np.sin(x1) # or
plot(x, y, x1, y1)

In signal processing, you can think of resampling as basically rescaling the array and interpolating the missing values or values with non-integer index using nearest, linear, cubic, etc methods.
Using scipy.interpolate.interp1d, you can achieve one dimensional resampling using the following function
def resample(x, factor, kind='linear'):
n = np.ceil(x.size / factor)
f = interp1d(np.linspace(0, 1, x.size), x, kind)
return f(np.linspace(0, 1, n))
e.g.:
a = np.array([1,2,3,4,5,6,7,8,9,10])
resample(a, factor=1.5, kind='linear')
yields
array([ 1. , 2.5, 4. , 5.5, 7. , 8.5, 10. ])
and
a = np.array([1,2,3,4,5,6,7,8,9,10])
resample(a, factor=1.5, kind='nearest')
yields
array([ 1., 2., 4., 5., 7., 8., 10.])

And if you want the integer sampling
a = numpy.array([1,2,3,4,5,6,7,8,9,10])
factor = 1.5
x = map(int,numpy.round(numpy.arange(0,len(a),factor)))
sampled = a[x]

Related

Empirical CDF function in python with reasonable NaN behavior

I'm looking to compute the ECDF and am using this statsmodels function:
from statsmodels.distributions.empirical_distribution import ECDF
Looks good at first:
ECDF(np.array([0,1,2,3, 3, 3]))(np.array([0,1,2,3, 3,3]))
array([0.16666667, 0.33333333, 0.5 , 1. , 1. ,
1. ])
However, nan seems to be treated as infinity:
>>> x = np.array([0,1,2,3, np.nan, np.nan])
>>> ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
Same as:
np.array([0,1,2,3, np.inf, np.inf])
ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
Comparing with R:
> x <- c(0,1,2,3,NA,NA)
> x
[1] 0 1 2 3 NA NA
> ecdf(x)(x)
[1] 0.25 0.50 0.75 1.00 NA NA
What's the standard python function for ecdf that is nan aware?
Hot-wiring like so does not seem to work:
def ecdf(x):
return np.where(~np.isfinite(x),
np.full_like(x, np.nan),
ECDF(x[np.isfinite(x)])(x[np.isfinite(x)]))
ecdf(x)
ECDF(x[np.isfinite(x)])(x[np.isfinite(x)]))
File "<__array_function__ internals>", line 6, in where
ValueError: operands could not be broadcast together with shapes (7,) (7,) (4,)
The source code of statsmodel's ECDF is pleasantly brief (after stripping comments):
class ECDF(StepFunction):
def __init__(self, x, side='right'):
x = np.array(x, copy=True)
x.sort()
nobs = len(x)
y = np.linspace(1./nobs,1,nobs)
super(ECDF, self).__init__(x, y, side=side, sorted=True)
Sorting the input samples via x.sort() will move all the np.nan valued elements to the end even after np.inf, which is why they appear to be treated as infinity
bar=np.array([1, np.nan, 2, np.inf, 3])
bar.sort()
# bar is now array([ 1., 2., 3., inf, nan])
The reason np.nan isn't propagated is because ECDF's parent class uses np.searchsorted to find the correct index and then looks it up in y. For np.nan this is simply the last element of the array and a subsequent lookup of self.y will return 1 for this case.
You can make it propagate np.nan with a simple change, which you can realize as a subclass or sibling.
from statsmodels.distributions.empirical_distribution import StepFunction
import numpy as np
class MyECDF(StepFunction):
def __init__(self, x, side='right'):
x = np.sort(x)
# count number of non-nan's instead of length
nobs = np.count_nonzero(~np.isnan(x))
# fill the y values corresponding to np.nan with np.nan
y = np.full_like(x, np.nan)
y[:nobs] = np.linspace(1./nobs,1,nobs)
super(MyECDF, self).__init__(x, y, side=side, sorted=True)
This small change will make the function behave in a way similar to R:
>>> from foobar import MyECDF
>>> from statsmodels.distributions.empirical_distribution import ECDF
>>> import numpy as np
>>> x = np.array([0,1,2,3, np.nan, np.nan])
>>> ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
>>> MyECDF(x)(x)
array([0.25, 0.5 , 0.75, 1. , nan, nan])
You can use a masked array:
import numpy.ma as ma
def ecdf(x):
return np.where(np.isnan(x),
np.full_like(x, np.nan),
ECDF(ma.array(x, mask=np.isnan(x)).compressed(), "right")(ma.array(x, mask=np.isnan(x))),
)
>>> ecdf(x)
array([0.25, 0.5 , 0.75, 1. , nan, nan])
Matches what R does natively.

Distance between 2 points in 3D for a big array

I have an array n×m, where n = 217000 and m = 3 (some data from telescope).
I need to calculate the distances between 2 points in 3D (according to my x, y, z coordinates in columns).
When I try to use sklearn tools the result is:
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
What tool can I use in this situation and what max possible size for this tools?
What tool can I use in this situation...?
You could implement the euclidean distance function on your own using the approach suggested by #Saksow. Assuming that a and b are one-dimensional NumPy arrays, you could also use any of the methods proposed in this thread:
import numpy as np
np.linalg.norm(a-b)
np.sqrt(np.sum((a-b)**2))
np.sqrt(np.dot(a-b, a-b))
If you wish to compute in one go the pairwise distance (not necessarily the euclidean distance) between all the points in your array, the module scipy.spatial.distance is your friend.
Demo:
In [79]: from scipy.spatial.distance import squareform, pdist
In [80]: arr = np.asarray([[0, 0, 0],
...: [1, 0, 0],
...: [0, 2, 0],
...: [0, 0, 3]], dtype='float')
...:
In [81]: squareform(pdist(arr, 'euclidean'))
Out[81]:
array([[ 0. , 1. , 2. , 3. ],
[ 1. , 0. , 2.23606798, 3.16227766],
[ 2. , 2.23606798, 0. , 3.60555128],
[ 3. , 3.16227766, 3.60555128, 0. ]])
In [82]: squareform(pdist(arr, 'cityblock'))
Out[82]:
array([[ 0., 1., 2., 3.],
[ 1., 0., 3., 4.],
[ 2., 3., 0., 5.],
[ 3., 4., 5., 0.]])
Notice that the number of points in the mock data array used in this toy example is and the resulting pairwise distance array has elements.
...and what max possible size for this tools?
If you try to apply the approach above using your data () you get an error:
In [105]: data = np.random.random(size=(217000, 3))
In [106]: squareform(pdist(data, 'euclidean'))
Traceback (most recent call last):
File "<ipython-input-106-fd273331a6fe>", line 1, in <module>
squareform(pdist(data, 'euclidean'))
File "C:\Users\CPU 2353\Anaconda2\lib\site-packages\scipy\spatial\distance.py", line 1220, in pdist
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError
The issue is you are running out of RAM. To perform such computation you would need more than 350TB! The required amount of memory result from multiplying the number of elements of the distance matrix (2170002) by the number of bytes of each element of that matrix (8), and dividing this product by the apropriate factor (10243) to express the result in gigabytes:
In [107]: round(data.shape[0]**2 * data.dtype.itemsize / 1024.**3)
Out[107]: 350.8
So the maximum allowed size for your data is determined by the amount of available RAM (take a look at this thread for further details).
Using only Python and Euclidean distance formula for 3 dimensions:
import math
distance = math.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2 + (z1 - z2) ** 2)

Python - Binning x,y,z values on a 2D grid

I have a list of z points associated to pairs x,y, meaning that for example
x y z
3.1 5.2 1.3
4.2 2.3 9.3
5.6 9.8 3.5
and so on. The total number of z values is relatively high, around 10000.
I would like to bin my data, in the following sense:
1) I would like to split the x and y values into cells, so as to make a 2-dimensional grid in x,y.If I have Nx cells for the x axis and Ny for the y axis, I would then have Nx*Ny cells on the grid. For example, the first bin for x could be ranging from 1. to 2., the second from 2. to 3. and so on.
2) For each of this cell in the 2dimensional grid, I would then need to calculate how many points fall into that cell, and sum all their z values. This gives me a numerical value associated to each cell.
I thought about using binned_statistic from scipy.stats, but I would have no idea on how to set the options to accomplish my task. Any suggestions? Also other tools, other than binned_statistic, are well accepted.
Assuming I understand, you can get what you need by exploiting the expand_binnumbers parameter for binned_statistic_2d, thus.
from scipy.stats import binned_statistic_2d
import numpy as np
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
z = [2.,3.,5.,7.]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = binned_statistic_2d(x, y, None, 'count', bins=[binx,biny], \
expand_binnumbers=True)
print (ret.statistic)
print (ret.binnumber)
sums = np.zeros([-1+len(binx), -1+len(biny)])
for i in range(len(x)):
m = ret.binnumber [0][i] - 1
n = ret.binnumber [1][i] - 1
sums[m][n] += sums[m][n] + z[i]
print (sums)
This is just an expansion of one of the examples. Here's the output.
[[ 2. 1.]
[ 1. 0.]]
[[1 1 1 2]
[1 2 1 1]]
[[ 9. 3.]
[ 7. 0.]]
Establish the edges of the cells, iterate over cell edges and use boolean indexing to extract the z values in each cell, keep the sums in a list, convert the list and reshape it.
import itertools
import numpy as np
x = np.array([0.1, 0.1, 0.1, 0.6, 1.2, 2.1])
y = np.array([2.1, 2.6, 2.1, 2.1, 3.4, 4.7])
z = np.array([2., 3., 5., 7., 10, 20])
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
next(b, None)
return itertools.izip(a, b)
minx, maxx = int(min(x)), int(max(x)) + 1
miny, maxy = int(min(y)), int(max(y)) + 1
result = []
x_edges = pairwise(xrange(minx, maxx + 1))
for xleft, xright in x_edges:
xmask = np.logical_and(x >= xleft, x < xright)
y_edges = pairwise(xrange(miny, maxy + 1))
for yleft, yright in y_edges:
ymask = np.logical_and(y >= yleft, y < yright)
cell = z[np.logical_and(xmask, ymask)]
result.append(cell.sum())
result = np.array(result).reshape((maxx - minx, maxy - miny))
>>> result
array([[ 17., 0., 0.],
[ 0., 10., 0.],
[ 0., 0., 20.]])
>>>
Unfortunately, no numpy vectorization magic

NumPy array sum reduce

I have a numpy array with three columns of the form:
x1 y1 f1
x2 y2 f2
...
xn yn fn
The (x,y) pairs may repeat. I would need another array such that each (x,y) pair appears once and the corresponding third column is the sum of all the f values that appeared next to (x,y).
For example, the array
1 2 4.0
1 1 5.0
1 2 3.0
0 1 9.0
would give
0 1 9.0
1 1 5.0
1 2 7.0
The order of rows is not relevant. What is the fastest way to do this in Python?
Thank you!
This would be one approach to solve it -
import numpy as np
# Input array
A = np.array([[1,2,4.0],
[1,1,5.0],
[1,2,3.0],
[0,1,9.0]])
# Extract xy columns
xy = A[:,0:2]
# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(xy.T)
sorted_xy = xy[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_xy,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)
# OR df2 = np.append([True],np.logical_or(df1[:,0]!=0,df1[:,1]!=0),0)
# OR df2 = np.append([True],np.dot(df1!=0,[True,True]),0)
# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1
# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels
# Get unique indices
unq_idx = sorted_idx[df2]
# Get counts and unique rows and setup output array
counts = np.bincount(labels, weights=A[:,2])
unq_rows = xy[unq_idx,:]
out = np.append(unq_rows,counts.ravel()[:,None],1)
Input & Output -
In [169]: A
Out[169]:
array([[ 1., 2., 4.],
[ 1., 1., 5.],
[ 1., 2., 3.],
[ 0., 1., 9.]])
In [170]: out
Out[170]:
array([[ 0., 1., 9.],
[ 1., 1., 5.],
[ 1., 2., 7.]])
Thanks to #hpaulj, finally found the simplest solution. If d contains the 3-column data:
ind =d[0:2].astype(int)
x = zeros(shape=(N,N))
add.at(x,list(ind),d[2])
This solution assumes that the (x,y) indices in the first two columns are integer and smaller than N. This is what I need and should have mentioned in the post.
Edit: Note that the above solution produces a sparse matrix with the sum values at position (x,y) within the matrix.
Certainly easily done in Python:
arr = np.array([[1,2,4.0],
[1,1,5.0],
[1,2,3.0],
[0,1,9.0]])
d={}
for x, y, z in arr:
d.setdefault((x,y), 0)
d[x,y]+=z
>>> d
{(1.0, 2.0): 7.0, (0.0, 1.0): 9.0, (1.0, 1.0): 5.0}
Then translate back to numpy:
>>> np.array([[x,y,d[(x,y)]] for x,y in d.keys()])
array([[ 1., 2., 7.],
[ 0., 1., 9.],
[ 1., 1., 5.]])
If you have scipy, the sparse module does this kind of addition - again for an array where the 1st 2 columns are integers - ie. indexes.
from scipy import sparse
M = sparse.csr_matrix((d[:,0], (d[:,1],d[:,2])))
M = M.tocoo() # there may be a short cut to this csr coo round trip
x = np.column_stack([M.row, M.col, M.data]) # needs testing
For convenience in constructing certain kinds of linear algebra matrices, the csr sparse array format sums values with duplicate indices. It's implemented in compiled code so should be fairly fast. But putting the data into M and taking it back out might slow it down.
(ps. I haven't tested this script since I'm writing this on a machine without scipy).

Using numpy.cov on a vector yields NANs

Good afternoon.
I am faced with a PCA task which simply involves reducing the dimensionality of a vector. I'm not interested in a two-dimensional matrix in this case, but merely a D-dimensional vector which I would like to project along it's K principal eigenvectors.
In order to implement PCA, I need to retrieve the covariance matrix of this vector. Let's try to do this on an example vector:
someVec = np.array([[1.0, 1.0, 2.0, -1.0]])
I've defined this vector as a 1 X 4 matrix, i.e a row vector, in order to make it compatible with numpy.cov. Taking the covariance matrix of this vector through numpy.cov will yield a scalar covariance matrix, because numpy.cov makes the assumption that the features are in the rows:
print np.cov(someVec)
1.58333333333
but this is (or rather, should be) merely a difference in dimensionality assumptions, and taking the covariance of the transpose vector should work fine, right? Except that it doesn't:
print np.cov(someVec.T)
/usr/lib/python2.7/site-packages/numpy/lib/function_base.py:2005: RuntimeWarning:
invalid value encountered in divide
return (dot(X, X.T.conj()) / fact).squeeze()
[[ nan nan nan nan]
[ nan nan nan nan]
[ nan nan nan nan]
[ nan nan nan nan]]
I'm not exactly sure what I've done wrong here. Any advice?
Thanks,
Jason
If you want to pass in the transpose, you'll need to set rowvar to zero.
In [10]: np.cov(someVec, rowvar=0)
Out[10]: array(1.5833333333333333)
In [11]: np.cov(someVec.T, rowvar=0)
Out[11]: array(1.5833333333333333)
From the docs:
rowvar : int, optional
If rowvar is non-zero (default), then each row
represents a variable, with observations in the columns. Otherwise,
the relationship is transposed: each column represents a variable,
while the rows contain observations.
If you want to find a full covariance matrix, you'll need more than one observation. With a single observation, and numpy's default estimator, NaN is exactly what you'd expect. If you would like to have normalization done by N instead of (N-1), you can pass in a 1 to the bias.
In [12]: np.cov(someVec.T, bias=1)
Out[12]:
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
Again, from the docs.
bias : int, optional
Default normalization is by (N - 1), where N is
the number of observations given (unbiased estimate). If bias is 1,
then normalization is by N. These values can be overridden by using
the keyword ddof in numpy versions >= 1.5.
You should use the option row_var=0 in numpy.cov:
In [1]: a = array([[1, 2, 3, 4]])
In [2]: np.cov(a)
Out[2]: array(1.6666666666666667)
In [3]: np.cov(a.T)
Out[3]:
array([[ nan, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan]])
In [4]: np.cov(a.T, rowvar=0)
Out[4]: array(1.6666666666666667)
Not really, shouldn't that be returning a matrix of size 4 x 4? I mean, the vector has 4 "features", so given that I want to measure the variance between the features and store them in appropriate places, I need a covariance matrix.
Since you only have one observation, you can't compute a covariance matrix. Depending on the estimator the covariances would either be zero or undefined.
If that's not intuitively clear, try answering the following questions:
what is the variance of 1.0?
what is the covariance of 1.0 and 2.0?
In essence, these are the computations that you're asking numpy.cov() to perform.

Categories