Why doesn't scipy's interpolate average over colocated values? - python

If I were to run the following code:
>>> from scipy.interpolate import interpolate
>>> import numpy as np
>>> data = np.arange(10)
>>> times = np.r_[np.arange(5),np.arange(5)]
>>> new_times = np.arange(5)
>>> f = interpolate.interp1d(times,data)
>>> interp_data = f(new_times)
I would naively (and hopefully) expect the following:
>>> interp_data
array([2.5, 3.5, 4.5, 5.5, 6.5])
based on the assumption that colocated values would be averaged and weighted accordingly in the interpolation. But, in fact, the result is:
>>> interp_data
array([ 0., 6., 7., 8., 9.])
What is causing this behaviour, and how could it be rectified?

From the interp1d documentation:
assume_sorted : bool, optional If False, values of x can be in any
order and they are sorted first. If True, x has to be an array of
monotonically increasing values.
I can only get the result you got by explicity forcing assume_sorted to be True:
>>> f = interpolate.interp1d(times,data, assume_sorted=True)
>>> interp_data = f(new_times)
>>> interp_data
array([ 0., 6., 7., 8., 9.])
It appears from your code that assume_sorted defaulted to True, which is giving the answer you don't expect.
If you explicitly set it to False, according to the documentation, interp1d sorts it automatically, and then does the interpolation, giving
>>> f = interpolate.interp1d(times,data)
>>> interp_data = f(new_times)
>>> interp_data
array([ nan, 1., 2., 3., 4.])
which is consistent with the documentation.

I'm not sure exact what you want but it seems interp may not be the best way to achieve this. An interpolation function, f, should relate a single input to a single output, i.e.
from scipy.interpolate import interpolate
import numpy as np
data = np.arange(2.,8.)
times = np.arange(data.shape[0])
new_times = np.arange(0.5,5.,1.)
f = interpolate.interp1d(times,data)
interp_data = f(new_times)
Alternatively, maybe an answer like:
Get sums of pairs of elements in a numpy array
may be what you wanted?

No, interp1d would not weight, average or do anything else to the data for you.
It expects the data to be sorted. If your scipy is recent enough (0.14 or above), it has assume_sorted keyword which you can set to False and then it'll just sort it for you. The precise behavior for unsorted data is undefined.

Related

NumPy - Faster Operations on Masked Array?

I have a numpy array:
import numpy as np
arr = np.random.rand(100)
If I want to find its maximum value, I run np.amax which runs 155,357 times a second on my machine.
However, for some reasons, I have to mask some of its values. Lets, for example, mask just one cell:
import numpy.ma as ma
arr = ma.masked_array(arr, mask=[0]*99 + [1])
Now, finding the max is much slower, running 26,574 times a second.
This is only 17% of the speed of this operation on a none-masked array.
Other operations, for example, are the subtract, add, and multiply. Although on a masked array they operate on ALL OF THE VALUES, it is only 4% of the speed compared to a none-masked array (15,343/497,663)
I'm looking for a faster way to operate on masked arrays like this, whether its using numpy or not.
(I need to run this on real data, which is arrays with multiple dimensions, and millions of cells)
MaskedArray is a subclass of the base numpy ndarray. It does not have compiled code of its own. Look at the numpy/ma/ directory for details, or the main file:
/usr/local/lib/python3.6/dist-packages/numpy/ma/core.py
A masked array has to key attributes, data and mask, one is the data array you used to create it, the other a boolean array of the same size.
So all operations have to take those two arrays into account. Not only does it calculate new data, it also has to calculate a new mask.
It can take several approaches (depending on the operation):
use the data as is
use compressed data - a new array with the masked values removed
use filled data, where the masked values are replaced by the fillvalue or some innocuous value (e.g. 0 when doing addition, 1 when doing multiplication).
The number of masked values, 0 or all, makes little, if any, difference is speed.
So the speed differences that you see are not surprising. There's a lot of extra calculation going on. The ma.core.py file says this package was first developed in pre-numpy days, and incorporated into numpy around 2005. While there have been changes to keep it up to date, I don't think it has been significantly reworked.
Here's the code for np.ma.max method:
def max(self, axis=None, out=None, fill_value=None, keepdims=np._NoValue):
kwargs = {} if keepdims is np._NoValue else {'keepdims': keepdims}
_mask = self._mask
newmask = _check_mask_axis(_mask, axis, **kwargs)
if fill_value is None:
fill_value = maximum_fill_value(self)
# No explicit output
if out is None:
result = self.filled(fill_value).max(
axis=axis, out=out, **kwargs).view(type(self))
if result.ndim:
# Set the mask
result.__setmask__(newmask)
# Get rid of Infs
if newmask.ndim:
np.copyto(result, result.fill_value, where=newmask)
elif newmask:
result = masked
return result
# Explicit output
....
The key steps are
fill_value = maximum_fill_value(self) # depends on dtype
self.filled(fill_value).max(
axis=axis, out=out, **kwargs).view(type(self))
You can experiment with filled to see what happens with your array.
In [40]: arr = np.arange(10.)
In [41]: arr
Out[41]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
In [42]: Marr = np.ma.masked_array(arr, mask=[0]*9 + [1])
In [43]: Marr
Out[43]:
masked_array(data=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, --],
mask=[False, False, False, False, False, False, False, False,
False, True],
fill_value=1e+20)
In [44]: np.ma.maximum_fill_value(Marr)
Out[44]: -inf
In [45]: Marr.filled()
Out[45]:
array([0.e+00, 1.e+00, 2.e+00, 3.e+00, 4.e+00, 5.e+00, 6.e+00, 7.e+00,
8.e+00, 1.e+20])
In [46]: Marr.filled(_44)
Out[46]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., -inf])
In [47]: arr.max()
Out[47]: 9.0
In [48]: Marr.max()
Out[48]: 8.0

How to count non-zeroes values using binned_statistic

I need to efficiently process very large 1D arrays extracting some statistics per bin and I have found very useful the function binned_statistic from scipy.stats as it includes a 'statistic' argument that works quite efficiently.
I would like to perform a 'count' function but without considering zero values.
I am working in parallel with sliding windows (pandas rolling function) over the same arrays and it work nicely to substitute zeroes to NaN, but this behavior is not shared to my case.
This is a toy example of what I am doing:
import numpy as np
import pandas as pd
from scipy.stats import binned_statistic
# As example with sliding windows, this returns just the length of each window:
a = np.array([1., 0., 0., 1.])
pd.Series(a).rolling(2).count() # Returns [1.,2.,2.,2.]
# You can make the count to do it only if not zero:
nonzero_a = a.copy()
nonzero_a[nonzero_a==0.0]='nan'
pd.Series(nonzero_a).rolling(2).count() # Returns [1.,1.,0.,1.]
# However, with binned_statistic I am not able to do anything similar:
binned_statistic(range(4), a, bins=2, statistic='count')[0]
binned_statistic(range(4), nonzero_a, bins=2, statistic='count')[0]
binned_statistic(range(4), np.array([1., False, None, 1.], bins=2, statistic='count')[0]
All the previous runs provide the same output: [2., 2.] but I am expecting [1., 1.].
The only option found is to pass a custom function but it performs considerably worst than the implemented functions with real cases.
binned_statistic(range(4), a, bins=2, statistic=np.count_nonzero)
I have found and easy way to replicate the nonzero count transforming the array to 0-1 and applying sum:
# Transform all non-zero to 1s
a = np.array([1., 0., 0., 2.])
nonzero_a = a.copy()
nonzero_a[nonzero_a>0.0]=1.0 # nonzero_a = [1., 0., 0., 1.]
binned_statistic(np.arange(len(nonzero_a)), nonzero_a, bins=bins, statistic='sum')[0] # Returns [1.0, 1.0]

Multiple coefficient sets for least squares fitting in numpy/scipy

Is there a way to perform multiple simultaneous (but unrelated) least-squares fits with different coefficient matrices in either numpy.linalg.lstsq or scipy.linalg.lstsq? For example, here is a trivial linear fit that I would like to be able to do with different x-values but the same y-values. Currently, I have to write a loop:
x = np.arange(12.0).reshape(4, 3)
y = np.arange(12.0, step=3.0)
m = np.stack((x, np.broadcast_to(1, x.shape)), axis=0)
fit = np.stack(tuple(np.linalg.lstsq(w, y, rcond=-1)[0] for w in m), axis=-1)
This results in a set of fits with the same slope and different intercepts, such that fit[n] corresponds to coefficients m[n].
Linear least squares is not a great example since it is invertible, and both functions have an option for multiple y-values. However, it serves to illustrate my point.
Ideally, I would like to extend this to any "broadcastable" combination of a and b, where a.shape[-2] == b.shape[0] exactly, and the last dimensions have to either match or be one (or missing). I am not really hung up on which dimension of a is the one representing the different matrices: it was just convenient to make it the first one to shorten the loop.
Is there a built in method in numpy or scipy to avoid the Python loop? I am very much interested in using lstsq rather than manually transposing, multiplying and inverting the matrices.
You could use scipy.sparse.linalg.lsqr together with scipy.sparse.block_diag. I'm just not sure it will be any faster.
Example:
>>> import numpy as np
>>> from scipy.sparse import block_diag
>>> from scipy.sparse import linalg as sprsla
>>>
>>> x = np.random.random((3,5,4))
>>> y = np.random.random((3,5))
>>>
>>> for A, b in zip(x, y):
... print(np.linalg.lstsq(A, b))
...
(array([-0.11536962, 0.22575441, 0.03597646, 0.52014899]), array([0.22232195]), 4, array([2.27188101, 0.69355384, 0.63567141, 0.21700743]))
(array([-2.36307163, 2.27693405, -1.85653264, 3.63307554]), array([0.04810252]), 4, array([2.61853881, 0.74251282, 0.38701194, 0.06751288]))
(array([-0.6817038 , -0.02537582, 0.75882223, 0.03190649]), array([0.09892803]), 4, array([2.5094637 , 0.55673403, 0.39252624, 0.18598489]))
>>>
>>> sprsla.lsqr(block_diag(x), y.ravel())
(array([-0.11536962, 0.22575441, 0.03597646, 0.52014899, -2.36307163,
2.27693405, -1.85653264, 3.63307554, -0.6817038 , -0.02537582,
0.75882223, 0.03190649]), 2, 15, 0.6077437777160813, 0.6077437777160813, 6.226368324510392, 106.63227777368986, 1.3277892240815807e-14, 5.36589277249043, array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]))

Normalization of a matrix

I have a 150x4 matrix X which I created from a pandas dataframe using the following code:
X = df_new.as_matrix()
I have to normalize it using this function:
I know that Uj is the mean val of j, and that σ j is the standard deviation of j, but I don't understand what j is. I'm having a little trouble understanding what the bar on X is, and I'm confused by the commas in the equation (I don't know if they have any significance or not).
Can anyone help me understand what this equation means so I can then write the normalization using sklearn?
You don't actually need to write code for the normalization yourself - it comes ready with sklearn.preprocessing.scale.
Here is an example from the docs:
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X_train = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X_train)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
When used with the default setting axis=0, the mormalization happens column-wise (i.e. for each column j, as in your equestion). As a result, it is easy to confirm that scaled data has zero mean and unit variance:
>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
The indexes for matrix X are row (i) and column (j). Hence, X,j means column j of matrix X. I.e. normalize each column of matrix X to z-scores.
You can do that using pandas:
df_new_zscores = (df_new - df_new.mean()) / df_new.std()
I do not know pandas but I think that the equation means that the normalized matrix is given by
You subtract the empirical mean and devide by the empirical standard deviation per column.
You sometimes use this for Principal Component Analysis.

scipy's splrep/splev for python interpolation returns nan

I have a data set with the first column is the x data (wavelenght) and the second column is the y data (relative intensity).
I wish to interpolate it on to another x_new-data but my problem is that splrep returns nan-values:
>>import numpy as np
>>from scipy.interpolate import splrep, splev
>>d = np.loadtxt("test.txt")
>>x,y = d[:,0],d[:,1]
>>
>>f = splrep( x,y,k=5 )
>>print f
>>(array([ 4501.19, 4501.19, 4501.19, ..., 7091.74, 7091.74, 7091.74]), array([ nan, nan, nan, ..., 0., 0., 0.]), 5)
It also happens when I don't specify k. Any suggestions how to overcome this problem?
Your x values probably contain duplicates, use s=... keyword argument to splrep to set a smoothing factor, because if this is not set the splines are supposed to go through every point exactly which is impossible with duplicates.
It might be that they are not duplicates but just very close too.

Categories