how to calculate geomatic average value with nans? - python

I would like to calculate the geometric mean of some data (including NaN), how can I do it?
I know how to calculate the mean value with NaNs, we can use the following code:
import numpy as np
M = np.nanmean(data, axis=2).
So how to do it with geomean?

You could use the identity (I only found it in the german Wikipedia but there are probably other sources as well):
This identity can be constructed using the "logarithm rules" on the normal definition of the geometric mean:
The base a can be chosen arbitarly, so you could use np.log (and np.exp as inverse operation):
import numpy as np
def nangmean(arr, axis=None):
arr = np.asarray(arr)
inverse_valids = 1. / np.sum(~np.isnan(arr), axis=axis) # could be a problem for all-nan-axis
rhs = inverse_valids * np.nansum(np.log(arr), axis=axis)
return np.exp(rhs)
And it seems to work:
>>> l = [[1, 2, 3], [1, np.nan, 3], [np.nan, 2, np.nan]]
>>> nangmean(l)
1.8171205928321397
>>> nangmean(l, axis=1)
array([ 1.81712059, 1.73205081, 2. ])
>>> nangmean(l, axis=0)
array([ 1., 2., 3.])
In NumPy 1.10 also np.nanprod was added, so you could also use the normal definition:
import numpy as np
def nangmean(arr, axis=None):
arr = np.asarray(arr)
valids = np.sum(~np.isnan(arr), axis=axis)
prod = np.nanprod(arr, axis=axis)
return np.power(prod, 1. / valids)

Related

numpy vectorize use on (2,) array

I have a numpy array of (m, 2) and I want to transform it to shape of (m, 1) using a function below.
def func(x):
if x == [1., 1.]:
return 0.
if x == [-1., 1.] or x == [-1., -1.]:
return 1.
if x == [1., -1.]:
return 2.
I want this for applied on each (2,) vector inside the (m, 2) array resulting an (m, 1) array. I tried to use numpy.vectorize but it seems that the function gets applied in each element of a array (which makes sense in general purpose case). So I have failed to apply it.
My intension is not to use for loop. Can anyone help me with this? Thanks.
import numpy as np
def f(a, b):
return a + b
F = np.vectorize(f)
x = np.asarray([[1, 2], [3, 4], [5, 6]]).T
print(F(*x))
Output:
[3, 7, 11]

Calculate condensed distance matrix with varying length data points

Scipy's pdist function expects an evenly shaped numpy array as input.
Working example:
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
#Example distance function.
def dfun(u, v):
return u.sum() + v.sum()
dat0 = np.array([-1, 1,-3, 1])
dat1 = np.array([-1, 1,-3, 1])
dat2 = np.array([ 1, 1, 1, 1])
data = np.array([dat0, dat1, dat2])
distance_matrix = pdist(data, dfun)
squareform(distance_matrix)
I got a custom distance function which works with run-length encoded data, thus the arrays may vary in size. When using the following input
dat0 = np.array([-1, 1,-4, 1])
dat1 = np.array([-1, 1,-3, 1, 1])
dat2 = np.array([ 1,-6])
A value error ValueError: A 2-dimensional array must be passed. is raised even though the distance function would be just fine handling the input. Does there exist an alternative to calculate these values?
Edit: the distance function in the above snippet is just an example for a metric which does not care about the actual number of elements inside the datapoint. In my case https://github.com/mclmza/AWarp is used which computes the dtw for sparse data sets example series: [1,-456,1,1,-23,1], thus padding the data is not a valid option.
If I understand correctly, you want to compute the distances using awarp, but that distance function takes signals of varying length. So you need to avoid creating an array, because NumPy doesn't allow 'ragged' arrays. Then I think you can do this:
from itertools import combinations
from scipy.spatial.distance import squareform
# Example distance function.
def dfun(u, v):
return u.sum() + v.sum()
dat0 = np.array([-1, 1,-4, 1])
dat1 = np.array([-1, 1,-3, 1, 1])
dat2 = np.array([ 1,-6])
data = [dat0, dat1, dat2]
dists = [dfun(a, b) for a, b in combinations(data, r=2)]
squareform(dists)
For your example, this yields:
array([[ 0, -4, -8],
[-4, 0, -6],
[-8, -6, 0]])
And if dfun = awarp then you get this output for those signals:
array([[ 0. , 0. , 2.23606798],
[ 0. , 0. , 2.44948974],
[ 2.23606798, 2.44948974, 0. ]])
I guess this approach only works if dfun is commutative, which I think awarp is.

How to numpy.nan*0=0

I would like to have python make the product
numpy.nan*0
return 0 (instead of nan), but e.g.
numpy.nan*4
still return nan.
My application: I have some numpy matrices which I am multiplying with one another. These contain many nan entries, and plenty of zeros.
The nans always represent unknown, but finite values which are known to become zero when multiplied with zero.
So I would like A*B return [1,nan],[nan,1] in the following example:
import numpy as np
A=np.matrix('1 0; 0 1')
B=np.matrix([[1, np.nan],[np.nan, 1]])
Is this possible?
Many thanks
You can use the numpy function numpy.nan_to_num()
import numpy as np
A = np.matrix('1 0; 0 1')
B = np.matrix([[1, np.nan],[np.nan, 1]])
C = np.nan_to_num(A) * np.nan_to_num(B)
The outcome will be [[1., 0.], [0., 1.]].
I don't think it is possible to override the behavior of nan * 0 in numpy directly because that multiplication is performed at a very low level.
However, you can provide your own Python class with the desired multiplication behavior, but be warned: This will seriously kill performance.
import numpy as np
class MyNumber(float):
def __mul__(self, other):
if other == 0 and np.isnan(self) or self == 0 and np.isnan(other):
return 0.0
return float(self) * other
def convert(x):
x = np.asmatrix(x, dtype=object) # use Python objects as matrix elements
x.flat = [MyNumber(i) for i in x.flat] # convert each element to MyNumber
return x
A = convert([[1, 0], [0, 1]])
B = convert([[1, np.nan], [np.nan, 1]])
print(A * B)
# [[1.0 nan]
# [nan 1.0]]

efficient numpy array creation

Given x, I want to produce x, log(x) as a numpy array whereby x has shape s, the result has shape (*s, 2). What's the neatest way to do this? x may just be a float, in which case I want a result with shape (2,).
An ugly way to do this is:
import numpy as np
x = np.asarray(x)
result = np.empty((*x.shape, 2))
result[..., 0] = x
result[..., 1] = np.log(x)
It's important to separate aesthetics from performance. Sometimes ugly code is
fast. In fact, that's the case here. Although creating an empty array and then
assigning values to slices may not look beautiful, it is fast.
import numpy as np
import timeit
import itertools as IT
import pandas as pd
def using_empty(x):
x = np.asarray(x)
result = np.empty(x.shape + (2,))
result[..., 0] = x
result[..., 1] = np.log(x)
return result
def using_concat(x):
x = np.asarray(x)
return np.concatenate([x, np.log(x)], axis=-1).reshape(x.shape+(2,), order='F')
def using_stack(x):
x = np.asarray(x)
return np.stack([x, np.log(x)], axis=x.ndim)
def using_ufunc(x):
return np.array([x, np.log(x)])
using_ufunc = np.vectorize(using_ufunc, otypes=[np.ndarray])
tests = [np.arange(600),
np.arange(600).reshape(20,30),
np.arange(960).reshape(8,15,8)]
# check that all implementations return the same result
for x in tests:
assert np.allclose(using_empty(x), using_concat(x))
assert np.allclose(using_empty(x), using_stack(x))
timing = []
funcs = ['using_empty', 'using_concat', 'using_stack', 'using_ufunc']
for test, func in IT.product(tests, funcs):
timing.append(timeit.timeit(
'{}(test)'.format(func),
setup='from __main__ import test, {}'.format(func), number=1000))
timing = pd.DataFrame(np.array(timing).reshape(-1, len(funcs)), columns=funcs)
print(timing)
yields, the following timeit results on my machine:
using_empty using_concat using_stack using_ufunc
0 0.024754 0.025182 0.030244 2.414580
1 0.025766 0.027692 0.031970 2.408344
2 0.037502 0.039644 0.044032 3.907487
So using_empty is the fastest (of the options tested applied to tests).
Note that np.stack does exactly what you want, so
np.stack([x, np.log(x)], axis=x.ndim)
looks reasonably pretty, but it is also the slowest of the three options tested.
Note that along with being much slower, using_ufunc returns an array of object dtype:
In [236]: x = np.arange(6)
In [237]: using_ufunc(x)
Out[237]:
array([array([ 0., -inf]), array([ 1., 0.]),
array([ 2. , 0.69314718]),
array([ 3. , 1.09861229]),
array([ 4. , 1.38629436]), array([ 5. , 1.60943791])], dtype=object)
which is not the same as the desired result:
In [240]: using_empty(x)
Out[240]:
array([[ 0. , -inf],
[ 1. , 0. ],
[ 2. , 0.69314718],
[ 3. , 1.09861229],
[ 4. , 1.38629436],
[ 5. , 1.60943791]])
In [238]: using_ufunc(x).shape
Out[238]: (6,)
In [239]: using_empty(x).shape
Out[239]: (6, 2)

python numpy weighted average with nans

First things first: this is not a duplicate of NumPy: calculate averages with NaNs removed, i'll explain why:
Suppose I have an array
a = array([1,2,3,4])
and I want to average over it with the weights
weights = [4,3,2,1]
output = average(a, weights=weights)
print output
2.0
ok. So this is pretty straightforward. But now I have something like this:
a = array([1,2,nan,4])
calculating the average with the usual method yields of coursenan. Can I avoid this?
In principle I want to ignore the nans, so I'd like to have something like this:
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
1.75
Alternatively, you can use a MaskedArray as such:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
First find out indices where the items are not nan, and then pass the filtered versions of a and weights to numpy.average:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75
As suggested by #mtrw in comments, it would be cleaner to use masked array here instead of index array:
>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75
I would offer another solution, which is more scalable to bigger dimensions (eg when doing average over different axis). Attached code works with 2D array, which possibly contains nans, and takes average over axis=0.
a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array
# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a))
# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec
# mean_vec is vector with weighted nan-averages of array a taken along axis=0
Expanding on #Ashwini and #Nicolas' answers, here is a version that can also handle an edge case where all the data values are np.nan, and that is designed to also work with pandas DataFrame without type-related issues:
def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
weights: List[Union[float, int]]) -> np.ndarray:
""" Calculates the weighted average of `measures`' values, ex-nans.
When nans are present in `measures`' values,
the weights are recalculated based only on the weights for non-nan measures.
Note:
The calculation used is NOT the same as just ignoring nans.
For example, if we had data and weights:
data = [2, 3, np.nan]
weights = [0.5, 0.2, 0.3]
calc_wa_ignore_nan approach:
(2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
The ignoring nans approach:
(2*0.5) + (3*0.2) == 1.6
Args:
data: Multiple rows of numeric data values with `measures` as column headers.
measures: The str names of values to select from `row`.
weights: The numeric weights associated with `measures`.
Example:
>>> df = pd.DataFrame({"meas1": [1, 1],
"meas2": [2, 2],
"meas3": [3, 3],
"meas4": [np.nan, 0],
"meas5": [5, 5]})
>>> measures = ["meas2", "meas3", "meas4"]
>>> weights = [0.5, 0.2, 0.3]
>>> calc_wa_ignore_nan(df, measures, weights)
array([2.28571429, 1.6])
"""
assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
# Need to coerce type to np.float instead of python's float
# to avoid "ufunc 'isnan' not supported for the input types ..." error
data = np.array(df[measures].values, dtype=np.float64)
# Make a 2d array with the same weights for each row
# cast for safety and better errors
weights = np.array([weights, ] * data.shape[0], dtype=np.float64)
mask = np.isnan(data)
masked_data = np.ma.masked_array(data, mask=mask)
masked_weights = np.ma.masked_array(weights, mask=mask)
# np.nanmean doesn't support weights
weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
# Replace masked elements with np.nan
# otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
weighted_avgs = weighted_avgs.filled(np.nan)
return weighted_avgs
All the solutions above are very good, but has don't handle the cases when there is nan in weights. For doing so, using pandas :
def weighted_average_ignoring_nan(df, col_value, col_weight):
den = 0
num = 0
for index, row in df.iterrows():
if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
den = den + row[col_weight]
num = num + row[col_weight]*row[col_value]
return num/den
Since you're looking for the mean another idea is to simply replace all the nan values with 0's:
>>>import numpy as np
>>>a = np.array([[ 3., 2., 5.], [np.nan, 4., np.nan], [np.nan, np.nan, np.nan]])
>>>w = np.array([[ 1., 2., 3.], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]])
>>>a[np.isnan(a)] = 0
>>>w[np.isnan(w)] = 0
>>>np.average(a, weights=w)
3.6666666666666665
This can be used with the axis functionality of the average function but be carful that your weights don't sum up to 0.

Categories