I would like to calculate the weighted median of each row of a pandas dataframe.
I found this nice function (https://stackoverflow.com/a/29677616/10588967), but I don't seem to be able to pass a 2d array.
def weighted_quantile(values, quantiles, sample_weight=None, values_sorted=False, old_style=False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:param values_sorted: bool, if True, then will avoid sorting of initial array
:param old_style: if True, will correct output to be consistent with numpy.percentile.
:return: numpy.array with computed quantiles.
"""
values = numpy.array(values)
quantiles = numpy.array(quantiles)
if sample_weight is None:
sample_weight = numpy.ones(len(values))
sample_weight = numpy.array(sample_weight)
assert numpy.all(quantiles >= 0) and numpy.all(quantiles <= 1), 'quantiles should be in [0, 1]'
if not values_sorted:
sorter = numpy.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
weighted_quantiles = numpy.cumsum(sample_weight) - 0.5 * sample_weight
if old_style:
# To be convenient with numpy.percentile
weighted_quantiles -= weighted_quantiles[0]
weighted_quantiles /= weighted_quantiles[-1]
else:
weighted_quantiles /= numpy.sum(sample_weight)
return numpy.interp(quantiles, weighted_quantiles, values)
Using the code from the link, the following works:
weighted_quantile([1, 2, 9, 3.2, 4], [0.0, 0.5, 1.])
However, this does not work:
values = numpy.random.randn(10,5)
quantiles = [0.0, 0.5, 1.]
sample_weight = numpy.random.randn(10,5)
weighted_quantile(values, quantiles, sample_weight)
I receive the following error:
weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
ValueError: operands could not be broadcast together with shapes (250,) (10,5,5)
Question
Is it possible to apply this weighted quantile function in a vectorized manner on a dataframe, or I can only achieve this using .apply()?
Many thanks for your time!
np.cumsum(sample_weight)
return a 1D list. So you would like to reshape it to (10,5,5) using
np.cumsum(sample_weight).reshape(10,5,5)
Try my code in the handy repo
https://github.com/syrte/handy/blob/773a1500a9e10dd28eb0704fded94d6105a84374/stats.py#L239
I copy the docstring here, so you see what it can do. Please go to the link for the complete function (which is pretty long...)
def quantile(a, weights=None, q=None, nsig=None, origin='middle',
axis=None, keepdims=False, sorted=False, nmin=0,
nanas=None, shape='stats'):
'''Compute the quantile of the data.
Be careful when q is very small or many numbers repeat in a.
Parameters
----------
a : array_like
Input array.
weights : array_like, optional
Weighting of a.
q : float or float array in range of [0,1], optional
Quantile to compute. One of `q` and `nsig` must be specified.
nsig : float, optional
Quantile in unit of standard deviation.
Ignored when `q` is given.
origin : ['middle'| 'high'| 'low'], optional
Control how to interpret `nsig` to `q`.
axis : int, optional
Axis along which the quantiles are computed. The default is to
compute the quantiles of the flattened array.
sorted : bool
If True, the input array is assumed to be in increasing order.
nmin : int or None
Return `nan` when the tail probability is less than `nmin/a.size`.
Set `nmin` if you want to make result more reliable.
- nmin = None will turn off the check.
- nmin = 0 will return NaN for q not in [0, 1].
- nmin >= 3 is recommended for statistical use.
It is *not* well defined when `weights` is given.
nanas : None, float, 'ignore'
- None : do nothing. Note default sorting puts `nan` after `inf`.
- float : `nan`s will be replaced by given value.
- 'ignore' : `nan`s will be excluded before any calculation.
shape : 'data' | 'stats'
Put which axes first in the result:
'data' - the shape of data
'stats' - the shape of `q` or `nsig`
Only works for case where axis is not None.
Returns
-------
quantile : scalar or ndarray
The first axes of the result corresponds to the quantiles,
the rest are the axes that remain after the reduction of `a`.
See Also
--------
numpy.percentile
conflevel
Examples
--------
>>> np.random.seed(0)
>>> x = np.random.randn(3, 100)
>>> quantile(x, q=0.5)
0.024654858649703838
>>> quantile(x, nsig=0)
0.024654858649703838
>>> quantile(x, nsig=1)
1.0161711040272021
>>> quantile(x, nsig=[0, 1])
array([ 0.02465486, 1.0161711 ])
>>> quantile(np.abs(x), nsig=1, origin='low')
1.024490097937702
>>> quantile(-np.abs(x), nsig=1, origin='high')
-1.0244900979377023
>>> quantile(x, q=0.5, axis=1)
array([ 0.09409612, 0.02465486, -0.07535884])
>>> quantile(x, q=0.5, axis=1).shape
(3,)
>>> quantile(x, q=0.5, axis=1, keepdims=True).shape
(3, 1)
>>> quantile(x, q=[0.2, 0.8], axis=1).shape
(2, 3)
>>> quantile(x, q=[0.2, 0.8], axis=1, shape='stats').shape
(3, 2)
'''
Related
I'll try to explain my issue here without going into too much detail on the actual application so that we can stay grounded in the code. Basically, I need to do operations to a vector field. My first step is to generate the field as
x,y,z = np.meshgrid(np.linspace(-5,5,10),np.linspace(-5,5,10),np.linspace(-5,5,10))
Keep in mind that this is a generalized case, in the program, the bounds of the vector field are not all the same. In the general run of things, I would expect to say something along the lines of
u,v,w = f(x,y,z).
Unfortunately, this case requires so more difficult operations. I need to use a formula similar to
where the vector r is defined in the program as np.array([xgrid-x,ygrid-y,zgrid-z]) divided by its own norm. Basically, this is a vector pointing from every point in space to the position (x,y,z)
Now Numpy has implemented a cross product function using np.cross(), but I can't seem to create a "meshgrid of vectors" like I need.
I have a lambda function that is essentially
xgrid,ygrid,zgrid=np.meshgrid(np.linspace(-5,5,10),np.linspace(-5,5,10),np.linspace(-5,5,10))
B(x,y,z) = lambda x,y,z: np.cross(v,np.array([xgrid-x,ygrid-y,zgrid-z]))
Now the array v is imported from another class and seems to work just fine, but the second array, np.array([xgrid-x,ygrid-y,zgrid-z]) is not a proper shape because it is a "vector of meshgrids" instead of a "meshgrid of vectors". My big issue is that I cannot seem to find a method by which to format the meshgrid in such a way that the np.cross() function can use the position vector. Is there a way to do this?
Originally I thought that I could do something along the lines of:
x,y,z = np.meshgrid(np.linspace(-2,2,5),np.linspace(-2,2,5),np.linspace(-2,2,5))
A = np.array([x,y,z])
cross_result = np.cross(np.array(v),A)
This, however, returns the following error, which I cannot seem to circumvent:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 1682, in cross
raise ValueError(msg)
ValueError: incompatible dimensions for cross product
(dimension must be 2 or 3)
There's a work around with reshape and broadcasting:
A = np.array([x_grid, y_grid, z_grid])
# A.shape == (3,5,5,5)
def B(v, p):
'''
v.shape = (3,)
p.shape = (3,)
'''
shape = A.shape
Ap = A.reshape(3,-1) - p[:,None]
return np.cross(v[None,:], Ap.reshape(3,-1).T).reshape(shape)
print(B(v,p).shape)
# (3, 5, 5, 5)
I think your original attempt only lacks the specification of the axis along which the cross product should be executed.
x, y, z = np.meshgrid(np.linspace(-2, 2, 5),np.linspace(-2, 2, 5), np.linspace(-2, 2, 5))
A = np.array([x, y, z])
cross_result = np.cross(np.array(v), A, axis=0)
I tested this with the code below. As an alternative to np.array([x, y, z]), you can also use np.stack(x, y, z, axis=0), which clearly shows along which axis the meshgrids are stacked to form a meshgrid of vectors, the vectors being aligned with axis 0. I also printed the shape each time and used random input for testing. In the test, the output of the formula is compared at a random index to the cross product of the input-vector at the same index with vector v.
import numpy as np
x, y, z = np.meshgrid(np.linspace(-5, 5, 10), np.linspace(-5, 5, 10), np.linspace(-5, 5, 10))
p = np.random.rand(3) # random reference point
A = np.array([x-p[0], y-p[1], z-p[2]]) # vectors from positions to reference
A_bis = np.stack((x-p[0], y-p[1], z-p[2]), axis=0)
print(f"A equals A_bis? {np.allclose(A, A_bis)}") # the two methods of stacking yield the same
v = -1 + 2*np.random.rand(3) # random vector v
B = np.cross(v, A, axis=0) # cross-product for all points along correct axis
print(f"Shape of v: {v.shape}")
print(f"Shape of A: {A.shape}")
print(f"Shape of B: {B.shape}")
print("\nComparison for random locations: ")
point = np.random.randint(0, 9, 3) # generate random multi-index
a = A[:, point[0], point[1], point[2]] # look up input-vector corresponding to index
b = B[:, point[0], point[1], point[2]] # look up output-vector corresponding to index
print(f"A[:, {point[0]}, {point[1]}, {point[2]}] = {a}")
print(f"v = {v}")
print(f"Cross-product as v x a: {np.cross(v, a)}")
print(f"Cross-product from B (= v x A): {b}")
The resulting output looks like:
A equals A_bis? True
Shape of v: (3,)
Shape of A: (3, 10, 10, 10)
Shape of B: (3, 10, 10, 10)
Comparison for random locations:
A[:, 8, 1, 1] = [-4.03607312 3.72661831 -4.87453077]
v = [-0.90817859 0.10110274 -0.17848181]
Cross-product as v x a: [ 0.17230515 -3.70657882 -2.97637688]
Cross-product from B (= v x A): [ 0.17230515 -3.70657882 -2.97637688]
I am having a problem when trying to reshape a torch tensor Yp of dimensions [10,200,1] to [2000,1,1]. The tensor is obtained from a numpy array y of dimension [2000,1]. I am doing the following:
Yp = reshape(Yp, (-1,1,1))
I try to subtract the result to a torch tensor version of y by doing:
Yp[0:2000,0] - torch.from_numpy(y[0:2000,0])
I expect the result to be an array of zeros, but that is not the case. Calling different orders when reshaping (order = 'F' or 'C') does not solve the problem, and strangely outputs the same result when doing the subtraction. I only manage to get an array of zeros by calling on the tensor Yp the ravel method with order = 'F'.
What am I doing wrong? I would like to solve this using reshape!
I concur with #linamnt's comment (though the actual resulting shape is [2000, 1, 2000]).
Here is a small demonstration:
import torch
import numpy as np
# Your inputs according to question:
y = np.random.rand(2000, 1)
y = torch.from_numpy(y[0:2000,0])
Yp = torch.reshape(y, (10,200,1))
# Your reshaping according to question:
Yp = torch.reshape(Yp, (-1,1,1))
# (note: Tensor.view() may suit your need more if you don't want to copy values)
# Your subtraction:
y_diff = Yp - y
print(y_diff.shape)
# > torch.Size([2000, 1, 2000])
# As explained by #linamnt, unwanted broadcasting is done
# since the dims of your tensors don't match
# If you give both your tensors the same shape, e.g. [2000, 1, 1] (or [2000]):
y_diff = Yp - y.view(-1, 1, 1)
print(y_diff.shape)
# > torch.Size([2000, 1, 1])
# Checking the result tensor contains only 0 (by calculing its abs. sum):
print(y_diff.abs().sum())
# > 0
I have angular data on a domain that is wrapped at pi radians (i.e. 0 = pi). The data are 2D, where one dimension represents the angle. I need to interpolate this data onto another grid in a wrapped way.
In one dimension, the np.interp function takes a period kwarg (for NumPy 1.10 and later):
http://docs.scipy.org/doc/numpy/reference/generated/numpy.interp.html
This is exactly what I need, but I need it in two dimensions. I'm currently just stepping through columns in my array and using np.interp, but this is of course slow.
Anything out there that could achieve this same outcome but faster?
An explanation of how np.interp works
Use the source, Luke!
The numpy doc for np.interp makes the source particularly easy to find, since it has the link right there, along with the documentation. Let's go through this, line by line.
First, recall the parameters:
"""
x : array_like
The x-coordinates of the interpolated values.
xp : 1-D sequence of floats
The x-coordinates of the data points, must be increasing if argument
`period` is not specified. Otherwise, `xp` is internally sorted after
normalizing the periodic boundaries with ``xp = xp % period``.
fp : 1-D sequence of floats
The y-coordinates of the data points, same length as `xp`.
period : None or float, optional
A period for the x-coordinates. This parameter allows the proper
interpolation of angular x-coordinates. Parameters `left` and `right`
are ignored if `period` is specified.
"""
Let's take a simple example of a triangular wave while going through this:
xp = np.array([-np.pi/2, -np.pi/4, 0, np.pi/4])
fp = np.array([0, -1, 0, 1])
x = np.array([-np.pi/8, -5*np.pi/8]) # Peskiest points possible }:)
period = np.pi
Now, I start off with the period != None branch in the source code, after all the type-checking happens:
# normalizing periodic boundaries
x = x % period
xp = xp % period
This just ensures that all values of x and xp supplied are between 0 and period. So, since the period is pi, but we specified x and xp to be between -pi/2 and pi/2, this will adjust for that by adding pi to all values in the range [-pi/2, 0), so that they effectively appear after pi/2. So our xp now reads [pi/2, 3*pi/4, 0, pi/4].
asort_xp = np.argsort(xp)
xp = xp[asort_xp]
fp = fp[asort_xp]
This is just ordering xp in increasing order. This is especially required after performing that modulo operation in the previous step. So, now xp is [0, pi/4, pi/2, 3*pi/4]. fp has also been shuffled accordingly, [0, 1, 0, -1].
xp = np.concatenate((xp[-1:]-period, xp, xp[0:1]+period))
fp = np.concatenate((fp[-1:], fp, fp[0:1]))
return compiled_interp(x, xp, fp, left, right) # Paraphrasing a little
np.interp does linear interpolation. When trying to interpolate between two points a and b present in xp, it only uses the values of f(a) and f(b) (i.e., the values of fp at the corresponding indices). So what np.interp is doing in this last step is to take the point xp[-1] and put it in front of the array, and take the point xp[0] and put it after the array, but after subtracting and adding one period respectively. So you now have a new xp that looks like [-pi/4, 0, pi/4, pi/2, 3*pi/4, pi]. Likewise, fp[0] and fp[-1] have been concatenated around, so fp is now [-1, 0, 1, 0, -1, 0].
Note that after the modulo operations, x had been brought into the [0, pi] range too, so x is now [7*pi/8, 3*pi/8]. Which lets you easily see that you'll get back [-0.5, 0.5].
Now, coming to your 2D case:
Say you have a grid and some values. Let's just take all values to be between [0, pi] off the bat so we don't need to worry about modulos and shufflings.
xp = np.array([0, np.pi/4, np.pi/2, 3*np.pi/4])
yp = np.array([0, 1, 2, 3])
period = np.pi
# Put x on the 1st dim and y on the 2nd dim; f is linear in y
fp = np.array([0, 1, 0, -1])[:, np.newaxis] + yp[np.newaxis, :]
# >>> fp
# array([[ 0, 1, 2, 3],
# [ 1, 2, 3, 4],
# [ 0, 1, 2, 3],
# [-1, 0, 1, 2]])
We now know that all you need to do is to add xp[[-1]] in front of the array and xp[[0]] at the end, adjusting for the period. Note how I've indexed using the singleton lists [-1] and [0]. This is a trick to ensure that dimensions are preserved.
xp = np.concatenate((xp[[-1]]-period, xp, xp[[0]]+period))
fp = np.concatenate((fp[[-1], :], fp, fp[[0], :]))
Finally, you are free to use scipy.interpolate.interpn to achieve your result. Let's get the value at x = pi/8 for all y:
from scipy.interpolate import interpn
interp_points = np.hstack(( (np.pi/8 * np.ones(4))[:, np.newaxis], yp[:, np.newaxis] ))
result = interpn((xp, yp), fp, interp_points)
# >>> result
# array([ 0.5, 1.5, 2.5, 3.5])
interp_points has to be specified as an Nx2 matrix of points, where the first dimension is for each point you want interpolation at the second dimension gives the x- and y-coordinate of that point. See this answer for a detailed explanation.
If you want to get the value outside of the range [0, period], you'll need to modulo it yourself:
x = 21 * np.pi / 8
x_equiv = x % period # Now within [0, period]
interp_points = np.hstack(( (x_equiv * np.ones(4))[:, np.newaxis], yp[:, np.newaxis] ))
result = interpn((xp, yp), fp, interp_points)
# >>> result
# array([-0.5, 0.5, 1.5, 2.5])
Again, if you want to generate interp_points for a bunch of x- and y- values, look at this answer.
I'm very new to Python and currently trying to replicate plots etc that I previously used GrADs for. I want to calculate the divergence at each grid box using u and v wind fields (which are just scaled by specific humidity, q), from a netCDF climate model file.
From endless searching I know I need to use some combination of np.gradient and np.sum, but can't find the right combination. I just know that to do it 'by hand', the calculation would be
divg = dqu/dx + dqv/dy
I know the below is wrong, but it's the best I've got so far...
nc = Dataset(ifile)
q = np.array(nc.variables['hus'][0,:,:])
u = np.array(nc.variables['ua'][0,:,:])
v = np.array(nc.variables['va'][0,:,:])
lon=nc.variables['lon'][:]
lat=nc.variables['lat'][:]
qu = q*u
qv = q*v
dqu/dx, dqu/dy = np.gradient(qu, [dx, dy])
dqv/dx, dqv/dy = np.gradient(qv, [dx, dy])
divg = np.sum(dqu/dx, dqv/dy)
This gives the error 'SyntaxError: can't assign to operator'.
Any help would be much appreciated.
try something like:
dqu_dx, dqu_dy = np.gradient(qu, [dx, dy])
dqv_dx, dqv_dy = np.gradient(qv, [dx, dy])
you can not assign to any operation in python; any of those are syntax errors:
a + b = 3
a * b = 7
# or, in your case:
a / b = 9
UPDATE
following Pinetwig's comment: a/b is not a valid identifier name; it is (the return value of) an operator.
Try removing the [dx, dy].
[dqu_dx, dqu_dy] = np.gradient(qu)
[dqv_dx, dqv_dy] = np.gradient(qv)
Also to point out if you are recreating plots. Gradient changed in numpy between 1.82 and 1.9. This had an effect for recreating matlab plots in python as 1.82 was the matlab method. I am not sure how this relates to GrADs. Here is the wording for both.
1.82
"The gradient is computed using central differences in the interior
and first differences at the boundaries. The returned gradient hence has
the same shape as the input array."
1.9
"The gradient is computed using second order accurate central differences in the interior and either first differences or second order accurate one-sides (forward or backwards) differences at the boundaries. The returned gradient hence has the same shape as the input array."
The gradient function for 1.82 is here.
def gradient(f, *varargs):
"""
Return the gradient of an N-dimensional array.
The gradient is computed using central differences in the interior
and first differences at the boundaries. The returned gradient hence has
the same shape as the input array.
Parameters
----------
f : array_like
An N-dimensional array containing samples of a scalar function.
`*varargs` : scalars
0, 1, or N scalars specifying the sample distances in each direction,
that is: `dx`, `dy`, `dz`, ... The default distance is 1.
Returns
-------
gradient : ndarray
N arrays of the same shape as `f` giving the derivative of `f` with
respect to each dimension.
Examples
--------
>>> x = np.array([1, 2, 4, 7, 11, 16], dtype=np.float)
>>> np.gradient(x)
array([ 1. , 1.5, 2.5, 3.5, 4.5, 5. ])
>>> np.gradient(x, 2)
array([ 0.5 , 0.75, 1.25, 1.75, 2.25, 2.5 ])
>>> np.gradient(np.array([[1, 2, 6], [3, 4, 5]], dtype=np.float))
[array([[ 2., 2., -1.],
[ 2., 2., -1.]]),
array([[ 1. , 2.5, 4. ],
[ 1. , 1. , 1. ]])]
"""
f = np.asanyarray(f)
N = len(f.shape) # number of dimensions
n = len(varargs)
if n == 0:
dx = [1.0]*N
elif n == 1:
dx = [varargs[0]]*N
elif n == N:
dx = list(varargs)
else:
raise SyntaxError(
"invalid number of arguments")
# use central differences on interior and first differences on endpoints
outvals = []
# create slice objects --- initially all are [:, :, ..., :]
slice1 = [slice(None)]*N
slice2 = [slice(None)]*N
slice3 = [slice(None)]*N
otype = f.dtype.char
if otype not in ['f', 'd', 'F', 'D', 'm', 'M']:
otype = 'd'
# Difference of datetime64 elements results in timedelta64
if otype == 'M' :
# Need to use the full dtype name because it contains unit information
otype = f.dtype.name.replace('datetime', 'timedelta')
elif otype == 'm' :
# Needs to keep the specific units, can't be a general unit
otype = f.dtype
for axis in range(N):
# select out appropriate parts for this dimension
out = np.empty_like(f, dtype=otype)
slice1[axis] = slice(1, -1)
slice2[axis] = slice(2, None)
slice3[axis] = slice(None, -2)
# 1D equivalent -- out[1:-1] = (f[2:] - f[:-2])/2.0
out[slice1] = (f[slice2] - f[slice3])/2.0
slice1[axis] = 0
slice2[axis] = 1
slice3[axis] = 0
# 1D equivalent -- out[0] = (f[1] - f[0])
out[slice1] = (f[slice2] - f[slice3])
slice1[axis] = -1
slice2[axis] = -1
slice3[axis] = -2
# 1D equivalent -- out[-1] = (f[-1] - f[-2])
out[slice1] = (f[slice2] - f[slice3])
# divide by step size
outvals.append(out / dx[axis])
# reset the slice object in this dimension to ":"
slice1[axis] = slice(None)
slice2[axis] = slice(None)
slice3[axis] = slice(None)
if N == 1:
return outvals[0]
else:
return outvals
If your grid is Gaussian and the wind names in the file are "u" and "v" you can also calculate divergence directly using cdo:
cdo uv2dv in.nc out.nc
See https://code.mpimet.mpg.de/projects/cdo/embedded/index.html#x1-6850002.13.2 for more details.
First things first: this is not a duplicate of NumPy: calculate averages with NaNs removed, i'll explain why:
Suppose I have an array
a = array([1,2,3,4])
and I want to average over it with the weights
weights = [4,3,2,1]
output = average(a, weights=weights)
print output
2.0
ok. So this is pretty straightforward. But now I have something like this:
a = array([1,2,nan,4])
calculating the average with the usual method yields of coursenan. Can I avoid this?
In principle I want to ignore the nans, so I'd like to have something like this:
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
1.75
Alternatively, you can use a MaskedArray as such:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
First find out indices where the items are not nan, and then pass the filtered versions of a and weights to numpy.average:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75
As suggested by #mtrw in comments, it would be cleaner to use masked array here instead of index array:
>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75
I would offer another solution, which is more scalable to bigger dimensions (eg when doing average over different axis). Attached code works with 2D array, which possibly contains nans, and takes average over axis=0.
a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array
# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a))
# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec
# mean_vec is vector with weighted nan-averages of array a taken along axis=0
Expanding on #Ashwini and #Nicolas' answers, here is a version that can also handle an edge case where all the data values are np.nan, and that is designed to also work with pandas DataFrame without type-related issues:
def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
weights: List[Union[float, int]]) -> np.ndarray:
""" Calculates the weighted average of `measures`' values, ex-nans.
When nans are present in `measures`' values,
the weights are recalculated based only on the weights for non-nan measures.
Note:
The calculation used is NOT the same as just ignoring nans.
For example, if we had data and weights:
data = [2, 3, np.nan]
weights = [0.5, 0.2, 0.3]
calc_wa_ignore_nan approach:
(2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
The ignoring nans approach:
(2*0.5) + (3*0.2) == 1.6
Args:
data: Multiple rows of numeric data values with `measures` as column headers.
measures: The str names of values to select from `row`.
weights: The numeric weights associated with `measures`.
Example:
>>> df = pd.DataFrame({"meas1": [1, 1],
"meas2": [2, 2],
"meas3": [3, 3],
"meas4": [np.nan, 0],
"meas5": [5, 5]})
>>> measures = ["meas2", "meas3", "meas4"]
>>> weights = [0.5, 0.2, 0.3]
>>> calc_wa_ignore_nan(df, measures, weights)
array([2.28571429, 1.6])
"""
assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
# Need to coerce type to np.float instead of python's float
# to avoid "ufunc 'isnan' not supported for the input types ..." error
data = np.array(df[measures].values, dtype=np.float64)
# Make a 2d array with the same weights for each row
# cast for safety and better errors
weights = np.array([weights, ] * data.shape[0], dtype=np.float64)
mask = np.isnan(data)
masked_data = np.ma.masked_array(data, mask=mask)
masked_weights = np.ma.masked_array(weights, mask=mask)
# np.nanmean doesn't support weights
weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
# Replace masked elements with np.nan
# otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
weighted_avgs = weighted_avgs.filled(np.nan)
return weighted_avgs
All the solutions above are very good, but has don't handle the cases when there is nan in weights. For doing so, using pandas :
def weighted_average_ignoring_nan(df, col_value, col_weight):
den = 0
num = 0
for index, row in df.iterrows():
if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
den = den + row[col_weight]
num = num + row[col_weight]*row[col_value]
return num/den
Since you're looking for the mean another idea is to simply replace all the nan values with 0's:
>>>import numpy as np
>>>a = np.array([[ 3., 2., 5.], [np.nan, 4., np.nan], [np.nan, np.nan, np.nan]])
>>>w = np.array([[ 1., 2., 3.], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]])
>>>a[np.isnan(a)] = 0
>>>w[np.isnan(w)] = 0
>>>np.average(a, weights=w)
3.6666666666666665
This can be used with the axis functionality of the average function but be carful that your weights don't sum up to 0.