matplotlib - extracting data from contour lines - python

I would like to get data from a single contour of evenly spaced 2D data (an image-like data).
Based on the example found in a similar question: How can I get the (x,y) values of the line that is ploted by a contour plot (matplotlib)?
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4]
>>> y = [1,2,3,4]
>>> m = [[15,14,13,12],[14,12,10,8],[13,10,7,4],[12,8,4,0]]
>>> cs = plt.contour(x,y,m, [9.5])
>>> cs.collections[0].get_paths()
The result of this call into cs.collections[0].get_paths() is:
[Path([[ 4. 1.625 ]
[ 3.25 2. ]
[ 3. 2.16666667]
[ 2.16666667 3. ]
[ 2. 3.25 ]
[ 1.625 4. ]], None)]
Based on the plots, this result makes sense and appears to be collection of (y,x) pairs for the contour line.
Other than manually looping over this return value, extracting the coordinates and assembling arrays for the line, are there better ways to get data back from a matplotlib.path object? Are there pitfalls to be aware of when extracting data from a matplotlib.path?
Alternatively, are there alternatives within matplotlib or better yet numpy/scipy to do a similar thing? Ideal thing would be to get a high resolution vector of (x,y) pairs describing the line, which could be used for further analysis, as in general my datasets are not a small or simple as the example above.

For a given path, you can get the points like this:
p = cs.collections[0].get_paths()[0]
v = p.vertices
x = v[:,0]
y = v[:,1]

from: http://matplotlib.org/api/path_api.html#module-matplotlib.path
Users of Path objects should not access the vertices and codes arrays
directly. Instead, they should use iter_segments() to get the
vertex/code pairs. This is important, since many Path objects, as an
optimization, do not store a codes at all, but have a default one
provided for them by iter_segments().
Otherwise, I'm not really sure what your question is. [Zip] is a sometimes useful built in function when working with coordinates. 1

The vertices of an all paths can be returned as a numpy array of float64 simply via:
cs.allsegs[i][j] # for element j, in level i
where cs is defined as in the original question as:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [1, 2, 3, 4]
m = [[15, 14, 13, 12], [14, 12, 10, 8], [13, 10, 7, 4], [12, 8, 4, 0]]
cs = plt.contour(x, y, m, [9.5])
More detailed:
Going through the collections and extracting the paths and vertices is not the most straight forward or fastest thing to do. The returned Contour object actually has attributes for the segments via cs.allsegs, which returns a nested list of shape [level][element][vertex_coord]:
num_levels = len(cs.allsegs)
num_element = len(cs.allsegs[0]) # in level 0
num_vertices = len(cs.allsegs[0][0]) # of element 0, in level 0
num_coord = len(cs.allsegs[0][0][0]) # of vertex 0, in element 0, in level 0
See reference:
https://matplotlib.org/stable/api/contour_api.html

I am facing a similar problem, and stumbled over this matplotlib list discussion.
Basically, it is possible to strip away the plotting and call the underlying functions directly, not super convenient, but possible. The solution is also not pixel precise, as there is probably some interpolation going on in the underlying code.
import matplotlib.pyplot as plt
import matplotlib._cntr as cntr
import scipy as sp
data = sp.zeros((6,6))
data[2:4,2:4] = 1
plt.imshow(data,interpolation='none')
level=0.5
X,Y = sp.meshgrid(sp.arange(data.shape[0]),sp.arange(data.shape[1]))
c = cntr.Cntr(X, Y, data.T)
nlist = c.trace(level, level, 0)
segs = nlist[:len(nlist)//2]
for seg in segs:
plt.plot(seg[:,0],seg[:,1],color='white')
plt.show()

Related

Is it possible to leave blank spaces in matplotlib's pcolormesh plots?

I'm displaying some data using matplotlib.pyplot.pcolormesh in python, and I want to leave blank spaces where there are missing data points.
Suppose I've collected data for x values 0 to 10, and y values 0 to 10, but not every such value. At present, I initialize my data storage array using np.zeros((11,11)), then use a for loop to change the values of that array to the data value if I have the data for that point.
That leaves me with a bunch of data plus some zeros in an array. When I plot this, it is impossible to distinguish between points that have no data and points which have data with small values.
Is it possible to have missing data points clearly distinct from non-missing data points? For example, in the code below I want the squares at (3,1), (5,7), and (8,8) colored but the rest of the squares white.
I've tried initializing my data storage array with np.empty((11,11)) and np.full((11,11),np.nan) as well, but they both produce the same output as np.zeros. Here's the code below:
import numpy as np
import matplotlib.pyplot as plt
data_storage = np.zeros((11,11))
collected_data = [[3, 1, 45.2], [5, 7, 23.9], [8, 8, 78.4]
for data in collected_data:
x_coord = data[0]
y_coord = data[1]
value = data[2]
data_storage[y_coord,x_coord] = value
all_x_values = np.linspace(0,10,11)
all_y_values = np.linspace(0,10,11)
plt.pcolormesh(all_x_values, all_y_values, data_storage)
plt.show()
One approach is to change all zeros to NaN, which would make the corresponding cells transparent.
Please note that the x and y values for pcolormesh are for the grid points, not for the centers, so you need one value more in each dimension (11 cells, 12 cell borders). This allows to create color meshes with unequal cell sizes. If you want the ticks to be nicely in the center of the cells, you can put the cell borders at the halves.
(In the code below the forloop has been written more concise).
import numpy as np
import matplotlib.pyplot as plt
data_storage = np.zeros((11, 11))
collected_data = [[3, 1, 45.2], [5, 7, 23.9], [8, 8, 78.4]]
for x_coord, y_coord, value in collected_data:
data_storage[y_coord, x_coord] = value
all_x_values = np.arange(0, 12) - 0.5
all_y_values = np.arange(0, 12) - 0.5
plt.pcolormesh(all_x_values, all_y_values, np.where(data_storage == 0, np.nan, data_storage))
plt.gca().xaxis.set_major_locator(MultipleLocator(1))
plt.gca().yaxis.set_major_locator(MultipleLocator(1))
plt.colorbar()
plt.show()
An alternative approach could be to create a colormap, set an 'under' color and set vmin to a value slightly larger than 0. Optionally, the 'under' color can be visualized in the colorbar with extend='min'.
from copy import copy
my_cmap = copy(plt.cm.get_cmap('viridis'))
my_cmap.set_under('lightgrey')
plt.pcolormesh(all_x_values, all_y_values, data_storage, cmap=my_cmap, vmin=0.000001)
plt.colorbar(extend='min', extendrect=True)

why does my histogram look weird in this stochastic problem plot in python?

I am trying to write a stochastic program in Python to replicate a fair dice (one dice) roll, such that this one dice is rolled 100 times. I intend to display the output of the dice rolls as a histogram.
I thought histograms had a specific n shape but this is what I have been getting with the code I used below.
import numpy as np
import matplotlib.pyplot as plt
x = []
for i in range(100):
num1 = random.choice(range(1,7))
x.append(num1)
plt.hist(x, bins=6)
plt.xlabel('dice')
plt.show()
myweirdhistogram
Also, is there an easier way to plot a histogram in python when you have, e.g ages as [10,3,5,1] and the frequency in a table as [2,3,4,4]? Do I have to type out the entire frequency of the ages in a list like this: age = [10,10,3,3,3,5,5,5,5,1,1,1,1] before I write the program?
please see what I mean in the code below:
import numpy as np
import matplotlib.pyplot as plt
plt.close()
ages = [88,88,88,88,76,76,76, 65,65,65,65,65,96,96,52,52,52,52,52,98,98,102,102,102,102]
#the frequency was = [4, 3, 5, 2, 5, 2, 4] which corresponded to the ages [88,76,65,96,52,98,102]
num_bins = 25
n, bins, patches = plt.hist(ages, num_bins, facecolor='blue')
plt.xlabel('age')
plt.ylabel('Frequency of occurence')
plt.show()
#my histogram again looks more like a bar chart. Is this because I used bins as the ages?
So far, its easier to plot a histogram with random numbers but not a table for me. Here is my second weird output:mysecondweirdhistogram
>>> ages = [10,3,5,1]
>>> freqs = [2,3,4,4]
Use zip to pair each age with it's frequency; then use the frequency to add that many to a new container.
>>> q = []
>>> for a, f in zip(ages, freqs):
... q.extend(a for _ in range(f))
>>> q
[10, 10, 3, 3, 3, 5, 5, 5, 5, 1, 1, 1, 1]
>>>
a for _ in range(f) is a generator expression which will produce f a's when iterated. The extend method consumes (iterates over) the generator expression.
We typically use _ for values that we aren't going to use.

Find all simplices a point is a part of in scipy.spatial.Delaunay python

Is there anyway to get all the simplices/triangles a certain point is a part of in a Delaunay triangulation using scipy.spatial.Delaunay?
I know there is the find_simplex() function, that that only returns 1 triangle that a point is a part of but I would like to get all triangles that it is a part of.
So in the example, when I do find_simplex() for point 6, it only returns triangle 2, but I would like it to return the triangles 1, 2, 3, 4, 10, and 9, as point 6 is a part of all of those triangles.
Any help would be appreciated!
You don’t want find_simplex because it is geometric, not topological. That is, it treats a point as a location rather than as a component of the triangulation: almost all points lie in only one simplex, so that’s what it reports.
Instead, use the vertex number. The trivial answer is to use the simplices attribute:
vert=6
[i for i,s in enumerate(d.simplices) if vert in s]
With a good bit more code, it is possible to search more efficiently using the vertex_to_simplex and neighbors attributes.
You can get all simplices adjacent to a given vertex efficiently by
def get_simplices(self, vertex):
"Find all simplices this `vertex` belongs to"
visited = set()
queue = [self.vertex_to_simplex[vertex]]
while queue:
simplex = queue.pop()
for i, s in enumerate(self.neighbors[simplex]):
if self.simplices[simplex][i] != vertex and s != -1 and s not in visited:
queue.append(s)
visited.add(simplex)
return np.array(list(visited))
Example:
import scipy.spatial
import numpy as np
np.random.seed(0)
points = np.random.rand(10, 2)
tri = scipy.spatial.Delaunay(points)
vertex = 2
simplices = get_simplices(tri, vertex)
# 0, 2, 5, 9, 11
neighbors = np.unique(tri.simplices[simplices].reshape(-1)])
# 0, 1, 2, 3, 7, 8
Visualisation:
import matplotlib.pyplot as plt
plt.triplot(points[:,0], points[:,1], tri.simplices)
plt.plot(points[neighbors,0], points[neighbors,1], 'or')
plt.plot(points[vertex,0], points[vertex,1], 'ob')
plt.show()

How to interpolate a line between two other lines in python

Note: I asked this question before but it was closed as a duplicate, however, I, along with several others believe it was unduely closed, I explain why in an edit in my original post. So I would like to re-ask this question here again.
Does anyone know of a python library that can interpolate between two lines. For example, given the two solid lines below, I would like to produce the dashed line in the middle. In other words, I'd like to get the centreline. The input is a just two numpy arrays of coordinates with size N x 2 and M x 2 respectively.
Furthermore, I'd like to know if someone has written a function for this in some optimized python library. Although optimization isn't exactly a necessary.
Here is an example of two lines that I might have, you can assume they do not overlap with each other and an x/y can have multiple y/x coordinates.
array([[ 1233.87375018, 1230.07095987],
[ 1237.63559365, 1253.90749041],
[ 1240.87500801, 1264.43925132],
[ 1245.30875975, 1274.63795396],
[ 1256.1449357 , 1294.48254424],
[ 1264.33600095, 1304.47893299],
[ 1273.38192911, 1313.71468591],
[ 1283.12411536, 1322.35942538],
[ 1293.2559388 , 1330.55873344],
[ 1309.4817002 , 1342.53074698],
[ 1325.7074616 , 1354.50276051],
[ 1341.93322301, 1366.47477405],
[ 1358.15898441, 1378.44678759],
[ 1394.38474581, 1390.41880113]])
array([[ 1152.27115094, 1281.52899302],
[ 1155.53345506, 1295.30515742],
[ 1163.56506781, 1318.41642169],
[ 1168.03497425, 1330.03181319],
[ 1173.26135672, 1341.30559949],
[ 1184.07110925, 1356.54121651],
[ 1194.88086178, 1371.77683353],
[ 1202.58908737, 1381.41765447],
[ 1210.72465255, 1390.65097106],
[ 1227.81309742, 1403.2904646 ],
[ 1244.90154229, 1415.92995815],
[ 1261.98998716, 1428.56945169],
[ 1275.89219696, 1438.21626352],
[ 1289.79440676, 1447.86307535],
[ 1303.69661656, 1457.50988719],
[ 1323.80994319, 1470.41028655],
[ 1343.92326983, 1488.31068591],
[ 1354.31738934, 1499.33260989],
[ 1374.48879779, 1516.93734053],
[ 1394.66020624, 1534.54207116]])
Visualizing this we have:
So my attempt at this has been using the skeletonize function in the skimage.morphology library by first rasterizing the coordinates into a filled in polygon. However, I get branching at the ends like this:
First of all, pardon the overkill; I had fun with your question. If the description is too long, feel free to skip to the bottom, I defined a function that does everything I describe.
Your problem would be relatively straightforward if your arrays were the same length. In that case, all you would have to do is find the average between the corresponding x values in each array, and the corresponding y values in each array.
So what we can do is create arrays of the same length, that are more or less good estimates of your original arrays. We can do this by fitting a polynomial to the arrays you have. As noted in comments and other answers, the midline of your original arrays is not specifically defined, so a good estimate should fulfill your needs.
Note: In all of these examples, I've gone ahead and named the two arrays that you posted a1 and a2.
Step one: Create new arrays that estimate your old lines
Looking at the data you posted:
These aren't particularly complicated functions, it looks like a 3rd degree polynomial would fit them pretty well. We can create those using numpy:
import numpy as np
# Find the range of x values in a1
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
# Create an evenly spaced array that ranges from the minimum to the maximum
# I used 100 elements, but you can use more or fewer.
# This will be used as your new x coordinates
new_a1_x = np.linspace(min_a1_x, max_a1_x, 100)
# Fit a 3rd degree polynomial to your data
a1_coefs = np.polyfit(a1[:,0],a1[:,1], 3)
# Get your new y coordinates from the coefficients of the above polynomial
new_a1_y = np.polyval(a1_coefs, new_a1_x)
# Repeat for array 2:
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a2_x = np.linspace(min_a2_x, max_a2_x, 100)
a2_coefs = np.polyfit(a2[:,0],a2[:,1], 3)
new_a2_y = np.polyval(a2_coefs, new_a2_x)
The result:
That's not bad so bad! If you have more complicated functions, you'll have to fit a higher degree polynomial, or find some other adequate function to fit to your data.
Now, you've got two sets of arrays of the same length (I chose a length of 100, you can do more or less depending on how smooth you want your midpoint line to be). These sets represent the x and y coordinates of the estimates of your original arrays. In the example above, I named these new_a1_x, new_a1_y, new_a2_x and new_a2_y.
Step two: calculate the average between each x and each y in your new arrays
Then, we want to find the average x and average y value for each of our estimate arrays. Just use np.mean:
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(100)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(100)]
midx and midy now represent the midpoint between our 2 estimate arrays. Now, just plot your original (not estimate) arrays, alongside your midpoint array:
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
And voilà:
This method still works with more complex, noisy data (but you have to fit the function thoughtfully):
As a function:
I've put the above code in a function, so you can use it easily. It returns an array of your estimated midpoints, in the format you had your original arrays in.
The arguments: a1 and a2 are your 2 input arrays, poly_deg is the degree polynomial you want to fit, n_points is the number of points you want in your midpoint array, and plot is a boolean, whether you want to plot it or not.
import matplotlib.pyplot as plt
import numpy as np
def interpolate(a1, a2, poly_deg=3, n_points=100, plot=True):
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
new_a1_x = np.linspace(min_a1_x, max_a1_x, n_points)
a1_coefs = np.polyfit(a1[:,0],a1[:,1], poly_deg)
new_a1_y = np.polyval(a1_coefs, new_a1_x)
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a2_x = np.linspace(min_a2_x, max_a2_x, n_points)
a2_coefs = np.polyfit(a2[:,0],a2[:,1], poly_deg)
new_a2_y = np.polyval(a2_coefs, new_a2_x)
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(n_points)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(n_points)]
if plot:
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
return np.array([[x, y] for x, y in zip(midx, midy)])
[EDIT]:
I was thinking back on this question, and I overlooked a simpler way to do this, by "densifying" both arrays to the same number of points using np.interp. This method follows the same basic idea as the line-fitting method above, but instead of approximating lines using polyfit / polyval, it just densifies:
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a1_x = np.linspace(min_a1_x, max_a1_x, 100)
new_a2_x = np.linspace(min_a2_x, max_a2_x, 100)
new_a1_y = np.interp(new_a1_x, a1[:,0], a1[:,1])
new_a2_y = np.interp(new_a2_x, a2[:,0], a2[:,1])
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(100)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(100)]
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
The "line between two lines" is not so well defined. You can obtain a decent though simple solution by triangulating between the two curves (you can triangulate by progressing from vertex to vertex, choosing the diagonals that produce the less skewed triangle).
Then the interpolated curve joins the middles of the sides.
I work with rivers, so this is a common problem. One of my solutions is exactly like the one you showed in your question--i.e. skeletonize the blob. You see that the boundaries have problems, so what I've done that seems to work well is to simply mirror the boundaries. For this approach to work, the blob must not intersect the corners of the image.
You can find my implementation in RivGraph; this particular algorithm is in rivers/river_utils.py called "mask_to_centerline".
Here's an example output showing how the ends of the centerline extend to the desired edge of the object:
sacuL's solution almost worked for me, but I needed to aggregate more than just two curves.
Here is my generalization for sacuL's solution:
def interp(*axis_list):
min_max_xs = [(min(axis[:,0]), max(axis[:,0])) for axis in axis_list]
new_axis_xs = [np.linspace(min_x, max_x, 100) for min_x, max_x in min_max_xs]
new_axis_ys = [np.interp(new_x_axis, axis[:,0], axis[:,1]) for axis, new_x_axis in zip(axis_list, new_axis_xs)]
midx = [np.mean([new_axis_xs[axis_idx][i] for axis_idx in range(len(axis_list))]) for i in range(100)]
midy = [np.mean([new_axis_ys[axis_idx][i] for axis_idx in range(len(axis_list))]) for i in range(100)]
for axis in axis_list:
plt.plot(axis[:,0], axis[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
If we now run an example:
a1 = np.array([[x, x**2+5*(x%4)] for x in range(10)])
a2 = np.array([[x-0.5, x**2+6*(x%3)] for x in range(10)])
a3 = np.array([[x+0.2, x**2+7*(x%2)] for x in range(10)])
interp(a1, a2, a3)
we get the plot:

NumPy or SciPy to calculate weighted median

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.
For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.
So far I've
imported the csv showing the weights as an array, masking values of 0, and
created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.
I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.
I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!
Update: here's some code for what I've done so far:
#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)
#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)
#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]
for i in range (rowLength):
createArr = np.arange(0, fieldLength*10, 10)
nCreateArr = np.array(createArr)
massArr.append(nCreateArr)
nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.
One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.
Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).
So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.
The code to do it like so:
import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])
c = np.cumsum(freq_count, axis=1)
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...
#This is just for explanation.
print "median masses is:", masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))
Output will be:
median masses is: [10 20 20 0 30]
[[ 30 191 9 0] <- The test data
[ 10 20 300 10]
[ 10 20 30 40]
[100 10 10 10]
[ 1 1 1 100]]
[[ 30. 221. 230. 230. 115. ] <- cumsum results with median added to the end.
[ 10. 30. 330. 340. 170. ] you can see from this where they fit in.
[ 10. 30. 60. 100. 50. ]
[ 100. 110. 120. 130. 65. ]
[ 1. 2. 3. 103. 51.5]]
wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.
Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).
def weighted_median(values, weights):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, 0.5 * c[-1])]]
Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend is to an arbitrary number of arbitrary quantiles.
Update
Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:
def weighted_quantiles(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]
This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])
Update 2
In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two number in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).
def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
q = np.searchsorted(c, quantiles * c[-1])
return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])
This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).
>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)
Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-values weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.
I ended up writing that function based on #muzzle and #maesers replies:
def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
The difference between interpolate True and False is as follows:
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5
(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])
Speed tests show it is just as performant as #maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.
Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.
import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain
book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'
masses = sh.col_values(0, start_rowx=1) # first column has mass
age = sh.row_values(0, start_colx=1) # first row has age ranges
count = 1
mass = []
for a in ages:
age.append(sh.col_values(count, start_rowx=1))
count += 1
stats = []
count = 0
for a in ages:
expanded = []
# create a tuple with the mass vector
age_mass = zip(masses, age[count])
count += 1
# replicate element[0] for element[1] times
expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)
# separate into one big list
medianlist = [x for t in expanded for x in t]
# convert to array and mask out zeroes
npa = np.array(medianlist)
npa = np.ma.masked_equal(npa,0)
median = np.median(npa)
meanMass = np.average(npa)
maxMass = np.max(npa)
minMass = np.min(npa)
stdev = np.std(npa)
stats1 = [median, meanMass, maxMass, minMass, stdev]
print stats1
stats.append(stats1)
np.savetxt(ofile, (stats), fmt="%d")

Categories