I want to group numbers in a list, based on how 'large' the numbers are in comparison of their neighbors, but I want to do it continuously and via clustering if possible. To clarify, let me give you an example:
Suppose you have the list
lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]
then, if we have 3 groups, it's obvious how to cluster. Running the k-means algorithm from sklearn (see code) confirms this. But, when the numbers in the list aren't that 'convenient', I run into trouble. Suppose you have the list:
lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]
My problem now is two-fold:
I want some sort of 'order-preserving, linear' clustering, which takes the order of the data into account. For the list above, the clustering algorithm should give me a desired output of the form
lst = [0,0,1,1,1,1,1,1,2,2]
If you look at this output above, you also see that I want the value 6.2 to be clustered in the second cluster, i.e. I want the cluster algorithm to see it as an outlier, not as an entirely new cluster.
EDIT For clarification, I want to be able to specify the amount of clusters in the linear clustering process, i.e. the 'end total' of clusters.
Code:
import numpy as np
from sklearn.cluster import KMeans
lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 2]: OK output
lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 1 1 0 0]. Desired output: [0 0 1 1 1 1 1 1 2 2]
As mentioned, i think a straightforward(ish) way to get the desired results is to just use a normal K-means clustering, and then modify the generated output as desired.
Explanation: The idea is to get the K-means outputs, and then iterate through them: keeping track of previous item's cluster group, and current cluster group, and controlling new clusters created on conditions. Explanations in code.
import numpy as np
from sklearn.cluster import KMeans
lst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 2]: OK output
lst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]
km = KMeans(3,).fit(np.array(lst).reshape(-1,1))
print(km.labels_)
# [0 0 1 1 1 2 1 1 0 0]. Desired output: [0 0 1 1 1 1 1 1 2 2]
def linear_order_clustering(km_labels, outlier_tolerance = 1):
'''Expects clustering outputs as an array/list'''
prev_label = km_labels[0] #keeps track of last seen item's real cluster
cluster = 0 #like a counter for our new linear clustering outputs
result = [cluster] #initialize first entry
for i, label in enumerate(km_labels[1:]):
if prev_label == label:
#just written for clarity of control flow,
#do nothing special here
pass
else: #current cluster label did not match previous label
#check if previous cluster label reappears
#on the right of current cluster label position
#(aka current non-matching cluster is sandwiched
#within a reasonable tolerance)
if (outlier_tolerance and
prev_label in km_labels[i + 1: i + 2 + outlier_tolerance]): label = prev_label #if so, overwrite current label
else:
cluster += 1 #its genuinely a new cluster
result.append(cluster)
prev_label = label
return result
Note that i have only tested this with tolerance for 1 outlier, and cannot promise it works as-is out of the box for all cases. This should get you started however.
Output:
print(km.labels_)
result = linear_order_clustering(km.labels_)
print(result)
[1 1 0 0 0 2 0 0 1 1]
[0, 0, 1, 1, 1, 1, 1, 1, 2, 2]
I would approach this in a couple of passes. First I would have a first function/method to do the analysis to determine the clustering centers, for each group and return an array of those centers. I would then take those centers along with the list into another function/method to assemble a list of the cluster id of each number in the list. I would then return that list sorted.
Define a threshold.
If the values of x[i] and x[i-1] differ too much, begin a new segment.
For better results, look at KDE and CUSUM approaches.
Don't use clustering. It has a different objective.
I had a similar problem and solved it as follows:
Given a distances matrix between all the elements,
I either do a bottom-up clustering (merging the two "most similar" elements/sub-clusters) or a top-down clustering (splitting a group of elements into the "most different" sub-clusters);
To compute the distance between sub-clusters I aggregate the distances of all the elements in them (the default method is taking the average, using the minimal or maximal distance is also possible).
Either way this results in a hierarchical clustering which you can then cut to produce any desired number of clusters.
It seems the bottom-up method gave better results, but YMMV.
Here's the code for the bottom-up method (in R). It builds:
A merge matrix where every row includes two columns with the indices of the next two things to merge - negative index for elements and positive index for previously created sub-clusters (R uses 1-based indices)
A height array containing the distance between the two merged elements/sub-clusters. This is added to the maximal height of the merged things (0 height for leaf elements) so heights are always increasing (for display of the tree, or as R calls it, the "dendogram").
This can be used to create R hclust objects which can be displayed and manipulated in various ways.
This isn't the most efficient possible implementation, but it gets the work done in a reasonable amount of time. A more efficient approach would be to reduce the size of the distances matrix (this would require more book keeping keeping track of the indices mapping between the smaller matrix and the original elements):
bottom_up <- function(distances, aggregation) {
aggregate <- switch(aggregation, mean=mean, min=min, max=max)
rows_count <- dim(distances)[1]
diag(distances) <- Inf
merge <- matrix(0, nrow=rows_count - 1, ncol=2)
height <- rep(0, rows_count - 1)
merged_height <- rep(0, rows_count)
groups <- -(1:rows_count)
for (merge_index in 1:(rows_count - 1)) {
adjacent_distances <- pracma::Diag(distances, 1)
low_index <- which.min(adjacent_distances)
high_index <- low_index + 1
grouped_indices <- sort(groups[c(low_index, high_index)])
merged_indices <- which(groups %in% grouped_indices)
groups[merged_indices] <- merge_index
merge[merge_index,] <- grouped_indices
height[merge_index] <- max(merged_height[merged_indices]) + adjacent_distances[low_index]
merged_height[merged_indices] <- height[merge_index]
merged_distances <- apply(distances[,merged_indices], 1, aggregate)
distances[,merged_indices] <- merged_distances
distances[merged_indices,] <- rep(merged_distances, each=length(merged_indices))
distances[merged_indices, merged_indices] <- Inf
}
return (list(merge=merge, height=height))
}
The pracma::Diag(distances, 1) fetches the offset-by-1 diagonal (above the main diagonal).
Related
I found the irr package has 2 big bugs for the calculation of weighted kappa.
Please tell me if the 2 bugs are really there or I misunderstood someting.
You can replicate the bugs using the following examples.
First bug: The sort of labels in confusion matrix needs to be corrected.
I have 2 pairs of scores for disease extent (from 0 to 100 while 0 is healthy, 100 is extremely ill).
In label_test.csv (you can just copy and paste the data to your disk to do the following test):
0
1
1
1
0
14
53
3
In pred_test.csv:
0
1
1
0
3
4
54
6
in script_r.R:
library(irr)
label <- read.csv('label_test.csv',header=FALSE)
pred <- read.csv('pred_test.csv',header=FALSE)
kapp <- kappa2(data.frame(label,pred),"unweighted")
kappa <- getElement(kapp,"value")
print(kappa) # output: 0.245283
w_kapp <- kappa2(data.frame(label,pred),"equal")
weighted_kappa <- getElement(w_kapp,"value")
print(weighted_kappa) # output: 0.443038
When I use Python to calculate the kappa and weighted_kappa, in script_python.py:
from sklearn.metrics import cohen_kappa_score
label = pd.read_csv(label_file, header=None).to_numpy()
pred = pd.read_csv(pred_file, header=None).to_numpy()
kappa = cohen_kappa_score(label.astype(int), pred.astype(int))
print(kappa) # output: 0.24528301886792447
weighted_kappa = cohen_kappa_score(label.astype(int), pred.astype(int), weights='linear', labels=np.array(list(range(100))) )
print(weighted_kappa) # output: 0.8359908883826879
We can find that the kappa calculated by R and Python is the same, but the weighted_kappa from R is far lower than the weighted_kappa in sklearn from Python. Which is wrong? After 2-day research, I found that the weighted_kappa from irr package in R is wrong. Details are as follows.
During the debuging, we will find the confusion matrix in irr from R is:
We can find that the order is wrong. The order of labels should be changed from [0, 1, 14, 3, 4, 53, 54, 6] to [0, 1, 3, 4, 6, 14, 53, 54] as it is in Python. It seems that irr package used a string-based sort method instead of integer-based sort method, which will put 14 to the front of 3. This mistake could be and should be corrected easily.
Second bug: The confusion matrix is not complete in R
In my pred_test.csv and label_test.csv, the values can not cover all possible values from 0 to 100. So the default confusion matrix in irr from R will miss those values which does not appear in data. This should be fixed.
Let's see another example.
In pred_test.csv, let's change the label from 54 to 99. Then, we run script_r.R and script_python.py again. The results are:
In R:
kappa: 0.245283
weighted_kappa: 0.443038
In Python:
kappa: 0.24528301886792447
weighted_kappa: 0.592891760904685
We can find the weighted_kappa from irr in R is unchanged at all. But the weighted_kappa from sklearn in Python is decreased from 0.83 to 0.59. So we know irr made a mistake again.
The reason is that sklearn can let us to pass the full labels to the confusion matrix so that the confusion matrix shape will be 100 * 100, however in irr, the labels of confusion matrix is calculated from the unique values from label and pred, which will miss a lot of other possible values. This mistake will assign the same weight to 53 and 99 here. So it is better to provide an option in irr package to let custumer provide the custum labels like what they have done in sklearn from Python.
The solution from the authors is not going to work because in the code of kappa2 function, it converts your ratings into a matrix, and once you convert a factor into matrix, the levels are lost, this is the line:
ratings <- as.matrix(na.omit(ratings))
You can try it on your data, it is converted into a character:
lvl = 0:100
ratings = data.frame(label = factor(label[,1],levels=lvl),
pred = factor(pred[,1],levels=lvl))
as.matrix(ratings)
label pred
[1,] "0" "0"
[2,] "1" "1"
[3,] "1" "1"
[4,] "1" "0"
[5,] "0" "3"
[6,] "14" "4"
[7,] "53" "54"
[8,] "3" "6"
Same results:
kappa2(ratings,weight="equal")
Cohen's Kappa for 2 Raters (Weights: equal)
Subjects = 8
Raters = 2
Kappa = 0.368
z = 1.79
p-value = 0.0742
I suggest using DescTools, you just need to provide the confusion matrix using table() function in R, with the factors declared correctly as above:
library(DescTools)
CohenKappa(table(ratings$label,ratings$pred), weight="Unweighted")
[1] 0.245283
CohenKappa(table(ratings$label,ratings$pred), weight="Equal-Spacing")
[1] 0.8359909
I have sent email to the author of the package, and he said he will fix the bug in next update.
Details are as follows:
Actually, I am aware of this awkward behavior of the kappa2-function.
This is due to the conversion and reordering of factor levels. These
are actually not two bugs but only one that results in an incorrect
generation of the confusion matrix (which you already found out). You
can easily fix it by deleting the first row in the kappa2-function
("ratings <- as.matrix(na.omit(ratings))"). This conversion to
numerical value as part of the removal of NA ratings is responsible
for the error.
In general, my function needs to know the factor levels in order to
correctly compute kappa. Thus, for your data, you would need to store
the values as factors with the appropriate possible factor levels.
E.g.
label <- c(0, 1, 1, 1, 0, 14, 53, 3) label <- factor(label,
levels=0:100) pred <- c(0, 1, 1, 0, 3, 4, 54, 6) pred <- factor(pred,
levels=0:100)
ratings <- data.frame(label,pred)
When you now run the modified kappa2-function (i.e. without the first
line), the results should be correct.
kappa2(ratings) # unweighted kappa2(ratings, "equal") # weighted kappa
with equal weights
For the next update of my package, I will take this into account.
I am doing some scientific computing and I couldn't find an elegant way of performing the following operation. Suppose I have a 2-dimensional numpy array D which stores measurements of a given quantity at several times along the day. Each row corresponds to a different measuring instrument and each column corresponds to a different moment in the day at which the measurement was done.
Consider a list of desired percentiles. For example:
quantiles = [0.25, 0.5, 0.75]
My goal is to compute the average measurement by percentile group, at each moment in the day. In other words, given a column of measurements, I would like to sort all the measurements from that column in groups respecting the quantiles above and then take averages within groups. Using the example, I would have 4 groups at each moment of the day: the measurements in the lower quartile, then the measurements between the 25th and 50th quartile, the ones between the 50th and the 75th and finally the ones in the last quartile. Therefore, if m is the number of moments in the day when measurements were taken and q is the number of elements in the quantiles variable, my desired output would be qxm numpy array.
Currently, I am doing this in the most inefficient and hard-coded way possible. Here we go:
quantiles = [0.25, 0.5, 0.75]
window = "30min"
moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time
quantile_curves = np.zeros((len(quantiles)+1, len(moments)-1))
EmpQuantiles = np.quantile(D, quantiles, axis = 0)
for moment in range(len(moments)-1):
quantile_curves[0, moment] = np.mean(D[:, moment][D[:,moment] < EmpQuantiles[0, moment]])
quantile_curves[1, moment] = np.mean(D[:, moment][np.logical_and(D[:,moment] > EmpQuantiles[0, moment], D[:,moment] <EmpQuantiles[1, moment])])
quantile_curves[2, moment] = np.mean(D[:, moment][np.logical_and(D[:,moment] > EmpQuantiles[1, moment], D[:,moment] <EmpQuantiles[2, moment])])
quantile_curves[3, moment] = np.mean(D[:, moment][D[:,moment] > EmpQuantiles[2, moment]])
What's an elegant and simpler way of doing this? I couldn't find the answer here however there is a related (but not the same) question in R: ddply multiple quantiles by group
I intend to plot the evolution of the in-group average along the day. I show the plot I get below (I am satisfied with the plot and I get the result I want however I seek better way of computing the quantile_curves variable):
Thanks a lot in advance!
You can do it efficiently using masked_arrays:
import numpy as np
quantiles = [0.25, 0.5, 0.75]
print('quantiles:\n', quantiles)
moments = [f'moment {i}' for i in range(5)]
print('nb of moments:\n', len(moments))
nb_measurements = 10000
D = np.random.rand(nb_measurements,len(moments))
quantile_values = np.quantile(D,quantiles,axis=0)
print('quantile_values (for each moment):\n', quantile_values)
quantile_curves = np.zeros((len(quantiles)+1,len(moments)))
quantile_curves[0, :] = np.mean(np.ma.masked_array(D, mask=D>quantile_values[[0],:]), axis=0)
for q in range(len(quantiles)-1):
quantile_curves[q+1, :] = np.mean(np.ma.masked_array(D, mask=np.logical_or(D<quantile_values[[q],:], D>quantile_values[[q+1],:])), axis=0)
quantile_curves[len(quantiles), :] = np.mean(np.ma.masked_array(D, mask=D<quantile_values[[len(quantiles)-1],:]), axis=0)
print('mean for each group and at each moment:')
print(quantile_curves)
Output:
% python3 script.py
quantiles:
[0.25, 0.5, 0.75]
nb of moments:
5
quantile_values (for each moment):
[[0.25271343 0.25434056 0.24658732 0.24612319 0.25221014]
[0.51114344 0.50103699 0.49671249 0.49113293 0.49819521]
[0.75629377 0.75427293 0.74676209 0.74211813 0.7490436 ]]
mean for each group and at each moment
[[0.12650993 0.12823392 0.12492136 0.12200609 0.12655318]
[0.3826476 0.373516 0.37050513 0.36974876 0.37722219]
[0.63454102 0.63023986 0.62280545 0.61696283 0.6238492 ]
[0.87866019 0.87614489 0.87492553 0.87253142 0.87403426]]
Note that I'm using random values between 0 and 1 that's why the quantile values (extremities of groups intervals) are almost equql to quantiles. Also not that this code works for an arbitrary number of quantiles or moments.
I'm trying to generate all possible bidirectional graphs given a set of nodes. I'm storing my graphs as numpy vectors, and need them that way as there is some downstream code consumes the graphs in this format.
Lets say I have two sets of nodes in which nodes in the same set do not connect. It is however possible to have graphs in which members of two sets do not meet at all.
posArgs= [0,1] # (0->1) / (1->1) is not allowed..neither is (0->1)
negArgs= [2] # (0->2) is possible, and so is (0 2) - meaning no connection.
To illustrate what I mean:
translates as:
import numpy as np
graph = np.array([0,0,1,0,0,1,0,0,0])
singleGraph= np.vstack( np.array_split(np.array(graph), nargs))
>>[[0 0 1]
[0 0 1]
[0 0 0]]
# Where the first row represent node 0's relationships with nodes 0, 1, 2
# the second row represents node 1's relationships with nodes 0, 1, 2 etc
What I want is to generate all the possible ways in which these two sets can then be constructed (i.e. all the possible node combinations as vectors). At the moment I am using itertools.product to generate a power set of all the nodes, then I create a bunch of vectors in which there are circular connections and same set connections. I then delete them from the powerset. Hence using the sets above I have the following code:
import numpy as np
import itertools
posArgs= [0,1] # 0->1 / 1->1 is not allowed..neither is 0->1
negArgs= [2]
nargs= len(posArgs+ negArgs)
allPermutations= np.array(list(itertools.product([0,1], repeat=nargs*nargs)))
# Create list of Attacks that we will never need. Circular attacks, and attacks between arguments of same polarity
circularAttacks = (np.arange(0, nargs*nargs, nargs+1)).tolist()
samePolarityAttacks = []
posList = list(itertools.permutations(posArgs, 2))
negList = list(itertools.permutations(negArgs, 2))
totList = posList + negList
for l in totList:
ptn = ((l[0]+1)*nargs)- ((nargs+1) - l[1]) + 1 # All the odd +1 are to account for the shift in 0 index
samePolarityAttacks.append(ptn)
graphsToDelete = np.unique([circularAttacks + samePolarityAttacks])
subGraphs = allPermutations[:,graphsToDelete]
cutDownGraphs = np.delete(allPermutations, (np.where(subGraphs>0)[0]).tolist(), axis = 0)
for graph in cutDownGraphs:
singleGraph= np.vstack( np.array_split(np.array(graph), nargs))
print(singleGraph)
My problem is that when I have more than 5 nodes within both of my sets my itertools.product is trying to produce (2^25) set of vectors. This of course is really expensive and leaves me with memory leaks.
Are you aware of a smart way in which I can reshape this code while ensuring my graphs stay in this numpy array format?
--Additional info:
For two sets, one node a set, all possible combinations look as follows:
Thanks
So I realize this is both a theoretical question and a coding question, but say if I have a list of 10 labels (x1, x2,...,x10) and their corresponding "location" vectors (v1, v2, ..., v10).
I want to collapse them based on their L2-norm distance from each other. For example, if v1 is close to v10, then relabel all x10's as x1's and so on.
So the end result could hypothetically look like the new labels: (x1, x3, x7, x8). Is there a way to smartly just make this into (x1', x2', x3', x4')?, so that people don't get confused and assume the new labels are the same.
Given:
labels = vector of Nx1 that has all the labels (1,2,3...,10)
Example Code:
epsilon = 0.2 # defines distance
change = [] # initialize vector of labels to change
# matrix is NxN matrix of the pairwise distances between all our vectors (v1,..,v10)
for i in range(0, distancematrix):
for j in range(0, distancematrix):
# add all pairs of labels that are "close", so that we may relabel
if i!=j and distancematrix[i, j] < epsilon:
change.append((i,j))
This will produce a list of pairs that I want to relabel. Is there a smart way of rewriting 'labels', so that it merges all the pairs I want to merge AND keeps the labels that were not part of any merge. Then reorganizes it to go from (1,2,3,4), if I merge 6 pairs of numbers (10-6 = 4).
Thank you. I realize this is somewhat of a weird problem, so if you have questions please let me know!
This does the job actually for me.
# creates a list of numbers from 0 to the length of your newlabels vector
changeto = [i for i in range(0, len(np.unique(newlabels)))]
# get the unique values of your newlabels (e.g. 0, 3, 4, 5, 10)
currentlabels = np.unique(newlabels)
# change all your labels to your new mapping (e.g. 0 -> 0, 3 -> 1, 4 -> 2, etc.)
for i in range(0, len(changeto)):
if currentlabels[i] != changeto[i]:
# change the 'states' in newlabels to new label
newlabels = [changeto[i] if x==currentlabels[i] else x for x in newlabels]
Maybe it's not pretty, but you map your new labels onto the line 0, 1, 2,...x, where x is the length of your new condensed label vector.
What if a label is not involved in any merge? Do you want to keep the original label? If so, what if that label is outside the new range?
Overall, I think that this is simply generating new labels given only the quantity of labels:
new_label_list = ["x"+str(n+1)+"'" for n in range(len(change))]
For change of length 4, this gives you
["x1'", "x2'", "x3'", "x4'"]
Do you see how the new label is built?
leading "x"
string version of the index, 1 .. length
trailing prime character
I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.
For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.
So far I've
imported the csv showing the weights as an array, masking values of 0, and
created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.
I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.
I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!
Update: here's some code for what I've done so far:
#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)
#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)
#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]
for i in range (rowLength):
createArr = np.arange(0, fieldLength*10, 10)
nCreateArr = np.array(createArr)
massArr.append(nCreateArr)
nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.
One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.
Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).
So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.
The code to do it like so:
import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])
c = np.cumsum(freq_count, axis=1)
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...
#This is just for explanation.
print "median masses is:", masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))
Output will be:
median masses is: [10 20 20 0 30]
[[ 30 191 9 0] <- The test data
[ 10 20 300 10]
[ 10 20 30 40]
[100 10 10 10]
[ 1 1 1 100]]
[[ 30. 221. 230. 230. 115. ] <- cumsum results with median added to the end.
[ 10. 30. 330. 340. 170. ] you can see from this where they fit in.
[ 10. 30. 60. 100. 50. ]
[ 100. 110. 120. 130. 65. ]
[ 1. 2. 3. 103. 51.5]]
wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.
Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).
def weighted_median(values, weights):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, 0.5 * c[-1])]]
Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend is to an arbitrary number of arbitrary quantiles.
Update
Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:
def weighted_quantiles(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]
This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])
Update 2
In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two number in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).
def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
q = np.searchsorted(c, quantiles * c[-1])
return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])
This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).
>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)
Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-values weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.
I ended up writing that function based on #muzzle and #maesers replies:
def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
The difference between interpolate True and False is as follows:
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5
(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])
Speed tests show it is just as performant as #maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.
Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.
import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain
book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'
masses = sh.col_values(0, start_rowx=1) # first column has mass
age = sh.row_values(0, start_colx=1) # first row has age ranges
count = 1
mass = []
for a in ages:
age.append(sh.col_values(count, start_rowx=1))
count += 1
stats = []
count = 0
for a in ages:
expanded = []
# create a tuple with the mass vector
age_mass = zip(masses, age[count])
count += 1
# replicate element[0] for element[1] times
expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)
# separate into one big list
medianlist = [x for t in expanded for x in t]
# convert to array and mask out zeroes
npa = np.array(medianlist)
npa = np.ma.masked_equal(npa,0)
median = np.median(npa)
meanMass = np.average(npa)
maxMass = np.max(npa)
minMass = np.min(npa)
stdev = np.std(npa)
stats1 = [median, meanMass, maxMass, minMass, stdev]
print stats1
stats.append(stats1)
np.savetxt(ofile, (stats), fmt="%d")