Is weighted Kappa calculated by `irr` package in R wrong? - python

I found the irr package has 2 big bugs for the calculation of weighted kappa.
Please tell me if the 2 bugs are really there or I misunderstood someting.
You can replicate the bugs using the following examples.
First bug: The sort of labels in confusion matrix needs to be corrected.
I have 2 pairs of scores for disease extent (from 0 to 100 while 0 is healthy, 100 is extremely ill).
In label_test.csv (you can just copy and paste the data to your disk to do the following test):
0
1
1
1
0
14
53
3
In pred_test.csv:
0
1
1
0
3
4
54
6
in script_r.R:
library(irr)
label <- read.csv('label_test.csv',header=FALSE)
pred <- read.csv('pred_test.csv',header=FALSE)
kapp <- kappa2(data.frame(label,pred),"unweighted")
kappa <- getElement(kapp,"value")
print(kappa) # output: 0.245283
w_kapp <- kappa2(data.frame(label,pred),"equal")
weighted_kappa <- getElement(w_kapp,"value")
print(weighted_kappa) # output: 0.443038
When I use Python to calculate the kappa and weighted_kappa, in script_python.py:
from sklearn.metrics import cohen_kappa_score
label = pd.read_csv(label_file, header=None).to_numpy()
pred = pd.read_csv(pred_file, header=None).to_numpy()
kappa = cohen_kappa_score(label.astype(int), pred.astype(int))
print(kappa) # output: 0.24528301886792447
weighted_kappa = cohen_kappa_score(label.astype(int), pred.astype(int), weights='linear', labels=np.array(list(range(100))) )
print(weighted_kappa) # output: 0.8359908883826879
We can find that the kappa calculated by R and Python is the same, but the weighted_kappa from R is far lower than the weighted_kappa in sklearn from Python. Which is wrong? After 2-day research, I found that the weighted_kappa from irr package in R is wrong. Details are as follows.
During the debuging, we will find the confusion matrix in irr from R is:
We can find that the order is wrong. The order of labels should be changed from [0, 1, 14, 3, 4, 53, 54, 6] to [0, 1, 3, 4, 6, 14, 53, 54] as it is in Python. It seems that irr package used a string-based sort method instead of integer-based sort method, which will put 14 to the front of 3. This mistake could be and should be corrected easily.
Second bug: The confusion matrix is not complete in R
In my pred_test.csv and label_test.csv, the values can not cover all possible values from 0 to 100. So the default confusion matrix in irr from R will miss those values which does not appear in data. This should be fixed.
Let's see another example.
In pred_test.csv, let's change the label from 54 to 99. Then, we run script_r.R and script_python.py again. The results are:
In R:
kappa: 0.245283
weighted_kappa: 0.443038
In Python:
kappa: 0.24528301886792447
weighted_kappa: 0.592891760904685
We can find the weighted_kappa from irr in R is unchanged at all. But the weighted_kappa from sklearn in Python is decreased from 0.83 to 0.59. So we know irr made a mistake again.
The reason is that sklearn can let us to pass the full labels to the confusion matrix so that the confusion matrix shape will be 100 * 100, however in irr, the labels of confusion matrix is calculated from the unique values from label and pred, which will miss a lot of other possible values. This mistake will assign the same weight to 53 and 99 here. So it is better to provide an option in irr package to let custumer provide the custum labels like what they have done in sklearn from Python.

The solution from the authors is not going to work because in the code of kappa2 function, it converts your ratings into a matrix, and once you convert a factor into matrix, the levels are lost, this is the line:
ratings <- as.matrix(na.omit(ratings))
You can try it on your data, it is converted into a character:
lvl = 0:100
ratings = data.frame(label = factor(label[,1],levels=lvl),
pred = factor(pred[,1],levels=lvl))
as.matrix(ratings)
label pred
[1,] "0" "0"
[2,] "1" "1"
[3,] "1" "1"
[4,] "1" "0"
[5,] "0" "3"
[6,] "14" "4"
[7,] "53" "54"
[8,] "3" "6"
Same results:
kappa2(ratings,weight="equal")
Cohen's Kappa for 2 Raters (Weights: equal)
Subjects = 8
Raters = 2
Kappa = 0.368
z = 1.79
p-value = 0.0742
I suggest using DescTools, you just need to provide the confusion matrix using table() function in R, with the factors declared correctly as above:
library(DescTools)
CohenKappa(table(ratings$label,ratings$pred), weight="Unweighted")
[1] 0.245283
CohenKappa(table(ratings$label,ratings$pred), weight="Equal-Spacing")
[1] 0.8359909

I have sent email to the author of the package, and he said he will fix the bug in next update.
Details are as follows:
Actually, I am aware of this awkward behavior of the kappa2-function.
This is due to the conversion and reordering of factor levels. These
are actually not two bugs but only one that results in an incorrect
generation of the confusion matrix (which you already found out). You
can easily fix it by deleting the first row in the kappa2-function
("ratings <- as.matrix(na.omit(ratings))"). This conversion to
numerical value as part of the removal of NA ratings is responsible
for the error.
In general, my function needs to know the factor levels in order to
correctly compute kappa. Thus, for your data, you would need to store
the values as factors with the appropriate possible factor levels.
E.g.
label <- c(0, 1, 1, 1, 0, 14, 53, 3) label <- factor(label,
levels=0:100) pred <- c(0, 1, 1, 0, 3, 4, 54, 6) pred <- factor(pred,
levels=0:100)
ratings <- data.frame(label,pred)
When you now run the modified kappa2-function (i.e. without the first
line), the results should be correct.
kappa2(ratings) # unweighted kappa2(ratings, "equal") # weighted kappa
with equal weights
For the next update of my package, I will take this into account.

Related

Calculating first and second derivative when coefficients stored in a csv file in python

I have a csv file with around 1000 regression results looking like this:
x^4_coeff x^3_coeff x^2_coeff x_coeff intercept
10 -.43 0.05 12 298
from the first set of coefficients I get an equation of:
10x4 -0.43x3 + 0.05x2 + 12x + 298
I want to automate calculating a first derivative which will be:
40x3 - 1.29x2 + 0.1x + 12
Then I would like to resolve this equation for 0 and find all the roots
After that I would like to get a second derivative which in this case would be:
12x2 - 2.58x + 0.1 and find both roots of this function
I would like to store the results in a csv file for a comparison, the points is to find out if there are some commonalities between all 1000 regressions and what is a difference between roots of first and second derivative for these equations.
I haven't calculated the roots manually so these values are dummy but hope you get the point
_root1 fd_root2 fd_root3 sd_root1 sd_root2
10 20 25 13 15
and do this for all my 1000 regression results. Is there a quick way to do this in python? What I have done so far was generating those 1000 regression outputs in Stata (which I don't know really well), saved the output to a csv file and thought it will be easier to carry on with Python.
Thanks for your help!
Here's a sample script for calculating derivatives and roots of the polynomials you have. I didn't include csv reading/writing because I wasn't sure about the exact format you were working with.
from sympy import Symbol, Poly
# Define symbols
x = Symbol("x")
# Add csv reading here
input_rows = [
[2.234, 0, 0.523, 2.3123, 4.123],
[2, 2, 2, 2, 2]]
output_rows = []
# Iterate over each row
for r in input_rows:
# Create polynomial from coefficients
y = Poly(r, x)
print(y)
# 1st derivative and its roots
y_dx = y.diff(x)
y_dx_roots = y_dx.nroots()
# 2nd derivative and its roots
y_ddx = y_dx.diff(x)
y_ddx_roots = y_ddx.nroots()
# Write results to list of dicts
output_rows.append({
"1st deriv": y_dx.all_coeffs(),
"2nd deriv": y_ddx.all_coeffs(),
"1st deriv roots": y_dx_roots,
"2nd deriv roots": y_ddx_roots})
print(*output_rows, sep="\n")
import pandas as pd
import numpy as np
d = {'coeff1': [2.3, 1], 'coeff2': [-5.3, -8.1], 'coeff3' : [-13.2,-111.2] , 'coeff4':[-5,-12], 'intercept':[150,200]}
df = pd.DataFrame(data=d)
df["root1"] = np.nan
df["root2"] = np.nan
for row in df.index:
p = np.poly1d([df['coeff1'][row], df['coeff2'][row], df['coeff3'][row],df['coeff4'][row], df['intercept'][row]])
# showing only second derivative roots to make the point
df["root1"].loc[row] = p.deriv().deriv().roots.item(0).real
df["root2"].loc[row] = p.deriv().deriv().roots.item(1).real
#print results
print(df)

fisher's linear discriminant in Python

I have the fisher's linear discriminant that i need to use it to reduce my examples A and B that are high dimensional matrices to simply 2D, that is exactly like LDA, each example has classes A and B, therefore if i was to have a third example they also have classes A and B, fourth, fifth and n examples would always have classes A and B, therefore i would like to separate them in a simple use of fisher's linear discriminant. Im pretty much new to machine learning, so i dont know how to separate my classes, i've been following the formula by eye and coding on the go. From what i was reading, i need to apply a linear transformation to my data so i can find a good threshold for it, but first i'd need to find the maximization function. For such task, i managed to find Sw and Sb, but i don't know how to go from there...
Where i also need to find the maximization function.
That maximization function gives me an eigen value solution:
What i have for each classes are matrices 5x2 of 2 examples. For instance:
Example 1
Class_A = [
201, 103,
40, 43,
23, 50,
12, 123,
99, 78
]
Class_B = [
201, 129,
114, 195,
180, 90,
69, 62,
76, 90
]
Example 2
Class_A = [
68, 98,
201, 203,
78, 212,
49, 5,
204, 78
]
Class_B = [
52, 19,
220, 219,
159, 195,
99, 23,
46, 50
]
I tried finding Sw for the example above like this:
Example_1_Class_A = np.dot(Example_1_Class_A, np.transpose(Example_1_Class_A))
Example_1_Class_B = np.dot(Example_1_Class_B, np.transpose(Example_1_Class_B))
Example_2_Class_A = np.dot(Example_2_Class_A, np.transpose(Example_2_Class_A))
Example_2_Class_B = np.dot(Example_2_Class_B, np.transpose(Example_2_Class_B))
Sw = sum([Example_1_Class_A, Example_1_Class_B, Example_2_Class_A, Example_2_Class_B], axis=0)
As for Sb, i tried like this:
Example_1_Class_A_mean = Example_1_Class_A.mean(axis=0)
Example_1_Class_B_mean = Example_1_Class_B.mean(axis=0)
Example_2_Class_A_mean = Example_2_Class_A.mean(axis=0)
Example_2_Class_B_mean = Example_2_Class_B.mean(axis=0)
Example_1_Class_A_Sb = np.dot(Example_1_Class_A_mean, np.transpose(Example_1_Class_A_mean))
Example_1_Class_B_Sb = np.dot(Example_1_Class_B_mean, np.transpose(Example_1_Class_B_mean))
Example_2_Class_A_Sb = np.dot(Example_2_Class_A_mean, np.transpose(Example_2_Class_A_mean))
Example_2_Class_B_Sb = np.dot(Example_2_Class_B_mean, np.transpose(Example_2_Class_B_mean))
Sb = sum([Example_1_Class_A_Sb, Example_1_Class_B_Sb, Example_2_Class_A_Sb, Example_2_Class_B_Sb], axis=0)
The problem is, i have no idea what else to do with my Sw and Sb, i am completely lost. Basically, what i need to do is get from here to this:
How for given Example A and Example B, do i separate a cluster only for classes As and only for classes b
Before answering your question, I will first touch the basic difference between PCA and (F)LDA. In PCA you don't know anything about underlying classes, but you assume that the information about classes separability lies in the variance of data. So you rotate your original axes (sometimes it is called projecting all the data onto new ones) in such way that your first new axis is pointing to the direction of most variance, second one is perpendicular to the first one and pointing to the direction of most residiual variance, and so on. This way a PCA transformation results in a (sub)space of the same dimensionality as the original one. Than you can take only first 2 dimensions, rejecting the rest, hence getting a dimensionality reduction from k dimensions to only 2.
LDA works a bit differently. In this case you know in advance how many classes there are in your data, and you can find their mean and covariance matrices. What Fisher criterion does it finds a direction in which the mean between classes is maximized, while at the same time total variability is minimized (total variability is a mean of within-class covariance matrices). And for each two classes there is only one such line. This is why when your data has C classes, LDA can provide you at most C-1 dimensions, regardless of the original data dimensionality. In your case this means that as you have only 2 classes A and B, you will get a one-dimensional projection, i.e. a line. And this is exactly what you have in your picture: original 2d data is projected on to a line. The direction of the line is the solution of the eigenproblem.
Let's generate data that is similar to your picture:
a = np.random.multivariate_normal((1.5, 3), [[0.5, 0], [0, .05]], 30)
b = np.random.multivariate_normal((4, 1.5), [[0.5, 0], [0, .05]], 30)
plt.plot(a[:,0], a[:,1], 'b.', b[:,0], b[:,1], 'r.')
mu_a, mu_b = a.mean(axis=0).reshape(-1,1), b.mean(axis=0).reshape(-1,1)
Sw = np.cov(a.T) + np.cov(b.T)
inv_S = np.linalg.inv(Sw)
res = inv_S.dot(mu_a-mu_b) # the trick
####
# more general solution
#
# Sb = (mu_a-mu_b)*((mu_a-mu_b).T)
# eig_vals, eig_vecs = np.linalg.eig(inv_S.dot(Sb))
# res = sorted(zip(eig_vals, eig_vecs), reverse=True)[0][1] # take only eigenvec corresponding to largest (and the only one) eigenvalue
# res = res / np.linalg.norm(res)
plt.plot([-res[0], res[0]], [-res[1], res[1]]) # this is the solution
plt.plot(mu_a[0], mu_a[1], 'cx')
plt.plot(mu_b[0], mu_b[1], 'yx')
plt.gca().axis('square')
# let's project data point on it
r = res.reshape(2,)
n2 = np.linalg.norm(r)**2
for pt in a:
prj = r * r.dot(pt) / n2
plt.plot([prj[0], pt[0]], [prj[1], pt[1]], 'b.:', alpha=0.2)
for pt in b:
prj = r * r.dot(pt) / n2
plt.plot([prj[0], pt[0]], [prj[1], pt[1]], 'r.:', alpha=0.2)
The resulting projection is calculated using a neat trick for two class problem. You can read details on it here in section 1.6.
Regarding the "examples" you mention in your question. I believe you need to repeat the process for each example, as it is a different set of data point probably with different distributions. Also, put attention that estimated mean (mu_a, mu_b) and class covariance matrices would be slightly different from the ones that data was generated with, especially for small sample size.
Mathematics
See https://sebastianraschka.com/Articles/2014_python_lda.html#lda-in-5-steps for more information.
Implementation using Iris
Since you want to use LDA for dimensionality reduction but provide only 2d data I am showing how to perform this procedure on the iris dataset.
Let's import libraries
import pandas as pd
import numpy as np
import sklearn as sk
from collections import Counter
from sklearn import datasets
# load dataset and transform to pandas df
X, y = datasets.load_iris(return_X_y=True)
X = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(4)])
y = pd.DataFrame(y, columns=['labels'])
tot = pd.concat([X,y], axis=1)
# calculate class means
class_means = tot.groupby('labels').mean()
total_mean = X.mean()
The class_means are given by:
class_means
feat_0 feat_1 feat_2 feat_3
labels
0 5.006 3.428 1.462 0.246
1 5.936 2.770 4.260 1.326
2 6.588 2.974 5.552 2.026
To do this, we first subtract the class means from each observation (basically we calculate x - m_i from the equation above).
Subtract the corresponding class mean from each observation. Since we want to calculate
x_mi = tot.transform(lambda x: x - class_means.loc[x['labels']], axis=1).drop('labels', 1)
def kronecker_and_sum(df, weights):
S = np.zeros((df.shape[1], df.shape[1]))
for idx, row in df.iterrows():
x_m = row.as_matrix().reshape(df.shape[1],1)
S += weights[idx]*np.dot(x_m, x_m.T)
return S
# Each x_mi is weighted with 1. Now we use the kronecker_and_sum function to calculate the within-class scatter matrix S_w
S_w = kronecker_and_sum(x_mi, 150*[1])
mi_m = class_means.transform(lambda x: x - total_mean, axis=1)
# Each mi_m is weighted with the number of observations per class which is 50 for each class in this example. We use kronecker_and_sum to calculate the between-class scatter matrix.
S_b=kronecker_and_sum(mi_m, 3*[50])
eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_w).dot(S_b))
We only need to consider the eigenvalues which are remarkably different from zero (in this case only the first two)
eig_vals
array([ 3.21919292e+01, 2.85391043e-01, 6.53468167e-15, -2.24877550e-15])
Transform X with the matrix of the two eigenvectors which correspond to the highest eigenvalues
W = eig_vecs[:, :2]
X_trafo = np.dot(X, W)
tot_trafo = pd.concat([pd.DataFrame(X_trafo, index=range(len(X_trafo))), y], 1)
# plot the result
tot_trafo.plot.scatter(x=0, y=1, c='labels', colormap='viridis')
We have reduced the dimensions from 4 to 2 and chosen the space in such a way, that the classes can be well seperated.
Scikit-learn usage
Scikit has LDA support aswell. What we did in dozens of lines can be done with the following lines of code:
from sklearn import discriminant_analysis
lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2)
X_trafo_sk = lda.fit_transform(X,y)
pd.DataFrame(np.hstack((X_trafo_sk, y))).plot.scatter(x=0, y=1, c=2, colormap='viridis')
I'm not giving a plot here, cause it is the same as in our derived example (except for a 180 degree rotation).

statsmodels.tsa.stattools.coint() | Critial Values are Negative, P Value is Positive

So I am running a co-integration test and the documentation says,
If the pvalue is small, below a critical size, then we can reject the
hypothesis that there is no cointegrating relationship.
My p values and critical sizes are as follows:
7.720961017991229e-07, array([-3.89753487, -3.33674071, -3.04487389]))
I have read that in an ADF test, a positive value means its co-integrated, however, is that the case for this test (which uses the 2 step Engle-Granger)?
If the critical values are negative, do we want to see the p-value larger than them? and vice versa if they are positive?
To test this I have made 2 time series that I think should be for sure cointegrated:
XR = numpy.random.normal(0, 1, 100)
noise = numpy.random.normal(0, 1, 100)
YR = XR + 5 + noise
These are the results:
1.8668081271228297e-12, array([-4.01048603, -3.39854434, -3.08756793]))
So I am assuming if it's above 0 and the critical values are below that its cointegrated? Please correct me if I am wrong.

K-means clustering on 3 dimensions with sklearn

I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found here has been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the iloc bit of this line:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])
I tried changing this part to iloc[1:4] (to only work on columns 1-3) but that resulted in the following error:
ValueError: n_samples=3 should be >= n_clusters=4
So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?
Here's my python file, thanks for your help:
from sklearn.cluster import KMeans
import csv
import pandas as pd
# Import csv file with data in following columns:
# [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]
df = pd.read_csv('point_data_test.csv',index_col=['PM'])
numProjects = len(df)
K = numProjects // 3 # Around three projects can be worked per day
print("Number of projects: ", numProjects)
print("K-clusters: ", K)
for k in range(1, K):
# Create a kmeans model on our data, using k clusters.
# Random_state helps ensure that the algorithm returns the
# same results each time.
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
# These are our fitted labels for clusters --
# the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_
# Sum of distances of samples to their closest cluster center
SSE = kmeans_model.inertia_
print("k:",k, " SSE:", SSE)
# Add labels to df
df['Labels'] = labels
#print(df)
df.to_csv('test_KMeans_out.csv')
It seems the issue is with the syntax of iloc[1:4].
From your question it appears you changed:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
to:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])
It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.
You should start by reading Indexing and Selecting Data from the pandas documentation.
But in short .iloc is an integer based indexing method for selecting data by position.
Let's say you have the dataframe:
A B C
1 2 3
4 5 6
7 8 9
10 11 12
The use of iloc in the example you provided iloc[:,:] selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notation or the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4] selects the rows at index 1-3. This would result in:
A B C
4 5 6
7 8 9
10 11 12
Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeans to find 4 clusters, which just isn't possible.
What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:
df.iloc[:, 1:4]
Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeans should work as you intended.

NumPy or SciPy to calculate weighted median

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.
For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.
So far I've
imported the csv showing the weights as an array, masking values of 0, and
created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.
I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.
I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!
Update: here's some code for what I've done so far:
#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)
#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)
#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]
for i in range (rowLength):
createArr = np.arange(0, fieldLength*10, 10)
nCreateArr = np.array(createArr)
massArr.append(nCreateArr)
nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.
One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.
Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).
So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.
The code to do it like so:
import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])
c = np.cumsum(freq_count, axis=1)
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...
#This is just for explanation.
print "median masses is:", masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))
Output will be:
median masses is: [10 20 20 0 30]
[[ 30 191 9 0] <- The test data
[ 10 20 300 10]
[ 10 20 30 40]
[100 10 10 10]
[ 1 1 1 100]]
[[ 30. 221. 230. 230. 115. ] <- cumsum results with median added to the end.
[ 10. 30. 330. 340. 170. ] you can see from this where they fit in.
[ 10. 30. 60. 100. 50. ]
[ 100. 110. 120. 130. 65. ]
[ 1. 2. 3. 103. 51.5]]
wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.
Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).
def weighted_median(values, weights):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, 0.5 * c[-1])]]
Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend is to an arbitrary number of arbitrary quantiles.
Update
Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:
def weighted_quantiles(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]
This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])
Update 2
In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two number in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).
def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
q = np.searchsorted(c, quantiles * c[-1])
return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])
This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).
>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)
Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-values weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.
I ended up writing that function based on #muzzle and #maesers replies:
def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
The difference between interpolate True and False is as follows:
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5
(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])
Speed tests show it is just as performant as #maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.
Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.
import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain
book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'
masses = sh.col_values(0, start_rowx=1) # first column has mass
age = sh.row_values(0, start_colx=1) # first row has age ranges
count = 1
mass = []
for a in ages:
age.append(sh.col_values(count, start_rowx=1))
count += 1
stats = []
count = 0
for a in ages:
expanded = []
# create a tuple with the mass vector
age_mass = zip(masses, age[count])
count += 1
# replicate element[0] for element[1] times
expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)
# separate into one big list
medianlist = [x for t in expanded for x in t]
# convert to array and mask out zeroes
npa = np.array(medianlist)
npa = np.ma.masked_equal(npa,0)
median = np.median(npa)
meanMass = np.average(npa)
maxMass = np.max(npa)
minMass = np.min(npa)
stdev = np.std(npa)
stats1 = [median, meanMass, maxMass, minMass, stdev]
print stats1
stats.append(stats1)
np.savetxt(ofile, (stats), fmt="%d")

Categories