Related
I found the irr package has 2 big bugs for the calculation of weighted kappa.
Please tell me if the 2 bugs are really there or I misunderstood someting.
You can replicate the bugs using the following examples.
First bug: The sort of labels in confusion matrix needs to be corrected.
I have 2 pairs of scores for disease extent (from 0 to 100 while 0 is healthy, 100 is extremely ill).
In label_test.csv (you can just copy and paste the data to your disk to do the following test):
0
1
1
1
0
14
53
3
In pred_test.csv:
0
1
1
0
3
4
54
6
in script_r.R:
library(irr)
label <- read.csv('label_test.csv',header=FALSE)
pred <- read.csv('pred_test.csv',header=FALSE)
kapp <- kappa2(data.frame(label,pred),"unweighted")
kappa <- getElement(kapp,"value")
print(kappa) # output: 0.245283
w_kapp <- kappa2(data.frame(label,pred),"equal")
weighted_kappa <- getElement(w_kapp,"value")
print(weighted_kappa) # output: 0.443038
When I use Python to calculate the kappa and weighted_kappa, in script_python.py:
from sklearn.metrics import cohen_kappa_score
label = pd.read_csv(label_file, header=None).to_numpy()
pred = pd.read_csv(pred_file, header=None).to_numpy()
kappa = cohen_kappa_score(label.astype(int), pred.astype(int))
print(kappa) # output: 0.24528301886792447
weighted_kappa = cohen_kappa_score(label.astype(int), pred.astype(int), weights='linear', labels=np.array(list(range(100))) )
print(weighted_kappa) # output: 0.8359908883826879
We can find that the kappa calculated by R and Python is the same, but the weighted_kappa from R is far lower than the weighted_kappa in sklearn from Python. Which is wrong? After 2-day research, I found that the weighted_kappa from irr package in R is wrong. Details are as follows.
During the debuging, we will find the confusion matrix in irr from R is:
We can find that the order is wrong. The order of labels should be changed from [0, 1, 14, 3, 4, 53, 54, 6] to [0, 1, 3, 4, 6, 14, 53, 54] as it is in Python. It seems that irr package used a string-based sort method instead of integer-based sort method, which will put 14 to the front of 3. This mistake could be and should be corrected easily.
Second bug: The confusion matrix is not complete in R
In my pred_test.csv and label_test.csv, the values can not cover all possible values from 0 to 100. So the default confusion matrix in irr from R will miss those values which does not appear in data. This should be fixed.
Let's see another example.
In pred_test.csv, let's change the label from 54 to 99. Then, we run script_r.R and script_python.py again. The results are:
In R:
kappa: 0.245283
weighted_kappa: 0.443038
In Python:
kappa: 0.24528301886792447
weighted_kappa: 0.592891760904685
We can find the weighted_kappa from irr in R is unchanged at all. But the weighted_kappa from sklearn in Python is decreased from 0.83 to 0.59. So we know irr made a mistake again.
The reason is that sklearn can let us to pass the full labels to the confusion matrix so that the confusion matrix shape will be 100 * 100, however in irr, the labels of confusion matrix is calculated from the unique values from label and pred, which will miss a lot of other possible values. This mistake will assign the same weight to 53 and 99 here. So it is better to provide an option in irr package to let custumer provide the custum labels like what they have done in sklearn from Python.
The solution from the authors is not going to work because in the code of kappa2 function, it converts your ratings into a matrix, and once you convert a factor into matrix, the levels are lost, this is the line:
ratings <- as.matrix(na.omit(ratings))
You can try it on your data, it is converted into a character:
lvl = 0:100
ratings = data.frame(label = factor(label[,1],levels=lvl),
pred = factor(pred[,1],levels=lvl))
as.matrix(ratings)
label pred
[1,] "0" "0"
[2,] "1" "1"
[3,] "1" "1"
[4,] "1" "0"
[5,] "0" "3"
[6,] "14" "4"
[7,] "53" "54"
[8,] "3" "6"
Same results:
kappa2(ratings,weight="equal")
Cohen's Kappa for 2 Raters (Weights: equal)
Subjects = 8
Raters = 2
Kappa = 0.368
z = 1.79
p-value = 0.0742
I suggest using DescTools, you just need to provide the confusion matrix using table() function in R, with the factors declared correctly as above:
library(DescTools)
CohenKappa(table(ratings$label,ratings$pred), weight="Unweighted")
[1] 0.245283
CohenKappa(table(ratings$label,ratings$pred), weight="Equal-Spacing")
[1] 0.8359909
I have sent email to the author of the package, and he said he will fix the bug in next update.
Details are as follows:
Actually, I am aware of this awkward behavior of the kappa2-function.
This is due to the conversion and reordering of factor levels. These
are actually not two bugs but only one that results in an incorrect
generation of the confusion matrix (which you already found out). You
can easily fix it by deleting the first row in the kappa2-function
("ratings <- as.matrix(na.omit(ratings))"). This conversion to
numerical value as part of the removal of NA ratings is responsible
for the error.
In general, my function needs to know the factor levels in order to
correctly compute kappa. Thus, for your data, you would need to store
the values as factors with the appropriate possible factor levels.
E.g.
label <- c(0, 1, 1, 1, 0, 14, 53, 3) label <- factor(label,
levels=0:100) pred <- c(0, 1, 1, 0, 3, 4, 54, 6) pred <- factor(pred,
levels=0:100)
ratings <- data.frame(label,pred)
When you now run the modified kappa2-function (i.e. without the first
line), the results should be correct.
kappa2(ratings) # unweighted kappa2(ratings, "equal") # weighted kappa
with equal weights
For the next update of my package, I will take this into account.
I have run this simulation (given below) and got the simulated transition probabilities for dry-to-dry and wet-to-wet conditions. The simulated results for dry-to-dry are almost equal to the estimated dry-to-dry (d2d_tran). But, the simulated wet-to-wet values are substantially lower than the estimated ones. It seems there is something wrong in the program. I tried several other ways but haven’t got the expected results. Can you please run the program and suggest me how I may get improved results for wet-to-wet probabilities? Thanks in advance.
My codes:
import numpy as np
import random, datetime
d2d = np.zeros(12)
d2w = np.zeros(12)
w2w = np.zeros(12)
w2d = np.zeros(12)
pd2d = np.zeros(12)
pw2w = np.zeros(12)
dry = [0.333] ##unconditional probability of dry for January
d2d_tran = [0.564,0.503,0.582,0.621,0.634,0.679,0.738,0.667,0.604,0.564,0.577,0.621]
w2w_tran = [0.784,0.807,0.8,0.732,0.727,0.728,0.64,0.64,0.665,0.717,0.741,0.769]
mu = [3.71,4.46,4.11,2.94,3.01,2.87,2.31,2.44,2.56,3.45,4.32,4.12]
sigma = [6.72,7.92,7.49,6.57,6.09,5.53,4.38,4.69,4.31,5.71,7.64,7.54]
days = np.array([31,28,31,30,31,30,31,31,30,31,30,31])
rain = np.array([])
for y in xrange(0,10000):
for m in xrange(0,12):
#Include leap years in the calculation and creat random variables for each month
if ((y%4 == 0 and y%100 != 0) or y%400 == 0) and m==1:
random_num = np.random.rand(29)
else:
random_num = np.random.rand(days[m])
#lets generate a rainfall amount for first day of the random series
if random_num[0] <= dry[0]:
random_num[0] = 0
else:
random_num[0] = abs(random.gauss(mu[0],sigma[0]))
# generate the whole series in sequence of month and year
for i in xrange(0,days[m]):
if random_num[i-1] == 0: #if yesterday was dry
if random_num[i] <= d2d_tran[m]: #check today against the dry2dry transition probabilities
random_num[i] = 0
d2d[m] += 1.0
else:
random_num[i] = abs(random.gauss(mu[m],sigma[m]))
d2w[m] += 1.0
else:
if random_num[i] <= w2w_tran[m]:
random_num[i] = abs(random.gauss(mu[m],sigma[m]))
w2w[m] += 1.0
else:
random_num[i] = 0
w2d[m] += 1.0
pd2d[m] = d2d[m]/(d2d[m] + d2w[m])
pw2w[m] = w2w[m]/(w2d[m] + w2w[m])
print 'Simulated transition probability of dry2dry:\n', np.around(pd2d, decimals=3)
print 'Simulated transition probability of wet2wet:\n', np.around(pw2w, decimals=3)
### pd2d and pw2w of generated data should be identical to d2d_tran and w2w_tran respectively
The simulation looks correct as far as it goes, and after running it for 8000 years, I get transition probabilities within .001 most of the time, and there is convergence as the number of days increases.
Nothing guarantees that you will get the exact transition probabilities - on any single run you may get anything. What you've done is generate an estimator for each single transition probability that has mean equal to the actual value (0.345), and some positive variance. The variance of your estimator decreases with n = sample size, but it will always be positive.
If you'd like values closer to the actual transition probabilities (faster convergence), apply some well-known variance reduction techniques: Stratified Sampling, Importance Sampling, etc. - too many to mention. Here's a quick technique - take the uniform random deviates generated by np.random.rand(), and estimate as usual. Then generate another estimator using the transformed deviates: [(1-x) for x in stored_deviates]. The average of the two estimators has reduced variance (by .5).
I am trying to generate log-normally distributed random numbers in python (for later MC simulation), and I find the results to be quite inconsistent when parameters are a bit larger.
Below I am generating a series of LogNormals from Normals (and then using Exp) and directly from LogNormals.
The resulting means are bearable, but the variances - quite imprecise.. this also holds for mu = 4,5,...
If you re-run the below code a couple of times - the results come back quite different.
Code:
import numpy as np
mu = 10;
tmp1 = np.random.normal(loc=-mu, scale=np.sqrt(mu*2),size=1e7)
tmp1 = np.exp(tmp1)
print tmp1.mean(), tmp1.var()
tmp2 = np.random.lognormal(mean=-mu, sigma=np.sqrt(mu*2), size=1e7)
print tmp2.mean(), tmp2.var()
print 'True Mean:', np.exp(0), 'True Var:',(np.exp(mu*2)-1)
Any advice how to fix this?
I've tried this also on Wakari.io - so the result is consistent there as well
Update:
I've taken the 'True' Mean and Variance formula from Wikipedia: https://en.wikipedia.org/wiki/Log-normal_distribution
Snapshots of results:
1)
0.798301881219 57161.0894726
1.32976988569 2651578.69947
True Mean: 1.0 True Var: 485165194.41
2)
1.20346203176 315782.004309
0.967106664211 408888.403175
True Mean: 1.0 True Var: 485165194.41
3) Last one with n=1e8 random numbers
1.17719369919 2821978.59163
0.913827160458 338931.343819
True Mean: 1.0 True Var: 485165194.41
Even with the large sample size that you have, with these parameters, the estimated variance is going to change wildly from run to run. That's just the nature of the fat-tailed lognormal distribution. Try running the np.exp(np.random.normal(...)).var() several times. You will see a similar swing of values as np.random.lognormal(...).var().
In any case, np.random.lognormal() is just implemented as np.exp(np.random.normal()) (well, the C equivalent).
Ok, as you have just built the sample, and using the notation in wikipedia (first section, mu and sigma) and the example given by you:
from numpy import log, exp, sqrt
import numpy as np
mu = -10
scale = sqrt(2*10) # scale is sigma, not variance
tmp1 = np.random.normal(loc=mu, scale=scale, size=1e8)
# Just checking
print tmp1.mean(), tmp1.std()
# 10.0011028634 4.47048010775, perfectly accurate
tmp1_exp = exp(tmp1) # Not sensible to use the same name for two samples
# WIKIPEDIA NOTATION!
m = tmp1_exp.mean() # until proven wrong, this is a meassure of the mean
v = tmp1_exp.var() # again, until proven wrong, this is sigma**2
#Now, according to wikipedia
print "This: ", log(m**2/sqrt(v+m**2)), "should be similar to", mu
# I get This: 13.9983309499 should be similar to 10
print "And this:", sqrt(log(1+v/m**2)), "should be similar to", scale
# I get And this: 3.39421327037 should be similar to 4.472135955
So, even if the values are not exactly perfect, I wouldn't claim that they are completely wrong.
I have a numpy array (actually imported from a GIS raster map) which contains
probability values of occurrence of a species like following example:
a = random.randint(1.0,20.0,1200).reshape(40,30)
b = (a*1.0)/sum(a)
Now I want to get a discrete version for that array again. Like if I have
e.g. 100 individuals which are located on the area of that array (1200 cells) how are they
distributed? Of course they should be distributed according to their probability,
meaning lower values indicated lower probability of occurrence. However, as everything is statistics there is still the chance that a individual is located at a low probability
cell. It should be possible that multiple individuals can occupy on cell...
It is like transforming a continuous distribution curve into a histogram again. Like many different histograms may result in a certain distribution curve it should also be the other way round. Accordingly applying the algorithm I am looking for will produce different discrete values each time.
...is there any algorithm in python which can do that? As I am not that familiar with discretization maybe someone can help.
Use random.choice with bincount:
np.bincount(np.random.choice(b.size, 100, p=b.flat),
minlength=b.size).reshape(b.shape)
If you don't have NumPy 1.7, you can replace random.choice with:
np.searchsorted(np.cumsum(b), np.random.random(100))
giving:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(100)),
minlength=b.size).reshape(b.shape)
So far I think ecatmur's answer seems quite reasonable and simple.
I just want to add may a more "applied" example. Considering a dice
with 6 faces (6 numbers). Each number/result has a probability of 1/6.
Displaying the dice in form of an array could look like:
b = np.array([[1,1,1],[1,1,1]])/6.0
Thus rolling the dice 100 times (n=100) results in following simulation:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(n)),minlength=b.size).reshape(b.shape)
I think that can be an appropriate approach for such an application.
Thus thank you ecatmur for your help!
/Johannes
this is similar to my question i had earlier this month.
import random
def RandFloats(Size):
Scalar = 1.0
VectorSize = Size
RandomVector = [random.random() for i in range(VectorSize)]
RandomVectorSum = sum(RandomVector)
RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
return RandomVector
from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
"""
Inputs:
ListSize = the size of the list to return
ListSumValue = The sum of list values
Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).
Output:
A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
"""
if type(Distribution) == list:
DistributionSize = len(Distribution)
if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
Values = multinomial(ListSumValue,Distribution,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
UniformDistro = [1/ListSize for i in range(ListSize)]
Values = multinomial(ListSumValue,UniformDistro,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'normal':
"""
Normal Distribution Construction....It's very flexible and hideous
Assume a +-3 sigma range. Warning, this may or may not be a suitable range for your implementation!
If one wishes to explore a different range, then changes the LowSigma and HighSigma values
"""
LowSigma = -3#-3 sigma
HighSigma = 3#+3 sigma
StepSize = 1/(float(ListSize) - 1)
ZValues = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
#Construction parameters for N(Mean,Variance) - Default is N(0,1)
Mean = 0
Var = 1
#NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
NormalDistro= list()
for i in range(len(ZValues)):
if i==0:
ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
NormalDistro.append(ERFCVAL)
elif i == len(ZValues) - 1:
ERFCVAL = NormalDistro[0]
NormalDistro.append(ERFCVAL)
else:
ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
ERFCVAL = ERFCVAL1 - ERFCVAL2
NormalDistro.append(ERFCVAL)
#print "Normal Distribution sum = %f"%sum(NormalDistro)
Values = multinomial(ListSumValue,NormalDistro,size=1)
OutputValue = Values[0]
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
ProbabilityDistibution = RandFloats(1200)#This is your probability distribution for your 1200 cell array
SizeDistribution = RandIntVec(1200,100,Distribution=ProbabilityDistribution)#for a 1200 cell array, whose sum is 100 with given probability distribution
The two main lines that are important are the last two lines in the code above
I'm trying to use some Time Series Analysis in Python, using Numpy.
I have two somewhat medium-sized series, with 20k values each and I want to check the sliding correlation.
The corrcoef gives me as output a Matrix of auto-correlation/correlation coefficients. Nothing useful by itself in my case, as one of the series contains a lag.
The correlate function (in mode="full") returns a 40k elements list that DO look like the kind of result I'm aiming for (the peak value is as far from the center of the list as the Lag would indicate), but the values are all weird - up to 500, when I was expecting something from -1 to 1.
I can't just divide it all by the max value; I know the max correlation isn't 1.
How could I normalize the "cross-correlation" (correlation in "full" mode) so the return values would be the correlation on each lag step instead those very large, strange values?
You are looking for normalized cross-correlation. This option isn't available yet in Numpy, but a patch is waiting for review that does just what you want. It shouldn't be too hard to apply it I would think. Most of the patch is just doc string stuff. The only lines of code that it adds are
if normalize:
a = (a - mean(a)) / (std(a) * len(a))
v = (v - mean(v)) / std(v)
where a and v are the inputted numpy arrays of which you are finding the cross-correlation. It shouldn't be hard to either add them into your own distribution of Numpy or just make a copy of the correlate function and add the lines there. I would do the latter personally if I chose to go this route.
Another, quite possibly better, alternative is to just do the normalization to the input vectors before you send it to correlate. It's up to you which way you would like to do it.
By the way, this does appear to be the correct normalization as per the Wikipedia page on cross-correlation except for dividing by len(a) rather than (len(a)-1). I feel that the discrepancy is akin to the standard deviation of the sample vs. sample standard deviation and really won't make much of a difference in my opinion.
According to this slides, I would suggest to do it this way:
def cross_correlation(a1, a2):
lags = range(-len(a1)+1, len(a2))
cs = []
for lag in lags:
idx_lower_a1 = max(lag, 0)
idx_lower_a2 = max(-lag, 0)
idx_upper_a1 = min(len(a1), len(a1)+lag)
idx_upper_a2 = min(len(a2), len(a2)-lag)
b1 = a1[idx_lower_a1:idx_upper_a1]
b2 = a2[idx_lower_a2:idx_upper_a2]
c = np.correlate(b1, b2)[0]
c = c / np.sqrt((b1**2).sum() * (b2**2).sum())
cs.append(c)
return cs
For a full mode, would it make sense to compute corrcoef directly on the lagged signal/feature? Code
from dataclasses import dataclass
from typing import Any, Optional, Sequence
import numpy as np
ArrayLike = Any
#dataclass
class XCorr:
cross_correlation: np.ndarray
lags: np.ndarray
def cross_correlation(
signal: ArrayLike, feature: ArrayLike, lags: Optional[Sequence[int]] = None
) -> XCorr:
"""
Computes normalized cross correlation between the `signal` and the `feature`.
Current implementation assumes the `feature` can't be longer than the `signal`.
You can optionally provide specific lags, if not provided `signal` is padded
with the length of the `feature` - 1, and the `feature` is slid/padded (creating lags)
with 0 padding to match the length of the new signal. Pearson product-moment
correlation coefficients is computed for each lag.
See: https://en.wikipedia.org/wiki/Cross-correlation
:param signal: observed signal
:param feature: feature you are looking for
:param lags: optional lags, if not provided equals to (-len(feature), len(signal))
"""
signal_ar = np.asarray(signal)
feature_ar = np.asarray(feature)
if np.count_nonzero(feature_ar) == 0:
raise ValueError("Unsupported - feature contains only zeros")
assert (
signal_ar.ndim == feature_ar.ndim == 1
), "Unsupported - only 1d signal/feature supported"
assert len(feature_ar) <= len(
signal
), "Unsupported - signal should be at least as long as the feature"
padding_sz = len(feature_ar) - 1
padded_signal = np.pad(
signal_ar, (padding_sz, padding_sz), "constant", constant_values=0
)
lags = lags if lags is not None else range(-padding_sz, len(signal_ar), 1)
if np.max(lags) >= len(signal_ar):
raise ValueError("max positive lag must be shorter than the signal")
if np.min(lags) <= -len(feature_ar):
raise ValueError("max negative lag can't be longer than the feature")
assert np.max(lags) < len(signal_ar), ""
lagged_patterns = np.asarray(
[
np.pad(
feature_ar,
(padding_sz + lag, len(signal_ar) - lag - 1),
"constant",
constant_values=0,
)
for lag in lags
]
)
return XCorr(
cross_correlation=np.corrcoef(padded_signal, lagged_patterns)[0, 1:],
lags=np.asarray(lags),
)
Example:
signal = [0, 0, 1, 0.5, 1, 0, 0, 1]
feature = [1, 0, 0, 1]
xcorr = cross_correlation(signal, feature)
assert xcorr.lags[xcorr.cross_correlation.argmax()] == 4