Python KS test - why is the P Value so large - python

I am trying to run the KS test for two graphs
One is the raw data plot(red), and the other is the power law fit
from scipy import stats
stats.ks_2samp(Red.Y, Blue.Y)
where Red.Y is the y value at each point of x, and Blue.Y is the power law y value at each of x.
Out[210]:
Ks_2sampResult(statistic=0.16666666666666669, pvalue=0.99133252540492101)
it seems like the p value is extremely large as the graphs are not alike. May I ask the reason?
the values for Red.Y is:
(0.03, 0.09] 0.000018
(0.09, 0.16] 0.000019
(0.16, 0.29] 0.000016
(0.29, 0.5] 0.000018
(0.5, 0.77] 0.000018
(0.77, 1.0] 0.000022
(1.0, 1.05] 0.000021
(1.05, 1.5] 0.000022
(1.5, 2.0] 0.000025
(2.0, 3.0] 0.000025
(3.0, 4.0] 0.000024
(4.0, 6.42] 0.000026
the values of Blue.Y is:
(0.03, 0.09] 0.000017
(0.09, 0.16] 0.000017
(0.16, 0.29] 0.000018
(0.29, 0.5] 0.000019
(0.5, 0.77] 0.000020
(0.77, 1.0] 0.000021
(1.0, 1.05] 0.000021
(1.05, 1.5] 0.000022
(1.5, 2.0] 0.000023
(2.0, 3.0] 0.000024
(3.0, 4.0] 0.000025
(4.0, 6.42] 0.000026

Basically, in KS-test, you want to compare 2 cumulative distribution (CDF) of 2 data samples (see from from Wikipedia). Suppose, you have blue-line data and red-line data
red_line = [0.000018, 0.000019, 0.000016,
0.000018, 0.000018, 0.000022,
0.000021, 0.000022, 0.000025,
0.000025, 0.000024, 0.000026]
blue_line = [0.000017, 0.000017, 0.000018,
0.000019, 0.000020, 0.000021,
0.000021, 0.000022, 0.000023,
0.000024, 0.000025, 0.000026]
n1 = len(red_line)
n2 = len(blue_line)
# CDF of red line
cdf1 = np.searchsorted(red_line, red_line + blue_line, side='right') / (1.0*len(red_line))
# CDF of blue line
cdf2 = np.searchsorted(blue_line, red_line + blue_line, side='right') / (1.0*len(blue_line))
# D-statistic
d = np.max(np.absolute(cdf1 - cdf2))
the D statistic (first value return) is the maximum distance between 2 CDFs.
For p-value, it is calculated by multiplying this CDF difference with Brownian bridge distribution. You can see how they calculate from the source code. Basically, if you compare difference between CDF with the distribution and it is still similar, we are going to get p > 0.1 for example (means you cannot reject that they are not from the same distribution).
from scipy.stats import distributions
en = np.sqrt(n1 * n2 / float(n1 + n2))
prob = distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d) # p-value
From the given data here, I got (D, p) = (0.1667, 0.9913).
So yeah, even graph looks different, when you plot CDF of 2 samples, it can be very similar and that's why p-value is still large.

Related

Multiplying Different Columns/Rows In Same Dataframe

I have a dataframe - see below. I would like to multiply 30 by .115 and create a new column (second image). I started writing the code below but Im not sure if Im even on the right track.
DF.loc[(DF['PRO_CHARGE']=="(1.0, 99283.0)")), '%'] = DF.loc(DF['PRO_CHARGE']=="(0.0, 99283.0)")....?
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"PRO_CHARGE": [(0.0, 99283.0), (1.0, 99283.0), (1.0, 99284.0)],
"TOTAL": [15, 30, 23],
"PERC_99282": [0.115, 0.49, 0.53],
}
)
print(df)
# Output
PRO_CHARGE TOTAL PERC_99282
0 (0.0, 99283.0) 15 0.115
1 (1.0, 99283.0) 30 0.490
2 (1.0, 99284.0) 23 0.530
Here is one way to it with Pandas .at property, which provides fast setting (as opposed to .loc) for a single value:
df.at[df["PRO_CHARGE"] == (1.0, 99283.0), "%"] = (
df.loc[df["PRO_CHARGE"] == (1.0, 99283.0), "TOTAL"].values[0]
* df.loc[df["PRO_CHARGE"] == (0.0, 99283.0), "PERC_99282"].values[0]
)
Then:
print(df)
# Output
PRO_CHARGE TOTAL PERC_99282 %
0 (0.0, 99283.0) 15 0.115 NaN
1 (1.0, 99283.0) 30 0.490 3.45
2 (1.0, 99284.0) 23 0.530 NaN

How to calculate Standard Deviation in Python when x and P(x) are known

I have the following data and I wish to calculate the standard deviation. The value of x and probability of x is given.
x
P(x)
-2,000
0.1
-1,000
0.1
0
0.2
1000
0.2
2000
0.3
3000
0.1
I know how to calculate it manually: by calculating the Var(x) by E[x^2] - (E[x])^2 and then taking Sqrt(Var(x)).
[This is how it is done manually]
How do you calculate it in python?
To clarify, the standard deviation of [1000, 2000, 3000, 0, -1000, -2000] is indeed 1707.8 if assuming all 6 terms have equal probability distribution.
However in the post, the 6 terms have unequal probability distribution [0.1, 0.1, 0.2, 0.2, 0.3, 0.1]
df = pd.DataFrame([
{'x':-2000, 'P(x)':0.1},
{'x':-1000, 'P(x)':0.1},
{'x':0, 'P(x)':0.2},
{'x':1000, 'P(x)':0.2},
{'x':2000, 'P(x)':0.3},
{'x':3000, 'P(x)':0.1} ])
df['E(x)'] = df['x'] * df['P(x)'] # E(x) = x . P(x)
df['E(x^2)'] = df['x']**2 * df['P(x)'] # E(x^2) = x^2 . P(x)
variance = df['E(x^2)'].sum() - df['E(x)'].sum() **2
std_dev = variance **0.5
display(df)
print('Standard Deviation is: {:.2f}'.format(std_dev))
Output
x P(x) E(x) E(x^2)
0 -2000 0.1 -200.0 400000.0
1 -1000 0.1 -100.0 100000.0
2 0 0.2 0.0 0.0
3 1000 0.2 200.0 200000.0
4 2000 0.3 600.0 1200000.0
5 3000 0.1 300.0 900000.0
Standard Deviation is: 1469.69
To confirm, you can go to https://www.rapidtables.com/calc/math/standard-deviation-calculator.html
Try this:
import math
df['x_squared'] = df['x']**2
df['E_of_x_squared'] = df['x_squared'] * df['P(x)']
df['E_of_x'] = df['x'] * df['P(x)']
sum_E_x_square = df['E_of_x_squared'].values.sum()
square_of_E_x_sum = df['E_of_x'].values.sum()**2
var = sum_E_x_square - square_of_E_x_sum
std_dev = math.sqrt(var)
print('Standard Deviation is: ' + str(std_dev))

What's the simplest way in pandas to comparatively plot the ROC curve for different binary classifiers?

I have three binary classification models and I arrived up to the following point trying to assemble them into a final comparative ROC plot.
import pandas as pd
import numpy as np
import sklearn.metrics as metrics
y_test = ... # a numpy array containing the test values
dfo = ... # a pd.DataFrame containing the model predictions
dfroc = dfo[['SVM',
'RF',
'NN']].apply(lambda y_pred: metrics.roc_curve(y_test[:-1], y_pred[:-1])[0:2],
axis=0, result_type='reduce')
print(dfroc)
dfroc_auc = dfroc.apply(lambda x: metrics.auc(x[0], x[1]))
print(dfroc_auc)
Which outputs the following (where dfroc and dfroc_auc are of type pandas.core.series.Series):
SVM ([0.0, 0.016666666666666666, 1.0], [0.0, 0.923...
RF ([0.0, 0.058333333333333334, 1.0], [0.0, 0.769...
NN ([0.0, 0.06666666666666667, 1.0], [0.0, 1.0, 1...
dtype: object
SVM 0.953205
RF 0.855449
NN 0.966667
dtype: float64
To be able to plot them as a comparative ROC I'd need to convert these into the following pivoted structure as dfroc pd.DataFrame ... how can this pivotization be done?
model fpr tpr
1 SVM 0.0 0.0
2 SVM 0.16666 0.923
3 SVM 1.0 ...
4 RF 0.0 0.0
5 RF 0.05833 0.769
6 RF 1.0 ...
7 NN ... ...
And then for the plotting and following directions from How to plot ROC curve in Python would be something like:
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
dfroc.plot(label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
Not the ideal structure to work on, but assuming you have something as follows:
s = pd.Series({'SVC':([0.0, 0.016, 1.0], [0.0, 0.923, 0.5], [0.3, 0.4, 0.9]),
'RF': ([0.0, 0.058, 1.0], [0.0, 0.923, 0.2], [0.5, 0.3, 0.9]),
'NN': ([0.0, 0.06, 1.0], [0.0, 0.13, 0.4], [0.2, 0.4, 0.9])})
You could define a function to compute the TPR and FPR and return a dataframe with the specified structure:
def tpr_fpr(g):
model, cm = g
cm = np.stack(cm.values)
diag = np.diag(cm)
FP = cm.sum(0) - diag
FN = cm.sum(1) - diag
TP = diag
TN = cm.sum() - (FP + FN + TP)
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
return pd.DataFrame({'model':model,
'TPR':TPR,
'FPR':FPR})
And from the groupby on the first level, and apply the above function to each group:
out = pd.concat([tpr_fpr(g) for g in s.explode().groupby(level=0)])
print(out)
model TPR FPR
0 NN 0.000000 0.098522
1 NN 0.245283 0.179688
2 NN 0.600000 0.880503
0 RF 0.000000 0.177117
1 RF 0.821906 0.129804
2 RF 0.529412 0.550206
0 SVC 0.000000 0.099239
1 SVC 0.648630 0.159021
2 SVC 0.562500 0.615006

How to set precision on column names made by np.arange()?

I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64

pandas groupby report empty bins

I want to make a 2d histogram (or other statistics, but let's take a histogram for the example) of a given 2d data set. The problem is that empty bins seem to be discarded altogether. For instance,
import numpy
import pandas
numpy.random.seed(35)
values = numpy.random.random((2,10000))
xbins = numpy.linspace(0, 1.2, 7)
ybins = numpy.linspace(0, 1, 6)
I can easily get the desired output with
print numpy.histogram2d(values[0], values[1], (xbins,ybins))
giving
[[ 408. 373. 405. 411. 400.]
[ 390. 413. 400. 414. 368.]
[ 354. 414. 421. 400. 413.]
[ 426. 393. 407. 416. 412.]
[ 412. 397. 396. 356. 401.]
[ 0. 0. 0. 0. 0.]]
However, with pandas,
df = pandas.DataFrame({'x': values[0], 'y': values[1]})
binned = df.groupby([pandas.cut(df['x'], xbins),
pandas.cut(df['y'], ybins)])
print binned.size().unstack()
prints
y (0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]
x
(0, 0.2] 408 373 405 411 400
(0.2, 0.4] 390 413 400 414 368
(0.4, 0.6] 354 414 421 400 413
(0.6, 0.8] 426 393 407 416 412
(0.8, 1] 412 397 396 356 401
i.e., the last row, with 1 < x <= 1.2, is missing entirely, because there are no values in it. However I would like to see that explicitly (as when using numpy.histogram2d). In this example I can use numpy just fine but on more complicated settings (n-dimensional binning, or calculating statistics other than counts, etc), pandas can be more efficient to code and to calculate than numpy.
In principle I can come up with ways to check if an index is present, using something like
allkeys = [('({0}, {1}]'.format(xbins[i-1], xbins[i]),
'({0}, {1}]'.format(ybins[j-1], ybins[j]))
for j in xrange(1, len(ybins))
for i in xrange(1, len(xbins))]
However, the problem is that the index formatting is not consistent, in the sense that, as you see above, the first index of binned is ['(0, 0.2]', '(0, 0.2]'] but the first entry in allkeys is ['(0.0, 0.2]', '(0.0, 0.2]'], so I cannot match allkeys to binned.viewkeys().
Any help is much appreciated.
It appears that pd.cut keeps your binning information which means we can use it in a reindex:
In [79]: xcut = pd.cut(df['x'], xbins)
In [80]: ycut = pd.cut(df['y'], ybins)
In [81]: binned = df.groupby([xcut, ycut])
In [82]: sizes = binned.size()
In [85]: (sizes.reindex(pd.MultiIndex.from_product([xcut.cat.categories, ycut.cat.categories]))
...: .unstack()
...: .fillna(0.0))
...:
Out[85]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
(0.0, 0.2] 408.0 373.0 405.0 411.0 400.0
(0.2, 0.4] 390.0 413.0 400.0 414.0 368.0
(0.4, 0.6] 354.0 414.0 421.0 400.0 413.0
(0.6, 0.8] 426.0 393.0 407.0 416.0 412.0
(0.8, 1.0] 412.0 397.0 396.0 356.0 401.0
(1.0, 1.2] 0.0 0.0 0.0 0.0 0.0

Categories