Comparing distributions with chi-square in python - python

I have two types of data lists, historical data and simulated data, that I want to compare with each other to see if they have the same distribution. My code is as follow:
import scipy.stats as stats
data_hist = [164, 157, 145, 113, 127, 192, 214, 193, 107, 95, 60, 55, 30, 19, 22, 22, 19, 20]
date_sim1 = [160, 174, 142, 121, 122, 192, 198, 179, 119, 107, 63, 50, 26, 17, 16, 22, 23, 23]
date_sim2 = [181, 130, 152, 114, 122, 198, 183, 192, 105, 100, 85, 42, 37, 26, 25, 30, 17, 15]
print(stats.chisquare(date_sim1, f_exp=data_hist))
print(stats.chisquare(date_sim2, f_exp=data_hist))
The code gives the following output:
Power_divergenceResult(statistic=12.11387994054504, pvalue=0.79319278886052769)
Power_divergenceResult(statistic=34.413397609752003, pvalue=0.0074220617004927226)
I compared the same data lists with each other using the F-test in excel and got the P-values as 0.939 and 0.849 respectively.
Now my question is am I using the correct chi-square function to calculate the P-value and how do I interpret it to know if I should reject the null hypothesis or not. Why is there a big difference in the P-value when using the different methods.

Related

How to find all the minima in a graph?

I am trying to find all the minima in this graph.
But when I am writing this code it is giving many extra minima too and I want total 8 minima.
mini = []
for i in range(1, len(y)-1):
if y[i-1] >= y[i] and y[i] <= y[i+1]:
mini.append(i)
print(mini)
Output
[2, 5, 7, 15, 20, 26, 30, 37, 40, 47, 50, 52, 59, 61, 64, 70, 76, 84, 89, 94, 96, 99, 107, 109, 117, 120, 122, 130, 134, 140, 144, 148, 154, 164, 169, 176]
And I want to cut my data according to these minimum values. Can somebody tell me what changes should I make in this code to achieve my goal?

How can i remove clusters of given indexes in kmeans? [duplicate]

I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that.
Is there a function to give the cluster id and it will list out all the data points in that cluster?
I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns.
data = pd.read_csv('filename')
km = KMeans(n_clusters=5).fit(data)
cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = km.labels_
Once the DataFrame is available is quite easy to filter,
For example, to filter all data points in cluster 3
cluster_map[cluster_map.cluster == 3]
If you have a large dataset and you need to extract clusters on-demand you'll see some speed-up using numpy.where. Here is an example on the iris dataset:
from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
km = KMeans(n_clusters=3)
km.fit(X)
Define a function to extract the indices of the cluster_id you provide. (Here are two functions, for benchmarking, they both return the same values):
def ClusterIndicesNumpy(clustNum, labels_array): #numpy
return np.where(labels_array == clustNum)[0]
def ClusterIndicesComp(clustNum, labels_array): #list comprehension
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
Let's say you want all samples that are in cluster 2:
ClusterIndicesNumpy(2, km.labels_)
array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])
Numpy wins the benchmark:
%timeit ClusterIndicesNumpy(2,km.labels_)
100000 loops, best of 3: 4 µs per loop
%timeit ClusterIndicesComp(2,km.labels_)
1000 loops, best of 3: 479 µs per loop
Now you can extract all of your cluster 2 data points like so:
X[ClusterIndicesNumpy(2,km.labels_)]
array([[ 6.9, 3.1, 4.9, 1.5],
[ 6.7, 3. , 5. , 1.7],
[ 6.3, 3.3, 6. , 2.5],
... #truncated
Double-check the first three indices from the truncated array above:
print X[52], km.labels_[52]
print X[77], km.labels_[77]
print X[100], km.labels_[100]
[ 6.9 3.1 4.9 1.5] 2
[ 6.7 3. 5. 1.7] 2
[ 6.3 3.3 6. 2.5] 2
Actually a very simple way to do this is:
clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]
The second row returns all the elements of the df that belong to the 0th cluster. Similarly you can find the other cluster-elements.
To get the IDs of the points/samples/observations that are inside each cluster, do this:
Python 2
Example using Iris data and a nice pythonic way:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
RESULTS
#dict format
{0: array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
1: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
2: array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}
# list format
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]
Python 3
Just change
for key, value in mydict.iteritems():
to
for key, value in mydict.items():
You can look at attribute labels_
For example
km = KMeans(2)
km.fit([[1,2,3],[2,3,4],[5,6,7]])
print km.labels_
output: array([1, 1, 0], dtype=int32)
As you can see first and second point is cluster 1, last point in cluster 0.
You can Simply store the labels in an array. Convert the array to a data frame. Then Merge the data that you used to create K means with the new data frame with clusters.
Display the dataframe. Now you should see the row with corresponding cluster. If you want to list all the data with specific cluster, use something like data.loc[data['cluster_label_name'] == 2], assuming 2 your cluster for now.

Using Agglomerative Hierarchical Clustering on a high-dimensional dataset with categorical and continuous variables

My group and I are working on a high-dimensional dataset with a mix of categorical (binary and integer) and continuous variables. We are wondering what would be the best distance metric and linkage method to use for agglomerative hierarchical clustering. We first started with Euclidean distance and Ward's linkage, but with the issues that arise with Euclidean distance and categorical variables we need a new strategy. We have attempted Heterogeneous Euclidean-Overlap Metric (HEOM) and Gower's distance metric with average, centroid, and single linkage, but have not gotten the clear results that we were hoping for. We are wondering if there are better methods or metrics that we should use for our analysis?
Here is an example of the code we have already:
from distython import HEOM
categorical_ix = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 28, 34, 37, 39, 142, 41, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 213, 217, 218, 219, 220, 221, 222, 223, 224, 225]
nan_eqv = 12345
heom_metric = HEOM(features, categorical_ix, nan_equivalents = [nan_eqv])
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric(heom_metric.heom)
distance = dist.pairwise(features)
import scipy.cluster.hierarchy as shc
from scipy.cluster.hierarchy import linkage, dendrogram
linkage_matrix = linkage(distance, 'average')
plt.figure(figsize=(10, 7))
plt.title("Test")
dendrogram(linkage_matrix)
plt.axhline(y=8, color='r', linestyle='--')
plt.show()
from scipy.cluster.hierarchy import fcluster
k = 4
clusters = fcluster(linkage_matrix, k, criterion='maxclust')
clusters
If Gower's distance or HEOM is the preferred method to use we would also appreciate any advice on how to better implement these metrics into our code. Thank you

Performing better adjustment - Nonlinear regression with python

I have some information, which varies according to the day (x-axis), I would like to adjust it in a better way. I have used scipy.optimize.curve_fit and it fits well with the following function but I would like to adjust it in some better way.
A polynomial form would not be useful or imprecise for me, since the Y-axis values are cumulative, so it is unlikely that the curve will drop.
Could someone give me a hand on what I could add to the function to make it fit better?
Here is the data I use, as well as the formula with the parameter values.
>>> X
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66])
>>> y
array([ 1, 1, 1, 1, 1, 1, 2, 2, 4, 15, 17, 38, 46,
53, 61, 75, 100, 111, 133, 142, 152, 159, 170, 174, 179, 182,
188, 190, 192, 196, 198, 199, 202, 205, 207, 214, 215, 216, 218,
224, 229, 231, 233, 236, 236, 236, 237, 237, 237, 237, 237, 237,
237, 238, 238, 238, 238, 238, 238, 238, 238, 238, 238, 238, 239,
243])
>>> def func(x,a,b):
return a*np.exp(b/x)
>>> popt,pcov=curve_fit(func,X,y)
>>> a=popt[0]
>>> b=popt[1]
>>> yvals=func(X,a,b)
>>> plot1=plt.plot(X,y,'*',label='original values')
>>> plot2=plt.plot(X,yvals,'r',label='curve_fit values')
>>> plt.show()
>>> popt
array([357.97373884, -20.60549425])
Curve Fit Plot
Thanks!

Iterating Swap Algorithm Python

I have an algorithm. I want that last solution of the algorithm if respect certains conditions become the first solution. In my case I have this:
First PArt
Split the multidimensional array q in 2 parts
split_at = q[:,3].searchsorted([1,random.randrange(LB,UB-I)])
D = numpy.split(q, split_at)
Change and rename the splitted matrix:
S=B[1]
SF=B[2]
S2=copy(SF)
S2[:,3]=S2[:,3]+I
Define a function f:
f=sum(S[:,1]*S[:,3])+sum(S2[:,1]*S2[:,3])
This first part is an obligated passage.
Second Passage
Then I split again the array in 2 parts:
split_at = q[:,3].searchsorted([1,random.randrange(LB,UB-I)])
D = numpy.split(q, split_at)
I rename and change parts of the matrix(like in the first passage:
T=D[1]
TF=D[2]
T2=copy(TF)
T2[:,3]=T2[:,3]+I
u=random.sample(T[:],1) #I random select an array from T
v=random.sample(T2[:],1) #random select an array from T2
u=array(u)
v=array(v)
Here is my first problem: I want to continue the algorithm only if v[0,0]-u[0,0]+T[-1,3]<=UB, if not I want to repeat Second Passage until the condition is verified.
Now I swap 1 random array from T with another from T2:
x=numpy.where(v==T2)[0][0]
y=numpy.where(u==T)[0][0]
l=np.copy(T[y])
T[y],T2[x]=T2[x],T[y]
T2[x],l=l,T2[x]
I modified and recalculate some in the matrix:
E=np.copy(T)
E2=np.copy(T2)
E[:,3]=np.cumsum(E[:,0])
E2[:,3]=np.cumsum(E2[:,0])+I
Define f2:
f2=sum(E[:,1]*E[:,3])+sum(E2[:,1]*E2[:,3])
Here my second and last problem. I need to iterate this algorithm. If f-f2<0 my new starting solution has to be E and E2 and my new f has to be f2 and iterate excluding last choice the algorithm (recalcultaing a new f and f2).
Thank you for the patience. I'm a noob :D
EDIT:
I have an example here(this part goes before the code I have written on top)
import numpy as np
import random
p=[ 29, 85, 147, 98, 89, 83, 49, 7, 48, 88, 106, 97, 2,
107, 33, 144, 123, 84, 25, 42, 17, 82, 125, 103, 31, 110,
34, 100, 36, 46, 63, 18, 132, 10, 26, 119, 133, 15, 138,
113, 108, 81, 118, 116, 114, 130, 134, 86, 143, 126, 104, 52,
102, 8, 90, 11, 87, 37, 68, 75, 69, 56, 40, 70, 35,
71, 109, 5, 131, 121, 73, 38, 149, 20, 142, 91, 24, 53,
57, 39, 80, 79, 94, 136, 111, 78, 43, 92, 135, 65, 140,
148, 115, 61, 137, 50, 77, 30, 3, 93]
w=[106, 71, 141, 134, 14, 53, 57, 128, 119, 6, 4, 2, 140,
63, 51, 126, 35, 21, 125, 7, 109, 82, 95, 129, 67, 115,
112, 31, 114, 42, 91, 46, 108, 60, 97, 142, 85, 149, 28,
58, 52, 41, 22, 83, 86, 9, 120, 30, 136, 49, 84, 38,
70, 127, 1, 99, 55, 77, 144, 105, 145, 132, 45, 61, 81,
10, 36, 80, 90, 62, 32, 68, 117, 64, 24, 104, 131, 15,
47, 102, 100, 16, 89, 3, 147, 48, 148, 59, 143, 98, 88,
118, 121, 18, 19, 11, 69, 65, 123, 93]
p=array(p,'double')
w=array(w,'double')
r=p/w
LB=12
UB=155
I=9
j=p,w,r
j=transpose(j)
k=j[j[:,2].argsort()]
c=np.cumsum(k[:,0])
q=k[:,0],k[:,1],k[:,2],c
q=transpose(q)
o=sum(q[:,1]*q[:,3])
split_at = q[:,3].searchsorted([1,UB-I])
B = numpy.split(q, split_at)
S=B[1]
SF=B[2]
S2=copy(SF)
S2[:,3]=S2[:,3]+I
f=sum(S[:,1]*S[:,3])+sum(S2[:,1]*S2[:,3])
split_at = q[:,3].searchsorted([1,random.randrange(LB,UB-I)])
D = numpy.split(q, split_at)
T=D[1]
TF=D[2]
T2=copy(TF)
T2[:,3]=T2[:,3]+I
u=random.sample(T[:],1)
v=random.sample(T2[:],1)
u=array(u)
v=array(v)
x=numpy.where(v==T2)[0][0]
y=numpy.where(u==T)[0][0]
l=np.copy(T[y])
T[y],T2[x]=T2[x],T[y]
T2[x],l=l,T2[x]
E=np.copy(T)
E2=np.copy(T2)
E[:,3]=np.cumsum(E[:,0])
E2[:,3]=np.cumsum(E2[:,0])+I
f2=sum(E[:,1]*E[:,3])+sum(E2[:,1]*E2[:,3])
I tried:
def DivideRandom(T,T2):
split_at = q[:,3].searchsorted([1,random.randrange(LB,UB-I)])
D = numpy.split(q, split_at)
T=D[1]
TF=D[2]
T2=copy(TF)
T2[:,3]=T2[:,3]+I
Divide(T,T2)
def SelectJob(u,v):
u=random.sample(T[:],1)
v=random.sample(T2[:],1)
u=array(u)
v=array(v)
SelectJob(u,v)
d=v[0,0]-u[0,0]+T[-1,3]
def Swap(u,v):
x=numpy.where(v==T2)[0][0]
y=numpy.where(u==T)[0][0]
l=np.copy(T[y])
T[y],T2[x]=T2[x],T[y]
T2[x],l=l,T2[x]
E=np.copy(T)
E2=np.copy(T2)
E[:,3]=np.cumsum(E[:,0])
E2[:,3]=np.cumsum(E2[:,0])+I
f2=sum(E[:,1]*E[:,3])+sum(E2[:,1]*E2[:,3])
while True:
if d<=UB
Swap(u,v)
if d>UB
DivideRandom(T,T2)
SelectJob(u,v)
if d<UB:
break
You can iterate indefinitely using while True, then stop whenever your conditions are met using break:
count = 0
while True:
count += 1
if count == 10:
break
So for your second example you can try:
while True:
...
if f - f2 < 0:
# use new variables
f, E = f2, E2
else:
break
Your first problem is similar; loop, test, reset the appropriate variables.

Categories