Which is the right standard deviation formula Python - python

I am confused. I found few formulas for finding the SD (standard deviation).
This is the NumPy library std method:
>>> nums = np.array([65, 36, 52, 91, 63, 79])
>>> np.std(nums)
17.716909687891082
But I found another formula here:Standard deviation
By this formula with the same dataset my result is 323,1666666666667‬. Now which one is right? Or they are used for two different things?
EDIT: Seems I forgot about the square root

numpy is correct, of course. here the plain python version:
from math import sqrt
data = [65, 36, 52, 91, 63, 79]
mean = sum(data) / len(data)
std = sqrt(sum((d - mean) ** 2 for d in data) / len(data))
print(std) # 17.716909687891082

Core python. see pstdev
import statistics
print(statistics.pstdev([65, 36, 52, 91, 63, 79]))
output
17.716909687891082

Related

Can't get the correct parameter fit for a system of ODEs using Symfit and some experimental results

I want to fit the following model:
To some fluorescence measurements ([YFP] over time). Basically I can measure the change of YFP over time, but not the change of x. After navigating through different solutions in overflow (and trying various of the solutions proposed), I finished getting pretty close with Symfit.
However, when I try to fit the model to the experimental results, I get the following fit results:
Parameter Value Standard Deviation
TauOFF 4.425923e-02 2.173698e+00
TauON 9.687891e+00 1.945774e+02
TauONx 4.539607e-02 2.239210e+00
x_SS 7.968579e+00 2.726591e+02
Status message Maximum number of function evaluations has been exceeded.
Number of iterations 443
Objective <symfit.core.objectives.LeastSquares object at 0x000002640701C898>
Minimizer <symfit.core.minimizers.NelderMead object at 0x000002640701CEF0>
Goodness of fit qualifiers:
chi_squared 480161.4690600715
objective_value 240080.73453003576
r_squared 0.9677940481847731
I don't understand why X's prediction is so low, and almost a constant (almost because when I zoom in, it actually changes a little bit). Also, it says that "Maximum number of function evaluations has been exceeded". What am I doing wrong?? Am I am using the wrong minimizer? The wrong initial parameter estimated values?
Below is my code:
# %% Importing modules
import symfit
from symfit import parameters, variables, ODEModel, Fit, Parameter, D
from symfit.core.objectives import LogLikelihood
from symfit.core.minimizers import NelderMead
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sympy.solvers import ode
# %% Experimental data. Inputs is time, outputs is fluorescence measurements ([YFP])
inputs = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66])
outputs = [73.64756293519015, 44.83500717360115, 66.59467242620596, 49.65998568360771, 46.484859217283514, 72.37530519707008, 74.47354904982025, 61.437468439656605, 80.15204098496119, 93.11740890688259, 74.73900664346728, 87.38835848475512, 94.96499329658872, 116.07910576096306, 126.95045168354777, 123.76237623762376, 147.73432650527624, 168.04489072652098, 183.3221551531411, 321.22186495176834, 356.38957816377166, 389.03440885819737, 321.22186495176834, 356.38957816377166, 389.03440885819737, 582.1501961516907, 607.139657798083, 651.6151143860851, 682.4329863103533, 716.422610612502, 749.3927432822223, 777.726234656009, 809.6079246328624, 847.2845376012857, 870.6370831711431, 895.512942218847, 914.3568311720239, 1002.7537605116663, 1019.3525890625908, 1028.7006485379452, 1073.162564875272, 1080.7277331278212, 1106.8392267287595, 1119.0425361584034, 1139.207233729366, 1145.790182270091, 1177.2867420349437, 1185.0114126299773, 1196.1818638533032, 1213.7383689107828, 1208.2922013820337, 1209.8943558642277, 1225.7463589296947, 1232.9657629893582, 1221.7722725107194, 1237.6858956142842, 1240.1111320399323, 1240.6384572177496, 1249.767333643555, 1247.0462864291337, 1259.6783113651027, 1258.188648128636, 1267.006026296567, 1272.2310666363428, 1260.6866757617101, 1266.8857660924748]
# %% Model Definitions
x, y, t = variables('x, y, t')
TauONx = Parameter('TauONx', 0.1)
TauON = Parameter('TauON', 0.180854297)
### For a moment, I thought of fixing TauOFF, obtaining this value from other experiments
TauOFF = Parameter('TauOFF', 10.53547354)
#TauOFF = 10.53547354
x_SS = Parameter('x_SS', 0.1)
#### All of this is using symfit package!
model_dict = {
D(x, t): TauONx*(x_SS - x),
D(y, t): TauON*x - TauOFF*y,
}
# %% Execute data
ode_model = ODEModel(model_dict, initial={t: 0.0, x: 54 * 10e-4, y: 54 * 10e-4})
fit = Fit(ode_model, t=inputs, x=None, y=outputs, minimizer=NelderMead)
#fit = Fit(ode_model, outputs, objective=LogLikelihood)
fit_result = fit.execute()
print(fit_result)
# %% Plot the data generated vs the output
tvec = np.linspace(0, 60, 1000)
X, Y = ode_model(t=tvec, **fit_result.params)
plt.plot(tvec, X, label='[x]')
plt.plot(tvec, Y, label='[y]')
plt.scatter(inputs, outputs)
plt.legend()
plt.show()

AgglomerativeClustering, no attribute called distances_

So I tried to learn about hierarchical clustering, but I alwas get an error code on spyder:
AttributeError: 'AgglomerativeClustering' object has no attribute 'distances_'
This is the code
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
df = pd.DataFrame({
'x':[41, 36, 32, 34, 32, 31, 24, 30, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
'y':[39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 30, 30, 31, 32, 29]
})
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=0)
clustering.fit(df)
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack([model.children_, model.distances_,
counts]).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
plt.title('Hierarchical Clustering Dendogram')
#plot the top 3 levels of the dendrogram
plot_dendrogram(clustering)
plt.xlabel("index data")
plt.show()
#print(clustering.labels_)
I have upgraded the scikit learning to the newest one, but the same error still exist, so is there anything that I can do? or is there something wrong in this code
official document of sklearn.cluster.AgglomerativeClustering() says
distances_ : array-like of shape (n_nodes-1,)
Distances between nodes in the corresponding place in children_.
Only computed if distance_threshold is used or compute_distances is set to True.
I have the same problem and I fix it by set parameter compute_distances=True
Apparently, I might miss some step before I upload this question, so here is the step that I do in order to solve this problem:
Uninstall scikit-learn through anaconda prompt
Instal scikit-learn back
If somehow your spyder is gone, install it again with anaconda prompt
install some missing library
It can work again :)
Update sklearn from 21.* to 22.*
pip install -U scikit-learn
For me https://stackoverflow.com/a/61363342/10270590 solution did not worked but, it worked without the dendrogram.
Ex. Code:
aggmodel = AgglomerativeClustering(distance_threshold=None,
n_clusters=10,
affinity = "manhattan",
linkage = "complete",
)
aggmodel = aggmodel.fit(data1)
aggmodel.n_clusters_
#aggmodel.labels_

Generating random lists in Python (seed problem?)

I'm trying to generate some (pseudo-) random lists for testing purposes. Here I'm generating a 2x2 matrix (list of lists), where one test_data.json file has any positive number of "cycles" & each "cycle" has n (resolution) number of integers.
After trying some basic random functions from numpy & random libraries, I've been unable to generate lists randomly at all.
import json
import random as ran
import numpy as np
import os
resolution = 10 # Map resolution: Max = 200
cycles = 3 # Number of cycles
dist = [None for _ in range(resolution)] # Distance list
output = list()
n = 0
with open("test_data.json", "w") as test:
for turn in range(cycles):
n += 10
# ran.seed(n)
np.random.seed(n)
for num in range(resolution):
# dist[num] = int(ran.random() * 255)
dist[num] = int(np.random.random() * 255)
output.append(dist)
# print(output)
json.dump(output, test)
# test.write('\n')
I can work with any "random" output within a certain range (here I'm scaling 0-1 to 0-255). Numbers in each list (cycle) is random enough, but every cycle is the same list of numbers.
[[164, 97, 169, 41, 245, 88, 252, 59, 149, 103], [164, 97, 169, 41, 245, 88, 252, 59, 149, 103], [164, 97, 169, 41, 245, 88, 252, 59, 149, 103]]
I've tried using seed(), with constant and changing seeds but the output is never changing between cycles.
Don't set the seed at every iteration. The whole point of a seed is that a given seed will generate the same stream of numbers every single time. Set the seed once at the beginning of your program (to get the same results for each run), or not at all. In the latter case, a time-dependent initial state will be generated for you, making your generator appear random indeed.
Also, choose whether you want to use python's built-in random module or np.random. You probably don't want to use both. Especially not if you're setting seeds. The seeds of one don't affect the other.
Why not remove the seeds, and simply use
import random as ran
ran.randint(0, 256)
to generate your random numbers in the range 0-255?
you can solve this by simply using random only
import random
cycles = 3
resolution = 10
output = [[random.randint(0,256) for _ in range(resolution)] for _ in range(cycles)]
This will give output as follows:
[[130, 126, 153, 18, 58, 24, 184, 75, 14, 25], [215, 73, 2, 58, 170, 255, 34, 113, 83, 80], [82, 100, 0, 118, 181, 90, 113, 165, 57, 87]]
You can then dump output to file
open("test_data.json", "w") as test:
json.dump(output, test)
The no of cycles and resolutions can be changed to achieve desirable results

Find length of cluster (how many point associated with cluster) after KMeans clustering (scikit learn)

I have done clustering using Kmeans using sklearn. While it has a method to print the centroids, I am finding it rather bizzare that scikit-learn doesn't have a method to find out the cluster length (or that I have not seen it so far). Is there a neat way to get the cluster-length of each cluster or many points associated with cluster? I currently have this rather cludgy code to do it where I am finding cluster of length one and need to add other point to this cluster by measuring the Euclidean distance between the points and have to update the labels
import numpy as np
from clustering.clusternew import Kmeans_clu
from evolution.generate import reproduction
from mapping.somnew import mapping, no_of_neurons, neuron_weights_init
from population_creation.population import pop_create
from New_SOL import newsol
data = genfromtxt('iris.csv', delimiter=',', skip_header=0, usecols=range(0, 4)) ##Read the input data
actual_label = genfromtxt('iris.csv', delimiter=',', dtype=str,skip_header=0, usecols=(4))
chromosome = int(input("Enter the number of chromosomes: ")) #Input the population size
max_gen = int(input("Enter the maximum number of generation: ")) #Input the maximum number of generation
for i in range(0, chromosome):
cluster = 3#random.randint(2, max_cluster) ##Randomly selects cluster number from 2 to root(poplation)
K.insert(i, cluster) ##Store the number of clusters in clu
print('value of K is ',K)
u, label,z1,A1= Kmeans_clu(cluster, data)
#print("centers and labels : ", u, label)
lab.insert(i, label) ##Store the labels in lab
center.insert(i, u)
new_center = pop_create(max_cluster, features, cluster, u)
population.insert(i, new_center)
print("VAlue of population in main\n" ,population)
newsol(max_gen,population,data)
For newsol method we pass the new population from the above method generated code and again doing K-Means on the population
def ClusterIndicesComp(clustNum, labels_array): #list comprehension for accessing the features in iris data set
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
def newsol(max_gen,population,data):
#print('VAlue of NewSol Population is',population)
for i in range(max_gen):
cluster1=5
u,label,t,l=Kmeans_clu(cluster1, population)
A1.insert(i,t)
plab.insert(i,label)
pcenter.insert(i,u)
k2=Counter(l.labels_) #Count number of elements in each cluster
k1=[t for (t, v) in k2.items() if v == 1] #element whose length is one will be fetched
t1= np.array(k1) #Iterating through the cluster which have one point associated with them
for b in range(len(t1)):
print("Value in NEW_SOL is of 1 length cluster\n",t1[b])
plot1=data[ClusterIndicesComp(t1[b], l.labels_)]
print("Values are in sol of plot1",plot1)
for q in range(cluster1):
plot2=data[ClusterIndicesComp(q, l.labels_)]
print("VAlue of plot2 is for \n",q,plot2)
for i in range(len(plot2)):#To get one element at a time from plot2
plotk=plot2[i]
if([t for (t, v) in k2.items() if v >2]):#checking if the cluster have more than 2 points than only the distance will be calculated
S=np.linalg.norm(np.array(plot1) - np.array(plotk))
print("Distance between plot1 and plotk is",plot1,plotk,np.linalg.norm(np.array(plot1) - np.array(plotk)))#euclidian distance is calculated
else:
print("NO distance between them\n")
Kmeans which I have done is
from sklearn.cluster import KMeans
import numpy as np
def Kmeans_clu(K, data):
kmeans = KMeans(n_clusters=K, init='random', max_iter=1, n_init=1).fit(data) ##Apply k-means clustering
labels = kmeans.labels_
clu_centres = kmeans.cluster_centers_
z={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)} #getting cluster for each label
return clu_centres, labels ,z,kmeans
For getting number of instances in each cluster may be you can try using Counter:
from collections import Counter, defaultdict
print(Counter(estimator.labels_))
Result:
Counter({0: 62, 1: 50, 2: 38})
where cluster 0 has 62 instances, cluster 1 has 50 instances, and cluster 2 has 38 instances
And may be to store index of instances of each clusters, you can use defaultdict:
clusters_indices = defaultdict(list)
for index, c in enumerate(estimator.labels_):
clusters_indices[c].append(index)
Now, to find indices of instances in cluster 0, calling:
print(clusters_indices[0])
Result:
[50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114, 119, 121, 123, 126, 127, 133, 138, 142, 146, 149]

Method to identify the point where numbers fall off sharply

I have a series of numbers:
numbers = [100, 101, 99, 102, 99, 98, 100, 97.5, 98, 99, 95, 93, 90, 85, 80]
It's very to see by eye that the numbers start to fall sharply roughly around 10, but is there a simple way to identify that point (or close to it) on the x axis?
This is being done in retrospect, so you can use the entire list of numbers to select the x axis point where the dropoff accelerates.
Python solutions are preferred, but pseudo-code or a general methodology is fine too.
Ok, this ended up fitting my needs. I calculate a running mean, std deviation, and cdf from a t distribution to tell me how unlikely each successive value is.
This only works with decreases since I am only checking for cdf < 0.05 but it works very well.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
numbers = np.array([100, 101, 99, 102, 99, 98, 100, 97.5, 98, 99, 95, 93, 90, 85, 80])
# Calculate a running mean
cum_mean = numbers.cumsum() / (np.arange(len(numbers)) + 1)
# Calculate a running standard deviation
cum_std = np.array([numbers[:i].std() for i in range(len(numbers))])
# Calculate a z value
cum_z = (numbers[1:] - cum_mean[:-1]) / cum_std[:-1]
# Add in NA vals to account for records without sample size
z_vals = np.concatenate((np.zeros(1+2), cum_z[2:]), axis=0)
# Calculate cdf
cum_t = np.array([stats.t.cdf(z, i) for i, z in enumerate(z_vals)])
# Identify first number to fall below threshold
first_deviation = np.where(cum_t < 0.05)[0].min()
fig, ax = plt.subplots()
# plot the numbers and the point immediately prior to the decrease
ax.plot(numbers)
ax.axvline(first_deviation-1, color='red')

Categories