Method to identify the point where numbers fall off sharply - python

I have a series of numbers:
numbers = [100, 101, 99, 102, 99, 98, 100, 97.5, 98, 99, 95, 93, 90, 85, 80]
It's very to see by eye that the numbers start to fall sharply roughly around 10, but is there a simple way to identify that point (or close to it) on the x axis?
This is being done in retrospect, so you can use the entire list of numbers to select the x axis point where the dropoff accelerates.
Python solutions are preferred, but pseudo-code or a general methodology is fine too.

Ok, this ended up fitting my needs. I calculate a running mean, std deviation, and cdf from a t distribution to tell me how unlikely each successive value is.
This only works with decreases since I am only checking for cdf < 0.05 but it works very well.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
numbers = np.array([100, 101, 99, 102, 99, 98, 100, 97.5, 98, 99, 95, 93, 90, 85, 80])
# Calculate a running mean
cum_mean = numbers.cumsum() / (np.arange(len(numbers)) + 1)
# Calculate a running standard deviation
cum_std = np.array([numbers[:i].std() for i in range(len(numbers))])
# Calculate a z value
cum_z = (numbers[1:] - cum_mean[:-1]) / cum_std[:-1]
# Add in NA vals to account for records without sample size
z_vals = np.concatenate((np.zeros(1+2), cum_z[2:]), axis=0)
# Calculate cdf
cum_t = np.array([stats.t.cdf(z, i) for i, z in enumerate(z_vals)])
# Identify first number to fall below threshold
first_deviation = np.where(cum_t < 0.05)[0].min()
fig, ax = plt.subplots()
# plot the numbers and the point immediately prior to the decrease
ax.plot(numbers)
ax.axvline(first_deviation-1, color='red')

Related

Create barplot but getting TypeError only size-1 arrays can be converted to Python scalars

I am trying to create a barplot which shows the increase in binding capacity in percentage for different reactions. However, I keep getting the error "TypeError only size-1 arrays can be converted to Python scalars". This is the code I have:
import numpy as np
import matplotlib.pyplot as plt
numbers_ci = [1.113, 1.068, 0.999, 1.021, 1.078, 1.086, 1.024, 1.025, 1.082, 1.215, 1.069, 1.09, 1.11, 1.106, 1.02, 1.087, 1.124, 1.069, 1.004, 1.002, 1.058, 0.993, 1.024, 0.926, 1.099, 1.083, 0.995, 1.023, 1.422]
def calculate_concentration(numbers):
concentrations = [0.3722 - (number/2.185*0.3722) for number in numbers]
increases = [(concentration - concentrations[-1])/concentrations[-1]*100 for concentration in concentrations]
print(f"The average absorbance numbers are:\n{numbers}")
print(f"The concentrations of bound copper are:\n{concentrations}")
print(f"The increases in copper binding are:\n{increases}")
reactions = [range(97,126)]
plt.bar(reactions, increases)
plt.xlabel("Reactions")
plt.ylabel("Increase in binding capacity (%)")
plt.title("Pure chitin")
plt.show()
calculate_concentration(numbers_ci)
The list reactions is not being created correctly. Currently it is a list containing a single range. I think what you were going for is
reactions = [i for i in range(97,126)]
your variable reactions is a list of a range instead of list of values
# this is how your reactions varibale looks like
[range(97, 126)]
# should look like this
[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125]
try doing this insetead
reactions = list(range(97,126))

Which is the right standard deviation formula Python

I am confused. I found few formulas for finding the SD (standard deviation).
This is the NumPy library std method:
>>> nums = np.array([65, 36, 52, 91, 63, 79])
>>> np.std(nums)
17.716909687891082
But I found another formula here:Standard deviation
By this formula with the same dataset my result is 323,1666666666667‬. Now which one is right? Or they are used for two different things?
EDIT: Seems I forgot about the square root
numpy is correct, of course. here the plain python version:
from math import sqrt
data = [65, 36, 52, 91, 63, 79]
mean = sum(data) / len(data)
std = sqrt(sum((d - mean) ** 2 for d in data) / len(data))
print(std) # 17.716909687891082
Core python. see pstdev
import statistics
print(statistics.pstdev([65, 36, 52, 91, 63, 79]))
output
17.716909687891082

Generating random lists in Python (seed problem?)

I'm trying to generate some (pseudo-) random lists for testing purposes. Here I'm generating a 2x2 matrix (list of lists), where one test_data.json file has any positive number of "cycles" & each "cycle" has n (resolution) number of integers.
After trying some basic random functions from numpy & random libraries, I've been unable to generate lists randomly at all.
import json
import random as ran
import numpy as np
import os
resolution = 10 # Map resolution: Max = 200
cycles = 3 # Number of cycles
dist = [None for _ in range(resolution)] # Distance list
output = list()
n = 0
with open("test_data.json", "w") as test:
for turn in range(cycles):
n += 10
# ran.seed(n)
np.random.seed(n)
for num in range(resolution):
# dist[num] = int(ran.random() * 255)
dist[num] = int(np.random.random() * 255)
output.append(dist)
# print(output)
json.dump(output, test)
# test.write('\n')
I can work with any "random" output within a certain range (here I'm scaling 0-1 to 0-255). Numbers in each list (cycle) is random enough, but every cycle is the same list of numbers.
[[164, 97, 169, 41, 245, 88, 252, 59, 149, 103], [164, 97, 169, 41, 245, 88, 252, 59, 149, 103], [164, 97, 169, 41, 245, 88, 252, 59, 149, 103]]
I've tried using seed(), with constant and changing seeds but the output is never changing between cycles.
Don't set the seed at every iteration. The whole point of a seed is that a given seed will generate the same stream of numbers every single time. Set the seed once at the beginning of your program (to get the same results for each run), or not at all. In the latter case, a time-dependent initial state will be generated for you, making your generator appear random indeed.
Also, choose whether you want to use python's built-in random module or np.random. You probably don't want to use both. Especially not if you're setting seeds. The seeds of one don't affect the other.
Why not remove the seeds, and simply use
import random as ran
ran.randint(0, 256)
to generate your random numbers in the range 0-255?
you can solve this by simply using random only
import random
cycles = 3
resolution = 10
output = [[random.randint(0,256) for _ in range(resolution)] for _ in range(cycles)]
This will give output as follows:
[[130, 126, 153, 18, 58, 24, 184, 75, 14, 25], [215, 73, 2, 58, 170, 255, 34, 113, 83, 80], [82, 100, 0, 118, 181, 90, 113, 165, 57, 87]]
You can then dump output to file
open("test_data.json", "w") as test:
json.dump(output, test)
The no of cycles and resolutions can be changed to achieve desirable results

Special function definition issue in Python vs Mathematica

I have a Mathematica code that calculates the 95% confidence intervals of a Cumulative Distribution Function (CDF) obtained from a specific Probability Distribution Function (PDF). The PDF is ugly, as it contains an Hypergeometric 2F1 function, and I need to calculate the 2-sigma errorbars of a data set of 15 values.
I want to translate this code to Python, but I get a very significant divergence on the second half of the values.
Mathematica code
results are the lower and upper 2-sigma confidence level for the values in xdata. That is, xdata should always fall between the two corresponding results values.
navs = {10, 10, 18, 30, 52, 87, 147, 245, 410, 684, 1141, 1903, 3173, 5290, 8816};
freqs = {0.00002, 0.00004, 0.0000666667, 0.000111111, 0.000185185, 0.000308642, 0.000514403, 0.000857339, 0.00142893, 0.00238166, 0.00396944, 0.00661594, 0.0165426, 0.0220568, 0.027571}
xdata = {0.578064980346793, 0.030812200935204, 0.316777979844816,
0.353718150091612, 0.287659600326548, 0.269254388840293,
0.16545714457921, 0.138759871084825, 0.0602382519940077,
0.10120771961, 0.065311134782518, 0.105235790998594,
0.124642033979457, 0.0271909963701794, 0.0686653810421847};
data = MapThread[{#1, #2, #3} &, {navs, freqs, xdata}]
post[x_, n_, y_] =
(n - 1) (1 - x)^n (1 - y)^(n - 2) Hypergeometric2F1[n, n, 1, x*y]
integral = Map[(values = #; mesh = Subdivide[0, 1, 1000];
Interpolation[
DeleteDuplicates[{Map[
SetPrecision[post[#, values[[1]], values[[3]]^2], 100] &,
mesh] // (Accumulate[#] - #/2 - #[[1]]/
2) & // #/#[[-1]] &,
mesh}\[Transpose], (#1[[1]] == #2[[1]] &)],
InterpolationOrder -> 1]) &, data];
results =
MapThread[{Sqrt[#1[.025]], Sqrt[#1[0.975]]} &, {integral, data}]
{{0.207919, 0.776508}, {0.0481485, 0.535278}, {0.0834002, 0.574447},
{0.137742, 0.551035}, {0.121376, 0.455097}, {0.136889, 0.403306},
{0.0674029, 0.279408}, {0.0612534, 0.228762}, {0.0158357, 0.134521},
{0.0525374, 0.156055}, {0.0270589, 0.108861}, {0.0740978, 0.137691},
{0.100498, 0.149646}, {0.00741129, 0.0525161}, {0.0507748, 0.0850961}}
Python code
Here's my translation: results are the same quantity as before, truncated to the 7th digit to increase readability.
The results values I get start diverging from the 7th pair of values on, and the last four points of xdata do not fall between the two corresponding results values.
import numpy as np
from scipy.integrate import cumtrapz
from scipy.interpolate import interp1d
from mpmath import *
mesh = list(np.linspace(0,1,1000));
navs = [10, 10, 18, 30, 52, 87, 147, 245, 410, 684, 1141, 1903, 3173, 5290, 8816]
freqs = [0.00002, 0.00004, 0.0000666667, 0.000111111, 0.000185185, 0.000308642, 0.000514403, 0.000857339, 0.00142893, 0.00238166, 0.00396944, 0.00661594, 0.0165426, 0.0220568, 0.027571]
xdata = [0.578064980346793, 0.030812200935204, 0.316777979844816,
0.353718150091612,0.287659600326548, 0.269254388840293,
0.16545714457921, 0.138759871084825, 0.0602382519940077,
0.10120771961, 0.065311134782518, 0.105235790998594,
0.124642033979457, 0.0271909963701794, 0.0686653810421847]
def post(x,n,y):
post = (n-1)*((1-x)**n)*((1-y)**(n-2))*hyp2f1(n,n,1,x*y)
return post
# setting the numeric precision to 100 as in Mathematica
# trying to get the most precise hypergeometric function values
mp.dps = 100
mp.pretty = True
results = []
for i in range(len(navs)):
postprob = [];
for j in range(len(mesh)):
posterior = post(mesh[j], navs[i], xdata[i]**2)
postprob.append(posterior)
# calculate the norm of the pdf for integration
norm = np.trapz(np.array(postprob),mesh);
# integrate pdf/norm to obtain cdf
integrate = list(np.unique(cumtrapz(np.array(postprob)/norm, mesh, initial=0)));
mesh2 = list(np.linspace(0,1,len(integrate)));
# interpolate inverse cdf to obtain the 2sigma quantiles
icdf = interp1d(integrate, mesh2, bounds_error=False, fill_value='extrapolate');
results.append(list(np.sqrt(icdf([0.025, 0.975]))))
results
[[0.2079198, 0.7765088], [0.0481485, 0.5352773], [0.0834, 0.5744489],
[0.1377413, 0.5510352], [0.1218029, 0.4566994], [0.1399324, 0.4122767],
[0.0733743, 0.3041607], [0.0739691, 0.2762597], [0.0230135, 0.1954886],
[0.0871462, 0.2588804], [0.05637, 0.2268962], [0.1731199, 0.3217401],
[0.2665897, 0.3969059], [0.0315915, 0.2238736], [0.2224567, 0.3728803]]
Thanks to the comments to this question, I found out that:
The hypergeometric function gives different results in the two languages. With the same input values i get that: In Mathematica Hypergeometric2F1 gives me as a result 1.0588267, while in Python mpmath.hyp2f1 gives 1.0588866. This is the very second point of the mesh, and the difference in in the fifth decimal place.
Is there somewhere a better definition of this special function I was not able to find?
I still don't know if this is only due to the Hypergeometric function or also to the integration method, but that is definitely a starting point.
(I am fairly new to Python, maybe the code is a bit naive)

Find length of cluster (how many point associated with cluster) after KMeans clustering (scikit learn)

I have done clustering using Kmeans using sklearn. While it has a method to print the centroids, I am finding it rather bizzare that scikit-learn doesn't have a method to find out the cluster length (or that I have not seen it so far). Is there a neat way to get the cluster-length of each cluster or many points associated with cluster? I currently have this rather cludgy code to do it where I am finding cluster of length one and need to add other point to this cluster by measuring the Euclidean distance between the points and have to update the labels
import numpy as np
from clustering.clusternew import Kmeans_clu
from evolution.generate import reproduction
from mapping.somnew import mapping, no_of_neurons, neuron_weights_init
from population_creation.population import pop_create
from New_SOL import newsol
data = genfromtxt('iris.csv', delimiter=',', skip_header=0, usecols=range(0, 4)) ##Read the input data
actual_label = genfromtxt('iris.csv', delimiter=',', dtype=str,skip_header=0, usecols=(4))
chromosome = int(input("Enter the number of chromosomes: ")) #Input the population size
max_gen = int(input("Enter the maximum number of generation: ")) #Input the maximum number of generation
for i in range(0, chromosome):
cluster = 3#random.randint(2, max_cluster) ##Randomly selects cluster number from 2 to root(poplation)
K.insert(i, cluster) ##Store the number of clusters in clu
print('value of K is ',K)
u, label,z1,A1= Kmeans_clu(cluster, data)
#print("centers and labels : ", u, label)
lab.insert(i, label) ##Store the labels in lab
center.insert(i, u)
new_center = pop_create(max_cluster, features, cluster, u)
population.insert(i, new_center)
print("VAlue of population in main\n" ,population)
newsol(max_gen,population,data)
For newsol method we pass the new population from the above method generated code and again doing K-Means on the population
def ClusterIndicesComp(clustNum, labels_array): #list comprehension for accessing the features in iris data set
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
def newsol(max_gen,population,data):
#print('VAlue of NewSol Population is',population)
for i in range(max_gen):
cluster1=5
u,label,t,l=Kmeans_clu(cluster1, population)
A1.insert(i,t)
plab.insert(i,label)
pcenter.insert(i,u)
k2=Counter(l.labels_) #Count number of elements in each cluster
k1=[t for (t, v) in k2.items() if v == 1] #element whose length is one will be fetched
t1= np.array(k1) #Iterating through the cluster which have one point associated with them
for b in range(len(t1)):
print("Value in NEW_SOL is of 1 length cluster\n",t1[b])
plot1=data[ClusterIndicesComp(t1[b], l.labels_)]
print("Values are in sol of plot1",plot1)
for q in range(cluster1):
plot2=data[ClusterIndicesComp(q, l.labels_)]
print("VAlue of plot2 is for \n",q,plot2)
for i in range(len(plot2)):#To get one element at a time from plot2
plotk=plot2[i]
if([t for (t, v) in k2.items() if v >2]):#checking if the cluster have more than 2 points than only the distance will be calculated
S=np.linalg.norm(np.array(plot1) - np.array(plotk))
print("Distance between plot1 and plotk is",plot1,plotk,np.linalg.norm(np.array(plot1) - np.array(plotk)))#euclidian distance is calculated
else:
print("NO distance between them\n")
Kmeans which I have done is
from sklearn.cluster import KMeans
import numpy as np
def Kmeans_clu(K, data):
kmeans = KMeans(n_clusters=K, init='random', max_iter=1, n_init=1).fit(data) ##Apply k-means clustering
labels = kmeans.labels_
clu_centres = kmeans.cluster_centers_
z={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)} #getting cluster for each label
return clu_centres, labels ,z,kmeans
For getting number of instances in each cluster may be you can try using Counter:
from collections import Counter, defaultdict
print(Counter(estimator.labels_))
Result:
Counter({0: 62, 1: 50, 2: 38})
where cluster 0 has 62 instances, cluster 1 has 50 instances, and cluster 2 has 38 instances
And may be to store index of instances of each clusters, you can use defaultdict:
clusters_indices = defaultdict(list)
for index, c in enumerate(estimator.labels_):
clusters_indices[c].append(index)
Now, to find indices of instances in cluster 0, calling:
print(clusters_indices[0])
Result:
[50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114, 119, 121, 123, 126, 127, 133, 138, 142, 146, 149]

Categories