Inverse method for discrete variables in Python - python

I am trying to do the inverse method for a discrete variable, but I don't get anything (if I print the sample I get only one number, not a sample). How do I make this work correctly?
import numpy as np
import numpy.random as rd
import matplotlib.pyplot as plt
def inversao_discreto(p,M):
amostra=np.zeros(M)
acum=np.zeros(len(p))
acum[:]=p
for i in range(1, len(acum)):
acum[i]+=acum[i-1]
r=rd.random_sample(M)
for i in range(M):
i=0
k=0
while(r[k]>acum[i]):
i+=1
amostra=i
return amostra
model=np.array([0.1,0.2,0.1,0.2,0.4])
sample=inversao_discreto(model,10000)

As far as I could understand, you want to implement the Inverse transform sampling for discrete variables, which works like this:
Given a probability distribution `p`
Given a number of samples `M`
Calculate the cumulative probability distribution function `acum` from `p`
Generate `M` uniformly distributed samples and store into `r`
For each sample `r[i]`
Get the index where the cumulative distribution `acum` is larger or equal to the sample `r[i]`
Store this value into `amostra[i]`
If my understanding is correct, then your code is almost there. Your mistake is only in the last for loop. Your variable i is tracking the position of your samples stored in r. Then, k will keep track of where in the accumulator acum you are comparing r to. Here is the pseudo-code:
For each sample `r[i]`
Start with `k = 0`
While `r[i] > acum[k]`
Increment `k`
Now that `r[i] <= acum[k]`, store `k` into `amostra[i]`
Translating it to Python:
for i in range(M):
k = 0
while (r[i] > acum[k]):
k += 1
amostra[i] = k
So here is the code fixed:
import numpy as np
import numpy.random as rd
def inversao_discreto(p, M):
amostra = np.zeros(M)
acum = np.zeros(len(p))
acum[:] = p
for i in range(1, len(acum)):
acum[i] += acum[i - 1]
r = rd.random_sample(M)
for i in range(M):
k = 0
while (r[i] > acum[k]):
k += 1
amostra[i] = k
return amostra
model = np.array([0.1, 0.2, 0.1, 0.2, 0.4])
sample = inversao_discreto(model, 10000)
I tried to change as little code as possible to make it work since you are relatively new to Python. Most of the changes I implemented were based on this Style Guide for Python Code, which I recommend you to take a look at since will improve your code, visually speaking.

Related

Optimizing a simple Photon Detection Simulation

I am a medical physics student trying to simulate photon detection - I succeeded (below) but I want to make it better by speeding it up: it currently takes 50 seconds to run and I want it to run in some fraction of that time. I assume someone more knowledgeable in Python could optimize it to complete within less than 10 seconds (without reducing num_photons_detected values). Thank you very much for trying out this little optimization challenge.
from random import seed
from random import random
import random
import matplotlib.pyplot as plt
import numpy as np
rows, cols = (25, 25)
num_photons_detected = [10**3, 10**4, 10**5, 10**6, 10**7]
lesionPercentAboveNoiseLevel = [1, 0.20, 0.10, 0.05]
index_range = np.array([i for i in range(rows)])
for l in range(len(lesionPercentAboveNoiseLevel)):
pixels = np.array([[0.0 for i in range(cols)] for j in range(rows)])
for k in range(len(num_photons_detected)):
random.seed(a=None, version=2)
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)])
counts = 0
while num_photons_detected[k] > counts:
for i in photons_random_pixel_choice:
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)]) #further ensures random pixel selection
for j in photons_random_pixel_choice:
pixels[i,j] +=1
counts +=1
plt.imshow(pixels, cmap="gray") #in the resulting images/graphs, x is on the vertical and y on the horizontal
plt.show()
I think that, aside from efficiency issues, a problem with the code is that it does not select the positions of photons truly at random. Instead, it selects rows numbers, and then for each selected row, it picks column numbers where photons will be observed in that row. As a result, if a row number is not selected, there will be no photons in that row at all, and if the same row is selected several times, there will be many photons in it. This is visible in the produced plots which have a clear pattern of lighter and darker rows:
Assuming that this is unintended and that each pixel should have equal chances of being selected, here is a function generating an array of a given size, with a given number of randomly selected pixels:
import numpy as np
def generate_photons(rows, cols, num_photons):
rng = np.random.default_rng()
indices = rng.choice(rows*cols, num_photons)
np.add.at(pix:=np.zeros(rows*cols), indices, 1)
return pix.reshape(rows, cols)
You can use it to produce images with specified parameters. E.g.:
import matplotlib.pyplot as plt
pixels = generate_photons(rows=25, cols=25, num_photons=10**4)
plt.imshow(pixels, cmap="gray")
plt.show()
gives:
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)])
It seems like the goal here is:
Use a pre-made sequence of integers, 0 to 24 inclusive, to select one of those values.
Repeat that process 25 times in a list comprehension, to get a Python list of 25 random values in that range.
Make a 1-d Numpy array from those results.
This is very much missing the point of using Numpy. If we want integers in a range, then we can directly ask for those. But more importantly, we should let Numpy do the looping as much as possible when using Numpy data structures. This is where it pays to read the documentation:
size: int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.
So, just make it directly: photons_random_pixel_choice = random.integers(rows, size=(rows,)).

Gap Statistic Method

import sys
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
filename = sys.argv[1]
datafile = sio.loadmat(filename)
data = datafile['bow']
sizedata=[len(data), len(data[0])]
gap=[]
SD=[]
for knum in xrange(10,20):
print knum
#Clustering original Data
kmeanspp = KMeans(n_clusters=knum,init = 'k-means++',max_iter = 100,n_jobs = 1)
kmeanspp.fit(data)
dispersion = kmeanspp.inertia_
#Clustering Reference Data
nrefs = 10
refDisp = np.zeros(nrefs)
for nref in xrange(nrefs):
refdata = np.random.random_sample((sizedata[0],sizedata[1]))
refkmeans = KMeans(n_clusters=knum,init='k-means++',max_iter=100,n_jobs=1)
refkmeans.fit(refdata)
refdisp = refkmeans.inertia_
refDisp[nref]=np.log(refdisp)
mean_log_refdisp = np.mean(refDisp)
gap.append(mean_log_refdisp-np.log(dispersion))
#Calculating standard deviaiton
sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
SD.append(sd)
SD = [sd*((1+(1/nrefs))**0.5) for sd in SD]
#determining optimal k
opt_k = None
diff = []
for i in xrange(len(gap)-1):
diff = (SD[i+1]-(gap[i+1]-gap[i]))
if diff>0:
opt_k = i+10
break
print diff
plt.plot(np.linspace(10,19,10,True),gap)
plt.show()
Here I am trying to implement the Gap Statistic method for determining the optimal number of clusters. But the problem is that every time I run the code I get a different value for k.
What is the solution to the problem?
How can the value of optimal k differ for the same data?
I have stored the data in a .mat file beforehand and I am passing it as an argument via terminal
I am looking for the smallest value of k for which Gap(k)>= Gap(k+1)-s(k+1) where s(k+1) = sd(k+1)*square_root(1+(1/B)) where sd is the standard deviation of the reference distribution and B is the number of copies of Monte Carlo sample
Otherwise stated, I am searching for the value of k for which
s(k+1)-Gap(k+1)+Gap(k)>=0
Couple of problems with your simulation:
1- sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
Why did you multiply the second component of zip by nrefs that is not needed according to the original paper.
2-
if diff>0:
opt_k = i+10
break
if diff>0 you want diff>=0 since equality can happen a
About why you get different number of clusters each time, as people said it is monte carlo simulation so there can be randomness and also it depends on what you are clustering and your dataset. I suggest you to test your algorithms against Silhouette and Elbow to get a better idea about number of clusters.
One option is to run your function several times and then average the gap statistics and the s values, and find the smallest k where the average s(k+1)-Gap(k+1)+Gap(k) is greater than
This will take longer but give a more reliable result.

1D Random Walk from Matlab to Python

I have a Matlab code that generates a 1D random walk.
%% probability to move up or down
prob = [0.05, 0.95];
start = 2; %% start with 2
positions(1) = start;
for i=2:1000
rr = rand(1);
down = rr<prob(1) & positions(i-1)>1;
up = rr>prob(2) & positions(i-1)<4;
positions(i) = positions(i-1)-down + up;
figure(1), clf
plot(positions)
This gives me the plot below 1D Random Walk with Matlab
I need to try to translate this in Python and I have came up with this (using numpy):
import random
import numpy as np
import matplotlib.pyplot as plt
prob = [0.05, 0.95] ##probability to move up or down
N = 100 ##length of walk
def randomWalk(N):
positions=np.zeros(N)
start = 2 ##Start at 2
positions[0] = start
for i in range(1,100):
rr = random.randint(0,1)
if rr<prob[0] and positions[i-1]>1:
start -= 1
elif rr>prob[1] and positions[i-1]<4:
start += 1
positions[i] = start
return positions
plt.plot(randomWalk(N))
plt.show()
It looks fairly close to what I want (see figure below):1D Random Walk with Python
But I wonder if they are really equivalent, because they do seem different: The Python code seems spikier than the Matlab one.
What is missing in my Python code to achieve the perfect stepwise increase/decrease (similar to the Matlab code)? Maybe it needs an "else" that tells it to stay the same unless the two conditions are met. How do I implement that?
You are doing a bunch of things differently.
For one, you are using rand in MATLAB, which returns a random float between 0 and 1. In python, you are using randint, which returns a random integer. You are doing randint(0, 1), which means "a random integer from 0 to 1, not including 0". So it will always be 1. You want random.random(), which returns a random float between 0 and 1.
Next, you are computing down and up in MATLAB, but in Python you are computing down or up in Python. For your particular case of probabilities these end up having the same result, but they are syntactically different. You can use an almost identical syntax to MATLAB for Python in this case.
Finally, you are calculating a lot more samples for MATLAB than Python (about a factor of 10 more).
Here is a direct port of your MATLAB code to Python. The result for me is pretty much the same as your MATLAB example (with different random numbers, of course):
import random
import matplotlib.pyplot as plt
prob = [0.05, 0.95] # Probability to move up or down
start = 2 #Start at 2
positions = [start]
for _ in range(1, 1000):
rr = random.random()
down = rr < prob[0] and positions[-1] > 1
up = rr > prob[1] and positions[-1] < 4
positions.append(positions[-1] - down + up)
plt.plot(positions)
plt.show()
If speed is an issue you can probably speed this up by using np.random.random(1000) to generate the random numbers up-front, and do the probability comparisons up-front as well in a vectorized manner.
So something like this:
import random
import numpy as np
import matplotlib.pyplot as plt
prob = [0.05, 0.95] # Probability to move up or down
start = 2 #Start at 2
positions = [start]
rr = np.random.random(1000)
downp = rr < prob[0]
upp = rr > prob[1]
for idownp, iupp in zip(downp, upp):
down = idownp and positions[-1] > 1
up = iupp and positions[-1] < 4
positions.append(positions[-1] - down + up)
plt.plot(positions)
plt.show()
Edit: To explain a bit more about the second example, basically what I am doing is pre-computing whether the probability is below the first threshold or above the second for every step ahead of time. This is much faster than computing a random sample and doing the comparison at each step of the loop. Then I am using zip to combine those two random sequences into one sequence where each element is the pair of corresponding elements from the two sequences. This is assuming python 3, if you are using python 2 you should use itertools.izip instead of zip.
So it is roughly equivalent to this:
import random
import numpy as np
import matplotlib.pyplot as plt
prob = [0.05, 0.95] # Probability to move up or down
start = 2 #Start at 2
positions = [start]
rr = np.random.random(1000)
downp = rr < prob[0]
upp = rr > prob[1]
for i in range(len(rr)):
idownp = downp[i]
iupp = upp[i]
down = idownp and positions[-1] > 1
up = iupp and positions[-1] < 4
positions.append(positions[-1] - down + up)
plt.plot(positions)
plt.show()
In python, it is generally preferred to iterate over values, rather than indexes. There is pretty much never a situation where you need to iterate over an index. If you find yourself doing something like for i in range(len(foo)):, or something equivalent to that, you are almost certainly doing something wrong. You should either iterate over foo directly, or if you need the index for something else you can use something like for i, ifoo in enumerate(foo):, which gets you both the elements of foo and their indexes.
Iterating over indexes is common in MATLAB because of various limitations in the MATLAB language. It is technically possible to do something similar to what I did in that Python example in MATLAB, but in MATLAB it requires a lot of boilerplate to be safe and will be extremely slow in most cases. In Python, however, it is the fastest and cleanest approach.

Pearson correlation on big numpy matrices

I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks
The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.

Discretization of probability array in Python

I have a numpy array (actually imported from a GIS raster map) which contains
probability values of occurrence of a species like following example:
a = random.randint(1.0,20.0,1200).reshape(40,30)
b = (a*1.0)/sum(a)
Now I want to get a discrete version for that array again. Like if I have
e.g. 100 individuals which are located on the area of that array (1200 cells) how are they
distributed? Of course they should be distributed according to their probability,
meaning lower values indicated lower probability of occurrence. However, as everything is statistics there is still the chance that a individual is located at a low probability
cell. It should be possible that multiple individuals can occupy on cell...
It is like transforming a continuous distribution curve into a histogram again. Like many different histograms may result in a certain distribution curve it should also be the other way round. Accordingly applying the algorithm I am looking for will produce different discrete values each time.
...is there any algorithm in python which can do that? As I am not that familiar with discretization maybe someone can help.
Use random.choice with bincount:
np.bincount(np.random.choice(b.size, 100, p=b.flat),
minlength=b.size).reshape(b.shape)
If you don't have NumPy 1.7, you can replace random.choice with:
np.searchsorted(np.cumsum(b), np.random.random(100))
giving:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(100)),
minlength=b.size).reshape(b.shape)
So far I think ecatmur's answer seems quite reasonable and simple.
I just want to add may a more "applied" example. Considering a dice
with 6 faces (6 numbers). Each number/result has a probability of 1/6.
Displaying the dice in form of an array could look like:
b = np.array([[1,1,1],[1,1,1]])/6.0
Thus rolling the dice 100 times (n=100) results in following simulation:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(n)),minlength=b.size).reshape(b.shape)
I think that can be an appropriate approach for such an application.
Thus thank you ecatmur for your help!
/Johannes
this is similar to my question i had earlier this month.
import random
def RandFloats(Size):
Scalar = 1.0
VectorSize = Size
RandomVector = [random.random() for i in range(VectorSize)]
RandomVectorSum = sum(RandomVector)
RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
return RandomVector
from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
"""
Inputs:
ListSize = the size of the list to return
ListSumValue = The sum of list values
Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).
Output:
A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
"""
if type(Distribution) == list:
DistributionSize = len(Distribution)
if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
Values = multinomial(ListSumValue,Distribution,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
UniformDistro = [1/ListSize for i in range(ListSize)]
Values = multinomial(ListSumValue,UniformDistro,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'normal':
"""
Normal Distribution Construction....It's very flexible and hideous
Assume a +-3 sigma range. Warning, this may or may not be a suitable range for your implementation!
If one wishes to explore a different range, then changes the LowSigma and HighSigma values
"""
LowSigma = -3#-3 sigma
HighSigma = 3#+3 sigma
StepSize = 1/(float(ListSize) - 1)
ZValues = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
#Construction parameters for N(Mean,Variance) - Default is N(0,1)
Mean = 0
Var = 1
#NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
NormalDistro= list()
for i in range(len(ZValues)):
if i==0:
ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
NormalDistro.append(ERFCVAL)
elif i == len(ZValues) - 1:
ERFCVAL = NormalDistro[0]
NormalDistro.append(ERFCVAL)
else:
ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
ERFCVAL = ERFCVAL1 - ERFCVAL2
NormalDistro.append(ERFCVAL)
#print "Normal Distribution sum = %f"%sum(NormalDistro)
Values = multinomial(ListSumValue,NormalDistro,size=1)
OutputValue = Values[0]
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
ProbabilityDistibution = RandFloats(1200)#This is your probability distribution for your 1200 cell array
SizeDistribution = RandIntVec(1200,100,Distribution=ProbabilityDistribution)#for a 1200 cell array, whose sum is 100 with given probability distribution
The two main lines that are important are the last two lines in the code above

Categories