Generating weighted intervals in Python - python

I know how to produce a weighted integer with random.choice.
Now I have 5000 integers from 0 to 1000. I want to have say 75% to land in the inerval 0-500, 20% 501-750 and 5% between 751-1000. What I tried and failed is
x = random.choice([np.arange(501), np.arange(501,751), np.arange(751, 1001)], size=5000, p=[0.75, 0.2, 0.05])
But then I only get random aranged intervals. Any help would be appreciated.

How about something like this:
import numpy as np
x = np.random.choice(list(range(1001)), size=5000, p=
[.75/501]*501+[.2/250]*250+[.05/250]*250)

another version would be:
import numpy as np
from scipy import stats
N = 5000
probs = [0.75, 0.2, 0.05]
breaks = [0, 501, 751, 1001]
# figure out how big each group should be
sizes = stats.multinomial.rvs(N, probs)
# get values for each group
x = np.concatenate([
stats.randint.rvs(l, h, size=n)
for n, l, h in zip(sizes, breaks, breaks[1:])])
# mix everything up
np.random.shuffle(x)
some differences to rocket's solution:
less/smaller temporary variables and more opportunity for vectorisation
allows probabilities to take irrational values
the runtime is doesn't depend on the number of possible values generated

Related

Pandas: Create a binary column randomly but with specific proportions

I am trying to create a new random binary column in my table and it needs to have 60% of values as 1 and 40% of values as 0. I have tried to use the np.random.choice function from the numpy package like the following, however, the proportion changes everytime I run my code.
np.random.choice(a = [0,1], size = len(df), p = [0.4, 0.6])
I need to have these proportions fixed. Can anyone help how it can be done? Thank you!
This is how you create an numpy array of size 100 with the distribution of 1 and 0 that you wanted and store it in variable m:
import numpy as np
m = np.random.choice(a = [0,1], size = 100, p = [0.4, 0.6])
I don't know anything about your pandas data frame, because you didn't post your source code here. Therefore I can't tell you, why len(df) is different each time.

Optimizing a simple Photon Detection Simulation

I am a medical physics student trying to simulate photon detection - I succeeded (below) but I want to make it better by speeding it up: it currently takes 50 seconds to run and I want it to run in some fraction of that time. I assume someone more knowledgeable in Python could optimize it to complete within less than 10 seconds (without reducing num_photons_detected values). Thank you very much for trying out this little optimization challenge.
from random import seed
from random import random
import random
import matplotlib.pyplot as plt
import numpy as np
rows, cols = (25, 25)
num_photons_detected = [10**3, 10**4, 10**5, 10**6, 10**7]
lesionPercentAboveNoiseLevel = [1, 0.20, 0.10, 0.05]
index_range = np.array([i for i in range(rows)])
for l in range(len(lesionPercentAboveNoiseLevel)):
pixels = np.array([[0.0 for i in range(cols)] for j in range(rows)])
for k in range(len(num_photons_detected)):
random.seed(a=None, version=2)
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)])
counts = 0
while num_photons_detected[k] > counts:
for i in photons_random_pixel_choice:
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)]) #further ensures random pixel selection
for j in photons_random_pixel_choice:
pixels[i,j] +=1
counts +=1
plt.imshow(pixels, cmap="gray") #in the resulting images/graphs, x is on the vertical and y on the horizontal
plt.show()
I think that, aside from efficiency issues, a problem with the code is that it does not select the positions of photons truly at random. Instead, it selects rows numbers, and then for each selected row, it picks column numbers where photons will be observed in that row. As a result, if a row number is not selected, there will be no photons in that row at all, and if the same row is selected several times, there will be many photons in it. This is visible in the produced plots which have a clear pattern of lighter and darker rows:
Assuming that this is unintended and that each pixel should have equal chances of being selected, here is a function generating an array of a given size, with a given number of randomly selected pixels:
import numpy as np
def generate_photons(rows, cols, num_photons):
rng = np.random.default_rng()
indices = rng.choice(rows*cols, num_photons)
np.add.at(pix:=np.zeros(rows*cols), indices, 1)
return pix.reshape(rows, cols)
You can use it to produce images with specified parameters. E.g.:
import matplotlib.pyplot as plt
pixels = generate_photons(rows=25, cols=25, num_photons=10**4)
plt.imshow(pixels, cmap="gray")
plt.show()
gives:
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)])
It seems like the goal here is:
Use a pre-made sequence of integers, 0 to 24 inclusive, to select one of those values.
Repeat that process 25 times in a list comprehension, to get a Python list of 25 random values in that range.
Make a 1-d Numpy array from those results.
This is very much missing the point of using Numpy. If we want integers in a range, then we can directly ask for those. But more importantly, we should let Numpy do the looping as much as possible when using Numpy data structures. This is where it pays to read the documentation:
size: int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.
So, just make it directly: photons_random_pixel_choice = random.integers(rows, size=(rows,)).

How to randomly generate random numbers in one of two intervals in Python

I want to generate random numbers from two different ranges [0, 0.3) and [0.7, 1) in python.
numpy.random.uniform has the option of generating only from one particular interval.
I assume you want to choose an interval with probability weighted by its size, then sample uniformly from the chosen interval. In that case, the following Python code will do this:
import random
# Define the intervals. They should be disjoint.
intervals=[[0, 0.05], [0.7, 1]]
# Choose one number uniformly inside the set
random.uniform(*random.choices(intervals,
weights=[r[1]-r[0] for r in intervals])[0])
import numpy
# Generate a NumPy array of given size
size=1000
numpy.asarray([ \
random.uniform(*random.choices(intervals,
weights=[r[1]-r[0] for r in intervals])[0]) \
for i in range(1000)])
Note that the intervals you give, [[0, 0.3], [0.7, 1]], appear to be arbitrary; this solution works for any number of disjoint intervals, and it samples uniformly at random from the union of those intervals.
How about this one?
first_interval = np.array([0, 0.3])
second_interval = np.array([0.7, 1])
total_length = np.ptp(first_interval)+np.ptp(second_interval)
n = 100
numbers = np.random.random(n)*total_length
numbers += first_interval.min()
numbers[numbers > first_interval.max()] += second_interval.min()-first_interval.max()
Refer to this thread hope it solves your query.
But the answer is as follows :
def random_of_ranges(*ranges):
all_ranges = sum(ranges, [])
return random.choice(all_ranges)
print(random_of_ranges(range(65, 90), range(97, 122)))
click here
You can just concatenate random numbers from those two intervals.
import numpy as np
rng = np.random.default_rng(12345)
a = rng.uniform(0, 0.3,1000)
b = rng.uniform(0.7, 1,1000)
my_rnd = np.concatenate([a, b])
These look fairly uniform across the two intervals.

Limiting a sequence of ratios to a range whilst maintaining overall increase/decrease of values they are multiplying

Sorry my maths isn't fantastic so you'll have to bear with me.
Let's say I have a ratio limit of 3.
I have a numpy array of sizes that are to be multiplied by the ratios and a numpy array of the ratios, some of which are within the limit, some of which aren't.
I need the ratios that are above the limit to be set to the limit and the ratios that are below the limit to be increased to account for the reduction of the ratios that were over the limit. The result would be the the sum of the sizes is still the same but the individual sizes haven't been altered by more than the limit
In [1]: import numpy as np
In [2]: sizes = np.array([2.0,4.0,6.0,8.0,10.0])
In [3]: ratios = np.array([0.5, 0.5, 5.0, 4.0, 0.5])
In [4]: print np.sum(sizes * ratios)
70.0
#result after limiting ratios would still be 70
Edit:
So in the example above the resulting ratios would be:
np.array([1.75, 1.75, 3.0, 3.0, 1.75])
In [4]: print np.sum(sizes * ratios)
70.0
The ratios that were previously above the limit have been reduced and the ratios that were below have been raised to compensate.
I think you are looking for something like this:
import numpy as np
def Spread_Ratios(ratios,sizes):
if np.dot(ratios,sizes)/np.sum(sizes)>3.:
print 'There is no solution!\n'
return None
if np.any(ratios>3.):
score = np.dot(sizes,ratios)
ratios_reduced = np.where(ratios>3.,3.,ratios)
score_reduced = np.dot(sizes,ratios_reduced)
delta_ratios = (score - score_reduced) / np.sum(sizes[ratios<3.])
new_ratios = ratios_reduced + np.where(ratios<3.,delta_ratios,0.)
return Spread_Ratios(new_ratios,sizes)
else:
return ratios,sizes
The recursive definition is necessary since it is possible that a weight below 3 (but close) is lifted above 3.
Furthermore it is possible that there exists no solution at all. This case is handled with the first if condition.

NumPy or SciPy to calculate weighted median

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.
For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.
So far I've
imported the csv showing the weights as an array, masking values of 0, and
created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.
I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.
I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!
Update: here's some code for what I've done so far:
#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)
#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)
#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]
for i in range (rowLength):
createArr = np.arange(0, fieldLength*10, 10)
nCreateArr = np.array(createArr)
massArr.append(nCreateArr)
nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.
One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.
Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).
So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.
The code to do it like so:
import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])
c = np.cumsum(freq_count, axis=1)
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...
#This is just for explanation.
print "median masses is:", masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))
Output will be:
median masses is: [10 20 20 0 30]
[[ 30 191 9 0] <- The test data
[ 10 20 300 10]
[ 10 20 30 40]
[100 10 10 10]
[ 1 1 1 100]]
[[ 30. 221. 230. 230. 115. ] <- cumsum results with median added to the end.
[ 10. 30. 330. 340. 170. ] you can see from this where they fit in.
[ 10. 30. 60. 100. 50. ]
[ 100. 110. 120. 130. 65. ]
[ 1. 2. 3. 103. 51.5]]
wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.
Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).
def weighted_median(values, weights):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, 0.5 * c[-1])]]
Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend is to an arbitrary number of arbitrary quantiles.
Update
Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:
def weighted_quantiles(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]
This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])
Update 2
In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two number in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).
def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
q = np.searchsorted(c, quantiles * c[-1])
return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])
This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).
>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)
Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-values weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.
I ended up writing that function based on #muzzle and #maesers replies:
def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
The difference between interpolate True and False is as follows:
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5
(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])
Speed tests show it is just as performant as #maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.
Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.
import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain
book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'
masses = sh.col_values(0, start_rowx=1) # first column has mass
age = sh.row_values(0, start_colx=1) # first row has age ranges
count = 1
mass = []
for a in ages:
age.append(sh.col_values(count, start_rowx=1))
count += 1
stats = []
count = 0
for a in ages:
expanded = []
# create a tuple with the mass vector
age_mass = zip(masses, age[count])
count += 1
# replicate element[0] for element[1] times
expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)
# separate into one big list
medianlist = [x for t in expanded for x in t]
# convert to array and mask out zeroes
npa = np.array(medianlist)
npa = np.ma.masked_equal(npa,0)
median = np.median(npa)
meanMass = np.average(npa)
maxMass = np.max(npa)
minMass = np.min(npa)
stdev = np.std(npa)
stats1 = [median, meanMass, maxMass, minMass, stdev]
print stats1
stats.append(stats1)
np.savetxt(ofile, (stats), fmt="%d")

Categories