Probability list creation

Probability list creation - python

following problem: Having a base triple (0,1,0)
Now I try to create a list of changed triples in a given range.
The constraints:
triple[0] and triple[2] have should have a maximum of r, such as r=0.2
sum(triple) = 1
triple[0] needn't to be equals triple[1] and should be increased by a given stepwise-parameter s, such as s= 0.02
In this above mentioned example our methode should create
lst = [(0.0, 1, 0.0),(0.02, 0.98, 0.), (0.04, 0.96,0), (0.04,0.94, 0.02), (0.06,0.94,0), (0.06, 0.92, 0.02), (0.06, 0.9, 0.04), ...]
Is there any pretty way to do this?
Maybe you have an idea to create these list without nested loops (probably with numpy?).
Thanks a lot!

Here's a list comprehension that should provide all 3-tuples that meet your constraints (as I understand them). Its a bit more clunky than I'd like due to the range function only accepting integers:
r = 0.2
s = 0.02
steps = int(math.ceil(r/s))
lst = [(a*s, 1-(a+b)*s, b*s) for b in range(steps) for a in range(steps)]
Results:
>>> lst[0:4]
[(0.0, 1.0, 0.0), (0.02, 0.98, 0.0), (0.04, 0.96, 0.0), (0.06, 0.94, 0.0)]
>>> lst[90:94]
[(0.0, 0.8200000000000001, 0.18), (0.02, 0.8, 0.18), (0.04, 0.78, 0.18), (0.06, 0.76, 0.18)]
The first and last values only go up to 0.18 in this code, and I'm not sure if that's desirable or not (is the constraint < r or <= r?). It shouldn't be too hard to tweak, if you want it the other way.

You could create a function that makes the triple as you describe ... something such as:
def make_triple(r=0.2, s=0.02):
element_one = round(random.uniform(0, r), 2)
max_s = r/s
element_three = random.randint(0, max_s) * s
element_two = round(1 - element_one - element_three, 2)
return (element_one, element_two, element_three)
And then just create a single loop that calls this function:
list_of_triples = []
for i in range(5):
list_of_triples.append(make_triple(0.2, 0.02))
And there you go! No nested loops necessary.

Another numpy answer just for kicks:
import numpy as np
r = .2
s = .02
a, b = np.mgrid[0:r:s, 0:r:s]
lst = np.dstack([a, 1 - (a+b), b]).reshape(-1, 3)

Here's a NumPy solution without for-loops, as requested. It uses a 3D array and NumPy broadcasting rules to assing the scale by row and by column. scale is a 2D array of a single column so it can be conveniently trasposed by .T. In the end, the 3D array is reshaped to 2D.
import numpy as np
r = .2
s = .02
scale = np.arange(r, step=s, dtype=float).reshape(-1,1)
a = np.empty((len(scale),len(scale),3), dtype=float)
a[:,:,0] = scale
a[:,:,2] = scale.T
a[:,:,1] = 1 - a[:,:,0] - a[:,:,2]
print a.reshape(-1,3)

Related

Most efficient way to convert list of values to probability distribution?

I have several lists that can only contain the following values: 0, 0.5, 1, 1.5
I want to efficiently convert each of these lists into probability mass functions. So if a list is as follows: [0.5, 0.5, 1, 1.5], the PMF will look like this: [0, 0.5, 0.25, 0.25].
I need to do this many times (and with very large lists), so avoiding looping will be optimal, if at all possible. What's the most efficient way to make this happen?
Edit: Here's my current system. This feels like a really inefficient/unelegant way to do it:
def get_distribution(samplemodes1):
n, bin_edges = np.histogram(samplemodes1, bins = 9)
totalcount = np.sum(n)
bin_probability = n / totalcount
bins_per_point = np.fmin(np.digitize(samplemodes1, bin_edges), len(bin_edges)-1)
probability_perpoint = [bin_probability[bins_per_point[i]-1] for i in range(len(samplemodes1))]
counts = Counter(samplemodes1)
total = sum(counts.values())
probability_mass = {k:v/total for k,v in counts.items()}
#print(probability_mass)
key_values = {}
if(0 in probability_mass):
key_values[0] = probability_mass.get(0)
else:
key_values[0] = 0
if(0.5 in probability_mass):
key_values[0.5] = probability_mass.get(0.5)
else:
key_values[0.5] = 0
if(1 in probability_mass):
key_values[1] = probability_mass.get(1)
else:
key_values[1] = 0
if(1.5 in probability_mass):
key_values[1.5] = probability_mass.get(1.5)
else:
key_values[1.5] = 0
distribution = list(key_values.values())
return distribution

Here are some solution for you to benchmark:
Using collections.Counter
from collections import Counter
bins = [0, 0.5, 1, 1.5]
a = [0.5, 0.5, 1.0, 0.5, 1.0, 1.5, 0.5]
denom = len(a)
counts = Counter(a)
pmf = [counts[bin]/denom for bin in Bins]
NumPy based solution
import numpy as np
bins = [0, 0.5, 1, 1.5]
a = np.array([0.5, 0.5, 1.0, 0.5, 1.0, 1.5, 0.5])
denom = len(a)
pmf = [(a == bin).sum()/denom for bin in bins]
but you can probably do better by using np.bincount() instead.
Further reading on this idea: https://thispointer.com/count-occurrences-of-a-value-in-numpy-array-in-python/

How to get elements from a specific range out of a list?

Does anybody have an idea how to get the elements in a list whose values fall within a specific (from - to) range?
I need a loop to check if a list contains elements in a specific range, and if there are any, I need the biggest one to be saved in a variable..
Example:
list = [0.5, 0.56, 0.34, 0.45, 0.53, 0.6]
# range (0.5 - 0.58)
# biggest = 0.56

You could use a filtered comprehension to get only those elements in the range you want, then find the biggest of them using the built-in max():
lst = [0.5, 0.56, 0.34, 0.45, 0.53, 0.6]
biggest = max([e for e in lst if 0.5 < e < 0.58])
# biggest = 0.56

As an alternative to other answers, you can also use filter and lambda:
lst = [0.5, 0.56, 0.34, 0.45, 0.53, 0.6]
biggest = max([i for i in filter(lambda x: 0.5 < x < 0.58, lst)])
I suppose a normal if check would be faster, but I'll give this just for completeness.
Also, you should not use list = ... as list is a built-in in python.

You could also go about it a step at a time, as the approach may aid in debugging.
I used numpy in this case, which is also a helpful tool to put in your tool belt.
This should run as is:
import numpy as np
l = [0.5, 0.56, 0.34, 0.45, 0.53, 0.6]
a = np.array(l)
low = 0.5
high = 0.58
index_low = (a < high)
print(index_low)
a_low = a[index_low]
print(a_low)
index_in_range = (a_low >= low)
print(index_in_range)
a_in_range = a_low[index_in_range]
print(a_in_range)
a_max = a_in_range.max()
print(a_max)

Elements in list greater than or equal to elements in other list (without for loop?)

I have a list containing 1,000,000 elements (numbers) called x and I would like to count how many of them are equal to or above [0.5,0.55,0.60,...,1]. Is there a way to do it without a for loop?
Right now I have the following the code, which works for a specific value of the [0.5,...1] interval, let's say 0.5 and assigns it to the count variable
count=len([i for i in x if i >= 0.5])
EDIT: Basically what I want to avoid is doing this... if possible?
obs=[]
alpha = [0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
for a in alpha:
count= len([i for i in x if i >= a])
obs.append(count)
Thanks in advance
Best, Mikael

I don't think it's possible without loop, but you can sort the array x and then you can use bisect module (doc) to locate insertion point (index).
For example:
x = [0.341, 0.423, 0.678, 0.999, 0.523, 0.751, 0.7]
alpha = [0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
x = sorted(x)
import bisect
obs = [len(x) - bisect.bisect_left(x, a) for a in alpha]
print(obs)
Will print:
[5, 4, 4, 4, 3, 2, 1, 1, 1, 1, 0]
Note:
sorted() has complexity n log(n) and bisect_left() log(n)

You can use numpy and boolean indexing:
>>> import numpy as np
>>> a = np.array(list(range(100)))
>>> a[a>=50].size
50

Even if you are not using for loop, internal methods use them. But iterates them efficiently.
you can use below function without for loop from your end.
x = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
l = list(filter(lambda _: _ > .5 , x))
print(l)

Based on comments, you're ok with using numpy, so use np.searchsorted to simply insert alpha into a sorted version of x. The indices will be your counts.
If you're ok with sorting x in-place:
x.sort()
counts = x.size - np.searchsorted(x, alpha)
If not,
counts = x.size - np.searchsorted(np.sort(x), alpha)
These counts assume that you want x < alpha. To get <= add the keyword side='right':
np.searchsorted(x, alpha, side='right')
PS
There are a couple of significant problems with the line
count = len([i for i in x if i >= 0.5])
First of all, you're creating a list of all the matching elements instead of just counting them. To count them do
count = sum(1 for i in x if i >= threshold)
Now the problem is that you are doing a linear pass through the entire array for each alpha, which is not necessary.
As I commented under #Andrej Kesely's answer, let's say we have N = len(x) and M = len(alpha). Your implementation is O(M * N) time complexity, while sorting gives you O((M + N) log N). For M << N (small alpha), your complexity is approximately O(N), which beats O(N log N). But for M ~= N, yours approaches O(N^2) vs my O(N log N).

EDIT: If you are using NumPy already, you can simply do this:
import numpy as np
# Make random data
np.random.seed(0)
x = np.random.binomial(n=20, p=0.5, size=1000000) / 20
bins = np.arange(0.55, 1.01, 0.05)
# One extra value for the upper bound of last bin
bins = np.append(bins, max(bins.max(), x.max()) + 1)
h, _ = np.histogram(x, bins)
result = np.cumsum(h)
print(result)
# [280645 354806 391658 406410 411048 412152 412356 412377 412378 412378]
If you are dealing with large arrays of numbers, you may considering using NumPy. But if you are using simple Python lists, you can do that for example like this:
def how_many_bigger(nums, mins):
# List of counts for each minimum
counts = [0] * len(mins)
# For each number
for n in nums:
# For each minimum
for i, m in enumerate(mins):
# Add 1 to the count if the number is greater than the current minimum
if n >= m:
counts[i] += 1
return counts
# Test
import random
# Make random data
random.seed(0)
nums = [random.random() for _ in range(1_000_000)]
# Make minimums
mins = [i / 100. for i in range(55, 101, 5)]
print(mins)
# [0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]
count = how_many_bigger(nums, mins)
print(count)
# [449771, 399555, 349543, 299687, 249605, 199774, 149945, 99928, 49670, 0]

Divide a list into two lists one from minimum to maximum and the other from maximum to minimum

I have two lists as given below;
phi= [0,pi/4, pi/2, 3*pi/4, pi, 5*pi/4, 3*pi/2, 7*pi/4, 2*pi]
t = [1,1.25,1.50,1.25,1,0.75,0.5,0.5,0.75]
I get t from computation of the phi list. So phi = 0 gives me t = 1 and so on. I want two lists from t and two from phi. First list will start from the minimum value of t (at second last position) to maximum value of t. The phi list will be the phi values associated with those new t values. Second list of t values will start from the maximum t value and end at the minimum. Second phi list should be the phi values associated with this t value lists. Is there a way to code this?
Desired Output:
t1 = [0.5,0.75,1,1.25,1.5]; phi1 = [7*pi/4, 2*pi, 0, pi/4, pi/2]
t2 = [1.50,1.25,1,0.75,0.5,]; phi2 = [pi/2, 3*pi/4, pi, 5*pi/4, 3*pi/2]

Starting from the following data:
phi= ['0','pi/4', 'pi/2','3*pi/4', 'pi','5*pi/4','3*pi/2','7*pi/4', '2*pi']
t = [1,1.25,1.50,1.25,1,0.75,0.5,0.5,0.75]
You can get the indexes where there is the maximum and the minimum with numpy.argmax() and numpy.argmin() which return the index of the maximum or minimum in a given list or array (note that in the case of multiple occurrences, it returns the index of the first extreme). Thus:
max_index = np.argmax(t)
min_index = np.argmin(t)
# the +1 in some indexes is to have pi/2 and 3pi/2 in both phi1 and phi2
phi1 = phi[min_index:]+phi[:max_index+1]
t1 = t[min_index:]+t[:max_index+1]
phi2 = phi[max_index:min_index+1]
t2 = t[max_index:min_index+1]
# Out: t1, phi1
# [0.5, 0.5, 0.75, 1, 1.25, 1.5] ['3*pi/2', '7*pi/4', '2*pi', '0', 'pi/4', 'pi/2']
# t2,phi2
# [1.5, 1.25, 1, 0.75, 0.5] ['pi/2', '3*pi/4', 'pi', '5*pi/4', '3*pi/2']
Using numpy may be an overkill, in which case, the max_index and min_index can be defined with built-in functions:
max_val = max(t)
min_val = min(t)
max_index = t.index(max_val)
min_index = t.index(min_val)
The same indexes are retrieved, and thus, the same output.

Your phi list is confusing and not in proper python object form, correct it, If i am getting right then you want something like this ,
from math import pi
phi= phi= [0,pi/4, pi/2,3,pi/4, pi,5,pi/4,3,pi/2,7,pi/4]
t = [1,1.25,1.50,1.25,1,0.75,0.5,0.5,0.75]
zipped_=[i for i in zip(t,phi)]
print(sorted(zipped_))
print(sorted(zipped_,reverse=True))
output:
[(0.5, 0.7853981633974483), (0.5, 5), (0.75, 3), (0.75, 3.141592653589793), (1, 0), (1, 0.7853981633974483), (1.25, 0.7853981633974483), (1.25, 3), (1.5, 1.5707963267948966)]
#First list will start from the minimum value of t(at second last position) to maximum value of t.The phi list will be the phi values associated with those new t values
[(1.5, 1.5707963267948966), (1.25, 3), (1.25, 0.7853981633974483), (1, 0.7853981633974483), (1, 0), (0.75, 3.141592653589793), (0.75, 3), (0.5, 5), (0.5, 0.7853981633974483)]
#Second list of t values will start from the maximum t value and end at the minimum. second phi list should be the phi values associated with this t value lists.
you can change your phi list if the modified list is not same as you have.

Normalization VS. numpy way to normalize?

I'm supposed to normalize an array. I've read about normalization and come across a formula:
I wrote the following function for it:
def normalize_list(list):
max_value = max(list)
min_value = min(list)
for i in range(0, len(list)):
list[i] = (list[i] - min_value) / (max_value - min_value)
That is supposed to normalize an array of elements.
Then I have come across this: https://stackoverflow.com/a/21031303/6209399
Which says you can normalize an array by simply doing this:
def normalize_list_numpy(list):
normalized_list = list / np.linalg.norm(list)
return normalized_list
If I normalize this test array test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9] with my own function and with the numpy method, I get these answers:
My own function: [0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
The numpy way: [0.059234887775909233, 0.11846977555181847, 0.17770466332772769, 0.23693955110363693, 0.29617443887954614, 0.35540932665545538, 0.41464421443136462, 0.47387910220727386, 0.5331139899831830
Why do the functions give different answers? Is there others way to normalize an array of data? What does numpy.linalg.norm(list) do? What do I get wrong?

There are different types of normalization. You are using min-max normalization. The min-max normalization from scikit learn is as follows.
import numpy as np
from sklearn.preprocessing import minmax_scale
# your function
def normalize_list(list_normal):
max_value = max(list_normal)
min_value = min(list_normal)
for i in range(len(list_normal)):
list_normal[i] = (list_normal[i] - min_value) / (max_value - min_value)
return list_normal
#Scikit learn version
def normalize_list_numpy(list_numpy):
normalized_list = minmax_scale(list_numpy)
return normalized_list
test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9]
test_array_numpy = np.array(test_array)
print(normalize_list(test_array))
print(normalize_list_numpy(test_array_numpy))
Output:
[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
MinMaxscaler uses exactly your formula for normalization/scaling:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html
#OuuGiii: NOTE: It is not a good idea to use Python built-in function names as varibale names. list() is a Python builtin function so its use as a variable should be avoided.

The question/answer that you reference doesn't explicitly relate your own formula to the np.linalg.norm(list) version that you use here.
One NumPy solution would be this:
import numpy as np
def normalize(x):
x = np.asarray(x)
return (x - x.min()) / (np.ptp(x))
print(normalize(test_array))
# [ 0. 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1. ]
Here np.ptp is peak-to-peak ie
Range of values (maximum - minimum) along an axis.
This approach scales the values to the interval [0, 1] as pointed out by #phg.
The more traditional definition of normalization would be to scale to a 0 mean and unit variance:
x = np.asarray(test_array)
res = (x - x.mean()) / x.std()
print(res.mean(), res.std())
# 0.0 1.0
Or use sklearn.preprocessing.normalize as a pre-canned function.
Using test_array / np.linalg.norm(test_array) creates a result that is of unit length; you'll see that np.linalg.norm(test_array / np.linalg.norm(test_array)) equals 1. So you're talking about two different fields here, one being statistics and the other being linear algebra.

The power of python is its broadcasting property, which allows you to do vectorizing array operations without explicit looping. So, You do not need to write a function using explicit for loop, which is slow and time-consuming, especially if your dataset is too big.
The pythonic way of doing min-max normalization is
test_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
normalized_test_array = (test_array - min(test_array)) / (max(test_array) - min(test_array))
output >> [ 0., 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1. ]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Probability list creation - python

Another numpy answer just for kicks: import numpy as np r = .2 s = .02 a, b = np.mgrid[0:r:s, 0:r:s] lst = np.dstack([a, 1 - (a+b), b]).reshape(-1, 3)

Related

Most efficient way to convert list of values to probability distribution?

How to get elements from a specific range out of a list?

Elements in list greater than or equal to elements in other list (without for loop?)

Divide a list into two lists one from minimum to maximum and the other from maximum to minimum

Normalization VS. numpy way to normalize?

Categories

Resources