python float precision: will this increment work reliably? - python

I use the following code to dynamically generate a list of dictionaries of every combination of incremental probabilities associated with a given list of items, such that the probabilities sum to 1. For example, if the increment_divisor were 2 (leading to increment of 1/2 or 0.5), and the list contained 3 items ['a', 'b', 'c'], then the function should return
[{'a': 0.5, 'b': 0.5, 'c': 0.0},
{'a': 0.5, 'b': 0.0, 'c': 0.5},
{'a': 0.0, 'b': 0.5, 'c': 0.5},
{'a': 1.0, 'b': 0.0, 'c': 0.0},
{'a': 0.0, 'b': 1.0, 'c': 0.0},
{'a': 0.0, 'b': 0.0, 'c': 1.0}]
The code is as follows. The script generates the incrementer by calculating 1/x and then iteratively adds the incrementer to increments until the value is >= 1.0. I already know that python floats are imprecise, but I want to be sure that the last value in increments will be something very close to 1.0.
from collections import OrderedDict
from itertools import permutations
def generate_hyp_space(list_of_items, increment_divisor):
"""Generate list of OrderedDicts filling the hypothesis space.
Each OrderedDict is of the form ...
{ i1: 0.0, i2: 0.1, i3: 0.0, ...}
... where .values() sums to 1.
Arguments:
list_of_items -- items that receive prior weights
increment_divisor -- Increment by 1/increment_divisor. For example,
4 yields (0.0, 0.25, 0.5, 0.75, 1.0).
"""
_LEN = len(list_of_items)
if increment_divisor < _LEN: # permutations() returns None if this is True
print('WARN: increment_divisor too small, so was reset to '
'len(list_of_items).', file=sys.stderr)
increment_divisor = _LEN
increment_size = 1/increment_divisor
h_space = []
increments = []
incremental = 0.0
while incremental <= 1.0:
increments.append(incremental)
incremental += increment_size
for p in permutations(increments, _LEN):
if sum(p) == 1.0:
h_space.append(OrderedDict([(list_of_items[i], p[i])
for i in range(_LEN)]))
return h_space
How large can the increment_divisor be before the imprecision of float breaks the reliability of the script? (specifically, while incremental <= 1.0 and if sum(p) == 1.0)
This is a small example, but real use will involve much larger permutation space. Is there a more efficient/effective way to achieve this goal? (I already plan to implement a cache.) Would numpy datatypes be useful here for speed or precision?

The script generates the incrementer by calculating 1/x and then iteratively adds the incrementer to increments until the value is >= 1.0.
No, no, no. Just make a list of [0/x, 1/x, ..., (x-1)/x, x/x] by dividing each integer from 0 to x by x:
increments = [i/increment_divisor for i in range(increment_divisor+1)]
# or for Python 2
increments = [1.0*i/increment_divisor for i in xrange(increment_divisor+1)]
The list will always have exactly the right number of elements, no matter what rounding errors occur.
With NumPy, this would be numpy.linspace:
increments = numpy.linspace(start=0, stop=1, num=increment_divisor+1)
As for your overall problem, working in floats at all is probably a bad idea. You should be able to do the whole thing with integers and only divide by increment_divisor at the end, so you don't have to deal with floating-point precision issues in sum(p) == 1.0. Also, itertools.permutations doesn't do what you want, since it doesn't allow repeated items in the same permutation.
Instead of filtering permutations at all, you should use an algorithm based on the stars and bars idea to generate all possible ways to place len(list_of_items) - 1 separators between increment_divisor outcomes, and convert separator placements to probability dicts.

Thanks to #user2357112 for...
...pointing out the approach to work with ints until the last step.
...directing me to stars and bars approach.
I implemented stars_and_bars as a generator as follows:
def stars_and_bars(n, k, the_list=[]):
"""Distribute n probability tokens among k endings.
Generator implementation of the stars-and-bars algorithm.
Arguments:
n -- number of probability tokens (stars)
k -- number of endings/bins (bars+1)
"""
if n == 0:
yield the_list + [0]*k
elif k == 1:
yield the_list + [n]
else:
for i in range(n+1):
yield from stars_and_bars(n-i, k-1, the_list+[i])

Related

Using brute force a dictionary of lists

I have a task that has to use brute force to solve.
The data is in a python dictionary where one value from each key is used to get correctly answer a sum is part of a wider solution so in that context there is only one solution but in the example, I am giving there is a possibility for multiple solutions I suppose. So let's just assume that the first solution is the correct one.
So for example say the target is 9036 The "rules" are that all we can do is addition and have to take one element from each list within the dictionary.
Therefore 9036 can be calculated from the dictionary below as x:9040,y:1247,w:242,z:-1493
I have been trying to achieve this via a loop but I cant get the code to "loop" how I want it to which is along the lines of iterating over x[0],y[0],w[0],z[0] where 0 is just the first element in the list in the first itteration and then doing x[0],y[1],w[0],z[0], x[0],y[1],w[1],z[0], x[0],y[1],w[1],z[1]... until it has solved the problem.
I have not added any code as I was simply running a loop that never got anywhere near what I needed it to do and as I have no real experience with these sorts of algorithms / brute-forcing I was hoping someone could help point me in the right direction.
EDIT::: I have added the dictionary so that a solution can be provided but the dictionary size can vary so it needs to be a dynamic solution.
The dictionary:
{'x': [11909.0, 9040.0], 'y': [4345.0, 1807.0, 1247.0, 0.0, 6152.0, 4222.0, 123.0], 'w': [538.0, 12.0, 526.0, 0.0, 242.0, 1.0, 128.0, 155.0], 'z': [7149.0, 3003.0, 4146.0, 3054.0, 0.0, -51.0, 1010.0, 189.0, 182.0, -1493.0, 5409.0, -1151.0]}
With inputs from https://stackoverflow.com/a/61335465/14066512 to iterate over dictionary.
permutations_dicts variable contains all the different types of "brute force" combinations
import itertools
keys, values = zip(*d.items())
permutations_dicts = [dict(zip(keys, v)) for v in itertools.product(*values)]
for i in permutations_dicts:
if sum(i.values())==9036:
print(i)
break
{'x': 9040.0, 'y': 1247.0, 'w': 242.0, 'z': -1493.0}
From what I understood you want to try all combinations of values for each key until you reach the right answer.
This will add all possible values for each key until it finds 9036:
my_dict = {'x': [11909.0, 9040.0], 'y': [4345.0, 1807.0, 1247.0, 0.0, 6152.0, 4222.0, 123.0], 'w': [538.0, 12.0, 526.0, 0.0, 242.0, 1.0, 128.0, 155.0], 'z': [7149.0, 3003.0, 4146.0, 3054.0, 0.0, -51.0, 1010.0, 189.0, 182.0, -1493.0, 5409.0, -1151.0]}
looping = True
for x in range(len(my_dict['x'])):
if not looping:
break
for y in range(len(my_dict['y'])):
if not looping:
break
for w in range(len(my_dict['w'])):
if not looping:
break
for z in range(len(my_dict['z'])):
addition = my_dict['x'][x] + my_dict['y'][y] + my_dict['w'][w] + my_dict['z'][z]
if addition == 9036:
new_dict = {'x':my_dict['x'][x], 'y':my_dict['y'][y], 'w':my_dict['w'][w], 'z':my_dict['z'][z]}
print(f"Correct answer is {new_dict}")
looping = False
break

Markov Chain: Find the most probable path from point A to point B

I have a transition matrix using dictionary
{'hex1': {'hex2': 1.0},
'hex2': {'hex4': 0.4, 'hex7': 0.2, 'hex6': 0.2, 'hex1': 0.2},
'hex4': {'hex3': 1.0},
'hex3': {'hex6': 0.3333333333333333, 'hex2': 0.6666666666666666},
'hex6': {'hex1': 0.3333333333333333,
'hex4': 0.3333333333333333,
'hex5': 0.3333333333333333},
'hex7': {'hex6': 1.0},
'hex5': {'hex3': 1.0}}
which shows the probability of going from a certain hex to another hex (e.g. hex1 has probability 1 going to hex2, hex2 has probability 0.4 going to hex4).
Taking the starting and endpoint, I want to find the path that has the highest probability.
The structure of the code will look like
def find_most_probable_path(start_hex, end_hex, max_path):
path = compute for maximum probability path from start_hex to end_hex
return path
where max_path is the maximum hexes to traverse. If there is no path within the max_path, return empty/null. Also, drop the path if goes back to the starting hex before reaching the ending hex.
Example would be
find_most_probable_path(hex2, hex3, 5)
>> "hex2,hex4,hex3"
The output can be a list of hexes or just concatenated strings.
You can treat your Markov Chain as a directed weighted graph, and use the probabilities as graph edges weights.
From this point, you can use the Dijkstra algorithm for a shortest path from two points on a weighted graph.
https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
I developed an algorithm but I have no idea about its efficiency but it works well.
table={'hex1': {'hex2': 1.0},
'hex2': {'hex4': 0.4, 'hex7': 0.2, 'hex6': 0.2, 'hex1': 0.2},
'hex4': {'hex3': 1.0},
'hex3': {'hex6': 0.3333333333333333, 'hex2': 0.6666666666666666},
'hex6': {'hex1': 0.3333333333333333,
'hex4': 0.3333333333333333,
'hex5': 0.3333333333333333},
'hex7': {'hex6': 1.0},
'hex5': {'hex3': 1.0}}
def find_most_probable_path(start_hex, end_hex, max_path=0):
assigned=[start_hex]
foundTrue=False
prob=[{"nodes":[start_hex],"prob":1,"length":1}]
if max_path==0:
status=False
else:
status=True
while status==True:
chn=[]
status=False
for i in prob:
if i["length"]<max_path:
lastElement=i["nodes"][-1]
for j in table[lastElement]:
if j not in assigned:
temp=i.copy()
js=temp["nodes"].copy()
js.append(j)
temp["nodes"]=js
temp["prob"]=temp["prob"]*table[lastElement][j]
temp["length"]+=1
#print(temp)
chn.append(temp)
status=True
maxv=0
for i in chn:
if i["prob"]>=maxv:
maxv=i["prob"]
added=i
if added["nodes"][-1]==end_hex:
foundTrue=True
status=False
assigned.append(added["nodes"][-1])
prob.append(added)
if foundTrue==True:
return prob[-1]["nodes"]
else:
return None
print(find_most_probable_path("hex2", "hex3",5))
The output will be:
['hex2', 'hex4', 'hex3']
If you want to see probability of path, you can change the part:
if foundTrue==True:
return prob[-1]["nodes"]
to:
if foundTrue==True:
return prob[-1]
Then the program give output like that:
{'nodes': ['hex2', 'hex4', 'hex3'], 'prob': 0.4, 'length': 3}

List indices must be integers or slices not str- HMM forward algorithm

i am trying to implement forward algorithm in order to calculate HMM. I am doing step by step and debug on every step but i am getting an error. Anyone can tell me what error is?
My code is:
states = ('Fever','Healthy')
end = 'F'
observation =('3','1','1','2','2','3','1','3')
start = {'Fever': 0.5, 'Healthy': 0.5}
trans_prob = {
'Fever' : {'Fever': 0.8, 'Healthy': 0.1, 'F': 0.1},
'Healthy' : {'Fever': 0.1, 'Healthy': 0.8, 'F': 0.1},
}
em_prob = {
'Fever' : {'1': 0.1, '2': 0.2, '3': 0.7},
'Healthy' : {'1': 0.7, '2': 0.2, '3': 0.1},
}
#lent = len(observation)
prev = []
for i, obs_i in enumerate(observation):
curr = []
for st in states:
if i==0:
prev_sum =start[st]*em_prob[st][obs_i]
else:
for i in trans_prob.keys():
prev_sum = sum(prev[k]*transition_probability[i][st] for k in states)
print (prev_sum)
It is giving me this error:
TypeError Traceback (most recent call last) in ()
20 else:
21 for i in trans_prob.keys():
---> 22 prev_sum = sum(prev[k]*transition_probability[i][st] for k in states)
23 print (prev_sum)
in (.0)
20 else:
21 for i in trans_prob.keys():
---> 22 prev_sum = sum(prev[k]*transition_probability[i][st] for k in states)
23 print (prev_sum)
TypeError: list indices must be integers or slices, not str
The issue is that i is a string not integer in prev_sum = sum(prev[k]*transition_probability[i][st] for k in states).
You make a for loop that iterates over they keys of trans_prob. But those keys are "Fever" and "Healthy" which are strings not integers. Because transition_probability is a list (apparently, I didn't see it in the code), you need an integer to access its elements.
You may be confused on two aspects. Either you think that transition_probability is a dictionary, or you think that trans_prob.keys() returns integers, not strings. The value you're looking for is probably trans_prob.values(), but even then, with your current dict structure, it'll return strings "Fever", "Healthy", "F".
EDIT: Ah, I see the other issue. You use the variable twice in iteration. Once in the topmost for loop, which DOES return an int (which is probably what you're looking for), and then once in the very inner loop, which returns strings. I'd suggest rewriting for i in trans_prob.keys(): to for curr_key in trans_prob.keys():

How to choose keys from a python dictionary based on weighted probability? [duplicate]

This question already has answers here:
A weighted version of random.choice
(28 answers)
Closed 2 years ago.
I have a Python dictionary where keys represent some item and values represent some (normalized) weighting for said item. For example:
d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
# Note that sum([v for k,v in d.iteritems()]) == 1 for all `d`
Given this correlation of items to weights, how can I choose a key from d such that 6.25% of the time the result is 'a', 32.25% of the time the result is 'b', and 62.5% of the result is 'c'?
def weighted_random_by_dct(dct):
rand_val = random.random()
total = 0
for k, v in dct.items():
total += v
if rand_val <= total:
return k
assert False, 'unreachable'
Should do the trick. Goes through each key and keeps a running sum and if the random value (between 0 and 1) falls in the slot it returns that key
Starting in Python 3.6, you can use the built-in random.choices() instead of having to use Numpy.
So then, if we want to sample (with replacement) 25 keys from your dictionary where the values are the weights/probabilities of being sampled, we can simply write:
import random
random.choices(list(my_dict.keys()), weights=my_dict.values(), k=25)
This outputs a list of sampled keys:
['c', 'b', 'c', 'b', 'b', 'c', 'c', 'c', 'b', 'c', 'b', 'c', 'b', 'c', 'c', 'c', 'c', 'c', 'a', 'b']
If you just want one key, set k to 1 and extract the single element from the list that random.choices returns:
random.choices(list(my_dict.keys()), weights=my_dict.values(), k=1)[0]
(If you don't convert my_dict.keys() to a list, you'll get a TypeError about how it's not subscriptable.)
Here's the relevant snippet from the docs:
random.choices(population, weights=None, *, cum_weights=None, k=1)
Return a k sized list of elements chosen from the population with replacement. If the population is empty, raises IndexError.
If a weights sequence is specified, selections are made according to the relative weights. Alternatively, if a cum_weights sequence is given, the selections are made according to the cumulative weights (perhaps computed using itertools.accumulate()). For example, the relative weights [10, 5, 30, 5] are equivalent to the cumulative weights [10, 15, 45, 50]. Internally, the relative weights are converted to cumulative weights before making selections, so supplying the cumulative weights saves work.
If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence. It is a TypeError to specify both weights and cum_weights.
The weights or cum_weights can use any numeric type that interoperates with the float values returned by random() (that includes integers, floats, and fractions but excludes decimals). Weights are assumed to be non-negative.
For a given seed, the choices() function with equal weighting typically produces a different sequence than repeated calls to choice(). The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.
According to the comments at https://stackoverflow.com/a/39976962/5139284, random.choices is faster for small arrays, and numpy.random.choice is faster for big arrays. numpy.random.choice also provides an option to sample without replacement, whereas there's no built-in Python standard library function for that.
If you're planning to do this a lot, you could use numpy to select your keys from a list with weighted probabilities using np.random.choice(). The below example will pick your keys 10,000 times with the weighted probabilities.
import numpy as np
probs = [0.0625, 0.625, 0.3125]
keys = ['a', 'c', 'b']
choice_list = np.random.choice(keys, 10000, replace=True, p=probs)
Not sure what your use case is here, but you can check out the frequency distribution/probability distribution classes in the NLTK package, which handle all the nitty details.
FreqDist is an extension of a counter, which can be passed to a ProbDistI interface. The ProbDistI interface exposes a "generate()" method which can be used to sample the distribution, as well as a "prob(sample)" method that can be used to get the probability of a given key.
For your case you'd want to use Maximum Likelihood Estimation, so the MLEProbDist. If you want to smooth the distribution, you could try LaplaceProbDist or SimpleGoodTuringProbDist.
For example:
from nltk.probability import FreqDist, MLEProbDist
d = {'a': 6.25, 'c': 62.5, 'b': 31.25}
freq_dist = FreqDist(d)
prob_dist = MLEProbDist(freq_dist)
print prob_dist.prob('a')
print prob_dist.prob('b')
print prob_dist.prob('c')
print prob_dist.prob('d')
will print "0.0625 0.3125 0.625 0.0".
To generate a new sample, you can use:
prob_dist.generate()
If you are able to use numpy, you can use the numpy.random.choice function, like so:
import numpy as np
d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
def pick_by_weight(d):
d_choices = []
d_probs = []
for k,v in d.iteritems():
d_choices.append(k)
d_probs.append(v)
return np.random.choice(d_choices, 1, p=d_probs)[0]
d = {'a': 0.0625, 'c': 0.625, 'b': 0.3125}
choice = pick_by_weight(d)
What i have understood: you need a simple random function that will generate a random number uniformly in between 0 and 1. If the value is in between say 0 to 0.0625, you will select key a, if it is in between 0.0625 and (0.0625 + 0.625), then you will select key c etc. This is what actually mentioned in this answer.
Since random numbers will be generated uniformly, it is expected that keys associated with larger weight will be selected more compared to others.

Merging python lists based on a 'similar' float value

I have a list (containing tuples) and I want to merge the list based on if the first element is within a maximum distance of the other elements (if if delta value < 0.05). I have the following list as an example:
[(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
This should yield something like:
[(0.0, 1.883659017),(1.00422, 0.9998252466431066),(2.00425,0.9951777494430947)]
I am thinking that I can use something similar as in this question (Merge nested list items based on a repeating value) altho a lot of other questions yield a similar answer. The only problem that I see there is that they use collections.defaultdict or itertools.groupby which require exact matching of the element. An important addition here is that I want the first element of a merged tuple to be the weighted mixture of elements, example as follows:
(1.001,80) and (0.99,20) are matched then the result should be (0.9988,100).
Is something similar possible but with the matching based on value difference and not exact match?
What I was trying myself (but don't really like the look of it) is:
Res = 0.05
combinations = itertools.combination(list,2)
for i in combinations:
if i[0][0] > i[1][0]-Res and i[0][0] < i[1][0]+Res:
newValue = ...
-- UPDATE --
Based on some comments and Dawgs answer I tried the following approach:
for fv, v in total:
k=round(fv, 2)
data[k]=data.get(k, 0)+v
using the following list (actual data example, instead of short example list):
total = [(0.0, 0.11630591852564721), (1.00335, 0.25158664272201053), (2.0067, 0.2707487305913156), (3.0100499999999997, 0.19327075057473678), (4.0134, 0.10295042331357719), (5.01675, 0.04364856520231155), (6.020099999999999, 0.015342958201863783), (0.0, 0.9811758192941256), (1.00422, 0.018649427348981), (0.0, 0.9024831978342827), (2.00425, 0.09269455160881204), (0.0, 0.6944298762418107), (0.99703, 0.2536959281304138), (1.99406, 0.045877927988415786)]
which then yields problems with values such as 2.0067 (rounded to 2.01) and 1.99406 (rounded to 1.99( where the total difference is 0.01264 (which is far below 0.05, a value that I had in mind as a 'limit' for now but that should set changeable). Rounding the values to 1 decimal place is also not an option since that would result in a window of ~0.09 with values such as 2.04999 and 1.95001 which both yield 2.0 in that case.
The exact output was:
{0.0: 2.694394811895866, 1.0: 0.5239319982014053, 4.01: 0.10295042331357719, 5.02: 0.04364856520231155, 2.0: 0.09269455160881204, 1.99: 0.045877927988415786, 3.01: 0.19327075057473678, 6.02: 0.015342958201863783, 2.01: 0.2707487305913156}
accum = list()
data = [(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
EPSILON = 0.05
newdata = {d: True for d in data}
for k, v in data:
if not newdata[(k,v)]: continue
newdata[(k,v)] = False
# use each piece of data only once
keys,values = [k*v],[v]
for kk, vv in [d for d in data if newdata[d]]:
if abs(k-kk) < EPSILON:
keys.append(kk*vv)
values.append(vv)
newdata[(kk,vv)] = False
accum.append((sum(keys)/sum(values),sum(values)))
You can round the float values then use setdefault:
li=[(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
data={}
for fv, v in li:
k=round(fv, 5)
data.setdefault(k, 0)
data[k]+=v
print data
# {0.0: 1.8836590171284082, 2.00425: 0.9951777494430947, 1.00422: 0.9998252466431066}
If you want some more complex comparison (other than fixed rounding) you can create a hashable object based on the epsilon value you want and use the same method from there.
As pointed out in the comments, this works too:
data={}
for fv, v in li:
k=round(fv, 5)
data[k]=data.get(k, 0)+v

Categories