I am trying to implement Theil's index (http://en.wikipedia.org/wiki/Theil_index) in Python to measure inequality of revenue in a list.
The formula is basically Shannon's entropy, so it deals with log. My problem is that I have a few revenues at 0 in my list, and log(0) makes my formula unhappy. I believe adding a tiny float to 0 wouldn't work as log(tinyFloat) = -inf, and that would mess my index up.
[EDIT]
Here's a snippet (taken from another, much cleaner -and freely available-, implementation)
def error_if_not_in_range01(value):
if (value <= 0) or (value > 1):
raise Exception, \
str(value) + ' is not in [0,1)!'
def H(x)
n = len(x)
entropy = 0.0
sum = 0.0
for x_i in x: # work on all x[i]
print x_i
error_if_not_in_range01(x_i)
sum += x_i
group_negentropy = x_i*log(x_i)
entropy += group_negentropy
error_if_not_1(sum)
return -entropy
def T(x):
print x
n = len(x)
maximum_entropy = log(n)
actual_entropy = H(x)
redundancy = maximum_entropy - actual_entropy
inequality = 1 - exp(-redundancy)
return redundancy,inequality
Is there any way out of this problem?
If I understand you correctly, the formula you are trying to implement is the following:
In this case, your problem is calculating the natural logarithm of Xi / mean(X), when Xi = 0.
However, since that has to be multiplied by Xi / mean(X) first, if Xi == 0 the value of ln(Xi / mean(X)) doesn't matter because it will be multiplied by zero. You can treat the value of the formula for that entry as zero, and skip calculating the logarithm entirely.
In the case that you are implementing Shannon's formula directly, the same holds:
In both the first and second form, calculating the log is not necessary if Pi == 0, because whatever value it is, it will have been multiplied by zero.
UPDATE:
Given the code you quoted, you can replace x_i*log(x_i) with a function as follows:
def Group_negentropy(x_i):
if x_i == 0:
return 0
else:
return x_i*log(x_i)
def H(x)
n = len(x)
entropy = 0.0
sum = 0.0
for x_i in x: # work on all x[i]
print x_i
error_if_not_in_range01(x_i)
sum += x_i
group_negentropy = Group_negentropy(x_i)
entropy += group_negentropy
error_if_not_1(sum)
return -entropy
Related
I'm trying to evaluate a Taylor polynomial for the natural logarithm, ln(x), centred at a=1 in Python. I'm using the series given on Wikipedia however when I try a simple calculation like ln(2.7) instead of giving me something close to 1 it gives me a gigantic number. Is there something obvious that I'm doing wrong?
def log(x):
n=1000
s=0
for i in range(1,n):
s += ((-1)**(i+1))*((x-1)**i)/i
return s
Using the Taylor series:
Gives the result:
EDIT: If anyone stumbles across this an alternative way to evaluate the natural logarithm of some real number is to use numerical integration (e.g. Riemann sum, midpoint rule, trapezoid rule, Simpson's rule etc) to evaluate the integral that is often used to define the natural logarithm;
That series is only valid when x is <= 1. For x>1 you will need a different series.
For example this one (found here):
def ln(x): return 2*sum(((x-1)/(x+1))**i/i for i in range(1,100,2))
output:
ln(2.7) # 0.9932517730102833
math.log(2.7) # 0.9932517730102834
Note that it takes a lot more than 100 terms to converge as x gets bigger (up to a point where it'll become impractical)
You can compensate for that by adding the logarithms of smaller factors of x:
def ln(x):
if x > 2: return ln(x/2) + ln(2) # ln(x) = ln(x/2 * 2) = ln(x/2) + ln(2)
return 2*sum(((x-1)/(x+1))**i/i for i in range(1,1000,2))
which is something you can also do in your Taylor based function to support x>1:
def log(x):
if x > 1: return log(x/2) - log(0.5) # ln(2) = -ln(1/2)
n=1000
s=0
for i in range(1,n):
s += ((-1)**(i+1))*((x-1)**i)/i
return s
These series also take more terms to converge when x gets closer to zero so you may want to work them in the other direction as well to keep the actual value to compute between 0.5 and 1:
def log(x):
if x > 1: return log(x/2) - log(0.5) # ln(x/2 * 2) = ln(x/2) + ln(2)
if x < 0.5: return log(2*x) + log(0.5) # ln(x*2 / 2) = ln(x*2) - ln(2)
...
If performance is an issue, you'll want to store ln(2) or log(0.5) somewhere and reuse it instead of computing it on every call
for example:
ln2 = None
def ln(x):
if x <= 2:
return 2*sum(((x-1)/(x+1))**i/i for i in range(1,10000,2))
global ln2
if ln2 is None: ln2 = ln(2)
n2 = 0
while x>2: x,n2 = x/2,n2+1
return ln2*n2 + ln(x)
The program is correct, but the Mercator series has the following caveat:
The series converges to the natural logarithm (shifted by 1) whenever −1 < x ≤ 1.
The series diverges when x > 1, so you shouldn't expect a result close to 1.
The python function math.frexp(x) can be used to advantage here to modify the problem so that the taylor series is working with a value close to one. math.frexp(x) is described as:
Return the mantissa and exponent of x as the pair (m, e). m is a float
and e is an integer such that x == m * 2**e exactly. If x is zero,
returns (0.0, 0), otherwise 0.5 <= abs(m) < 1. This is used to “pick
apart” the internal representation of a float in a portable way.
Using math.frexp(x) should not be regarded as "cheating" because it is presumably implemented just by accessing the bit fields in the underlying binary floating point representation. It isn't absolutely guaranteed that the representation of floats will be IEEE 754 binary64, but as far as I know every platform uses this. sys.float_info can be examined to find out the actual representation details.
Much like the other answer does you can use the standard logarithmic identities as follows: Let m, e = math.frexp(x). Then log(x) = log(m * 2e) = log(m) + e * log(2). log(2) can be precomputed to full precision ahead of time and is just a constant in the program. Here is some code illustrating this to compute the two similar taylor series approximations to log(x). The number of terms in each series was determined by trial and error rather than rigorous analysis.
taylor1 implements log(1 + x) = x1 - (1/2) * x2 + (1/3) * x3 ...
taylor2 implements log(x) = 2 * [t + (1/3) * t3 + (1/5) * t5 ...], where t = (x - 1) / (x + 1).
import math
import struct
_LOG_OF_2 = 0.69314718055994530941723212145817656807550013436025
def taylor1(x):
m, e = math.frexp(x)
log_of_m = 0
num_terms = 36
sign = 1
m_minus1_power = m - 1
for k in range(1, num_terms + 1):
log_of_m += sign * m_minus1_power / k
sign = -sign
m_minus1_power *= m - 1
return log_of_m + e * _LOG_OF_2
def taylor2(x):
m, e = math.frexp(x)
num_terms = 12
half_log_of_m = 0
t = (m - 1) / (m + 1)
t_squared = t * t
t_power = t
denominator = 1
for k in range(num_terms):
half_log_of_m += t_power / denominator
denominator += 2
t_power *= t_squared
return 2 * half_log_of_m + e * _LOG_OF_2
This seems to work well over most of the domain of log(x), but as x approaches 1 (and log(x) approaches 0) the transformation provided by x = m * 2e actually produces a less accurate result. So a better algorithm would first check if x is close to 1, say abs(x-1) < .5, and if so the just compute the taylor series approximation directly on x.
My answer is just using the Taylor series for In(x). I really hope this helps. It is simple and straight to the point.
enter image description here
If the sum is 1, I could just divide the values by their sum. However, this approach is not applicable when the sum is 0.
Maybe I could compute the opposite of each value I sample, so I would always have a pair of numbers, such that their sum is 0. However this approach reduces the "randomness" I would like to have in my random array.
Are there better approaches?
Edit: the array length can vary (from 3 to few hundreds), but it has to be fixed before sampling.
There is a Dirichlet-Rescale (DRS) algorithm that generates random numbers summing up to a given number. As it says, it has the feature that
the vectors are uniformly distributed over the valid region of the
domain of all possible vectors, bounded by the constraints.
There is also a Python library for it.
You could use sklearns Standardscaler. It scales your data to have a variance of 1 and a mean of 0. The mean of 0 is equivalent to a sum of 0.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
rand_numbers = StandardScaler().fit_transform(np.random.rand(100,1, ))
If you don't want to use sklearn you can standardize by hand, the formula is pretty simple:
rand_numbers = np.random.rand(1000,1, )
rand_numbers = (rand_numbers - np.mean(rand_numbers)) / np.std(rand_numbers)
The problem here is the variance of 1, that causes numbers greater than 1 or smaller than -1. Therefor you devide the array by its max abs value.
rand_numbers = rand_numbers*(1/max(abs(rand_numbers)))
Now you have an array with values between -1 and 1 with a sum really close to zero.
print(sum(rand_numbers))
print(min(rand_numbers))
print(max(rand_numbers))
Output:
[-1.51822999e-14]
[-0.99356294]
[1.]
What you will have with this solution is either one 1 or one -1 in your data allways. If you would want to avoid this you could add a positive random factor to the division through the max abs. rand_numbers*(1/(max(abs(rand_numbers))+randomfactor))
Edit
As #KarlKnechtel mentioned the division by the standard deviation is redundant with the division by max absolute value.
The above can be simply done by:
rand_numbers = np.random.rand(100000,1, )
rand_numbers = rand_numbers - np.mean(rand_numbers)
rand_numbers = rand_numbers / max(abs(rand_numbers))
I would try the following solution:
def draw_randoms_while_sum_not_zero(eps):
r = random.uniform(-1, 1)
sum = r
yield r
while abs(sum) > eps:
if sum > 0:
r = random.uniform(-1, 0)
else:
r = random.uniform(0,1)
sum += r
yield r
As the floating point numbers are not perfectly accurate, you can never be sure, that the numbers you'd draw might sum up to 0. You need to decide, what margin is acceptable and call the above generator.
It'll yield (lazily return) random numbers as you need them as long as they don't sum up to 0 ± eps
epss = [0.1, 0.01, 0.001, 0.0001, 0.00001]
for eps in epss:
lengths = []
for _ in range(100):
lengths.append(len(list(draw_randoms_while_sum_not_zero(eps))))
print(f'{eps}: min={min(lengths)}, max={max(lengths)}, avg={sum(lengths)/len(lengths)}')
Results:
0.1: min=1, max=24, avg=6.1
0.01: min=1, max=174, avg=49.27
0.001: min=4, max=2837, avg=421.41
0.0001: min=5, max=21830, avg=4486.51
1e-05: min=183, max=226286, avg=48754.42
Since you are fine with the approach of generating lots of numbers and dividing by the sum, why not generate n/2 positive numbers divide by sum. Then generate n/2 negative numbers and divide by sum?
Want a random positive to negative mix? Randomly generate that mix randomly first then continue.
One way to generate such list is by having the opposite number.
If that is not a desirable property, you can introduce some extra randomness by adding / subtracting the same random value to different opposite couples, e.g.:
def exact_sum_uniform_random(num, min_val=-1.0, max_val=1.0, epsilon=0.1):
items = [random.uniform(min_val, max_val) for _ in range(num // 2)]
opposites = [-x for x in items]
if num % 2 != 0:
items.append(0.0)
for i in range(len(items)):
diff = random.random() * epsilon
if items[i] + diff <= max_val \
and any(opposite - diff >= min_val for opposite in opposites):
items[i] += diff
modified = False
while not modified:
j = random.randint(0, num // 2 - 1)
if opposites[j] - diff >= min_val:
opposites[j] -= diff
modified = True
result = items + opposites
random.shuffle(result)
return result
random.seed(0)
x = exact_sum_uniform_random(3)
print(x, sum(x))
# [0.7646391433441265, -0.7686875811622043, 0.004048437818077755] 2.2551405187698492e-17
EDIT
If the upper and lower limits are not strict, a simple way to construct a zero sum sequence is to sum-normalize two separate sequences to 1 and -1 and join them together:
def norm(items, scale):
return [item / scale for item in items]
def zero_sum_uniform_random(num, min_val=-1.0, max_val=1.0):
a = [random.uniform(min_val, max_val) for _ in range(num // 2)]
a = norm(a, sum(a))
b = [random.uniform(min_val, max_val) for _ in range(num - len(a))]
b = norm(b, -sum(b))
result = a + b
random.shuffle(result)
return result
random.seed(0)
n = 3
x = exact_mean_uniform_random(n)
print(exact_mean_uniform_random(n), sum(x))
# [1.0, 2.2578843364303585, -3.2578843364303585] 0.0
Note that both approaches will not have, in general, a uniform distribution.
how to create a python function called mySqrt that will approximate the square root of a number, call it n, by using Newton’s algorithm. Here's what I tried so far:
def newguess(x):
result = x/2
return result
def mySqrt(n):
result = (1/2) * (oldguess + (n/oldguess))
return result
v = newguess(45)
t = mySqrt(65)
print(t)
I think this is what you are looking for:
def my_sqrt(n):
approx = n/2
closer = (approx + n/approx)/2
while closer != approx:
approx = closer
closer = (approx + n/approx)/2
return approx
The Newton method finds an approximated solution r of the equation f(x) = 0 as follows:
[Initialize] Set r to some initial guess. Set epsilon := 0.00001 (precision)
[Iterate] While abs(f(r)) > epsilon Repeat r := r - f(r)/f'(r)
[End] Return r
In step 1 above, epsilon is the precision you want to achieve. The larger the precision the longer your program will take. In step 2 f'(r) stands for the derivative of f at r.
Now, you want to compute sqrt(a) for any value of a >= 0 using the Newton method.
By definition x = sqrt(a) means x^2 = a or x^2 - a = 0. Let f(x) = x^2 - a. Finding a solution r of f(x) = 0 is equivalent to finding r = sqrt(a). Note that in this case we have f'(x) = 2*x.
If we now apply the above algorithm to this case with a/2 as the initial guess (actually anything between 0 and a), we get:
[Initialize] Set r := a/2 and epsilon := 0.000000001
[Iterate] While abs(r^2 - a) > epsilon Repeat r := r - (r^2 - a)/(2*r)
[End] Return r
So, the only you have to do now is to translate these three simple steps into a phyton program.
Here is a solution which uses 50 iterations to approximate the value:
def mySqrt(n):
newGuess=n/2
for i in range(50):
newGuess=0.5*(newGuess + (n/newGuess))
return newGuess
I’m having a bit of trouble controlling the results from a data generating algorithm I am working on. Basically it takes values from a list and then lists all the different combinations to get to a specific sum. So far the code works fine(haven’t tested scaling it with many variables yet), but I need to allow for negative numbers to be include in the list.
The way I think I can solve this problem is to put a collar on the possible results as to prevent infinity results(if apples is 2 and oranges are -1 then for any sum, there will be an infinite solutions but if I say there is a limit of either then it cannot go on forever.)
So Here's super basic code that detects weights:
import math
data = [-2, 10,5,50,20,25,40]
target_sum = 100
max_percent = .8 #no value can exceed 80% of total(this is to prevent infinite solutions
for node in data:
max_value = abs(math.floor((target_sum * max_percent)/node))
print node, "'s max value is ", max_value
Here's the code that generates the results(first function generates a table if its possible and the second function composes the actual results. Details/pseudo code of the algo is here: Can brute force algorithms scale? ):
from collections import defaultdict
data = [-2, 10,5,50,20,25,40]
target_sum = 100
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(target_sum + 1): #set the range of one higher than sum to include sum itself
for c in range(s / x + 1):
if T[s - c * x, i]:
T[s, i+1] = True
coeff = [0]*len(data)
def RecursivelyListAllThatWork(k, sum): # Using last k variables, make sum
# /* Base case: If we've assigned all the variables correctly, list this
# * solution.
# */
if k == 0:
# print what we have so far
print(' + '.join("%2s*%s" % t for t in zip(coeff, data)))
return
x_k = data[k-1]
# /* Recursive step: Try all coefficients, but only if they work. */
for c in range(sum // x_k + 1):
if T[sum - c * x_k, k - 1]:
# mark the coefficient of x_k to be c
coeff[k-1] = c
RecursivelyListAllThatWork(k - 1, sum - c * x_k)
# unmark the coefficient of x_k
coeff[k-1] = 0
RecursivelyListAllThatWork(len(data), target_sum)
My problem is, I don't know where/how to integrate my limiting code to the main code inorder to restrict results and allow for negative numbers. When I add a negative number to the list, it displays it but does not include it in the output. I think this is due to it not being added to the table(first function) and I'm not sure how to have it added(and still keep the programs structure so I can scale it with more variables).
Thanks in advance and if anything is unclear please let me know.
edit: a bit unrelated(and if detracts from the question just ignore, but since your looking at the code already, is there a way I can utilize both cpus on my machine with this code? Right now when I run it, it only uses one cpu. I know the technical method of parallel computing in python but not sure how to logically parallelize this algo)
You can restrict results by changing both loops over c from
for c in range(s / x + 1):
to
max_value = int(abs((target_sum * max_percent)/x))
for c in range(max_value + 1):
This will ensure that any coefficient in the final answer will be an integer in the range 0 to max_value inclusive.
A simple way of adding negative values is to change the loop over s from
for s in range(target_sum + 1):
to
R=200 # Maximum size of any partial sum
for s in range(-R,R+1):
Note that if you do it this way then your solution will have an additional constraint.
The new constraint is that the absolute value of every partial weighted sum must be <=R.
(You can make R large to avoid this constraint reducing the number of solutions, but this will slow down execution.)
The complete code looks like:
from collections import defaultdict
data = [-2,10,5,50,20,25,40]
target_sum = 100
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
R=200 # Maximum size of any partial sum
max_percent=0.8 # Maximum weight of any term
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(-R,R+1): #set the range of one higher than sum to include sum itself
max_value = int(abs((target_sum * max_percent)/x))
for c in range(max_value + 1):
if T[s - c * x, i]:
T[s, i+1] = True
coeff = [0]*len(data)
def RecursivelyListAllThatWork(k, sum): # Using last k variables, make sum
# /* Base case: If we've assigned all the variables correctly, list this
# * solution.
# */
if k == 0:
# print what we have so far
print(' + '.join("%2s*%s" % t for t in zip(coeff, data)))
return
x_k = data[k-1]
# /* Recursive step: Try all coefficients, but only if they work. */
max_value = int(abs((target_sum * max_percent)/x_k))
for c in range(max_value + 1):
if T[sum - c * x_k, k - 1]:
# mark the coefficient of x_k to be c
coeff[k-1] = c
RecursivelyListAllThatWork(k - 1, sum - c * x_k)
# unmark the coefficient of x_k
coeff[k-1] = 0
RecursivelyListAllThatWork(len(data), target_sum)
I have got this code to solve Newton's method for a given polynomial and initial guess value. I want to turn into an iterative process which Newton's method actually is. The program should keeping running till the output value "x_n" becomes constant. And that final value of x_n is the actual root. Also, while using this method in my algorithm it should always produce a positive root between 0 and 1. So does converting the negative output (root) into a positive number would make any difference? Thank you.
import copy
poly = [[-0.25,3], [0.375,2], [-0.375,1], [-3.1,0]]
def poly_diff(poly):
""" Differentiate a polynomial. """
newlist = copy.deepcopy(poly)
for term in newlist:
term[0] *= term[1]
term[1] -= 1
return newlist
def poly_apply(poly, x):
""" Apply a value to a polynomial. """
sum = 0.0
for term in poly:
sum += term[0] * (x ** term[1])
return sum
def poly_root(poly):
""" Returns a root of the polynomial"""
poly_d = poly_diff(poly)
x = float(raw_input("Enter initial guess:"))
x_n = x - (float(poly_apply(poly, x)) / poly_apply(poly_d, x))
print x_n
if __name__ == "__main__" :
poly_root(poly)
First, in poly_diff, you should check to see if the exponent is zero, and if so simply remove that term from the result. Otherwise you will end up with the derivative being undefined at zero.
def poly_root(poly):
""" Returns a root of the polynomial"""
poly_d = poly_diff(poly)
x = None
x_n = float(raw_input("Enter initial guess:"))
while x != x_n:
x = x_n
x_n = x - (float(poly_apply(poly, x)) / poly_apply(poly_d, x))
return x_n
That should do it. However, I think it is possible that for certain polynomials this may not terminate, due to floating point rounding error. It may end up in a repeating cycle of approximations that differ only in the least significant bits. You might terminate when the percentage of change reaches a lower limit, or after a number of iterations.
import copy
poly = [[1,64], [2,109], [3,137], [4,138], [5,171], [6,170]]
def poly_diff(poly):
newlist = copy.deepcopy(poly)
for term in newlist:
term[0] *= term[1]
term[1] -= 1
return newlist
def poly_apply(poly, x):
sum = 0.0
for term in poly:
sum += term[0] * (x ** term[1])
return sum
def poly_root(poly):
poly_d = poly_diff(poly)
x = float(input("Enter initial guess:"))
x_n = x - (float(poly_apply(poly, x)) / poly_apply(poly_d, x))
print (x_n)
if __name__ == "__main__" :
poly_root(poly)