I am trying to calculate the Theil index in python and R, but with the given functions, I am getting different answers. Here is the formula that I am trying to use:
Using the ineq package in R, I can easily get the Theil index:
library(ineq)
x=c(26.1,16.1,15.5,15.4,14.8,14.7,13.7,12.1,11.7,11.6,11,10.8,10.8,7.5)
Theil(x)
0.04152699
This implementation seems to make sense and I can look at the code provided to see what exact calculations that are happening and it seems to follow the formula (deleting zeros when I get have them in order to take the log):
getAnywhere(Theil )
Out[24]:
A single object matching ‘Theil’ was found
It was found in the following places
package:ineq
namespace:ineq
with value
function (x, parameter = 0, na.rm = TRUE)
{
if (!na.rm && any(is.na(x)))
return(NA_real_)
x <- as.numeric(na.omit(x))
if (is.null(parameter))
parameter <- 0
if (parameter == 0) {
x <- x[!(x == 0)]
Th <- x/mean(x)
Th <- sum(x * log(Th))
Th <- Th/sum(x)
}
else {
Th <- exp(mean(log(x)))/mean(x)
Th <- -log(Th)
}
Th
}
However, I see that this question has been answered previously before for python here . The code is here, but the answers do not match for some reason:
def T(x):
n = len(x)
maximum_entropy = math.log(n)
actual_entropy = H(x)
redundancy = maximum_entropy - actual_entropy
inequality = 1 - math.exp(-redundancy)
return redundancy,inequality
def Group_negentropy(x_i):
if x_i == 0:
return 0
else:
return x_i*math.log(x_i)
def H(x):
n = len(x)
entropy = 0.0
summ = 0.0
for x_i in x: # work on all x[i]
summ += x_i
group_negentropy = Group_negentropy(x_i)
entropy += group_negentropy
return -entropy
x=np.array([26.1,16.1,15.5,15.4,14.8,14.7,13.7,12.1,11.7,11.6,11,10.8,10.8,7.5])
T(x)
(512.62045438815949, 1.0)
It is not stated explicitly in the other question, but that implementation expects its input to be normalized, so that each x_i is a proportion of income, not an actual amount. (That's why the other code has that error_if_not_in_range01 function and raises an error if any x_i is not between 0 and 1.)
If you normalize your x, you'll get the same result as the R code:
>>> T(x/x.sum())
(0.041526988117662533, 0.0406765553418974)
(The first value there is what R is reporting.)
Related
Given two sequences A and B of the same length: one is strictly increasing, the other is strictly decreasing.
It is required to find an index i such that the absolute value of the difference between A[i] and B[i] is minimal. If there are several such indices, the answer is the smallest of them. The input sequences are standard Python arrays. It is guaranteed that they are of the same length. Efficiency requirements: Asymptotic complexity: no more than the power of the logarithm of the length of the input sequences.
I have implemented index lookup using the golden section method, but I am confused by the use of floating point arithmetic. Is it possible to somehow improve this algorithm so as not to use it, or can you come up with a more concise solution?
import random
import math
def peak(A,B):
def f(x):
return abs(A[x]-B[x])
phi_inv = 1 / ((math.sqrt(5) + 1) / 2)
def cal_x1(left,right):
return right - (round((right-left) * phi_inv))
def cal_x2(left,right):
return left + (round((right-left) * phi_inv))
left, right = 0, len(A)-1
x1, x2 = cal_x1(left, right), cal_x2(left,right)
while x1 < x2:
if f(x1) > f(x2):
left = x1
x1 = x2
x2 = cal_x1(x1,right)
else:
right = x2
x2 = x1
x1 = cal_x2(left,x2)
if x1 > 1 and f(x1-2) <= f(x1-1): return x1-2
if x1+2 < len(A) and f(x1+2) < f(x1+1): return x1+2
if x1 > 0 and f(x1-1) <= f(x1): return x1-1
if x1+1 < len(A) and f(x1+1) < f(x1): return x1+1
return x1
#value check
def make_arr(inv):
x = set()
while len(x) != 1000:
x.add(random.randint(-10000,10000))
x = sorted(list(x),reverse = inv)
return x
x = make_arr(0)
y = make_arr(1)
needle = 1000000
c = 0
for i in range(1000):
if abs(x[i]-y[i]) < needle:
c = i
needle = abs(x[i]-y[i])
print(c)
print(peak(x,y))
Approach
The poster asks about alternative, simpler solutions to posted code.
The problem is a variant of Leetcode Problem 852, where the goal is to find the peak index in a moutain array. We convert to a peak, rather than min, by computing the negative of the abolute difference. Our aproach is to modify this Python solution to the Leetcode problem.
Code
def binary_search(x, y):
''' Mod of https://walkccc.me/LeetCode/problems/0852/ to use function'''
def f(m):
' Absoute value of difference at index m of two arrays '
return -abs(x[m] - y[m]) # Make negative so we are looking for a peak
# peak using binary search
l = 0
r = len(arr) - 1
while l < r:
m = (l + r) // 2
if f(m) < f(m + 1): # check if increasing
l = m + 1
else:
r = m # was decreasing
return l
Test
def linear_search(A, B):
' Linear Search Method '
values = [abs(ai-bi) for ai, bi in zip(A, B)]
return values.index(min(values)) # linear search
def make_arr(inv):
random.seed(10) # added so we can repeat with the same data
x = set()
while len(x) != 1000:
x.add(random.randint(-10000,10000))
x = sorted(list(x),reverse = inv)
return x
# Create data
x = make_arr(0)
y = make_arr(1)
# Run search methods
print(f'Linear Search Solution {linear_search(x, y)}')
print(f'Golden Section Search Solution {peak(x, y)}') # posted code
print(f'Binary Search Solution {binary_search(x, y)}')
Output
Linear Search Solution 499
Golden Section Search Solution 499
Binary Search Solution 499
I'm trying to compute the upper bound on the predictability of my occupancy dataset, as in Song's 'Limits of Predictability in Human Mobility' paper. Basically, home (=1) and not at home (=0) then represent the visited locations (towers) in Song's paper.
I tested my code (which I derived from https://github.com/gavin-s-smith/MobilityPredictabilityUpperBounds and https://github.com/gavin-s-smith/EntropyRateEst) on a random binary sequence which should return an entropy of 1 and a predictability of 0.5. Instead, the returned entropy is 0.87 and the predictabiltiy 0.71.
Here's my code:
import numpy as np
from scipy.optimize import fsolve
from cmath import log
import math
def matchfinder(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
# Using L_{n} definition from
#"Nonparametric Entropy Estimation for Stationary Process and Random Fields, with Applications to English Text"
# by Kontoyiannis et. al.
# $L_{n} = 1 + max \{l :0 \leq l \leq n, X^{l-1}_{0} = X^{-j+l-1}_{-j} \text{ for some } l \leq j \leq n \}$
# for each position, i, in the sub-sequence that occurs before the current position, start_idx
# check to see the maximum continuously equal string we can make by simultaneously extending from i and start_idx
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
# for( int i = 0; i < start_idx; i++ )
# {
j = 0
#increase the length of the substring starting at j and start_idx
#while they are the same keeping track of the length
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j > max_subsequence_matched:
max_subsequence_matched = j;
#L_{n} is obtained by adding 1 to the longest match-length
output[start_idx] = max_subsequence_matched + 1;
return output
if __name__ == '__main__':
#Read dataset
data = np.random.randint(2,size=2000)
#Number of distinct locations
N = len(np.unique(data))
#True entropy
lambdai = matchfinder(data)
Etrue = math.pow(sum( [ lambdai[i] / math.log(i+1,2) for i in range(1,len(data))] ) * (1.0/len(data)),-1)
S = Etrue
#use Fano's inequality to compute the predictability
func = lambda x: (-(x*log(x,2).real+(1-x)*log(1-x,2).real)+(1-x)*log(N-1,2).real ) - S
ub = fsolve(func, 0.9)[0]
print ub
the matchfinder function finds the entropy by looking for the longest match and adds 1 to it (= the shortest substring not previously seen). The predictability is then computed by using Fano's inequality.
What could be the problem?
Thanks!
The entropy function seems to be wrong.
Refering to the paper Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018–1021. you mentioned, real entropy is estimated by algorithm based on Lempel-Ziv data compression:
In code it would look like this:
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
Where n is the length of time series.
Notice that we used different base for logarithm than in given formula. However, since the base for logarithm in Fano's inequality is 2, then it seems logical to use the same base for entropy calculation. Also, I'm not sure why you started sum from the first instead of zero index.
So now wrapping that up into function for example:
def solve(locations, size):
data = np.random.randint(locations,size=size)
N = len(np.unique(data))
n = float(len(data))
print "Distinct locations: %i" % N
print "Time series length: %i" % n
#True entropy
lambdai = matchfinder(data)
#S = math.pow(sum([lambdai[i] / math.log(i + 1, 2) for i in range(1, len(data))]) * (1.0 / len(data)), -1)
Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
S = Etrue
print "Maximum entropy: %2.5f" % log(locations,2).real
print "Real entropy: %2.5f" % S
func = lambda x: (-(x * log(x, 2).real + (1 - x) * log(1 - x, 2).real) + (1 - x) * log(N - 1, 2).real) - S
ub = fsolve(func, 0.9)[0]
print "Upper bound of predictability: %2.5f" % ub
return ub
Output for 2 locations
Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013
Output for 3 locations
Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172
Lempel-Ziv compression converge to real entropy when n approaches infinity, that is why for 2 locations case it is slightly higher than maximum limit.
I am not also sure if you interpreted definition for lambda correctly. It is defined as "the length of the shortest substring starting at position i which dosen't previously appear from position 1 to i-1", so when we got to some point where further substrings are not unique anymore, your matching algorithm would give it length always one higher than the length of substring, while it should be rather equal to 0, since unique substring does not exist.
To make it clearer let's just give a simple example. If the array of positions looks like that:
[1 0 0 1 0 0]
Then we can see that after first three positions pattern is repeated once again. That means that from fourth location shorthest unique substring does not exist, thus it equals to 0. So the output (lambda) should look like this:
[1 1 2 0 0 0]
However, your function for that case would return:
[1 1 2 4 3 2]
I rewrote you matching function to treat that problem:
def matchfinder2(data):
data_len = len(data)
output = np.zeros(len(data))
output[0] = 1
for start_idx in range(1,data_len):
max_subsequence_matched = 0
for i in range(0,start_idx):
j = 0
end_distance = data_len - start_idx #length left to the end of sequence (including current index)
while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
j = j + 1
if j == end_distance: #check if j has reached the end of sequence
output[start_idx::] = np.zeros(end_distance) #if yes fill the rest of output with zeros
return output #end function
elif j > max_subsequence_matched:
max_subsequence_matched = j;
output[start_idx] = max_subsequence_matched + 1;
return output
Differences are small of course, because result change just for the small part of sequences.
I'm trying to implement Newton's method for fun in Python but I'm having a problem conceptually understanding the placement of the check.
A refresher on Newton's method as a way to approximate roots via repeated linear approximation through differentiation:
I have the following code:
# x_1 = x_0 - (f(x_0)/f'(x_0))
# x_n+1 - x_n = precision
def newton_method(f, f_p, prec=0.01):
x=1
x_p=1
tmp=0
while(True):
tmp = x
x = x_p - (f(x_p)/float(f_p(x_p)))
if (abs(x-x_p) < prec):
break;
x_p = tmp
return x
This works, however if I move the if statement in the loop to after the x_p = tmp line, the function ceases to work as expected. Like so:
# x_1 = x_0 - (f(x_0)/f'(x_0))
# x_n+1 - x_n = precision
def newton_method(f, f_p, prec=0.01):
x=1
x_p=1
tmp=0
while(True):
tmp = x
x = x_p - (f(x_p)/float(f_p(x_p)))
x_p = tmp
if (abs(x-x_p) < prec):
break;
return x
To clarify, function v1 (the first piece of code) works as expected, function v2 (the second) does not.
Why is this the case?
Isn't the original version essentially checking the current x versus the x from 2 assignments back, rather than the immediately previous x ?
Here is the test code I am using:
def f(x):
return x*x - 5
def f_p(x):
return 2*x
newton_method(f,f_p)
EDIT
I ended up using this version of the code, which forgoes the tmp variable and is much clearer for me, conceptually:
# x_1 = x_0 - (f(x_0)/f'(x_0))
# x_n+1 - x_n = precision
def newton_method(f, f_p, prec=0.01):
x=1
x_p=1
tmp=0
while(True):
x = x_p - (f(x_p)/float(f_p(x_p)))
if (abs(x-x_p) < prec):
break;
x_p = x
return x
Let x[i] be the new value to be computed in an iteration.
What is happening in version 1:
The statement x = x_p - (f(x_p)/float(f_p(x_p))) translates to:
x[i] = x[i-2] - f(x[i-2])/f'(x[i-2]) - 1
But according to the actual mathematical formula, it should have been this:
x[i] = x[i-1] - f(x[i-1])/f'(x[i-1])
Similarly, x[i-1] = x[i-2] - f(x[i-2])/f'(x[i-2]) - 2
Comparing 1 and 2, we can see that the x[i] in 1 is actually x[i-1] according to the math formula.
The main point to note here is that x and x_p are always one iteration apart. That is, x is the actual successor to x_p, unlike what it might seem by just looking at the code.
Hence, it is working correctly as expected.
What is happening in version 2:
Just like the above case, the same thing happens at the statement x = x_p - (f(x_p)/float(f_p(x_p))).
But by the time we reach if (abs(x-x_p) < prec), x_p had changed its value to temp = x = x[i-1].
But as deduced in the case of version 1, x too is x[i-1] rather than x[i].
So, abs(x - x_p) translates to abs(x[i-1] - x[i-1]), which will turn out to be 0, hence terminating the iteration.
The main point to note here is that x and x_p are actually the same values numerically, which always results in the algorithm terminating after just 1 iteration itself.
Update
This saves the current value of x
tmp = x
In this statement, the next value of x is created from the current value x_p
x = x_p - (f(x_p)/float(f_p(x_p)))
If convergence (eg next value - current value < threshold), break
if (abs(x-x_p) < prec):
break;
Set x_p to the next iteration's value
x_p = tmp
If you pull x_p = tmp above the if statement, you are actually checking x vs x` from 2 iterations ago, which is not what you want to do. This actually causes weird behavior where the correctness of the outcome is dependent on the starting values. If you start x at you get the correct response whereas if you start with 1, you will not.
To test it out and see why, it can be helpful to add in a print statement as below.
def newton_method(f, f_p, prec=0.01):
x=7
x_p=1
tmp=0
while(True):
tmp = x
x = x_p - (f(x_p)/float(f_p(x_p)))
print (x,x_p,_tmp)
if (abs(x-x_p) < prec):
break;
x_p = tmp
Are you trying to check X vs X from 2 iterations ago? Or X from the previous iteration of the loop?
If you have x_p=tmp before the if statement, if (abs(x-x_p) < prec): will check the current value of x versus the previous version of x, instead of x from 2 assignments ago
I am trying to implement Theil's index (http://en.wikipedia.org/wiki/Theil_index) in Python to measure inequality of revenue in a list.
The formula is basically Shannon's entropy, so it deals with log. My problem is that I have a few revenues at 0 in my list, and log(0) makes my formula unhappy. I believe adding a tiny float to 0 wouldn't work as log(tinyFloat) = -inf, and that would mess my index up.
[EDIT]
Here's a snippet (taken from another, much cleaner -and freely available-, implementation)
def error_if_not_in_range01(value):
if (value <= 0) or (value > 1):
raise Exception, \
str(value) + ' is not in [0,1)!'
def H(x)
n = len(x)
entropy = 0.0
sum = 0.0
for x_i in x: # work on all x[i]
print x_i
error_if_not_in_range01(x_i)
sum += x_i
group_negentropy = x_i*log(x_i)
entropy += group_negentropy
error_if_not_1(sum)
return -entropy
def T(x):
print x
n = len(x)
maximum_entropy = log(n)
actual_entropy = H(x)
redundancy = maximum_entropy - actual_entropy
inequality = 1 - exp(-redundancy)
return redundancy,inequality
Is there any way out of this problem?
If I understand you correctly, the formula you are trying to implement is the following:
In this case, your problem is calculating the natural logarithm of Xi / mean(X), when Xi = 0.
However, since that has to be multiplied by Xi / mean(X) first, if Xi == 0 the value of ln(Xi / mean(X)) doesn't matter because it will be multiplied by zero. You can treat the value of the formula for that entry as zero, and skip calculating the logarithm entirely.
In the case that you are implementing Shannon's formula directly, the same holds:
In both the first and second form, calculating the log is not necessary if Pi == 0, because whatever value it is, it will have been multiplied by zero.
UPDATE:
Given the code you quoted, you can replace x_i*log(x_i) with a function as follows:
def Group_negentropy(x_i):
if x_i == 0:
return 0
else:
return x_i*log(x_i)
def H(x)
n = len(x)
entropy = 0.0
sum = 0.0
for x_i in x: # work on all x[i]
print x_i
error_if_not_in_range01(x_i)
sum += x_i
group_negentropy = Group_negentropy(x_i)
entropy += group_negentropy
error_if_not_1(sum)
return -entropy
I’m having a bit of trouble controlling the results from a data generating algorithm I am working on. Basically it takes values from a list and then lists all the different combinations to get to a specific sum. So far the code works fine(haven’t tested scaling it with many variables yet), but I need to allow for negative numbers to be include in the list.
The way I think I can solve this problem is to put a collar on the possible results as to prevent infinity results(if apples is 2 and oranges are -1 then for any sum, there will be an infinite solutions but if I say there is a limit of either then it cannot go on forever.)
So Here's super basic code that detects weights:
import math
data = [-2, 10,5,50,20,25,40]
target_sum = 100
max_percent = .8 #no value can exceed 80% of total(this is to prevent infinite solutions
for node in data:
max_value = abs(math.floor((target_sum * max_percent)/node))
print node, "'s max value is ", max_value
Here's the code that generates the results(first function generates a table if its possible and the second function composes the actual results. Details/pseudo code of the algo is here: Can brute force algorithms scale? ):
from collections import defaultdict
data = [-2, 10,5,50,20,25,40]
target_sum = 100
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(target_sum + 1): #set the range of one higher than sum to include sum itself
for c in range(s / x + 1):
if T[s - c * x, i]:
T[s, i+1] = True
coeff = [0]*len(data)
def RecursivelyListAllThatWork(k, sum): # Using last k variables, make sum
# /* Base case: If we've assigned all the variables correctly, list this
# * solution.
# */
if k == 0:
# print what we have so far
print(' + '.join("%2s*%s" % t for t in zip(coeff, data)))
return
x_k = data[k-1]
# /* Recursive step: Try all coefficients, but only if they work. */
for c in range(sum // x_k + 1):
if T[sum - c * x_k, k - 1]:
# mark the coefficient of x_k to be c
coeff[k-1] = c
RecursivelyListAllThatWork(k - 1, sum - c * x_k)
# unmark the coefficient of x_k
coeff[k-1] = 0
RecursivelyListAllThatWork(len(data), target_sum)
My problem is, I don't know where/how to integrate my limiting code to the main code inorder to restrict results and allow for negative numbers. When I add a negative number to the list, it displays it but does not include it in the output. I think this is due to it not being added to the table(first function) and I'm not sure how to have it added(and still keep the programs structure so I can scale it with more variables).
Thanks in advance and if anything is unclear please let me know.
edit: a bit unrelated(and if detracts from the question just ignore, but since your looking at the code already, is there a way I can utilize both cpus on my machine with this code? Right now when I run it, it only uses one cpu. I know the technical method of parallel computing in python but not sure how to logically parallelize this algo)
You can restrict results by changing both loops over c from
for c in range(s / x + 1):
to
max_value = int(abs((target_sum * max_percent)/x))
for c in range(max_value + 1):
This will ensure that any coefficient in the final answer will be an integer in the range 0 to max_value inclusive.
A simple way of adding negative values is to change the loop over s from
for s in range(target_sum + 1):
to
R=200 # Maximum size of any partial sum
for s in range(-R,R+1):
Note that if you do it this way then your solution will have an additional constraint.
The new constraint is that the absolute value of every partial weighted sum must be <=R.
(You can make R large to avoid this constraint reducing the number of solutions, but this will slow down execution.)
The complete code looks like:
from collections import defaultdict
data = [-2,10,5,50,20,25,40]
target_sum = 100
# T[x, i] is True if 'x' can be solved
# by a linear combination of data[:i+1]
T = defaultdict(bool) # all values are False by default
T[0, 0] = True # base case
R=200 # Maximum size of any partial sum
max_percent=0.8 # Maximum weight of any term
for i, x in enumerate(data): # i is index, x is data[i]
for s in range(-R,R+1): #set the range of one higher than sum to include sum itself
max_value = int(abs((target_sum * max_percent)/x))
for c in range(max_value + 1):
if T[s - c * x, i]:
T[s, i+1] = True
coeff = [0]*len(data)
def RecursivelyListAllThatWork(k, sum): # Using last k variables, make sum
# /* Base case: If we've assigned all the variables correctly, list this
# * solution.
# */
if k == 0:
# print what we have so far
print(' + '.join("%2s*%s" % t for t in zip(coeff, data)))
return
x_k = data[k-1]
# /* Recursive step: Try all coefficients, but only if they work. */
max_value = int(abs((target_sum * max_percent)/x_k))
for c in range(max_value + 1):
if T[sum - c * x_k, k - 1]:
# mark the coefficient of x_k to be c
coeff[k-1] = c
RecursivelyListAllThatWork(k - 1, sum - c * x_k)
# unmark the coefficient of x_k
coeff[k-1] = 0
RecursivelyListAllThatWork(len(data), target_sum)