Complication using log-probabilities - Naive Bayes text classifier - python

I'm constructing a Naive Bayes text classifier from scratch in Python and I am aware that, upon encountering a product of very small probabilities, using a logarithm over the probabilities is a good choice.
The issue now, is that the mathematical function that I'm using has a summation OVER a product of these extremely small probabilities.
To be specific, I'm trying to calculate the total word probabilities given a mixture component (class) over all classes.
Just plainly adding up the logs of these total probabilities is incorrect, since the log of a sum is not equal to the sum of logs.
To give an example, lets say that I have 3 classes, 2000 words and 50 documents.
Then I have a word probability matrix called wordprob with 2000 rows and 3 columns.
The algorithm for the total word probability in this example would look like this:
sum = 0
for j in range(0,3):
prob_product = 1
for i in words: #just the index of words from my vocabulary in this document
prob_product = prob_product*wordprob[i,j]
sum = sum + prob_product
What ends up happening is that prob_product becomes 0 on many iterations due to many small probabilities multiplying with each other.
Since I can't easily solve this with logs (because of the summation in front) I'm totally clueless.
Any help will be much appreciated.

I think you may be best to keep everything in logs. The first part of this, to compute the log of the product is just adding up the log of the terms. The second bit, computing the log of the sum of the exponentials of the logs is a bit trickier.
One way would be to store each of the logs of the products in an array, and then you need a function that, given an array L with n elements, will compute
S = log( sum { i=1..n | exp( L[i])})
One way to do this is to find the maximum, M say, of the L's; a little algebra shows
S = M + log( sum { i=1..n | exp( L[i]-M)})
Each of the terms L[i]-M is non-positive so overflow can't occur. Underflow is not a problem as for them exp will return 0. At least one of them (the one where L[i] is M) will be zero so it's exp will be one and we'll end up with something we can pass to log. In other words the evaluation of the formula will be trouble free.
If you have the function log1p (log1p(x) = log(1+x)) then you could gain some accuracy by omitting the (just one!) i where L[i] == M from the sum, and passing the sum to log1p instead of log.

your question seems on the math side of things rather than the coding of it.
I haven't quite figured out what your issue is but the sum of logs equals the log of the products. Dont know if that helps..
Also, you are calculating one prob_product for every j but you are just using the last one (and you are re-initializing it). you meant to do one of two things: either initialize it before the j-loop or use it before you increment j. Finally, i doesnt look that you need to initialize sum unless this is part of yet another loop you are not showing here.
That's all i have for now.
Sorry for the long post and no code.

High school algebra tells you this:
log(A*B*....*Z) = log(A) + log(B) + ... + log(Z) != log(A + B + .... + Z)

Related

Algorithm-finding-dedicated-sum-from-the-population-of-variables

I need a way of finding an exact value made of the sum of variables chosen from the population. The algorithm can find just the first solution or all. So we can have 10, 20, or 30 different numbers and we will sum some of them to get a desirable number. As an example we have a population of the below numbers: -2,-1,1,2,3,5,8,10 and we try to get 6 - this can be made of 8 and -2, 1 + 5 etc. I need at least 2 decimal places to consider as well for accuracy and ideally, the sum of variables will be exact to the asking value.
Thanks for any advice and help on this:)
I build a model Using the simplex method in Excel but I need the solution in Python.
This is the subset sum problem, which is an NP Complete problem.
There is a known pseudo-polynomial solution for it, if the numbers are integers. In your case, you need to consider numbers only to 2nd decimal point, so you could convert the problem into integers by multiplying by 1001, and then run the pseudo-polynomial algorithm.
It will works quite nicely and efficiently - if the range of numbers you have is quite small (Complexity is O(n*W), where W is the sum of numbers in absolute value).
Appendix:
Pseudo polynomial time solution is Dynamic Programming adaptation of the following recursive formula:
k is the desired number
n is the total number of elements in list.
// stop clause: Found a sum
D(k, i) = true | for all 0 <= i < n
// Stop clause: failing attempt, cannot find sum in this branch.
D(x, n) = false | x != k
// Recursive step, either take the current element or skip it.
D(x, i) = D(x + arr[i], i+1) OR D(x, i+1)
Start from D(0,0)
If this is not the case, and the range of numbers is quite high, you might have to go with brute force solution, of checking all possible subsets. This solution is of course exponential, and processing it is in O(2^n) .
(1) Consider rounding if needed, but that's a simple preprocessing that doesn't affect the answer.

How to iterate through the Cartesian product of ten lists (ten elements each) faster? (Probability and Dice)

I'm trying to solve this task.
I wrote function for this purpose which uses itertools.product() for Cartesian product of input iterables:
def probability(dice_number, sides, target):
from itertools import product
from decimal import Decimal
FOUR_PLACES = Decimal('0.0001')
total_number_of_experiment_outcomes = sides ** dice_number
target_hits = 0
sides_combinations = product(range(1, sides+1), repeat=dice_number)
for side_combination in sides_combinations:
if sum(side_combination) == target:
target_hits += 1
p = Decimal(str(target_hits / total_number_of_experiment_outcomes)).quantize(FOUR_PLACES)
return float(p)
When calling probability(2, 6, 3) output is 0.0556, so works fine.
But calling probability(10, 10, 50) calculates veeery long (hours?), but there must be a better way:)
for side_combination in sides_combinations: takes to long to iterate through huge number of sides_combinations.
Please, can you help me to find out how to speed up calculation of result, i want too sleep tonight..
I guess the problem is to find the distribution of the sum of dice. An efficient way to do that is via discrete convolution. The distribution of the sum of variables is the convolution of their probability mass functions (or densities, in the continuous case). Convolution is an n-ary operator, so you can compute it conveniently just two pmf's at a time (the current distribution of the total so far, and the next one in the list). Then from the final result, you can read off the probabilities for each possible total. The first element in the result is the probability of the smallest possible total, and the last element is the probability of the largest possible total. In between you can figure out which one corresponds to the particular sum you're looking for.
The hard part of this is the convolution, so work on that first. It's just a simple summation, but it's just a little tricky to get the limits of the summation correct. My advice is to work with integers or rationals so you can do exact arithmetic.
After that you just need to construct an appropriate pmf for each input die. The input is just [1, 1, 1, ... 1] if you're using integers (you'll have to normalize eventually) or [1/n, 1/n, 1/n, ..., 1/n] if rationals, where n = number of faces. Also you'll need to label the indices of the output correctly -- again this is just a little tricky to get it right.
Convolution is a very general approach for summations of variables. It can be made even more efficient by implementing convolution via the fast Fourier transform, since FFT(conv(A, B)) = FFT(A) FFT(B). But at this point I don't think you need to worry about that.
If someone still interested in solution which avoids very-very-very long iteration process through all itertools.product Cartesian products, here it is:
def probability(dice_number, sides, target):
if dice_number == 1:
return (1 <= target <= sides**dice_number) / sides
return sum([probability(dice_number-1, sides, target-x) \
for x in range(1,sides+1)]) / sides
But you should add caching of probability function results, if you won't - calculation of probability will takes very-very-very long time as well)
P.S. this code is 100% not mine, i took it from the internet, i'm not such smart to product it by myself, hope you'll enjoy it as much as i.

count number of basketball plays when given the final score

Given the final score of a basketball game, how i can count the number of possible scoring sequences that lead to the final score.
Each score can be one of: 3 point, 2 point, 1 point score by either the visiting or home team. For example:
basketball(3,0)=4
Because these are the 4 possible scoring sequences:
V3
V2, V1
V1, V2
V1, V1, V1
And:
basketball(88,90)=2207953060635688897371828702457752716253346841271355073695508308144982465636968075
Also I need to do it in a recursive way and without any global variables(dictionary is allowed and probably is the way to solve this)
Also, the function can get only the result as an argument (basketball(m,n)).
for those who asked for the solution:
basketballDic={}
def basketball(n,m):
count=0;
if n==0 and m==0:
return 1;
if (n,m) in basketballDic:
return basketballDic[(n,m)]
if n>=3:
count+= basketball(n-3,m)
if n>=2:
count+= basketball(n-2,m)
if n>=1:
count+= basketball(n-1,m)
if m>=3:
count+= basketball(n,m-3)
if m>=2:
count+= basketball(n,m-2)
if m>=1:
count+= basketball(n,m-1)
basketballDic[(n,m)]=count;
return count;
When you're considering a recursive algorithm, there are two things you need to figure out.
What is the base case, where the recursion ends.
What is the recursive case. That is, how can you calculate one value from one or more previous values?
For your basketball problem, the base case is pretty simple. When there's no score, there's exactly one possible set of baskets that has happened to get there (it's the empty set). So basketball(0,0) needs to return 1.
The recursive case is a little more tricky to think about. You need to reduce a given score, say (M,N), step by step until you get to (0,0), counting up the different ways to get each score on the way. There are six possible ways for the score to have changed to get to (M,N) from whatever it was previously (1, 2 and 3-point baskets for each team) so the number of ways to get to (M,N) is the sum of the ways to get to (M-1,N), (M-2,N), (M-3,N), (M,N-1), (M,N-2) and (M,N-3). So those are the recursive calls you'll want to make (after perhaps some bounds checking).
You'll find that a naive recursive implementation takes a very long time to solve for high scores. This is because it calculates the same values over and over (for instance, it may calculate that there's only one way to get to a score of (1,0) hundreds of separate times). Memoization can help prevent the duplicate work by remembering previously calculated results. It's also worth noting that the problem is symmetric (there are the same number of ways of getting a score of (M,N) as there are of getting (N,M)) so you can save a little work by remembering not only the current result, but also its reverse.
There are two ways this can be done, and neither come close to matching your specified outputs. The less relevant way would be to count the maximum possible number of scoring plays. Since basketball has 1 point scores, this will always be equal to the sum of both inputs to our basketball() function. The second technique is counting the minimum number of scoring plays. This can be done trivially with recursion, like so:
def team(x):
if x:
score = 3
if x < 3:
score = 2
if x < 2:
score = 1
return 1 + team(x - score)
else:
return 0
def basketball(x, y):
return team(x) + team(y)
Can this be done more tersely and even more elegantly? Certainly, but this should give you a decent starting point for the kind of stateless, recursive solution you are working on.
tried to reduce from the given result (every possible play- 1,2,3 points) using recursion until i get to 0 but for that i need global variable and i cant use one.
Maybe this is where you reveal what you need. You can avoid a global by passing the current count and/or returning the used count (or remaining count) as needed.
In your case I think you would just pass the points to the recursive function and have it return the counts. The return values would be added so the final total would roll-up as the recursion unwinds.
Edit
I wrote a function that was able to generate correct results. This question is tagged "memoization", using it gives a huge performance boost. Without it, the same sub-sequences are processed again and again. I used a decorator to implement memoization.
I liked #Maxwell's separate handling of teams, but that approach will not generate the numbers you are looking for. (Probably because your original wording was not at all clear, I've since rewritten your problem description). I wound up processing the 6 home and visitor scoring possibilities in a single function.
My counts were wrong until I realized what I needed to count was the number of times I hit the terminal condition.
Solution
Other solutions have been posted. Here's my (not very readable) one-liner:
def bball(vscore, hscore):
return 1 if vscore == 0 and hscore == 0 else \
sum([bball(vscore-s, hscore) for s in range(1,4) if s <= vscore]) + \
sum([bball(vscore, hscore-s) for s in range(1,4) if s <= hscore])
Actually I also have this line just before the function definition:
#memoize
I used Michele Simionato's decorator module and memoize sample. Though as #Blckknght mentioned the function is commutative so you could customize memoize to take advantage of this.
While I like the separation of concerns provided by generic memoization, I'm also tempted to initialize the cache with (something like):
cache = {(0, 0): 1}
and remove the special case check for 0,0 args in the function.

Finite Metric Embeddings: Good Algorithm?

I have a finite metric space given as a (symmetric) k by k distance matrix. I would like an algorithm to (approximately) isometrically embed this in euclidean space R^(k-1). While it is not always possible to do exactly by solving the system of equations given by the distances I am looking for a solution that embeds with some (very small) controllable error.
I currently use multidimensional scaling (MDS) with the output dimension set to (k-1). It occurs to me that in general MDS may be optimized for the situation where you are trying to reduce the ambient embedding dimension to something less then (k-1) (typically 2 or 3) and that there may be a better algorithm for my restricted case.
Question: What is a good/fast algorithm for realizing a metric space of size k in R^{k-1} using euclidean distance?
Some parameters and pointers:
(1) My k's are relatively small. Say 3 < k < 25
(2) I don't actually care if I embed in R^{k-1}. If it simplifies things/makes things faster any R^N would also be fine as long as it's isometric. I'm happy if there's a faster algorithm or one with less error if I increase to R^k or R^(2k+1).
(3) If you can point to a python implementation I'll be even happier.
(4) Anything better then MDS would work.
Why not try LLE , you can also find the code there.
Ok, as promised here a simple solution:
Notation: Let d_{i,j}=d_{j,i} denote the squared distance between points i and j. Let N be the number of points. Let p_i denote the point i and p_{i,k} the k-th coordinate of the point.
Let's hope I derive the algorithm correctely now. There will be some explanations afterwards so you can check the derivation (I hate it when many indexes appear).
The algorithm uses dynamic programming to arrive at the correct solution
Initialization:
p_{i,k} := 0 for all i and k
Calculation:
for i = 2 to N do
sum = 0
for j = 2 to i - 1 do
accu = d_{i,j} - d_{i,1} - p_{j,j-1}^2
for k = 1 to j-1 do
accu = accu + 2 p_{i,k}*p_{j,k} - p_{j,k}^2
done
p_{i,j} = accu / ( 2 p_{j,j-1} )
sum = sum + p_{i,j}^2
done
p_{i,i-1} = sqrt( d_{i,0} - sum )
done
If I have not done any grave index mistakes (I usually do) this should do the trick.
The idea behind this:
We set the first point arbitrary at the origin to make our life easier. Not that for a point p_i we never set a coordinate k when k > i-1. I.e. for the second point we only set the first coordinate, for the third point we only set first and second coordinate etc.
Now let's assume we have coordinates for all points p_{k'} where k'<i and we want to calculate the coordinates for p_{i} so that all d_{i,k'} are satisfied (we do not care yet about any constraints for points with k>k'). If we write down the set of equations we have
d_{i,j} = \sum_{k=1}^N (p_{i,k} - p_{j,k} )^2
Because both p_{i,k} and p_{j,k} are equal to zero for k>k' we can reduce this to:
d_{i,j} = \sum_{k=1}^k' (p_{i,k} - p_{j,k} )^2
Also note that by the loop invariant all p_{j,k} will be zero when k>j-1. So we split this equation:
d_{i,j} = \sum_{k=1}^{j-1} (p_{i,k} - p_{j,k} )^2 + \sum_{k=j}^{k'} p_{i,j}^2
For the first equation we just get:
d_{i,1} = \sum_{k=1}^{j-1} (p_{i,k} - p_{j,k} )^2 + \sum_{k=j}^{k'} p_i{i,1}^2
This one will need some special treatment later.
Now if we solve all binomials in the normal equation we get:
d_{i,j} = \sum_{k=1}^{j-1} p_{i,k}^2 - 2 p_{i,k} p_{j,k} + p_{j,k}^2 + \sum_{k=j}^{k'} p_{i,j}^2
subtract the first equation from this and you are left with:
d_{i,j} - d_{i,1} = \sum_{k=1}^{j-1} p_{j,k}^2 - 2 p_{i,k} p_{j,k}
for all j > 1.
If you look at this you'll note that all squares of coordinates of p_i are gone and the only squares we need are already known. This is a set of linear equations that can easily be solved using methods from linear algebra. Actually there is one more special thing about this set of equations: The equations already are in triangular form, so you only need the final step of propagating the solutions. For the final step we are left with one single quadratic equation that we can just solve by taking one square root.
I hope you can follow my reasoning. It's a bit late and my head is a bit spinning from those indexes.
EDIT: Yes, there were indexing mistakes. Fixed. I'll try to implement this in python when I have time in order to test it.

What category of combinatorial problems appear on the logic games section of the LSAT?

EDIT: See Solving "Who owns the Zebra" programmatically? for a similar class of problem
There's a category of logic problem on the LSAT that goes like this:
Seven consecutive time slots for a broadcast, numbered in chronological order I through 7, will be filled by six song tapes-G, H, L, O, P, S-and exactly one news tape. Each tape is to be assigned to a different time slot, and no tape is longer than any other tape. The broadcast is subject to the following restrictions:
L must be played immediately before O.
The news tape must be played at some time after L.
There must be exactly two time slots between G and
P, regardless of whether G comes before P or whether G comes after P.
I'm interested in generating a list of permutations that satisfy the conditions as a way of studying for the test and as a programming challenge. However, I'm not sure what class of permutation problem this is. I've generalized the type problem as follows:
Given an n-length array A:
How many ways can a set of n unique items be arranged within A? Eg. How many ways are there to rearrange ABCDEFG?
If the length of the set of unique items is less than the length of A, how many ways can the set be arranged within A if items in the set may occur more than once? Eg. ABCDEF => AABCDEF; ABBCDEF, etc.
How many ways can a set of unique items be arranged within A if the items of the set are subject to "blocking conditions"?
My thought is to encode the restrictions and then use something like Python's itertools to generate the permutations. Thoughts and suggestions are welcome.
This is easy to solve (a few lines of code) as an integer program. Using a tool like the GNU Linear Programming Kit, you specify your constraints in a declarative manner and let the solver come up with the best solution. Here's an example of a GLPK program.
You could code this using a general-purpose programming language like Python, but this is the type of thing you'll see in the first few chapters of an integer programming textbook. The most efficient algorithms have already been worked out by others.
EDIT: to answer Merjit's question:
Define:
matrix Y where Y_(ij) = 1 if tape i
is played before tape j, and 0
otherwise.
vector C, where C_i
indicates the time slot when i is
played (e.g. 1,2,3,4,5,6,7)
Large
constant M (look up the term for
"big M" in an optimization textbook)
Minimize the sum of the vector C subject to the following constraints:
Y_(ij) != Y_(ji) // If i is before j, then j must not be before i
C_j < C_k + M*Y_(kj) // the time slot of j is greater than the time slot of k only if Y_(kj) = 1
C_O - C_L = 1 // L must be played immediately before O
C_N > C_L // news tape must be played at some time after L
|C_G - C_P| = 2 // You will need to manipulate this a bit to make it a linear constraint
That should get you most of the way there. You want to write up the above constraints in the MathProg language's syntax (as shown in the links), and make sure I haven't left out any constraints. Then run the GLPK solver on the constraints and see what it comes up with.
Okay, so the way I see it, there are two ways to approach this problem:
Go about writing a program that will approach this problem head first. This is going to be difficult.
But combinatorics teaches us that the easier way to do this is to count all permutations and subtract the ones that don't satisfy your constraints.
I would go with number 2.
You can find all permutations of a given string or list by using this algorithm. Using this algorithm, you can get a list of all permutations. You can now apply a number of filters on this list by checking for the various constraints of the problem.
def L_before_O(s):
return (s.index('L') - s.index('O') == 1)
def N_after_L(s):
return (s.index('L') < s.index('N'))
def G_and_P(s):
return (abs(s.index('G') - s.index('P')) == 2)
def all_perms(s): #this is from the link
if len(s) <=1:
yield s
else:
for perm in all_perms(s[1:]):
for i in range(len(perm)+1):
yield perm[:i] + s[0:1] + perm[i:]
def get_the_answer():
permutations = [i for i in all_perms('GHLOPSN')] #N is the news tape
a = [i for i in permutations if L_before_O(i)]
b = [i for i in a if N_after_L(i)]
c = [i for i in b if G_and_P(i)]
return c
I haven't tested this, but this is general idea of how I would go about coding such a question.
Hope this helps

Categories