Unknown pattern detection in a list of numbers

Unknown pattern detection in a list of numbers - python

I have a sequence of numbers that follow some kind of arbitrary rule, let's imagine the following 5 examples:
A = [1,2,3,4]
B = [8,7,6,5,4,3,2]
C = [2,4,6,8,10,12]
D = [15,18,21,24]
E = [2,8,18,32,50]
Sequence A follows a rule of xn = xn-1+1 , where n0=1, sequence B follows a rule of xn = xn-1-2 where n0=8, and so on. Example E follows the more complex formula ni=2(i+1)2
How, using python, can I predict the next element of each sequence?

You can fit a curve using scipy.optimize.curve_fit if you have a specific function in mind, or you could do a numpy.polyfit if you're confident that the "and so on" is always going to conform to some polynomial - your examples are all linear, so that's just a polynomial of degree 1.
Here's an example of using numpy.polyfit:
import numpy as np
model = np.polyfit([0,1], [1,2],deg=1)
This will take in your values [1,2] and map them to positional values [0,1] before calculating the 1-degree polynomial that best fits their sequence.
You then need a function to use the model to predict the n'th value in the sequence (Alternatively, use poly1d) but here's a simple polynomial calculator function that accepts coefficients as the first parameter, and a value of x for which you want to return the result of the polynomial:
def poly(coeffs, x):
accumulator=0
n = len(coeffs)-1
for e,i in enumerate(coeffs):
accumulator = accumulator + (i*(x ** (n-e)))
return accumulator
So, we've trained it on a sequence with indices 0,1 - the answer for the 3rd point, with index 2 is found by:
poly(model,2)
Which returns the expected value of 3.
Here's an example using the sequence [3,6,9,12]:
model = np.polyfit([0,1,2,3], [3,6,9,12],deg=1)
poly(model,4)
Gives the answer 15. (OK, 15.000000000000002, but it's close enough - if you're confident that you're always going to arrive at integer answers then you could round to the closest integer - or choose some level of precision you're comfortable with)
This is all linear, for a quadratic model, you'd set the deg=1 to deg=2 and so on.
What this wont do for you is find more interesting patterns for which there isn't a polynomial to describe them. The On-Line Encyclopedia of Integer Sequences has a huge list of such sequences, but examples might include The Fibonacci Sequence, The Prime Number Sequence, or Triangular Numbers for these more interesting examples, you'll need to come up with a more nuanced approach.

Related

Generate random postfix expressions of an arbitrary length without producing duplicate expressions

For a research project, we are currently using the Gamblers Ruin Algorithm to produce random postfix expressions (terms) using term variables "x", "y", and "z" and an operator "*", as shown in the method below:
def gamblers_ruin_algorithm(prob=0.3,
min_term_length=1,
max_term_length=None):
"""
Generate a random term using the gamblers ruin algorithm
:type prop: float
:param prob: Probability of growing the size of a random term
:type max_term_length: int
:param max_term_length: Maximum length of the generated term
"""
term_variables = ["x", "y", "z"]
substitutions = ("EE*", "I")
term = "E"
term_length = 0
# randomly build a term
while("E" in term):
rand = uniform(0, 1)
if rand < prob or term_length < min_term_length:
index = 0
term_length += 1
else:
index = 1
if (max_term_length is not None and
term_length >= max_term_length):
term = term.replace("E", "I")
break
term = term.replace("E", substitutions[index], 1)
# randomly replace operands
while("I" in term):
term = term.replace("I", choice(term_variables), 1)
return term
The following method will produce a random postfix term like the following:
xyz*xx***zyz*zzx*zzyy***x**x****x**
This issue with this method is that when run thousands of times, it tends to frequently produce duplicate expressions.
Is there a different algorithm for producing random posfix expressions of an arbitrary length that minimizes the probability of producing the same expression more than once?

Your basic problem is not the algorithm; it's the way you force the minimize size of the resulting expression. That procedure introduces an important bias into the generation of the first min_term_length operators. If the expression manages to grow further, that bias will slowly decrease, but it will never disappear.
Until the expression reaches the minimum length, you replace the first E with EE*. So the first few expressions are always:
E
EE*
EE*E*
EE*E*E*
...
When the minimum length is reached, the function starts replacing E with I with probability 1-prob, which is 70% using the default argument. If this succeeds for all the Es, the function will return a tree with the above shape.
Suppose that min_term_length is 5. The probability of five successive tests choosing to not extend the expression is 0.75, or about 16.8%. At that point, the expression will be II*I*I*I*I, and the six Is will be randomly replaced by a variable name. There are three variables, making a total of 36 = 729 different postfix expressions. If you do a large number of samples, the fact that a sixth of the samples fall into 729 possible expressions will certainly create lots of duplicates. That's unnecessary because there are actually 42 possible shapes of a postfix expression with five operands (the fifth Catalan number), so there were actually 30618 possible postfix expressions. If all of those could be produced, you'd expect less than one duplicate in a hundred thousand samples.
Note that the bias introduced by forcing a particular replacement for the first min terms will continue to show up for longer strings as well. For example, If the algorithm happens to expand the string exactly once during the first six steps, which has a probability of about 10%, then it will choose one of six shapes, although there are 132 possibilities. So you can expect duplicates of that size as well, although somewhat fewer.
Instead of forcing a choice when the string is still short, you should let the algorithm just continue until the gambler is ruined or the maximum length occurs. If the gambler is ruined too soon, throw out that sample and start over. That will slow things down a bit, but it's still quite practical. If tossing out so many possibilities annoys you, you could instead pre-generate all possible patterns of the minimum length -- as noted above, if the minimum length is six operators, then that's 132 shapes, which are easy to enumerate -- and select one of those at random as the starting point.

You have four digits: x, y, z, and * such that:
x = 1
y = 2
z = 3
* = 4
So any expression can be expressed as a number using those digits. For example, the postfix expression xy*z* is 12434. And every such number maps to a unique expression.
With this technique, you can map each expression to a unique 32 bit or 64 bit number. And there are many good techniques for generating unique random numbers. See, for example, https://stackoverflow.com/a/34420445/56778.
So:
Generate a bunch of unique random numbers.
For each random number:
Convert to that modified base 5 number.
Generate the expression from that number.
You can of course combine the 3rd and 4th steps. That is, instead of generating a '1', generate an 'x'.
There will of course be some limit on the length of the expressions. Each digit requires two bits to represent, so the maximum length of an expression from a 32 bit number will be 16 characters. You can extend to longer expressions easily enough by generating 64 bit random numbers. Or 128. Or whatever you like. The basic algorithm remains the same.

How to iterate through the Cartesian product of ten lists (ten elements each) faster? (Probability and Dice)

I'm trying to solve this task.
I wrote function for this purpose which uses itertools.product() for Cartesian product of input iterables:
def probability(dice_number, sides, target):
from itertools import product
from decimal import Decimal
FOUR_PLACES = Decimal('0.0001')
total_number_of_experiment_outcomes = sides ** dice_number
target_hits = 0
sides_combinations = product(range(1, sides+1), repeat=dice_number)
for side_combination in sides_combinations:
if sum(side_combination) == target:
target_hits += 1
p = Decimal(str(target_hits / total_number_of_experiment_outcomes)).quantize(FOUR_PLACES)
return float(p)
When calling probability(2, 6, 3) output is 0.0556, so works fine.
But calling probability(10, 10, 50) calculates veeery long (hours?), but there must be a better way:)
for side_combination in sides_combinations: takes to long to iterate through huge number of sides_combinations.
Please, can you help me to find out how to speed up calculation of result, i want too sleep tonight..

I guess the problem is to find the distribution of the sum of dice. An efficient way to do that is via discrete convolution. The distribution of the sum of variables is the convolution of their probability mass functions (or densities, in the continuous case). Convolution is an n-ary operator, so you can compute it conveniently just two pmf's at a time (the current distribution of the total so far, and the next one in the list). Then from the final result, you can read off the probabilities for each possible total. The first element in the result is the probability of the smallest possible total, and the last element is the probability of the largest possible total. In between you can figure out which one corresponds to the particular sum you're looking for.
The hard part of this is the convolution, so work on that first. It's just a simple summation, but it's just a little tricky to get the limits of the summation correct. My advice is to work with integers or rationals so you can do exact arithmetic.
After that you just need to construct an appropriate pmf for each input die. The input is just [1, 1, 1, ... 1] if you're using integers (you'll have to normalize eventually) or [1/n, 1/n, 1/n, ..., 1/n] if rationals, where n = number of faces. Also you'll need to label the indices of the output correctly -- again this is just a little tricky to get it right.
Convolution is a very general approach for summations of variables. It can be made even more efficient by implementing convolution via the fast Fourier transform, since FFT(conv(A, B)) = FFT(A) FFT(B). But at this point I don't think you need to worry about that.

If someone still interested in solution which avoids very-very-very long iteration process through all itertools.product Cartesian products, here it is:
def probability(dice_number, sides, target):
if dice_number == 1:
return (1 <= target <= sides**dice_number) / sides
return sum([probability(dice_number-1, sides, target-x) \
for x in range(1,sides+1)]) / sides
But you should add caching of probability function results, if you won't - calculation of probability will takes very-very-very long time as well)
P.S. this code is 100% not mine, i took it from the internet, i'm not such smart to product it by myself, hope you'll enjoy it as much as i.

Generate non-uniform random numbers [duplicate]

This question already has an answer here:
Fast way to obtain a random index from an array of weights in python
(1 answer)
Closed 4 years ago.
Algo (Source: Elements of Programming Interviews, 5.16)
You are given n numbers as well as probabilities p0, p1,.., pn-1
which sum up to 1. Given a rand num generator that produces values in
[0,1] uniformly, how would you generate one of the n numbers according
to their specific probabilities.
Example
If numbers are 3, 5, 7, 11, and the probabilities are 9/18, 6/18,
2/18, 1/18, then in 1000000 cals to the program, 3 should appear
500000 times, 7 should appear 111111 times, etc.
The book says to create intervals p0, p0 + p1, p0 + p1 + p2, etc so in the example above the intervals are [0.0, 5.0), [0.5, 0.0.8333), etc and combining these intervals into a sorted array of endpoints could look something like [1/18, 3/18, 9/18, 18/18]. Then run the random function generator, and find the smallest element that is larger than the generated element - the array index that it corresponds to maps to an index in the given n numbers.
This would require O(N) pre-processing time and then O(log N) to binary search for the value.
I have an alternate solution that requires O(N) pre-processing time and O(1) execution time, and am wondering what may be wrong with it.
Why can't we iterate through each number in n, multiplying [n] * 100 * probability that matches with n. E.g [3] * (9/18) * 100. Concatenate all these arrays to get, at the end, a list of 100 elements, with the number of elements for each mapping to how likely it is to occur. Then, run the random num function and index into the array, and return the value.
Wouldn't this be more efficient than the provided solution?

Your number 100 is not independent of the input; it depends on the given p values. Any parameter that depends on the magnitude of the input values is really exponential in the input size, meaning you are actually using exponential space. Just constructing that array would thus take exponential time, even if it was structured to allow constant lookup time after generating the random number.
Consider two p values, 0.01 and 0.99. 100 values is sufficient to implement your scheme. Now consider 0.001 and 0.999. Now you need an array of 1,000 values to model the probability distribution. The amount of space grows with (I believe) the ratio of the largest p value and the smallest, not in the number of p values given.

If you have rational probabilities, you can make that work. Rather than 100, you must use a common denominator of the rational proportions. Insisting on 100 items will not fulfill the specs of your assigned example, let alone more diabolical ones.

python polynomial fit with some coefficients being fixed, order should be a parameter, need to create list of variables?

I need some help writing a pretty simple code (at least in pseudo code):
I want fit data using a polynomial of order n, where n is a parameter and should be changable. On top of that I would like to always keep the first three coefficients fixed to be zero. So I need something like
order = 5
def poly(x,c0=0,c1=0,c2=0,c3,c4,c5):
return numpy.polynomial.polynomial.polyval(x, [c0,c1,c2,c3,c4,c5], tensor=False)
popt, pcov = scipy.optimize.curve_fit(poly,x,y)
So problems I can not sove atm is:
How do I create a polynomial function with n number of coefficents? I basicly need to create a list of variables of length n.
If that is solved than we could put c0 to c2 to 0.
I hope I was able to make myself clear, if not please help me to refine my question.

You currently do not keep the first 3 coefficient fixed to 0, you just give them a default value.
Arbitrary argument lists seem to be what you are looking for:
def poly(x,*args):
return numpy.polynomial.polynomial.polyval(x, [0,0,0] + list(args), tensor=False)
If the number of arguments MUST be of fixed length (for instance n), you can check len(args) and raise an error if necessary.
Calling poly(x,a,b,c) now returns the polynomial function with the coefficients [0,0,0,a,b,c]
You can find more information in Python's documentation: https://docs.python.org/3/tutorial/controlflow.html#more-on-defining-functions

Complication using log-probabilities - Naive Bayes text classifier

I'm constructing a Naive Bayes text classifier from scratch in Python and I am aware that, upon encountering a product of very small probabilities, using a logarithm over the probabilities is a good choice.
The issue now, is that the mathematical function that I'm using has a summation OVER a product of these extremely small probabilities.
To be specific, I'm trying to calculate the total word probabilities given a mixture component (class) over all classes.
Just plainly adding up the logs of these total probabilities is incorrect, since the log of a sum is not equal to the sum of logs.
To give an example, lets say that I have 3 classes, 2000 words and 50 documents.
Then I have a word probability matrix called wordprob with 2000 rows and 3 columns.
The algorithm for the total word probability in this example would look like this:
sum = 0
for j in range(0,3):
prob_product = 1
for i in words: #just the index of words from my vocabulary in this document
prob_product = prob_product*wordprob[i,j]
sum = sum + prob_product
What ends up happening is that prob_product becomes 0 on many iterations due to many small probabilities multiplying with each other.
Since I can't easily solve this with logs (because of the summation in front) I'm totally clueless.
Any help will be much appreciated.

I think you may be best to keep everything in logs. The first part of this, to compute the log of the product is just adding up the log of the terms. The second bit, computing the log of the sum of the exponentials of the logs is a bit trickier.
One way would be to store each of the logs of the products in an array, and then you need a function that, given an array L with n elements, will compute
S = log( sum { i=1..n | exp( L[i])})
One way to do this is to find the maximum, M say, of the L's; a little algebra shows
S = M + log( sum { i=1..n | exp( L[i]-M)})
Each of the terms L[i]-M is non-positive so overflow can't occur. Underflow is not a problem as for them exp will return 0. At least one of them (the one where L[i] is M) will be zero so it's exp will be one and we'll end up with something we can pass to log. In other words the evaluation of the formula will be trouble free.
If you have the function log1p (log1p(x) = log(1+x)) then you could gain some accuracy by omitting the (just one!) i where L[i] == M from the sum, and passing the sum to log1p instead of log.

your question seems on the math side of things rather than the coding of it.
I haven't quite figured out what your issue is but the sum of logs equals the log of the products. Dont know if that helps..
Also, you are calculating one prob_product for every j but you are just using the last one (and you are re-initializing it). you meant to do one of two things: either initialize it before the j-loop or use it before you increment j. Finally, i doesnt look that you need to initialize sum unless this is part of yet another loop you are not showing here.
That's all i have for now.
Sorry for the long post and no code.

High school algebra tells you this:
log(A*B*....*Z) = log(A) + log(B) + ... + log(Z) != log(A + B + .... + Z)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.