Creating recursive function for nested loop in python

Creating recursive function for nested loop in python - python

I had posted this question :
Non overlapping pattern matching with gap constraint in python ; two months back. I got only one response. But the solution is quite long, and for each word in a pattern, one nested loop is formed. Is there any way of forming the following function recursively ?
i=0
while i < len(pt_dic[pt_split[0]]):
match=False
ii = pt_dic[pt_split[0]][i]
#print "ii=" + str(ii)
# Start loop at next index after ii
j = next(x[0] for x in enumerate(pt_dic[pt_split[1]]) if x[1] > ii)
while j < len(pt_dic[pt_split[1]]) and not match:
jj = pt_dic[pt_split[1]][j]
#print "jj=" + str(jj)
if jj > ii and jj <= ii + 2:
# Start loop at next index after ii
k = next(x[0] for x in enumerate(pt_dic[pt_split[2]]) if x[1] > jj)
while k < len(pt_dic[pt_split[2]]) and not match:
kk = pt_dic[pt_split[2]][k]
#print "kk=" + str(kk)
if kk > jj and kk <= jj + 2:
# Start loop at next index after kk
l = next(x[0] for x in enumerate(pt_dic[pt_split[3]]) if x[1] > kk)
while l < len(pt_dic[pt_split[2]]) and not match:
ll = pt_dic[pt_split[3]][l]
#print "ll=" + str(ll)
if ll > kk and ll <= kk + 2:
print "Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + "," + str(ll) + ")"
# Now that we've found a match, skip indices within that match.
i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > ll)
i -= 1
match=True
l += 1
k += 1
j += 1
i += 1
Edit : For those who don't get the context :
I want to find total no. of non-overlapping matches of a pattern appearing in a sequence, with the gap constraint 2.
Eg. A B C is a pattern found using some algorithm. I have to find the total # of this pattern appearing in a sequence such as A A B B C D E A B C … , where the max gap constraint is 2.
Max. gap isn't seen across sequence, but is seen between two words belonging to a pattern that are substring in sequence. E.g. Pat: A B C and seq: A B D E C B A B A B C D E.
In this case, A B D E C ... is a match as max two gaps allowed between A,B and B, C. Next we find A B A B C as another match. Interestingly. there are two matches, (2 chars b/w A, B and 2 chars b/w B,C) . However, we will count it only as one, as it's an overlapping match. A B X X X C isn't valid.

I have read the original question only briefly. I'm not really sure if I've got the gap counting part right. I think you have L sorted sequences of unique indices and the code searches for all lists with L elements, where Nth element is from Nth sequence and where two adjacent items satisfy a condition prev < next < prev + GAP + 1
Anyway this question is about nested loops.
The basic idea of the code below is to pass a list of sequences to the recursive function. This function takes the first sequence from it and iterates over it. The remaining sequences are passed to the other instances of the same function where each instance does the same, i.e. iterates over the first sequence and passes the rest until no sequences to iterate over are left.
During that process a partial solution is being built step by step. The recursion continues only if this partial solution satisfies the condition. When all sequences are exhausted, the partial solution becomes a final solution.
list_of_seqs= [
[0, 1, 7, 11, 22, 29],
[2, 3, 8, 14, 25, 33],
[4, 9, 15, 16, 27, 34],
]
def found_match(m):
print(m)
GAP = 2
def recloop(part, ls):
if not ls:
found_match(part)
return
seq, *ls = ls # this is Python3 syntax
last = part[-1] if part else None
# this is not optimized:
for i in seq:
if last is None or last < i <= last + GAP + 1:
recloop(part + [i], ls)
recloop([], list_of_seqs)
For Python2 replace the marked line with seq, ls = ls[0], ls[1:]

Related

How can I count the number of ways to divide a string into N parts of any size?

I'm trying to count the number of ways you can divide a given string into three parts in Python.
Example: "bbbbb" can be divided into three parts 6 ways:
b|b|bbb
b|bb|bb
b|bbb|b
bb|b|bb
bb|bb|b
bbb|b|b
My first line of thinking was N choose K, where N = the string's length and K = the number of ways to split (3), but that only works for 3 and 4.
My next idea was to iterate through the string and count the number of spots the first third could be segmented and the number of spots the second third could be segmented, then multiply the two counts, but I'm having trouble implementing that, and I'm not even too sure if it'd work.
How can I count the ways to split a string into N parts?

Think of it in terms of the places of the splits as the elements you're choosing:
b ^ b ^ b ^ ... ^ b
^ is where you can split, and there are N - 1 places where you can split (N is the length of the string), and, if you want to split the string into M parts, you need to choose M - 1 split places, so it's N - 1 choose M - 1.
For you example, N = 5, M = 3. (N - 1 choose M - 1) = (4 choose 2) = 6.
An implementation:
import scipy.special
s = 'bbbbb'
n = len(s)
m = 3
res = scipy.special.comb(n - 1, m - 1, exact=True)
print(res)
Output:
6

I came up with a solution to find the number of ways to split a string in python and I think it is quite easier to understand and has a better time complexity
def slitStr(s):
i = 1
j= 2
count = 0
while i <= len(s)-2:
# a, b, c are the split strings
a = s[:i]
b = s[i:j]
c = s[j:]
#increase j till it gets to the end of the list
#each time j gets to the end of the list increment i
#set j to i + 1
if j<len(s):
j+= 1
if j==len(s):
i += 1
j = i+1
# you can increment count after each iteration
count += 1
You can customize the solution to fit your need. I hope this helps.

Hope this helps you too :
string = "ABCDE"
div = "|"
out = []
for i in range(len(string)):
temp1 = ''
if 1 < i < len(string):
temp1 += string[0:i-1] + div
for j in range(len(string) + 1):
temp2 = ""
if j > i:
temp2 += string[i-1:j-1] + div + string[j-1:]
out.append(temp1 + temp2)
print(out)
Result :
['A|B|CDE', 'A|BC|DE', 'A|BCD|E', 'AB|C|DE', 'AB|CD|E', 'ABC|D|E']

How to check if any item exists in the list

How to check whether a list and an element of a list with such an index exist in the list itself?
I have a list [[10,10,9], [10,10,10], [10,10,10]]
Then I enter the number of coordinates (k) and the coordinates themselves. At these coordinates, I have to subtract 8 from the cell and 4 with each cell standing next to it. But what if there are no cells nearby?
When checking if field [r + f] [c + s] in field: it always gives a negative answer. How to make a check?
for i in range(k):
for j in range(1):
f = drops[i][j]
s = drops[i][j + 1]
field[f][s] -= 8
for r in range(-1, 1):
for c in range(-1, 1):
if not (r == c == 1):
if field[r + f][c + s] in field:
field[r + f][c + s] -= 4

You just have to check whether the index isn't at the start or the end of the list.
n = 2
mylist = [4, 5, 8, 9, 12]
if len(mylist) > n+1:
mylist[n+1] -= 1
if n > 0:
mylist[n-1] -= 1

Slice assignment might help. You have to avoid letting an index go negative, but something like
s = slice(max(n-1,0), n+2)
x[s] = [v-1 for v in x[s]]
isn't too repetitive, while handling the edge cases n == 0 and n == len(s) - 1. (It won't work ifn` is explicitly set to a negative index, though.)

Find Repeating Substring In a List

I have a long list of sub-strings (close to 16000) that I want to find where the repeating cycle starts/stops. I have come up with this code as a starting point:
strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
pat = [ '1100100100000010',
'1001001000000110',
'0010010000001100',]
for i in range(0,len(strings)-1):
for j in range(0,len(pat)):
if strings[i] == pat[j]:
continue
if strings[i+1] == pat[j]:
print 'match', strings[i]
break
break
The problem with this method is that you have to know what pat is to search for it. I would like to be able to start with the first n sub-list (in this case 3) and search for them, if not match move down one sub-string to the next 3 until it has gone through the entire list or finds the repeat. I believe if the length is high enough (maybe 10) it will find the repeat without being too time demanding.

strings= ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
'1001000000110110',
'0010000001101101',
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',]
n = 3
patt_dict = {}
for i in range(0, len(strings) - n, 1):
patt = (' '.join(strings[i:i + n]))
if patt not in patt_dict.keys(): patt_dict[patt] = 1
else: patt_dict[patt] += 1
for key in patt_dict.keys():
if patt_dict[key] > 1:
print 'Found ' + str(patt_dict[key]) + ' repeating instances of ' + str(key) + '.'
Give this a shot. Runs in linear time. Basically uses a dictionary to count the number of times that an n-size pattern occurs in a subset. If it exceeds 1, then we have a repeating pattern :)

Here's a reasonably simple way that finds all matches of all lengths >= 1:
def findall(xs):
from itertools import combinations
# x2i maps each member of xs to a list of all the
# indices at which that member appears.
x2i = {}
for i, x in enumerate(xs):
x2i.setdefault(x, []).append(i)
n = len(xs)
for ixs in x2i.values():
if len(ixs) > 1:
for i, j in combinations(ixs, 2):
length = 1 # xs[i] == xs[j]
while (i + length < n and
j + length < n and
xs[i + length] == xs[j + length]):
length += 1
yield i, j, length
Then:
for i, j, n in findall(strings):
print("match of length", n, "at indices", i, "and", j)
displays:
match of length 4 at indices 0 and 6
match of length 1 at indices 3 and 9
match of length 3 at indices 1 and 7
match of length 2 at indices 2 and 8
What you do and don't want hasn't been precisely specified, so this lists all matches. You probably don't really want some of the them. For example, the match of length 3 at indices 1 and 7 is just the tail end of the match of length 4 at indices 0 and 6.
So you'll need to alter the code to compute what you really want. Perhaps you only want a single, maximal match? All maximal matches? Only matches of a particular length? Etc.

Here's something that will find all subarrays that match within the strings array.
strings = ['A', 'B', 'C', 'D', 'Z', 'B', 'B', 'C', 'A', 'B', 'C']
pat = ['A', 'B', 'C', 'D']
i = 0
while i < len(strings):
if strings[i] not in pat:
i += 1
continue
matches = 0
for j in xrange(pat.index(strings[i]), len(pat)):
if i + j - pat.index(strings[i]) >= len(strings):
break
if strings[i + j - pat.index(strings[i])] == pat[j]:
matches += 1
else:
break
if matches:
print 'matched at index %d subsequence length: %d value %s' % (i, matches, strings[i])
i += matches
else:
i += 1
Output:
matched at index 0 subsequence length: 4 value A
matched at index 5 subsequence length: 1 value B
matched at index 6 subsequence length: 2 value B
matched at index 8 subsequence length: 3 value A

Longest arithmetic progression with a hole

The longest arithmetic progression subsequence problem is as follows. Given an array of integers A, devise an algorithm to find the longest arithmetic progression in it. In other words find a sequence i1 < i2 < … < ik, such that A[i1], A[i2], …, A[ik] form an arithmetic progression, and k is maximal. The following code solves the problem in O(n^2) time and space. (Modified from http://www.geeksforgeeks.org/length-of-the-longest-arithmatic-progression-in-a-sorted-array/ . )
#!/usr/bin/env python
import sys
def arithmetic(arr):
n = len(arr)
if (n<=2):
return n
llap = 2
L = [[0]*n for i in xrange(n)]
for i in xrange(n):
L[i][n-1] = 2
for j in xrange(n-2,0,-1):
i = j-1
k = j+1
while (i >=0 and k <= n-1):
if (arr[i] + arr[k] < 2*arr[j]):
k = k + 1
elif (arr[i] + arr[k] > 2*arr[j]):
L[i][j] = 2
i -= 1
else:
L[i][j] = L[j][k] + 1
llap = max(llap, L[i][j])
i = i - 1
k = j + 1
while (i >=0):
L[i][j] = 2
i -= 1
return llap
arr = [1,4,5,7,8,10]
print arithmetic(arr)
This outputs 4.
However I would like to be able to find arithmetic progressions where up to one value is missing. So if arr = [1,4,5,8,10,13] I would like it to report that there is a progression of length 5 with one value missing.
Can this be done efficiently?

Adapted from my answer to Longest equally-spaced subsequence. n is the length of A, and d is the range, i.e. the largest item minus the smallest item.
A = [1, 4, 5, 8, 10, 13] # in sorted order
Aset = set(A)
for d in range(1, 13):
already_seen = set()
for a in A:
if a not in already_seen:
b = a
count = 1
while b + d in Aset:
b += d
count += 1
already_seen.add(b)
# if there is a hole to jump over:
if b + 2 * d in Aset:
b += 2 * d
count += 1
while b + d in Aset:
b += d
count += 1
# don't record in already_seen here
print "found %d items in %d .. %d" % (count, a, b)
# collect here the largest 'count'
I believe that this solution is still O(n*d), simply with larger constants than looking without a hole, despite the two "while" loops inside the two nested "for" loops. Indeed, fix a value of d: then we are in the "a" loop that runs n times; but each of the inner two while loops run at most n times in total over all values of a, giving a complexity O(n+n+n) = O(n) again.
Like the original, this solution is adaptable to the case where you're not interested in the absolute best answer but only in subsequences with a relatively small step d: e.g. n might be 1'000'000, but you're only interested in subsequences of step at most 1'000. Then you can make the outer loop stop at 1'000.

Longest equally-spaced subsequence

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example
1, 4, 5, 7, 8, 12
has a subsequence
4, 8, 12
My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes O(n²) time per point it seems.
Is there a faster way to solve this problem?
Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)] .
Timings. I tested with the following input data on my 32 bit system.
a= [random.randint(0,10000) for r in xrange(20000)]
a.sort()
The dynamic programming method of ZelluX uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it takes only 9 seconds! However it crashes with a memory error on large inputs.
The O(nd) time method of Armin took 9 seconds with pypy but only 20MB of RAM. Of course this would be much worse if the range were much larger. The low memory usage meant I could also test it with a= [random.randint(0,100000) for r in xrange(200000)] but it didn't finish in the few minutes I gave it with pypy.
In order to be able to test the method of Kluev's I reran with
a= [random.randint(0,40000) for r in xrange(28000)]
a = list(set(a))
a.sort()
to make a list of length roughly 20000. All timings with pypy
ZelluX, 9 seconds
Kluev, 20 seconds
Armin, 52 seconds
It seems that if the ZelluX method could be made linear space it would be the clear winner.

We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.
Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.
A = [1, 4, 5, 7, 8, 12] # in sorted order
Aset = set(A)
for d in range(1, 12):
already_seen = set()
for a in A:
if a not in already_seen:
b = a
count = 1
while b + d in Aset:
b += d
count += 1
already_seen.add(b)
print "found %d items in %d .. %d" % (count, a, b)
# collect here the largest 'count'
Updates:
This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)
If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.

UPDATE: I've found a paper on this problem, you can download it here.
Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.
We assume all numbers are saved in array a in ascending order, and n saves its length. 2D array l[i][j] defines length of longest equally-spaced subsequence ending with a[i] and a[j], and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i < j < k).
lmax = 2
l = [[2 for i in xrange(n)] for j in xrange(n)]
for mid in xrange(n - 1):
prev = mid - 1
succ = mid + 1
while (prev >= 0 and succ < n):
if a[prev] + a[succ] < a[mid] * 2:
succ += 1
elif a[prev] + a[succ] > a[mid] * 2:
prev -= 1
else:
l[mid][succ] = l[prev][mid] + 1
lmax = max(lmax, l[mid][succ])
prev -= 1
succ += 1
print lmax

Update: First algorithm described here is obsoleted by Armin Rigo's second answer, which is much simpler and more efficient. But both these methods have one disadvantage. They need many hours to find the result for one million integers. So I tried two more variants (see second half of this answer) where the range of input integers is assumed to be limited. Such limitation allows much faster algorithms. Also I tried to optimize Armin Rigo's code. See my benchmarking results at the end.
Here is an idea of algorithm using O(N) memory. Time complexity is O(N2 log N), but may be decreased to O(N2).
Algorithm uses the following data structures:
prev: array of indexes pointing to previous element of (possibly incomplete) subsequence.
hash: hashmap with key = difference between consecutive pairs in subsequence and value = two other hashmaps. For these other hashmaps: key = starting/ending index of the subsequence, value = pair of (subsequence length, ending/starting index of the subsequence).
pq: priority queue for all possible "difference" values for subsequences stored in prev and hash.
Algorithm:
Initialize prev with indexes i-1. Update hash and pq to register all (incomplete) subsequences found on this step and their "differences".
Get (and remove) smallest "difference" from pq. Get corresponding record from hash and scan one of second-level hash maps. At this time all subsequences with given "difference" are complete. If second-level hash map contains subsequence length better than found so far, update the best result.
In the array prev: for each element of any sequence found on step #2, decrement index and update hash and possibly pq. While updating hash, we could perform one of the following operations: add a new subsequence of length 1, or grow some existing subsequence by 1, or merge two existing subsequences.
Remove hash map record found on step #2.
Continue from step #2 while pq is not empty.
This algorithm updates O(N) elements of prev O(N) times each. And each of these updates may require to add a new "difference" to pq. All this means time complexity of O(N2 log N) if we use simple heap implementation for pq. To decrease it to O(N2) we might use more advanced priority queue implementations. Some of the possibilities are listed on this page: Priority Queues.
See corresponding Python code on Ideone. This code does not allow duplicate elements in the list. It is possible to fix this, but it would be a good optimization anyway to remove duplicates (and to find the longest subsequence beyond duplicates separately).
And the same code after a little optimization. Here search is terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range.
Armin Rigo's code is simple and pretty efficient. But in some cases it does some extra computations that may be avoided. Search may be terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range:
def findLESS(A):
Aset = set(A)
lmax = 2
d = 1
minStep = 0
while (lmax - 1) * minStep <= A[-1] - A[0]:
minStep = A[-1] - A[0] + 1
for j, b in enumerate(A):
if j+d < len(A):
a = A[j+d]
step = a - b
minStep = min(minStep, step)
if a + step in Aset and b - step not in Aset:
c = a + step
count = 3
while c + step in Aset:
c += step
count += 1
if count > lmax:
lmax = count
d += 1
return lmax
print(findLESS([1, 4, 5, 7, 8, 12]))
If range of integers in source data (M) is small, a simple algorithm is possible with O(M2) time and O(M) space:
def findLESS(src):
r = [False for i in range(src[-1]+1)]
for x in src:
r[x] = True
d = 1
best = 1
while best * d < len(r):
for s in range(d):
l = 0
for i in range(s, len(r), d):
if r[i]:
l += 1
best = max(best, l)
else:
l = 0
d += 1
return best
print(findLESS([1, 4, 5, 7, 8, 12]))
It is similar to the first method by Armin Rigo, but it doesn't use any dynamic data structures. I suppose source data has no duplicates. And (to keep the code simple) I also suppose that minimum input value is non-negative and close to zero.
Previous algorithm may be improved if instead of the array of booleans we use a bitset data structure and bitwise operations to process data in parallel. The code shown below implements bitset as a built-in Python integer. It has the same assumptions: no duplicates, minimum input value is non-negative and close to zero. Time complexity is O(M2 * log L) where L is the length of optimal subsequence, space complexity is O(M):
def findLESS(src):
r = 0
for x in src:
r |= 1 << x
d = 1
best = 1
while best * d < src[-1] + 1:
c = best
rr = r
while c & (c-1):
cc = c & -c
rr &= rr >> (cc * d)
c &= c-1
while c != 1:
c = c >> 1
rr &= rr >> (c * d)
rr &= rr >> d
while rr:
rr &= rr >> d
best += 1
d += 1
return best
Benchmarks:
Input data (about 100000 integers) is generated this way:
random.seed(42)
s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)])))
And for fastest algorithms I also used the following data (about 1000000 integers):
s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)])))
All results show time in seconds:
Size: 100000 1000000
Second answer by Armin Rigo: 634 ?
By Armin Rigo, optimized: 64 >5000
O(M^2) algorithm: 53 2940
O(M^2*L) algorithm: 7 711

Algorithm
Main loop traversing the list
If number found in precalculate list, then it's belong to all sequences which are in that list, recalculate all the sequences with count + 1
Remove all precalculated for current element
Recalculate new sequences where first element is from range from 0 to current, and second is current element of traversal (actually, not from 0 to current, we can use the fact that new element shouldn't be more that max(a) and new list should have possibility to become longer that already found one)
So for list [1, 2, 4, 5, 7] output would be (it's a little messy, try code yourself and see)
index 0, element 1:
if 1 in precalc? No - do nothing
Do nothing
index 1, element 2:
if 2 in precalc? No - do nothing
check if 3 = 1 + (2 - 1) * 2 in our set? No - do nothing
index 2, element 4:
if 4 in precalc? No - do nothing
check if 6 = 2 + (4 - 2) * 2 in our set? No
check if 7 = 1 + (4 - 1) * 2 in our set? Yes - add new element {7: {3: {'count': 2, 'start': 1}}} 7 - element of the list, 3 is step.
index 3, element 5:
if 5 in precalc? No - do nothing
do not check 4 because 6 = 4 + (5 - 4) * 2 is less that calculated element 7
check if 8 = 2 + (5 - 2) * 2 in our set? No
check 10 = 2 + (5 - 1) * 2 - more than max(a) == 7
index 4, element 7:
if 7 in precalc? Yes - put it into result
do not check 5 because 9 = 5 + (7 - 5) * 2 is more than max(a) == 7
result = (3, {'count': 3, 'start': 1}) # step 3, count 3, start 1, turn it into sequence
Complexity
It shouldn't be more than O(N^2), and I think it's less because of earlier termination of searching new sequencies, I'll try to provide detailed analysis later
Code
def add_precalc(precalc, start, step, count, res, N):
if step == 0: return True
if start + step * res[1]["count"] > N: return False
x = start + step * count
if x > N or x < 0: return False
if precalc[x] is None: return True
if step not in precalc[x]:
precalc[x][step] = {"start":start, "count":count}
return True
def work(a):
precalc = [None] * (max(a) + 1)
for x in a: precalc[x] = {}
N, m = max(a), 0
ind = {x:i for i, x in enumerate(a)}
res = (0, {"start":0, "count":0})
for i, x in enumerate(a):
for el in precalc[x].iteritems():
el[1]["count"] += 1
if el[1]["count"] > res[1]["count"]: res = el
add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N)
t = el[1]["start"] + el[0] * el[1]["count"]
if t in ind and ind[t] > m:
m = ind[t]
precalc[x] = None
for y in a[i - m - 1::-1]:
if not add_precalc(precalc, y, x - y, 2, res, N): break
return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])]

Here is another answer, working in time O(n^2) and without any notable memory requirements beyond that of turning the list into a set.
The idea is quite naive: like the original poster, it is greedy and just checks how far you can extend a subsequence from each pair of points --- however, checking first that we're at the start of a subsequence. In other words, from points a and b you check how far you can extend to b + (b-a), b + 2*(b-a), ... but only if a - (b-a) is not already in the set of all points. If it is, then you already saw the same subsequence.
The trick is to convince ourselves that this simple optimization is enough to lower the complexity to O(n^2) from the original O(n^3). That's left as an exercice to the reader :-) The time is competitive with other O(n^2) solutions here.
A = [1, 4, 5, 7, 8, 12] # in sorted order
Aset = set(A)
lmax = 2
for j, b in enumerate(A):
for i in range(j):
a = A[i]
step = b - a
if b + step in Aset and a - step not in Aset:
c = b + step
count = 3
while c + step in Aset:
c += step
count += 1
#print "found %d items in %d .. %d" % (count, a, c)
if count > lmax:
lmax = count
print lmax

Your solution is O(N^3) now (you said O(N^2) per index). Here it is O(N^2) of time and O(N^2) of memory solution.
Idea
If we know subsequence that goes through indices i[0],i[1],i[2],i[3] we shouldn't try subsequence that starts with i[1] and i[2] or i[2] and i[3]
Note I edited that code to make it a bit easier using that a sorted but it will not work for equal elements. You may check number max number of equal elements in O(N) easily
Pseudocode
I'm seeking only for max length but that doesn't change anything
whereInA = {}
for i in range(n):
whereInA[a[i]] = i; // It doesn't matter which of same elements it points to
boolean usedPairs[n][n];
for i in range(n):
for j in range(i + 1, n):
if usedPair[i][j]:
continue; // do not do anything. It was in one of prev sequences.
usedPair[i][j] = true;
//here quite stupid solution:
diff = a[j] - a[i];
if diff == 0:
continue; // we can't work with that
lastIndex = j
currentLen = 2
while whereInA contains index a[lastIndex] + diff :
nextIndex = whereInA[a[lastIndex] + diff]
usedPair[lastIndex][nextIndex] = true
++currentLen
lastIndex = nextIndex
// you may store all indicies here
maxLen = max(maxLen, currentLen)
Thoughts about memory usage
O(n^2) time is very slow for 1000000 elements. But if you are going to run this code on such number of elements the biggest problem will be memory usage.
What can be done to reduce it?
Change boolean arrays to bitfields to store more booleans per bit.
Make each next boolean array shorter because we only use usedPairs[i][j] if i < j
Few heuristics:
Store only pairs of used indicies. (Conflicts with the first idea)
Remove usedPairs that will never used more (that are for such i,j that was already chosen in the loop)

This is my 2 cents.
If you have a list called input:
input = [1, 4, 5, 7, 8, 12]
You can build a data structure that for each one of this points (excluding the first one), will tell you how far is that point from anyone of its predecessors:
[1, 4, 5, 7, 8, 12]
x 3 4 6 7 11 # distance from point i to point 0
x x 1 3 4 8 # distance from point i to point 1
x x x 2 3 7 # distance from point i to point 2
x x x x 1 5 # distance from point i to point 3
x x x x x 4 # distance from point i to point 4
Now that you have the columns, you can consider the i-th item of input (which is input[i]) and each number n in its column.
The numbers that belong to a series of equidistant numbers that include input[i], are those which have n * j in the i-th position of their column, where j is the number of matches already found when moving columns from left to right, plus the k-th predecessor of input[i], where k is the index of n in the column of input[i].
Example: if we consider i = 1, input[i] = 4, n = 3, then, we can identify a sequence comprehending 4 (input[i]), 7 (because it has a 3 in position 1 of its column) and 1, because k is 0, so we take the first predecessor of i.
Possible implementation (sorry if the code is not using the same notation as the explanation):
def build_columns(l):
columns = {}
for x in l[1:]:
col = []
for y in l[:l.index(x)]:
col.append(x - y)
columns[x] = col
return columns
def algo(input, columns):
seqs = []
for index1, number in enumerate(input[1:]):
index1 += 1 #first item was sliced
for index2, distance in enumerate(columns[number]):
seq = []
seq.append(input[index2]) # k-th pred
seq.append(number)
matches = 1
for successor in input[index1 + 1 :]:
column = columns[successor]
if column[index1] == distance * matches:
matches += 1
seq.append(successor)
if (len(seq) > 2):
seqs.append(seq)
return seqs
The longest one:
print max(sequences, key=len)

Traverse the array, keeping a record of the optimal result/s and a table with
(1) index - the element difference in the sequence,
(2) count - number of elements in the sequence so far, and
(3) the last recorded element.
For each array element look at the difference from each previous array element; if that element is last in a sequence indexed in the table, adjust that sequence in the table, and update the best sequence if applicable, otherwise start a new sequence, unless the current max is greater than the length of the possible sequence.
Scanning backwards we can stop our scan when d is greater than the middle of the array's range; or when the current max is greater than the length of the possible sequence, for d greater than the largest indexed difference. Sequences where s[j] is greater than the last element in the sequence are deleted.
I converted my code from JavaScript to Python (my first python code):
import random
import timeit
import sys
#s = [1,4,5,7,8,12]
#s = [2, 6, 7, 10, 13, 14, 17, 18, 21, 22, 23, 25, 28, 32, 39, 40, 41, 44, 45, 46, 49, 50, 51, 52, 53, 63, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 80, 82, 86, 95, 97, 101, 110, 111, 112, 114, 115, 120, 124, 125, 129, 131, 132, 136, 137, 138, 139, 140, 144, 145, 147, 151, 153, 157, 159, 161, 163, 165, 169, 172, 173, 175, 178, 179, 182, 185, 186, 188, 195]
#s = [0, 6, 7, 10, 11, 12, 16, 18, 19]
m = [random.randint(1,40000) for r in xrange(20000)]
s = list(set(m))
s.sort()
lenS = len(s)
halfRange = (s[lenS-1] - s[0]) // 2
while s[lenS-1] - s[lenS-2] > halfRange:
s.pop()
lenS -= 1
halfRange = (s[lenS-1] - s[0]) // 2
while s[1] - s[0] > halfRange:
s.pop(0)
lenS -=1
halfRange = (s[lenS-1] - s[0]) // 2
n = lenS
largest = (s[n-1] - s[0]) // 2
#largest = 1000 #set the maximum size of d searched
maxS = s[n-1]
maxD = 0
maxSeq = 0
hCount = [None]*(largest + 1)
hLast = [None]*(largest + 1)
best = {}
start = timeit.default_timer()
for i in range(1,n):
sys.stdout.write(repr(i)+"\r")
for j in range(i-1,-1,-1):
d = s[i] - s[j]
numLeft = n - i
if d != 0:
maxPossible = (maxS - s[i]) // d + 2
else:
maxPossible = numLeft + 2
ok = numLeft + 2 > maxSeq and maxPossible > maxSeq
if d > largest or (d > maxD and not ok):
break
if hLast[d] != None:
found = False
for k in range (len(hLast[d])-1,-1,-1):
tmpLast = hLast[d][k]
if tmpLast == j:
found = True
hLast[d][k] = i
hCount[d][k] += 1
tmpCount = hCount[d][k]
if tmpCount > maxSeq:
maxSeq = tmpCount
best = {'len': tmpCount, 'd': d, 'last': i}
elif s[tmpLast] < s[j]:
del hLast[d][k]
del hCount[d][k]
if not found and ok:
hLast[d].append(i)
hCount[d].append(2)
elif ok:
if d > maxD:
maxD = d
hLast[d] = [i]
hCount[d] = [2]
end = timeit.default_timer()
seconds = (end - start)
#print (hCount)
#print (hLast)
print(best)
print(seconds)

This is a particular case for the more generic problem described here: Discover long patterns where K=1 and is fixed. It is demostrated there that it can be solved in O(N^2). Runnig my implementation of the C algorithm proposed there it takes 3 seconds to find the solution for N=20000 and M=28000 in my 32bit machine.

Greedy method
1 .Only one sequence of decision is generated.
2. Many number of decisions are generated.
Dynamic programming
1. It does not guarantee to give an optimal solution always.
2. It definitely gives an optimal solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating recursive function for nested loop in python - python

Related

How can I count the number of ways to divide a string into N parts of any size?

How to check if any item exists in the list

Find Repeating Substring In a List

Longest arithmetic progression with a hole

Longest equally-spaced subsequence

Categories

Resources