Find Repeating Sublist Within Large List - python

I have a large list of sub-lists (approx. 16000) that I want to find where the repeating pattern starts and ends. I am not 100% sure that there is a repeat, however I have a strong reason to believe so, due to the diagonals that appear within the sub-list sequence. The structure of a list of sub-lists is preferred, as it is used that way for other things in this script. The data looks like this:
data = ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011', etc
I do not have any time constraints, however the fastest method would not be frown upon. The code should be able to return the starting/ending sequence and location within the list, to be called upon in the future. If there is an arrangement of the data that would be more useful, I can try to reformat it if necessary. Python is something that I have been learning for the past few months, so I am not quite able to just create my own algorithms from scratch just yet. Thank you!

Here's some fairly simple code that scans a string for adjacent repeating subsequences. Set minrun to the length of the smallest subsequences that you want to check. For each match, the code prints the starting index of the first subsequence, the length of the subsequence, and the subsequence itself.
data = [
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
]
data = ''.join(data)
minrun = 3
lendata = len(data)
for runlen in range(minrun, lendata // 2):
i = 0
while i < lendata - runlen * 2:
s1 = data[i:i + runlen]
s2 = data[i + runlen:i + runlen * 2]
if s1 == s2:
print(i, runlen, s1)
i += runlen
else:
i += 1
output
1 3 100
4 3 100
8 3 000
15 3 010
18 3 010
23 3 000
32 3 001
38 3 000
47 3 001
53 3 000
17 15 001001000000110
32 15 001001000000110
Note that we get the same sequence of length 3 at index 15 and 18 = 15 + 3 : 010; that indicates that there are 3 adjacent copies of 010. Similarly, there are 3 adjacent copies of the sequence at index 17 of length 15.

Related

Find line intersection with all possible combinations of points in dataframe

let's say I have the following line:
l = Line(Point(25, 0), Point(25, 25))
and I have a dataframe (df) which contains 2500 points, something like:
x y
0 0 49
1 13 48
2 0 47
3 5 46
4 9 45
...
How can I efficiently examine if the lines formed by each and every combination of those points intersects with the above line?
Note that I am using the intersection function from the sympy library.
And note that using two nested loop takes forever... not efficient.

finding the length of the longest subsequence

Here, in this piece of code, it prints the length of the largest subsequence of a sequence that's increasing then decreasing or vice versa.
for example:
Input: 1, 11, 2, 10, 4, 5, 2, 1
Output: 6 (A Longest Subsequence of length 6 is 1, 2, 10, 4, 2, 1)
but how can I make it work with three monotonic (increasing or decreasing) regions?
like increasing-decreasing-increasing OR decreasing-increasing-decreasing
example:
input: 7 16 1 6 20 17 7 18 25 1 25 21 11 5 29 11 3 3 26 19
output: 12
(largest subsequence: 7 1 6 17 18 25 25 21 11 5 3 3) as we see,
it can be split into three regions:
7,1 / 6,17,18,25,25 / 21,11,5,3,3
arr = list(map(int, input().split()))
def lbs(arr):
n = len(arr)
lis = [1 for i in range(n+1)]
for i in range(1 , n):
for j in range(0 , i):
if ((arr[i] > arr[j]) and (lis[i] < lis[j] +1)):
lis[i] = lis[j] + 1
lds = [1 for i in range(n+1)]
for i in reversed(range(n-1)):
for j in reversed(range(i-1 ,n)):
if(arr[i] > arr[j] and lds[i] < lds[j] + 1):
lds[i] = lds[j] + 1
maximum = lis[0] + lds[0] - 1
for i in range(1 , n):
maximum = max((lis[i] + lds[i]-1), maximum)
return maximum
print ("Length of LBS is",lbs(arr))
I've came up with a O(n^2 log n) idea.
You want to divide your whole segment into three parts: first one containing increasing subsequence, second one containing decreasing one and the last one containing again increasing one.
First of all, let's choose a sequence's prefix - the first part (O(n) possibilities). To minimize the amount of checked intervals, you can pick only prefixes which last element is in their longest increasing subsequence. (In other words, when choosing range [1, x], a_x should be in it's longest increasing subsequence)
Now you have similar problem to the one you've already solved - finding decreasing, then increasing subsequence (I'd use binary search instead of for loop you used, by the way). The only difference is that the decreasing subsequence must start from values smaller than the last element of chosen prefix (just ignore any larger or equal values) - you're able to do it in O(n log n).

How to iterate over a range of permutations? [duplicate]

This question already has an answer here:
How to iterate through array combinations with constant sum efficiently?
(1 answer)
Closed 6 years ago.
Say you have n items each ranging from 1-100. How can I get go over all possible variations within the range?
Example:
3 stocks A, B and C
Working to find possible portfolio allocation.
A - 0 0 0 1 2 1 1
B - 0 1 2 ... 0 0 ... 1 2
C - 100 99 98 99 98 98 97
Looking for an efficient way to get a matrix of all possible outcomes.
Sum should add up to 100 and cover all possible variations for n elements.
How I'd do it:
>>> import itertools
>>> cp = itertools.product(range(101),repeat=3)
>>> portfolios = list(p for p in cp if sum(p)==100)
But that creates unnecessary combinations. See discussions of integer partitioning to avoid that. E.g., Elegant Python code for Integer Partitioning

In Python Dictionaries, how does ( (j*5)+1 ) % 2**i cycle through all 2**i

I am researching how python implements dictionaries. One of the equations in the python dictionary implementation relates the pseudo random probing for an empty dictionary slot using the equation
j = ((j*5) + 1) % 2**i
which is explained here.
I have read this question, How are Python's Built In Dictionaries Implemented?, and basically understand how dictionaries are implemented.
What I don't understand is why/how the equation:
j = ((j*5) + 1) % 2**i
cycles through all the remainders of 2**i. For instance, if i = 3 for a total starting size of 8. j goes through the cycle:
0
1
6
7
4
5
2
3
0
if the starting size is 16, it would go through the cycle:
0 1 6 15 12 13 2 11 8 9 14 7 4 5 10 3 0
This is very useful for probing all the slots in the dictionary. But why does it work ? Why does j = ((j*5)+1) work but not j = ((j*6)+1) or j = ((j*3)+1) both of which get stuck in smaller cycles.
I am hoping to get a more intuitive understanding of this than the equation just works and that's why they used it.
This is the same principle that pseudo-random number generators use, as Jasper hinted at, namely linear congruential generators. A linear congruential generator is a sequence that follows the relationship X_(n+1) = (a * X_n + c) mod m. From the wiki page,
The period of a general LCG is at most m, and for some choices of factor a much less than that. The LCG will have a full period for all seed values if and only if:
m and c are relatively prime.
a - 1 is divisible by all prime factors of m.
a - 1 is divisible by 4 if m is divisible by 4.
It's clear to see that 5 is the smallest a to satisfy these requirements, namely
2^i and 1 are relatively prime.
4 is divisible by 2.
4 is divisible by 4.
Also interestingly, 5 is not the only number that satisfies these conditions. 9 will also work. Taking m to be 16, using j=(9*j+1)%16 yields
0 1 10 11 4 5 14 15 8 9 2 3 12 13 6 7
The proof for these three conditions can be found in the original Hull-Dobell paper on page 5, along with a bunch of other PRNG-related theorems that also may be of interest.

Longest Snake Sequence in an Array

Question : A set of numbers separated by space is passed as input. The program must print the largest snake sequence present in the numbers. A snake sequence is made up of adjacent numbers such that for each number, the number on the right or left is +1 or -1 of it's value. If multiple snake sequences of maximum length is possible print the snake sequence appearing in the natural input order.
Example Input/Output 1:
Input:
9 8 7 5 3 0 1 -2 -3 1 2
Output:
3 2 1 0 1
Example Input/Output 2:
Input:
-5 -4 -3 -1 0 1 4 6 5 4 3 4 3 2 1 0 2 -3 9
Output:
6 5 4 3 4 3 2 1 0 -1 0 1 2
Example Input/Output 3:
Input:
5 6 7 9 8 8
Output:
5 6 7 8 9 8
I have searched online & have only found references to find a snake sequence when a grid of numbers is given & not an array.
My Solution so far :
Create a 2D Array containing all the numbers from input as 1 value and the 2nd value being the max length sequence that can be generated starting from that number. But this doesn't always generate the max length sequence and doesn't work at all when there are 2 snakes of max length.
Assuming that the order in the original set of numbers does not matter, as seems to be the case in your question, this seems to be an instance of the Longest Path Problem, which is NP-hard.
Think of it that way: You can create a graph from your numbers, with edges between all pairs of nodes that have a difference of one. Now, the longest simple (acyclic) path in this graph is your solution. Your first example would correspond to this graph and path. (Note that there are two 1 nodes for the two ones in the input set.)
While this in itself does not solve your problem, it should help you getting started finding an algorithm to solve (or approximate) it, now that you know a better/more common name for the problem.
One algorithm works like this: Starting from each of the numbers, determine the "adjacent" numbers and do sort of a depth-first search through the graph to determine the longest path. Remember to temporarily remove the visited nodes from the graph. This has a worstcase complexity of O(2n) 1), but apparently it's sufficient for your examples.
def longest_snake(numbers, counts, path):
best = path
for n in sorted(counts, key=numbers.index):
if counts[n] > 0 and (path == [] or abs(path[-1] - n) == 1):
counts[n] -= 1
res = longest_snake(numbers, counts, path + [n])
if len(res) > len(best):
best = res
counts[n] += 1
return best
Example:
>>> from collections import Counter
>>> numbers = list(map(int, "9 8 7 5 3 0 1 -2 -3 1 2".split()))
>>> longest_snake(numbers, Counter(numbers), [])
[3, 2, 1, 0, 1]
Note that this algorithm will reliably find a maximum "snake" sequence, using no number more often than allowed. However, it may not find the specific sequence that's expected as the output, i.e. "the snake sequence appearing in the natural input order", whatever that's supposed to mean.
To get closer to the "natural order", you might try the numbers in the same order as they appear in the input (as I did with sorted), but that does not work perfectly, either. Anyway, I'm sure you can figure out the rest by yourself.
1) In this special case, the graph has a branching factor of 2, thus O(2n); in the more general case, the complexity would be closer to O(n!).

Categories