Optimizing string search in python

Optimizing string search in python - python

I have to write a python program that given a large 50 MB DNA sequence and a smaller one, of around 15 characters, returns a list of all sequences of 15 characters ordered by how close they are to the one given as well as where they are in the larger one.
My current approach is to first get all the subsequences:
def get_subsequences_of_size(size, data):
sequences = {}
i = 0
while(i+size <= len(data)):
sequence = data[i:i+size]
if sequence not in sequences:
sequences[sequence] = data.count(sequence)
i += 1
return sequences
and then pack them in a list of dictionaries according to what the problem asked (I forgot to get the position):
def find_similar_sequences(seq, data):
similar_sequences = {}
sequences = get_subsequences_of_size(len(seq), data)
for sequence in sequences.keys():
diffs, muts = calculate_similarity(seq,sequence)
if diffs not in similar_sequences:
similar_sequences[diffs] = [{"Sequence": sequence, "Mutations": muts}]
else:
similar_sequences[diffs].append({"Sequence": sequence, "Mutations": muts})
#similar_sequences[sequence] = {"Similarity": (len(sequence)-diffs), "Differences": diffs, "Mutatations": muts}
return similar_sequences
My problem is that this running way too slow. With the 50MB input, it takes over 30 minutes to finish processing.

What about the following approach:
Go with a sliding window of length 15 over your long sequence and for every subsequence:
store the start location on the long sequence
calculate and store the similarity
import re
from itertools import islice
from collections import defaultdict
short_seq = 'TGGCGACGGACTTCA'
long_seq = 'AGAACGTTTCGCGTCAGCCCGGAAGTGGTCAGTCGCCTGAGTCCGAACAAAAATGACAACAACGTTTATGACAGAACATT' +\
'CCTTGCTGGCAACTACCTGAAAATCGGCTGGCCGTCAGTCAATATCATGTCCTCATCAGATTATAAATGCGTGGCGCTGA' +\
'CGGATTATGACCGTTTTCCGGAAGATATTGATGGCGAGGGGGATGCCTTCTCTCTTGCCTCAAAACGTACCACCACATTT' +\
'ATGTCCAGTGGTATGACGCTGGTGGAGAGTTCCCCCGGCAGGGATGTGAAGGATGTGAAATGGCGACGGACTTCACCGCA' +\
'TGAGGCTCCACCAACCACGGGGATACTGTCGCTCTATAACCGTGGCGATCGCCGTCGCTGGTACTGGCCCTGTCCACACT' +\
'GTGGTGAGTATTTTCAGCCCTGCGGCGATGTGGTTGCTGGTTTCCGTGATATTGCCGATCCCGTGCTGGCAAGTGAGGCG' +\
'GCTTATATTCAGTGTCCTTCTGGCGACGGACTTCACGCGTCAGCCCGGAAGTGGTCAGTCGCCTGAGTCCGAACAAAAAT'
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
# from https://docs.python.org/release/2.3.5/lib/itertools-example.html
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield ''.join(result)
for elem in it:
result = result[1:] + (elem,)
yield ''.join(result)
def hamming_distance(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
k = len(short_seq)
locations = defaultdict(list)
similarities = defaultdict(set)
for start, subseq in enumerate(window(long_seq, k)):
locations[subseq].append(start)
similarity = hamming_distance(subseq, short_seq) # substitute with your own similarity function
similarities[similarity].add(subseq)
with open(r'stack46268997.txt', 'w') as f:
for similarity in sorted(similarities.keys()):
f.write("Sequence(s) which differ in {} base(s) from the short sequence:\n".format(similarity))
for subseq in similarities[similarity]:
f.write("{} at location(s) {}\n".format(subseq, ', '.join(map(str, locations[subseq]))))
f.write('\n')
This outputs the list of subsequences ordered by how close they are to the given sequence.
Sequence(s) which differ in 0 base(s) from the short sequence:
TGGCGACGGACTTCA at location(s) 300, 500
Sequence(s) which differ in 5 base(s) from the short sequence:
TGGCGATCGCCGTCG at location(s) 362
Sequence(s) which differ in 6 base(s) from the short sequence:
TGGCAACTACCTGAA at location(s) 86
TGGTGAGTATTTTCA at location(s) 401
TGGCGAGGGGGATGC at location(s) 191
Sequence(s) which differ in 7 base(s) from the short sequence:
ATGTGAAGGATGTGA at location(s) 283
AGGGGGATGCCTTCT at location(s) 196
TGACAACAACGTTTA at location(s) 53
CGCTGACGGATTATG at location(s) 154
TTATGACCGTTTTCC at location(s) 164
TGGTTGCTGGTTTCC at location(s) 430
TCGCGTCAGCCCGGA at location(s) 8
AGTCGCCTGAGTCCG at location(s) 30, 536
CGGCGATGTGGTTGC at location(s) 422
[... and so on...]
I also ran the script on a 50 MB FASTA file. On my machine, this took 42 seconds to compute the results and another 30 seconds to write out the results to a file (printing them out would have taken much longer!)

Related

how to find 3 Numbers with Sum closest to a given number

I'm trying to write simple code for that problem. If I get an array and number I need to find the 3 numbers that their sum are close to the number that's given.
I've thought about first to pop out the last digit (the first number)
then I'll have a new array without this digit. So now I look for the second number who needs to be less the sum target. so I take only the small numbers that it's smaller them the second=sum-first number (but I don't know how to choose it.
The last number will be third=sum-first-second
I tried to write code but it's not working and it's very basic
def f(s,target):
s=sorted(s)
print(s)
print(s[0])
closest=s[0]+s[1]+s[2]
m=s[:-1]
print(m)
for i in range(len(s)):
for j in range(len(m)):
if (closest<=target-m[0]) and s[-1] + m[j] == target:
print (m[j])
n = m[:j] + nums[j+1:]
for z in range (len(z)):
if (closest<target-n[z]) and s[-1]+ m[j]+n[z] == target:
print (n[z])
s=[4,2,12,3,4,8,14]
target=20
f(s,target)
if you have idea what to change here. Please let me know
Thank you

Here is my solution I tried to maximize the performance of the code to not repeat any combinations. Let me know if you have any questions.
Good luck.
def find_3(s,target):
to_not_rep=[] #This list will store all combinations without repetation
close_to_0=abs(target - s[0]+s[1]+s[2]) #initile
There_is_one=False #False: don't have a combination equal to the target yet
for s1,first_n in enumerate(s):
for s2,second_n in enumerate(s):
if (s1==s2) : continue #to not take the same index
for s3,third_n in enumerate(s):
if (s1==s3) or (s2==s3) : continue #to not take the same index
val=sorted([first_n,second_n,third_n]) #sorting
if val in to_not_rep :continue #to not repeat the same combination with diffrent positions
to_not_rep.append(val)#adding all the combinations without repetation
sum_=sum(val) #the sum of the three numbers
# Good one
if sum_==target:
print(f"Found a possibility: {val[0]} + {val[1]} + {val[2]} = {target}")
There_is_one = True
if There_is_one is False: #No need if we found combination equal to the target
# close to the target
# We know that (target - sum) should equal to 0 otherwise :
# We are looking for the sum of closet combinations(in abs value) to 0
pos_n=abs(target-sum_)
if pos_n < close_to_0:
closet_one=f"The closet combination to the target is: {val[0]} + {val[1]} + {val[2]} = {sum_} almost {target} "
close_to_0=pos_n
# Print the closet combination to the target in case we did not find a combination equal to the target
if There_is_one is False: print(closet_one)
so we can test it :
s =[4,2,3,8,6,4,12,16,30,20,5]
target=20
find_3(s,target)
#Found a possibility: 4 + 4 + 12 = 20
#Found a possibility: 2 + 6 + 12 = 20
#Found a possibility: 3 + 5 + 12 = 20
another test :
s =[4,2,3,8,6,4,323,23,44]
find_3(s,target)
#The closet combination to the target is: 4 + 6 + 8 = 18 almost 20

This is a simple solution that returns all possibilites.
For your case it completed in 0.002019 secs
from itertools import combinations
import numpy as np
def f(s, target):
dic = {}
for tup in combinations(s, 3):
try:
dic[np.absolute(np.sum(tup) - target)].append(str(tup))
except KeyError:
dic[np.absolute(np.sum(tup) - target)] = [tup]
print(dic[min(dic.keys())])

Use itertools.combinations to get all combinations of your numbers without replacement of a certain length (three in your case). Then take the three-tuple for which the absolute value of the difference of the sum and target is minimal. min can take a key argument to specify the ordering of the iterable passed to the function.
from typing import Sequence, Tuple
def closest_to(seq: Sequence[float], target: float, length: int = 3) -> Tuple[float]:
from itertools import combinations
combs = combinations(seq, length)
diff = lambda x: abs(sum(x) - target)
return min(combs, key=diff)
closest_to([4,2,12,3,4,8,14], 20) # (4, 2, 14)
This is not the fastest or most efficient way to do it, but it's conceptionally simple and short.

Something like this?
import math
num_find = 1448
lst_Results = []
i_Number = num_find
while i_Number > 0:
num_Exp = math.floor(math.log(i_Number) / math.log(2))
lst_Results.append(dict({num_Exp: int(math.pow(2, num_Exp))}))
i_Number = i_Number - math.pow(2, num_Exp)
print(lst_Results)
In a sequence of numbers: for example 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, etc ...
The sum of the previous numbers is never greater than the next. This gives us the possibility of combinations, for example:
The number: 1448, there is no other combination than the sum of the previous numbers: 8 + 32 + 128 + 256 + 1024
Then you find the numbers whose sum is close to the number provided

Python: Size of (large) dict 10 times smaller when pickled

I'm trying to understand what's going in internally with python in the following.
Situation (Python3 on debian):
A (large) dict that has integers as keys (running from zero) and tuples as values.
The elements of the tuple are ALL integers (randomly from zero to the number of the largest key).
All tuples have exactly 30 elements.
Problem:
The pickled dict is significantly (approx. 10 times!) smaller on my harddisk than the size of the single elements should represent in memory.
Details:
The size of an integer is 28 bytes (except < 0 > which is just 24 bytes).
The size of a tuple is dependent on the number of elements it contains; assuming 30 elements it is 288 bytes.
The size of a dictionary is dependent on the number of elements it contains; assuming 1000 elements it is 49248 bytes.
Given the situation above, 1000 elements in the dict and assuming the number < 0 > appears 29 times in the tuples I get:
size of the integers in the tuples: 28 x 30 x 1000 - 4 x 29 = 839,884 bytes
size of the tuples: 288 x 1000 = 288,000 bytes
size of the keys: 28 x 1000 - 4 (the first key is zero) = 27,996 bytes
size of the dict with 1000 elements: 49,248 bytes
Sum of this all = 1,205,128 bytes
Now I pickle this dict to harddisk as a binary file and I actually get 91,207 bytes as the size of the file.
So my question is now: what is going on here?
Is the pickling "compressing" the integers to just what the bits (or something like that) are? The number < 1000 > for example can be represented with just 10 bits and would fit into 2 bytes (instead of 28).
Code that might be useful:
import os
import sys
import random
max_key = 1000
zeros = 0
theoretical_size = 0
the_dict = {}
for i in range(max_key):
the_tuple = tuple()
ii = 0
while ii < 30:
number = random.randint(0, (max_key - 1))
if number not in the_tuple:
the_tuple += (number, )
theoretical_size += sys.getsizeof(number)
ii += 1
if not number:
zeros += 1
theoretical_size += sys.getsizeof(the_tuple)
theoretical_size += sys.getsizeof(i)
the_dict[i] = the_tuple
theoretical_size += sys.getsizeof(the_dict)
outfile = '/path/to/outfile/outfilename'
with open(outfile, 'wb') as f:
pickle.dump(the_dict, f)
print(" zeros:", zeros)
print("theoretical size:", theoretical_size)
print(" Calculated:", 28*30*max_key - 4*zeros + 288*max_key + 28*max_key - 4 + sys.getsizeof(the_dict))
print(" On disk:", os.path.getsize(outfile))

Shannon-Fano code as max-heap in python

I have a min-heap code for Huffman coding which you can see here: http://rosettacode.org/wiki/Huffman_coding#Python
I'm trying to make a max-heap Shannon-Fano code which is similar to min-heap.
Here is a code:
from collections import defaultdict, Counter
import heapq, math
def _heappop_max(heap):
"""Maxheap version of a heappop."""
lastelt = heap.pop() # raises appropriate IndexError if heap is empty
if heap:
returnitem = heap[0]
heap[0] = lastelt
heapq._siftup_max(heap, 0)
return returnitem
return lastelt
def _heappush_max(heap, item):
"""Push item onto heap, maintaining the heap invariant."""
heap.append(item)
heapq._siftdown_max(heap, 0, len(heap)-1)
def sf_encode(symb2freq):
heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
heapq._heapify_max(heap)
while len(heap) > 1:
lo = _heappop_max(heap)
hi = _heappop_max(heap)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
_heappush_max(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
print heap
return sorted(_heappop_max(heap)[1:], key=lambda p: (len(p[1]), p))
But i've got output like this:
Symbol Weight Shannon-Fano Code
! 1 1
3 1 01
: 1 001
J 1 0001
V 1 00001
z 1 000001
E 3 0000001
L 3 00000001
P 3 000000001
N 4 0000000001
O 4 00000000001
Am I right using heapq to implement Shannon-Fano coding? The problem in this string:
_heappush_max(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
and I don't understand how to fix it.
Expect output similar to Huffman encoding
Symbol Weight Huffman Code
2875 01
a 744 1001
e 1129 1110
h 606 0000
i 610 0001
n 617 0010
o 668 1000
t 842 1100
d 358 10100
l 326 00110
Added:
Well, I've tried to do this without heapq, but have unstopable recursion:
def sf_encode(iA, iB, maxP):
global tupleList, total_sf
global mid
maxP = maxP/float(2)
sumP = 0
for i in range(iA, iB):
tup = tupleList[i]
if sumP < maxP or i == iA: # top group
sumP += tup[1]/float(total_sf)
tupleList[i] = (tup[0], tup[1], tup[2] + '0')
mid = i
else: # bottom group
tupleList[i] = (tup[0], tup[1], tup[2] + '1')
print tupleList
if mid - 1 > iA:
sf_encode(iA, mid - 1, maxP)
if iB - mid > 0:
sf_encode(mid, iB, maxP)
return tupleList

In Shannon-Fano coding you need the following steps:
A Shannon–Fano tree is built according to a specification designed to
define an effective code table. The actual algorithm is simple:
For a given list of symbols, develop a corresponding list of
probabilities or frequency counts so that each symbol’s relative
frequency of occurrence is known.
Sort the lists of symbols according
to frequency, with the most frequently occurring symbols at the left
and the least common at the right.
Divide the list into two parts,
with the total frequency counts of the left part being as close to the
total of the right as possible.
The left part of the list is assigned
the binary digit 0, and the right part is assigned the digit 1. This
means that the codes for the symbols in the first part will all start
with 0, and the codes in the second part will all start with 1.
Recursively apply the steps 3 and 4 to each of the two halves,
subdividing groups and adding bits to the codes until each symbol has
become a corresponding code leaf on the tree.
So you will need code to sort (your input appears already sorted so you may be able to skip this), plus a recursive function that chooses the best partition and then recurses on the first and second halves of the list.
Once the list is sorted, the order of the elements never changes so there is no need to use heapq to do this style of encoding.

Effcient way to find longest duplicate string for Python (From Programming Pearls)

From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)

My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.

The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string

The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.

This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...

Recursive Permutation Generator, swapping list items not working

I want to systematically generate permutations of the alphabet.
I cannot don't want to use python itertools.permutation, because pregenerating a list of every permutation causes my computer to crash (first time i actually got it to force a shutdown, it was pretty great).
Therefore, my new approach is to generate and test each key on the fly. Currently, I am trying to handle this with recursion.
My idea is to start with the largest list (i'll use a 3 element list as an example), recurse in to smaller list until the list is two elements long. Then, it will print the list, swap the last two, print the list again, and return up one level and repeat.
For example, for 123
123 (swap position 0 with position 0)
23 --> 123 (swap position 1 with position 1)
32 --> 132 (swap position 1 with position 2)
213 (swap position 0 with position 1)
13 --> 213 (swap position 1 with position 1)
31 --> 231 (swap position 1 with position 2)
321 (swap position 0 with position 2)
21 --> 321 (swap position 1 with position 1)
12 --> 312 (swap position 1 with position 2)
for a four letter number (1234)
1234 (swap position 0 with position 0)
234 (swap position 1 with position 1)
34 --> 1234
43 --> 1243
324 (swap position 1 with position 2)
24 --> 1324
42 --> 1342
432 (swap position 1 with position 3)
32 --> 1432
23 --> 1423
2134 (swap position 0 for position 1)
134 (swap position 1 with position 1)
34 --> 2134
43 --> 2143
314 (swap position 1 with position 2)
14--> 2314
41--> 2341
431 (swap position 1 with position 3)
31--> 2431
13 -->2413
This is the code i currently have for the recursion, but its causing me a lot of grief, recursion not being my strong suit. Here's what i have.
def perm(x, y, key):
print "Perm called: X=",x,", Y=",y,", key=",key
while (x<y):
print "\tLooping Inward"
print "\t", x," ",y," ", key
x=x+1
key=perm(x, y, key)
swap(x,y,key)
print "\tAfter 'swap':",x," ",y," ", key, "\n"
print "\nFull Depth Reached"
#print key, " SWAPPED:? ",swap(x,y,key)
print swap(x, y, key)
print " X=",x,", Y=",y,", key=",key
return key
def swap(x, y, key):
v=key[x]
key[x]=key[y]
key[y]=v
return key
Any help would be greatly appreciated, this is a really cool project and I don't want to abandon it.
Thanks to all! Comments on my method or anything are welcome.

Happened upon my old question later in my career
To efficiently do this, you want to write a generator.
Instead of returning a list of all of the permutations, which requires that you store them (all of them) in memory, a generator returns one permutation (one element of this list), then pauses, and then computes the next one when you ask for it.
The advantages to generators are:
Take up much less space.
Generators take up between 40 and 80 bytes of space. One generators can have generate millions of items.
A list with one item takes up 40 bytes. A list with 1000 items takes up 4560 bytes
More efficient
Only computes as many values as you need. In permuting the alphabet, if the correct permutation was found before the end of the list, the time spend generating all of the other permutations was wasted.
(Itertools.permutation is an example of a generator)
How do I Write a Generator?
Writing a generator in python is actually very easy.
Basically, write code that would work for generating a list of permutations. Now, instead of writing resultList+=[ resultItem ], write yield(resultItem).
Now you've made a generator. If I wanted to loop over my generator, I could write
for i in myGenerator:
It's that easy.
Below is a generator for the code that I tried to write long ago:
def permutations(iterable, r=None):
# permutations('ABCD', 2) --> AB AC AD BA BC BD CA CB CD DA DB DC
# permutations(range(3)) --> 012 021 102 120 201 210
pool = tuple(iterable)
n = len(pool)
r = n if r is None else r
if r > n:
return
indices = range(n)
cycles = range(n, n-r, -1)
yield tuple(pool[i] for i in indices[:r])
while n:
for i in reversed(range(r)):
cycles[i] -= 1
if cycles[i] == 0:
indices[i:] = indices[i+1:] + indices[i:i+1]
cycles[i] = n - i
else:
j = cycles[i]
indices[i], indices[-j] = indices[-j], indices[i]
yield tuple(pool[i] for i in indices[:r])
break
else:
return

I think you have a really good idea, but keeping track of the positions might get a bit difficult to deal with. The general way I've seen for generating permutations recursively is a function which takes two string arguments: one to strip characters from (str) and one to add characters to (soFar).
When generating a permutation then we can think of taking characters from str and adding them to soFar. Assume we have a function perm that takes these two arguments and finds all permutations of str. We can then consider the current string str. We'll have permutations beginning with each character in str so we just need to loop over str, using each of these characters as the initial character and calling perm on the characters remaining in the string:
// half python half pseudocode
def perm(str, soFar):
if(str == ""):
print soFar // here we have a valid permutation
return
for i = 0 to str.length:
next = soFar + str[i]
remaining = str.substr(0, i) + str.substr(i+1)
perm(remaining, next)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing string search in python - python

Related

how to find 3 Numbers with Sum closest to a given number

Python: Size of (large) dict 10 times smaller when pickled

Shannon-Fano code as max-heap in python

Effcient way to find longest duplicate string for Python (From Programming Pearls)

Recursive Permutation Generator, swapping list items not working

Categories

Resources