I'm currently learning and practicing algorithms on strings. Specifically I was toying with replacing patterns in strings based on KMP with some modifications, which has O(N) complexity (my implementation below).
def replace_string(s, p, c):
"""
Replace pattern p in string s with c
:param s: initial string
:param p: pattern to replace
:param c: replacing string
"""
pref = [0] * len(p)
s_p = p + '#' + s
p_prev = 0
shift = 0
for i in range(1, len(s_p)):
k = p_prev
while k > 0 and s_p[i] != s_p[k]:
k = pref[k - 1]
if s_p[i] == s_p[k]:
k += 1
if i < len(p):
pref[i] = k
p_prev = k
if k == len(p):
s = s[:i - 2 * len(p) + shift] + c + s[i - len(p) + shift:]
shift += len(c) - k
return s
Then, I wrote the same program using built-in python str.replace function:
def replace_string_python(s, p, c):
return s.replace(p, c)
and compared performance for various strings, I'll attach just one example, for string of length 1e5:
import time
if __name__ == '__main__':
initial_string = "a" * 100000
pattern = "a"
replace = "ab"
start = time.time()
res = replace_string(initial_string, pattern, replace)
print(time.time() - start)
Output (my implementation):
total time: 1.1617710590362549
Output (python built-in):
total time: 0.0015637874603271484
As you can see, implementation via python str.replace is light-years ahead KMP. So my question why is that? What algorithm does python C code use?
While the algorithm might be O(N), your implementation does not seem linear, at least not with respect to multiple repetitions of the pattern, because of
s = s[:i - 2 * len(p) + shift] + c + s[i - len(p) + shift:]
which is O(N) itself. Thus if your pattern happens N time in a string, your implementation is in fact O(N^2).
See the following timings for the scaling time of your algorithm, which confirms the quadratic shape
LENGTH TIME
------------
100000 1s
200000 8s
300000 31s
400000 76s
500000 134s
Related
Using this code to get count of permutations is slow on big numbers as the partition part takes long time to calculate all the partitions for a number like 100 and because of all the partitions in the ram, it is very ram consuming. Any solution to get count of permutations in a faster way? Thanks.
If we have get_permutations_count(10,10) means all the permutations in the length of 10 using 10 distinct symbols and If we have get_permutations_count(10,1) means all the permutations in the length of 10 using 1 distinct symbol which going to be 10 as those permutations will be 0000000000 1111111111 2222222222 333333333 ... 9999999999.
from sympy.utilities.iterables import partitions
from sympy import factorial
def get_permutations_count(all_symbols_count, used_symbols_count):
m = n = all_symbols_count
r = n - used_symbols_count
while True:
result = 0
for partition in partitions(r):
length = 0
if 2 * r > n:
for k, v in partition.items():
length += (k + 1) * v
if length > n:
pass
else:
C = binomial(m, n - r)
d = n - r
for v in partition.values():
C *= binomial(d, v)
d -= v
# permutations
P = 1
for k in partition.keys():
for v in range(partition[k]):
P *= factorial(k + 1)
P = factorial(n) // P
result += C * P
return result
if __name__ == "__main__":
print(get_permutations_count(300, 270)) # takes long time to calculate
print(get_permutations_count(10, 9) # prints: 163296000
print(get_permutations_count(10, 10)) # prints: 3628800
Following this answer, you can find the derivation of efficient algorithms for counting the number of such permutation.
It is achieved by using a generalization of the problem to count sequences of a length not necessarily equals to the size of the alphabet.
from functools import lru_cache
#lru_cache
def get_permutations_count(n_symbols, length, distinct, used=0):
'''
- n_symbols: number of symbols in the alphabet
- length: the number of symbols in each sequence
- distinct: the number of distinct symbols in each sequence
'''
if distinct < 0:
return 0
if length == 0:
return 1 if distinct == 0 else 0
else:
return \
get_permutations_count(n_symbols, length-1, distinct-0, used+0) * used + \
get_permutations_count(n_symbols, length-1, distinct-1, used+1) * (n_symbols - used)
Then
get_permutations_count(n_symbols=300, length=300, distinct=270)
runs in ~0.5 second giving the answer
2729511887951350984580070745513114266766906881300774347439917775
7093985721949669285469996223829969654724957176705978029888262889
8157939885553971500652353177628564896814078569667364402373549268
5524290993833663948683375995196081654415976659499171897405039547
1546236260377859451955180752885715923847446106509971875543496023
2494854876774756172488117802642800540206851318332940739395445903
6305051887120804168979339693187702655904071331731936748927759927
3688881301614948043182289382736687065840703041231428800720854767
0713406956719647313048146023960093662879015837313428567467555885
3564982943420444850950866922223974844727296000000000000000000000
000000000000000000000000000000000000000000000000
I am trying to find the longest Palidromic Substring of a given string, LeetCode problem.
I am getting a lesser runtime for expanding centers even though its time complexity is N**2 and manachers is N.
What mistake am I making?
'''
Manachers Algorithm O(N) ---> runtime 2000ms
class Solution(object):
def addhashspace(self,s: str) -> str:
t = ''
for i in range(len(s)):
t += '#' + s[i]
t = '$' + t + '##'
return t
def longestPalindrome(self, s: str) -> str:
t = self.addhashspace(s)
P = [0]*len(t)
maxi = 0
for i in range(len(t)-1):
C,R = 0,0
mirr = 2*C - 1
if(i< R):
P[i] = min(P[mirr],R - i)
while(t[i + P[i] + 1] == t[i - P[i] - 1]):
P[i] += 1
if(i + P[i] > R):
C = i
R = i + P[i]
if P[i] > maxi:
maxi = P[i]
index = i
ind1 = index//2 - P[index]//2 - maxi%2
ind2 = index//2 + P[index]//2
return(s[ind1:ind2])
Expnding centers O(N**2) ----> 800ms
class Solution(object):
def longestPalindrome(self, s: str) -> str:
startt = time.time()
if len(s) <= 1:
return s
start = end = 0
length = len(s)
for i in range(length):
max_len_1 = self.get_max_len(s, i, i + 1)
max_len_2 = self.get_max_len(s, i, i)
max_len = max(max_len_1, max_len_2)
if max_len > end - start:
start = i - (max_len - 1) // 2
end = i + max_len // 2
print("Execution Time of 2nd Algo " + str((time.time() - startt) * 10**6) + " ms")
return s[start: end+1]
def get_max_len(self, s: 'list', left: 'int', right: 'int') -> 'int':
length = len(s)
i = 1
max_len = 0
while left >= 0 and right < length and s[left] == s[right]:
left -= 1
right += 1
return right - left - 1
"civilwartestingwhetherthatnaptionoranynartionsoconceivedandsodedicatedcanlongendureWeareqmetonagreatbattlefiemldoftzhatwarWehavecometodedicpateaportionofthatfieldasafinalrestingplaceforthosewhoheregavetheirlivesthatthatnationmightliveItisaltogetherfangandproperthatweshoulddothisButinalargersensewecannotdedicatewecannotconsecratewecannothallowthisgroundThebravelmenlivinganddeadwhostruggledherehaveconsecrateditfaraboveourpoorponwertoaddordetractTgheworldadswfilllittlenotlenorlongrememberwhatwesayherebutitcanneverforgetwhattheydidhereItisforusthelivingrathertobededicatedheretotheulnfinishedworkwhichtheywhofoughtherehavethusfarsonoblyadvancedItisratherforustobeherededicatedtothegreattdafskremainingbeforeusthatfromthesehonoreddeadwetakeincreaseddevotiontothatcauseforwhichtheygavethelastpfullmeasureofdevotionthatweherehighlyresolvethatthesedeadshallnothavediedinvainthatthisnationunsderGodshallhaveanewbirthoffreedomandthatgovernmentofthepeoplebythepeopleforthepeopleshallnotperishfromtheearth"
This is the worst case test input among other inputs I was using to compare the two algorithms. I used a counter to see the number of iterations it took to find the palindrome "ranynar" in the given string.
Not shockingly - Manachers took 1189 iterations where as Expanding Centers took 1094 iterations.
Definitely Manachers takes more computational time in each iteration so that explains the runtime difference.
I guess my test input size has been small to really take advantage of O(N) time complexity as Manachers does require more steps in each iteration.
I will have to test with a dataset with larger input sizes to see the difference in the two.
I'm trying to lex (i.e., tokenize) escaped strings in pure CPython fast (without resorting to C code).
The best I have been able to come up with is the following:
def bench(s, c, i, n):
m = 0
iteration = 0
while iteration < n:
# How do I optimize this part?
# Inputs: string s, index i
k = i
while True:
j = s.index(c, k, n)
sub = s[k:j]
if '\\' not in sub: break
k += sub.index('\\') + 2
# Outputs: substring s[i:j], index j
m += j - i
iteration += 1
return m
def test():
from time import clock
start = clock()
s = 'sd;fa;sldkfjas;kdfj;askjdf;askjd;fasdjkfa, "abcdefg", asdfasdfas;dfasdl;fjas;dfjk'
m = bench(s, '"', s.index('"') + 1, 3000000)
print "%.0f chars/usec" % (m / (clock() - start) / 1000000,)
test()
However, it's still somewhat slow for my taste. It seems that the invocation of .index is taking a lot of time in my actual project, though it doesn't seem to happen quite as often in this benchmark.
Most strings that it needs to lex can be assumed to be relatively short (say, 7 characters) and are unlikely to contain backslashes. I've already optimized for that somewhat. My question is:
Are there any optimizations I could make to speed up this code? If so, what?
I have to implement the Z algorithm and use it to search a target text for a specific pattern. I've implemented what I thought was the correct algorithm and search function using it but it's really slow. For the naive implementation of string search I consistently got times lower than 1.5 seconds and for the z string search I consistently got times over 3 seconds (for my biggest test case) so I have to be doing something wrong. The results seem to be correct, or were at least for the few test cases we were given. The code for the functions mentioned in my rant is below:
import sys
import time
# z algorithm a.k.a. the fundemental preprocessing algorithm
def z(P, start=1, max_box_size=sys.maxsize):
n = len(P)
boxes = [0] * n
l = -1
r = -1
for k in range(start, n):
if k > r:
i = 0
while k + i < n and P[i] == P[k + i] and i < max_box_size:
i += 1
boxes[k] = i
if i:
l = k
r = k + i - 1
else:
kp = k - l
Z_kp = boxes[kp]
if Z_kp < r - k + 1:
boxes[k] = Z_kp
else:
i = r + 1
while i < n and P[i] == P[i - k] and i - k < max_box_size:
i += 1
boxes[k] = i - k
l = k
r = i - 1
return boxes
# a simple string search
def naive_string_search(P, T):
m = len(T)
n = len(P)
indices = []
for i in range(m - n + 1):
if P == T[i: i + n]:
indices.append(i)
return indices
# string search using the z algorithm.
# The pattern you're searching for is simply prepended to the target text
# and than the z algorithm is run on that concatenation
def z_string_search(P, T):
PT = P + T
n = len(P)
boxes = z(PT, start=n, max_box_size=n)
return list(map(lambda x: x[0]-n, filter(lambda x: x[1] >= n, enumerate(boxes))))
Your's implementation of z-function def z(..) is algorithmically ok and asymptotically ok.
It has O(m + n) time complexity in worst case while implementation of naive string search has O(m*n) time complexity in worst case, so I think that the problem is in your test cases.
For example if we take this test case:
T = ['a'] * 1000000
P = ['a'] * 1000
we will get for z-function:
real 0m0.650s
user 0m0.606s
sys 0m0.036s
and for naive string matching:
real 0m8.235s
user 0m8.071s
sys 0m0.085s
PS: You should understand that there are a lot of test cases where naive string matching works in linear time too, for example:
T = ['a'] * 1000000
P = ['a'] * 1000000
Thus the worst case for a naive string matching is where function should apply pattern and check again and again. But in this case it will do only one check because of the lengths of the input (it cannot apply pattern from index 1 so it won't continue).
What is the fastest way to swap two digits in a number in Python? I am given the numbers as strings, so it'd be nice if I could have something as fast as
string[j] = string[j] ^ string[j+1]
string[j+1] = string[j] ^ string[j+1]
string[j] = string[j] ^ string[j+1]
Everything I've seen has been much more expensive than it would be in C, and involves making a list and then converting the list back or some variant thereof.
This is faster than you might think, at least faster than Jon Clements' current answer in my timing test:
i, j = (i, j) if i < j else (j, i) # make sure i < j
s = s[:i] + s[j] + s[i+1:j] + s[i] + s[j+1:]
Here's my test bed should you want to compare any other answers you get:
import timeit
import types
N = 10000
R = 3
SUFFIX = '_test'
SUFFIX_LEN = len(SUFFIX)
def setup():
import random
global s, i, j
s = 'abcdefghijklmnopqrstuvwxyz'
i = random.randrange(len(s))
while True:
j = random.randrange(len(s))
if i != j: break
def swapchars_martineau(s, i, j):
i, j = (i, j) if i < j else (j, i) # make sure i < j
return s[:i] + s[j] + s[i+1:j] + s[i] + s[j+1:]
def swapchars_martineau_test():
global s, i, j
swapchars_martineau(s, i, j)
def swapchars_clements(text, fst, snd):
ba = bytearray(text)
ba[fst], ba[snd] = ba[snd], ba[fst]
return str(ba)
def swapchars_clements_test():
global s, i, j
swapchars_clements(s, i, j)
# find all the functions named *SUFFIX in the global namespace
funcs = tuple(value for id,value in globals().items()
if id.endswith(SUFFIX) and type(value) is types.FunctionType)
# run the timing tests and collect results
timings = [(f.func_name[:-SUFFIX_LEN],
min(timeit.repeat(f, setup=setup, repeat=R, number=N))
) for f in funcs]
timings.sort(key=lambda x: x[1]) # sort by speed
fastest = timings[0][1] # time fastest one took to run
longest = max(len(t[0]) for t in timings) # len of longest func name (w/o suffix)
print 'fastest to slowest *_test() function timings:\n' \
' {:,d} chars, {:,d} timeit calls, best of {:d}\n'.format(len(s), N, R)
def times_slower(speed, fastest):
return speed/fastest - 1.0
for i in timings:
print "{0:>{width}}{suffix}() : {1:.4f} ({2:.2f} times slower)".format(
i[0], i[1], times_slower(i[1], fastest), width=longest, suffix=SUFFIX)
Addendum:
For the special case of swapping digit characters in a positive decimal number given as a string, the following also works and is a tiny bit faster than the general version at the top of my answer.
The somewhat involved conversion back to a string at the end with the format() method is to deal with cases where a zero got moved to the front of the string. I present it mainly as a curiosity, since it's fairly incomprehensible unless you grasp what it does mathematically. It also doesn't handle negative numbers.
n = int(s)
len_s = len(s)
ord_0 = ord('0')
di = ord(s[i])-ord_0
dj = ord(s[j])-ord_0
pi = 10**(len_s-(i+1))
pj = 10**(len_s-(j+1))
s = '{:0{width}d}'.format(n + (dj-di)*pi + (di-dj)*pj, width=len_s)
It has to be of a mutable type of some sort, the best I can think of is (can't make any claims as to performance though):
def swapchar(text, fst, snd):
ba = bytearray(text)
ba[fst], ba[snd] = ba[snd], ba[fst]
return ba
>>> swapchar('thequickbrownfox', 3, 7)
bytearray(b'thekuicqbrownfox')
You can still utilise the result as a str/list - or explicitly convert it to a str if needs be.
>>> int1 = 2
>>> int2 = 3
>>> eval(str(int1)+str(int2))
23
I know you've already accepted an answer, so I won't bother coding it in Python, but here's how you could do it in JavaScript which also has immutable strings:
function swapchar(string, j)
{
return string.replace(RegExp("(.{" + j + "})(.)(.)"), "$1$3$2");
}
Obviously if j isn't in an appropriate range then it just returns the original string.
Given an integer n and two (zero-started) indexes i and j of digits to swap, this can be done using powers of ten to locate the digits, division and modulo operations to extract them, and subtraction and addition to perform the swap.
def swapDigits(n, i, j):
# These powers of 10 encode the locations i and j in n.
power_i = 10 ** i
power_j = 10 ** j
# Retrieve digits [i] and [j] from n.
digit_i = (n // power_i) % 10
digit_j = (n // power_j) % 10
# Remove digits [i] and [j] from n.
n -= digit_i * power_i
n -= digit_j * power_j
# Insert digit [i] in position [j] and vice versa.
n += digit_i * power_j
n += digit_j * power_i
return n
For example:
>>> swapDigits(9876543210, 4, 0)
9876503214
>>> swapDigits(9876543210, 7, 2)
9826543710