Related
I'm a little lost and would love to help build shift buildings.
The work is carried out in an 8-hour shift and employees are entitled to three 10-minute breaks during their shift, which for convenience can start every ten minutes (i.e .: breaks can start at 08: 00,09: 20, ...)
It is determined that the maximum interval between breaks and breaks or between breaks and the beginning / end of the shift does not exceed w = 160 minutes) it can be assumed that W receives a value that is double 10 minutes.
I need to build all the possible shift structures.
I say this list indicates a shift, every interval 10 minutes
print(list(range(1,49)))
Example of one shift
A1=[2,3,....,15,17,........,32,34,...48]
I.e. break 1 in the first 10 minutes of the shift, break 2 after 150 minutes and last break after 330 minutes.
Thanks
All shift structures can be obtained by performing a Exhaustive Search in three loops(a number of breaks) so that the condition is match.
work = set(range(1, 49))
w = 16
breaktime = []
check = 0
for i in range(1, w + 2):
if i > check:
break1 = i
for j in range(i, i + w + 2):
if j > break1:
break2 = j
for k in range(j, j + w + 2):
if k > break2 and k < 49:
break3 = k
if w + k > 49:
breaktime.append([break1, break2, break3])
check += 1
shift_list = []
for l in breaktime:
shift = list(work - set(l))
print(shift)
shift_list.append(shift)
I'd just be lazy and use itertools functions here since the data is quite small. At larger numbers one might want to be more clever.
The breaks (and shift) results are zero-indexed for programming convenience (as opposed to 1-indexed as in the question), but that's trivial to adjust, and makes it easier to use e.g. datetime for actual times.
import itertools
def compute_breaks(
*,
timeblock_count: int,
n_breaks: int,
min_between: int,
min_start: int = 0,
min_end: int = 0,
):
timeblocks = range(timeblock_count)
for breaks in itertools.combinations(timeblocks, n_breaks):
if any(abs(a - b) < min_between for a, b in itertools.combinations(breaks, 2)):
continue
if any(a < min_start or a > timeblock_count - min_end for a in breaks):
continue
yield set(breaks)
def render_shift(shift, timeblock_count: int):
return "".join("O" if x in shift else "." for x in range(timeblock_count))
def breaks_to_shift(breaks, timeblock_count: int):
return [i for i in range(timeblock_count) if i not in breaks]
for breaks in sorted(compute_breaks(timeblock_count=48, n_breaks=3, min_between=16)):
shift = breaks_to_shift(breaks, timeblock_count=48)
print(render_shift(shift, timeblock_count=48))
The renderings show breaks as .s and work blocks as Os, e.g. (truncated):
.OOOOOOOOOOOOOOO.OOOOOOOOOOOOOOO.OOOOOOOOOOOOOOO
.OOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOO.OOOOOOOOOOOOOO
.OOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOOO.OOOOOOOOOOOOO
...
.OOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOOOOOOOOOOOOOOO.O
.OOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO.
...
O.OOOOOOOOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOO.OOOOOOO
O.OOOOOOOOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOOO.OOOOOO
O.OOOOOOOOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOOOO.OOOOO
...
OOOOO.OOOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOO.OOOOOOOO
OOOOO.OOOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOOO.OOOOOOO
OOOOO.OOOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOOOO.OOOOOO
...
OOOOOOOOOOOOO.OOOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOO.
OOOOOOOOOOOOO.OOOOOOOOOOOOOOOOO.OOOOOOOOOOOOOOO.
OOOOOOOOOOOOOO.OOOOOOOOOOOOOOO.OOOOOOOOOOOOOOO.O
OOOOOOOOOOOOOO.OOOOOOOOOOOOOOO.OOOOOOOOOOOOOOOO.
I wrote a python script that calculates all possibilities where the following conditions are met:
a^(2) + b^(2) + c^(2) + d^(2) + e^(2) = f^(2)
a,b,c,d,e,f are distinct and nonzero integers
a,b,c,d,e are even numbers between twin primes (e.g. 11 & 13 are twin primes, so 12 is a valid possibility)
f ≤ 65535
the sum of the digits of a == the sum of the digits of b == the sum of the digits of c == the sum of the digits of d == the sum of the digits of e == the sum of the digits of f
I'm not positive whether there will be any results when including criteria 5, but I'd like to find out in a timely manner at least. Ideally, the following conditions should also be met:
results that use the same values for a,b,c,d,e,f but in a different order should not be in the results; ideally they should be excluded from the for loops as well
results should be sorted by lowest value of a first, lowest value of b first and so and so forth
My question would be, how can I decrease the operating time and increase efficiency?
import itertools
import time
start_time = time.time()
def is_prime(n):
for i in range(2, n):
if n % i == 0:
return False
return True
def generate_twin_primes(start, end):
for i in range(start, end):
j = i + 2
if(is_prime(i) and is_prime(j)):
n = text_file2.write(str(i+1) + '\n')
def sum_digits(n):
r = 0
while n:
r, n = r + n % 10, n // 10
return r
def is_sorted(vals):
for i in range(len(vals)-2):
if vals[i] < vals[i+1]:
return False
return True
def pythagorean_sixlet():
valid = []
for a in x:
for b in x:
for c in x:
for d in x:
for e in x:
f = (a * a + b * b + c * c + d * d + e * e)**(1/2)
if f % 1 == 0 and all(x[0]!=x[1] for x in list(itertools.combinations([a, b, c, d, e], 2))):
if sum_digits(a) == sum_digits(b) == sum_digits(c) == sum_digits(d) == sum_digits(e) == sum_digits(int(f)):
valid.append([a, b, c, d, e, int(f)])
for valid_entry in valid:
if is_sorted(valid_entry):
with open("output.txt", "a") as text_file1:
text_file1.write(str(valid_entry[0]) + " " + str(valid_entry[1]) + " " + str(valid_entry[2]) + " " + str(valid_entry[3]) + " " + str(valid_entry[4]) + " | " + str(valid_entry[5]) + '\n')
text_file1.close()
#input #currently all even numbers between twin primes under 1000
text_file2 = open("input.txt", "w")
generate_twin_primes(2, 1000)
text_file2.close()
# counting number of lines in "input.txt" and calculating number of potential possibilities to go through
count = 0
fname = "input.txt"
with open(fname, 'r') as f:
for line in f:
count += 1
print("Number of lines:", count)
print("Number of potential possibilites:", count**5)
with open('input.txt', 'r') as f:
x = f.read().splitlines()
x = [int(px) for px in x]
pythagorean_sixlet()
print("--- %s seconds ---" % (time.time() - start_time))
Well, this smells a lot like a HW problem, so we can't give away the farm too easy... :)
A couple things to consider:
if you want to check unique combinations, the number of possibilities is reduced a good chunk from count**5, right?
You are doing all of your checking at the inner-most part of the loop. Can you do some checking along the way so that you don't have to generate and test all of the possibilities, which is "expensive."
If you do choose to keep your check for uniqueness in the inner portion, find a better way that making all the combinations...that is wayyyyy expensive. Hint: If you made a set of the numbers you have, what would it tell you?
Implementing some of the above:
Number of candidate twin primes between [2, 64152]: 846
total candidate solutions: 1795713740 [need to check f for these]
elapsed: 5.957056045532227
size of result set: 27546
20 random selections from the group:
(40086.0, [3852, 4482, 13680, 20808, 30852])
(45774.0, [6552, 10458, 17028, 23832, 32940])
(56430.0, [1278, 13932, 16452, 27108, 44532])
(64746.0, [15732, 17208, 20772, 32562, 46440])
(47610.0, [3852, 9432, 22158, 24372, 32832])
(53046.0, [3852, 17208, 20772, 23058, 39240])
(36054.0, [4518, 4932, 16452, 21492, 22860])
(18396.0, [3258, 4518, 5742, 9342, 13680])
(45000.0, [2970, 10890, 16650, 18540, 35730])
(59976.0, [2970, 9342, 20772, 35802, 42282])
(42246.0, [3528, 5652, 17208, 25308, 28350])
(39870.0, [3528, 7308, 16362, 23292, 26712])
(64656.0, [8820, 13932, 16452, 36108, 48312])
(61200.0, [198, 882, 22158, 35532, 44622])
(55350.0, [3168, 3672, 5652, 15732, 52542])
(14526.0, [1278, 3528, 7128, 7560, 9432])
(65106.0, [5652, 30852, 31248, 32832, 34650])
(63612.0, [2088, 16830, 26730, 33750, 43650])
(42066.0, [2088, 13932, 15642, 23832, 27540])
(31950.0, [828, 3582, 13932, 16452, 23292])
--- 2872.701852083206 seconds ---
[Finished in 2872.9s]
Context
Firstly, thanks for hypothesis. It's both extremely powerful and extremely useful!
I've written a hypothesis strategy to produce monotonic (ANDS and ORs) policy expressions of the form:
(A and (B or C))
This can be thought of as a tree structure, where A, B and C are attributes at the leaf nodes, whereas 'and' and 'or' are non-leaf nodes.
The strategy seems to generate expressions as desired.
>>> find(policy_expressions(), lambda x: len(x.split()) > 3)
'(A or (A or A))'
(Perhaps the statistical diversity of examples could be improved, but that is not the essence of this question).
Inequalities are valid too. For example:
(N or (WlIorO and (nX <= 55516 and e)))
I want to constrain or filter the examples so that I can generate policy expressions with a specified number of leaf nodes (i.e. attributes).
For a performance test, I've tried using data.draw() with filter something like this:
#given(data=data())
def test_keygen_encrypt_proxy_decrypt_decrypt_execution_time(data, n):
"""
:param n: the input size n. Number of attributes or leaf nodes in policy tree.
"""
policy_str = data.draw(strategy=policy_expressions().filter(lambda x: len(extract_attributes(group, x)) == n),
label="policy string")
Where extract_attributes() counts the number of leaf nodes in the expression and n is the desired number of leaves.
The problem with this solution is that when n > 16, hypothesis throws a:
hypothesis.errors.Unsatisfiable: Unable to satisfy assumptions of hypothesis test_keygen_encrypt_proxy_decrypt_decrypt_execution_time.
I want to generate valid policy expressions with 100s of leaf nodes.
Another drawback of that approach was that hypothesis reported HealthCheck.filter_too_much and HealthCheck.too_slow and the #settings got ugly.
I would rather have a parameter to say policy_expressions(leaf_nodes=4) to get an example like this:
(N or (WlIorO and (nX <= 55516 and e)))
I avoided that initially, because I'm not able to see how to do it with the recursive strategy code.
Question
Can you suggest a way to refactor this strategy so that it can be parameterized for number of leaf nodes?
Here's the strategy code (its open source in Charm Crypto anyway)
from hypothesis.strategies import text, composite, sampled_from, characters, one_of, integers
def policy_expressions():
return one_of(attributes(), inequalities(), policy_expression())
#composite
def policy_expression(draw):
left = draw(policy_expressions())
right = draw(policy_expressions())
gate = draw(gates())
return u'(' + u' '.join((left, gate, right)) + u')'
def attributes():
return text(min_size=1, alphabet=characters(whitelist_categories='L', max_codepoint=0x7e))
#composite
def inequalities(draw):
attr = draw(attributes())
oper = draw(inequality_operators())
numb = draw(integers(min_value=1))
return u' '.join((attr, oper, str(numb)))
def inequality_operators():
return sampled_from((u'<', u'>', u'<=', u'>='))
def gates():
return sampled_from((u'or', u'and'))
def assert_valid(policy_expression):
assert policy_expression # not empty
assert policy_expression.count(u'(') == policy_expression.count(u')')
https://github.com/JHUISI/charm/blob/dev/charm/toolbox/policy_expression_spec.py
I'd suggest explicitly building in the number of leaves into how the data is constructed, then passing in the number of leaves you want:
from hypothesis.strategies import text, composite, sampled_from, characters, one_of, integers
def policy_expressions_of_size(num_leaves):
if num_leaves == 1:
return attributes()
elif num_leaves == 2:
return one_of(inequalities(), policy_expression(num_leaves))
else:
return policy_expression(num_leaves)
policy_expressions = integers(min_value=1, max_value=500).flatmap(policy_expressions_of_size)
#composite
def policy_expression(draw, num_leaves):
left_leaves = draw(integers(min_value=1, max_value=num_leaves - 1))
right_leaves = num_leaves - left_leaves
left = draw(policy_expressions_of_size(left_leaves))
right = draw(policy_expressions_of_size(right_leaves))
gate = draw(gates())
return u'(' + u' '.join((left, gate, right)) + u')'
def attributes():
return text(min_size=1, alphabet=characters(whitelist_categories='L', max_codepoint=0x7e))
#composite
def inequalities(draw):
attr = draw(attributes())
oper = draw(inequality_operators())
numb = draw(integers(min_value=1))
return u' '.join((attr, oper, str(numb)))
def inequality_operators():
return sampled_from((u'<', u'>', u'<=', u'>='))
def gates():
return sampled_from((u'or', u'and'))
Then you can pick exactly how large you want the policy expression to be:
>>> policy_expressions.example()
'((((((oOjFo or (((cH and (Q or (uO > 18 and byy))) and kS) or pqKUUZ > 74)) and (gi or mwsrU <= 4115)) and qLkVSTqXZxgScTj) and (vNJ > 969 and (Drwvh or (((xhmsWhHpc or hQSMnfgyiYnblLFJ) or sesfHbQ) and jt)))) or xS) and ((V and (mArqYR or qY)) or (((uVf and bbtKUCnecMKjRJD > 18944) and nerVkPSs < 29292) and (UlOJebfbgcJz or (bxfVfjgmfulSB > 71 or (jqGLlr or (zQqj and zqUGwc < 24845)))))))'
>>>
>>> policy_expressions_of_size(1).example()
'Eo'
>>>
>>> policy_expressions_of_size(2).example()
'KJAitOKC > 18179'
>>> policy_expressions_of_size(10).example()
'(((htjdVy or (((XTfZil or (rqZw and DEOeER)) and xGVsdeQJLTJxLsC < 388312303) or LxLfUPljUTH)) or (Kb or EoipoYzjncAGKTE)) or bc)'
>>> policy_expressions_of_size(100).example()
'(((((CxySeUrNW or bZG) or (gzSUGgTG and (((V or n) or wqA) or veuTEnjGKwIpkDDDBiQkMwsNbxrBv))) or (((SKgQSXtAg or ChCHcEsVavy) and (((Yxj and xcCX) or QrILGAWxVKXWRb > 98817811688973569232860005374239659122) or JD <= 28510)) and KhrGfZciz > 4057857855522854443)) and (ZMIzFELKAKDMrH and (((MOmAZ and J <= 22052) or (Scy >= 17563 and (VCS and ((FFLa and EtZvqwNymnZNnjlREM) or pU)))) or A))) and ((((kaYzzIXIu and (lwos and (vp and GqG))) and ((Nh and lb) or ((TbNZWYOpYmj and (AQs or w)) or NjFYLBr > 228431293))) or ((((FTSXkXGZyKXD or zXeVEqNgkyXI) or mNGI) or ((cGOGK or gjcI) and DQzYonXszfSrZMB)) and JI > 3802)) or (((jIREd and IVzFB >= 28149) and (UdCBg < 20 or (VSGxr or XBuiS <= 1615))) and (rE > 10511139808015932 and ((((((((W and u) or yslVZ) or (eVGlz < 7033 or UiE)) and ((trOmArBc and Zx) or mPKva)) or ((qqDmKUpAnW or yvSkhTgqXQaLnxL) or Z)) or snXcMDhhf) and ((Wu or XSjbKdsZqEiXXvOb) and (DNZg and qv >= 7503))) and ((rnffxTLThwvw >= 24460 and ((oO or y <= 24926) and (NjM and vEHukii))) or ((((BTdpW and rP) or (rjUylCZwJzGobXZR or MNoBdEEIuLbTRvZHMb < 7958346708112664935)) and ((YU or gY >= 15498) and (s and GnOydthO > 103))) or ((caumKPjp < 27 and OQoFXscbD) or ((qaxYwfnelmetYqHKnatQ or P) and (ixzsvX and mYROpqoHAqeEy))))))))))'
Problem: I need to convert an amount to Indian currency format
My code: I have the following Python implementation:
import decimal
def currencyInIndiaFormat(n):
d = decimal.Decimal(str(n))
if d.as_tuple().exponent < -2:
s = str(n)
else:
s = '{0:.2f}'.format(n)
l = len(s)
i = l-1;
res = ''
flag = 0
k = 0
while i>=0:
if flag==0:
res = res + s[i]
if s[i]=='.':
flag = 1
elif flag==1:
k = k + 1
res = res + s[i]
if k==3 and i-1>=0:
res = res + ','
flag = 2
k = 0
else:
k = k + 1
res = res + s[i]
if k==2 and i-1>=0:
res = res + ','
flag = 2
k = 0
i = i - 1
return res[::-1]
def main():
n = 100.52
print "INR " + currencyInIndiaFormat(n) # INR 100.52
n = 1000.108
print "INR " + currencyInIndiaFormat(n) # INR 1,000.108
n = 1200000
print "INR " + currencyInIndiaFormat(n) # INR 12,00,000.00
main()
My Question: Is there a way to make my currencyInIndiaFormat function shorter, more concise and clean ? / Is there a better way to write my currencyInIndiaFormat function ?
Note: My question is mainly based on Python implementation of the above stated problem. It is not a duplicate of previously asked questions regarding conversion of currency to Indian format.
Indian Currency Format:
For example, numbers here are represented as:
1
10
100
1,000
10,000
1,00,000
10,00,000
1,00,00,000
10,00,00,000
Refer Indian Numbering System
Too much work.
>>> import locale
>>> locale.setlocale(locale.LC_MONETARY, 'en_IN')
'en_IN'
>>> print(locale.currency(100.52, grouping=True))
₹ 100.52
>>> print(locale.currency(1000.108, grouping=True))
₹ 1,000.11
>>> print(locale.currency(1200000, grouping=True))
₹ 12,00,000.00
You can follow these steps.
Install Babel python package from pip
pip install Babel
In your python script
from babel.numbers import format_currency
format_currency(5433422.8012, 'INR', locale='en_IN')
Output:
₹ 54,33,422.80
def formatINR(number):
s, *d = str(number).partition(".")
r = ",".join([s[x-2:x] for x in range(-3, -len(s), -2)][::-1] + [s[-3:]])
return "".join([r] + d)
It's simple to use:
print(formatINR(123456))
Output
1,23,456
If you want to handle negative numbers
def negativeFormatINR(number):
negativeNumber = False
if number < 0:
number = abs(number)
negativeNumber = True
s, *d = str(number).partition(".")
r = ",".join([s[x-2:x] for x in range(-3, -len(s), -2)][::-1] + [s[-3:]])
value = "".join([r] + d)
if negativeNumber:
return '-' + value
return value
It's simple to use:
print(negativeFormatINR(100-10000))
output
-9,900
Note - THIS IS AN ALTERNATIVE SOLUTION FOR ACTUAL QUESTION
If anyone trying to convert in simple Indian terms like K, L, or Cr with 2 floating-point values, the following solution would work.
def format_cash(amount):
def truncate_float(number, places):
return int(number * (10 ** places)) / 10 ** places
if amount < 1e3:
return amount
if 1e3 <= amount < 1e5:
return str(truncate_float((amount / 1e5) * 100, 2)) + " K"
if 1e5 <= amount < 1e7:
return str(truncate_float((amount / 1e7) * 100, 2)) + " L"
if amount > 1e7:
return str(truncate_float(amount / 1e7, 2)) + " Cr"
Examples
format_cash(7843) --> '7.84 K'
format_cash(78436) --> '78.43 K'
format_cash(784367) --> '7.84 L'
format_cash(7843678) --> '78.43 L'
format_cash(78436789) --> '7.84 Cr'
Here is the other way around:
import re
def in_for(value):
value,b=str(value),''
value=''.join(map(lambda va:va if re.match(r'[0-9,.]',va) else '',value))
val=value
if val.count(',')==0:
v,c,a,cc,ii=val,0,[3,2,2],0,0
val=val[:val.rfind('.')] if val.rfind('.')>=0 else val
for i in val[::-1]:
if c==ii and c!=0:
ii+=a[cc%3]
b=','+i+b
cc+=1
else:
b=i+b
c+=1
b=b[1:] if b[0]==',' else b
val=b+v[value.rfind('.'):] if value.rfind('.')>=0 else b
else:
val=str(val).strip('()').replace(' ','')
v=val.rfind('.')
if v>0:
val=val[:v+3]
return val.rstrip('0').rstrip('.') if '.' in val else val
print(in_for('1000000000000.5445'))
Output will be:
10,000,00,00,000.54
(As mentioned in wikipedia indian number system Ex:67,89,000,00,00,000)
def format_indian(t):
dic = {
4:'Thousand',
5:'Lakh',
6:'Lakh',
7:'Crore',
8:'Crore',
9:'Arab'
}
y = 10
len_of_number = len(str(t))
save = t
z=y
while(t!=0):
t=int(t/y)
z*=10
zeros = len(str(z)) - 3
if zeros>3:
if zeros%2!=0:
string = str(save)+": "+str(save/(z/100))[0:4]+" "+dic[zeros]
else:
string = str(save)+": "+str(save/(z/1000))[0:4]+" "+dic[zeros]
return string
return str(save)+": "+str(save)
This code will Convert Yout Numbers to Lakhs, Crores and arabs in most simplest way. Hope it helps.
for i in [1.234567899 * 10**x for x in range(9)]:
print(format_indian(int(i)))
Output:
1: 1
12: 12
123: 123
1234: 1234
12345: 12.3 Thousand
123456: 1.23 Lakh
1234567: 12.3 Lakh
12345678: 1.23 Crore
123456789: 12.3 Crore
Another way:
def formatted_int(value):
# if the value is 100, 10, 1
if len(str(value)) <= 3:
return value
# if the value is 10,000, 1,000
elif 3 < len(str(value)) <= 5:
return f'{str(value)[:-3]},{str(value)[-3:]} ₹'
# if the value is greater the 10,000
else:
cut = str(value)[:-3]
o = []
while cut:
o.append(cut[-2:]) # appending from 1000th value(right to left)
cut = cut[:-2]
o = o[::-1] # reversing list
res = ",".join(o)
return f'{res},{str(value)[-3:]} ₹'
value1 = 1_00_00_00_000
value2 = 10_00_00_00_000
value3 = 100
print(formatted_int(value1))
print(formatted_int(value2))
print(formatted_int(value3))
Ouput:
1,00,00,00,000 ₹
10,00,00,00,000 ₹
100 ₹
As pgksunilkumar's answer, a little improvement is done in case the the number is in between 0 to -1000
def formatINR(number):
if number < 0 and number > -1000:
return number
else:
s, *d = str(number).partition(".")
r = ",".join([s[x-2:x] for x in range(-3, -len(s), -2)][::-1] + [s[-3:]])
return "".join([r] + d)
now if the number is between 0 to -1000, the format will not disturb the user.
i.e
a = -600
b = -10000000
c = 700
d = 8000000
print(formatINR(a))
print(formatINR(b))
print(formatINR(c))
print(formatINR(d))
output will be:
-600
-1,00,00,000
700
80,00,000
Couldn't make the other two solutions work for me, so I made something a little more low-tech:
def format_as_indian(input):
input_list = list(str(input))
if len(input_list) <= 1:
formatted_input = input
else:
first_number = input_list.pop(0)
last_number = input_list.pop()
formatted_input = first_number + (
(''.join(l + ',' * (n % 2 == 1) for n, l in enumerate(reversed(input_list)))[::-1] + last_number)
)
if len(input_list) % 2 == 0:
formatted_input.lstrip(',')
return formatted_input
This doesn't work with decimals. If you need that, I would suggest saving the decimal portion into another variable and adding it back in at the end.
num=123456789
snum=str(num)
slen=len(snum)
result=''
if (slen-3)%2 !=0 :
snum='x'+snum
for i in range(0,slen-3,2):
result=result+snum[i:i+2]+','
result+=snum[slen-3:]
print(result.replace('x',''))
From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)
My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.
The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string
The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.
This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...